CVPR Daily - Friday

Computer Vision and Pattern Recognition Friday Nashville CVPR Daily 2025 Awards, Highlights, Posters, Workshops, Previews, Women in Computer Vision, reviewed for y’all at CVPR 2025 in Nashville!

2 DAILY CVPR Friday [2A-1] FoundationStereo: Zero-Shot Stereo Matching [2A-4] MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds [1-52] DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering [1-240] One Diffusion to Generate Them All [1-437] PIAD: Pose and Illumination agnostic Anomaly Detection [2-208] Reversing Flow for Image Restoration Mathis’ picks of the day: Leveraging multi-view and 3D information can greatly enhance tasks typically performed using regular 2D images. Moreover, many of these ideas can be applied to exciting domains such as X-Ray imaging and scene understanding. Orals For today, Friday 13 Mathis’ Picks Posters Mathis Kruse is a PhD student at Leibniz University Hannover, supervised by Bodo Rosenhahn. Currently, he is working on Anomaly Detection, with a special focus on multi-view scenarios. I recently presented my paper, "Multi-Flow: Multi-View-Enriched Normalizing Flows for Industrial Anomaly Detection" at the VAND 3.0 workshop on Thursday. Feel free to chat with me about all things anomaly detection, flows, and multi-view…

3 DAILY CVPR Friday Editorial CVPR Daily Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, CVPR and the conference organizers. Dear all, I’ll make this very short. Welcome to Smashville! It is a pleasure to publish this CVPR Daily once again for you. This is the 10th consecutive year! Thank y’all for reading and thank you to IEEE, to Nicole Finn and to the CVPR community for trusting me once again... Like this magazine? Keep in touch with Computer Vision News and subscribe for free here! Y’all enjoy the reading and don’t forget to celebrate the southern charm of Nashvegas, only two blocks away from the conference ☺ Ralph Anzarouth Editor, Computer Vision News Ralph’s photo above was taken in peaceful, lovely and brave Odessa, Ukraine. Russian Invasion of Ukraine CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. Editor: Ralph Anzarouth Publisher & Copyright: Computer Vision News All rights reserved. Unauthorized reproduction is strictly forbidden. UKRAINE CORNER

4 DAILY CVPR Friday Oral & Award Candidate Jianyuan’s paper proposes a novel feed-forward reconstruction model that processes multiple input images to generate a 3D reconstruction. Unlike prior classical and deep learning-based methods, which often rely on time-consuming test-time optimization, this model operates without such constraints. Optimization techniques such as bundle adjustment or global alignment can take minutes or longer to complete. In contrast, Jianyuan’s model achieves reconstruction in seconds, significantly enhancing speed and efficiency. “Such optimization steps are usually non-differentiable and can’t work as a plug-and-play component in recent deep learning frameworks,” he explains. “That's the bottleneck for 3D vision these days. Therefore, we go for a feedforward only model!” Jianyuan identifies two major challenges in developing this model. The first was the need for a robust dataset to solve the problem in a data-driven manner. He collected 17 public datasets and processed them into a unified format, a task that required considerable engineering work. However, this was crucial because the quality of the data determines the limits for any method. The second challenge involved ensuring the model's generalization ability. “We want the model to handle an arbitrary number of input frames during inference,” he tells us. “Users may have only one frame Jianyuan Wang is a joint PhD student at the University of Oxford’s Visual Geometry Group and Meta AI. His paper introduces a superfast feed-forward reconstruction model, representing a significant advancement in 3D computer vision. Ahead of his oral presentation this afternoon, Jianyuan tells us more about his innovative work. VGGT: Visual Geometry Grounded Transformer

or 100 frames, but they still want the reconstruction results.” To address this, he implemented an alternating-attention mechanism, utilizing frame-wise attention to enable the model to identify which tokens correspond to which input frame. Jianyuan's research leverages several advanced computer vision techniques. Drawing inspiration from the success of 2D vision, it utilizes DINO, a 2D foundation model based on a vision transformer architecture. This approach enables the model to patchify the input images into multiple tokens, transforming the image information into a format that networks can understand and process. Additionally, the model features a camera head that regresses the camera's extrinsic and intrinsic parameters. This simple transformer approach is informed by previous works in camera pose estimation, such as Relpose, PoseDiffusion, and VGGSfM. He also employs DPT, a computer vision network developed four years ago, to predict dense, pixel-wise outputs. Now that we know which techniques Jianyuan has learned from, are there computer vision techniques that he thinks could benefit from his work? “Yes, neuro rendering methods, such as 3D Gaussian or NeRF, because they need camera poses predicted from upstream methods such as ours,” he responds. “Also, our model can predict a high-level latent representation of the 3D properties, so recent large 3D VLM models could benefit from it.” One potential application of this work in the real world is in online shopping, where customers often rely on 2D images of products. By utilizing this model, retailers could offer 3D reconstructions of items, allowing customers to rotate and view products from all angles, and even create personal 3D avatars for a virtual fitting. 5 DAILY CVPR Friday Jianyuan Wang

6 DAILY CVPR Friday Jianyuan’s paper has not only earned him an oral presentation slot at this year’s conference but has also been nominated for a prestigious Best Paper award. He attributes this recognition to the pressing need for advancements in 3D vision, which currently lags behind rapid developments in 2D vision and natural language processing. “They have built a lot of fantastic works, like GPT and SAM,” he points out. “In 3D vision, we’re still working with smaller models and classical techniques. A joint thought among the 3D vision community is that we need a large 3D foundation model that can handle numerous downstream tasks. I think that’s why this paper is kind of special!” Looking to the future, Jianyuan is optimistic about the potential applications of his research. He has Oral & Award Candidate

already observed follow-up works, including AnySplat, which utilizes VGGT’s feature backbone to enable feed-forward Gaussian parameter prediction for novel synthesis, and Spatial-MLLM, which combines its backbone with other large vision models to establish a unified foundation model for 3D perception. “In the future, we could see further trials on 4D tasks,” he envisions. “As we go from 2D to 3D, I think in probably two or three years, we’ll have something good in 4D. In 4D, people dance, run, and many scenes are dynamic!” In conclusion, while Jianyuan’s model represents a significant step forward, he emphasizes that datadriven 3D vision is just the beginning. “As Rich Sutton said in 2019, general approaches that leverage computation will ultimately prove to be the most effective,” he reflects. “This ‘Bitter Lesson’ has attracted great attention in the 2D and NLP communities, and we believe it’s true for 3D as well. Feed-forward models will be the future of 3D vision.” To learn more about Jianyuan’s work, visit Oral Session 2A: 3D Computer Vision (Karl Dean Ballroom) this afternoon from 13:00 to 14:30 [Oral 5] and Poster Session 2 (ExHall D) from 16:00 to 18:00 [Poster 86]. 7 DAILY CVPR Friday Jianyuan Wang

Gemma Canet Tarrés just finished her PhD at the University of Surrey, where she worked on improving controllability in image generation models under the supervision of John Collomosse and Andrew Gilbert. She defended her thesis a couple of weeks ago. You will find a short review of her thesis just after this article. During her PhD, she completed two internships at Adobe and is currently interning at Amazon. She is now in the market for new exciting opportunities where she can continue learning and contributing to the field of generative AI. She’s a catch, so take her before it’s too late! Multitwine: Multi-Object Compositing with Text and Layout Control 8 DAILY CVPR Friday Highlight Presentation This paper is the first one, as far as Gemma is aware, that does multiple object compositing at the same time. And additionally, you also add layout and text control for added controllability over the final image. But why do we need this? So far there's a lot of work on multiple objects subject-driven generation, but for object compositing, which is a very important task in many editing pipelines, there's nothing on multiple objects at the same time. And sometimes if we sequentially add different objects, there's some things that are very hard to do or almost impossible to do, like adding two people hugging that needs reposing of both people at the same time, or adding a person walking a dog that is the person, the dog and also the leash. And there are some interactions that are very hard to do with sequential one object compositing. Just providing a way to simultaneously add multiple objects is an important contribution. It is challenging because it has all the challenges from object compositing that we have with a single object at the same time. You have to reharmonize the object, blend the

9 DAILY CVPR Friday boundaries, and make the final image look natural. And at the same time, you have to repose the object to fit some kind of interaction. If the users feed the interaction they want through text, it needs to follow that text and also keep the identity of the objects, which is a very hard task. It took a while and it wasn’t easy to find a solution, but it is a brilliant one. Gemma found that the big part of it was in the training data. She found different ways of obtaining this training data. She combined four different sources and four different ways of obtaining that data using segmentation models, VLMs and grounding models. That allowed to combine various sorts of data that excelled in separate aspects offering different kinds of text prompts and different kinds of interactions. That allowed the model to learn a more complete set of interactions and reposings. Gemma Canet Tarrés

10 DAILY CVPR Friday Moreover, how to balance the text and the visual information? For that, what Gemma did was adding customization as an auxiliary task. So, during the training, sometimes you don't provide the background and you get the model to generate it on its own. That allows it to focus only on balancing textual and visual information in a few steps, allowing the final result to be more balanced. “Honestly, this customization as an auxiliary task part was kind of by chance,” Gemma admits. “We wanted to see if we could get our model to do both tasks, to do object compositing, but also customization. And we found that when adding that, it actually helped the other task. And then we thought about it and we thought like, yes, actually, it makes sense because it is an easier task and they are complementary!” The model is based on Stable Diffusion 1.5, which is quite an old version now. But since the team were combining that many things, they decided to test their pipeline on a smaller model, even though it is not small. And then now it can be adapted to a bigger model that would provide better image quality. Still, the baseline is basically a UNet-based diffusion model. We asked Gemma about her thoughts when she discovered that her paper was accepted as a highlight. Gemma candidly admits that she did not expect it. Though she’s very happy about it. “I think people like the fact that it's a new task that we're doing,” Gemma suggests. “Like no one had done multiple object compositing at the same time. And also we're able to do compositing and customization, Highlight Presentation

11 DAILY CVPR Friday which are like very, very high topics right now. Topics that are being talked about a lot. They're very useful for industry, for any content creation task, entertainment, marketing, etc.” Gemma knows that the problem is not solved it. There are many more things that can be done. This model, for example, doesn't work super well when you try to use many, many objects at the same time. The authors tried up to 10 objects, which is crazy. And it didn't work as well, as the success rate definitely falls. There's definitely some work to do there. Like what? Gemma thinks that some cool extensions that could be done, like, for example, could it be used for video? Could it be used to generate crazy interactions like, for example, be able to add a person lifting a car over their head? Gemma remembers a funny moment when she took many screenshots and send them to people. At some point she managed to get multiple objects to be there. But the identities were completely mixed. She was getting dogs with cat faces, which happened around Halloween. So she called them Halloween dogs and was kind enough to share the image here with us To learn more about this paper, visit today (Friday) Poster Session 2 (ExHall D) from 16:00 to 18:00 [Poster 260]. Congratulate Gemma for being selected as an 'Outstanding Reviewer' at CVPR 2025. Continue to the next page to read about Gemma’s thesis. Gemma Canet Tarrés

12 DAILY CVPR Friday Congrats, Doctor Gemma! Addressing these limitations, Gemma’s work explores novel methods to constrain and guide the image generation process by leveraging multimodal inputs, such as sketches, style, text, and exemplars, to guide the creative process. Based on the success of DALL-E, her first approach was CoGS (ECCV 2022), a framework for style-conditioned, sketch-driven image synthesis. By decoupling structure and appearance, CoGS empowers users to define coarse layouts via sketches and class labels and guide aesthetics using exemplar style images. A transformer-based encoder converts these inputs into a discrete codebook representation, which can be mapped into a metric space for finegrained adjustments. This unification of search and synthesis allows iterative refinement, enabling users to explore diverse appearance possibilities and produce results that closely match their vision. Building on this idea, PARASOL (CVPR WiCV 2024) advances control by enabling disentangled, parametric control of the visual style. This multimodal synthesis model conditions a latent diffusion framework on both content and finegrained style embeddings, ensuring Gemma’s bio is on page 8. Here is a review of her thesis, two weeks after her successful defense. Recent advancements in deep learning have transformed the field of image generation, enabling the creation of highly realistic and visually compelling images. However, despite their impressive capabilities, state-ofthe-art models often lack the fine-grained control needed to tailor outputs precisely. This challenge is particularly evident when user input is ambiguous or when multiple constraints must be satisfied simultaneously.

13 DAILY CVPR Friday independent yet complementary control of each modality. Using a novel training strategy based on auxiliary search-driven triplets, PARASOL introduces precise style manipulation while preserving content integrity. Expanding to conditioning on exemplars, the next model, Thinking Outside the BBox (ECCV 2024) addresses the novel challenge of 'unconstrained generative object compositing'. This task involves seamlessly integrating objects into background images without requiring explicit positional guidance. By training a diffusionbased model on paired synthetic data, the approach autonomously handles tasks such as object placement, scaling, lighting harmonization, and generating realistic effects like shadows and reflections. Notably, the model explores diverse, natural placements when no positional input is provided, enabling flexibility and accelerating workflows. This solution surpasses existing methods in realism and user satisfaction, setting a new standard for generative compositing. Finally, Gemma’s thesis culminates in Multitwine (CVPR 2025), a model for simultaneous multiobject compositing, combining text, layout, and exemplar-based inputs. – For more information about this model, see pages 8-11 or go ask Gemma directly at her poster session today! Together, these different approaches form a cohesive framework for controllable image generation, addressing challenges in structural, stylistic, and compositional control. By leveraging diverse input modalities, the generation space is narrowed, producing outputs more closely aligned with the inputs and unlocking greater precision and new creative possibilities. Congrats, Doctor Gemma! Gemma Canet Tarrés

14 DAILY CVPR Women in Computer Vision Friday Marcella Cornia is an Associate Professor at the University of Modena, Italy. She’s been almost a decade at this university. Read 160 FASCINATING interviews with Women in Computer Vision Marcella, tell us about your work. My research activities are mainly related to vision and language. I work on multimodal learning in general. During my PhD I worked a lot on image captioning: I developed solutions to automatically describe an image in natural language. Since the AI research changes in the last couple of years, now we mainly focus on multimodal large language models, which are probably the state-of-the-art architectures in the vision and language literature. Is it true that many vision people switch to language because of this? Yeah, now probably 60/70% of the computer vision papers are related to multimodal LLMs. Many architectures are now based on language models. Even when we want to generate an image, basically

15 DAILY CVPR Friday Marcella Cornia we do that by giving an input in natural language sentences. Should I change the name of the magazine to Natural Language News and maybe CVPR should change its name too? No, I don't think so. There are also a lot of problems that are based on computer vision and it is very important that the computer vision community focus on the visual part and the understanding of the visual components. Maybe a serious answer would be: it's not that vision people went to language, it's that vision and language converged in some way. Yeah, yeah, yeah, it's true. So now there is no more a very significant difference between the two fields. Natural language processing and computer vision, we are now very related somehow. How has the very strong wave of AI LLMs in the last couple of years changed your work? Oh, well, when I started a PhD, my research was related to language also at the beginning, but basically, we trained the architecture from scratch, so we didn't have a pretrained architecture, a pre-trained language model as a base of our solutions. Nowadays, many research efforts focus on starting from a pretrained language model and teaching it multimodal capabilities. I think the most significant change is the starting point itself. Also the size of the models changed a lot. The models that we developed at the beginning of my PhD were quite small in the size. Now we have large architectures that are also quite expensive to train and to use. I think that the most important change is the starting point somehow. Also the size of the models changed a lot. The models that we developed at the beginning of my PhD were quite small in the size. Now we have large architectures that are also quite expensive to train and use.

16 DAILY CVPR Women in Computer Vision Friday

Tell me, Marcella, were you meant to be a professor or it came just like this? It's difficult to answer because when I started a PhD, my ambition was to learn something new and I didn't think that my career would be linked to academia. It’s probably during the postdoc that I decided to stay in academia. There was an opportunity and I decided to take it. But actually, my original plan was more to teach in high school. Now the question that you certainly knew was coming: what is it like to grow as a scientist, as a researcher and as a professor under a pillar of our community like Rita Cucchiara. 17 DAILY CVPR Friday Marcella Cornia

Working with Rita is first of all a pleasure, because she is recognized inside the community and it also gives us a lot of opportunities to connect with other research groups in Europe, in Italy, and also in the United States and everywhere. This is mainly an opportunity! Is there one thing of Rita’s style that you would like to have too? Her capabilities during public presentations, I think, because she performs very well during public talks, during scientific talks, in conferences, invited talks and so on. This is a great skill that I find quite difficult to acquire. Are you from Modena too, Marcella? Yes, yes. Wow, so the three of us are from Modena. You, Rita and myself. That's so funny! Okay, most of my readers have never heard about Modena or maybe they only know the name. Tell us something fantastic about our town! Well, the food, obviously! The best town in the world for food! And also Ferrari, which is based very near Modena. These are probably the most important things to know. I even think that Ferrari works with the university. Is that right? Yeah, yeah. Some years ago, we also had a couple of projects. I didn't work on that, but the lab had a couple of research projects with Ferrari. That's a big honor, of course. Did you ever drive a Ferrari? No, never. That will come maybe. Let’s talk a bit about the future, Marcella. Are you going to keep teaching? Yeah, yeah. This is my present and also my future. I recently obtained a full-time position at the university and I think that I will continue to work here in the next years. Congratulations from us! This magazine has seen you growing for almost 10 years, I think. Exactly! We met the first time in Las Vegas at CVPR 2016. 2016! That famous CVPR at the Caesars Palace! Yes, this was my first conference, before the start of my PhD, in November 2016. We met for the first time there, at the Women in Computer Vision workshop. 18 DAILY CVPR Women in Computer Vision Friday

It was my first CVPR and my first conference too. So we were born in the same place and almost together. Wow! The funny thing is that now in Nashville we are both celebrating our 10th CVPR. Tell something to the readers who are attending their first CVPR today. Tell them how to take the most advantage of this show! I remember ResNet was first presented at CVPR 2016 and it also won the best paper award. If I take a look at that CVPR, it was a very, very big conference with many attendees. It was difficult to talk with everyone. This is the most important thing when we have the opportunity to go to conferences: it is important to connect with other participants, talk a lot with the others, meet the others and also try to understand. Step by to the posters, try to discuss with people, because we are now in a specific period in which we are starting the research activities for the next year. CVPR is important also to brainstorm and also to try and acquire all the possible knowledge from the other participants. It is also special to meet your heroes in person! Yes! Give me a couple of names that you were honored to meet in person when you started, when you didn't know yet that they were here, available in real life. Well, there are many people. Maybe when I started, when I was working on image captioning, one of the names was Devi Parikh, because she worked on similar topics. There are obviously many, many others. Ross Girshick, Raquel Urtasun, Sanja Fidler. You gave a preference to ladies. Yeah, I started to attend this type of conferences thanks to the Women in Computer Vision workshop. The workshop gave me the opportunity to attend several conferences when I was a student. Women for president! Indeed, Rita is now running for being elected rector of our university. OK, Rita for rector! We vote for her. Can we vote, Marcella? I can vote, you can't. The second ballot will be in only a few days. Now I look for an answer with passion. What will you tell the young scholar arriving at his first conference now at CVPR or to young Marcella nine years ago in Las Vegas, what would you tell them? My advice is to be curious, not be shy like me. Also, connect with the other researchers, the other students. The connections that you make during the years at the conferences will be useful to organize other activities, research papers, workshops and much more! 19 DAILY CVPR Friday Marcella Cornia

20 DAILY CVPR Friday My First CVPR Klára Janoušková - a second-year PhD student from the Faculty of Electrical Engineering, at the Czech Technical University in Prague – celebrates in front of her poster at CVPR 2025. Klára is supervised by Jiří Matas. Her poster FungiTastic, presented at FGVC12, the 12th Workshop on Finegrained Visual Categorization, introduces a new multi-modal dataset and benchmark for image categorization.

21 DAILY CVPR Friday Full House at the Workshop! Full house at the workshop 4D Vision: Modeling the Dynamic World. Adam Harley of Meta is presenting his talk "4D Vision Tomorrow: Structured, Slow, and Data-Driven“.

22 DAILY CVPR Friday CV for Mixed Reality Workshop Rana Hanocka of the University of Chicago presenting her talk “DataDriven Neural Mesh Editing - without 3D Data” at the Computer Vision for Mixed Reality workshop.

Double-DIP Don’t miss the BEST OF CVPR 2025 iSCnul i Ccbkos cmhreipbruet feorrVf ri sei eo na nNde wg es toi tf iJnu ly o. u r m a i l b o x ! Don’t miss the BEST OF CVPR 2025 in Computer Vision News of July. Subscribe for free and get it in your mailbox! Click here Target with solid fill

ion News? Read it here Do you read Computer Vision News? Read it here Target with solid fill

25 DAILY CVPR Friday Our First CVPR [From left to right] Alexander Pondaven, Ben Kaye and Lorenza Prospero are all PhD students at the University of Oxford. They all celebrate their first CVPR. Alexander (TVG) will present his poster Video Motion Transfer with Diffusion Transformers on Sunday morning at the main program. Ben (VGG) is presenting this afternoon his highlight poster DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction. He works on image reconstruction. Lorenza (VGG) presented her poster GST: Precise 3D Human Body from a Single Image with Gaussian Splatting Transformers at the workshop on Computer Vision in Sports. She works on human pose estimation.

RkJQdWJsaXNoZXIy NTc3NzU=