CVPR Daily - Saturday‏

Computer Vision and Pattern Recognition Saturday Nashville CVPR Daily 2025 What happens at CVPR 2025

2 DAILY CVPR Saturday [3A-1] MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual … [3A-2] Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos [3C-4] Thinking in Space: How Multimodal Large Language Models See, Remember … [4B-4] Navigation World Models [3-285] Progress-Aware Video Frame Captioning [4-173] FIction: 4D Future Interaction Prediction from Video Jovana’s picks of the day: Orals For today, Saturday 14 Jovana’s Picks Posters Jovana Videnovic is a Master’s student at the University of Ljubljana in Slovenia, currently working on visual object tracking and segmentation. “At CVPR, I’ll be presenting my Master thesis on distractor-aware object tracking — a training-free memory module integrated into the SAM2. It’s designed to improve the performance of memory-based trackers, especially in challenging scenes with occlusions, distractors, or long video sequences. This summer, I’m joining EPFL as an intern working on computer vision for biomedicine. My poster session is on Sunday morning - see you at poster #309! “I am originally from Serbia and in my spare time I love to hike and spend time in nature!”

3 DAILY CVPR Saturday Editorial CVPR Daily Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, CVPR and the conference organizers. Good Morning Nashville! Did you follow the award announcements, yesterday at CVPR 2025? Do you know that Jianyuan Wang won the Best Paper Award? Yes, the same paper that we reviewed in yesterday’s magazine. Once again, we kinda guessed the Best Paper in advance. Did you miss our full review? Find it again here: Best Paper Award CVPR 2025. Enjoy the reading of this CVPR Daily and have a great Saturday! Ralph Anzarouth Editor, Computer Vision News Ralph’s photo above was taken in peaceful, lovely and brave Odessa, Ukraine. Editor: Ralph Anzarouth Publisher & Copyright: Computer Vision News All rights reserved. Unauthorized reproduction is strictly forbidden.

4 DAILY CVPR Saturday Oral & Award Candidate In this paper, Bangyan explores the challenge of vanishing point estimation in a Manhattan world, a classical problem in 3D computer vision. This problem involves identifying the points at which multiple parallel lines appear to converge in a 3D space, a critical task for several computer vision applications. Bangyan's research addresses the limitations of previous methods, which often struggled with speed and global solutions. “We apply a convex relaxation technique to these problems for the first time,” he explains. “Our method can jointly estimate vanishing point positions and line associations simultaneously, showing that GlobustVP achieves a superior balance of efficiency, robustness, and global optimality compared to prior works.” Vanishing point estimation is a fundamental building block for many downstream applications in 3D computer vision, and one which has not yet been fully solved, which served as a key motivator for Bangyan to push his research forward. “Since this problem is very old, everybody thinks it must be very simple, but I don't think so,” he points out. “It’s a typical chicken and egg problem. To solve such problems, you must address two subproblems simultaneously. These two subproblems are highly coupled with each other. Solving each subtask is simple, but solving them both is a very hard problem.” Bangyan Liao is a second-year PhD student at Westlake University in China. His paper introduces GlobustVP, a novel method for vanishing point estimation in a Manhattan world. In addition to being accepted for a coveted oral slot, it has been nominated as a candidate for a Best Paper award this year. Ahead of his presentation, Bangyan tells us more about his work. Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

Bangyan outlines three key insights for navigating these challenges. First, he proposes a joint approach to solving the coupled subproblems through a novel scheme called soft data association. This reformulates the problem as a quadratically constrained quadratic programming (QCQP) problem, a standard optimization problem in mathematical programming. The second insight involves transforming a non-convex QCQP problem into a convex semidefinite programming (SDP) problem, thereby simplifying the solution process. Finally, to enhance efficiency, he iteratively solves smaller SDP problems rather than one large problem, significantly accelerating the process. As one of only 15 out of almost 2,900 accepted papers to be recognized as Best Paper award candidates, Bangyan reflects on why he thinks his work stood out from the crowd. “I think the reason comes from the fact that we reveal that even longstanding fundamental geometric problems are not entirely solved,” he considers. “It demonstrates that there is still a need for more advanced and powerful optimization algorithms to tackle these classical challenges in the computer vision field.” Looking to the future, Bangyan aims to extend his research beyond the constraints of the Manhattan world and apply convex relaxation techniques to a broader range of applications in computer vision. “Our paper currently focuses on the Manhattan world, which is a relatively strict assumption,” he 5 DAILY CVPR Saturday Bangyan Liao

6 DAILY CVPR Saturday explains. “I want to overcome this assumption and use it more generally. Also, I want to utilize a convex relaxation technique for other applications in computer vision.” Moving forward, he wants to explore the fusion of classical geometric optimization methods with advanced learning-based techniques to develop more powerful solvers. “I have to think about it,” he ponders. “The combination is really hard. You have to learn both classical mathematics and keep up with the advanced techniques.” Bangyan highlights that many challenging fundamental problems in the field remain unsolved. If given the opportunity, he would love to solve them, particularly those related to convex relaxation and the scalability of semidefinite programming solvers. “As you can imagine, relaxation is to relax some assumptions,” he adds. “This relaxation might be tight or might be loose, and there’s no theory that can prove it entirely. They can only prove a tight relaxation under strict assumptions. Additionally, classical solvers have a significant scalability issue. At a very large scale, efficiency can’t be guaranteed. They can be really slow.” Oral & Award Candidate Last author Peidong Liu

China has seen increasing success at international computer vision conferences, with a growing number of Chinese scientists and scholars submitting their papers and receiving widespread acceptance. Bangyan attributes this success to the intense competition within the country, which is driven by its larger population. He laughs: “We have more people than other countries, so we have to work harder, think harder, and do harder!” As he prepares for his presentation, Bangyan is committed to making his research accessible to a wider audience. “My paper is full of mathematical equations, but I want to emphasize that such mathematical equations are not as hard as you may imagine,” he says. “I have some very insightful slides, and I want beginners to understand the insights behind my paper!” To learn more about Bangyan’s work, visit Oral Session 4C: 3D Computer Vision (Davidson Ballroom) this afternoon from 13:00 to 14:15 [Oral 2] and Poster Session 4 (Hall D) from 17:00 to 19:00 [Poster 102]. 7 DAILY CVPR Saturday Co-author Zhenjun Zhao Bangyan Liao

Riku Murai (right), a postdoctoral researcher, and Eric Dexheimer (left), a fourth-year PhD student at Imperial College London, are the joint first authors of an innovative new paper that introduces a real-time monocular dense SLAM system. They speak to us ahead of their poster session later today. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors 8 DAILY CVPR Saturday Highlight Presentation In their paper, Riku and Eric address the challenge of SLAM (Simultaneous Localization and Mapping), which involves estimating the egomotion of a camera – essentially tracking its movement – while simultaneously mapping the 3D geometry of the surrounding scene. “Typically, SLAM, particularly monocular RGB SLAM, where we’re given only images, is hard to do because there are many ambiguous cases,” Riku explains. “Often, people need expertise. You need very careful motion. People typically move the camera in a very specific manner to perform SLAM, so it's not very robust.” While SLAM technology has matured and is now seen as a foundational building block for various robotics and augmented reality products, the need for a plug-and-play solution remains. A significant innovation in this work is the integration of a deep 3D reconstruction prior, known as MASt3R, developed by Naver Labs Europe. This prior is flexible, powerful, and does not require calibration. “Usually, you have an internal team that will calibrate your camera and make sure the hardware-software stack is aligned,” Eric points out. “Having a SLAM system with a single camera

9 DAILY CVPR Saturday that works well, gives you good geometry, and tells you where your robot is, is pretty helpful!” One benefit of having a very general SLAM system is its ability to meet the growing demand for data. A system that can work on in-the-wild videos means it can utilize the vast selection of video clips already available online to reconstruct scenes and understand objects without needing a fully integrated hardware-software stack. MASt3R-SLAM

10 DAILY CVPR Saturday Despite their advancements, Riku and Eric acknowledge that there are still challenges to overcome. In particular, the model assumes static scenes, meaning everything in the environment must be stationary. Eric notes that while their system is one of the more robust options available for single-camera SLAM, it is not infallible. “There are issues with dynamics and working in a variety of scenes, like super large-scale, outdoor scenes, where it's still not perfect,” he reveals. “Also, we can run it on movie clips, and it works well, but we can’t just feed in a whole movie.” Managing real-time performance in SLAM is another challenge the pair encountered. Eric explains that they wanted to see what was happening live on their desktop as they moved the camera. “One of our key contributions was this prior that puts everything in 3D,” he tells us. “We can also imagine that as a set of rays from a camera. That gave us the benefit that we could keep the camera model general so that we could handle things that typical monocular SLAM systems can’t, like zooming in on a video!” One way they pushed the real-time aspect of the project was by finding ways to implement some general form of projective correspondence. There are other ways of matching, such as in pixel space or 3D space, but they found that these methods were a bottleneck to real-time performance and degraded the system's accuracy. Looking ahead, Riku envisions future work that addresses the limitations of static scene assumptions. “We need to handle dynamic scenes better, as they’re everywhere in the real world,” he adds. “Humans are moving, cars are moving, and assuming static scenes has always been a limitation of typical SLAM systems. Many new works try to Highlight Presentation

11 DAILY CVPR Saturday address dynamic scenes, and I hope that people will integrate these techniques into our pipeline to make the system more robust and able to reconstruct dynamically moving objects.” Riku and Eric are a truly global partnership. Riku, originally from Japan, has lived in the UK for 20 years, while Eric hails from Connecticut in the US. With SLAM being a fundamental problem in many 3D vision systems, from robotics to autonomous driving, both researchers express excitement about the potential for people to take their work forward into realworld applications. “Making our code public and then people building on top of it is a fun part of this journey,” Riku smiles. “Working close to the bottom of the stack where we’re providing data that other people can build on, and then getting feedback saying it was helpful, is always exciting and fulfilling.” Eric agrees and reflects on the journey they have taken: “We saw the MASt3R prior and recognized its potential, but getting the full system to work took some effort and expertise. It was exciting to figure out the problems and develop a realtime solution while also ensuring accuracy. Having the final system in play, it’s fun. We sometimes still surprise ourselves with it!” Before they go, the team is keen to give credit to Naver Labs, and in particular, Jérome Revaud, who visited Imperial, allowing them to see the potential of his work firsthand. “We’re building off of their stuff,” Eric adds. “Their work is very inspiring to many people in the community!” To learn more about MASt3R-SLAM, visit Poster Session 4 (Hall D) today from 17:00 to 19:00 [Poster 83]. MASt3R-SLAM

12 DAILY CVPR Saturday Congrats, Doctor Sascha! Let’s face it — decision trees (DTs) were probably one of the first model you ever used in a machine learning course. They’re intuitive, interpretable, and often overlooked the moment we move on to deep learning. I chose a different route: instead of leaving DTs behind, I doubled down on them. During my PhD, I developed a method that allows hard, axis-aligned DTs to be trained using gradient descent — just like neural networks. This turns traditional tree induction on its head: no more greedy, locally optimal splitby-split procedures (Figure 1). Instead, all parameters — including thresholds, features, and even leaf outputs — are optimized jointly using backpropagation. Sounds simple? It’s not. Hard, axis-aligned splits are non-differentiable by nature. But by combining a dense, parameter-based tree representation (Figure 2) with the straight-through operator and soft approximations during backpropagation, we can calculate meaningful gradients — all while keeping the tree structure hard and interpretable throughout training. The Sascha Marton recently defended his PhD at the University of Mannheim and is now Assistant Professor at the Technical University of Clausthal. At the CORE lab, Sascha continues to explore structured and interpretable learning systems. His research brings decision trees — one of the most classic machine learning models — back to the spotlight with a fresh twist: teaching them to learn like neural networks. The resulting gradient-based trees strike a balance between transparency and performance, showing promising results across tabular learning, multimodal architectures, and even reinforcement learning. From Roots to Gradients: Teaching Trees to Learn Differently

13 DAILY CVPR Saturday result: GradTree, a new class of DTs that outperforms many baselines on tabular data by combining the inductive bias of trees with the optimization power of neural nets. Once we could optimize a single tree, we scaled it up to GRANDE — a fully differentiable, weighted ensemble of DTs that remains efficient, robust, and expressive. It achieves state-of-the-art results on tabular benchmarks and extends naturally to broader domains like reinforcement learning and multimodal learning, where it integrates seamlessly and delivers strong results. For instance, we demonstrated how GRANDE can act as a structured backbone for tabular inputs combined effectively with CNNs for image processing, or use tree-based heads as the final decision-making layer to boost interpretability. DTs are not outdated — we’ve just been using them the same way for over 40 years. This work shows how a structured, interpretable model can benefit from modern optimization tricks, and in doing so, opens the door to new use cases that previously felt out of reach. Sometimes, it’s worth going back to where you started — and teaching it some new tricks. Sascha Marton Figure 1. Greedy vs. Gradient-Based DT. Two DTs trained on the Echocardiogram dataset. The CART DT (a) makes only locally optimal splits, while GradTree (b) jointly optimizes all parameters, leading to significantly better performance. Figure 2. Standard vs. Dense DT Representation. Comparison of a standard (a) and its equivalent dense representation (b) for an exemplary DT with depth 2 and a dataset with 3 variables and 2 classes. Here, ℎ stands for a discretized logistic function.

14 DAILY CVPR Women in Computer Vision Saturday Read 160 FASCINATING interviews with Women in Computer Vision Elisabeta Oneata is a computer vision researcher at Bitdefender in Romania. Elisabeta, tell us about your work. I work in a fundamental research team on deepfake detection. Mostly focused on detecting fakes in video domain, image domain, but more recently I'm going to multimodal with audiovisual data.

15 DAILY CVPR Saturday Elisabeta Oneata Apparently, after your PhD, you decided to continue your career in the industry and not in academia. Is this by chance or is it something that you decided for a reason? Well, I just wanted to work in research. And it happened that shortly after I finished my PhD, I took a two-year break when my daughter was born. So I felt like I needed to make a change, and in some way going into academia in Romania it felt a bit hard because I needed to invest time in teaching and going in front of people - which I'm not very comfortable with. So I tried to find a position as close as possible to academic research, but in a company; and I was lucky enough to find this position here at Bitdefender. Now that you are at Bitdefender - you might still have to explain to people, to talk to groups of people. You certainly do. I certainly do have to do this. You cannot run away from it. And I still teach, but only for about two hours, a class that we have some collaborations with our local universities. So yes, I cannot run away from it. Moving toward industry started a bit before, when I worked for one year at Google here in Romania. But since it was with my PhD supervisor, I think that the transition felt smoother from academia towards industry. I am sure that the more you become knowledgeable about your science, it becomes easier to tell about it to other people. Is that right? Yes, that's right. But I feel that in order to be comfortable with the things that you're working on, you need to be good at them. You need to study them in depth, which is really hard because it evolves very quickly. And I personally always feel

16 DAILY CVPR Women in Computer Vision Saturday

behind! It's like one step or many steps behind where I think I should be. Probably that's a more general feeling because it's really hard to comprehend everything that is published, everything that is going on, at least in the whole AI field. It's challenging, but I'm sure that you have developed some personal methods to get along with this. Yes, of course, I follow a groups of scientists that are very relevant to my research and I tend to skim through their articles and know exactly where to look and how to find the things that are most relevant to my research. It doesn't feel as overwhelming as it used to be, but it still is. You have made a very interesting choice to work in a field that is very international and to continue your career in your country. Was it a deliberate decision or did it happen by chance? It was a complex decision and there are many factors in it. I studied abroad for my master's in the UK and then I came back to Romania and I was not very sure if I was going to stay here or go back. There are many aspects, both personal and professional, that enter into this decision. If you're talking on the professional side, I got the chance to be in a very good research laboratory with high standards that actually taught me what research is and to be able to do a PhD at Western European standards, let's say. That was certainly one reason that mattered in choosing to be here, because I saw that I could go to conferences, I could publish, I had the right mentorship to publish and very good colleagues to learn from in good conferences and do the research at the standards that felt similar to other countries. Of course, there's other personal reasons for choosing to stay in your country and that mattered as well. I think the decision was taken over a series of years. It was not one day. At one moment 17 DAILY CVPR Saturday Elisabeta Oneata

moment it felt like, okay, it seems to work! How did you like doing your masters in the UK? I loved it! Tell us about it, make us dream! It was a great exposure for me to meet so many international students. There was this cultural novelty in talking to people all over the world and saying like, okay, these are people like me. I feel that that there was more emphasis on practical aspects, intuition and understanding, rather than the theoretical abstract. And I do need this part of intuition in order to build these mental models of what I'm working on. So it felt good for me. I think that the relationship between professors and students was a bit more natural. I felt more confident to ask questions and to be wrong, than I felt back in my home country. Overall it was a really nice experience. Where do you think Romania is going? I feel it's going in in a good direction. I've been here for the past 12 to 13 years since I started and especially when I came back from my masters, it felt really hard to think that I could find here a research position that would allow me to do work at the highest standards. So people have more opportunities here in Romania than there used to be like 10 to 12 years ago. Where are you based, Elisabeta? In Bucharest. Okay, tell us something about Bitdefender that we don't know. No secrets, only things that we are allowed to know. I think that not many people know about our theoretical research team, which is not a very large one. We are like 10 to 15 people. But what is particularly about it is that we're trying to do more like fundamental research, we're not tied to a particular product. So it's in a sense closer to academic research. It's a very nice place. I mentioned I was particularly privileged and lucky to define this type of position. And at the same time, you could still work with more industrial approaches or with the engineer team, if you'd like. And you can still go to this more fundamental type. You have the resources and the time to develop to 18 DAILY CVPR Women in Computer Vision Saturday

a problem that doesn't require, getting results by a deadline. I think that you consider yourself a lucky woman, right? Yes, that's true. Do you think that your luck, you have looked for it or it came by itself? Probably a bit of both. I was offered some opportunities and privileges that were quite unique. And I do hope that I try to make the most out of them. And of course, it's not always easy or flowers and butterflies along the way. I think it's a combination of both. I had a lot of support from outside, from mentors, from family, from everyone and good opportunities and also tried to make the most of them when I got them. What will you do in the future to be sure that you continue to be a lucky woman? That's a harder question. I don't have too many plans for the future and actually I never had. I never dreamed of being a researcher. I tend to take things one day at a time. Focus on the things that I'm doing now. And I think that if I strive for quality, for competence, for all the basic, the very strong values, the other things will somehow follow along. In tennis they say: point after point, one point at a time. So probably I cannot ask you what is the dream for the rest of your career. But is there anything that you would like to have that you don't have yet? I think it's not about the external recognition. I want to strive to be more confident in my decisions, in my approach and in how I mentor the newer generations. I'm not yet where I want to be, but I'm better than I used to be. It's a journey that has ups and downs. I want to be less influenced by rejections or minor things - that I know they are normal - along the way. You don't get everything that you want. So stop feeling them like failures - rather like normal steps along the way. It's more like a mental state that I want to achieve rather than a recognition for publishing good research, being visible or other stuff like that. You know that there is no shop where you can buy confidence. Yes. That's how you become confident. Failures are normal and they are needed. But it's easier to talk about them than to feel them or to let them pass without getting stuck too much feeling like, OK, I want to give up! 19 DAILY CVPR Saturday Elisabeta Oneata

20 DAILY CVPR Saturday CVPR Paris

21 DAILY CVPR Saturday Yes, there was a CVPR Paris this year, only a few days before we met in Nashville. The awesome Ukrainian ladies here are (from left) Tetiana Martynyuk, Sophia Sirko and Yaroslava Lochman. Yara was so sweet to share these photos with us. It is always right to remember CVPR’s official motion against the Russian Invasion of Ukraine: CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. UKRAINE CORNER

22 DAILY CVPR Saturday My First CVPR Ege Özsoy is a third year PhD student from Technical University of Munich (TUM). This afternoon he will present his research on surgical scene understanding: MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments [poster 4-341].

Double-DIP Don’t miss the BEST OF CVPR 2025 iSCnul i Ccbkos cmhreipbruet feorrVf ri sei eo na nNde wg es toi tf iJnu ly o. u r m a i l b o x ! Don’t miss the BEST OF CVPR 2025 in Computer Vision News of July. Subscribe for free and get it in your mailbox! Click here Target with solid fill

24 DAILY CVPR Saturday Our Own CVPR T-shirts Xingyu Chen and Yue Chen are both PhD students at Westlake University. Their team is presenting Feat2GS at CVPR2025, where they explore the 3D awareness of Visual Foundation Models using only 2D images! Xingyu is co-advised by Anpei Chen and Andreas Geiger. Yue is co-advised by Yuliang Xiu and Gerard Pons-Moll. Dunno you, but I like these folks very much and I love their T-shirts!

25 DAILY CVPR Saturday Poster "Hello! I am Kaiyue Sun, a second-year PhD student from the department of Electrical and Electronic Engineering at the University of Hong Kong. I’m presenting my research in front of my poster at CVPR 2025! My poster 'T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation' presented at CVPR 2025, introduces a comprehensive benchmark for compositional text-tovideo generation."

Made with FlippingBook

RkJQdWJsaXNoZXIy NTc3NzU=