Computer Vision News - Summer 2025

Full review of Best Paper and other winners Let us tell you what happened at CVPR Exclusive Interview with Kristen Grauman Summer 2025 Enthusiasm is common, endurance is rare!

Computer Vision News Computer Vision News 2 Best Paper CVPR 2025 Jianyuan’s paper proposes a novel feed-forward reconstruction model that processes multiple input images to generate a 3D reconstruction. Unlike prior classical and deep learning-based methods, which often rely on time-consuming test-time optimization, this model operates without such constraints. Optimization techniques such as bundle adjustment or global alignment can take minutes or longer to complete. In contrast, Jianyuan’s model achieves reconstruction in seconds, significantly enhancing speed and efficiency. “Such optimization steps are usually non-differentiable and can’t work as a plug-and-play component in recent deep learning frameworks,” he explains. “That's the bottleneck for 3D vision these days. Therefore, we go for a feedforward only model!” Jianyuan identifies two major challenges in developing this model. The first was the need for a robust dataset to solve the problem in a data-driven manner. He collected 17 public datasets and processed them into a unified format, a task that required considerable engineering work. However, this was crucial because the quality of the data determines the limits for any method. The second challenge involved ensuring the model's generalization ability. “We want the model to handle an arbitrary number of input frames during inference,” he tells us. “Users may have only one frame Jianyuan Wang is a joint PhD student at the University of Oxford’s Visual Geometry Group and Meta AI. His paper introduces a superfast feed-forward reconstruction model, representing a significant advancement in 3D computer vision. Ahead of his oral presentation this afternoon, Jianyuan tells us more about his innovative work. VGGT: Visual Geometry Grounded Transformer This article was published BEFORE the CVPR2025 awards winners were announced. Which means that we kinda guessed two years in a row the best paper in advance. CVPR 2024 was here. Don’t ask, I’ve no clue!

3 Computer Vision News Computer Vision News Jianyuan Wang or 100 frames, but they still want the reconstruction results.” To address this, he implemented an alternating-attention mechanism, utilizing frame-wise attention to enable the model to identify which tokens correspond to which input frame. Jianyuan's research leverages several advanced computer vision techniques. Drawing inspiration from the success of 2D vision, it utilizes DINO, a 2D foundation model based on a vision transformer architecture. This approach enables the model to patchify the input images into multiple tokens, transforming the image information into a format that networks can understand and process. Additionally, the model features a camera head that regresses the camera's extrinsic and intrinsic parameters. This simple transformer approach is informed by previous works in camera pose estimation, such as Relpose, PoseDiffusion, and VGGSfM. He also employs DPT, a computer vision network developed four years ago, to predict dense, pixel-wise outputs. Now that we know which techniques Jianyuan has learned from, are there computer vision techniques that he thinks could benefit from his work? “Yes, neuro rendering methods, such as 3D Gaussian or NeRF, because they need camera poses predicted from upstream methods such as ours,” he responds. “Also, our model can predict a high-level latent representation of the 3D properties, so recent large 3D VLM models could benefit from it.” One potential application of this work in the real world is in online shopping, where customers often rely on 2D images of products. By utilizing this model, retailers could offer 3D reconstructions of items, allowing customers to rotate and view products from all angles, and even create personal 3D avatars for a virtual fitting.

Computer Vision News Computer Vision News 4 Jianyuan’s paper has not only earned him an oral presentation slot at this year’s conference but has also been nominated for a prestigious Best Paper award. He attributes this recognition to the pressing need for advancements in 3D vision, which currently lags behind rapid developments in 2D vision and natural language processing. “They have built a lot of fantastic works, like GPT and SAM,” he points out. “In 3D vision, we’re still working with smaller models and classical techniques. A joint thought among the 3D vision community is that we need a large 3D foundation model that can handle numerous downstream tasks. I think that’s why this paper is kind of special!” Looking to the future, Jianyuan is optimistic about the potential applications of his research. He has Best Paper CVPR 2025

5 Computer Vision News Computer Vision News Jianyuan Wang already observed follow-up works, including AnySplat, which utilizes VGGT’s feature backbone to enable feed-forward Gaussian parameter prediction for novel synthesis, and Spatial-MLLM, which combines its backbone with other large vision models to establish a unified foundation model for 3D perception. “In the future, we could see further trials on 4D tasks,” he envisions. “As we go from 2D to 3D, I think in probably two or three years, we’ll have something good in 4D. In 4D, people dance, run, and many scenes are dynamic!” In conclusion, while Jianyuan’s model represents a significant step forward, he emphasizes that datadriven 3D vision is just the beginning. “As Rich Sutton said in 2019, general approaches that leverage computation will ultimately prove to be the most effective,” he reflects. “This ‘Bitter Lesson’ has attracted great attention in the 2D and NLP communities, and we believe it’s true for 3D as well. Feed-forward models will be the future of 3D vision.” NOTE: this article was published before the announcement of the award winners. Which explains why it does not mention being a winning paper. Once again, we placed our bets on the right horse! Congratulations to Jianyuan and team for the brilliant win! And to the other winning papers too!

Computer Vision News Computer Vision News 6 Oral & Award Candidate In this paper, Yiqing introduces a novel generalizable foundation model for estimating both geometry and motion in dynamic scenes using just a pair of image frames as input. This task, known as scene flow estimation, has long been a challenge in computer vision and is crucial for applications such as robotics, augmented reality, and autonomous driving, where understanding 3D motion is essential. Yiqing likens the task to a firstperson video game: “Your head is always in the center,” she explains. “You see the wall move, people walk around, and objects change shape. Our model can predict the geometry and motion of all of it!” The timing of this work was key. Monocular scene flow was proposed about five years ago, but it hit a wall: there was not enough compute, data, or pretrained weights to make it work. Now, that has all changed. “We benefited from advancements in 3D over the last year,” she reveals. “People found that if you scale up training for 3D geometry prediction, you can get feed-forward methods that predict 3D geometry from 2D information. We go one step further than that, and ask: can we also add motion?” The answer, it turns out, is yes, but the biggest challenge was not the model architecture – it was the data. “I’m probably giving an answer that my fellow researchers working in the field are very familiar with!” she laughs. “The coding part is fairly easy, but having enough data to formulate the problem properly takes Yiqing Liang is a PhD student in Computer Science at Brown University. Her recent paper on scene flow estimation, developed during a summer internship at NVIDIA Research, has been accepted for an oral presentation at CVPR 2025 and nominated for a coveted Best Paper award. Ahead of her presentation, Yiqing told us more about her fascinating work. Zero-Shot Monocular Scene Flow Estimation in the Wild

more time!” Yiqing eventually curated a massive dataset of over 1 million annotated training samples, spanning indoor scenes, outdoor scenarios with vehicles and pedestrians, animated films, and even simulated environments with chaotic motion. Much of the data was derived from existing RGB-D video datasets, which combine color, depth, and camera parameters. By carefully converting and filtering them to reduce edge noise, she was able to reconstruct scene flow annotations at scale. 7 Computer Vision News Computer Vision News Yiqing Liang

Computer Vision News Computer Vision News 8 Even with an innovative new model in hand and a Best Paper nomination on the table, Yiqing remains grounded. “All of us are very honored to be an award candidate, but I think it’s not only because of the quality of the work, but it’s also because of luck,” she says modestly. “There are many, many works that are very, very high quality.” What made this one stand out, she suggests, is its perspective: “We’re looking at a classic problem through a modern lens. I’ve worked on popular methods like NeRF and Gaussian splatting before, and the main bottleneck is the inference time learning - you wait minutes, hours, even days. Classical methods don’t have that problem. Now, classical methods are generalizable, so we try to marry the two trends together to create a new possibility.” Looking ahead, Yiqing sees several promising directions for future work, which she hopes the community will take forward. First, there is the potential to scale the dataset even further, not just in terms of size, but also in terms of diversity. Incorporating noisy real-world data would be particularly valuable. “One million sounds big,” she remarks, “but it’s still small compared to what's used for diffusion models.” Next is extending the model’s capabilities beyond geometry and scene flow. “We’re interested in predicting other modalities, like camera motion, to decompose scene motion into different fractions for more applications,” she tells us. The method could also be extended to long-term tracking. “Right now, we work with image pairs, but what if we had more pairs? What if we had a longer time horizon between the pairs?” She is also excited about potential applications in robotics: “People have been trying to use particle systems in robotics because they found Oral & Award Candidate

9 Computer Vision News Computer Vision News Yiqing Liang that it’s a more abstract version of the information compared to the raw camera. For example, RGB-D point clouds. It's possible to abstract our output like that.” Looking at the bigger picture, Yiqing is curious about how this research could intersect with multimodal large language models. “For LLMs, the multimodal side still has a lot to explore,” she points out. “People are interested in how to encode visual information more efficiently, and how to let it interact more with textual information.” More than anything, what excites Yiqing most is this model’s generalizability. “It’s really cool how general it is!” she says with a smile. “We’ve tested it on out-of-domain datasets – real-world, high-motion scenes – and it still works!” NOTE: this interview was taken before the announcement of the award winners. Which explains why it mentions it being a Best Paper nomination and award candidate.

Computer Vision News Computer Vision News 10 Oral & Award Candidate In this paper, Bangyan explores the challenge of vanishing point estimation in a Manhattan world, a classical problem in 3D computer vision. This problem involves identifying the points at which multiple parallel lines appear to converge in a 3D space, a critical task for several computer vision applications. Bangyan's research addresses the limitations of previous methods, which often struggled with speed and global solutions. “We apply a convex relaxation technique to these problems for the first time,” he explains. “Our method can jointly estimate vanishing point positions and line associations simultaneously, showing that GlobustVP achieves a superior balance of efficiency, robustness, and global optimality compared to prior works.” Vanishing point estimation is a fundamental building block for many downstream applications in 3D computer vision, and one which has not yet been fully solved, which served as a key motivator for Bangyan to push his research forward. “Since this problem is very old, everybody thinks it must be very simple, but I don't think so,” he points out. “It’s a typical chicken and egg problem. To solve such problems, you must address two subproblems simultaneously. These two subproblems are highly coupled with each other. Solving each subtask is simple, but solving them both is a very hard problem.” Bangyan Liao is a second-year PhD student at Westlake University in China. His paper introduces GlobustVP, a novel method for vanishing point estimation in a Manhattan world. In addition to being accepted for a coveted oral slot, it has been nominated as a candidate for a Best Paper award this year. Ahead of his presentation, Bangyan told us more about his work. Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World

11 Computer Vision News Computer Vision News Bangyan Liao Bangyan outlines three key insights for navigating these challenges. First, he proposes a joint approach to solving the coupled subproblems through a novel scheme called soft data association. This reformulates the problem as a quadratically constrained quadratic programming (QCQP) problem, a standard optimization problem in mathematical programming. The second insight involves transforming a non-convex QCQP problem into a convex semidefinite programming (SDP) problem, thereby simplifying the solution process. Finally, to enhance efficiency, he iteratively solves smaller SDP problems rather than one large problem, significantly accelerating the process. As one of only 15 out of almost 2,900 accepted papers to be recognized as Best Paper award candidates, Bangyan reflects on why he thinks his work stood out from the crowd. “I think the reason comes from the fact that we reveal that even longstanding fundamental geometric problems are not entirely solved,” he considers. “It demonstrates that there is still a need for more advanced and powerful optimization algorithms to tackle these classical challenges in the computer vision field.” Looking to the future, Bangyan aims to extend his research beyond the constraints of the Manhattan world and apply convex relaxation techniques to a broader range of applications in computer vision. “Our paper currently focuses on the Manhattan world, which is a relatively strict assumption,” he

Computer Vision News Computer Vision News 12 Oral & Award Candidate explains. “I want to overcome this assumption and use it more generally. Also, I want to utilize a convex relaxation technique for other applications in computer vision.” Moving forward, he wants to explore the fusion of classical geometric optimization methods with advanced learning-based techniques to develop more powerful solvers. “I have to think about it,” he ponders. “The combination is really hard. You have to learn both classical mathematics and keep up with the advanced techniques.” Bangyan highlights that many challenging fundamental problems in the field remain unsolved. If given the opportunity, he would love to solve them, particularly those related to convex relaxation and the scalability of semidefinite programming solvers. “As you can imagine, relaxation is to relax some assumptions,” he adds. “This relaxation might be tight or might be loose, and there’s no theory that can prove it entirely. They can only prove a tight relaxation under strict assumptions. Additionally, classical solvers have a significant scalability issue. At a very large scale, efficiency can’t be guaranteed. They can be really slow.” Last author Peidong Liu

13 Computer Vision News Computer Vision News Bangyan Liao China has seen increasing success at international computer vision conferences, with a growing number of Chinese scientists and scholars submitting their papers and receiving widespread acceptance. Bangyan attributes this success to the intense competition within the country, which is driven by its larger population. He laughs: “We have more people than other countries, so we have to work harder, think harder, and do harder!” As he prepares for his presentation, Bangyan is committed to making his research accessible to a wider audience. “My paper is full of mathematical equations, but I want to emphasize that such mathematical equations are not as hard as you may imagine,” he says. “I have some very insightful slides, and I want beginners to understand the insights behind my paper!” NOTE: this interview was taken before the announcement of the award winners. Which explains why it mentions it being a Best Paper award candidate. Co-author Zhenjun Zhao

Computer Vision News Computer Vision News 14 CVPR Paris

Yes, there was a CVPR Paris this year, only a few days before we met in Nashville. The awesome Ukrainian ladies here are (from left) Tetiana Martynyuk, Sophia Sirko and Yaroslava Lochman. Yara was so sweet to share these photos with us. It is always right to remember CVPR’s official motion against the Russian Invasion of Ukraine: CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. 15 Computer Vision News Computer Vision News UKRAINE CORNER CVPR Paris

by Cornelia Fermuller and Guillermo Gallego The Fifth Workshop on Event-based Vision, held on June 12, 2025, at CVPR, reaffirmed its role as a central forum for the rapidly expanding community working at the intersection of sensing hardware, computer vision, and intelligent systems. The workshop has become a cornerstone event for researchers advancing event-based and neuromorphic vision, a field that continues to gain momentum. This year’s program featured an open call for papers, live demonstrations, poster presentations, and multiple international competitions designed to benchmark progress and promote collaboration. After completing his PhD, Joshua has begun working as a machine learning engineer at Google. Event-based Vision Workshop at CVPR 2025: A Growing Hub for Neuromorphic Innovation Computer Vision News Computer Vision News 16 Workshop

Academic highlights included Davide Scaramuzza (University of Zurich) presenting new developments in structure-from-motion and SLAM with event sensors in robotics; Christopher Metzler (University of Maryland) showcasing innovative uses of event sensors in computational photography; and Priyadarshini Panda (Yale University) discussing the integration of event data with spiking neural networks for efficient hardware and software systems. Industry contributions stood out as particularly impactful. Kynang Eng of SynSense (Switzerland) outlined current barriers to mainstream adoption of event-based vision and emphasized its strong potential in 3D motion applications for robotics. Davide Migliore, representing the newly launched event-vision startup Tempo Sense, engaged the audience in an interactive discussion using live polling to gather perspectives on current challenges, key milestones, and promising directions for future research and applications. 17 Computer Vision News Computer Vision News Event-based Vision

by Agnese Taluzzi and Davide Gesualdi This challenge focused on a rapidly growing area of research: how can AI understand what’s happening in first-person videos, like the ones captured with smart glasses? Egocentric videos are particularly complex: the camera moves as it is attached to someone’s head, objects go in and out of view, and there is often a lot happening outside the frame. While humans naturally excel at interpreting both fine-grained actions and broader activities, traditional video understanding models struggle in this scenario because they’re trained mostly on third-person, nicely framed images. So, this challenge was about pushing the limits of AI to understand in the wild interactions and environments from a first-person perspective, something that’s becoming increasingly relevant for wearable tech, assistive devices, and robotics. We addressed the challenge by focusing on limitations of current models, particularly their difficulty in reasoning over long egocentric video sequences and their lack of structured semantic understanding. Most state-of-the-art multimodal models like Gemini are trained on third-person data and rely heavily on frame-level representations, which makes it hard for them to capture object affordances, state changes, and causal dynamics in first-person settings. To overcome this, we adopted a hybrid approach rooted in neuro-symbolic AI, combining two complementary modules: SceneNet, which generates structured scene graphs capturing spatial relationships and actions grounded in the video; and KnowledgeNet, which expands object understanding using external commonsense knowledge from Egovis HD-EPIC VQA 18 Computer Vision News Computer Vision News Challenge Winner From left: Riccardo, Agnese, Davide and Chiara. Work was carried out in the Smart Eyewear Lab, a joint research center of Politecnico di Milano with EssilorLuxottica. Other team members: Francesca, Simone, Matteo. Great job guys!

ConceptNet, enabling reasoning that extends beyond visual evidence. Our solution stood out in the competition due to its innovative integration of neuro-symbolic reasoning which effectively bridged the gap between low-level visual perception and high-level semantic understanding. SceneNet gave the model a grounded, structured view of the world, perfect for understanding things like where objects are and how they move. KnowledgeNet added a layer of abstract reasoning, which helped with questions that required prior knowledge or procedural understanding. By integrating symbolic scene representations into the prompt, this structured form of prompt engineering enabled more efficient reasoning over long sequences. These graph-based abstractions provided compact and interpretable encodings of spatial-temporal dynamics, reducing the need for extensive processing of raw visual tokens. The infusion of external knowledge further allowed the model to reason beyond visual input, enhancing its robustness and generalization in complex and data-scarce scenarios. This challenge provided both the community and our team with valuable insights into the potential and limitations of current approaches to egocentric video understanding. We demonstrated that graph-based structured representations can significantly enhance a model's ability to interpret complex object interactions and temporal dynamics when combined with external commonsense knowledge. At the same time, we observed the challenges of integrating heterogeneous information sources, reinforcing the need for more refined and context-aware fusion strategies. Our results support the importance of developing neuro-symbolic frameworks that are selectively applied and task-specific, rather than onesize-fits-all. These findings open several avenues for future research, including the creation of unified models that combine the strengths of SceneNet and KnowledgeNet, the exploration of more advanced, questionaware fusion techniques, and the extension of symbolic reasoning to new modalities and domains. Additionally, the idea of performing reasoning directly on symbolic representations, without relying on raw video, holds promising for scenarios with limited computational resources. Indeed, advancing scalable reasoning over long, complex egocentric sequences remains a key challenge in enabling deeper temporal and causal understanding. Overall, the experience confirmed that structured intermediate representations not only improve model performance but also lay the groundwork for building AI systems that are more interpretable, robust, and aligned with real-world human contexts. 19 Computer Vision News Computer Vision News Egovis HD-EPIC VQA

Paula Ramos is a Senior Computer Vision Machine Learning Advocate Lead at Voxel51. She spoke to me at CVPR 2025 in Nashville. Computer Vision News Computer Vision News 20 Who is Voxel51? Voxel51 is more than the company that is putting its name in the liner at CVPR. It is the company behind the FiftyOne platform. FiftyOne is the only data-centric AI platform where basically we have the mission to improve the dataset quality, to improve the model performance. Who needs this? Everyone in the computer vision field. You need to understand your dataset. You need to understand what kind of computer vision tasks you need. You can understand maybe the cluster that the raw data has. You can maybe also understand the annotations, if you have mislabeled something. And also, you can understand the predictions. So, you can understand everything. And at the end with the predictions, you can also compare with model evaluation how the model is performing with those data. How did we do before Voxel51 to make research? Oh my gosh, you know, I think that FiftyOne is a time saver. Because before that, you needed to maybe navigate a JSON file with all the file names and patterns that you have and metadata. With FiftyOne, you can visualize everything in one single platform. FiftyOne has an SDK and a GUI. You need to code and program in the SDK. And then you can launch the session of the GUI and you can curate, filter the information. Before that, you needed From the CVPR Expo

21 Computer Vision News Computer Vision News Voxel51 to navigate the JSON file. Just understand more or less the structure of your directory, how your data is there. Maybe you have a timestamp. Maybe you have the name of the labels in the JSON file. It is difficult to understand the structure of your data, understand the blind spots of your data. With FiftyOne, we can make a lot of things and explore the dataset in multiple fronts. With the metadata, we can understand that with interactive plots. And we also have embeddings, which are so important in the datasets; because with the embeddings, you can see how the data is clustering. And if you have maybe some outliners, you can understand what is going on with those outliners with the model and see maybe you need to add that information to the dataset for training or maybe you need to remove that. You have a lot of advantages to use the platform. And at the end, you will save a lot of time and you can expedite the research process. What makes Voxel51 able to do this? How is Voxel able to give me all these functions that without it I wouldn't have? I think that FiftyOne is making a great job putting the top AI trends in the platform, basically grabbing the concepts for data management. And we are putting different concepts in functions. We have functions to curate data. For example, understand the labels that we have in the metadata. We can filter the dataset that we have just with those functions for curating and filtering. We can also explore the curated models that we have in the state of the art. We can use those models and

Computer Vision News Computer Vision News 22 apply those models to the dataset in order to find some blind spots that we have in the dataset. I think that the team has been doing a great job adding features and features every day to the platform for making the platform more intuitive and easier to use. How immediate to work with it is? It is super easy to work with the documentation. It's intuitive. We have videos, we have images, we have 3D point clouds. Data can be very diverse. Does this work for health data? Does this work for sports data? Does this work with any data? Yes, this is this platform. The beauty of this platform is this is agnostic to the vertical or industry. We have projects that we can work in healthcare, oil and gas, retail, manufacturing, agriculture, LIDAR, autonomous vehicles… The company was born basically in an autonomous vehicles scenario. What do you do in the autonomous driving field? There are many things happening now. One of the main things that we have in autonomous vehicles is that we have multiple sensors to take the information of the car. You know, we have images, we have LIDAR information, we have Doppler information. With 51, it's possible to merge everything in the visualization. We are also launching a partnership that we have with NVIDIA. With NVIDIA, we can create this simulation, synthetic information, and put people, more people, cars, and also change the scenario. For example, if we are in summer, we can put the trees, green trees. If we are in fall, we can just put rain. Empty trees. Data can be good, it can be bad. What happens if the user comes with a data set, which is quite poor, and tries to organize it with your tool? Does it improve his data? Or does it give him more understanding about how bad his data is? That is a great question. Great, great question. Because normally we don't understand what kind of data we have and the quality of our data. We just trust in the protocols that we have to acquire data and believe that the protocols are okay. But then when we check, data is poor. And for sure, if you have poor data or bad data, you have bad models. So the models are not dumb because they are dumb. But when they use the platform, they can understand that they have bad data because we have a lot of indexes to qualify the quality of the data and the quality of the models. From the CVPR Expo

23 Computer Vision News Computer Vision News Voxel51 What message do you want to convey at CVPR? We have a huge announcement about verified auto labeling. The platform will help you to verify the labels, create the labels so you don't need to use the ground truth information. The new feature is called Verified Auto Labeling: a new workflow in FiftyOne that helps you to automatically label datasets with curated open-source models. You can make classification, detection, segmentation, and you can add your own labels classes and validate predictions. And you can visualize also the confidence, the scores. You can filter poor predictions. You can boost the dataset quality before training. And there are three main things that make it a game changer. That will be 50x faster than manual annotation. That will be 1000x cheaper than human labeling. And that will have a competitive MAP score gains versus the human label data. What else we should know? There is a global community. I think that we have more than 50k community members. That is a good number. And we are doing a lot of Meetups and we have a lot of engagement with the community.

Computer Vision News Computer Vision News 24 Meshcapade with Anica Wilhelm Look at Anica: Meshcapade is able, with one monocular camera, to catch her movement and turn them into these figures - in real time. What a long way they have come to! Congrats to Naureen, Michael and team! And great moves, Anica

25 Computer Vision News Computer Vision News Kristen Grauman Kristen Grauman is a professor at the University of Texas at Austin. She spoke to me at CVPR 2025 in Nashville. Kristen, what is your work about? The work that I do is about video understanding. And in particular, I have a lot of interest in egocentric or first-person video, as well as the implications, kind of guiding problems for video understanding coming from the domains of augmented reality and robot learning. There is a lot of computer vision in there. Yes, that's right. Yeah. So you have been doing computer vision for all your career until now? That's right. I guess I got started as an undergrad, kind of exposed to computer vision in an early class at the time, started doing some simple research and getting more and more interested from there on. OK, let's now disclose everything to the readers. This is my 10th CVPR. How many do you have? How many has it been? Well, I guess it would be something like 24. I haven't counted them, but since 2001. Wow, that's amazing. Tell us something that has not changed since 2001. That's a great question. What hasn't changed about being at CVPR? This might surprise you, but I'd say even just the energy and the culture and the nature, despite the scale change and obviously the major tech advances in the meantime, that quality of the conference, I think, has been preserved. And so, yeah, I think that's something that's steady! What occupies your time now? Work-wise, it's about research and teaching, as well. What that looks like for me is spending time with my graduate students and coming up with the problems we want to address and going for it. And on the personal Read 160 FASCINATING interviews with Women in Science Read 160 FASCINATING interviews with Women in Science

Computer Vision News Computer Vision News 26 Women in Computer Vision personal side, my life is a lot about my children and all the wonderful things I get to do with them. What is the best part of your work? I think the best part of my work is the constant, like the dynamic nature, the constant change, evolution, new things, things that you can care about. And then the other best part of my work or my job is the freedom that comes with it. What a gift or a privilege, to be able to guide your own path and follow curiosity and work with lots of talented people and on the university side, help them follow their journey! What is the best thing that you will teach to your next students that you didn't know when you started to be a professor? Yeah, part of me thinks it's all the little things. It's not that there's some epiphany that now I've had it and now I can pass it on to my students. I think it's everything, just that anyone who gets the chance to work with a colleague or mentor anyone or work with a student, it's all the little things. I think it's everything from habits in research, good practices, and what is the norm and how does that norm change, or what do we know how to do better for it that we accumulate. Are you still passionate about research? Yes, yes, of course, yeah. What is the spark in there that you like? I think there's the abstract aspect and there's the concrete. I think on the abstract, it's that I love the work that has perhaps an unknown end point and this ability to explore and forge and learn along the way. So that's abstract. I think in the concrete, one thing I'm very excited about right now is the future of what computer vision can do for Augmented Reality. And I really believe that, thinking of wearable AI, wearable computer vision is gonna change daily life activities and the quality of assistance we can get in daily life! And in particular, I'm motivated by, in our research, trying to enable skill learning through video and helping people and robots learn new skills in a more effective, efficient way. What was the most difficult down career moment that you can share with us? “I really believe that wearable AI is gonna change daily life activities and the quality of assistance we can get in daily life!”

27 Computer Vision News Computer Vision News The generic downs that we all have and maybe become easier and easier to take but I think it's quite familiar to people in this CVPR context: the paper submission process and having your work be understood and we all have failures such as rejected papers or rejected grant proposals. And I think it's a trick of the trade or a necessary skill that most of us probably can acquire pretty naturally but it takes time. It's just like the thick skin and the resilience and the persistence. Like maybe the thing that can turn around any of that that I've most learned to do effectively is like that persistence, right? You have to have that! What was the best wow moment that you had while listening to a student of yours? Well, I'll mention two things. One, I think what I've been really impressed by and learning about increasingly is how, between my students, the way they support each other, technically or otherwise and just seeing that blossom more and more, how it works and how it works best, is something that I'm learning from them. I think the other thing that I'm very touched by and that I've, as you say, I've learned is - and it's fresh in my head because every CVPR or so, we round up all of the former graduates, anyone who's here and it's now stretching back, whatever it is, 18-ish years - I was telling them today, it's a lifelong relationship, me and them, and then with each other too. And I think learning at these reunion moments or graduations, the influence that I potentially can have in whatever small way or otherwise for their career, for their learning, I think that's another thing that I'm aware of. And it gets exposed, like the day to day, we all do it and we're all working hard and that's not so present, but then in these special moments, you get to remember that. Can you imagine - that's a game, I know - what will be 24 CVPR from now? What CVPR will look like? Well, I'm sure we all are hoping or expecting that the growth won't be the same factor. I think what will keep happening now 10 years and say 24 is the closeness or the pulling closeness between adjacent fields on our own. We're already seeing that of course today, especially in robotics and language. And I think that Kristen Grauman “What a gift or a privilege, to be able to guide your own path and follow curiosity and work with lots of talented people!”

Computer Vision News Computer Vision News 28 Women in Computer Vision is the thing that will continue, you know, in even more dramatic ways than what we have today, the boundaries breaking down between these sister or adjacent areas. And that'll change the problems the way we think. I mean, it's already happening now, so this is an easy bet, right? But I think that's what we can see moving forward. Maybe something like computer vision becoming one together with NLP. Yeah, exactly. That kind of stuff. I mean, the problems are not so isolated as they once were years ago. And that's a big deal for what's gonna be possible in AI. If you had a magic wand and could improve something in our community, what would you want to change to make it better? I think the scale is challenging as far as sharing, communicating ideas thoroughly. We have the advantage and the blessing that the field is so vibrant and so big and moving so fast. But that comes with this challenge of adequate awareness of work and communication of that work. So yeah, I think a magic wand that could help us along with that or remove that: keep the scale, the vibrancy and the energy and the movement, but allow just scholarly process and the knowledge to scale in the same way. I think that's a real challenge, yeah! You have many more years of career. What would you like to achieve that will make you say when you retire, I'm happy, I did it. What will make you retire with no regrets and say I did what I wanted to do? I mean, I'm very satisfied with the work I've had the opportunity to do The exceptional line-up at CVPR 2023’s Women in Computer Vision workshop panel: from left to right, Angel Chang, Kristen Grauman, Judy Hoffman, Ilke Demir, Abby Stylianou, Devi Parikh. Just wow!

29 Computer Vision News Computer Vision News “If I am able to continue being a supportive and effective advisor for students and watching them grow and seeing that change, I will be very happy!” even thus far. I think that's the right trajectory. If I am able to continue being a supportive and effective advisor for students and watching them grow and seeing that change, I will be very happy! I think on the technical side, if I can be some part of realizing this transformation into wearables and augmented reality and first-person vision and taking it to the next level for the technology that can change our lives. I don't need to be the one that changes our whole lives. But if our work moves towards that end in a way that I can see, that would be something I'd be very excited about as a researcher and also as a future user. What about the bureaucracy of the work? There is a lot of bureaucracy and reviewing and doing things that are not so much researchoriented, but you have to do them anyway. Do you get used to this part of the work? Yeah, I don't think it's that significant. Some people hate writing grants. Yeah, but that's not bureaucracy. That's thinking work. Yes, it's hard and it's so essential, so it's like do or die to get it right, but I mean, writing and thinking about how to formulate your agenda, I don't think of it as bureaucracy. It's real work! On the discussion, AI is dangerous. AI is not dangerous for humankind. On what side do you stand, if any? I don't think it's binary. I think that we can be well aware of the challenges or the pitfalls or the dangers at the same time we're aware of opportunities. I'm gonna leave it at that! Read 160 FASCINATING interviews with Women in Computer Vision! Read 160 FASCINATING interviews with Women in Computer Vision! Computer Vision News Publisher: C.V. News Copyright: C.V. News Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, WACV, CVPR, ICCV and all conference organizers. Kristen Grauman

Antonio was still a PhD student when he started the company with colleague Fabio Cermelli and supervisor Barbara Caputo, two and a half years ago. Antonio started in domain adaptation applied to autonomous driving, aerial image analysis. Fabio in continual learning. And then they took everything that they learned during their PhD research to found Focoos AI. They developed the technology that is behind the models that are provide in the platform: it is a Neural Architecture Search (NAS), that creates a custom computer vision architecture tailored to a specific area. 30 Computer Vision News Computer Vision News From the CVPR Expo Antonio Tavera is a founder and CEO at Focoos AI, an Italian startup that works in the computer vision sector. He spoke to us at CVPR.

What they found at the time with is that all the computer vision models follow a trend of being bigger and bigger in terms of computational complexity. This increases also the energy consumption in the hardware. They designed a technology to move a step forward to create a new architecture that is optimized by design to run on specific low-power devices, as well as maintaining performances and speed. Focoos have a development platform that streamlines all the development process of computer vision application. But its particularity is not in being a development platform, but in the models that are provided inside of the platform. These are models that are optimized with a technology designed to run on lowpower edge devices. Devices that run in complex computer vision applications on low-power edge devices, environments with limited computational resources. What is the most challenging thing in doing this project? “The first thing that was difficult in this project was to validate the technology with the potential partners and the commercial user,” is Antonio’s prompt answer. “And especially to find a specific market to where to start. Because it is a technology that is agnostic to the computer vision sector, and can be applied to different kind of sectors, so that's the first problem. But now we are focusing specifically on smart city sector, on defense and security and surveillance.” The real users of the platforms are senior users, the ones that are here at the CVPR conference. That can then deploy their own model and finally deploy on the hardware infrastructure. Focoos supports them through every step of the process: from dataset management, to training, monitoring, model comparison, and finally deployment, either on the company’s cloud infrastructure or directly on local devices. Counting the founder and the advisor, Focoos are 13 people - based in Turin, Italy. We are curious to learn one last thing: how is it to work regularly with Barbara Caputo? “Barbara says every time that we are her children, her academic children,” Antonio smiles. “She gave us a lot of things not only from a technical point of view, but also from a business point of view. Because she teaches us how to speak to people, how to sell our product!” 31 Computer Vision News Computer Vision News Focoos AI

TOPS is a physics-based simulation for any bandwidth that exists in the market from SWIR, MWIR and LWIR sensors, which are spectral validated. They also include in the simulation the actual physics of the sensor itself, which means that if it has optical issues and different behavior from the electronics and from the signal processing points of view, that's included in the data. They claim that their simulation is fully validated including all those features and including atmospheric effects, time effects, location effects, various external effects that you can add to the system like dust, dirt. What makes them special, I asked Dor. “We actually simulate the physical properties of what is happening in the environment and in the sensor itself,” Dor explains, “which means that we minimize the gap between real and synthetic data, which is what mostly scares AI scientists and big data analysts, about using synthetic data instead of real data. The accuracy of our systems is 95%!” Validation is a very major thing in computer vision and it is crucial for this business too. Their second product is Majestic AI. It's a data as a service based on a simulation that they provide to scientists. They give validated data including scenes, different kind of scenes, behavior analysis, temporal analysis, and the like. They pride themselves with scaling the development of the AI models in such a way that it can be done much faster than with regular training, saving much money to companies. It also helps speed up the development and the go-to-market timing. Supported verticals mainly include outdoor operations from the defense market to the homeland security market like firefighting, police, etc. and also commercial markets, automatic driving, ADAS, drone inspection services for bridges, wind turbines, power plants, ambulances, Emergency Medical Services - hence a very wide spectrum of application fields. They can create validated data for very scarce events for which real data is not available. “We can create specific models that maybe no one has,” Dor boasts, “for example a car under a flood!” 32 Computer Vision News Computer Vision News From the CVPR Expo Dor Dagan is Strategy and Business Development manager at Tiltan Software Engineering. Tiltan works in simulations and synthetic data, and their focus is on simulation. Their system is called TOPS.

Tiltan were focusing until now on local markets and now they are expanding worldwide. This is why they came toCVPR: “We want the CVPR community to know what Tiltan is doing and approach us in specific and special projects that they have in order to increase to speed up their development,” is Dor’s rationale, “so the AI scientists will know that we can give them highend synthetic data for training and validation and also simulation.” Tiltan are now 25 people, of which about 20 scientists and engineers. “We also have our auto geomapping system,” Dor proudly concludes, “which is a 3d reconstruction system that is a near real time!” 33 Computer Vision News Computer Vision News Tiltan Software Engineering

Computer Vision News Computer Vision News 34 Look at Anica: Meshcapade is able, with one monocular camera, to catch her movement and turn them into these figures - in real time. What a long way they have come to! Congrats to Naureen, Michael and team! And great moves, Anica Poster Pablo Ruiz Ponce had the chance to present his work MixerMDM: Learnable Composition of Human Motion Diffusion Models to Michael Black. Pablo told us: “Michael is a researcher whose work I’ve long admired and learned from. It was both exciting and overwhelming to see him engage with my poster and ask questions. Thankfully, he was incredibly approachable and easy to talk to, so my initial nervousness quickly disappeared. Michael was particularly interested in our approach to synthetically increasing the diversity of generated human interactions, and we had an insightful discussion about the challenges of modeling realistic contacts between humans and the limitations of the data currently available.”

by Matteo Poggi The Monocular Depth Estimation Challenge (MDEC) had its first edition in late 2022, when monodepth was working reasonably well, yet it was still far from being considered as mature as we could consider stereo already. In the three years between it and this fourth edition, things drastically changed, and this workshop evolved accordingly to side with this revolution. A quiet revolution, as defined by Konrad Schindler through an inspiring analogy to weather report, driven by small and consistent advances occurring one after the other, as well as by the much larger scale of data becoming available through the years. This has brought monodepth much closer to becoming a commodity, and already a core ingredient in some of the intriguing, higher-level applications shown by Yiyi Liao in her talk. Despite this steady progress, some challenges remain open and represent a fascinating opportunity to further advance the fields, such as the possibility of scaling up to highresolution images spotlighted by Peter Wonka and his cutting-edge works. We feel this fourth edition marked a line, further confirming the increasing excitement around monodepth and making both the participants and the organizers wonder about what is yet to come next.” 35 Computer Vision News Computer Vision News CVPR Monodepth Challenge

Computer Vision News Computer Vision News 36

On my way to Nashville and CVPR, I paid a long-due visit to awesome Madi Babaiasl at her robotics lab at the University of Saint Louis. Here you can see me with different modalities like EEG headset and eye tracking glasses that can be translated to robotic actions. Computer Vision News Computer Vision News 38 At the Mecharithm Lab

The African Computer Vision Summer School (ACVSS 2025) was an exciting gig in Kigali, Rwanda that united outstanding African students with leading computer vision experts: 10 intensive days, fully funded for participants, with 15 speakers, 8 lectures, 8 keynotes, 8 practicals, 4 mentorings, a hackathon, a poster session, and probably a lot of fun! If you look well in the photos, you will find old friends of the magazine: Daniel Cremers, Vicky Kalogeiton and Raoul de Charette (thank you for the lovely photos!) Computer Vision News Computer Vision News ACVSS 2025 39

Made with FlippingBook

RkJQdWJsaXNoZXIy NTc3NzU=