ICCV Daily 2025 - Thursday

Inside: Exclusive review of two Best Paper candidates! One of them was awarded the Best Paper Honorable Mention DAILY Thursday

Dat’s picks of the day (Thursday): Aloha from ICCV’25 in Hawai’i !!! I am Dat Nguyen, originally from Vietnam and currently a final-year PhD candidate at the University of Luxembourg, supervised by Djamila Aouada. Before starting my PhD, my early career focused on building AI systems for Autonomous Driving and Recommendation Systems. My doctoral research focuses on Deepfake Detection, an increasingly important field in the era of Generative AI. Specifically, I develop deepfake detectors that are robust and generalizable to unseen manipulation methods while also offering interpretability. Beyond performance, I explore how to design detectors that are lightweight yet strong, and how to evaluate them more precisely under realistic conditions. This is a crucial step toward making deepfake detection truly practical and trustworthy in the real world. 5A-1 LaRender: Training-Free Occlusion Control in Image Generation via Latent … ➔ ➔ ➔ Read full review of this paper on page 12 5B-4 Counting Stacked Objects For today, Thursday 23 2 Dat’s Picks DAILY ICCV Thursday At ICCV, I presented yesterday FakeSTormer, a generalizable deepfake video detection model designed to identify manipulated faces in videos! Orals: Unlike most existing approaches, FakeSTormer is trained exclusively on real videos and high-quality pseudo-fakes generated by a proposed video-level data synthesis. This makes it robust and independent of any specific deepfake generation technique that relies on face blending. The model employs a multi-task learning framework that explicitly guides it to attend artifact-prone regions at both spatial and temporal domains.

3 DAILY ICCV Thursday Aloha ICCV! Chee-hoo! What a great conference! In the words of Program Chair Richard Souvenir: Enjoy the last day and see you soon at WACV and CVPR! Ralph Anzarouth Editor, Computer Vision News Ralph’s photo above was taken in peaceful, lovely and brave Odessa, Ukraine. ICCV Daily Editor: Ralph Anzarouth Publisher & Copyright: Computer Vision News All rights reserved. Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, ICCV and the conference organizers. Editorial ICCV season begins with ambition and ends with perspective. In between are long nights, reviews, rebuttals, and plenty of opinions. It’s an imperfect process, but somehow, we end up with a remarkable set of papers and a community that keeps pushing the field forward (and maybe arguing a little along the way). And we get to do it in Hawaii. Mahalo to everyone who makes ICCV what it is.

This work is about a new type of camera that allows to focus sharp everywhere on a sensor for every pixel. Conventional cameras today use a lens, which can only focus on one plane at a time, one depth at a time. For example, if I have a camera and I point at a water bottle in front of me, I focus on that bottle, the background is going to appear blurry, my kitchen there, it's going to be blurry. And if I focus on my kitchen, then the front object here will be blurry. Of course, this is with this current camera that I have with like the large aperture. But that is generally true for cameras when you have a sufficient aperture. The underlying reason for that is because of the depth of field that the lens has. With any conventional cameras today, the focus across the entire sensor is the same. So if you focus at half a meter away, then all the pixels would focus on half a meter away. The focus is a focal plane. This work introduces a new kind of camera that allows you to not just have a global focal plane, but elevate to another dimension. What if we can have that focal plane to adapt to any three-dimensional structures of the scene that you have? For example, the focus is no longer a plane, a flat plane, but it would have a shape that conforms to the scene geometry. Yingsi Qin is currently a fifth year Ph.D. student at Carnegie Mellon University, under the supervision of Aswin Sankaranarayanan and Matthew O'Toole. She is also the first author of this great paper, which was selected among the 13 best papers of ICCV out of more than 11,000 papers submitted. Ahead of her oral and poster presentations this afternoon, Yingsi tells us more about her work. Spatially-Varying Autofocus 4 DAILY ICCV Thursday This interview was conducted before the ICCV 2025 awards were known. Yingsi and her team earned a fabulous Best Paper Honorable Mention Best Paper Hon. Mention

And what would that do? Having such a focal surface that can conform to the scene geometry allows you to have any type of focusing across the sensor you have. For example, you can perform optical all-in-focus imaging, which has been studied extensively in the literature. But the two most straightforward ways to obtain all the stuff in focus, one is to use small aperture, meaning that you decrease the size of the aperture. Your depth of field increases, so you can have more depth in focus. But then that also comes with light loss. The smaller the aperture you use, the less light you have. And also the smaller the aperture you use, the more diffraction blur you encounter. 5 DAILY ICCV Thursday Yingsi Qin

6 DAILY ICCV Thursday The other method is to use focus stacking, which is the most common go-to approach for photographers where they stack the focus across the depth range. So you capture one photo here, capture another photo here, another photo here, and put this stack together computationally. Afterwards, you capture the entire stack so you can computationally fuse the focuses across the sensor together. Yeah, so these two most common approaches, they have their own drawback. Yingsi’s method maintains a large aperture, so you don't have to use a long exposure. And also it will not have defocus blur when you're in an extreme depth range. You don't need post processing. You don't rely on computational post processing to produce the focus result. There are two parts that go into this work, Yingsi explains: “It's a work that combines hardware and software. So the two key innovations that enable this work. One is the optics, which is the camera itself. The optics enables us to have spatial control of focus. And then the algorithm tells us how do we control, what kind of control do we put into the camera to enable this. So for example, if I want the focal surface to conform to scene geometry, I need the depth map of the scene. And the optics, which is the hardware of this work, enables us to, as long as we have this depth map, to perform all in focus imaging. The algorithm is what gives us the depth map.” This didn’t go without challenges. The first challenge came when Yingsi was building the first iteration of the prototype, almost two years ago. It was very different from this one. It uses a totally different set of lenses, and a different sensor, a machine vision sensor; and it used 50millimetre lenses for the relay. She played around with that setup for a few months. But then the 50millimeter lenses that she was using turned out producing too much chromatic aberration in the prototype. The other challenge is that the machine vision sensor allows Best Paper Hon. Mention

7 DAILY ICCV Thursday only to do contrast-detection autofocus (CDAF): for every iteration in the algorithm, you have to capture multiple images to land on the autofocus image because contrast detection autofocus relies on searching for the best focus instead of computing for the best focus. There is a lot of computer vision work to discover in this paper. First of all, all in focus imaging is computer vision. “I would say this one falls into the category of physics-based computer vision,” Yingsi adds, “where you use physicsbased ideas and models to enable new capabilities for a computer vision system. This camera is a computer vision system because it enables the machine to have vision, to see the world, to perceive more information. All in focus imaging itself is providing more information to the machine or to any computer at a single instant compared to conventional cameras, because conventional cameras would have blurry information at other depths. But all in focus imaging, you can have all seen in focus at the same time!” Yingsi feels very excited for this new technology because for the first time we can auto focus every object at the same time: “There's no camera that can do it today!”, she exclaims. Also in autonomous driving Yingsi Qin

8 DAILY ICCV Thursday it can have a lot of impact. Let Yingsi explain: “If I'm capturing the scene in front of the car and there's a pedestrian walking by the cameras - any conventional camera is going to auto focus to that pedestrian. But then you lose focus to the street behind, like the far street and cars. But that's not desirable because you would want to know what's happening at all time.” Also in microcopy, if you want to capture different layers of a thick tissue, you can image the multiple depth simultaneously. You need post processing and that is time consuming. With this technology, you can have an arbitrary depth of field, arbitrary shape for the focal plane, which means you can image things at different depth at the same time. Yingsi wants to add one more key point: “With our spatially varying focusing framework, any type of autofocus algorithms can be readapted to a spatially varying way to the spatially varying framework. So we show examples of contrast detection autofocus (CDAF) and phase-detection autofocus (PDAF). But for follow up research, you can go beyond that. You don't have to stick to these two kinds of autofocus algorithms, although they are the mainstream today. You can use depth from defocus to produce the depth map. That's one way. And there are also other kinds of contrast detection autofocus algorithms like Best Paper Hon. Mention

9 DAILY ICCV Thursday Yingsi Qin

10 DAILY ICCV Thursday hill climbing. And there are a variety of algorithms for autofocusing. And the key point is that all of them can be readapted for our framework. Which means you don't have one depth that you land on, but you perform the autofocus for every pixel area, pixel region or super pixel at the same time!” To learn more about Yingsi’s work, visit Oral Session 6A: Physical Scene Perception (Exhibit Hall III) this afternoon from 13:00 to 14:15 [Oral 6] and Poster Session 6 (Exhibit Hall I) from 14:30 to 16:30 [Poster 74]. Best Paper Hon. Mention

Double-DIP Don’t miss the BEST OF ICCV 2025 iSCnul i Ccbkos cmhreipbrue t feor rVf ri sei eo na nNde wg es toi tf iNnoyvoeumr bmear i. l b o x ! Don’t miss the BEST OF ICCV 2025 in Computer Vision News of November. Subscribe for free and get it in your mailbox! Click here Target with solid fill

When Xiaohang began exploring how to control occlusion in image generation – deciding which object appears in front of another – he quickly identified the limits of current diffusion models. “Occlusion is a spatial relationship of objects rather than a semantic one,” he explains. “It’s not something that a text prompt can easily control.” LaRender proposes a method for generating images with precise occlusion relationships, eliminating the need for retraining or fine-tuning of the model. “We designed a very novel method using the principle of 3D rendering to generate the image in latent space,” Xiaohang tells us. “We use rendering to let the model understand the spatial relationship of objects. In this way, we don’t introduce any extra parameters or training modules, so the whole framework is training-free. We observed very good quality and very accurate control of occlusion.” The idea grew from a reluctance to rely on traditional data-driven methods. “When we consider controlling something in a model, we need to collect paired data,” he notes. “We manually annotate the relationships and use this paired data to fine-tune the model. That’s the typical way, but I think it’s a little bit boring. I wanted to find a way to perform this without any annotation, without any paired data, without tuning – and that’s hard.” Xiaohang Zhan is a Senior Research Scientist at Adobe and previously worked at Tencent. His paper, which has been shortlisted as a candidate for a Best Paper Award, introduces a new approach to controlling spatial relationships between objects in generated images. Ahead of his oral and poster presentations, Xiaohang tells us how the idea came about and what makes it different. LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering 12 DAILY ICCV Thursday Oral & Award Candidate

At first, there was no clear direction. “No one had done this before,” Xiaohang admits. “Occlusion is part of rendering. When we perform rendering on 3D scenes, it naturally contains occlusions. I started thinking about how to introduce rendering into a pre-trained diffusion model. I performed a lot of experiments and finally made it work. That was the most challenging part.” 13 DAILY ICCV Thursday Xiaohang Zhan

14 DAILY ICCV Thursday Oral & Award Candidate

15 DAILY ICCV Thursday It was also a gradual and highly iterative process. The research was conducted largely in his spare time, while his primary work focused on multimodal understanding. “I just find this topic very interesting,” he says with a smile. “I was focused on it with one of my interns, but neither of us had a lot of time. That’s one of the reasons why it’s training-free – we didn’t have time!” Early prototypes failed completely. “At first, we wanted to make a full 3D rendering inside the diffusion model,” he recalls. “But it’s really difficult to estimate a 3D shape inside a representation.” He continued simplifying it – switching to an orthographic camera, using 2D latent features instead of 3D shapes, but keeping the 3D layout and spatial relationships. At last, it worked. That experience taught Xiaohang a valuable lesson. “Sometimes you need to simplify,” he reflects. “Your first idea might seem appealing, but if it doesn’t work, you need to find the best trade-off among your ideas. You might need to sacrifice something and simplify it.” LaRender adapts principles of volumetric rendering to the “latent” level of a diffusion model, allowing the system to combine features according to physical rules of occlusion and transmittance. The research shows that it outperforms text-to-image and layout-to-image methods on occlusion-related benchmarks. Xiaohang Zhan

16 DAILY ICCV Thursday Oral & Award Candidate Asked why he thinks the ICCV committee rated the work so highly, Xiaohang points to its originality. “The idea is different from any existing paper,” he observes. “It provides some new insight because it’s performed in a very different way. People can be inspired by this paper.” He also believes the method opens up broader possibilities. “I simplified a lot of things, but that doesn’t mean they wouldn’t work,” he adds. “Researchers can use this idea before simplification and extend it beyond occlusion control. 3D rendering can control many things, like lighting conditions, camera poses, and field of view. It opens up new directions for training-free image editing and generation.” That potential for expansion is what excites him the most. “I only implemented a small piece of latent rendering in occlusion control,” he remarks. “But people can use latent rendering to do a lot more. I think this is the most important contribution to the community.” For visitors to ICCV, Xiaohang plans to utilize both his oral and poster sessions to share ideas on how others can build upon his work. “In the poster, I can explain face-to-face and give some more ideas about how to follow this paper,” he reveals. “I’ll give them some hints about how to use latent rendering to do some very different things – and maybe make a best paper of their own!”

17 DAILY ICCV Thursday Since completing the research, Xiaohang has joined Adobe as a Senior Research Scientist. “When I was doing this work, I was in the transition from Tencent to Adobe,” he explains. “Now, I develop imageediting algorithms for Adobe products like Photoshop and Lightroom. I’m working on some very magic things, like generating or removing visual effects in images and helping with creative design for users.” To learn more about Xiaohang’s work, visit Oral Session 5A: Content Generation (Exhibit Hall III) today from 8:00 to 9:15 [Oral 1] and Poster Session 5 (Exhibit Hall I) from 10:45 to 12:45 [Poster 75].17 Xiaohang Zhan

18 DAILY ICCV Thursday Poster Presentation LookOut: Real-World Humanoid Egocentric Navigation Boxiao Pan is currently a research scientist at Luma AI working on unified multi-model generation understanding models. Shortly before this, he completed a PhD at Stanford with Leo Guibas, mostly on human scene and human object interaction and understanding, so on both 2D and 3D problems. Boxiao is also the first author of a very nice paper that was accepted as a poster at ICCV 2025. Ahead of his poster presentation today, Boxiao agreed to tell us about his work. .

This work deals with the problem of egocentric humanoid navigation. Given egocentric video, how can we predict a navigation trajectory for a humanoid robot or assistive humanoid policies to help people navigate? Boxiao basically asked this question: how close are humanoid robots now from actually being deployed in the real world? The surprise is that the answer is not really. We're still quite far from it. This paper wants to take one step towards making that a reality. Boxiao and team approached this problem from several fronts. First, they make the problem statement closer to that reality. Most of the prior works study an environment where obstacles are mostly static, or there is no more than one or two person moving in front of the robot and that's it. Which is very far from the reality! The authors decided to study in the real world dynamic environments. They collected data by just walking in very busy streets and specifically go out to find streets and times where a lot of people and cars are found. These are the two primary dynamic obstacles in this study. The policy needs to find a trajectory that can avoid both the static and the dynamics obstacles, just from egocentric video, which makes the problem studied very close to a real world policy. And this is on the input side. On the output side, we want the policy to be to learn what we call human-like active information guiding behavior, which corresponds to what humans would normally do, like rotate our heads and look for useful information. For example, before we cross the road, we would first look to the sides before we actually cross. “We want our robot,” Boxiao explains, “to learn these information or behaviors as well. So we specifically include such behaviors 19 DAILY ICCV Thursday Boxiao Pan

20 DAILY ICCV Thursday Poster Presentation in our data so that the policy can learn that. And so the method is specifically curated for this purpose to learn to be able to aggregate information in that egocentric video to learn this policy.” We can imagine that the team had to invent a few things from the ground

up. Like to solve the challenge of data: we need data that first captures both the static and dynamic obstacles we want; and second, provides egocentric observations; and third, they need to exhibit such human-like information guidance behaviors, for example, head turning. Boxiao and his co-authors were not able to find any data that has all three. There are a couple of close ones in autonomous driving, but they don't have egocentric observations, or they're synthetic. They have a lot of data, but it’s not very vivid or similar to real-life like dynamic obstacles for navigation. They had to collect their own and they ended up using the Meta Aria glasses as the only collection hardware, which solved the data problem. The main novelty of this work lays in it’s being a solution or a pipeline that brings the navigation robot closer to real world navigation. But what makes Boxiao proud the most? “The major thing I'm proud of,” he answers, “is that this entire pipeline didn't exist before. This is very similar to some of my previous projects in which I had to come up with an entirely new solution, both for the method and the data and evaluation, because we were dealing with a new problem. I'm proud that I personally, as the first author, lead the innovation and development of the entire pipeline.” To learn more about Boxiao’s work, visit Poster Session 6 (Exhibit Hall I) from 14:30 to 16:30 [Poster 24]. 21 DAILY ICCV Thursday Boxiao Pan

22 DAILY ICCV Thursday Posters and People From left: Rana Hanocka, Richard Liu and Itai Lang from UChicago, presenting WIR3D yesterday at ICCV. They've got so good abstractions that we had to fly them all the way from Chicago to come and show you! Turkish Computer Vision community is growing every year! Have you ever heard of Hawaiian Pide? Just like these 28 bright minds’ celebration of community, taste of Turkey blended locally in Hawaii! Thanks to awesome Ilke Demir for the photo ☺

23 DAILY ICCV Thursday ICCV's sister conference CVPR adopted a motion with a very large majority, condemning in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine. UKRAINE CORNER

24 DAILY ICCV Thursday Women in Computer Vision Miaomiao Liu is currently an associate professor at the Australian National University in Canberra, Australia. She’s working as a faculty member with the School of Computing. Miaomiao, what’s your work about? I lead a group of six students. In my lab, my main research direction is 3D vision, geometric scene understanding, human motion understanding, and recently we look at acoustic field learning. Our goal is to interpret the 3D world from images and let the intelligent robot or AI agents understand or perceive the world like a human. Why did you choose these subjects in particular? Yes, for people working in 3D vision, we know that the output of the algorithm is quite… I would say ‘interesting’. So very interesting, it’s Read 160 FASCINATING interviews with Women in Science Read 160 FASCINATING interviews with Women in Science

25 DAILY ICCV Thursday Miaomiao Liu like you can see the 3D model. After applying the algorithms that we developed, we could basically retrieve or produce or create the 3D model of what we can see or the data captured from video sequences. For me it's quite visual, you can play with the 3D model. Now we have the 3D printer, you can print the 3D model. You can digitilize yourself, the person. You are originally from China and you teach in Australia. How do you find yourself in this interesting mix? Australia is a very nice place for work and also for life. Life there is not as hectic as in other places. No, people there are relaxed. At the academic level, you have more time to... Because things are quite slow, so Canberra is relatively quiet and it's called a bush walking city. You have a lot of nature and hiking places you can go during the weekend. It's a small city and you can drive to everywhere in 20 minutes. So normally we do not have traffic jams. Very smart choice. I graduated from the University of Hong Kong, a very busy city, so it's quite different. How different is it to teach Australian students versus teaching other students in general? I think there's not much difference in Australia. We teach undergraduate courses and master courses. Our international students are mainly from China, India, Sri Lanka… For the teaching style, it's probably similar to other universities. It's normally 12 weeks teaching and probably students will find that the workload is quite intense. After 12 weeks you have a break and then move on to the next semester. Students probably interact more, compared to students in China. Yes, they definitely ask direct questions. In China, probably... I'm not sure now, because I graduated a long time ago from China. Probably nowadays the young generation is different, but in our older days, people were quieter in class. Now, if you have a question, you just put up your hand and then ask. More interaction! Since you have started to talk about the past, let's reveal something to our readers. We met for the first time nine years ago at ECCV in Amsterdam. Tell me, how did Miaomiao change from the Miaomiao of 2016, into the Miaomiao of 2025? Let's see. I was working at NICTA, in Australia. It's a research center called the National ICT Centre of Australia, where I was a researcher there. The work in that research center did not have a teaching component, it's research-focused. Do not need to

26 DAILY ICCV apply for funding. In 2018, I moved on to join ANU, the Australian National University as a faculty. It's different! And you as a researcher, as a teacher, how did you change from that time? Are you the same? Okay, I don't think I changed much. Yes, for teaching, it was quite probably new to me when I joined ANU, so that is probably a different part. As like the other faculty member in other universities, you need to spend time to know how to deliver the lectures in a way that the students could understand. So that could be different from the research. In the past, before 2018, I mainly focused on research: working with colleagues, students, papers. I mainly focused on research excellence. Now I need to balance. As a faculty member in the university, I need to have teaching excellence and research excellence. We spoke about the present, we spoke about the past. Now let's speak a little bit about the future. Where do you see yourself going? Are you going to stay in academia? Do you see yourself becoming an old professor? At this stage, I do not have a plan to move to industry. But yes, as what other colleagues are doing, it would be great to get more involved with local industries and also industries overseas. At this stage, I do not have a plan to move to industry. But yes, as what other colleagues are doing, it would be great to get more involved with local industries and also industries overseas. With my students and collaborators, we are aiming to develop techniques that benefit the society. Yes. I will probably become an old professor! [smiles] What would be a career dream for you? What are you dreaming that you have not achieved yet? Thursday Women in Computer Vision “Be curious!” “I think it's curiosity. I like to know things that I don't know!”

That is a really good question. Thank you. I came all the way to Hawaii to ask it. Yes, so definitely like every researcher, you like to see more impactable work. It's more than just papers. It's about work that could generate more impact within the society. Not only benefit industry, I mean, but improve other people's lives. That would be great! So for us, for example, for computer vision, especially the 3D computer vision, the closest collaborative application field is robotics. I have some idea in my mind for me. People become older in our society, so how to make people's life easier. Techniques like robots to help people's life. I can see there are now many of this kind of startups, especially in China. I was trying to look at how they help the robot to have the movements of a human. And also, now there are some cooking robots, those things called home service robots. That will be great! So I hope that the techniques that we actually developed, in the future will generate that impact or benefit. So when you and I will be old, there will be some robots to take care of us. Tell me, I have one last question for you. It's about your motivation. What keeps your passion alive after many years in computer vision? I think it's curiosity. I like to know things that I don't know! I like trying to know how things will work and how we can create something that is much better. That is probably a motive for me, very simple, yes. That's a lovely question! Be curious! I have been in this field for a really long time. I think the computer vision community, from my point of view, is quite healthy. I am area chair at CVPR and I can see people who are quite senior, and they are still reviewing papers, attending conferences, and mentoring young researchers. I think that is really great! So we are doing a great work. You can see the techniques develop so fast. This is what I like. People are still doing exciting things! 27 DAILY ICCV Thursday Miaomiao Liu

Made with FlippingBook

RkJQdWJsaXNoZXIy NTc3NzU=