Full review of Best Paper and Honorable Mention Best Paper December 2025 Enthusiasm is common, endurance is rare!
Computer Vision News Computer Vision News 2 Best Paper ICCV 2025 Most generative AI techniques focus on generating things for the digital world. For example, text, images and videos and digital 3D models. But what Ava and co-authors wanted to do was to bring generative AI into the physical world, not the digital world. They wanted to generate objects that could actually be built from pre-made pieces in real life. And once they're built in real life, they could stand up and be stable. Ava Pun is a second year PhD student at Carnegie Mellon University, under the supervision of Jun-Yan Zhu. With Kangle Deng - bottom left - and Ruixuan Liu - bottom right, Ava is also the shared first author of a great paper, which won the Best Paper award at ICCV 2025 ☺ Ahead of her oral and poster presentations, Ava told us more about this work. Generating Physically Stable and Buildable Brick Structures from Text This interview was conducted before the ICCV 2025 awards were known. Yes, we guessed again.
3 Computer Vision News Computer Vision News Ava Pun In pursuit of that goal, they developed BrickGPT, which is a model that generates brick structures as a list of a brick-by-brick list. Structures made out of toy bricks, such as Lego bricks. And those structures, when they are built in real life, they will stand up and not collapse. Theirs is a quest for stability and physical possibility. But what advantages will we have once the model is stable? If we can make a model generate stable 3D outputs, it could have a lot of applications in manufacturing and design and architecture. For example, we could design some custom furniture for someone with specific needs - maybe the furniture has to be lower than usual. Or maybe someone could design houses and buildings very quickly using generative AI techniques. Those houses and buildings, of course, which will have to stand up. One challenge was how to determine if something is stable or will be stable when it is built in real life. That means running a full physics simulation to determine whether it is stable, which could be very time consuming and resource intensive. So the team built a physics model that is specific to these toy bricks, that accounts for all the forces that are applied to each brick. And then the physics simulation tries to set all the forces to zero by using an optimization technique. If that is possible, then there are no forces. The structure will not move and it will not fall down - it is stable. Otherwise, it is unstable and it will fall down. “It's way more than just Lego!”
Computer Vision News Computer Vision News 4 Best Paper ICCV 2025 We asked Ava, what makes it difficult to assess stability? Testing whether the structure is stable is not a straightforward problem, because existing physics simulators can't reliably simulate the contact points between the bricks. That's why they to developed a customized-physics reasoning algorithm. They simplified the model and developed this custom algorithm, which accounts for all the physical forces that each brick experiences due to gravity contact and friction. And then, using this force model, they used an optimization algorithm to try and make all the forces sum to zero, which means that the structure would be in static equilibrium and it won't shift or collapse - it is stable. Why should we come to both oral and poster presentations today? “Because this is a very cool project!”, is Ava’s confident reply. “This is one of the first times that people have tried developing generative AI for the physical world. And if you come to our talk and our poster, you will
see some of the cool structures generated by brick GPT that were actually built by humans in real life. We also made a robot system that picks up the bricks and puts them together and also can build the structures in real life. We brought some bricks, so we're planning to bring some real bricks to the poster and build our structures in real life so you can see them there standing up and touch them.” As awesome as it sounds, we want to know why - out of more than 11,000 submitted papers, this paper has come in the top 13. Ava’s guess is that the challenge that they're trying to tackle is, very applicable to a lot of people and very understandable to a lot of people. “Because everyone's played with Lego bricks before!”, she declares. “Everyone knows how important it is to actually make the structure stable and make it buildable in real life. And bringing generative AI out of the digital world and into the physical world is something that a lot of people haven't seen before and would probably like to see because it's something that we all experience every day in the physical world!” The authors ended up writing a user study where a bunch of people wrote prompts, submit them, and then the model would return instructions so that the people could either build them themselves or it would send it to a robot and then the robot would build the results. “That was really cool! People liked it! They liked being able to see just a text prompt in their mind, send it to the computer, and then getting this physical product that you could touch. That was really cool for me and everyone involved!” A funny detail: during the process of developing the model, the team came up with many attempts that didn't work so well and generated many images of chairs that were obviously not very good. Ava was kind enough to share a set of these - here it is it below. 5 Computer Vision News Computer Vision News Ava Pun
Computer Vision News Computer Vision News 6 Ava is firmly convinced that this can open new directions. Taking the custom furniture generation example again, it is a harder problem than just making something stable - because someone has to sit on it and it has to be strong enough to hold their weight. Same thing for architecture. It has to be extremely stable; so that it's very bad if your chair collapses, but it's even worse if a house falls down. “These are definitely very exciting avenues to explore”, Ava remarks. “Even though our model is tested and trained on these Lego toy bricks, the project isn't really about just Lego bricks. The project is about extending this to general buildability and stability, trying to make generative AI that puts things that can work in the real world. It's way more than just Lego!” NOTE: this article was published before the announcement of the award winners. Which explains why it does not mention being a winning paper. Once again, we placed our bets on the right horse! Congratulations to Ava and team for the brilliant win! And to the other winning papers too! Best Paper ICCV 2025!!!
7 Computer Vision News Computer Vision News ICCV Workshops Michael Black speaking at the 1st Workshop on Interactive Human-centric Foundation Models. MIT undergraduates Yifan Kang and Dingning Cao present Doodle Agent, a multimodal LLM-driven system that explores how AI can doodle - selecting brushes, colors, and strokes to create expressive, emotion-guided artworks without explicit instructions - at the 2nd AI for Visual Arts Workshop [AI4VA].
Computer Vision News Computer Vision News 8 Best Paper Honorable Mention This work is about a new type of camera that allows to focus sharp everywhere on a sensor for every pixel. Conventional cameras today use a lens, which can only focus on one plane at a time, one depth at a time. For example, if I have a camera and I point at a water bottle in front of me, I focus on that bottle, the background is going to appear blurry, my kitchen there, it's going to be blurry. And if I focus on my kitchen, then the front object here will be blurry. Of course, this is with this current camera that I have with like the large aperture. But that is generally true for cameras when you have a sufficient aperture. The underlying reason for that is because of the depth of field that the lens has. With any conventional cameras today, the focus across the entire sensor is the same. So if you focus at half a meter away, then all the pixels would focus on half a meter away. The focus is a focal plane. This work introduces a new kind of camera that allows you to not just have a global focal plane, but elevate to another dimension. What if we can have that focal plane to adapt to any three-dimensional structures of the scene that you have? For example, the focus is no longer a plane, a flat plane, but it would have a shape that conforms to the scene geometry. Yingsi Qin is currently a fifth year PhD student at Carnegie Mellon University, under the supervision of Aswin Sankaranarayanan and Matthew O'Toole. She is also the first author of this fascinating paper, which was selected as the Honorable Mention Best Paper, out of more than 11,000 papers submitted at ICCV 2025. Ahead of her oral and poster presentations, Yingsi told us more about her work. Spatially-Varying Autofocus This interview was conducted before the ICCV 2025 awards were known. Yingsi and her team earned a fabulous Best Paper Honorable Mention
9 Computer Vision News Computer Vision News And what would that do? Having such a focal surface that can conform to the scene geometry allows you to have any type of focusing across the sensor you have. For example, you can perform optical all-in-focus imaging, which has been studied extensively in the literature. But the two most straightforward ways to obtain all the stuff in focus, one is to use small aperture, meaning that you decrease the size of the aperture. Your depth of field increases, so you can have more depth in focus. But then that also comes with light loss. The smaller the aperture you use, the less light you have. And also the smaller the aperture you use, the more diffraction blur you encounter. Yingsi Qin
Computer Vision News Computer Vision News 10 The other method is to use focus stacking, which is the most common go-to approach for photographers where they stack the focus across the depth range. So you capture one photo here, capture another photo here, another photo here, and put this stack together computationally. Afterwards, you capture the entire stack so you can computationally fuse the focuses across the sensor together. Yeah, so these two most common approaches, they have their own drawback. Yingsi’s method maintains a large aperture, so you don't have to use a long exposure. And also, it will not have defocus blur when you're in an extreme depth range. You don't need post processing. You don't rely on computational post processing to produce the focus result. There are two parts that go into this work, Yingsi explains: “It's a work that combines hardware and software: the two key innovations that enable this work. One is the optics, which is the camera itself. The optics enables us to have spatial control of focus. And then the algorithm tells us how do we control, what kind of control do we put into the camera to enable this. So for example, if I want the focal surface to conform to scene geometry, I need the depth map of the scene. And the optics, which is the hardware of this work, enables us to, as long as we have this depth map, to perform all in focus imaging. The algorithm is what gives us the depth map.” This didn’t go without challenges. The first challenge came when Yingsi was building the first iteration of the prototype, almost two years ago. It was very different from this one. It uses a totally different set of lenses, and a different sensor, a machine vision sensor; and it used 50millimetre lenses for the relay. She played around with that setup for a few months. But then the 50millimeter lenses that she was using turned out producing too much chromatic aberration in the prototype. The other challenge is that the machine vision sensor allows Best Paper Honorable Mention
11 Computer Vision News Computer Vision News only to do contrast-detection autofocus (CDAF): for every iteration in the algorithm, you have to capture multiple images to land on the autofocus image because contrast detection autofocus relies on searching for the best focus instead of computing for the best focus. There is a lot of computer vision work to discover in this paper. First of all, all in focus imaging is computer vision. “I would say this one falls into the category of physics-based computer vision,” Yingsi adds, “where you use physicsbased ideas and models to enable new capabilities for a computer vision system. This camera is a computer vision system because it enables the machine to have vision, to see the world, to perceive more information. All in focus imaging itself is providing more information to the machine or to any computer at a single instant compared to conventional cameras, because conventional cameras would have blurry information at other depths. But all in focus imaging, you can have all seen in focus at the same time!” Yingsi feels very excited for this new technology because for the first time we can auto focus every object at the same time: “There's no camera that can do it today!”, she exclaims. Also in autonomous driving Yingsi Qin
Computer Vision News Computer Vision News 12 Best Paper Honorable Mention it can have a lot of impact. Let Yingsi explain: “If I'm capturing the scene in front of the car and there's a pedestrian walking by the cameras - any conventional camera is going to auto focus to that pedestrian. But then you lose focus to the street behind, like the far street and cars. But that's not desirable because you would want to know what's happening at all time.” Also in microcopy, if you want to capture different layers of a thick tissue, you can image the multiple depth simultaneously. You need post processing and that is time consuming. With this technology, you can have an arbitrary depth of field, arbitrary shape for the focal plane, which means you can image things at different depth at the same time. Yingsi wants to add one more key point: “With our spatially varying focusing framework, any type of autofocus algorithms can be readapted to a spatially varying way to the spatially varying framework. So we show examples of contrast detection autofocus (CDAF) and phase-detection autofocus (PDAF). But for follow up research, you can go beyond that. You don't have to stick to these two kinds of autofocus algorithms, although they are the mainstream today. You can use depth from defocus to produce the depth map. That's one way. And there are also other kinds of contrast detection autofocus algorithms like
13 Computer Vision News Computer Vision News Yingsi Qin
Computer Vision News Computer Vision News 14 Best Paper Honorable Mention hill climbing. And there are a variety of algorithms for autofocusing. And the key point is that all of them can be readapted for our framework. Which means you don't have one depth that you land on, but you perform the autofocus for every pixel area, pixel region or super pixel at the same time!” NOTE: this interview was taken before the announcement of the award winners. Which explains why it calls it only a Best Paper award candidate.
15 Computer Vision News Computer Vision News No Comment
Computer Vision News Computer Vision News 16 Public Identity Management Tal spent much of his tech career working on face recognition products at AWS and Facebook, particularly face recognition in the wild. Around 2020 there started to be a worldwide increase in enforcement of privacy regulations, which is a good thing per se, but it led to the eventual shutdown and stepping away from face recognition products. It became touchy and very expensive. At some point Big Tech decided, and we can understand why, to take a step back. That left a gap, because there are very important and sensitive use cases, that would be very difficult to solve without having a way to recognize people. Tal tells us of real-life examples, like when where your name, your image or your likeness is being misused. Someone's taking your face and doing something with it that is harmful to you, including spreading misinformation, harassment and ransoming you. According to a paper that was released with a Google co-author from 2024, across 10 countries that they surveyed, 2.2% of surveyed people said they fell victim to harassment using non-consensual, synthetic, intimate imagery. If you think 2.2% of the American population, it's millions! “Now why does that happen?,” Tal asks. “There's laws against that, that's a crime. But you cannot enforce it without knowing, without scanning the image and looking for people who are not supposed to be there!” Tal Hassner (left) is a co-founder and CTO of a young startup company called WEIR AI. Tal with co-founder and CEO Gary McCoy (right) hope to have their product out very soon. Tal was formerly an associate professor back in Israel and visiting research associate professor at USC. We asked him to tell us: Who owns your face?
17 Computer Vision News Computer Vision News WEIR AI with Tal Hassner Tal spent much of his tech career working on face recognition products at AWS and Facebook, particularly face recognition in the wild. Around 2020 there started to be a worldwide increase in enforcement of privacy regulations, which is a good thing per se, but it led to the eventual shutdown and stepping away from face recognition products. It became touchy and very expensive. At some point Big Tech decided, and we can understand why, to take a step back. That left a gap, because there are very important and sensitive use cases, that would be very difficult to solve without having a way to recognize people. Tal tells us of real-life examples, like when where your name, your image or your likeness is being misused. Someone's taking your face and doing something with it that is harmful to you, including spreading misinformation, harassment and ransoming you.
Challenge Winner Computer Vision News Computer Vision News 18 Public Identity Management According to a paper that was released with a Google co-author from 2024, across 10 countries that they surveyed, 2.2% of surveyed people said they fell victim to harassment using non-consensual, synthetic, intimate imagery. If you think 2.2% of the American population, it's millions! “Now why does that happen?,” Tal asks. “There's laws against that, that's a crime. But you cannot enforce it without looking for people who are not supposed to be there!” If you were an advertiser and wanted to use some famous person's likeness in an ad, you could just download the photo from Google Images, add your logo and put it on one of many ad platforms. It would then be seen by millions of people and nothing is there to stop it, because in order to stop it, you need to know that that person is there, by running face recognition. And that is a major regulatory problem. This was a problem long before generative AI. Since Photoshop, actually. WEIR AI is a research company that is developing new technology. Its goal is not to do face recognition.
19 Computer Vision News Computer Vision News Rather, it is to do something that on the one hand respects regulatory, privacy-related regulations, and on the other hand provides the same sort of technical solution or answers the same sort of need that was previously satisfied using face recognition technology. “Look, today face recognition is commoditized,” Tal explains. “It's generally speaking a solved problem. You would be hard-pressed to find an image with faces in it that would be too difficult for modern state-ofthe-art face recognition systems to recognize. Take some library that does face recognition that's off the shelf. The thing is, the way that those pipelines work violates privacy regulations. Who owns your face? You think you do. But others think that they own it too!” Tal is not a lawyer and his is not legal advice, only his understanding. WEIR.AI’s goal is to avoid taking people's fingerprints, but still be able to answer questions such as who is in that photo and prevent those cases of name, image, and likeness (NIL) misuse. Tal and cofounder Gary McCoy hope to have their product out very, very soon. “The frontiers today are being able to redeveloping these technologies,” Tal declares, “but in a way that's responsible, that doesn't violate people's privacy, that treats people equitably. This is why, by the way, we are not an LLC, we are a PBC, a Public Benefit Corporation, which means it's an organization that reserves the rights to make a decision based on our mission, not based on what our shareholders want us to do.” To give you an example of just how difficult this problem is, the state of the art in addressing this problem right now is perhaps exemplified by what the actor Tom Hanks did, being fed up and frustrated with the abuse of his image and the absence of a response. He went on his own social media to tell people, look, if you see a picture of me selling something, it's not me. “Not many people have that ability,” Tal admits. “And even that, I don't know how effective. That's fantastic. He demarketed his own image!” But what in concrete is WEIR.AI going to provide as a company or as an organization? “A product that solves these problems,” is Tal’s answer, “based on new technology that we're developing that is an alternative to face recognition built from the ground up. I'd love to tell you about the technology, but obviously I can't! Anyways, the case is not technological. It is regulatory. We're developing new technology in order to provide the solutions and help with these pain points.” WEIR AI with Tal Hassner
Meshcapade with Anica Wilhelm
21 Computer Vision News Computer Vision News Deepti Ghadiyaram Deepti Ghadiyaram is an assistant professor at Boston University. Deepti, what is your work about? My students and I focus primarily on understanding how different large scale, small scale language models, vision language models, vision models assimilate information and how they respond to different reasoning tasks. The goal is that, by understanding how these models work, we will be able to build better, more generalizable algorithms. When did you guys start to work on this? I started at BU in 2024, in July of 2024. That's when I recruited students and together all of us have been pursuing this research agenda. Why did you choose this subject? I've been interested in the field of computer vision building models for a very long time. I spent over five years at Facebook and there we've had an opportunity to work on very large scale models. But it was always the failure mode that fascinated me as to how can a model that can do such difficult tasks so well fail or struggle on such benign, slightly tweaked inputs. That as a question has always intrigued me. Probably because it's a probabilistic model and not deterministic or something like that, right? Correct, yeah. Understanding that and resolving those has been something that grew organically out of working on more and more models. Tell me about your Facebook years. Yeah, I joined a Facebook applied research group right after I graduated. Some of the works included building very large scale video understanding models. They were deployed in different video products on Facebook and Instagram for content moderation, activity, detection, etc. And then I led an effort on building safer and responsible models. After that I spent a year at Runway, a generative AI company. That gave me an opportunity to get exposed to how to build generative image and video models. Read 160 FASCINATING interviews with Women in Science Read 160 FASCINATING interviews with Women in Science
Computer Vision News Computer Vision News 22 Women in Computer Vision I'm very curious, did you see some of the features that you have developed in the real products? Did it happen to you to look on Facebook or Instagram and say, hey, I did this! Yeah, I think some of the algorithms that we built were powering some of the features on Instagram. So I did get a chance to see if we were able to recommend hashtags which are suggesting the action that is happening in the video. That would tell me, okay, this is powered by a model that my group and my team have worked on. Who chose the team that’s working with you? I currently have a group of five PhD students and a few masters and undergraduate researchers. They're all very bright, hardworking, and just brilliant at what they do. As an incoming faculty member, I think it's a choice both of us have to make, both my students and me. I do consider myself lucky that they trusted a new faculty person. As far as I remember, Facebook and also Yann LeCun, whom I have interviewed a couple of times, are very much in favor of dual affiliations, like working for one company and at the same time teaching at some university. What is your take on that? Yeah, I think it is a hard balance to strike, especially if you are on tenure track and there are lofty pursuits. But I've seen a lot of colleagues of mine do that very efficiently. Why did you want to leave Facebook? It was time for a change into a faster pace. We spoke a little bit of the near past. Can we speak of the time before? What happened in India? I grew up in a very close-knitted family. My parents, both of them, educated and working in the service sector. I think it was very strongly instilled in me and my brother that education is our source. It is our strength. It is where we can derive our power. We took education very seriously. I did my undergraduate there and then was very interested in pursuing research. I applied for a master's and joined UT Austin. Then my master's thesis advisor, Alan Bovik, was very encouraging and offered that I stay and pursue a PhD under his supervision. “It is a cost to pay. To quit the family, to quit the friends, to quit your country!”
Computer Vision News Computer Vision News 24 Women in Computer Vision Did you have to leave India for that? I did have to leave. There are opportunities there as well. There are very good premier institutions in India. But my friends who pursued masters here in America, and my brother's friends, all have always told me how unique the system here is. The flexibility in being able to select the courses that you want to do, pick an advisor if they're interested… So it's a cost to pay? It is a cost to pay. To quit the family, to quit the friends, to quit your country. Are you going one day to live again in India? I have never done long-term planning. It's funny you say that because my family was visiting me and I made the exact same comment with my mom as to how a tiny dot of decision many years ago had such a profound impact on my life and of course consequently their life. No one saw this trajectory. It is true for many of us. How do you live with this? I have a three-year-old and most of my non-professional world revolves around her. I take this concept of making tiny moments that I spend with her very special. I keep a note so that I do remember these are the tiny moments that are shaping the life we have together. What would be a career dream to achieve? I've drawn so much inspiration from Kristen Grauman because for the past 15 years she was also at UT. She was a faculty there when I was a student and there weren't a lot of women anywhere where I grew up, so she became a role model for me. She continues to be that way. I aspire to hopefully instill good research skills to my students. That's on the professional angle, continue to do meaningful research which is fun but also hopefully adds some benefit to the society. That's on the professional front. Personally, I just have a fulfilling relationship with my daughter and family. That's very important.
25 Computer Vision News Computer Vision News Kristen told me this year that it was her 25th CVPR... Isn't that phenomenal? It is phenomenal indeed. I am still only at 10. Tell me one scientist from the past that you particularly admire. I read a lot of stories about how Marie Curie, how much she persevered and how much she had to fight the societal biases and yet pursued and persisted. I had a book of stories of different scientists and this was a story I had read and reread when I was young. I completely forgot about that until you asked me this question. I'm happy to remind you. Maybe you'll find the book somewhere. My dad saved everything so I'm sure it's there. What is the most interesting thing that you have learned from a student of yours? I like that question a lot. I will like your answer a lot. I said this to a student recently - without revealing more details, she has been going through some personal crisis. All I could offer her as a mentor was checking emotional support of some sort and also flexibility for her to go visit her family whenever. But then I did learn through her again how to persevere and continue working hard and somehow be able to compartmentalize and push the research. That's an amazing muscle for such a young graduate student. Is there anything of your career until now that if possible you would change? No, I don't think so. As we mentioned earlier, these tiny dots of decisions brought us to where we are and I think I do own every dot, every choice so I wouldn't go back and edit. Your message to the community. A recurring question that I've heard people ask is how to choose what to work on and that's a question that I grapple with. Maybe what I tell myself is pick what interests you, what fascinates you and it's likely - as long as you're having fun and you're learning something – that it is valuable! Read 160 FASCINATING interviews with Women in Computer Vision! Read 160 FASCINATING interviews with Women in Computer Vision! Deepti Ghadiyaram
Computer Vision News Computer Vision News 26 My First ICCV - Elena Elena Bueno-Benito is a PhD student at the Institut de Robòtica i Informàtica Industrial, CSIC-UPC, in Barcelona, Spain. “It’s my first time at ICCV,” she told us, “where I’m presenting our latest work, CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation. After a long journey to Hawaii and with a research stay in Tokyo ahead, I’m starting to think the universe really wants me to close the loop ;) ”
27 Computer Vision News Computer Vision News My First ICCV - Lilika Lilika Makabe is a PhD student at the University of Osaka, Japan. “I presented my poster "Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating" on the final day of ICCV. It was a really fun and inspiring experience! My research focuses on physicsbased vision, including 3D reconstruction and reflectivity estimation. I'm currently looking for academic or industry positions worldwide starting next year after graduation.” Reach out or check her homepage - Lilika is a catch!!!
Computer Vision News Computer Vision News 28 Ivan Martinović, PhD Student at the Faculty of Electrical Engineering and Computing University of Zagreb, enthusiastically presenting what it takes to make open-vocabulary segmentation truly open, accepted as a poster at the "What is Next in Multimodal Foundation Models?“ workshop. When the object mutates its frame - Danda Paudel, faculty at INSAIT, Sofia University, presents ObjectRelator, a highlight paper exploring how objects transform across ego- and exo-centric perspectives to bridge first- and third-person visual understanding. Posters
29 Computer Vision News Computer Vision News From left: Rana Hanocka, Richard Liu and Itai Lang from UChicago, presenting their oral WIR3D at ICCV 2025. They've got so good abstractions that we had to fly them all the way from Chicago to come and show you! Turkish Computer Vision community is growing every year! Have you ever heard of Hawaiian Pide? Just like these 28 bright minds’ celebration of community at ICCV 2025, taste of Turkey blended locally in Honolulu! Thanks to awesome Ilke Demir for the photo ☺ Posters and People
Computer Vision News Computer Vision News 30 What is TUS-REC Challenge: Held at MICCAI in 2024 and 2025, the TUS-REC Challenge is the world’s first publicly available international benchmark for reconstructing 3D ultrasound volumes without any tracking hardware — no bulky, expensive optical or electromagnetic trackers. Instead, participants must reconstruct a full 3D ultrasound volume by “stitching together” a stream of 2D ultrasound slices, inferring how the probe moved in space based solely on what each ultrasound frame shows. It’s a bit like navigating a plane without radar — relying only on what you can see to work out how you’re moving. Why it matters: Imagine trying to take a panoramic photo with your phone, but the phone doesn’t record how you moved it. You’d end up with lots of separate pictures, but no way to stitch them together into a smooth, full scene. MICCAI Challenge Qi Li is a final-year PhD student in Medical Imaging at the Hawkes Institute, University College London, supervised by Yipeng Hu and Tom Vercauteren. Over the years, Qi has made a long journey through the world of trackerless 3D freehand ultrasound reconstruction. More recently, she has ventured into uncertainty quantification in image segmentation. If you share the same interest and ambition, talk to her now while she is still in the postdoc market, especially if your opportunity is near London! As the lead organizer of the TUS-REC Challenge series, Qi accepted our invitation and is very excited to share how this initiative has advanced the field by providing high-quality benchmarks, fostering collaboration, and accelerating progress in trackerless freehand ultrasound reconstruction. Trackerless 3D Freehand Ultrasound Reconstruction (TUS-REC) Challenge
31 Computer Vision News Computer Vision News TUS-REC Challenge series That’s exactly the challenge with freehand ultrasound. In many hospitals, especially those with limited resources, clinicians use simple handheld ultrasound probes, affordable, portable and easy to carry anywhere in the clinic or even to remote communities. These probes capture 2D slices of the body as the clinician moves their hand, but without a tracking device to record the probe’s position and orientation, turning those slices into a 3D image is extremely difficult. It’s like trying to build a 3D puzzle without knowing where each piece came from. Trackerless freehand ultrasound reconstruction aims to solve this problem. It takes ordinary, inexpensive ultrasound scans and uses clever algorithms to estimate how the probe moved, making it possible to reconstruct a full 3D picture of organs or tissues. This could make 3D ultrasound far more accessible around the world, reducing the need for expensive equipment and giving clinicians richer information for diagnosis. However, despite decades of research, progress has been slow. One major reason is that researchers haven’t had shared datasets or standard benchmarks to compare methods fairly. Each group uses different data, making it hard to tell whether new techniques are genuinely better. TUS-REC changes this. It is the first open, standardised benchmark specifically for trackerless ultrasound reconstruction. It provides a common platform where researchers can test and compare their algorithms on the same data, helping the field move forward more quickly and reliably. In short, trackerless freehand ultrasound could bring advanced imaging to places that need it most, and TUS-REC helps the community finally build and compare the tools needed to make that vision real, one that shared by Qi and her team. Setup for freehand ultrasound data acquisition.
Illustration of three coordinate systems in the task: image, tracker tool, and camera (or world) coordinate systems. Computer Vision News Computer Vision News 32 What they did: 1) published a large, ethicallyapproved in vivo ultrasound dataset comprising over 2,000 scans (> 1 million frames) collected from the forearms of 85 diverse volunteers, with synchronised pose data captured by a high-precision optical tracker. 2) developed comprehensive step-bystep tutorials that helps the public and wider researchers understand the basics of this task, accompanied by easy-to-use modular baseline code, allowing anyone interested to get started quickly. 3) established a standardised benchmark for fair comparison across methods in this field, supported by rigorously defined metrics and a transparent ranking scheme, which will no doubt contribute towards the growing need to establish a standard in this field. 4) released the challenge report paper with comprehensive background introduction and literature review, describing the challenge design and dataset, and presenting a comparative analysis of the submitted methods across multiple evaluation metrics. MICCAI Challenge
33 Computer Vision News Computer Vision News What is the outcome: By the submission deadline, 101 participants (43 teams) registered for the TUS-REC2024 Challenge, representing both academia and industry across 14 countries. Six teams, comprising 25 participants, submitted their algorithms (21 valid dockerized solutions) for final evaluation. The submitted methods span a wide range of approaches, including the state space model, the recurrent model, the registrationdriven volume refinement, the attention mechanism, and the physics-informed model. The dataset has been downloaded over 2,300 times to date. All data and code are publicly available to facilitate ongoing development and reproducibility. As a live and evolving benchmark, it is designed to be continuously iterated and improved. The 2025 edition includes more challenging data and continues to grow in scale and impact, drawing 141 individuals (59 teams) across 16 countries so far, consisting of 23 solutions from 7 teams. Key Results & Take-aways: 1) Trackerless reconstruction really works. Multiple teams were able to produce convincing 3D ultrasound volumes without using any external tracking device. This shows that the idea is not just a theoretical concept, but an achievable task under controlled acquisition protocols. 2) Different methods have different strengths. The submitted methods show that there is no single “winning paradigm” yet. Each type of algorithm performed well on some metrics and less well on others, showing that no single technique has emerged as the clear favourite. 3) There is still room for improvement. Our analysis showed that reconstruction quality tends to worsen in longer scans, and current models still struggle to handle a wide variety of scanning patterns. Improving robustness and generalisation remains an important challenge. We welcome interested readers to check out the challenge website and contribute to future developments. Huge thanks to all co-organisers and participants, whose energy and hard work made the challenge possible. TUS-REC Challenge series
Computer Vision News Computer Vision News 34 NeurIPS Paper Gaia Di Lorenzo recently graduated from ETH Zurich with a Master’s in Computer Science (major in Machine Intelligence), and her thesis - Object-X - has been accepted to NeurIPS 2025, an exciting milestone at the start of her research career. She has now joined NVIDIA, where she works on AI Agents, focusing on how intelligent, embodied systems can understand, interact with, and reason about the world. Congratulations, Gaia! In 3D vision, objects are often represented through point clouds, meshes, or neural fields such as NeRFs or 3D Gaussian Splatting (3DGS). These representations achieve strong results, but they are far from lightweight. When dealing with thousands of objects, a realistic scenario in robotics, AR/VR, simulation, or large-scale scene understanding, storage and computation quickly become major bottlenecks. Object-X, developed at ETH Zurich, introduces a more scalable alternative: object-centric embeddings that are both compact and decodable. Each object is reduced to a single, fixedsize vector that requires 3-4 orders of magnitude less storage than common 3D formats, while still retaining enough information to reconstruct the object with high visual fidelity. The pipeline begins by canonicalizing each segmented object and voxelizing it into a regular 3D grid. Using posed images, multi-view features are then projected into the voxels, creating a dense 3D field of learned signals. A 3D encoder compresses this representation into SLat, a structured latent that organizes geometry and appearance. SLat is then further compressed into the U-3DGS embedding, a compact vector that is easy to store, share, or index. From this tiny embedding, Object-X can decode a full 3D Gaussian Splatting model, effectively turning a small vector back into a high-quality 3D object ready for rendering, mesh extraction, editing, or alignment. One of the strengths of Object-X is that this embedding is not only compact but also versatile. Through
35 Gaia Di Lorenzo Computer Vision News Computer Vision News multi-task training, the same representation can support retrieval, localization, and scene alignment. These additional signals, such as text features or scene-graph context, enrich the embedding without compromising its ability to decode back into accurate 3D geometry. In practice, this means a single vector can both describe how an object looks and help reason about where it belongs. Object-X matters because it makes scalable 3D understanding much more practical. Large libraries of objects become manageable, and each object becomes a portable, modular unit that can be inserted into scenes, compared across datasets, or updated with minimal overhead. More broadly, it represents a shift toward objectcentric 3D representations, a direction that simplifies existing pipelines and opens new possibilities in robotics, AR/VR, digital twins, and generative 3D systems. It points toward a future when interacting with 3D objects is as efficient and flexible as working with today’s learned embeddings. Despite its compact size, Object-X reconstructs detailed geometry and appearance comparable to classical 3D representations. Object-X compresses each object into a small, multi-modal embedding that can be fully decoded into a 3D Gaussian Splatting model.
Computer Vision News Computer Vision News 36 An amazing visual showing Vida Adeli’s doctoral work, GAITGen, already accepted at WACV 2026. Co-author Andrea Iaboni posted that the most striking part of this work, is that six clinical experts (herself included) blindly reviewed real and AI-generated gait videos and couldn’t tell them apart. They guessed correctly only half the time, no better than chance. WACV 2026 Paper Preview
37 Computer Vision News Computer Vision News GAITGen by Vida Adeli GAITGen: Disentangled Motion-Pathology Impaired Gait Generative Model - Bringing Motion Generation to the Clinical Domain. arXiv. Project page. Images were generated by Vida using Gemini 3 pro. Vida is featured on the right in the last image, together with Andrea and supervisor Babak Taati.
Computer Vision News Computer Vision News 38 Congrats, Doctor Ivona! Ivona Najdenkoska recently completed her PhD at the University of Amsterdam, under the supervision of Marcel Worring and Yuki Asano. She worked on multimodal foundation models and generative AI, investigating how these systems combine information from different modalities. In particular, her work studies how richer context - visual, textual, or both - can strengthen multimodal understanding, generation, and alignment. Ivona will continue her research at UvA as a post-doc. Congratulations, Doctor Ivona! The idea that motivated much of her thesis is simple: humans rarely rely on a single cue when understanding the world. We look at what surrounds an object, at earlier examples, at prior demonstrations of a task, at object relationships, and how images and text complement each other. Multimodal foundation models should ideally learn to use context in the same way. Her thesis begins with the challenge of learning from only a few image– caption examples as context. Language models often rely on hand-engineered task instructions that guide the model toward the correct task. The first chapter introduces a meta-learning approach that makes this task instruction learnable. It leverages frozen vision and language backbones connected through a lightweight module named the meta-mapper. This allows quick model adaptation from limited demonstrations, showing that even frozen models can be far more flexible than expected. Similar type of context appears in her work on diffusion models for image generation. These models are typically guided by crafted text prompts, yet many visual concepts— styles, color palettes, object arrangements—are hard to describe in words. Ivona introduced Context Diffusion - a framework that lets diffusion models learn from examples provided as context. Instead of using only text prompts, users can show the model a few images or combine them with text.
39 Ivona Najdenkoska Computer Vision News Computer Vision News The model then generates new images that follow the demonstrated patterns, making image generation more intuitive and faithful to the intended target. Another limitation she tackled concerns contrastive visionlanguage models like CLIP, whose training context window is limited to only 77 tokens of text. This becomes a bottleneck when working with long captions. Her proposed approach called TULIP augments CLIP with relative positional encodings and distills knowledge from the original text encoder. This improves performance in long-caption retrieval, image generation, and any mutlimodal task that benefits from richer textual inputs. Her final chapter turns to the generation of long captions i.e., paragraphs by considering the inherent diversity of data. This challenge is tackled in the context of radiology report generation, as these reports often reflect uncertainty and diversity between experts. Her proposed approach called Variational Topic Inference framework models this diversity by capturing sentence-level topics, leading to generation of reports that are coherent and better aligned with the images. Across all chapters, her PhD work shows that visual and textual context can meaningfully improve multimodal foundation models. As Ivona moves into her postdoctoral work, she aims to build models that leverage context while behaving reliably in real-world applications. More information about her work and publications is available here.
Computer Vision News Computer Vision News 40 "Datasets through the LookingGlass" is a webinar series focused on reflecting on the data-related facets of Machine Learning (ML) methods. We are building a community of enthusiastic researchers who care about understanding the impact that data and ML methods could have on our society. The webinar is part of “Making MetaDataCount” project and was originally organized by Veronika Cheplygina at the IT University of Copenhagen and Amelia Jiménez-Sánchez, now at IT University of Barcelona. The webinar goes beyond the project since Théo Sourget, also affiliated at ITU Copenhagen, and Steff Groefsema have joined the organizational team. Steff is a PhD-student on the topic "Ethical Uncertainty in medical AI“ at the Department of Artificial Intelligence, Bernoulli Institute, the University of Groningen. He accepted our invitation to tell us about the latest edition of the webinar, in which several researchers presented their work on bias and dataset quality. Datasets through the L king-Glass Photo by Reyer Boxem Steff Groefsema
41 Making MetaDataCount Computer Vision News Computer Vision News Mamunur Rahaman is a PhD candidate in Computer Science and Engineering at the University of New South Wales (UNSW), Sydney, specializing in AI, biomedical imaging, and computational pathology. His research focuses on multimodal deep learning for cancer diagnostics. His talk gave a detailed overview of his PhD journey, beginning with why computational pathology is essential yet difficult to implement effectively. Standard computer vision tools often fall short, requiring specialized data processing to handle varying dataset characteristics. Mamunur shared valuable insights on building a reliable multi-modal framework that delivers consistent predictions across datasets. His first project established a robust, generalizable foundation for histopathology classification, using supervised contrastive learning and strong feature fusion, resulting in highly accurate and dependable predictions. Want to know more about about his projects and the relevant papers? The entire talk can be found on YouTube:
Computer Vision News Computer Vision News 42 Datasets through the L king-Glass David Restrepo is a PhD student in Applied Mathematics at the Mathematics and Computer Science Lab (MICS) at CentraleSupélec, University of Paris-Saclay. His research focuses on fairness and bias in medical AI, with a particular emphasis on multimodal approaches. Health disparities are also a challenge in Ophthalmology, because the equipment necessary to obtain high quality data from the retina is not available for everyone. Obtaining data which is of high quality and fair is essential to provide reliable care for visual impairments, but also for people with cardiovascular diseases or diabetes. Therefore, the presented work address these challenges through the development of three open and representative datasets: BRSET, mBRSET, and Multi-OphthaLingua. These resources capture diverse retinal images and associated demographic and clinical data from Latin America and beyond, enabling systematic benchmarking of AI performance across geographic, socioeconomic, and linguistic dimensions. Want to take a closer look into this interesting topic? The entire talk can be found on YouTube:
Computer Vision News Publisher: C.V. News Copyright: C.V. News Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, WACV, CVPR, ICCV and all conference organizers. 43 Making MetaDataCount Computer Vision News Computer Vision News Would you be interested in joining our next webinar to improve your own knowledge or have the opportunity to ask experts questions? Please look at our website and sign-up to the newsletter! Yuki Arase is a professor at the School of Computing, Institute of Science Tokyo (aka Tokyo Institute of Technology). Her research interests focus on paraphrasing and NLP technology for language education and healthcare. You just met with your doctor, but you have absolutely no clue what all these terms meant. Would it be nice to have a tool translating it into easier, plain language? In English this might be easy due to the large amounts of data, however this is not the case for Japanese. During her talk, Yuki explained that the creation of JASMINE provided a first step for Japanese medical text simplification. First 17.000 sentences from 1000 publicly available patient blogs are reviewed by two highly experienced NLP annotators. This resulted in more than 1400 simple and complex sentence pairs. Do you want to know more about how the model is trained? Or do you want to see some examples? The entire talk can be found on YouTube:
RkJQdWJsaXNoZXIy NTc3NzU=