WACV 2026 Daily - Tuesday

Winter Conference on Applications of Computer Vision 2026 - Tuesday WACV DAILY

9:45-10:45 CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual … 14:45-15:45 Broadcast2Pitch: MM-TS: Multi-Modal Temperature and Margin … 14:45-15:45 Locally Explaining Prediction Behavior via Gradual Interventions and … (see our full review on page 10) Poster Session 5-35: GHOST: Getting to the Bottom of Hallucinations with a … Poster Session 5-116: VectorSynth: Fine-Grained Satellite Image Synthesis with … Poster Session 6-110: GorillaWatch: An Automated System for In-the-Wild Gorilla ... Jon’s picks of the day: Jon Crall has a PhD in Computer Science from Rensselaer Polytechnic Institute and is a Staff R&D Engineer on Kitware's Computer Vision team. For today, Tuesday 10 2 Jon’s Picks DAILY WACV Tuesday Posters "At Kitware, I work on a variety of interesting vision problems and contribute to the open source ecosystem by building and maintaining Python tools. Currently I'm working on the DARPA AIQ program where we are studying what kinds of mathematical guarantees can be made about large language models and empirically evaluating how well those theories hold up at scale. I also work on my own independent pet projects, and one of them is the reason I'm here.“ Orals "The last time I was at WACV was in 2013, when I presented HotSpotter. Now, in 2026, I'm back with ScatSpotter: a dataset for dog poop detection. This may sound silly --- and it is --- but it's also a serious and challenging vision problem with practical applications. Among these is a struggle that many dog owners know: when your dog goes in leaf clutter, if you take your eye off it for even a second, it can be surprisingly hard to find again. I’m working on a phone app to make relocating it easier. While it isn't the most important problem in the world, it's at least number two!"

3 DAILY WACV Tuesday WACV Daily Editor: Ralph Anzarouth Publisher & Copyright: Computer Vision News All rights reserved. Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, WACV and the conference organizers. Good Morning Tucson! Enjoy the last magazine of WACV this year and have a great Tuesday! Ralph Anzarouth Editor, Computer Vision News Ralph’s photo above was taken in peaceful, lovely and brave Odessa, Ukraine. WACV Daily

Active Speaker Detection (ASD) asks a deceptively simple question: when several people appear in a video, who is actually speaking at a given moment? The task requires models to analyze both the visual scene and the audio track, and to learn to align facial motion with speech signals. “You take in the faces and the voice and the audio, and then you have to make a prediction about whether someone is speaking over the span of time,” Le explains. The field has long been considered largely solved. On standard benchmarks, systems typically reach around 95–96 mean average precision (mAP), the standard evaluation metric for active speaker detection. With numbers that high, many researchers assume the task is saturated. But when Le tested one of these models on everyday videos, he found a different story. When he ran a state-of-the-art system on a noisy YouTube game show clip, the results were surprising. “I realized that the model broke,” he recounts. “I thought, what happened here? I saw a systematic failure that whenever you introduce very high noise, the model suddenly fails.” Tracing the issue further, he found that the datasets used to train and evaluate many models share a common characteristic: they are built largely from movie footage. Dialogue in films is typically recorded and mixed to emphasize speech while minimizing background noise. “They are trained on movie video, and they are tested on movie video,” Le notes. “When you watch a movie, all the noise has been filtered. You can only hear the voice of the actor.” 4 DAILY WACV Tuesday Oral Presentation Le Thien Phuc Nguyen (top) is an undergraduate researcher, and Zhuoran Yu (bottom) is a finalyear PhD student at the University of Wisconsin–Madison. They are the first authors of a paper exploring how active speaker detection models behave in noisy real-world environments. They speak to us ahead of their oral and poster presentation this afternoon. LASER: Lip Landmark Assisted Speaker Detection for Robustness

5 DAILY WACV Tuesday Le Thien Phuc Nguyen As a result, models that perform well on these datasets may struggle when applied to real-world video, where crowd noise, music, or overlapping conversations are common. To measure the impact of this problem more systematically, the team created a new evaluation dataset designed to test models under different noise conditions. Their benchmark, called LASERbench, includes modern online video clips with realistic background sound. The results revealed a clear gap between clean and noisy environments. In clips with clean audio, models maintained performance close to the familiar benchmark results. But as background noise increased, their accuracy dropped noticeably. In particularly noisy clips, performance fell to around 80% of the level observed under clean conditions. With a clearer understanding of where the models struggled, the researchers began exploring ways to make them more robust. Their approach draws on a simple strategy people naturally use when speech is hard to hear: watching the speaker’s mouth. “When the noise is very loud, one way you know if that person is speaking or not is that you look at their lips,” Le points out. “What we did was extract the lip landmarks, which are coordinate points around the speaker’s lips, and then we embedded those into the model to guide it to look at the lips when there’s high noise.” The method they developed, called LASER, incorporates lip landmarks during training. These landmarks describe the position and movement of the mouth. They are encoded into feature maps, which are combined with visual features extracted from the face track, helping the model focus directly on

6 DAILY WACV Tuesday Oral Presentation mouth motion when deciding whether someone is speaking. Implementing this idea raised its own challenges. Lip landmarks are typically extracted using lightweight facial landmark detectors, which do not always perform perfectly. Faces can be partially occluded, captured at unusual angles, or too small for reliable detection. Le remembers that the first version of the system struggled with exactly this issue. “The first version of the model failed a little bit,” he says. “The performance decreased a little in clearer audio conditions because sometimes the lip model cannot extract the lip landmarks.” To overcome this limitation, the team introduced a consistency loss during training. The model learns to produce similar predictions whether lip landmarks are present or not, retaining the benefits of lip guidance without relying on a landmark detector at test time. The result is a system that can still focus on lip-audio synchronization even when landmark detection fails.

As Le points out, the approach builds on ideas that are already familiar in computer vision and machine learning. “We did two things,” he explains. “The first one is something pretty well known in previous computer vision models, which is using an additional cue to guide the model through an embedding. Basically, you encode it into a vector and embed it into the model. The second thing we used is the consistency loss, which is well known in the semi-supervised learning field.” Zhuoran emphasizes that the project also highlights a broader issue in the field’s evaluation of its models. “From the motivation side, it pinpoints potential weaknesses that have been overlooked by the previous benchmarks,” he tells us. “We propose an approach to improve upon that, but more importantly, we also created a new benchmark that covers gaps the earlier benchmarks didn’t capture.” The new benchmark includes both high- and low-background-noise scenarios, enabling future work to conduct more detailed quantitative analysis of how models perform under different noise conditions. Although the work improves performance in challenging environments, the researchers do not see it as solving the task entirely. Instead, they view it as opening new directions for future research. Much of the remaining challenge lies in building datasets that better capture the diversity of real-world conditions. For Le, the experience has also been personally meaningful. As the project’s undergraduate researcher, 7 DAILY WACV Tuesday Le Thien Phuc Nguyen

8 DAILY WACV Tuesday Oral Presentation he was responsible for coordinating experiments and organizing the work. “When I first came in, everything was very messy,” he recalls. “But then I learned how to organize the experiment in a structured way. How to organize a team to annotate a test set. Those are skills that I’m proud of having acquired for myself.” He is now continuing this line of work by exploring how to curate larger and more diverse training datasets that expose models to a broader range of acoustic conditions. The goal is to make active speaker detection systems more reliable in the environments where they are most likely to be used. For Zhuoran, working with undergrads like Le forms part of his regular work at UW–Madison. His own research focuses more broadly on multimodal learning, particularly text-to-image models and multimodal large language models. Le hopes the project encourages others to revisit a task that many researchers assumed was already solved. “If you come to our oral presentation, what you can learn is that the field is not done yet,” he teases. “There are still many things that can break an ASD model, which means the task is not solved, and there are many opportunities for further development. We also present a technique that people might be able to use for other tasks, and we’ll have a small demo.” You can learn more about this work during Oral Session 8B: Video Recognition and Understanding II, Tuesday 13:30–14:30 in AZ Ballroom 7, and during Poster Session 6, Tuesday 15:45–17:30 in the Tucson Ballroom and Prefunction Space, poster #9.

Double-DIP Don’t miss the BEST OF WACV 2026 iSCnul i Ccbkos cmhreipbruet feor rVf ri sei eo na nNde wg es toi tf iAnpyr oi l .u r m a i l b o x ! Don’t miss the BEST OF WACV 2026 in Computer Vision News of April. Subscribe for free and get it in your mailbox! Click here Target with solid fill

This work’s main focus is explainability: figuring out what happens inside a neural network. Niklas works in a group which includes scholars directly working on causality. His focus and perspective on explainability come from the area of causality. “I kind of try to figure out from a causal perspective,” he starts, “what do neural networks do? And while I studied their different methods that exist, how they work, I kind of took a step back and basically view it through an interventional lens. So the core idea of my paper is to perform input interventions in images, to explain shifts in the behavior of a neural network.” Specifically for this, Niklas measures gradients with respect to properties, and he’s doing it to learn how neural 10 DAILY WACV Tuesday Oral Presentation Niklas Penzel is a sixth year PhD student in Jena under the supervision of Joachim Denzler. He’s the first author of a very nice paper that has been accepted as an oral at WACV 2026. The focus of this work is on explainability: figuring out what happens inside a neural network. He speaks to us ahead of his oral and poster presentations later today. Locally Explaining Prediction Behavior via Gradual Interventions and Measuring Property Gradients

11 DAILY WACV Tuesday Niklas Penzel encode information and also how behavior changes under concrete interventions. “In my opinion,” he explains, “a straightforward way to answer this is the causal hierarchy theorem. Causal queries and questions about data generating systems can be summarized in a hierarchy. Associational queries have less power than interventional queries.” One challenge Niklas encountered was building on image-to-image editing diffusion models. While they become better and better, and his method is directly linked to the quality of these other methods, they always introduce a domain shift. He

made some preliminary experiments and then also some more experiments beyond the paper just to convince himself of that. The problem was solved by also studying whether the information is actually in there for the domain shift. “While this didn't make it into the paper in the end,” Niklas confesses, “because this was preliminary work, I figured out that in this specific case, the information to differentiate based on the noise was not in the backbone or in the model activations. And also how I solved this in the end is was to only use diffusion models. And a separate idea we had is to actually record new data and to perform synthetic interventions where we design a specific intervention, like handcraft a solution.” For text inputs generating changes is often more tractable because of limited vocabularies and strong generative models like LLMs. But for vision, this is more complex. “For vision,” Niklas declares, “there are a lot of different combinations of pixels that can correspond to a change. For example, imagine an image of a dog with very white fur and we want to gradually change the fur color to black, then the kind of change it specifically takes can vary depending on many factors, for example lighting conditions. We could imagine different color gradients or like a dog where dark spots increase in size. In our work, we build on diffusion image-toimage editing models. They are similar to LLMs in the sense that they are generative models in the domain we are interested in and we use them to sample from interventional distributions via text guidance.” Niklas thinks that in general, the focus on causal perspective is something that is needed in the explainability area. There are many other works that do something like this as well, where they cast explanations as verifiable hypothesis. “My work is one of many 12 DAILY WACV Tuesday Oral Presentation

How did this paper turn into an oral presentation at WACV? Did Niklas give a thought at this? “I got valuable feedback at previous conferences where it was rejected,” is Niklas candid opinion. “And I used all of these insights they recommended to improve the paper. So for this specific paper, I tried to incorporate a lot of feedback, like from my colleagues, from reviewers, and then of course my own feedback. What did reviewers criticize? What did they not look at? How can I tell the story so they actually follow my argument? I think it was a lot of improvement that led to this, in a sense. I think if I would have submitted the first version directly here, then it would not have been understood. The kind of continuous improvement, that's what actually led to this result!” Go to Niklas' presentations today! You will discover a causal or interventional perspective of explainability. “I hope you can take something away from that, that you can apply to your own research in the area of explainability. Even if it's not specifically the method I use, think about this from a causal perspective!” If that has piqued your interest, you can learn more about Niklas’ work during Oral Session 9B: Machine Learning II, Tuesday 14:45–15:45 in AZ Ballroom 7, and during Poster Session 6, Tuesday 15:45-17:30 in the Tucson Ballroom and Prefunction Space, poster 19. 13 DAILY WACV Tuesday Niklas Penzel

14 DAILY WACV Tuesday Poster Presentation This work is a case study on why explainability is important and how to apply it in the field for digital forensics methods, specifically AIbased digital forensic methods. Maryna worked on the problem of creating AI models to detect the source camera for the video. “And once I concluded that work,” she reveals, “kind of came to my mind. How do we know that the model actually follows the hypothesis that we follow, specifically that every video has unique noise prints for specific camera? So for that reason, we decided to design the pipeline called xDFAI, which is basically digital forensics explainability pipeline for video source camera identification AI methods particularly.” There are multiple reasons why this is important. One of the main ones would Digital Forensic AI You Can Explain: A Case Study on Video Source Camera Identification Maryna Veksler is a postdoc at Virginia Commonwealth University. Her paper was accepted at WACV as a poster and she talks to us ahead of her presentation today. UKRAINE CORNER would be the specificity of the domain for the implementation, because when we talk about digital forensics, any method that we use, it eventually needs to be and will be presented in court. And the experts that use it, they need to be able to go on the stand or present the evidence to the lawyer and they need to be able to communicate their findings clearly. They

15 DAILY WACV Tuesday Maryna Veksler UKRAINE CORNER also need to be able to demonstrate the validity of the method, that it can be replicated, and also that it actually does what it's supposed to be doing and not just random magic. It's almost as a sensitive issue as with medical: “Yes, yes!”, she confirms. “Actually, I think digital forensics is also part of the safety critical domains and applications where AI is used, because, you know, it may result in the wrongful conviction, but it also can result into the inadmissibility of the critical evidence for important cases!” It is so important not to mess up the evidence, especially if they are critical to link suspect to the crime. Maryna is thinking of any evidence or any cases that involve linking the videos to the suspect, such as child abuse or, you know, any kind of trafficking, as well as even placing the suspect on the scene and identifying the observer who saw it. The next question is obvious: with all the fake things and the facility that we have today to fake things, doesn't the line between sure and not sure become more blurred? Let Maryna reply: “It does. It definitely does. And coming back to our work, yes, we did it for digital forensics, but at the end we also elaborated that this specific application of source camera ID is only the first step. Eventually, this is something that we want to be applied for all AI methods. And I think it's particularly relevant when we talk about deepfakes, because it's coming to the point that anyone can claim about any video, this is a deepfake! And there are probably tools that can somehow support it, even if it's real. So we need additional methods that can validate the tool!” Traditionally, we used to do this manually. We just run statistical equations or manually validate the evidence, but as we use AI models, which are usually or more often black box for those tasks, it becomes

really hard to just manually verify it. This is where it kind of becomes obvious to the authors that the explainability is something that needs to be there. “But it shouldn't be just as easy as saying, oh, this is explainable,” Maryna replies, “because I can say that anything is explainable. But how do I communicate it to you? And how I interpret this explainability results is also important and what conclusions I draw based on it.” The main challenge in this work is figuring out how do we go from obtaining some explanations and statistics to turning those data into the meaningful interpretations of what the model is doing. That's why the team didn't create just one method, but rather focused on the pipeline as a whole. They started with a first step of getting those explanations and they obtained it per video frame, because that's how usually models process the videos: instead of taking the whole video, they just process frame by frame. And from there, basically, there are multiple challenges. The first challenge is what do we need to even measure to prove the hypothesis that it's a noise print, as noise prints are specific to devices. And then the second challenge was how to aggregate those explanations across videos and across devices, because the explanations are unique to the local frame. But then we want to look at it as a whole, because one frame is not enough to ID the camera that was used to record the video. “We need to understand how to obtain those global explanations,” Maryna clarifies. “And then also, how do we make sure that during our aggregation, we didn't mess up with the actual statistical data, basically that our preprocessing didn't compromise the results.” Ask Maryna at her poster about the fascinating work she did to solve these two challenges and what her conclusions are. There are multiple directions Maryna and team are considering as follow ups. One of them is to use this knowledge to improve the existing AI methods, because what it essentially did is to reveal the vulnerabilities and the instabilities of the model. And the other path is 16 DAILY WACV Tuesday Poster Presentation UKRAINE CORNER

17 DAILY WACV Tuesday Russian Invasion of Ukraine Our sister conference CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. Maryna Veksler UKRAINE CORNER to apply it to different domains, different tasks and essentially come up with the more or less standard approach that can be more or less transferred as a backbone to different applications. Let’s now talk about the country Maryna comes from: Ukraine. The computer vision community, is very supportive and condemns the Russian aggression in the strongest possible terms. Is there anything that she wants to tell the community about this? “I'm very grateful. And I think it's very important for me that the community recognizes it as an aggression, not a special operation,

or conflict, the other way they like to call it a lot these days. So, and I think, for me, and for almost all Ukrainians that I know, the support is important, and I would also encourage everyone to just speak up about it! What's valuable these days is to spread across the information about it whenever possible, and a lot of people that I know, I know even some athletes that are world class on the stage and they have been speaking out about it consistently, and also other public figures. And I think that this acknowledgments and kind of spread of the information helps a lot to filter out that discussion [claiming that] there is something's happening, but nobody really knows what!” Maryna will be presenting her paper during Poster Session 5, Tuesday 10:45–12:15 in the Tucson Ballroom and Prefunction, poster 117. 10 DAILY WACV Tuesday Poster Presentation 18 UKRAINE CORNER These women were competitive gymnasts. They became computer vision scientists. They are Italian. They are in today’s WACV Daily. Meet the scientist behind the science!!! on page 20

19 DAILY WACV Tuesday Annika and Sadia Teamwork across career stages: Sadia (second from left - PhD student) and Annika (right - Postdoc) from the Computer Vision and Robust AI Lab Osnabrück enjoying great discussions and insightful feedback during the poster session at WACV! Their work explores how generated objects can be used to systematically challenge open-vocabulary object detectors, and make AI perception systems safer and more reliable in applications like autonomous driving.

20 DAILY WACV Tuesday Isabella Poles Before pursuing research, Isabella dedicated many years to rhythmic gymnastics. At just 13, she moved away from her hometown near Treviso to train with the Virtus Gallarate team near Milan, balancing intensive training with school while living far from home at a young age. She competed in the Italian Serie A championships and, in 2014, became Italian champion in the 5-ribbons team event. Years of training and competition shaped her discipline, resilience, and ability to work within highly coordinated teams under pressure. Isabella Poles is an Italian computer vision researcher and PhD candidate in Computer Science and Engineering at Politecnico di Milano. Her work focuses on deep learning methods for medical image analysis, with the goal of improving the time from diagnosis to treatment through representation learning, multimodal AI, and learning-to-optimize strategies applied to diagnosis, prognosis, and radiotherapy planning. She earned her double-master’s degree at the University of Illinois at Chicago and Politecnico di Milano and, during her PhD, was a visiting researcher at Brigham and Women’s Hospital-Harvard Medical School, and an intern at Siemens Healthineers.

21 DAILY WACV Tuesday From Gymnastics to science #1 Beyond the athletic challenge, gymnastics sparked her curiosity about the physics behind movement and the complexity of the human body, how precise motions emerge from underlying biomechanical principles. This early fascination with human motion and structure gradually evolved into a broader scientific interest in understanding the body through technology. These experiences ultimately inspired her decision to pursue a scientific education, first in biomedical engineering and later in computer vision for medical imaging.

22 DAILY WACV Tuesday Giulia Avanzato Giulia Avanzato is a computer engineer and computer scientist from Italy. Her journey in artificial intelligence is driven by a deep curiosity about how technology can contribute to solving real-world problems, especially in healthcare. I am participating in WACV 2026 with research focused on breast cancer detection using multimodal models. For me, working at the intersection of AI and medicine is not just an academic pursuit, it is a way to contribute to tools that may one day support doctors in making earlier and more accurate diagnoses. Our approach leverages privileged unpaired text from external datasets, aligns external textual information with histology images through a vision–text module based on CLIP, enabling multimodal training while preserving unimodal inference, and achieving efficient multimodal knowledge distillation.

23 DAILY WACV Tuesday From Gymnastics to science #2 Gymnastics has been an important part of my life and has shaped the way I approach challenges. It taught me discipline, attention to detail, and the ability to adapt myself to different contexts, just as in gymnastics you move across different apparatus, each requiring a different mindset and skill set. During my journey, I had the opportunity to work as a research intern in Montreal, an experience that expanded my perspective both scientifically and culturally. Today, I work in consulting in Zurich, where I continue to explore the potential of AI Security. Giulia is looking to start a new career in the research field as a PhD student. She’s a catch! Contact her for interesting collaborations!

Rosaura VidalMata is an AI Researcher at Lenovo Research. Rosaura. what is your work about? In our lab we do all kinds of research towards camera solutions, because Lenovo of course does laptops, but we also have a tight collaboration with Motorola since it's owned by Lenovo. So it also involves phone cameras and all the fun things that we can do with that. What does your work consist in? We have a set of different image processing projects so I am currently working on a project related to stereo setup segmentation. For that, we are looking into some phone applications and information that we can learn from stereo setups having two cameras that let us know about the 3d information of our users, their background, how distant they are from objects, whether they are interacting with things. Learning about all that and more importantly fitting it into the devices that the users are going to be using. a laptop or a phone, rather than a super server - this is one of the most challenging parts: making a very big model work in a very small device. I have a Lenovo laptop. What can you learn about me? Technically not that much, because a lot of this is for devices that are actually not in production; it's more of research for the sake of research, though some of the other projects that we have worked on actually going to production: some of them are more theoretical, like this one that we're working on. Do you enjoy research for the sake of research or sometimes you feel like you want to do research that will translate into the real world? I think it's a good thing to have a little balance! Actually, before this project, most of the research that I have done here at Lenovo was research-oriented for products, so we had a very real need that we wanted to address to improve the experiences of our users or solve some of the unmet needs that we had identified on products. And obviously, since that is related to an actual production pipeline, it has a completely different base and priorities. Now we moved into more experimental or theoretical stuff, to try and define directions for the future. It's a little bit less rigorous, because we don't have a production pipeline, but you also have to demonstrate whether this has any potential for a product in the future and you also feel more freedom, because you don't have that many strict requirements as if you were Read 160 FASCINATING interviews with Women in Computer Vision Read 160 FASCINATING interviews with Women in Computer Vision Women in Computer Vision 24 DAILY WACV Tuesday

25 DAILY WACV Tuesday Rosaura VidalMata

going for a product; at the end of the day, you still have to come up with a recommendation: this is probably something valuable in the next 2-3 years or maybe there’s a way to adapt it for other business needs at other departments; because sometimes your department will maybe not be focused on that, but there might be other departments that were working on something similar, so you can just share that knowledge and pass the torch to them. That is fascinating Rosaura! I would like to know, what is your personal taste with these different parts: would you prefer the one that is less constrained and maybe for the sake of itself or something that is more constrained but where you will see the fruits of your work in a real product? That's a difficult question! Sorry. I think I have enjoyed a lot getting the fruits of my research on something that will actually be in the hands of users. That is very rewarding, when people tell me ‘I have a laptop’ and I’m like ‘you know the feature when you open the camera that does this’... yeah we had some small part in that! That is always very rewarding... though it can be very stressful! So I do enjoy having small purely research breaks in between... [laughs] Is this what you expected when you joined Lenovo? I wasn’t completely sure, actually. I thought that it would be very heavy engineering and product-dependent, but when I actually joined the group within Lenovo that I worked in, we were in a foundational part of figuring out if this research direction was possible for one of our software solutions that was just in the very early steps. So it was a bit of a shock at that point: it was very heavy research and analyzing stuff of course - that is usually the start of the project - and then you have to go to optimizations and testing and all that stuff so it was a good transition period actually. I was lucky to join the company when we were just starting that transition! Can you tell me something about Lenovo that we don't know? There are a lot of branches in the company: for example, we also do servers. We are looking to start doing services, especially around AI; there are even labs that do robotics every now and then we see crates of robots moving around the building! So it is a really huge company and it's all over the world, so having meetings all over different time zones is actually very common! How much of IBM is there in the Women in Computer Vision 26 DAILY WACV Tuesday

27 DAILY WACV Tuesday Rosaura VidalMata “ … this is one of the most challenging parts: making a very big model work in a very small device!”

Women in Computer Vision 28 DAILY WACV Tuesday mentality, in the everyday? Lenovo acquired the computer solutions and we kind of got inspired by some of the ways, the structuring of the company… so it was a very heavy inspiration there and I think my manager was part of the acquisition - he used to be in IBM - so there is a lot of people around that were actually part of the acquisition. Is there anything from academia that you miss? One of the things I miss in academia is the change of pace. Since in academia projects timelines are longer, you have more liberty to go all the way to the nitty-gritty. So you do more depth-based exploration, rather than breadth, which is something that I actually had to adapt when I joined the company. Because even though we do want to know the heavy details of the research, you might not necessarily have enough time to tell specifically why this branch of solutions might not be the best one. So you have to instead look at all the possible solutions and then try and explore them quickly to figure out which one do you want to go depth into - that's one of the things I miss most in academia.

29 DAILY WACV Tuesday Rosaura VidalMata

Let’s talk about the future. Where is your career going? That's a very good question! Thank you. We are obviously looking into continuing the research aspect here in Lenovo. We have a lot of exciting products and projects that we are planning for the future. Especially now with this huge AI boom, one of the major things that I'm looking forward to is of course learning more and get my hands more into heavy AI model optimization. Especially to get all these huge AI models - when working in vision, these models can be really large, dealing with multi-dimensional data and if you're dealing with video there is also the temporal aspect - so I'm looking into figuring optimizations to get all these fantastic performing models in devices that might not have that much computation like your phone or your laptop. Most laptops are not designed for this heavy computational load, especially laptops that are for regular consumers. In terms of your career, how do you see yourself evolving? I'm a staff researcher, which involves very hands-on work on research, but career-wise I'm looking forward to moving also in determining whatever paths we are Women in Computer Vision 30 DAILY WACV Tuesday “Do not be afraid, as long as you always keep trying. If you continue to fail and fail, you're just finding ways to reach the correct solution. So be brave and, no matter how many times you fail, just continue trying!”

going to and maybe determining future research directions and product directions that we are growing into. Or maybe even developing something that is not really existing yet, so I'm really looking into moving into a more leading role in determining where we are going to. Is there any specific dream that you have in this regard? It’s a difficult question… I think it's looking in the future which is useful but it's always hard to take a step back and look forward. One thing that I want to accomplish in my career: there are of course foundational inventions, even here within Lenovo; for example, the foldable screens in laptop and in phones - that is a foundational step forward in technology and in hardware! I would love to have a similar contribution, especially in adapting all these AI solutions into consumer accessible innovations, because some of it are fantastic things but it's not necessarily something that the day-to-day consumer can access. Hence, figuring away maybe a platform or a framework that would enable basically any user to get access to these fantastic inventions. I think that would be an amazing thing to achieve! I don't know how possible it is, but getting closer to that would be fantastic! I wish you get there. Your word for the community. Do not be afraid, as long as you always keep trying. If you continue to fail and fail, you're just finding ways to reach the correct solution. So be brave and, no matter how many times you fail, just continue trying! 31 DAILY WACV Tuesday Read 160 FASCINATING interviews with Women in Computer Vision! Read 160 FASCINATING interviews with Women in Computer Vision! Rosaura VidalMata

32 DAILY WACV Tuesday Poster Lakshay Sharma is a Senior Applied Scientist / ML Engineer at Instacart in New York. He presented Subimage Overlap Prediction at the CV4EO workshop. This is a novel self-supervised pretext task designed for semantic segmentation in remote sensing imagery. With this, the authors demonstrate significantly faster convergence, and better or comparable performance compared to other state-of-the art methods, while using significantly less pretraining data.

33 DAILY WACV Tuesday Poster Andreas Lolos is a researcher working at the intersection of AI, computational pathology, and medical imaging. He builds machine learning methods for processing whole-slide images, with a focus on uncertainty, efficiency, and interpretability. He presented SGPMIL at WACV 2026.

34 DAILY WACV Congrats, Doctor Shashank! Tuesday Shashank Tripathi recently completed his PhD at the Max Planck Institute for Intelligent Systems under the supervision of Michael J. Black. His research focuses on 3D human body modeling, human object interactions, and physicsinspired motion understanding. Shashank will soon be starting his own venture in the gaming industry. Reach out to him if you would like to learn more! Congrats, Doctor Shashank! Building convincing digital humans is central to the vision of shared virtual worlds for AR, VR, and telepresence. Yet, despite rapid progress, today’s virtual humans often fall into a physical "uncanny valley” — bodies float above or penetrate objects, motions ignore balance and biomechanics, and human object interactions miss the rich contact patterns that make behavior look real. Enforcing physics through simulation is possible, but remains too slow, restrictive, and brittle for real-world, in-the-wild settings. In his PhD thesis, Shashank argues that physical realism does not necessarily require full simulation. Instead, it can emerge from the same principles humans rely on every day: intuitive physics and contact. Inspired by biomechanics and cognitive science, Shashank presents a unified framework that embeds these ideas directly into learning-based 3D human modeling. His first work, IPMAN, introduces intuitive physics into 3D human pose and shape estimation. Instead of full physics simulation, IPMAN uses differentiable biomechanical quantities such as Center of Pressure and Center of Mass, to encourage physically plausible poses with proper balance and ground contact. These constraints are efficient and integrate easily into existing pipelines, improving both realism and accuracy.

While IPMAN ensures physically plausible static poses, generating realistic dynamic motion presents additional challenges. Many existing models ignore body shape, and therefore overlook how physiology affects movement. To address this, his second work, HUMOS, introduces a selfsupervised shape-conditioned motion generation method. By leveraging dynamic intuitive physics constraints, HUMOS produces diverse and physically plausible motions tailored to individual body shapes. Realistic behavior also depends on how humans interact with objects. Understanding such interactions requires detecting contact across the entire body surface. In his third work, DECO, Shashank proposes a 3D contact detector that estimates dense vertex level contact on the body. DECO combines body part and scene context attention to handle occlusions and detect contact reliably in complex scenes. Finally, recovering full 3D human object interactions requires reasoning about contact on both the body and the object. The key challenge is establishing correspondences between these contact points. In PICO, Shashank introduces a data collection method that transfers body contact annotations to arbitrary objects, along with a reconstruction approach that uses these correspondences to recover accurate 3D human object interactions from images. Together, these contributions move physics aware human modeling closer to practical use, enabling digital humans that not only look realistic but also move and interact with the world in ways that feel natural. For more information, see his site https://sha2nkt.github.io/ 35 DAILY WACV Tuesday Shashank Tripathi

Made with FlippingBook

RkJQdWJsaXNoZXIy NTc3NzU=