ICCV Daily 2023 - Wednesday

Brilliant Oral and Poster Presentations Workshop: Quo Vadis? byGeorgia Gkioxari Demo Preview: Inline microscopic 3D shape reconstruction Today’s Picks by: Fatma Guney A publication by Women in Computer Vision: Nadiya Shvai DAILY

Fatma’s picks of the day (Wednesday): Fatma Guney is an Assistant Professor at Koc University in Istanbul. During her PhD, she worked with Andreas Geiger at MPI in Tubingen. Currently, she is leading a small team called Autonomous Vision Group (AVG; shamelessly stolen from her advisor). Papers that I’d definitely read: PM GameFormer: Game-theoretic Modeling and Learning of Transformer-based … AM Tracking Anything with Decoupled Video Segmentation… AM Unified Out-Of-Distribution Detection: A Model-Specific Perspective AM DiffDreamer: Towards Consistent Unsupervised Single-view Scene… AM Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D … PM Anomaly Detection Under Distribution Shift For today, Wednesday 4 2 Fatma’s Picks DAILY ICCV Wednesday “I am pleased to share the news of my recent success in securing an ERC Starting Grant. Ironically, despite the European Union's substantial financial trust, my visa application has encountered multiple rejections this summer -three, to be exact. While I would prefer to attribute these setbacks to personal factors rather than discrimination against scientists hailing from non-EU nations such as Turkey, I remain skeptical. Nevertheless, I extend my best wishes to all attendees of ICCV and hope you find the conference enriching and insightful. Oral: Posters:

3 DAILY ICCV Wednesday by Fatma Guney In our group, we focus on computer vision problems related to autonomous driving, hence the name autonomous vision. In the last few years, we worked on future prediction; predicting the next frames in a video or predicting the future trajectories of agents in the scene. Recently, we also started looking into the action part, which is learning to act based on perception and prediction input. After experiencing the difficulties of behavior learning first-hand, I better understand the requirements of robotics from computer vision algorithms, mainly efficiency. In this ICCV, we present one of the smallest and fastest trajectory prediction algorithms: ADAPT. I am currently betting on object-level reasoning, ideally without using any labels. We have another paper on unsupervised object discovery: When we also reason about 3D geometry, unsupervised segmentation becomes a lot more accurate! Lastly, we present “RbA: Rejected by All” on segmenting chickens on the road, a.k.a. out-ofdistribution objects. This has been a largely ignored problem until recently, we need to start thinking about those“corner cases”. Our 3 papers: ❖ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation ❖Multi-Object Discovery by Low-Dimensional Object Motion ❖RbA: Segmenting Unknown Regions Rejected by All

Imagine asking your AI assistant to find your misplaced mobile phone or keys. The task at the heart of this paper, VQ3D, or Visual Queries with 3D Localization, holds the promise of helping people locate their belongings and the objects of interest in their daily lives. Last year, the Ego4D benchmark introduced multiple tasks related to video understanding, including VQ3D, a fusion of 3D geometric comprehension and egocentric video understanding. Jinjie’s new take on this is a multi-stage, multimodule solution looking to improve. Jinjie Mai is a master’s student in Bernard Ghanem’s lab at KAUST. His paper on 3D object localization in egocentric videos has been accepted as an oral and poster this year. He speaks to us ahead of his presentation this morning. EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries 4 DAILY ICCV Wednesday Oral Presentation

performance over the baseline. The task hinges on localizing objects in 3D space based on pictures illustrating the objects to find. “Based on the best method, our solution improves the camera pose estimation path, the performance of the VQ2D detection network, and the view backprojection and multiviewaggregation,” he tells us. “After these steps, we do depth estimation for the objects we’re looking for and then aggregate them together to get the final prediction.” Egocentric videos are inherently dynamic, with freely changing viewpoints and motion blur. Ego4D proposed performing camera pose estimation by relocalizing egocentric video frames to a Matterport scan. However, the noisy nature of Matterport scans leads to low accuracy and poor performance when matching the two. “We identified this problem and proposed to run structure from motion inside the egocentric videos to construct the correct correspondences between the frames for a complete 3D map,” Jinjie reveals. “This insight has improved the performance greatly. Since we choose to run the structure from motion just for the egocentric video, we can construct a 5 DAILY ICCV Wednesday JinjieMai

6 DAILY ICCV Wednesday Oral Presentation big 3D map containing more frames, giving us more camera poses. With more camera poses, we can localize more objects and get a more accurate 3D object localization.” Another challenge Jinjie encountered involved integrating the VQ2D task, which is closely related to VQ3D. VQ2D seeks to localize objects in query images using egocentric videos. Past approaches simply combined the VQ2D and VQ3D tasks, applying VQ2D results in VQ3D. However, performance limitations exist when VQ2D results are applied naively for a VQ3D task. VQ2D outputs tracking results for the query object, typically using the last frame of the tracking, when the object is usually outside the image, proving challenging for 3D localization, like depth estimation and 2D bounding box accuracy. To address this, Jinjie proposed a novel strategy. “We input the egocentric video into the 2D detection network,” he explains. “The detection network will compute the similarity between the proposals and the query object to give a similarity score. For each frame, this score will tell you how similar the top prediction is to the object you want to find. After we get that query score, we propose to use the peaks of those similarity scores as the positive candidate proposals. We believe those peaks show the appearance of the target object. We extract those peaks and get a 2D bounding box of those candidates. Since we already have the 3D camera pose from the previous egocentric structure from motion, we can backproject those 2D proposals into the 3D world to get those 3D predictions. Finally, we use the confidence score to do weighted averages for those candidates, then aggregate to get a single final prediction, showing the object’s location in the 3D environment.”

7 DAILY ICCV Wednesday JinjieMai Who is our Woman in Computer Vision on page 16? ICCV Daily Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, the Computer Vision Foundation and ICCV organizers. These innovations have significantly improved performance. Compared to the previous camera relocalization approach, egocentric structure-from-motion lifted the success rate from just 8% to 77%. Furthermore, the shift from relying on VQ2D tracking results to using the detection network and selecting the most confident peaks increased the rate to 87%. A huge success! Jinjie tells us that while this project is a part of his master’s work, it does not encompass the entirety of his thesis, which is expected to include various related projects and topics reflecting his broader research interests. To learn more about Jinjie’s work, visit his oral this morning at 9:3010:30 and his poster at 10:3012:30.

by the awesome Georgia Gkioxari We stand at a pivotal juncture. The past two years have been an exhilarating ride, brimming with innovation and creativity. The dawn of Generative AI (the author of this piece loves some drama!) has ushered in an epoch few could have foreseen just three years prior. Anyone claiming the contrary is with high certainty lying! Amidst the exhilaration, there's discernible concern regarding the direction of Computer Vision research. The industry's aggressive investments, both in talent and computing power, signal a rush to capitalize on the latest technological advances. This surge is welcome; it offers more opportunities for our community members. This is nothing but a sign of a healthy field. But simultaneously, it instills a sense of uncertainty in many about their next steps. 8 DAILY ICCV Wednesday Workshop “Quo Vadis, Computer Vision?” means “Where are you headed, Computer Vision?” Quo Vadis, Computer Vision?

9 DAILY ICCV Wednesday Quo Vadis? These concerns came under the spotlight and were extensively discussed at the “Big Scholars” workshop during CVPR, sparking debates about the trajectory of academic versus industrial research and its implications for the future of Computer Vision. Arguably, our field’s fast pace is distilling our budding talents with a sense of agony around how they could make their own significant mark in this new landscape of research. This is where our "Quo Vadis, Computer Vision?" workshop enters the scene aspiring to guide

10 DAILY ICCV Wednesday “… countless challenges in CV await solutions. To put it another way, the most crucial problems remain unconquered…” and galvanize young researchers in navigating this transformed research milieu. We've asked experts from diverse backgrounds and research focus to share their insights. We've posed to them an important question: "In today's landscape, what would you, as a grad student, focus on?". Many of us, including the organizers of this workshop, are staunch in our belief that countless challenges in CV await solutions. To put it another way, the most crucial problems remain unconquered. But we are concerned that this sentiment isn't universally shared by our emerging scholars. We are optimistic that our seminar will inspire them to think, delve deeper, and find their place in this ever-evolving landscape of Computer Vision. Workshop “… how they could make their own significant mark in this new landscape of research…”

11 DAILY ICCV Wednesday Industrial Inspection and Tricky Surfaces An exciting workshop about Transparent & Reflective objects In the wild (TRICKY) offered a vivid exchange about approaches for transparent and reflective objects In the wild. Especially interesting were the presentations and discussions about lights, photometry, industrial challenges, depth completion for transparent objects, self-supervised learning and 3D reconstruction for objects with complex light transport. In a very related news, an example of industrial challenges with tricky objects will be shown in the inline microscopic 3D shape reconstruction demo, which will be presented by (from left to right in the photo) Lukas Traxler, Doris Antensteiner, Christian Kapeller from AIT, the Austrian Institute of Technology. The demo can be found at the exhibition area from today (Wednesday) to Friday at the stand D1. The presenters focus on researching novel methods to inspect industrial objects with difficult surface structures (e.g. reflective or translucent materials) and geometries.

Generalization is a fundamental problem in machine learning, involving training models to perform well on the data they were trained on, as well as unseen or modified data. A key subfield within generalization is model robustness, which ensures that machine learning models maintain their performance even when faced with changes in the input data. Imagine training a model on ImageNet and achieving impressive accuracy on that dataset. However, the model’s performance drops significantly when the test distribution changes, perhaps due to image corruptions like JPEG compression artifacts, adverse weather conditions like fog or rain, or image contrast and brightness alterations. Model robustness is critical in machine learning because real-world data is often dynamic and unpredictable. “A good model will be robust to those changes,” Mehmet tells us. “What we want to do is improve the robustness of our model to ensure it works in different distributions while keeping or improving its performance on the test set. How we do it is through data augmentation. 12 DAILY ICCV Wednesday Poster Presentation HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness Mehmet Kerim Yücel is a Research Scientist at Samsung Electronics Research and a recent PhD graduate from Hacettepe University in Ankara, Turkey. His paper proposes a new method leveraging data augmentation techniques to enhance the robustness of convolutional neural networks (CNNs). He speaks to us ahead of his poster this afternoon.

We augment the images in a way that improves the accuracy by diversifying the training distribution. What we propose is a two-step algorithm. Firstly, the HybridAugment data augmentation method. Then we build on it and propose HybridAugment++, which is the end result of our paper.” Inspired by Mehmet’s background in electronics engineering, the paper delves into frequency analysis, a specific area of the robustness field, which plays a crucial role in signal processing. The central idea is that CNNs and humans process information differently. While CNNs focus on high-frequency components, human perception emphasizes lowfrequency information. This divergence in processing methods is one of the reasons adversarial examples, where small, imperceptible changes can fool a model, are a concern. Humans may not be able to perceive them visually, but CNNs do. “If I were to show you an image of a cat with a couple of pixels flipped, just some random noise, you will still classify it as a cat because we are robust to those pixel changes,” Mehmet points out. “We wanted to do the same thing on CNNs but with a frequency perspective!” 13 DAILY ICCV Wednesday Mehmet Kerim Yücel

The work introduces a technique inspired by a 2006 paper, Hybrid Images, which proposes merging one image’s high-frequency components with another’s lowfrequency components to create a new image. This new image exhibits different frequencies when viewed from varying distances, aligning with the differences in how CNNs and humans perceive information. For the first part of his novel method, Mehmet performs asimple hybrid image augmentation. The model is trained on a batch of images, and some are randomly picked to have their high and lowfrequency components mixed up. This straightforward process diversifies the training data, introducing variations that help the model generalize better but only require a few lines of code, which are readily available online. “The second part goes deeper into frequency analysis literature,” he explains. “The Fourier transform decomposes a signal into an amplitude and a phase component. Amplitude is essentially the magnitude of frequency components in that signal. Phase shows the phase. We find that humans focus more on the phase information, which is important because if we remove the amplitude, we can pretty much guess what the image is. It turns out that CNNs overfit to amplitude information just like they overfit to high frequency. We merge these two techniques: doing the hybrid images and taking the phase information.” 14 DAILY ICCV Wednesday Poster Presentation

Overfitting, where a model performs exceptionally well on the training data but poorly on new, unseen data, is a significant concern in machine learning. Model robustness is closely connected to overfitting because robust models are less susceptible to changes in the data distribution and environment, which is a critical requirement if they are to be deployed in the real world. The work aims to contribute to the ongoing efforts to enhance model robustness in the context of computer vision. Still, Mehmet believes the proposed methods could benefit various downstream tasks, extending to other domains likenatural language processing. “I think our method would work off the shelf for many computer vision tasks,” he says. “Our work is primarily focused on vision and CNNs, but we show it also works for transformers. For NLP or other modalities, I don’t think it would be that easy. In NLP, the frequency of the words rather than the image is not trivial for me to define at this point. That’s why we’ve shared all the details in our paper, and all the code and all the pre-trained models are available on GitHub.” To learn more about Mehmet’s work, visit his poster this afternoon at 14:30-16:30. 15 DAILY ICCV Wednesday Mehmet Kerim Yücel

Nadiya Shvai is currently a Senior Data Scientist responsible for research at Cyclope.AI. More than 100 inspiring interviews with successful Women in Computer Vision in our archive Where shall we start? From Nadiya or from Cyclope.AI? [laughs] Let's start with Cyclope.AI because I think we'll have plenty to talk about. Perfect! Cyclope.AI is a relatively small company that works on artificial intelligence-based solutions for smart road infrastructure and safety. For example, among the products that we do is the security system for the tunnels. You’ve probably heard about the accident in the Mont Blanc Tunnel that happened some years ago. After this, the regulations 16 DAILY ICCV Wednesday Women in Computer Vision UKRAINE CORNER

for the safety of tunnels have been reinforced a lot in France. We are working on the automation of the system to make sure that they are as fault-proof as possible. At the same time, they do not generate a lot of false alarms because a system that generates a lot of false alarms finally becomes not useful at all. What do you do there as the data scientist? My work really considers almost all the aspects of deep learning product development. Starting from the data collection to data selection, to supervising the data labeling, to model training and testing. Then, we put the deep learning models into the pipeline and finally prepare this pipeline for deployment. This is additional to the other research activities that we do. Is this what you wanted to do when you studied? Or was it an opportunity that came to you, and you took it? [Thinks a little moment] It's an opportunity that came to me, and I took it. This has more or less been happening throughout my professional path. I think it's normal for opportunities to come our way, and it's important to recognize them and grab them. Recognize those that speak to you, that are close to your spirit. During your studies, what did you think you would be doing when you grewup? Ahh, it's a very good question! Thank you. I didn’t come for nothing. [both laugh] Well, deep learning as a mainstream activity is relatively new. It comes from signal processing, but this was not my specialization when I was studying. At the core, I'm a mathematician. You can think about this as being relatively far from what I do because I was doing linear algebra, and my PhD is also on linear algebra. But then, slowly, I drifted towards more applied skills, which is how I came to where I am today. So it's not true that women are weaker in mathematics, or are you special? 17 DAILY ICCV Wednesday Nadiya Shvai UKRAINE CORNER “Every day brings us closer to victory.!”

18 DAILY ICCV special? [laughs] No, I really don't think that I'm special. I honestly don't think that women are weaker in mathematics. However, I think we have to talk about the point that we are coming from. We're coming from the point that there is enough of the existing bias of what women should be occupied with and the lack of the examples of the women researchers. That's why the interviews that you do are so important. They provide examples to other women and young girls to broaden their spectrum of possibilities and realize, yes, I can do this. This is possible for me! You’ve told us something about the present and something about the past. Let’s speak about the future. Where are you planning to go? Currently, I'm increasing the amount of research activities in my day-today work. This is my current vector of development. But where it will bring me, I don't know for now. I do know that this is what I am enjoying doing, and this is important for me. Can you be a researcher all your life? [hesitates a moment] Hopefully. If we're talking from the mathematician's point of view, there is this preconception that mathematicians usually are most fruitful in their 20s, maybe 30s. Then, after this, there is some sort of decline in activity. I never heard that. That would be terrible if it were true. [laughs] This is a conception that I have heard, and I'm not sure if there are actually some sort of statistics regarding this. But in one form or another, I would like to continue doing research as much as possible. Because for me, one of my main drives is curiosity. That's what makes research appealing to me. I don't think this curiosity is going to go away with time. Are you curious about learning new things to progress yourself or to make progress in science? What is your drive? UKRAINE CORNER Wednesday Women in Computer Vision

I'm not that ambitious to think that I'm going to push science forward. For me, it’s to discover things for myself or for the team, even if it's a small thing. I also enjoy seeing the applied results of the research that we do, because I believe that deep learning is the latest wave of automation and industrialization. The final goal is to give all the repetitive tasks to the machine, so we as humans can enjoy more creative tasks or just leisure time. You just brought us to my next question! [laughs] Please go ahead. What has been your greatest success so far that you are most proudof? If we're talking about automation, I was the person responsible for training and testing the model that right now does the vehicle classification according to required payment at the tolls all over France. It means that every day, literally hundreds of thousands of vehicles are being classified using the computer vision models that I have trained. So, I'm at least partial to the final product, and it means less of a repetitive job for the operators. Before, there was a need for the operator because physical sensors were not able to capture the differences between some classes, so humans had to check this. This is a job very similar to labeling. If you ever did the labeling of images and videos, you know how tough it actually is. You have to do it hour after hour after hour; it's quite tough. So right now, I'm super happy that a machine can do it instead of a human. What will humans do instead? Something else. [laughs] Hopefully, something more pleasant or maybe more useful. That means you are not in the group of those who are scared of artificial intelligence taking too much space in our lives. In our workload? No, I don't think so. First of all, as a person working with AI every day, I think I understand pretty well the limitations that it has. Definitely, it 19 DAILY ICCV Wednesday Nadiya Shvai UKRAINE CORNER

cannot replace humans, but it's just a tool. It's a powerful tool that enables you to do things faster, to do things better. And am I worried about AI in regard to life? Maybe to some extent. Sometimes, when I see some things, I think, do I want this for myself or for my family? The answer is no. But again, it's rather a personal choice. For example, I think a couple of years ago, I saw this prototype of an AI-powered toy for really young kids who can communicate with the kid, etc. And honestly, I am not sure that this is something that I would like for my kids. I don't think that we are at the point that it's A, really safe, and B I think it might be a little bit early for the child to present this to them. It might create some sort of confusion between live beings and AI toys. But again, this is just my personal opinion, and here everyone chooses for themselves. Nadiya, up until now, our chat has been fascinating. My next topic may be more painful. You are Ukrainian, and you do not live in Ukraine. How much do you miss Ukraine, and what can you tell us about how you have been living the past 18 months? [hesitates for a moment] You feel inside as if you are split in two. Because for me, I live in France, and I have to continue functioning normally every day. I go to work, I spend time with my family, I smile at people, etc. Then there's a second half that reads news or gets messages from friends and family that are facing the horror and the tragedy and the pain of war. Of course, it cannot be even closely compared to people's experience who are in Ukraine right now. But I believe there is no Ukrainian in the world that is not affected by the war. How can life go on when somebody is burning down your house? I honestly don't know. But it has to, as you cannot just stop doing whatever you are doing and say I'm going to wait until the war is over. Can you really say, okay, business as usual? Sometimes, don't you feel the whole world should stop and say, hey, come on, this can’t goon? 20 DAILY ICCV Wednesday Women in Computer Vision UKRAINE CORNER

[hesitates for a moment] I wish it could be like this, but it's not like this. We have to do our best in this situation that we are in. Do you know, Nadiya, that one and a half years ago CVPR passed a resolution condemning the invasion of Ukraine and offering solidarity and support people of Ukraine? You enjoy a lot of sympathy in this community. Can you tell all of us what you expect from us to make things easier for you? I do feel the support of the research community, and I appreciate a lot the work that they are doing. It means a lot to me personally, and I'm sure that it means a lot also for other Ukrainians. Being seen and heard is one of the basic human needs, particularly in the worst situation that we are in right now. To use our conversation as a stage, I think that the best the community can do is to provide support to Ukrainian researchers, particularly for those who are right now staying in Ukraine. For collaborations and projects, this is probably the best approach. Do you have anything else to tell the community? [hesitates for a moment] Sure, it’s not work-related, but what I want to say is that a couple of days ago, I was reading a book. There was a quote in the book that I liked a lot that I'd like to share: “There are no big breakthroughs. Only a series of small breakthroughs.” And I'm saying this to support young researchers, particularly young girls. Just continue there, and you're going to achieve. Right? This is also my word of support to all Ukrainians who are going to read this. Every day brings us closer to victory. Xi Yin 21 DAILY ICCV Wednesday “I do feel the support of the research community, and I appreciate a lot the work that they are doing. It means a lot to me personally!” Nadiya Shvai UKRAINE CORNER

The performance of image classifiers often hinges on their ability to correctly identify images across a wide range of scenarios, including those that may not have been encountered during training. In this paper, Jan proposes a novel approach to identify systematic errors in image classifiers, recognizing the importance of ensuring these models perform well even on rare corner cases. “If you have an image classifier trained on some data distribution, often in the long tail of this data distribution, there are cases which are not covered in the training or test data,” he points out. “In this case, the system could misclassify the images, which goes unnoticed because these cases are also not covered in the validation set.” 22 DAILY ICCV Wednesday Poster Presentation Identification of Systematic Errors of Image Classifiers on Rare Subgroups Jan Hendrik Metzen is a senior expert at the Bosch Center for Artificial Intelligence. His paper on auditing image classifiers to identify systematic errors has been accepted as a poster. He speaks to us ahead of his presentation this morning.

This problem becomes particularly significant when deploying computer vision models in realworld scenarios where unseen situations may arise. Before shipping these models to the field, it is important to test that they would also do well on these corner cases. Jan highlighted two critical motivations for this: safety and fairness. “In terms of safety, if we have an autonomous system, it must behave safely on corner cases,” he asserts. “In terms of fairness, if we have demographic minority groups underrepresented in training and test data, we want the system to perform well on those groups.” 23 DAILY ICCV Wednesday Jan Hendrik Metzen PromptAttack identifies the subgroup „rear views of small orange minivans in front of snowy forest“ as systematic error of a VGG16, which misclassifies 25% of the corresponding samples as snowplows (not as minivans). A ConvNeXt-B classifies the same samples with 99% accuracy. What sets this research apart is its innovative approach to identifying systematic errors by introducing the concept of an operational design domain. This domain encompasses all the scenarios where the system should perform well. Using a text-to-image model like Stable Diffusion, it synthesizes images within this domain and rigorously tests systems against them, which has not been done before. However, relying on text-to-image models presents a challenge because they occasionally produce images that do not align with the intended text prompts. Ensuring the faithfulness of these generated images is vital since they are being used to validate downstream systems, and it would not be possible to screen thousands of images manually. Jan had to perform moderate prompt engineering and assign specific classifiers to address this.

“One thing we thought about a lot during this work was how we can evaluate procedures for systematic error identification in a quantitative way,” he says. “Often, other works just showcase examples of systematic errors, but in the end, you want to understand how reliable these procedures are and how often they find systematic errors.” He found a controlled way to inject systematic errors into zero-shot classifiers based on CLIP and then check if the procedure could find them. Then, he could quantitatively assess how good the procedure is and tune its hyperparameters. “Before we started, we didn’t know which systematic errors we’d find in the models because these were typically strong models, close to state of the art, which performed very well on a validation set,” Jan recalls. “People said they were close to human performance. Then we identified some systematic errors, which were obviously wrong to a human, but the system would make the same error over and over again!” One specific example was a rear view of an orange minivan in a snowy scene that was often misclassified as a snowplow, despite looking nothing like one. Regarding future possibilities for this work, Jan is optimistic that stronger text-to-image models will be released and that further domains could be explored. “Overall, we hypothesize that these models get better over time,” he tells us. “Stable Diffusion v1.5 was state of the art back then. Even if these models have shortcomings now, our approach will automatically benefit from progress in text-to-image models in the future.” To learn more about Jan’s work, visit his poster this afternoon at 14:30-16:30. 24 DAILY ICCV Wednesday Poster Presentation PromptAttack identifies the subgroup “old male African person with long hair” as systematic error of a MixerB/16, which misclassifies 25% of the corresponding samples as apes (not as humans). A Mixer-L/16 classifies the same samples with 97% accuracy.

25 DAILY ICCV Wednesday Posters ☺ Ivan Reyes-Amezcua is a PhD student in Computer Science at CINVESTAV, Mexico. He is researching adversarial robustness in deep learning systems and developing defense mechanisms to enhance the reliability of models. He presented his poster at the LatinX workshop, demonstrating how subtle changes to images can fool a model: shifting its confidence from identifying an image as a pig to confidently labeling it as an airliner. Laura Hanu (right) and Anita L Verő are both Machine Learning Research Engineers at Unitary, a startup building multimodal contextual AI for content moderation. Laura told us that in this work, they demonstrate for the first time that LLMs like GPT3.5/Claude/Llama2 can be used to directly classify multimodal content like videos in-context with no training required. "To do this," she added, "we propose a new model-agnostic approach for generating detailed textual descriptions that capture multimodal video information, which are then fed to the LLM along with the labels to classify. To prove the efficacy of this method, we evaluate our method on action recognition benchmarks like UCF-101 and Kinetics400."