Computer Vision News - November 2023

July 2019 A publication by JULY 2023 November 2023 BEST OF ICCV BEST OF MICCAI Yann LeCun Exclusive Interview

Computer Vision News 2 Exclusive Interview Yann LeCun was keynote speaker at MICCAI 2023. He was so kind as to give a second interview to Ralph, during his visit at ICCV 2023 in Paris. Yann, thank you very much for being with us again. When we talked five years ago, you told me you had a clear plan for the next few years. Did you stick to it? The plan hasn’t changed very much – the details have changed, and we’ve made progress, but the original plan is still the same. The original plan was the limitation of current AI systems is that they’re not capable of understanding the world. You need a system that can understand the world if you want it to be able to plan. You need to imagine in your head what the consequences of your actions might be, and for this, you need a world model. I’ve been advocating for this for a long time. This is not a new idea. The concept is very old, from optimal control, but using machine learning to learn the world models is the big problem. Back when we talked, I can’t remember if I’d made the transition between what I called latent variable generative models and what I’m advocating now, which I call JEPA, so joint embedding predictive architectures. I used to think that the proper way to do this would be to train a system on videos to predict what will happen in the video, perhaps as a consequence of some action being taken. If you have

3 Computer Vision News a system that can predict what’s going to happen in the video, then you can use that system for planning. I’ve been playing with this idea for almost 10 years. We started working on video prediction at FAIR in 2014/15. We had some papers on this. Then, we weren’t moving very fast. We had Mikael Henaff and Alfredo Canziani working on a model of this type that could help plan a trajectory for self-driving cars, which was somewhat successful. But then, we made progress. We realized that predicting everything in a video was not just useless but probably impossible and even hurtful. I came up with this new idea derived from experimental results. The results are such that if you want to use self-supervised learning from images to train a system to run good representations of images, the generative methods don’t work. The methods are based on essentially corrupting an image and then training a neural network to recover the original image. Large language models are trained this way. You take a text, corrupt it, and then train a system to reconstruct it. When you do this with images, it doesn’t work very well. There are a number of techniques to do this, but they don’t work very well. The most successful is probably MAE, which means masked autoencoder. Some of my colleagues at Meta did that. What really works are those joint embedding architectures. You take an image and a corrupted version of the image, run them through encoders, and train the encoders to produce identical representations for those two images so that the representation produced from the corrupted image is identical to that from the uncorrupted image. In the case of a video, you take a segment of video and the following segment, you run them through encoders, and you want to predict the representation of the following segment from the representation of the previous segment. It’s no longer a generative model because you’re not predicting all the missing pixels; you’re predicting a representation of them. The trick is, how do you train something like this while preventing it from collapsing? It’s easy for this system to collapse, ignore the input, and always predict the same thing. That’s the question. So, we did not get to solve the exact problem we wanted? It was the wrong problem to solve. The real problem is to learn how the world works from video. The original approach was a generative model that predicts the next video frames. We couldn’t get this to work. Then, we discovered a bunch of methods that allow one of those joint embedding systems to learn when they’recollapsing. There are a number “This is, I think, the future of AI systems. Computer vision has a very important role to play there!” Yann LeCun

Computer Vision News 4 Exclusive Interview of those methods. There’s one called BYOL from DeepMind – Bootstrap Your Own Latent. There are things like MoCo. There have been a number of contrastive methods to do this. I probably had the first paper on this in 1993, on a Siamese neural network. You train two identical neural nets to produce identical representations for things you know are semantically identical and then push away the outputs for dissimilar things. More recently, there’s been some progress with the SimCLR paper from Google. Then, I became somewhat negative about those contrastive methods because I don’t think they scale very well. A number of non-contrastive methods appeared about four years ago. One of them is BYOL. Another one, which came from my group at FAIR, is called Barlow Twins, and there are a number of others. Then, we came up with two other ones called VICReg and I-JEPA, or Image JEPA. Another group at FAIR worked on something called DINOv2, which works amazingly well. Those are all different ways of training a joint embedding architecture with two parallel networks and predicting the representation of one from the representation of the other. DINOv2 is applied to images, VICReg is applied to images and short videos, I-JEPA to images, and now we’re working on something called V-JEPA or Video JEPA, a version of this for video. We’ve made a lot of progress. I’mvery optimistic about wherewe’regoing. You have long been a partisan of the double affiliation model. Would you suggest young people today consider a career with hats in academia and industry, or would your advice for this generation be a little bit different? I wouldn’t advise young people at the beginning of their career to wear two hats of this type because you have to focus on one thing. In North America, if you go into academia, you have to focus on getting tenure. In Europe, it’s different, but you have to focus on building your group, your publications, your students, your brand, your research project. You can’t do this if you split your time. llll Yann’s interview with Ralph in 2018

5 Computer Vision News Yann LeCun Once you’re more senior, then it’s a different thing. Frankly, it’s only in the last 10 years that I’ve been straddling the fence in a situation where I’m pretty senior and can choose what I want to work on. At FAIR, we don’t take part-time researchers who are also faculty if they’re not tenured. Even the tenured, we tend only to take people who are quite senior, well established, and sometimes only for a short time, for a few years or something like that. It’s not for everyone. It depends on which way you want to have an impact and whether you like working with students. In industry, you tend to be more hands-on, whereas in a university, you work through students generally. There are pluses and minuses. You are one of the well-known scientists in our community who does not shy away from talking to younger and less experienced people on social media, in articles, and at venues like ICCV and MICCAI. Do you also learn from these exchanges? The main reason for doing it is to inspire young people to work on interesting things. I’ve been here at ICCV for about an hour and a half, and about 100 people came to take selfies with me. I don’t turn them down because they’re so enthusiastic. I don’t want to disappoint them. I think we should encourage enthusiasm for science and technology from young people. I find that adorable. I want to encourage it. I want to inspire people to work on technology that will improve the human condition and make progress in knowledge. That’s my goal. It’s very indirect. Sometimes, those people get inspired. Sometimes, that puts them on a good trajectory. That’s why I don’t shyaway. There are a lot of exchanges about the potential benefits and risks of AI, for example. The discussions I’ve had on social media about this have allowed me to think about things I didn’t think of spontaneously and answer questions I didn’t know people were asking themselves. It makes my argument better to have these discussions on social media and have them in public as well. I’ve held public debates about the risks of AI with various people, including Yoshua Bengio and people like that. I think it’s useful. Those are the discussions we need to have between well-meaning, serious people. The problem with social media is that there’s a lot of noise and people who don’t know anything. I don’t think we should blame people for not knowing; I think we should blame people for being dishonest, not for not knowing things. I’ma professor. My job is to educate people. I’m not going to blame them for not knowing something! You started in a place where you

Computer Vision News 6 Exclusive Interview knew every single scientist in your field. Now, you are meeting thousands and cannot learn all their names. What is your message to our growing community? A number of different messages. The first one is there are a lot of applications of current technologies where you need to tweak an existing technique and apply it to an important problem. There’s a lot of that. Many people who attend these conferences are looking for ideas for applications they’re interested in medicine, environmental protection, manufacturing, transportation, etc. That’s one category of people – essentially AI engineers. Then, some people are looking for new methods because we need to invent new methods to solve new problems. Here’s a long-term question. The success we’ve seen in natural language manipulation and large language models – not just generation but also understanding – is entirely due to progress in selfsupervised learning. You train some giant transformer to fill in the blanks missing from a text. The special case is if the blank is just the last word. That’s how you get autoregressive LLMs. Selfsupervised learning has been a complete revolution in NLP. We’ve not seen this revolution in vision yet. A lot of people are using selfsupervised learning. A lot of people are experimenting with it. A lot of people are applying it to problems where there’s not that much data, so you need to pre-train on whatever data you have available or synthetic data and then fine-tune on whatever data you have. So, some progress in imaging. I’m really happy about this because I think that’s a good thing, but the successful methods aren’t generative. The kind of methods that work in these cases aren’t the same kind of methods that work in NLP. In my opinion, the idea that you’re going to tokenize your video or learn to predict the tokens is not going anywhere. We have to develop specific techniques for images because images and video are considerably more complicated than language. Language is discrete. It makes it simple, particularly when having to handle uncertainty. Vision is very challenging. We’ve made progress. We have good techniques now that do selfsupervised learning from images. The next step is video. Once we figure out a recipe to train a system to learn good representations of the world from video, we can also train it to learn predictive world models: Here’s the state of the world at time T. Here’s an action I’m taking. What’s going to be the state of the world at time T+1? If we have that, we can have machines that can plan, which means they can reason and figure out a sequence of actions to arrive at a goal. I call this objective-driven AI. This is, I think, the future of AI systems. Computer vision has a very important role to play there. That’s what I’mworking on. My entire research is entirely focused on this!

7 Computer Vision News Inspiring (and surprising) slides from YannLeCun’s keynote speech at MICCAI. Yann LeCun

Computer Vision News 8 ICCV Best Student Paper Award Tracking Everything Everywhere All At Once Author Qianqian Wang is a postdoc at UC Berkeley. She recently completed her PhD in Computer Science at Cornell Tech. She speaks to us about her work on estimating motion from video sequences ahead of her oral presentation and poster this afternoon. Read our full review of her winning work in the next pages! This exceptional work has just won the Best Student Paper Award at ICCV 2023. This interview was conducted before the announcement of the award. RSIP Vision continues a long tradition of selecting in advance the future award-winning papers for full feature! Congrats Qianqian! In this paper, Qianqian proposes a novel optimization method for estimating the complete motion of a video sequence. It presents dense and

9 Computer Vision News Tracking Everything… BEST OF ICCV 2023 and long-range motion representation that allows for tracking through occlusions and modeling full-length trajectories. This method finds correspondences between frames, a fundamental problem in computer vision. These correspondences are the foundation for various applications, notably dynamic 3D scene reconstruction, as understanding 2D correspondences between frames in a dynamic scene is essential for reconstructing its 3D geometry and 3D motion. The research also opens up exciting possibilities for video editing, allowing for seamless propagation of edits across multiple frames. “I came up with this idea because, in my last project, I realized there was no such motion representation in the past,” Qianqian tells us. “It’s not a new problem, but people don’t have a good solution. The last paper I saw similar to our work was 10 years ago, but because it’s too challenging, and people don’t have new tools to work on it, progress has been suspended for a decade.” Now, renewed interest in this problem has sparked concurrent research. While approaches may differ, the shared goal remains the same – tracking points in a video

Computer Vision News 10 ICCV Best Student Paper Award over extended periods. However, the road to achieving that is not without its challenges. “The first challenge was to formulate the problem because it’s different from what most people did before,” Qianqian explains. “We have sparse feature tracking, which gives you long-range correspondences but they are sparse. On the other hand, we have optical flow, which gives you dense correspondences, but only for a very short period of time. What we want is dense and long-range correspondences. It took a little bit of time to figure that out.” An important moment in the project was realizing the need for invertible mapping. Without it, the global consistency of estimated motion trajectories could not be guaranteed. It was then a challenge to determine how to represent the geometry. Parameterizing the quasi3D space was far from straightforward,

11 Computer Vision News Tracking Everything… BEST OF ICCV 2023 which led to the team exploring the concept of neural radiance field, a dense representation offering the flexibility needed to optimize scene structure and the mapping between each local and canonical frame. The work opens up opportunities for future extensions, including using similar principles for reconstructing dynamic scenes and enhancing video editing techniques with speed and efficiency improvements. “Compared to other correspondence work, our approach guarantees cycle consistency,” Qianqian points out. “We’re mapping it to 3D space, which allows it to handle occlusion. That’s a nice property because most works on motion estimation remain in 2D. They don’t build a consistent 3D representation of the scene to track.” Qianqian is originally from China but has been in the US since starting her PhD in 2018 and says it is a “very welcoming and inclusive” environment. Her advisors on this project at Cornell Tech were Noah SnavelyandBharath Hariharan. “Noah is the most wonderful advisor in the world,” she smiles. “He’s super caring. He has very creative ideas and guided me through the whole process. We discussed along the way and then figured out the right formulation for the problem. He pointed me to important papers that inspired me to work on this. He’s super helpful, and I appreciate his guidance!” In her new position at UC Berkeley, Qianqian works with two exceptional professors, who are also great friends of our magazine: Angjoo Kanazawa and Alyosha Efros. She is in a transition stage but plans to continue working on motion estimation, 3D reconstruction, and video understanding, particularly finegrained and deep video understanding. She adds that if we better understand motion in a video, we’ll better understand higher-level information, like semantic information. Where does she see herself in 10 years? “That’s a very hard question to answer,” she ponders. “I still want to do research and contribute to the field of computer vision. I hope to find a faculty position in a university and stay in academia, but if that doesn’t work out, I’m also fine to find a research position in industry. I’ll keep doing research. That’s something I know!”

Computer Vision News 12 Best Demo at ICCV by Doris Antensteiner Our innovation resulted from a Computer Vision research project aimed at retrieving accurate 3D shape at micro-scale for industrial inspection applications. Our system fuses light-field imaging with photometric stereo to simultaneously capture detailed 3D shape information, object texture, and photometric stereo characteristics. The integration of light-field imaging and photometric stereo offers a holistic approach to 3D shape reconstruction. Light-field imaging captures the angular information of the incoming light, allowing for a multi-view perspective of the object. Photometric stereo complements this by analyzing the way light interacts with the object's surface, providing crucial information about surface normals and reflectance properties. A notable feature of our system is its ability to perform3D reconstructions without the need for relying on traditional scanning or stacking processes, setting it apart from technologies like confocal scanning or focus stacking. The Inline Microscopic 3D Shape Reconstruction is the winner of the ICCV 2023 best demo award for its contribution to computer vision and industrial inspection. The demo was presented by scientists from the Austrian Institute of Technology (AIT) including Christian Kapeller, Lukas Traxler and Doris Antensteiner (in the photo with Ralph, from left to right). The demo showcased a novel optical microscopic inline 3D imaging system, which can be utilized by future industries for micro-scale object analysis. BEST OF ICCV 2023

13 Inline Microscopic 3D Shape Reconstruction Computer Vision News This functionality allows for data acquisition during continuous object motion, making it versatile for applications like inspecting moving parts on a production line or in roll-to-roll manufacturing processes. Traditional methods, such as confocal scanning or focus stacking, involve capturing multiple images from varying focal depths or perspectives and subsequently combining them to generate a 3D model. These techniques can be timeconsuming and less suitable for dynamic or moving objects. In contrast, the Inline Microscopic 3D Shape Reconstruction system excels in capturing 3D data while the object is in continuous motion, eliminating the need for costly setup changes or interruptions. In our demo, various samples were demonstrated to showcase a high variety of scanning scenarios. The samples which were chosen are typically considered challenging for 3D inspection (metal surfaces, ball-grid-arrays form integrated circuits, security prints, etc.). These samples were placed on a translation stage to simulate object motion during inspection. The system demonstrated its capabilities by capturing objects with a point-topoint distance of 700nm per pixel and an acquisition speed of up to 12mm per second, equivalent to 39 million 3D points per second. Our Inline Microscopic 3D Shape Reconstruction system has the potential to show great impact in the field of microscopic inspections in various industries. It reaches a high level of precision and efficiency in inspection processes and has a wide range of practical applications. Our innovation has the potential to enhance micro-scale object analysis and industrial inspection in real-world scenarios. ICI Microscopy setup, computing 3D depth and capturing RGB data with a lateral point-to point distance of 700nm. BEST DEMO ICCV 2023

Computer Vision News 14 by Robin Hesse Today’s strong performance of deep neural networks coincides with the development of increasingly complex models. As a result, humans cannot easily understand how these models are working, and therefore, only have limited trust in them. This renders their application in safety-critical domains such as autonomous driving or medical imaging difficult. To counteract this issue, the field of explainable artificial intelligence (XAI) has emerged which aims to shed light on how deep models are working. While numerous fascinating methods have been proposed to improve the explainability of vision systems, the evaluation of these methods has often been limited by the absence of ground-truth explanations. This issue naturally leads to the lingering question: “How to decide which explanation method is most suited for my specific application?”, which is the motivating question for our work. To answer this question, so far, the XAI community resorted to proxy tasks to approximate ground-truth explanations. One popular instance are feature deletion protocols where pixels or patches are incrementally removed to measure their impact on the model output and approximate their importance. However, these protocols come with several limitations, such that they introduce out-of-domain issues that could interfere with the metric, they only consider a single dimension of XAI quality, and they work on a semantically less meaningful pixel or patch level. The last point is especially important considering that explanations aim to support humans, and humans perceive images in semantically meaningful concepts rather than pixels. Robin Hesse is a third-year PhD student at the Technical University of Darmstadt under the supervision of Simone Schaub-Meyer and Stefan Roth. In his research, he works on explainable artificial intelligence with a particular interest in intrinsically more explainable models and how to evaluate explanation methods. FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods ICCV Oral Presentation

15 Computer Vision News Motivated by these limitations, our paper proposes a synthetic classification dataset that is specifically designed for the partbased evaluation of XAI methods. It consists of renderings of funnylooking birds of various ‘species’ on which ‘semantically meaningful’ image-space interventions can be performed to approximate groundtruth explanations. Following a similar idea as the above feature deletion protocols, the dataset allows to delete individual parts of the birds, e.g., their beak or feet, to measure how the output of the model drops. If a deleted part causes a large drop in the output confidence, one can assume that the part is more important than one that only causes a minor output drop (Fig. 1). This allows to move from the above pixel level to a semantically more meaningful part level and, as the training set now includes images with deleted parts, all interventions can be considered in domain. To thoroughly analyze various aspects of an explanation, the FunnyBirds framework considers three dimensions of XAI quality and two dimensions of model quality (Fig. 2). Various interesting findings were made, using the proposed FunnyBirds framework. First, architectures that were designed to be more interpretable, such as BagNet, often achieve higher metrics than the corresponding standard backbone networks. Second, the VGG16 backbone appears to be more explainable than theResNet-50backbone, FunnyBirds Fig 1. Removing individual bird parts and measuring the output change allows to approximate ground-truth importances for each part. In this example, the beak is more important than the feet. BEST OF ICCV 2023

Computer Vision News 16 indicating that different architectures or model components are more or less explainable. Third, the ranking of XAI methods may change across different backbones, so it is crucial to consider multiple backbones for the future evaluation of XAI methods. ICCV Oral Presentation Fig 2. (left) Example image of the FunnyBirds dataset. (center) BagNet explanation for the left image. (right) Quantitative results for the examined BagNet model in the FunnyBirds framework. Left: supervisor Simone Schaub-Meyer BEST OF ICCV 2023

17 Computer Vision News ECCV Milano 2024 Dear Computer Vision community, It was CVPR 2019 when Vitto, Andrew, and I sat down at a coffee shop and started brainstorming on crazy places where we could organize ECCV in 2024. A year later we presented the bid, and three years later here we are, full steam ahead to what will hopefully be a fun, exciting, and engaging ECCV. We will meet in less than a year in Milano, Italy, capital of fashion and spritz, and now also computer vision. The Program chairs, Olga, Elisa, Gül, Torsten, Ales, and Stefan have been working non-stop to ensure that our scientific program will be innovative and extremely interesting. Let’s all have a spritz with Laura in Milano! We are looking forward to the event and we hope that you will enjoy every minute of what we have prepared for you! Be ready to be surprised! Ci vediamo a Milano! Laura NB: awesome Laura Leal-Taixé is a co-General Chair of ECCV 2024

Computer Vision News 18 ICCV Workshop by Georgia Gkioxari We stand at a pivotal juncture. The past two years have been an exhilarating ride, brimming with innovation and creativity. The dawn of Generative AI (the author of this piece loves some drama!) has ushered in an epoch few could have foreseen just three years prior. Anyone claiming the contrary is with high certainty lying! Amidst the exhilaration, there's discernible concern regarding the direction of Computer Vision research. The industry's aggressive investments, both in talent and computing power, signal a rush to capitalize on the latest technological advances. This surge is welcome; it offers more opportunities for our community members. This is nothing but a sign of a healthy field. But simultaneously, it instills a sense of uncertainty in many about their next steps. “Quo Vadis, Computer Vision?” means “Where are you headed, Computer Vision?”. That’s the name of a very successful workshop at ICCV 2023, showcasing a fantastic line-up of speakers. Did you miss it? Awesome Georgia got you covered! Quo Vadis, Computer Vision? BEST OF ICCV 2023

These concerns came under the spotlight and were extensively discussed at the “Big Scholars” workshop during CVPR, sparking debates about the trajectory of academic versus industrial research and its implications for the future of Computer Vision. Arguably, our field’s fast pace is distilling our budding talents with a sense of agony around how they could make their own significant mark in this new landscape of research. This is where our "Quo Vadis, Computer Vision?" workshop enters the scene aspiring to guide 19 Computer Vision News Quo Vadis?

Computer Vision News 20 “ … countless challenges in CV await solutions. To put it another way, the most crucial problems remain unconquered…” and galvanize young researchers in navigating this transformed research milieu. We've asked experts from diverse backgrounds and research focus to share their insights. We've posed to them an important question: "In today's landscape, what would you, as a grad student, focus on?". Many of us, including the organizers of this workshop, are staunch in our belief that countless challenges in CV await solutions. To put it another way, the most crucial problems remain unconquered. But we are concerned that this sentiment isn't universally shared by our emerging scholars. We are optimistic that our seminar will inspire them to think, delve deeper, and find their place in this ever-evolving landscape of Computer Vision. “… how they could make their own significant mark in this new landscape of research…” ICCV Workshop BEST OF ICCV 2023

11 Paris supports Ukraine Computer Vision News Computer Vision News Editor: Ralph Anzarouth Publisher: RSIP Vision Contact us Give us feedback Free subscription Read previous magazines Copyright: RSIP Vision All rights reserved Unauthorized reproduction is strictly forbidden. Follow us: Ralph’s photo on the right was taken in lovely, peaceful and brave Odessa, Ukraine. 21 ICCV's sister conference CVPR adopted a motion with a very large majority, condemning in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine. One day before ICCV in Paris, we decided to involve the Eiffel Tower and the Mona Lisa. Photos credit to awesome Doris Antensteiner! UKRAINE CORNER

Computer Vision News 22 Women in Computer Vision UKRAINE CORNER Nadiya Shvai is currently a Senior Data Scientist responsible for research at Cyclope.AI. Where shall we start? From Nadiya or from Cyclope.AI? [laughs] Let's start with Cyclope.AI because I think we'll have plenty to talk about. Perfect! Cyclope.AI is a relatively small company that works on artificial intelligence-based solutions for smart road infrastructure and safety. For example, among the products that we do is the security system for the tunnels. You’ve probably heard about the accident in the Mont Blanc Tunnel that happened some years ago. After this, the regulations Read 100 FASCINATING interviews with Women in Computer Vision

23 Computer Vision News Nadiya Shvai UKRAINE CORNER for the safety of tunnels have been reinforced a lot in France. We are working on the automation of the system to make sure that they are as fault-proof as possible. At the same time, they do not generate a lot of false alarms because a system that generates a lot of false alarms finally becomes not useful at all. What do you do there as the data scientist? My work really considers almost all the aspects of deep learning product development. Starting from the data collection to data selection, to supervising the data labeling, to model training and testing. Then, we put the deep learning models into the pipeline and finally prepare this pipeline for deployment. This is additional to the other research activities that we do. Is this what you wanted to do when you studied? Or was it an opportunity that came to you, and you took it? [Thinks a little moment] It's an opportunity that came to me, and I took it. This has more or less been happening throughout my professional path. I think it's normal for opportunities to come our way, and it's important to recognize them and grab them. Recognize those that speak to you, that are close to your spirit. During your studies, what did you think you would be doing when you grewup? Ahh, it's a very good question! Thank you. I didn’t come for nothing. [both laugh] Well, deep learning as a mainstream activity is relatively new. It comes from signal processing, but this was not my specialization when I was studying. At the core, I'm a mathematician. You can think about this as being relatively far from what I do because I was doing linear algebra, and my PhD is also on linear algebra. But then, slowly, I drifted towards more applied skills, which is how I came to where I am today. So it's not true that women are weaker in mathematics, or are you special? “Every day brings us closer to victory.!” BEST OF ICCV 2023

Computer Vision News 24 Women in Computer Vision special? [laughs] No, I really don't think that I'm special. I honestly don't think that women are weaker in mathematics. However, I think we have to talk about the point that we are coming from. We're coming from the point that there is enough of the existing bias of what women should be occupied with and the lack of the examples of the women researchers. That's why the interviews that you do are so important. They provide examples to other women and young girls to broaden their spectrum of possibilities and realize, yes, I can do this. This is possible for me! You’ve told us something about the present and something about the past. Let’s speak about the future. Where are you planning to go? Currently, I'm increasing the amount of research activities in my day-today work. This is my current vector of development. But where it will bring me, I don't know for now. I do know that this is what I am enjoying doing, and this is important for me. Can you be a researcher all your life? [hesitates a moment] Hopefully. If we're talking from the mathematician's point of view, there is this preconception that mathematicians usually are most fruitful in their 20s, maybe 30s. Then, after this, there is some sort of decline in activity. I never heard that. That would be terrible if it were true. [laughs] This is a conception that I have heard, and I'm not sure if there are actually some sort of statistics regarding this. But in one form or another, I would like to continue doing research as much as possible. Because for me, one of my main drives is curiosity. That's what makes research appealing to me. I don't think this curiosity is going to go away with time. Are you curious about learning new things to progress yourself or to make progress in science? What is your drive? UKRAINE CORNER

25 Computer Vision News Nadiya Shvai UKRAINE CORNER I'm not that ambitious to think that I'm going to push science forward. For me, it’s to discover things for myself or for the team, even if it's a small thing. I also enjoy seeing the applied results of the research that we do, because I believe that deep learning is the latest wave of automation and industrialization. The final goal is to give all the repetitive tasks to the machine, so we as humans can enjoy more creative tasks or just leisure time. You just brought us to my next question! [laughs] Please go ahead. What has been your greatest success so far that you are most proudof? If we're talking about automation, I was the person responsible for training and testing the model that right now does the vehicle classification according to required payment at the tolls all over France. It means that every day, literally hundreds of thousands of vehicles are being classified using the computer vision models that I have trained. So, I'm at least partial to the final product, and it means less of a repetitive job for the operators. Before, there was a need for the operator because physical sensors were not able to capture the differences between some classes, so humans had to check this. This is a job very similar to labeling. If you ever did the labeling of images and videos, you know how tough it actually is. You have to do it hour after hour after hour; it's quite tough. So right now, I'm super happy that a machine can do it instead of a human. What will humans do instead? Something else. [laughs] Hopefully, something more pleasant or maybe more useful. That means you are not in the group of those who are scared of artificial intelligence taking too much space in our lives. In our workload? No, I don't think so. First of all, as a person working with AI every day, I think I understand pretty well the limitations that it has. Definitely, it BEST OF ICCV 2023

Computer Vision News 26 Women in Computer Vision UKRAINE CORNER cannot replace humans, but it's just a tool. It's a powerful tool that enables you to do things faster, to do things better. And am I worried about AI in regard to life? Maybe to some extent. Sometimes, when I see some things, I think, do I want this for myself or for my family? The answer is no. But again, it's rather a personal choice. For example, I think a couple of years ago, I saw this prototype of an AI-powered toy for really young kids who can communicate with the kid, etc. And honestly, I am not sure that this is something that I would like for my kids. I don't think that we are at the point that it's A, really safe, and B I think it might be a little bit early for the child to present this to them. It might create some sort of confusion between live beings and AI toys. But again, this is just my personal opinion, and here everyone chooses for themselves. Nadiya, up until now, our chat has been fascinating. My next topic may be more painful. You are Ukrainian, and you do not live in Ukraine. How much do you miss Ukraine, and what can you tell us about how you have been living the past 18 months? [hesitates for a moment] You feel inside as if you are split in two. Because for me, I live in France, and I have to continue functioning normally every day. I go to work, I spend time with my family, I smile at people, etc. Then there's a second half that reads news or gets messages from friends and family that are facing the horror and the tragedy and the pain of war. Of course, it cannot be even closely compared to people's experience who are in Ukraine right now. But I believe there is no Ukrainian in the world that is not affected by the war. How can life go on when somebody is burning down your house? I honestly don't know. But it has to, as you cannot just stop doing whatever you are doing and say I'm going to wait until the war is over. Can you really say, okay, business as usual? Sometimes, don't you feel the whole world should stop and say, hey, come on, this can’t goon?

27 Computer Vision News UKRAINE CORNER Nadiya Shvai [hesitates for a moment] I wish it could be like this, but it's not like this. We have to do our best in this situation that we are in. Do you know, Nadiya, that one and a half years ago CVPR passed a resolution condemning the invasion of Ukraine and offering solidarity and support people of Ukraine? You enjoy a lot of sympathy in this community. Can you tell all of us what you expect from us to make things easier for you? I do feel the support of the research community, and I appreciate a lot the work that they are doing. It means a lot to me personally, and I'm sure that it means a lot also for other Ukrainians. Being seen and heard is one of the basic human needs, particularly in the worst situation that we are in right now. To use our conversation as a stage, I think that the best the community can do is to provide support to Ukrainian researchers, particularly for those who are right now staying in Ukraine. For collaborations and projects, this is probably the best approach. Do you have anything else to tell the community? [hesitates for a moment] Sure, it’s not work-related, but what I want to say is that a couple of days ago, I was reading a book. There was a quote in the book that I liked a lot that I'd like to share: “There are no big breakthroughs. Only a series of small breakthroughs.” And I'm saying this to support young researchers, particularly young girls. Just continue there, and you're going to achieve. Right? This is also my word of support to all Ukrainians who are going to read this. Every day brings us closer to victory. “I do feel the support of the research community, and I appreciate a lot the work that they are doing. It means a lot to me personally!” BEST OF ICCV 2023

Computer Vision News 28 Posters ☺ Ivan Reyes-Amezcua is a PhD student in Computer Science at CINVESTAV, Mexico. He is researching adversarial robustness in deep learning systems and developing defense mechanisms to enhance the reliability of models. He presented his poster at the LatinX workshop, demonstrating how subtle changes to images can fool a model: shifting its confidence from identifying an image as a pig to confidently labeling it as an airliner. Laura Hanu (right) and Anita L Verőare both Machine Learning Research Engineers at Unitary, a startup building multimodal contextual AI for content moderation. Laura told us that in this work, they demonstrate for the first time that LLMs like GPT3.5/Claude/Llama2 can be used to directly classify multimodal content like videos in-context with no training required. "To do this," she added, "we propose a new model-agnostic approach for generating detailed textual descriptions that capture multimodal video information, which are then fed to the LLM along with the labels to classify. To prove the efficacy of this method, we evaluate our method on action recognition benchmarks like UCF-101 and Kinetics400." BEST OF ICCV 2023

NOVEMBER 2023 What’s this? Find out on page 50! BEST OF MICCAI 2023

Computer Vision News 30 Deep Learning for the Eyes Leaving behind a world of completing Jira tickets as a software engineer led him to work on finding disease trajectories for AMD, which he was able to present at last month’s MICCAI in Vancouver! In case you missed out on it, I am covering the key points of “Clustering Disease Trajectories in Contrastive Feature Space for Biomarker Proposal in Age-Related Macular Degeneration” here! Let’s start with the application. Robbie explained the limitations of common grading systems for AMD to me - they lack the ability of prognostics. Simply said, it is unclear how long it will take until a patient transitions from early-stage to a late stage of AMD. Some patients progress quicker than others. by Christina Bornberg @datascEYEnce It is time for another deep learning in ophthalmology interview as part of the datascEYEnce column here in the Computer Vision News magazine! I am Christina and through my work with retinal images, I come across a lot of amazing research that I want to share with you! This time, I interviewed Robbie Holland from Imperial College London on his work on age-related macular degeneration (AMD)! featuring Robbie Holland Robbie's decision to get involved with deep learning for ophthalmology was influenced by both, his interest in modelling complex systems as well as reading a publication by DeepMind in 2018: “Clinically applicable deep learning for diagnosis and referral in retinal disease“. Following his undergrad in Mathematics and Computer Science as well as a project on abdominal MRI analysis, he started his PhD with a focus on the early detection of AMD under the supervision of Daniel Rueckert and Martin Menten in the BioMedIA lab, Imperial College London. Automated AMD biomarker discovery

31 datascEYEnce! Computer Vision News To account for this, his research focuses on using deep learning for automatic temporal biomarker discovery. In other words, clustering disease progression trajectories in a pre-trained feature space. The choice of a self-supervised approach, specifically contrastive learning, was made in order to identify trajectories, or more precisely sub-trajectories. Contrastive learning methods have shown their capability to autonomously learn disease-related features (including previously unknown biomarkers) without the need for clinical guidance. For the setup, a ResNet-50 backbone was trained on costeffective yet informative OCT scans. The self-supervised loss function BYOL contrastive loss makes it possible to train only on positive samples. The decision to use this

Computer Vision News 32 Deep Learning for the Eyes specific loss function for a selfsupervised learning task is based on his previous work (and by chance got emphasized by Yann LeCun!!). Now the ResNet-50 backbone can be used to extract features which are subsequently projected in a feature space. The next step is clustering. Clustering sub-trajectories allows to find common paths for disease progression among the patients. In this work, spectral clustering was applied. Why not k-means? Because the distance function is not Euclidean.

33 datascEYEnce! Computer Vision News The trajectories have an agnostic number of points and therefore it is difficult to find an average value for representation. So, what is the final outcome? Robbie and his collaborators were surprised at how well contrastive learning can model human concepts. With the help of their deep learning pipeline, they were able to isolate patterns of disease progression that are suspected by clinicians to be related to AMD. In other words, the clinicians were able to relate the automatically generated clusters to known and additionally yet unknown temporal biomarkers. As a last question, I asked Robbie about advice for prospective PhD students. For everyone who also wants to pursue research in deep learning for ophthalmology, Robbie emphasized research in selfsupervised learning applied to retinal images. It is an exciting field, where you can try new big ideas - another very good example is RetFound which was recently published by researchers from UCL/Moorfields. In general, there is a high demand for understanding eye-related diseases and a lot of problems that remain to be solved!! More about AI for Ophthalmology Do you enjoy this November issue of Computer Vision News? We are glad that you do! We have one more important community message to tell you. It’s an advert for something free ☺ Just go to page 64, and you’ll know. Keep in touch!

34 MICCAI Oral Presentation Parkinson’s disease, a neurodegenerative disorder affecting millions worldwide, has long been a focus of research to improve diagnostics and treatment. In this paper, Favour presents a method capable of deriving neuroimaging biomarkers associated with disturbances in gait –a common symptom in individuals with Parkinson’s disease. However, the significance of this work extends beyond the laboratory. Favour is determined to make a tangible impact on clinical practice. “A big piece of the work is an explainability framework, meaning that it’s not only computational for how other medical physicists can An Explainable Geometric-Weighted Graph Attention Network for Identifying Functional Networks Associated with Gait Impairment Favour Nerrise is a PhD candidate in the Department of Electrical Engineering at Stanford University under the supervision of Ehsan Adeli and Kilian Pohl. Her work proposes a new method to derive neuroimaging biomarkers associated with disturbances in gait for people withParkinson’s disease and has been accepted for an oral and poster presentations. She spoke to us ahead of her presentations at MICCAI 2023. BEST OF MICCAI 2023

35 Computer Vision News An Explainable Geometric-Weighted … use the work, but more importantly, we created visualizations that you can turn on and off through our pipeline that allow people with relevant clinical or neurological knowledge to interpret what our computational model is doing on the back end,” she explains. “We hope that helps computational scientists, neuroscientists, or clinicians who want to adopt and expand this work translate it into clinical understanding.” One of the most significant challenges was the scarcity of clinical data, a common issue in research involving rare diseases like Parkinson’s. Clinical datasets are typically small and imbalanced, with more healthy cases than diseased ones. Favour developed a novel statistical method for learningbased sample selection to address this. This method identifies the most valuable samples in any given class for training, oversampling them to achieve a balanced representation across all classes. “There were some existing methods that did sample selection either randomly or through synthetic oversampling, but we thought it’d be better to address this directly in a stratified way,” she tells us. “Making sure there’s equal representation of that strong sample bias across every class before applying an external method like random oversampling.” The method performed very well, surpassing existing approaches, including those that inspired Favour to take up the work in the first place, with up to29% improvement for area under the curve (AUC). The success can be attributed to a solution comprising various techniques rather than a linear approach of only visualization explainability or sample selection. Favour plans to expand the research by evaluating it on a larger dataset, the Parkinson’s Progression Markers Initiative (PPMI) database, and exploring new methods involving self-attention and semisupervised techniques.

36 MICCAI Oral Presentation “We also want to do something that I’mreally passionate about, which is trying to identify cross-modal relationships between our various features,” she reveals. “In this work, we focus on the neuroimaging side, taking the rs-fMRI connectivity matrices and then optimizing that using Riemannian geometry and leveraging some features. Now, I’m interested in combining some of the patient attributes and seeing how that could better inform linkages that could be learned during training amongst other types of techniques we’ll try.” Favour has several follow-up projects in the pipeline that promise to push the boundaries further, leveraging attention-based methods and geometric deep learning techniques. Beyond neuroimaging, she aims to incorporate video data from the same patients. This multimodal approach is a significant next step. She intends to derive specific motion biomarkers that can be associated with the existing features. This expansion aims to optimize the learning process and further enhance the understanding of those linkages. The ultimate goal is to combine all these modalities into a comprehensive framework that can be generalized to a broader population. She envisions creating a foundation model that can serve as a valuable resource for researchers and clinicians in various downstream tasks. BEST OF MICCAI 2023

37 Computer Vision News An Explainable Geometric-Weighted … Research journeys are rarely smooth sailing, and Favour tells us a remarkable aspect of this one was the need for constant course correction in her coding efforts. As the deadline for MICCAI submission approached, the results were not exactly where she needed them to be. “It was weighing on my heart so heavily,” she admits. “At first, we were even doing an approach of comparative binary classification and multi-class classification, and things just weren’t making sense. Then, I just focused on the multiclass classification. Once I did that and started to look into how I could directly optimize my metrics, ensuring everything was weighted in my loss functions, sampling techniques, and all those things, we started to see consistent results that could be repeated over trials. I was so concerned about that because I’d get good results here and there, but I couldn’t repeat them. I was so happy once it got stable enough to have reproducible results. That literally happened a few weeks before we were supposed to submit! I kept updating my paper every day until the night of submission.” Favour remains dedicated to her academic journey. With a couple of years to go until she completes her PhD, she is committed to ongoing research in the field and leadership roles both on and off campus. As we wrap up our time together, she acknowledges the importance of the support she has received along the way. This work has been a significant milestone as her first selfowned project and paper to be released during her graduate school career. It now has the honor of being accepted as an oral at a prestigious conference like MICCAI. “I literally can’t believe it!” she smiles. “To any other PhD students thinking, I don’t know what I’m doing, does this even make sense? I don’t understand what I’m writing. Just have faith in your work because when I read my paper, I’mlike, did I write this?! Just have faith in the work you’re doing, and somebody will love it, and you’re absolutely brilliant, andit’sgoing to be worth it!”

RkJQdWJsaXNoZXIy NTc3NzU=