ICCV Daily 2023 - Thursday

A publication by DAILY Exclusive Interview with Yann LeCun

Yann LeCun was so kind as to give another interview to Ralph at ICCV 2023 in Paris. Yann, thank you very much for being with us again. When we talked five years ago, you told me you had a clear plan for the next few years. Did you stick to it? The plan hasn’t changed very much – the details have changed, and we’ve made progress, but the original plan is still the same. The original plan was the limitation of current AI systems is that they’re not capable of understanding the world. You need a system that can understand the world if you want it to be able to plan. You need to imagine in your head what the consequences of your actions might be, and for this, you need a world model. I’ve been advocating for this for a long time. This is not a new idea. The concept is very old, from optimal control, but using machine learning to learn the world models is the big problem. Back when we talked, I can’t remember if I’d made the transition between what I called latent variable generative models and what I’m advocating now, which I call JEPA, so joint embedding predictive architectures. I used to think that the proper way to do this would be to train a system on videos to predict what will happen in the video, perhaps as a consequence of some action being taken. If you have 2 DAILY ICCV Thursday Exclusive Interview

a system that can predict what’s going to happen in the video, then you can use that system for planning. I’ve been playing with this idea for almost 10 years. We started working on video prediction at FAIR in 2014/15. We had some papers on this. Then, we weren’t moving very fast. We had Mikael Henaff and Alfredo Canziani working on a model of this type that could help plan a trajectory for self-driving cars, which was somewhat successful. But then, we made progress. We realized that predicting everything in a video was not just useless but probably impossible and even hurtful. I came up with this new idea derived from experimental results. The results are such that if you want to use self-supervised learning from images to train a system to run good representations of images, the generative methods don’t work. The methods are based on essentially corrupting an image and then training a neural network to recover the original image. Large language models are trained this way. You take a text, corrupt it, and then train a system to reconstruct it. When you do this with images, it doesn’t work very well. There are a number of techniques to do this, but they don’t work very well. The most successful is probably MAE, which means masked autoencoder. Some of my colleagues at Meta did that. What really works are those joint embedding architectures. You take an image and a corrupted version of the image, run them through encoders, and train the encoders to produce identical representations for those two images so that the representation produced from the corrupted image is identical to that from the uncorrupted image. In the case of a video, you take a segment of video and the following segment, you run them through encoders, and you want to predict the representation of the following segment from the representation of the previous segment. It’s no longer a generative model because you’re not predicting all the missing pixels; you’re predicting a representation of them. The trick is, how do you train something like this while preventing it from collapsing? It’s easy for this system to collapse, ignore the input, and always predict the same thing. That’s the question. So, we did not get to solve the exact problem we wanted? It was the wrong problem to solve. The real problem is to learn how the world works from video. The original approach was a generative model that predicts the next video frames. We couldn’t get this to work. Then, we discovered a bunch of methods that allow one of those joint embedding systems to learn when they’recollapsing. There are a number 3 DAILY ICCV Thursday Yann LeCun “This is, I think, the future of AI systems. Computer vision has a very important role to play there!”

4 DAILY ICCV of those methods. There’s one called BYOL from DeepMind – Bootstrap Your Own Latent. There are things like MoCo. There have been a number of contrastive methods to do this. I probably had the first paper on this in 1993, on a Siamese neural network. You train two identical neural nets to produce identical representations for things you know are semantically identical and then push away the outputs for dissimilar things. More recently, there’s been some progress with the SimCLR paper from Google. Then, I became somewhat negative about those contrastive methods because I don’t think they scale very well. A number of non-contrastive methods appeared about four years ago. One of them is BYOL. Another one, which came from my group at FAIR, is called Barlow Twins, and there are a number of others. Then, we came up with two other ones called VICReg and I-JEPA, or Image JEPA. Another group at FAIR worked on something called DINOv2, which works amazingly well. Those are all different ways of training a joint embedding architecture with two parallel networks and predicting the representation of one from the representation of the other. DINOv2 is applied to images, VICReg is applied to images and short videos, I-JEPA to images, and now we’re working on something called V-JEPA or Video JEPA, a version of this for video. We’ve made a lot of progress. I’mvery optimistic about wherewe’regoing. You have long been a partisan of the double affiliation model. Would you suggest young people today consider a career with hats in academia and industry, or would your advice for this generation be a little bit different? I wouldn’t advise young people at the beginning of their career to wear two hats of this type because you have to focus on one thing. In North America, if you go into academia, you have to focus on getting tenure. In Europe, it’s different, but you have to focus on building your group, your publications, your students, your brand, your research project. You can’t do this if you split your time. llll Thursday Yann’s interview with Ralph in 2018 Exclusive Interview

Once you’re more senior, then it’s a different thing. Frankly, it’s only in the last 10 years that I’ve been straddling the fence in a situation where I’m pretty senior and can choose what I want to work on. At FAIR, we don’t take part-time researchers who are also faculty if they’re not tenured. Even the tenured, we tend only to take people who are quite senior, well established, and sometimes only for a short time, for a few years or something like that. It’s not for everyone. It depends on which way you want to have an impact and whether you like working with students. In industry, you tend to be more hands-on, whereas in a university, you work through students generally. There are pluses and minuses. You are one of the well-known scientists in our community who does not shy away from talking to younger and less experienced people on social media, in articles, and at venues like ICCV and MICCAI. Do you also learn from these exchanges? The main reason for doing it is to inspire young people to work on interesting things. I’ve been here at ICCV for about an hour and a half, and about 100 people came to take selfies with me. I don’t turn them down because they’re so enthusiastic. I don’t want to disappoint them. I think we should encourage enthusiasm for science and technology from young people. I find that adorable. I want to encourage it. I want to inspire people to work on technology that will improve the human condition and make progress in knowledge. That’s my goal. It’s very indirect. Sometimes, those people get inspired. Sometimes, that puts them on a good trajectory. That’s why I don’t shyaway. There are a lot of exchanges about the potential benefits and risks of AI, for example. The discussions I’ve had on social media about this have allowed me to think about things I didn’t think of spontaneously and answer questions I didn’t know people were asking themselves. It makes my argument better to have these discussions on social media and have them in public as well. I’ve held public debates about the risks of AI with various people, including Yoshua Bengio and people like that. I think it’s useful. Those are the discussions we need to have between well-meaning, serious people. The problem with social media is that there’s a lot of noise and people who don’t know anything. I don’t think we should blame people for not knowing; I think we should blame people for being dishonest, not for not knowing things. I’ma professor. My job is to educate people. I’m not going to blame them for not knowing something! You started in a place where you 5 DAILY ICCV Thursday Yann LeCun

knew every single scientist in your field. Now, you are meeting thousands and cannot learn all their names. What is your message to our growing community? A number of different messages. The first one is there are a lot of applications of current technologies where you need to tweak an existing technique and apply it to an important problem. There’s a lot of that. Many people who attend these conferences are looking for ideas for applications they’re interested in medicine, environmental protection, manufacturing, transportation, etc. That’s one category of people – essentially AI engineers. Then, some people are looking for new methods because we need to invent new methods to solve new problems. Here’s a long-term question. The success we’ve seen in natural language manipulation and large language models – not just generation but also understanding – is entirely due to progress in selfsupervised learning. You train some giant transformer to fill in the blanks missing from a text. The special case is if the blank is just the last word. That’s how you get autoregressive LLMs. Self-supervised learning has been a complete revolution in NLP. We’ve not seen this revolution in vision yet. A lot of people are using self-supervised learning. A lot of people are experimenting with it. A lot of people are applying it to problems where there’s not that much data, so you need to pre-train on whatever data you have available or synthetic data and then fine-tune on whatever data you have. So, some progress in imaging. I’m really happy about this because I think that’s a good thing, but the successful methods aren’t generative. The kind of methods that work in these cases aren’t the same kind of methods that work in NLP. In my opinion, the idea that you’re going to tokenize your video or learn to predict the tokens is not going anywhere. We have to develop specific techniques for images because images and video are considerably more complicated than language. Language is discrete. It makes it simple, particularly when having to handle uncertainty. Vision is very challenging. We’ve made progress. We have good techniques now that do selfsupervised learning from images. The next step is video. Once we figure out a recipe to train a system to learn good representations of the world from video, we can also train it to learn predictive world models: Here’s the state of the world at time T. Here’s an action I’m taking. What’s going to be the state of the world at time T+1? If we have that, we can have machines that can plan, which means they can reason and figure out a sequence of actions to arrive at a goal. I call this objective-driven AI. This is, I think, the future of AI systems. Computer vision has a very important role to play there. That’s what I’mworking on. My entire research is entirely focused on this! 6 DAILY ICCV Thursday Exclusive Interview

Jiayu’s picks of the day (Thursday): Jiayu Yang is currently a final year PhD student in the Australian National University, supervised by Miaomiao Liu, Jose M. Alvarez and Richard Hartley. His research focuses on 3D Computer Vision, 3D Reconstruction, Multi-view Stereo, Autonomous Driving, Extended Reality, and 3D Artificial Intelligence Generated Content. AM Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models AM FocalFormer3D: Focusing on Hard Instance for 3D Object Detection AM Towards Viewpoint Robustness in Bird’s Eye View Segmentation AM MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation AM Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data AM View Consistent Purification for Accurate Cross-View Localization Jiayu sent us his recommendations below. But he forgot to tell us that he also has a poster to present today: Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation inBird’s EyeView. Visit him during the morning session at 10:30 AM-12:30 PM [Room "Nord" –030] Oral: Posters: 7 DAILY ICCV Thursday Jiayu’sPicks For today, Thursday 5

Computer vision has seen significant advancements in language-based object detection in recent years. Unlike conventional object detection tasks with predefined label categories, language-based perception allows algorithms to understand and respond to a diverse range of textual descriptions associated with images. This progress is paving the way for a future where the limitations of fixed label spaces are replaced by an expansive, almost infinite, label space. However, with this exciting shift comes a pressing challenge: effectively evaluating the performance of these algorithms. “For language-based detection, that’s not as easy as you think,” Samuel points out. “We looked at benchmarks for referring expression datasets and open-vocabulary detection, and while there are great benchmarks for open-vocabulary detection, they’re mostly limited to evaluating how good you are at detecting a novel category not seen during training, but these category names are still simple categories, like a bottle or an iPad case.” Samuel Schulter is a Senior Researcher at NEC Laboratories America, working on computer vision. His recent focus is on the intersection between vision and language for 2D/3D perception tasks. In this paper, he proposes a new benchmark for languagebased object detection. He speaks to us ahead of his oral and poster this afternoon. OmniLabel: A Challenging Benchmark for Language-Based Object Detection 8 DAILY ICCV Thursday Oral Presentation

For many of these datasets, there is an assumption that the text relates to precisely one bounding box in the image. An algorithm already knows there is one bounding box in the image that refers to the text; it just needs to find it. This scenario differs from object detection, where you are given several categories, and an image might contain one or five categories and multiple or no instances. Samuel wanted to create a benchmark to address this gap that evaluated an algorithm’s ability to handle more complex, freeform descriptions. He needed descriptions that encompassed multiple objects or even referred to objects not present in the image, introducing the concept of negative descriptions. “If a person is wearing a blue shirt, and the description is a person wearing a red shirt, the algorithm’s output should be no bounding box,” he explains. “Object detection benchmarks do evaluate for this; existing language-based benchmarks do not. That’s where our paper comes into play. The task is I have an image and a list of descriptions, and the algorithm gives me back a set of bounding boxes only for the objects that match.” Although vision and language research has existed for many years, the explosion of large-scale models 9 DAILY ICCV Thursday OmniLabel

10 DAILY ICCV Thursday Oral Presentation like CLIP has spurred the development of benchmarks for perception tasks. Samuel tells us there are already three similar papers to his one, which is a promising sign of the field’s growth and direction. “People should come to my oral presentation, of course, but I think it’s good confirmation that this is the right direction, and people are interested in it,” he adds. “One paper was already at CVPR, and two others are unpublished, but there are now four benchmarks with similar goals, which is great for the community.” As is often the case, data collection proved to be the most challenging aspect of this work. Samuel employed as much automation as possible, aiming for the most efficient way to get the necessary annotations. Still, data quality became an issue due to using Amazon Mechanical Turk, a platform for crowdsourcing tasks. “You start a round of annotations and realize maybe my instructions weren’t as good as I thought when you get something back really different than what you expected,” he says. “Of course, people there have an incentive too. They want to earn money doing annotations, so they get to the solution in the quickest way they can and leverage shortcuts to get the task done quickly. You don’t get what you want if you forget something in your instructions. We had to do a couple of iterations to get that right.” The key motivation was to have challenging descriptions that refer to multiple objects in a scene, requiring models to consider the entire context of a sentence. Every detector can take an image of a cat on a bench and find the cat. Instead, he would use an image of two cats, one on a bench and one on the ground, which is a more challenging task. “We leveraged existing datasets, starting with object detection datasets where we knew there were two or three cats in the image,” he tells us. “We selected those first and asked annotators to pick only a subset, two out of three, and describe them so that they only refer to those two, but not the third one.” Regarding the next steps, Samuel highlighted a notable follow-up paper on arXiv: the winner of the OmniLabel challenge hosted at CVPR. This research explored how to teach a model to focus on all aspects of a given sentence when identifying objects in an image. It used large language models and ChatGPT to generate negative descriptions, training the algorithms with this additional augmented data. The challenge itself is still online and open to anyone with a language-based object detector to evaluate and benchmark. He hopes to host another edition of the challenge and workshop for CVPR 2024.

11 DAILY ICCV Thursday ICCV Daily Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, the Computer Vision Foundation and ICCV organizers. Samuel acknowledges that the success of this work was a collective effort and is keen to thank his collaborators, including second and third authors Vijay and Yumin; the students at Rutgers University who helped with the benchmarks and baselines; and everyone at NEC Labs, particularly his department head, who encouraged and supported the work despite its academic bent. “I’mgoing to be bold here and say, in the future, this is what object detection looks like,” he declares. “Our paper is just a benchmark, but it’s a benchmark for models that will be the next generation. It’s a transition already, but this is how all detection papers will look in the next few years.” Until the next revolution, at least? “Yes, exactly,” he laughs. “It could be over soon!” To learn more about Samuel’s work, visit his poster this afternoon at 14:30-16:30 and his oral at 16:30-18:00. OmniLabel

12 DAILY ICCV Thursday Poster Presentation SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields Anh-Quan Cao is a thirdyear PhD student at Inria Paris, supervised by Raoul de Charette. His paper proposes a selfsupervised 3D reconstruction method from a single image. He speaks to us ahead of his poster this morning. In this paper, Anh-Quan extends the capabilities of 3D reconstruction to encompass large-scale scenes, such as outdoor driving scenarios and robot navigation, using only imagebased supervision. He builds on a NeRF-based method, using a sequence of future frames to supervise the training of the method conditioned on a single image. The idea follows his previous work, MonoScene, in which he successfully predicted complete 3D semantic scenes from a single image with 3D groundtruth. “The setting is very challenging because we want to work on a large-scale scene,” Anh-Quan explains. “It’s a very complex scene with a lot of occlusion, objects, and clutter because it’s 100 meters away. To reconstruct it from only a single image is hard. Furthermore, we don’t want to use 3D groundtruth; we only want to use images to supervise, adding another difficulty to the problem.”

“We also propose an efficient ray sampling technique,” he continues. “In neural radiance fields, we need to project the rays into the views and then sample the points on the rays. In our case, we need to have very long rays, around 100 meters. Therefore, we propose an efficient technique to sample a small amount of points on the rays. For example, we only need 60 points for 100 meters, reducing the computation required.” Anh-Quan believes this is the first instance of a large-scale, selfsupervised 3D reconstruction method that operates solely from a single image. The work has also built on neural radiance fields, a highly popular and award-winning method, and demonstrated its ability to generalize to unseen nuScenes images. Advisor Raoul de Charette told us that he finds SceneRF particularly interesting because it alleviates the need for 3D groundtruth, thus stepping towards arbitrary 3D reconstruction from a video stream. “It’s a very challenging project because of the setting,” Anh-Quan adds. “That’s the thing I like about it. It’s tough to solve this problem, and I remember we only solved it several weeks before the deadline!” The practical applications of this research in the real world are farreaching. In autonomous driving, where the prediction of 3D scenes is essential, this method eliminates the need for 3D groundtruth supervision during training of the computer vision network, enabling training on larger image datasets. 13 DAILY ICCV Thursday SceneRF To address these challenges, he extends the PixelNeRF method to enable it to reconstruct the 3D geometry from the image and work on large-scale scenes. He proposes a novel encoderdecoder architecture designed to expand the image’s field of view, allowing the extraction of features from points outside the immediate view.

The method could also be deployed on drones or small robots. These devices often face challenges in capturing 3D groundtruth data, making a single-camera-based method highly advantageous. “It’s still quite heavy to train,” AnhQuan reveals. “It takes five days. The next step would be to make that more efficient. Also, inference is quite slow for now. Another step is to incorporate the semantic information inside the method. Currently, it only predicts the scene’s geometry without the semantic information. I hope other researchers will take it from here.” Anh-Quan is originally from Hanoi, the capital of Vietnam, which he tells us has a lot of “beautiful landscapes.” Outside of this project, his work is focused on 3D scene reconstruction from images or point clouds, aiming to reduce the reliance on supervision, estimate model uncertainty, and enhance explainability. “I will finish my PhD next year,” he tells us. “The next step is to find a job – I want to be a research scientist in a company.” To learn more about Anh-Quan’s work, visit his poster this morning at 10:30-12:30. 14 DAILY ICCV Thursday Poster Presentation

15 DAILY ICCV Thursday UKRAINE CORNER Sophia Sirko-Galouchenko is starting her PhD working on representation learning for Autonomous Driving with limited supervision at Valeo.ai, team based in Paris which conducts research for automotive applications. Sophia is volunteering at ICCV. ICCV's sister conference CVPR adopted a motion with a very large majority, condemning in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine. We decided to host a Ukraine Corner also in the ICCV Daily.

In this paper, Gefen and Elad explore medical image-text representation to assist radiologists in interpreting chest X-ray images and their corresponding reports. Their journey started with an indepth exploration of the distinctive features of chest X-rays, particularly when compared to natural images, which landed on three key observations. 16 DAILY ICCV Thursday Poster Presentation LIMITR: Leveraging Local Information for Medical Image-Text Representation Gefen Dawidowicz (left) andElad Hirsch(right) are PhD candidates in the Electrical Engineering Department at the Technion – Israel Institute of Technology under the supervision of Ayellet Tal. Their paper is about learning a joint representation of images and text from medical images and reports. They speak to us ahead of their poster tomorrow (Friday) afternoon.

17 DAILY ICCV Thursday LIMITR Firstly, chest X-rays have a unique structure influenced by the layout of the human body. Secondly, the pathologies within medical images are typically confined to small regions, unlike natural images, where differences are often pronounced and readily discernible. Furthermore, the medical reports accompanying these images tend to focus heavily on normal observations, with relatively brief pathology descriptions. Thirdly, the studies used in this work typically contain a medical report and one or more images. “One of the images is always the frontal image, and sometimes, in the studies, they provide an additional view, which is the lateral view,” Gefen tells us. “Most of the works that were done before us ignored the lateral view, even though it is mentioned in some of the reports, and the information it contains is helpful for the radiologist to understand the pathology and the conditions of the subject in that examination.” The team’s primary goal was to create a shared space where text and image data could coexist. They envisioned a system where an image representing a specific pathology would align closely with text describing the same condition. This alignment could be leveraged in various applications, including retrieval, where the text can be used to retrieve images that closely match the description. They introduced the concept of generating attention maps linking phrases to corresponding areas in the image. This technique, known as phrase grounding, demonstrates the quality of the local alignment. “The end goal is to help new radiologists and those who are not yet experts and give them the option to retrieve similar studies,” Elad points out. “Say they have an image and are not sure what happens in that image. They can easily retrieve and get similar reports and then see if they missed something or can learn from it. That’s why we need this representation space that captures these similarities.”

When asked about the origin of their work, Gefen, who is the first author, shared that her background in biomedical engineering provided a starting point. Then, her collaboration with Elad, whose prior work related to image and text, led them in this innovative direction. However, given their engineering backgrounds, they both acknowledge the significant challenge they faced in becoming familiar with the medical field. “We’re engineers, not doctors,” Elad affirms. “We didn’t know how to read chest X-rays, but we had to do that. It’s very important to verify and find the failures of the system. We had to observe the data and try to understand what was right and wrong. In natural images, it’s easy. You see that there’s a dog, and if the system says it’s a cat, you try to understand what caused the failure. If there’s a lung lesion or opacity, and you don’t know either of them, it’s much more difficult.” After overcoming this initial hurdle, they devoted their efforts to understanding the unique properties and challenges of the field and tailoring their solution accordingly. This solution revolved around several key principles: handling multiple images, leveraging a known structure, and weighted learning. They developed a flexible solution that accommodated cases with additional lateral images and cases with only one frontal view. In doing this, they recognized that a radiologist would naturally examine lateral images if available but wouldn’t if not, so it was essential to mimic that behavior. Their model learned to match words in the report to relevant images or regions within the images. 18 DAILY ICCV Thursday Poster Presentation

19 DAILY ICCV Thursday LIMITR “The second thing we did was leverage the known structure,” Gefen adds. “We know that in the images, there’s a structure. The heart is always in the middle of the image. We use positional encoding to leverage this known structure in the entire dataset. Thirdly, we know that the interesting differences between the images and the reports lay in small areas of the image or a few words in the reports. Our model learns to weigh each of the words in the report or the regions in the images and gives more weight to areas that represent those pathologies or abnormalities.” Outside of this work, the pair think their method could be helpful in other fields and datasets. “One of our colleagues worked on archaeology, and he had similar challenges with small datasets,” Gefen continues. “You can’t always use the big models to solve your problems.” Elad agrees: “If you have a small dataset without an ability to segment the regions and match image regions to text, but you know that there’s a connection between the words and the image, you can utilize or easily adapt our method.” While their immediate plans do not involve a direct continuation of this work, Gefen and Elad are sticking with the field but exploring new horizons. One promising direction is the generation of reports from images, a task similar to this one. They are also investigating the phrase-grounding task, which visually connects words or phrases in a report to specific regions in an image and can potentially benefit both trainee and expert radiologists. We were intrigued to know what led to Gefen and Elad’s decision to submit their work to ICCV rather than MICCAI. Although it is easy to see why their backgrounds in computer vision led them to ICCV,

can they see the benefit of presenting their work to a more specialized medical imaging audience in the future? “Yeah, on the one hand, you’re correct, but many of the publications in that field, and even the works we’re compared to, are not from MICCAI,” Elad responds. “We’re happy to see that the community is looking for more fields, trying to find the impact of computer vision in other domains, and publishes these works in CVPR, ICCV, ECCV, etc., and not only MICCAI. Also, as computer vision engineers, we can see these works. I think it goes both ways.” To learn more about Gefen and Elad’s work, visit their poster tomorrow [Friday] afternoon at 14:30-16:30. 20 DAILY ICCV Thursday Poster Presentation We received this very nice photo and the ones in the next page by Debora Caldarola, an organizer of the Women in Computer Vision workshop and a dear friend of our magazine. The inspiring panel included (from left to right), the awesome Diana Larlus, Siyu Tang, Anna Rohrbach, Hilde Kuehne and Arsha Nagrani. It is interesting to note that 4 of the 5 panelists were already interviewed in RSIPVision’s series of Women in Computer Vision.

21 DAILY ICCV Thursday WiCV

Azade Farshad is a Senior Researcher at the Technical University of Munich. Azade, tell us about yourself. I just submitted my doctoral thesis last week. But since last year, I've been a senior researcher at the Technical University of Munich. I will be staying there for the next month at least. When is your defense? Probably in six months. Can we say congrats before, or do we have to wait six months? [laughs] Yeah, you can say before. Azade, what is your work about? It's about foundational research in machine learning and computer vision. I work on generative models, scene graphs, meta-learning, and a bit of medical imaging. Is that difficult? The topic is not difficult, but it's very 22 DAILY ICCV Thursday Women in Computer Vision

competitive. So, coming up with novel ideas and publishing sometimes becomes difficult. Why did you take a competitive field? You could have taken an easier one. It’s always been my interest. Even when I was a child, I wanted to go into artificial intelligence, and I still love that. You're probably doing pretty well with the competition. Yeah, I hope so. What problem are you trying to solve in the real world? Before this, I was working on images, generating images, and manipulating them. Now, with the recent advances in diffusion models, that is almost solved. But there are some challenges still. For example, semantics are not accurate in generating images. Also, there is not much research going on on videos. So, I plan to also move towards that to generate videos. What is the world going to do with that? There are two directions. One is for entertainment. I think everyone is entertained by using these tools with images. And with videos, it will be even more interesting. Another aspect I am trying to do this for is 23 DAILY ICCV Thursday Azade Farshad

24 DAILY ICCV Thursday Women in Computer Vision medical imaging, and that part will be more impactful. For example, predicting diseases. Do you prefer working to entertain people better or to improve their health? Is there a big difference for you? The applications are very different, but I work on the methodological part, not exactly on the application. Something that can be applied to both worlds. Is it a choice to work more in methodology than in application? Yes, but in academia, it also depends on the funding that you get. There is more funding available for application-based research. For the more theoretical, there wasn't much funding. Recently, however, in Germany, where I am coming from, there have been much more funds. So, I was able to also put more in. Apparently, you chose the branch with the lowest funding. Yes. [laughs] Wouldn't it be better if you could do science where you think it is needed rather than where the funding leads you? Yes, exactly. That's the best option. I think if everyone can work on what they like and what they think will be more impactful. Rather than competing to just do the publication, that would be great. Then people can actually benefit from the work, not just increasing the numbers in the tables slightly. Sure! That was something that Nikos Paragios called the “deep depression”. We hope that whoever needs to hear that heard that. Azade, you told me when you were a child that you already wanted to work in artificial intelligence. How did that happen, at what age, and how did you feel about it? [hesitates for a moment] I think it was around four or five years old. When someone asked me what I wanted to become when you're older, I would say I wanted to make robots. That was to help my parents when they needed help, for example, with work at home. Later,

when I went to high school, I knew that if I wanted to go towards this path, I needed to study computer science. Before this, I also loved math. Has everything always moved in that direction? Almost. Currently, I'm not working on robotics, but from the moment I entered university, I realized I was more interested in the intelligence end than the hardware end. Okay, imagine that I give you a big grant to work on anything you want. What would it be? It would be the thing that I am currently working on with predicting future videos, and also the same for medical imaging. Will that be the subject of your thesis in six months? Not exactly, although the subject of my thesis was the foundation for that; it was working on images without any temporal data. But in the future, I want to go in that direction. If I could give you something that you don't already have, what would it be? Probably money. [laughs] More money for you or for your research? Well, for example, one of the biggest choices of PhD holders is to go towards academia or industry. I'm still considering going for both because academia doesn't have much money, and industry has. If you want to have a nice life and also work in your interest, then you need to have both of these. I see. You don't see this getting mended very soon? In some countries, it is fine. For example, in Germany it is good, and also in Switzerland. But in some countries, especially academia, it does not have much money and you don't get paid well. What's the situation in Iran? So many Iranian scientists go abroad. Why do they go away? One of the reasons is the current economic situation. They cannot find very well-suited jobs. They also see others having better lives when they go abroad, so they follow the samepath. So there is no solution? At this point, I am not sure. Maybe if the government changes their attitude. 25 DAILY ICCV Thursday Azade Farshad

Can you share something that we don't know about Iran? It's a very beautiful country. The people are very nice and friendly. Would you say to visit Iran? Yes. It has many great and beautiful cities. [smiles] What’s one thing that is not very well known but you know is beautiful? [hesitates for a moment] I think many of them are well known. But in the south, there is the Persian Gulf and a spot where the desert meets the sea. I have not seen that myself, but I've been planning to do that for some years. Do you plan to go back to Iran one day to live? It is possible, but I am not planning that at the moment. The opportunities are better in Europe and the USA. After your PhD, do you plan to stay in the research area, or do you want to teach? First, I will be staying in academia for at least the next two years. I plan to stay in academia, probably as a postdoctoral researcher, because I love research. I didn't like teaching four years ago when I started my PhD. But then I got used to it and now I even like it. [laughs] Maybe you will have a young PhD student who will start working with you. Would you like that? Yes. I'm even supervising master's students now and in the past years during my PhD. Howmany? Each semester, maybe twenty. Wow. Is it demanding? It's difficult to manage the time and to have all the meetings, but it's also enjoyable. Are there some that are more talented than others? Yes, but one of the things is to try to match people and help all of them grow together. This was one of the challenges at the beginning of my work. I learned to help them grow as agroup. 26 DAILY ICCV Thursday Women in Computer Vision

What if you found one who is not talented? Would you tell him, listen, this field is not for you? It depends on how you define talent. If they are just lazy and don't want to put the work in, then I may tell them. But if they just need more help to understand and learn things, I would help them. Very nice. So, it depends on how many chances they have to continue. But if you feel they should leave, you would tell them? Yes, because this is a very competitive field. At some point, I even thought I was too slow to compete with the industry and I slightly changed my path to be a bit away from industry. Let's fast forward to the day you retire. What would be something that you have done in your career that would satisfy you, and you say, “Oh, that was worth it.” What success do you wish for yourself? Something you can think back to on your day of retirement and think that was great. The best thing would be to receive the Turing Award. [laughs] I'm not sure if I can get there. But even if I can do work on impactful research and be helpful to the board, that would be something. If one day I am on the jury of the Turing Award, I will vote for you. [laughs] Thank you. Of all the skills that engineers and scientists have that are important to succeed, which one do you cherish the most? Which skill would you say, please come to me, stay with me, and never go away? I think being curious and trying new things out. Just being excited about one's research is the most important part. What drives you to work so hard in this field? If I look into the far future, then having very intelligent systems that would help people would still be a dream. But in the short term, achieving some minimal goals and impactful work would be nice. Do you have something you’d like to share with the community? Put more effort into academia. And try to push artificial intelligence and computer vision toward more impactful and interesting ideas rather than working just for profit. Xi Yin 27 DAILY ICCV Thursday Azade Farshad

RkJQdWJsaXNoZXIy NTc3NzU=