Computer Vision News - July 2023

July 2019 A publication by JULY 2023

Full house for Yi-Zhe Song presenting Award Candidate paper: What Can Human Sketches Do for Object Detection? A Highlight at CVPR… BEST OF CVPR 2023 Computer Vision News 2 Award Candidate at CVPR 2023

3 Computer Vision News Neural radiance fields (NeRF) excel at reconstructing new viewpoints in 3D scenes using multiple 2D images. With a series of images captured by a smartphone, NeRF methods can generate a complete 360-degree rendering around an object of interest. However, there are certain limitations associated with this approach, including that the photos must be taken in pristine conditions , without any distracting elements such as moving pedestrians, clouds, or shadows. If any of these distractors or transient objects are present, NeRF renderings may exhibit artifacts . In this work, Sara attempts to prepare clean renderings as if the images were captured under ideal conditions, even when using images taken in the wild with different transient objects in the scene. “ Previous works tried to segment the pedestrian out or train a separate model for these moving objects, ” she tells us. “ They could only handle a specific set of transients. However, we approach it as an optimization task and say anything that’s not constant in the scene is an outlier in your optimization because it will have a really high loss. Our model can handle hundreds of transients of different types in the scene, and it will remove all of them! They don’t need to all be in the same category. ” In earlier works, transient objects typically consisted of common elements like the camera’s shadow or pedestrians on the street. Handling a specific model for these transients was sufficient for most datasets. However, as NeRF gained Sara Sabour is a Research Scientist at Google DeepMind and a PhD student at the University of Toronto. Her paper was accepted by CVPR 2023 as a highlight presentation. It proposes a new method for NeRF training to remove outliers from a scene, which is simple to incorporate into modern NeRF frameworks. She spoke to us ahead of her poster presentation. RobustNeRF: Ignoring Distractors with Robust Losses RobustNeRF: Ignoring Distractors with Robust Losses

Computer Vision News CVPR Highlight Presentation 4 popularity and found applications in various domains, issues have emerged when dealing with many different outliers. While previous methods might perform well with 100 people in the scene, they fail when confronted with diverse shapes like a dog, ball, or box. In contrast, this paper’s optimization-based approach treats all outliers equally , regardless of their shape. “ At first, we assumed that by using a robust loss and treating it like a typical robust optimization, where the loss takes care of the outliers, it would work fine, but in practice, it wasn’t working, ” Sara reveals. “ If you do it yourself by replacing the reconstruction loss with a robust loss, it will fail because of the trajectory of the optimization that NeRF follows. At the start of the training, the details of the scene, which makes NeRF very good, will always have a high loss. They will be trimmed if you just blindly use a robust loss. Your images will become blurred and without any interesting details. Our main struggle was balancing this and separating the transient objects as outliers , as opposed to early training details or viewpoint effects. ” The model can also handleglossy or transparent objects . It does not remove view-dependent artifacts that are intentional, only transient objects, because of a loss that considers the spatial consistency of an object. “ We use an iteratively reweighted least-squares (IRLS) with a Trimmed least squares (LS) loss , a specific robust loss that adapts the threshold BEST OF CVPR 2023

5 Computer Vision News of robustness based on the loss distribution in the previous steps, ” Sara explains. “ We also incorporated the spatial consistency of the loss by filtering and using diffusion with kernels of different sizes and patches during the optimization. ” Ultimately, RobustNeRF represents a straightforward modification to the loss function in NeRF models. It only requires a few lines of code, so by simply substituting the reconstruction loss with the robust loss proposed here, any NeRF model can be adapted. Sara was one of the authors of a paper we featured in Computer Vision News recently called nerf2nerf by Lily Goli , as part of Andrea Tagliasacchi’s team at the University of Toronto. She has also worked closely with leading deep learning figure Geoffrey Hinton at the University of Toronto and Google. Can she give us an insight into the man referred to as one of the ‘ Godfathers of AI ’? “ When we started the team in Toronto, Geoff and I used to have lunches every day together , ” she recalls. “ He told me a story about when he was working in a woodworking workshop. At first, he didn’t feel like going to university. He said he learned a lot in the year or two he spent there. He then wanted to learn more about science, so he started in an area related to chemistry or physics, but after a couple of years, realized he was not interested in that and changed to cognitive science. He had a very explorative approach to getting to where he is now! ” RobustNeRF: Ignoring Distractors with Robust Losses

Speaking of exploring, Sara also reveals thatGeoff’smiddle name is Everest . George Everest, the Surveyor General of India, after whom the mountain is named, was a relative. Google Brain , where Sara has been working for several years, has just merged with Google DeepMind . She says one of the best things about working for Google is its policy that everyone can access everyone else’s code in real time. “ You can see how they’re doing their experiments, and that’s been helpful for me, ” she points out. “ Once, there was another project, and not only could I look at their final code, but I was curious to know if they had tried something while developing it. I realized they had, and itdidn’twork. ” Sara was born and studied in Iran before moving to Canada for her master’s . What can she tell us about her homeland that we do not know? Computer Vision News 6 CVPR Highlight Presentation

7 RobustNeRF: Ignoring Distractors with Robust Losses Computer Vision News “ Iran has all four seasons at the same time all year round , ” she declares. “ It’s a large country with a special geography. There are high mountains with permanent snow, but there are also deserts with permanent sun, jungles, forests, and beaches! ” The code for this work is not widely available yet, but Sara says she plans to add the robust loss function snippet to a NeRF codebase soon . Computer Vision News Editor: Ralph Anzarouth Publisher: RSIP Vision Contact us Give us feedback Free subscription Read previous magazines Copyright: RSIP Vision All rights reserved Unauthorized reproduction is strictly forbidden. Follow us: Ralph’s photo on the right was taken in lovely, peaceful and brave Odessa, Ukraine.

Computer Vision News 8 CVPR Poster Presentation Jamie Watson is a Researcher at Niantic and a part-time PhD student at UCL. His paper proposes a new method to improve how virtual assets appear alongside real-world objects in augmented reality. He spoke to us ahead of his poster presentation at CVPR 2023. Virtual Occlusions Through Implicit Depth Augmented reality (AR) has revolutionized how we interact with digital content by overlaying virtual objects in the real world. In creating genuinely immersive AR experiences , realistic occlusions play a vital role. Occlusions refer to the ability of virtual objects to be hidden or partially obscured by realworld elements, such as hands, tables, or buildings, thereby enhancing the believability of the augmented content. With an AR character or object in a scene, the traditional approach performs realistic occlusions by predicting depth. When the AR character is closer than the predicted depth, it can be seen, and when it is further away, any object in front of it will occlude it. This method can sometimes be unstable, with networks having difficulty predicting edges for depth, meaning the AR occlusion jumps in and out. In this work, Jamie proposes a solution by training a neural network to predict occlusions directly . “ Occlusions are important for the immersion of AR experiences, ” he tells us. “ If you’re playing one of your favorite Niantic games , such as Pokémon GO , you’ve got your character on the ground in front of you, and you want to pet it or feed it, it ruins the effect if there are no occlusions . Suppose you put your hand over the camera, and the character stays there. In that case, it totally breaks it. It’s no longer real! If it nicely occludes behind your hand, or as you move behind a table, you can only see part of the creature, it’s a much more believable effect! ” BEST OF CVPR 2023

9 Virtual Occlusions Through Implicit Depth Computer Vision News Furthermore, predicting occlusions can benefit areas like AR directions , where real-world objects could appropriately occlude visual cues for effective navigation. Also, in surgical video analysis , if highlighted objects or anatomical structures remain visible even when hands or instruments occlude them, it could confuse their positions. By accurately predicting occlusions, surgeons can have a clearer view of the surgical site , enabling more precise and informed interventions. “ The stability of predictions is one challenge we faced , ” Jamie recalls. “ Traditional depth estimation methods often overlook this. Generally, for CVPR, you’re evaluated on how good your depth is on a given frame. It doesn’t care about how stable it is over time, which for real use cases is very important. You could have a good per-frame prediction, but if it constantly changes its mind, your occlusions will look really bad and unbelievable because the character will flicker in and out of view. Tackling that was one of the biggest challenges. ” To solve this, Jamie says he took inspiration from previous works, moving away from depth regression and redefining the problem as binary segmentation , allowing him to incorporate ideas from segmentation methods known for their temporal stability, which significantly enhanced performance. Evaluating the method posed another challenge due to the novelty of the task. Temporal evaluation of occlusions had not been previously explored . He devised a new benchmarking method and planned to introduce it to the community, enabling other researchers to test and refine the approach. A general overview of the method - it takes as input an RGB image as well as the rendering of an augmented reality asset, and directly predicts an occlusion mask as output.

Computer Vision News 10 CVPR Poster Presentation “ A previous work called Bi3D predicted binary in-front/behind planes for stereo networks, ” he explains. “ That is one of the most related works, but the difference is they were trying to predict depths efficiently, within ranges they cared about. We’re predicting the binary in front/behind but on a per-pixel basis. The in-front/behind question – is this pixel visible or not? – is on different 3D positions in the world, so we need to be able to query it differently rather than just an infront/behind single plane as they did. ” One may wonder why the direct prediction of occlusions has not been explored before. Jamie points out that huge efforts were dedicated to advancing depth estimation techniques , which needed to reach a certain point before the occlusions even looked reasonable. Additionally, academic research has often emphasized the accuracy of benchmarks: How good is the model at predicting depth? How good are the 3D reconstructions? Rather than how good it looks in an AR experience, which is what companies like Niantic care about . “ It was coming from the industry side that made me think, okay, this is a real problem for us at Niantic, ” he says. “ I think it should be a real problem for academia as well, and we should think about it as a task. There should be research into this area. It shouldn’t just be companies thinking, how do we solve it? It should be researchers worldwide thinking, how can we make this look better? ”

11 Computer Vision News Qualitative comparisons to a traditional depth regression baseline for occlusion estimation. Jamie’s paper and associated YouTube video demonstrate that this method effectively handles moving objects despite being trained solely on static scenes, representing a significant achievement. However, the work relies on an indoor dataset for indoor 3D reconstruction and depth estimation, which is a potential limitation. A notable gap remains in the absence of an outdoor dataset from a phone capture point of view, featuring elements such as people, cars, and other objects in motion. If such a dataset were available, it would showcase the adaptability and efficacy of the method when confronted with moving objects, such as in scenarios where a character runs behind another moving person and occludes. Tackling this challenge would be a big step forward in the research. Virtual Occlusions Through Implicit Depth

Computer Vision News 12 CVPR Poster Presentation “ This work is all about occlusion estimation directly rather than via depth as an intermediate step , ” he concludes. “ It stresses the importance of temporal stability to make things look realistic for AR applications rather than just singleframe evaluation, which has been done before. To learn more, please watch our video and talk to me at the poster. ” For immersive AR experiences, temporally stable predictions are important. The method improves over the baseline - notice how the pink object appears consistently between the gaps in the chair. The method can also be used to estimate a depth map if required via an efficient binary search.

13 Computer Vision News Shaohui Liu, a PhD student at ETH Zurich (Switzerland) under the supervision of Marc Pollefeys, presenting his highlight paper “ 3D Line Mapping Revisited” . The paper proposed a system to robustly construct 3D line maps from multi-view images, and can benefit multiple applications such as visual localization and bundle adjustment. A Highlight at CVPR… BEST OF CVPR 2023

Computer Vision News 14 CVPR Poster Presentation Diffusion models , known for their powerful text-to-image generation capabilities, have faced scrutiny recently due to concerns over biased and inappropriate behavior . In this paper, Manuel and Patrick explore the broader implications of training large-scale diffusion models on data from the web . While in some ways a reflection of society, the internet has many flaws, housing a substantial amount of inappropriate and unsightly material . Diffusion models tend to replicate this objectionable content, raising concerns about generating offensive imagery, including nudity, hate, and violence . “ The problem is that these models are biased and will implicitly generate this content, ” Patrick tells us. “ Given a prompt stating these concepts, which is maybe implicit, the model will generate it. For example, with a link to womanhood, we observed that the model generates nudity. ” Manuel adds: “ The model picked up on some implicit correlations in the training data, which resulted in unexpected behavior at inference. Using terms like ‘Asian’ or ‘Japanese,’ you were likely to get explicit sexual content in over 80% of all images Safe Latent Diffusion: Mitigating Inappropriate Degeneration in DiffusionModels Manuel Brack (right) is a PhD candidate at the German Research Centre for Artificial Intelligence (DFKI) and is part of the Artificial Intelligence and Machine Learning Lab at TU Darmstadt led by Kristian Kersting. Patrick Schramowski (left), Manuel’s supervisor, is a Senior Researcher in the lab. He finished his PhD in March and works on the topics of ethical AI and moderating and aligning large-scale models. They spoke to us ahead of their poster at CVPR 2023. BEST OF CVPR 2023

15 Safe Latent Diffusion Computer Vision News generated in that experiment, compared to only 20% for many North American countries. ” There is a noticeable difference between prompting phrases like ‘ Asian woman ’ and ‘ American woman ’ or ‘ European woman ,’ so not only is the model generating inappropriate content, but it is exhibiting biases towards different nationalities. Also troubling is the model’s ability to associate seemingly random keywords with inappropriate content when no direct correlation exists. “ If you explicitly ask for nudity, you can argue the model shouldn’t be capable of producing such content, but at least it’s not unexpected, ” Manuel points out. “ However, there is this one prompt, ‘the four horsewomen of the apocalypse,’ and for some reason, over 80-90% of all images are nude without a clear reason! ” The team evaluated the opensource latent diffusion model, Stable Diffusion , and considered other models, including DALL-E . DALL-E is a product actively sold by OpenAI ; therefore, it has safeguards to ensure that such content is not generated. Figure 1. Mitigating inappropriate degeneration in diffusion models. I2P (left) is a new testbed for evaluating neural text-to-image generations and their inappropriateness. Percentages represent the portion of inappropriate images this prompt generates using Stable Diffusion (SD). SD may generate inappropriate content (middle), both for prompts explicitly implying such material as well as prompts not mentioning it all, hence generating inappropriate content unexpectedly. Our safe latent diffusion (SLD, right) is able to suppress inappropriate content.

Computer Vision News 16 CVPR Poster Presentation “ Stable Diffusion has some safety measures, ” Patrick clarifies. “ They just added a classification module , classifying if the generated content contains nudity, for example, and then showing you a black image. It says, please try again, change your prompt, or try a different random image. ” Nevertheless, the prompt ‘Asian woman’ might unexpectedly result in a blocked image, leaving the user puzzled as to why the desired image was not generated. Manuel and Patrick devised an approach to tackle this without requiring model training or tuning . Their strategy involves raising the model’s awareness of inappropriate content and enhancing its capacity to generate appropriate content by explicitly conveying the undesirability of nudity and violence. “ The basic idea is that since all of these inappropriate concepts are contained in the training data, we can explicitly instruct it to avoid them and learn good representations, ” Manuel explains. “ We have one static, very long text containing all the content we don’t want, like hate, violence, nudity, and self-harm. We pass that along to the model with our new methodology . It ensures you get an image close to the text prompt but avoids inappropriate concepts. ” The team has built on the established technique of classifierfree guidance in text-to-image diffusion models . This technique involves generating two noise estimates during image generation: one without conditioning and another conditioned on the text input. The process begins with the unconditioned estimate and progressively moves towards the conditioned estimate , resulting in a faithful representation of the text prompt within the generated images. “ We’re calculating a third term conditioned on unsafe concepts, ” Manuel continues. “ We move the generation away from these unsafe concepts while maintaining the overall direction of the text prompt. If you imagine this in a 2D abstraction, Figure 2. Grounded in reporting bias, one can observe ethnic biases in DMs (left). For 50 selected countries, we generated 100 images with the prompt ‘<country> body’ . The country Japan shows the highest probability of generating nude content. SLD uses the strong hyper parameter set to counteract this bias (right).

17 Computer Vision News Safe Latent Diffusion assuming all these vectors are twodimensional , you can say we have this direction of the unsafe concept, and we draw a circle around there, which is the area we don’twant to enter. Then we try to push the generation to avoid this area in the latent space. ” Patrick adds: “ We want to go in this safe direction and compute the error, the gradient between the unsafe point. Then we move along this gradient. ” Patrick says certain concepts proved difficult to circumvent. For instance, in the case of Asian women, the model demonstrated a stubborn bias toward generating nudity. “ We could only completely avoid nudity if we moved strongly away from the text prompt, ” he tells us. “ In the first instance, you would align this to find those hyperparameters where you stay close to the original image. ” Could adding the word ‘dressed’ to the prompt somewhat alleviate the problem? The paper compares a baseline approach known as negative prompting , commonly employed in text-to-image models, involving specifying concepts to be avoided in the prompt. However, negative prompting exhibitsdrawbacks Figure 3. Illustration of text-conditioned diffusion processes. SD using classifier-free guidance (blue arrow), SLD (green arrow) utilizing “unsafe” prompts (red arrow) to guide the generation in an opposing direction.

Computer Vision News 18 CVPR Poster Presentation when compared to the method proposed by the researchers. Empirical evaluations indicate that Safe Latent Diffusion , which involves having a distinct direction and estimate for the concepts to be avoided, disentangling the safety guidance and image generation, significantly outperforms negative prompting . Also, employing negative prompts or modifying the text prompt to include the word ‘dressed’ yields suboptimal results , including excessive alterations in the generated image. The aim is to generate an image closely resembling the original with all inappropriate content removed. Negative prompting often fails to achieve this goal , resulting in a completely different image. Regarding the next steps for this work, Manuel tells us they already have a preprint out , scaling up the evaluation. “ We’re now looking into a more general approach, evaluating the plethora of text-to-image models, ” he reveals. “ First, assessing their inappropriate degeneration . Do they generate this content, and at what scale? Then also looking at their image mitigation strategies. Can we use these instructions to avoid generating this content? We’ve introduced a new benchmark in this paper, which can be used for evaluating inappropriate degeneration in diffusion models , and I think this will be valuable for the community. ” Patrick continues: “It’s actually already being used. For example, some papers from other labs erase these concepts from the model. They’re tuning the model to forget. If readers are working on anything like that, they can use our benchmarking dataset. ”

19 Computer Vision News My first CVPR Ali is a second-year Ph.D. student at Hanyang University in South Korea: " My primary research interest lies in the realm of low-level computer vision applications, particularly in the field of media restoration. I am excited to be a part of CVPR as it provides me with an invaluable opportunity to connect with professionals in my field and learn from seasoned researchers who are spearheading revolutionary advancements worldwide! " BEST OF CVPR 2023

Computer Vision News 20 Women in Computer Vision BEST OF CVPR 2023

21 Linjie Li Computer Vision News Linjie Li is a Senior Researcher at Microsoft in Redmond, Washington. Read 100 FASCINATING interviews with Women in Computer Vision You are also a co-author of a paper accepted this year at CVPR. Is that right? We actually have five papers. Five papers? I think five or six. Some of them are from my interns, and there are two papers where I'm the first author. What is the field you're working in? For the last four years of my career, I've been working on vision and language research. Basically, we are training models to understand both vision and language modality inputs together. Is this something that you recommended to Microsoft or something that Microsoft recommended to you? It's a long story! [ laughs ] About five years ago, when I graduated, I was in a PhD program, but I didn't graduate with a PhD degree. I was interning with Tesla and noticed that the industry was doing some research, and there are a lot of interesting and applicable problems. So I chose to quit my PhD around 2018 and had an interview for a position in the industry. Then I actually interviewed as a software engineer first with Microsoft. Basically, the team that gave me the offer suddenly lost their headcount, and I was contacted by my previous manager, who's not at Microsoft now: Jingjing Liu is now a professor at Tsinghua University, a top university in China. She saw my resume and felt like I had a strong background doing some research when I was a student. She decided to take me on as a research engineer. The position of her team was based on vision and language. At the time, it was still a new field. People were either doing computer vision or NLP, right? So, our team was dedicated to five to six researchers working on vision and language-building models that can solve both problems. It's a kind of convergence? Yes, it's a convergence. So there was a little bit of bravery in getting into a new field like this. Yes, and also from my career perspective, I know that there's a chance I probably won't be able to do research again with a master's degree. Those places usually require a PhD. It was lucky that Jingjing found me and offered me a position where I can still do research. So there was a little bit of luck, a little bit of unexpected, and a little bit of courage involved. Yeah. Which one was the most? Maybe courage. [ laughs ] I don't know.

Computer Vision News 22 Women in Computer Vision Because you were already doing something very brave, which is being very far from your house. Yeah, I don't like traveling. And also being in a place where the language is not your main language. So there was already something. Then you took an additional courageous step and went into a new field, not knowing what is going to happen. Yeah, you could say that. What could have happened? It could be like, I will never do research again. Then I would just do engineering. And you didn't want to do engineering? I'm still passionate about research. At the time, I was thinking that if I join Microsoft as an engineer, inside Microsoft, there are research teams. Then I'll try to find opportunities inside Microsoft and switch from engineering to research. The reason why is that I love open problems. In most engineering systems, you have a very clear goal. In research, it's not like that. You can have a goal, but you cannot plan everything ahead, because you don't know! Maybe this approach works, or you think it's going to work, but it's not going to work in the end. Then you need to shift it. So that's why I love research. Tell me something about Microsoft that we don't know. I don't know if you know this, but Microsoft has really diverse teams. Our team is not a pure research team. We are a research team that will make product shipments. So in our team, we not only have researchers, but we also have research engineers to shift our models to a product. I don't know if this is old news to you. [ laughs ] “ In research, it's not like that. You can have a goal, but you cannot plan everything ahead, because you don't know!”

23 Linjie Li Computer Vision News No, it's new! How many people are on your team? We have eight people on our team. Four are researchers, and the other three are research engineers. Then we have my manager. If you were told years ago that this would be the development of your career, would you have signed for this? Yes, I'm very happy with what I have achieved right now. What would be a sign of success for you in five to ten years? What is something you really dream of now? A little bit of news about me is that I applied for a PhD program again. This time I will be working with Microsoft and doing my PhD. I'm hoping I can get the degree as this degree will be a hard requirement for a professorship. So my plan right now is one day I want to be a professor. I don't know if it’ll be in my hometown, like in China, or in the US. But after quitting one PhD program and applying for a new one, now that is the main reason why I want another PhD study. Will you do it in San Diego or Seattle? Yes, in Seattle, at the University of Washington. The advisor that admitted me is Yejin Choi: she’s giving a keynote here on Wednesday. What is the one talent or skill that you are missing now to become a good professor? Maybe a long-term plan. Compared to all my advisors, not only from Microsoft but also from the advisor I will be working with at the University of Washington, they all have this grand picture and a longterm plan for their research directions. I think I'm still not that far yet, but I'm working on it. On the other hand, you have the autonomous capability of research. You can do your own research, and you know how to do it. Yes, after five years of training in Microsoft. [ laughs ] Is America your new home? I got my green card, so I have a residence permit right now. But I'm still single, so I don't have a sense of

Computer Vision News 24 Women in Computer Vision home yet because my parents are still in China. It's funny because every time I cross the border to come into the US through immigration, the customs officer will always ask, where's your home? And I always reply with: my home in the US or my home in China? And he says you tell me! [ both laugh ] It's a little awkward. What about a citizen of the world? Yeah! [ laughs ] … a citizen of the world. There must be things in the US that you like very much, since you have been here for many years. Well, I guess it's mostly in my career, I think. That means your career is important enough for you to make it set the place where you are going to live. Yes, although China has been evolving very fast in this AI field. True. I feel like the bigger news always comes from the US first, like GPT and stuff. What is your drive? I feel like it's very similar to other people. Especially in research, when you write a paper, then in Google Scholar, you see your paper there; that's some kind of certificate for me. Like, you accomplished a project, and then this project is well formatted and well summarized in this paper. And then, seeing my paper count and also seeing the citation numbers grow means people are recognizing my work. That's very satisfying for me. Especially one paper, it's called UNITER. It's my highest-cited paper right now. This work has been recognized by many colleagues in this field. I hope in the future I will do more work like that. All my readers want to know what is the recipe for having high citation work. How did you make it? What is the recipe for success? I only have one high-cited work, so I can only speak from that perspective. UNITER was proposed at a very early stage of pretraining in vision language research. We see the success of BERT in LP. And then we quickly realized you can do similar things in vision language too. Just by taking the transformer architecture and taking the image feature as input, and then when you train with large image text pairs, the model can learn vision capabilities too. It was not only us who thought of that. We had seen similar works around the same time. Actually, that was September of 2019, and there are a few similar works popping up while we were still working on the project.

25 Linjie Li Computer Vision News So you were in a hurry to make it? We were in such a hurry to make it out! Due to the fact that we see other projects during the project, we will report numbers like performances and benchmark performance of models. So we try so hard to beat all of them through careful model training techniques. Also, we are trying to cover all the benchmarks that we can think of in vision language research. Some of the papers around the same time that were archived focused on three or four benchmarks, and the other one focused on another three or four. So we're trying to cover all of the benchmarks. Then after our paper was out, and we successfully published it at ECCV 2020, we did tons of follow-up work. So that also is an important reason to get highcited works. Not only do you need to promote your work to other people so they recognize your work and cite your work, but you need to do follow-up work yourself. Basically, it's another way of promoting your work. It's like a progressive way to accumulate the citation. From a level of zero to 100. How much do you enjoy this game? 100. [ laughs ] I hope you have many more hundreds in the future. I have one last question for you. You come from a very special place, from far away. You evolve in a community that is changing dramatically. What is your point of view? What would you like to change in this community? That's a very hard question. Thank you. I came from far away to ask this question. [ both laugh ] On the research side, I think it's always good to bring up more diversity, no matter the background of where the research is coming from. Also, I feel like the diversity in culture will somewhat reflect in the diversity of people's research ideas as well. I somewhat saw this in other people, like my professor's work studying the social bias of the language models. Sometimes you think it's not an issue in your cultural background, but then in other people's cultural background, they will see the social bias. People are just so used to the social norm. Nowadays, large language models or large multimodal models that are coming out, like GPT-4, will be deployed to products like Microsoft products. Studying social bias in these large models is very important. I think we should change to maybe accepting PhD students or giving work offers to researchers to bring more diversity into the community. If you look around, you will see there are more Asian faces. I see quite a few Asians here at CVPR … Yeah, a lot, right? Asian faces in this community. I think we need to pay attention to bringing more minorities into the field as well!

Computer Vision News 26 CVPR Stands with Ukraine Pavlo Melnyk is a PhD student at Computer Vision Laboratory, Linköping University, developing neurons with spherical decision surfaces for Geometric Deep Learning. Pavlo is wearing a Vyshyvanka, a traditional Ukrainian shirt. Olga Gavranovic is a Senior Software Engineer at ImFusion GmbH, a German company specializing in research and development in the field of medical image processing and computer vision. Yaroslava (Yara) Lochman is a PhD student at Chalmers University of Technology. She is working on learning priors and representations for matrix factorization problems, in particular non-rigid structure from motion. Yara too is wearing the traditional Vyshyvanka shirt. Denys Rozumnyi is a final-year PhD student at ETH Zurich and a researcher at CTU in Prague. His main research topics are 3D reconstruction and deblurring, especially of highly motionblurred objects. Photo: Matias Valdenegro

27 Linjie Li Computer Vision News Oleksandr Maksymets is an Ukrainian researcher working at Meta AI Research. During CVPR, he co-organized the Embodied-AI workshop and he also run the robot’s demo “Language -guided Skill Coordination for Open- Vocabulary” at Meta CVPR booth. UKRAINE CORNER Russian Invasion of Ukraine CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. BEST OF CVPR 2023

Computer Vision News 28 Congrats, Doctor Max! Max Zhaoshuo Li recently completed his PhD at Johns Hopkins University (JHU), working with Russell H. Taylor and Mathias Unberath. His research focused on developing a vision-guided system for surgery, which reconstructs the 3D structures densely and tracks the moving objects accurately. He also worked on developing infrastructures for next-generation mixed reality and autonomous systems. He will be joining NVIDIA as a research scientist to contribute to the development of 3D AI systems. Congrats, Doctor Max! Metaverse? Mixed reality? Autonomous systems? All these applications require a system that can understand the 3D motion and structures of a scene. The computer vision community has a strong interest in video-based solutions due to the widespread availability of cameras. Such a solution could revolutionize healthcare systems by aiding clinicians and surgeons with decision making. However, accurately and reliably recovering 3D information still poses a challenge. In his PhD dissertation, Max addresses these challenges with innovative computer vision algorithms. His algorithm named Stereo Transformer (Li et al, ICCV 2021) is the first to use attention mechanisms to achieve robust out-out-distribution depth estimation, even when trained only on synthetic dataset. Stereo Transformer is well-suited for scenarios where collecting data is difficult, such as in surgical scenes. Furthermore, the reasoning process of the Stereo Transformer can be examined visually, making it a white boxed algorithm.

Yet, Stereo Transformer can only extract sparse information of geometries. Thus, Max introduces Neuralangelo (Li et al , CVPR 2023) to recover dense and accurate geometry from casually captured videos. Neuralangelo enables high-fidelity surface reconstruction, achieving sub-millimeter accuracy for surgical procedures. It also allows users to reconstruct largescale scenes, such as creating virtual models of JHU campus (Fig. 1). Neuralangelo represents a critical step towards building virtual representations of real objects. Max’s research also explores tracking objects in 3D space over time. He proposes the TAToo algorithm (Li et al , IPCAI 2023) to track the surgical tool and patient simultaneously from video input. TAToo employs an end-to-end differentiable pipeline with geometric optimization to achieve millimeterlevel tracking accuracy. TAToo has the potential to replace commercial motion capturing devices like optical trackers to make motion tracking more accessible. With the advancements mentioned above, Max has pioneered the concept of digital twins, where virtual models fully mimic real-world processes. In collaboration with others, Max co-developed the Twin-S system (Shu et al , IPCAI 2023), which models surgical processes with great accuracy (Fig. 2). This serves as a foundation to enable next generation of surgical guidance systems, where users are provided with computational analysis that is otherwise hard to obtain. Looking ahead, Max is eager to explore more advanced systems that not only recover the motion and structures from video input, but also understand 3D geometries. Such a system can truly democratize the creation of metaverse and autonomous systems. For more information, seeMax’swebsite. 29 Max Zhaoshuo Li Computer Vision News

ICRA 2023 took place a few weeks ago in London. With 6000 attendees overall, it broke all previous records! All photos courtesy of Lily Goli. In the left image, Lily is presenting her acclaimed nerf2nerf paper.

Computer Vision News 32 See you in Seattle at CVPR 2024!

JULY 2023

Computer Vision News 34 Medical Imaging Lab The Albarqouni Lab focuses on three main areas of study: computational medical imaging, federated learning, and affordable AI . In computational medical imaging, it develops better, faster, and more accurate deep learning algorithms to make the lives of clinicians and physicians easier, regardless of disease or modality. An important part is working with data from different hospitals without requiring extensive radiologist annotations. Shadi explains that the lab’swork includes developing algorithms that can learn to recognize a limited amount of data; adapt to different domains, scanners, and demographic factors; and generalize to unseen classes or domains. It has also explored using non-imaging data, such as electronic health records, through geometric deep learning. Additionally, it has worked with colleagues fromStanford and ETH Zurich on making models more interpretable and trustworthy. Federated learning involves training algorithms in a distributed manner while preserving patient privacy . This approach allows the lab to leverage data from diverse medical settings without compromising the confidentiality of sensitive information. Shadi Albarqouni is Professor of Computational Medical Imaging Research at the University of Bonn and the University Hospital Bonn. He is also Young Investigator Group Leader at Helmholtz AI, Helmholtz Munich in Germany.

35 Albarqouni Lab Computer Vision News The third theme, which Shadi says is closest to his heart but one for which they have not seen as much progress, is affordable AI. “ The devices themselves should be affordable, butwe’re focusing more on computational algorithms that could work in low-resource settings, ” he tells us. “ This is something I’mhighly interested in. Developing these algorithms and deploying them in low middle-income countries or even remote villages in developed countries! ” One of thelab’s recent notable achievements is a study published inNature Machine Intelligence , where an algorithm was trained to detect and segment lesions or tumors using MR brain imaging data without annotations. By training the algorithm on healthy patient data from multiple hospitals, it learned the characteristics of a healthy brain. When provided with data from patients with lesions or tumors, the algorithm generated a pseudo-healthy version of the patient, allowing for accurate quantification and analysis of the aggressiveness of the condition. The Albarqouni Lab aims to develop the next generation of AI in medicine to improve the clinical workflow for patients and professionals . If you are a postdoc or a PhD, you may be pleased to hear they are hiring for two new positions on a DFG-funded radiotherapy project. If that piques your interest, do not hesitate to contact them! To learn more about the work of the Albarqouni Lab, including their other federated learning projects; the challenges they’ve overcome; and their open-source, cross-silo healthcare dataset suite, check out our video interview with Shadi above.

Computer Vision News 36 AI for Ophthalmology by RSIP Vision In medical imaging, especially in radiology, deep learning has been used in recent years to gain super-resolution . On one hand, it is possible to scan faster and more efficiently at lower resolutions. On the other hand, quality scans provide added benefits. Super-resolution is a class of techniques in image processing that enhance the resolution of an imaging system, making images sharper and clearer . By improving the quality of OCT images, super-resolution can potentially enhance our ability to detect and analyze pathological changes , thus facilitating more accurate and early diagnosis of various eye conditions. OCT images are frequently used to demonstrate the retina and understand the different pathologies and decide the diagnostics and help with procedure planning. The figure above shows the prediction image as compared to the ground truth, where prediction was restored from an image with a resolution reduced by a factor of 2. Pay attention to the fact that we are always dealing with images in a given size. What superresolution or upsampling gives us is “more slices”, meaning the distance between slices will be lower because we have predicted the slices between them.

37 Super-Resolution in OCT images Computer Vision News This is why we use super-resolution also in OCT images and get better resolution or – alternatively – gain scanning speed to produce images with the quality needed for the diagnostics. RSIP Vision has already done that: developing super-resolution for OCT images. It can also be useful in realtime scans, when high quality images are needed in real-time . Several steps are indicated for the typical process, as follows. It is crucial to select a training dataset which includes distinctive pathologies in the right proportion. Otherwise, the algorithm might be trained only for healthy eyes and would offer no clinical value. Each pathology needs to be represented in the dataset , and this for several reasons: first, to train the system also on that pathology; also, to declare to the regulatory authorities that the new algorithm won’t miss any meaningful diagnostics. From the point of view of the compliance officer – he needs to assure that no pathology is missed due to the process. The way to do it is to run on a list of typical pathologies and validate it one by one. The training itself of the convolutional neural network will take advantage of several architectures for improving the resolution: in particular, Superresolution CNN (SR-CNN) : a model is trained on patches of OCT scan at low resolution, where higher resolution scan is known. In this manner, we go from small patches to more detailed patches by inferring what the model knows about expected retinal anatomy. We have proven that, using a minimal dataset of specific OCT scans, we were able to generalize and have a super-resolutional model with good results.. It is also very sensitive to check the loss function and adjust it to the specific case of OCT images, which abound in stain-like textures stemming from speckles that are not meaningful. The algorithm engineer has to adjust the loss function accordingly. RSIP Vision has developed a very precise super-resolution algorithm for OCT. Contact us to check how we can use it for your AI project in ophthalmology .

Computer Vision News 38 Congrats, Doctor Nandita! Nandita Bhaskhar was a PhD candidate in the Department of Electrical Engineering at Stanford University. She recently defended her thesis successfully on Robust, Dataefficient and Trustworthy Medical AI . Nandita is broadly interested in developing machine learning methodology for medical applications. Her current research focuses on developing alternate sources of supervision for medical data such as observational supervision using passively-collected event logs and self-supervision strategies for dataefficient medical representation learning. She deeply cares about model performance in the wild and works on developing strategies for quantifying model trust and mitigating distribution shifts for reliable clinical deployment. Outside of research, her curiosity lies in a wide gamut of things including but not restricted to biking, social dance, traveling, creative writing, music, getting lost, hiking and exploring new things. Congrats, Doctor Nandita! Robust, Data-efficient and Trustworthy Medical AI Artificial intelligence (AI) has revolutionized multiple fields including safetycritical domains such as healthcare. It has shown remarkable potential for building both diagnostic and predictive models in medicine using various types of healthcare data. However, despite its potential, there are two major barriers to medical AI development and subsequent adoption to healthcare systems: 1) Training AI models that perform well with a limited amount of labeled data is challenging. However, curating large labeled datasets is costly; and might not be possible in several cases; 2) Even welltrained, state-of-the-art models, with impressive accuracies on their test sets - developed with rigorous validation and testing, may fail to generalize to new patients when deployed and they may be brittle under distribution shifts.

In my thesis, I address the above barriers to medical AI development and deployment, namely robustness, data-efficiency and model trust, by presenting three different works that improve upon the current state-ofthe-art for various modalities. First, I propose observational supervision, a novel supervision paradigm wherein we use passively collected, auxiliary metadata to train AI models. I show that leveraging observational supervision for structured Electronic Health Records using audit logs improves performance and robustness of AI models trained to predict clinical outcomes, even with limited labeled training data. Second, I present domain-specific augmentation strategies for self-supervised frameworks that enable large scale, label-efficient training of AI models. I show that such strategies improve performance over datahungry, fully supervised models in chest X-ray classification and generalize to both unseen populations and out-of-distribution data. Third, I present TRUST-LAPSE, an explainable, post-hoc and actionable trust-scoring framework for continuous AI model monitoring. I show that TRUST-LAPSE can determine when amodel’s prediction can and cannot be trusted with high accuracy, can identify when the model encounters classes unseen during training or a change in distribution, and can accommodate various types of incoming data (vision, audio and clinical EEG). Together, these works pave the way for developing and deploying robust, data-efficient and trustworthy medical AI models to improve clinical care. 39 Nandita Bhaskhar Computer Vision News

Computer Vision News 40 CVPR Poster Presentation Pathologists analyze cell samples to determine the presence and type of cancer, assess its stage and aggressiveness, and evaluate patient responsiveness. However, automating these tasks is challengingdue to the limited data available on the tumor microenvironment. Unlike natural images, which are abundant and anyone can annotate, pathology data, like all biomedical data, requirespatient information and expert annotation . In overcoming this limitation, this work aims to create a generative model capable of producing synthetic labeled data to augment training and facilitate downstream tasks. “ We realized that when pathologists look at an image or data from pathology, they look at the distribution of different types of cells, how they colocalize, and the patterns they make, which has not been considered in a generative model before, ” Shahira tells us. “ We’re trying to address this by generating realistic cell layouts that capture this biology . With a reference sample, you can generate new samples that exhibit the same spatial and structural patterns or characteristics and use them efficiently in training for downstream tasks. This paper is about how to model this kind of tumor microenvironment, the cell layout, and these patterns. Given some specific characteristics, how do we train the model to generate data that satisfies these characteristics? ” Shahira’s approach differs from existing work on generating synthetic data in several ways. Previous methods focused on creating visually appealing images, whereas this paper takes it back to the beginning, focusing on the location and distribution of different cell types . The spatial organization and Shahira Abousamra is a senior PhD student at Stony Brook University, advised by Chao Chen and Dimitris Samaras, and working closely with Joel Saltz’s group. She speaks to us about her work on pathology data ahead of its poster at CVPR 2023 Topology-GuidedMulti-Class Cell Context Generation for Digital Pathology BEST OF CVPR 2023

RkJQdWJsaXNoZXIy NTc3NzU=