ICCV Daily 2023 - Friday

Brilliant Oral and Poster Presentations Best Student Paper: Qianqian Wang Tracking Everything Everywhere All At Once UniverSeg: Universal Medical Image Segmentation Today’s Picks by: YimingXie A publication by Women in Computer Vision: Hilde Kuehne DAILY

Yiming’s picks of the day (Friday): “Hi, I amYiming Xie, currently a PhD student at Northeastern University, USA, working with Huaizu Jiang. My research is motivated by a desire to enhance the real world by Augmented Reality (AR) via visual computing and machine learning approaches. Some of my recent research includes 3D scene reconstruction from a posed video, 3D object detection from multi-view images, controllable human motion generation, etc.” PM Tracking Everything Everywhere All at Once SEE OUR FULL REVIEW ON PAGE 3 PM LERF: Language Embedded Radiance Fields PM Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions PM Neural Haircut: Prior-Guided Strand-Based Hair Reconstruction AM LightGlue: Local Feature Matching at Light Speed AM Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis For today, Friday 6 2 Yiming’sPicks DAILY ICCV Friday “My work at ICCV this year is about online 3D object detection from just a few consecutive views. We train a model to leverage appearanceenhanced queries initialized from 3D points and update the locations of 3D points with a recurrent layer. It achieves better performance, exhibits speedier convergence, is robust to a varying number of queries!” “Sad to not be heading to Paris for ICCV this year. Georgia Gkioxari will present our poster in the morning session today, Room“Foyer Sud” - 074: Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection. I hope everyone attending ICCV has a great time!” Orals: Posters:

Author Qianqian Wang is a postdoc at UC Berkeley. She recently completed her PhD in Computer Science at Cornell Tech. She speaks to us about her work on estimating motion from video sequences ahead of her oral presentation and poster this afternoon. Read our full review of her winning work in the next pages! Tracking Everything Everywhere All At Once DAILY ICCV 3 DAILY ICCV Friday Qianqian Wang This exceptional work has just won the Best Student Paper Award at ICCV 2023. This interview was conducted before the announcement of the award. RSIP Vision continues a long tradition of selecting in advance the future award-winning papers for full feature! Congrats Qianqian! In this paper, Qianqian proposes a novel optimization method for estimating the complete motion of a video sequence. It presents dense and

and long-range motion representation that allows for tracking through occlusions and modeling full-length trajectories. This method finds correspondences between frames, a fundamental problem in computer vision. These correspondences are the foundation for various applications, notably dynamic 3D scene reconstruction, as understanding 2D correspondences between frames in a dynamic scene is essential for reconstructing its 3D geometry and 3D motion. The research also opens up exciting possibilities for video editing, allowing for seamless propagation of edits across multiple frames. “I came up with this idea because, in my last project, I realized there was no such motion representation in the past,” Qianqian tells us. “It’s not a new problem, but people don’t have a good solution. The last paper I saw similar to our work was 10 years ago, but because it’s too challenging, and people don’t have new tools to work on it, progress has been suspended for a decade.” Now, renewed interest in this problem has sparked concurrent research. While approaches may differ, the shared goal remains the same – tracking points in a video 4 DAILY ICCV Friday Oral Presentation

over extended periods. However, the road to achieving that is not without its challenges. “The first challenge was to formulate the problem because it’s different from what most people did before,” Qianqian explains. “We have sparse feature tracking, which gives you long-range correspondences but they are sparse. On the other hand, we have optical flow, which gives you dense correspondences, but only for a very short period of time. What we want is dense and long-range correspondences. It took a little bit of time to figure that out.” An important moment in the project was realizing the need for invertible mapping. Without it, the global consistency of estimated motion trajectories could not be guaranteed. It was then a challenge to determine how to represent the geometry. Parameterizing the quasi3D space was far from straightforward, 5 DAILY ICCV Friday Qianqian Wang

6 DAILY ICCV Friday Oral Presentation which led to the team exploring the concept of neural radiance field, a dense representation offering the flexibility needed to optimize scene structure and the mapping between each local and canonical frame. The work opens up opportunities for future extensions, including using similar principles for reconstructing dynamic scenes and enhancing video editing techniques with speed and efficiency improvements. “Compared to other correspondence work, our approach guarantees cycle consistency,” Qianqian points out. “We’re mapping it to 3D space, which allows it to handle occlusion. That’s a nice property because most works on motion estimation remain in 2D. They don’t build a consistent 3D representation of the scene to track.” Qianqian is originally from China but has been in the US since starting her PhD in 2018 and says it is a “very welcoming and inclusive” environment. Her advisors on this project at Cornell Tech were Noah SnavelyandBharath Hariharan. “Noah is the most wonderful advisor in the world,” she smiles. “He’s super caring. He has very creative ideas and guided me through the whole process. We discussed along the way and then figured out the right formulation for the problem. He pointed me to important papers that inspired me to work on this. He’s super helpful, and I appreciate his guidance!” In her new position at UC Berkeley, Qianqian works with two exceptional professors, who are also great friends of our magazine: Angjoo Kanazawa and Alyosha Efros. She is in a transition stage but

7 DAILY ICCV Friday Did you miss yesterday’s fascinating interview with Yann LeCun? Read it here! ICCV Daily Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, the Computer Vision Foundation and ICCV organizers. plans to continue working on motion estimation, 3D reconstruction, and video understanding, particularly finegrained and deep video understanding. She adds that if we better understand motion in a video, we’ll better understand higher-level information, like semantic information. Where does she see herself in 10 years? “That’s a very hard question to answer,” she ponders. “I still want to do research and contribute to the field of computer vision. I hope to find a faculty position in a university and stay in academia, but if that doesn’t work out, I’m also fine to find a research position in industry. I’ll keep doing research. That’s something I know!” To learn more about Qianqian’s work, visit her poster this afternoon at 14:30-16:30 and her oral presentation at 16:30-18:00. Qianqian Wang

Medical professionals, including clinicians and scientists, frequently encounter new segmentation tasks and protocols for different anatomies, necessitating the training of additional or fine-tuned machine learning networks. Unfortunately, this process can be time-consuming and inaccessible to many due to the technical expertise and hardware required. “This kind of problem necessitates that you have a model that is flexible to new tasks, without retraining, that can be deployed at inference,” Victor begins. “Our model, UniverSeg, can segment new tasks without retraining across a variety of different anatomies. It takes advantage of context learning, a mechanism that hasn’t been well explored in medical segmentation thus far. It has a place as a foundational model for different applications.” This work is not about enabling more tasks but rather simplifying the adoption of machine-learning tools for medical segmentation. When a clinician encounters a new problem and collects medical scans, it can take months to collaborate with an expert to train and deploy a model. UniverSeg streamlines this process by eliminating the need for training. 8 DAILY ICCV Friday Poster Presentation UniverSeg: Universal Medical Image Segmentation Victor Ion Butoi (right) is a second-year PhD student, and Jose Javier Gonzalez Ortiz (left) is a final-year PhD student at MIT, supervised by Adrian Dalca. They speak to us ahead of their poster this afternoon on UniverSeg, a novel method for solving unseen medical segmentation tasks without additional training.

“We’re trying to make the task easier by having a model that you just need to use instead of build,” Jose explains. “Then, clinical researchers will have a much easier time just using the model. Instead of months, hopefully, it will take them hours or minutes. Beyond that, it will help them iterate faster on their research. We’re still doing the same segmentation tasks; we’re just making the whole process simpler to reach a wider audience of researchers.” When we ask whose idea UniverSeg was, the team quickly emphasize that it was a collaborative effort spanning multiple institutions over several years, including Cornell, MIT, and Harvard. It involved extensive work on data, training on over 50 datasets, and iterating on what is a consistent format for the model to train on. Decisions on how to evaluate on out-of-distribution datasets and architectural choices, such as the new CrossBlock mechanism introduced to produce accurate segmentation maps, required the combined effort of researchers to create a model capable of adapting to new tasks seamlessly. While UniverSeg is a significant boon to research, its potential extends to real-world medical applications. Jose acknowledges that deploying such technology in the medical field, with its high stakes, is challenging. “If you had told me this was possible when I started my PhD, I would not have believed you,” he reveals. “This is a really good first milestone. As people keep working on it and improving on the ideas, I see a bigger potential for getting this into real-world systems, but it requires a lot more testing and evaluation whenyou’re making very important decisions.” Victor continues: “A large part of our work is helping boost the efficiency of these medical professionals. Suppose you have a small dataset tied to a particular disease or sub-group of people you don’t have a model for. You’d like to measure things about the subjects’ brains, such as white matter hyperintensities or hemorrhage, specific to this one group. Segmentations of these quantities 9 DAILY ICCV Friday UniverSeg

are very useful because they help clinicians determine things about subjects tied to diseases such as Alzheimer’s. You can use a UniverSeg network to do automatic segmentations because you can provide a context of similar examples without retraining, letting you perform diagnoses much faster. Maybe we’re a step before in the pipeline, but there’s some clinical use.” Adrian, an Assistant Professor at Harvard Medical School, a Research Scientist at MIT, and a long-time friend of our magazine, points out there is a distinction between clinical researchers and clinicians here. “Clinicians may be a little further just because they have all these limitations of what they’re allowed to use and whatnot,” he adds. “For clinical researchers, we’re maybe one step removed, but with improvement, I think we can work with our collaborators to use this.” UniverSeg has laid a strong foundation for further research in context-driven segmentation models. While the model primarily focuses on binary segmentations and 2D slices, there’s immense potential for expansion. The team has already begun work on extending the model to multi-class segmentation and 3D data, addressing the complexities of a wide variety of medical scans. “We’ve built an incredibly important foundation,” Victor asserts. “But there are many directions we would like to go such that it is more applicable.” The model presents a promising research direction for young scholars. One of the team’s most significant challenges was amassing a large and diverse data collection to ensure the model generalized across different datasets and anatomies. As they added more datasets, the model kept improving, consistent with what has been seen in the general AI space and the progress of popular models like ChatGPT. “The literature has focused for a long time on the model aspects and not as much on the data,” Jose tells us. “I would encourage researchers to scale up the data 10 DAILY ICCV Friday Poster Presentation

collection and standardization efforts because to develop this model, we did a lot of research on the model and machine learning, but it wouldn’t have worked if we didn’t devote a considerable amount of time and effort to collecting the data and deriving different protocols for making it as diverse as possible.” Summing up and reflecting on what is particularly special about the work, Adrian highlights its potential to break down barriers, opening up new avenues for clinical researchers and enabling studies previously hindered by technical challenges. “It opens up a lot of new technical areas in my lab and our sphere, and that’s exciting,” he says. “But probably the fact that it can advance science is the most exciting!” To learn more about the team’s work, visit their poster this afternoon at 14:30-16:30. 11 DAILY ICCV Friday UniverSeg

12 DAILY ICCV Friday Women in Computer Vision Hilde Kuehne is an Associate Professor at the University of Bonn as well as an Affiliated Professor at MIT-IBM in Cambridge. … every project is kind of unique! Every project has ups and downs! I could not say, okay, this was especially a nightmare because of blah …

13 DAILY ICCV Friday What is your work about, Hilde? I'm working on everything in multimodal learning at the moment. Technically, it means trying to figure out how we can learn from different modalities and across different modalities. So we started with video, where it's obvious that you have more than one modality. Video is not only the vision part; it also has audio. In most cases nowadays, it comes with ASR. So there's also text! Actually, the interesting thing is that we figured out that one modality can actually be used to enhance learning for the other. I think that's a super cool thing because, similar to vision language models, it kind of frees us a bit from having to use annotation. It also opens up the space for anything free text. I think that's super cool! Tell us why it is super cool. Oh, that's a hard one. [laughs] I am here for the hard ones! Okay, so I especially come from video understanding and action recognition. One problem that we have with actions, probably more than with objects or anything else in the world, is that they are very hard to describe. People usually have a very good understanding of what an object is like. A mug is a mug, period. But actually, understanding actions highly depends on your world knowledge, on your expert knowledge for a specific task, and so on. Therefore, describing actions by pure categories usually works for a certain subset of tasks. This is what we have in current data sets, but it's usually not enough to capture all actions that are going on in the world. Therefore, moving away from pure classification, especially in the context of action and video understanding, is very important. First, having foundation models that actually transfer much better than what we have at the moment, and second, actually to get closer or to do even more for real-world applications. I understand now why it is cool. Is it cool enough to dedicate the best years of your career to research? [laughs] Absolutely! So what is best, teaching or researching? [hesitates a moment…] Both have good sides and bad sides. If I had to choose at the moment, I would probably say research. However, teaching and research are not separate for me. I mean, obviously, there are lectures. But technically, teaching and research happen together. When we have good Master's students or even PhD students, and they do research, technically, we also teach them on the fly how to be good researchers. This is something that I really love. So, actually, it's both. Isn't it funny that most of the research is done by people who are not yet proficient in research? They are just learning to research. Hilde Kuehne

14 DAILY ICCV Friday Women in Computer Vision [hesitates a moment…] Perhaps yes, perhaps no. The interesting thing that we have is that a lot of the people who are doing the real research are researchers in training, if you want to see it like that. But let's say the interesting thing is when you look at what those people then later do, either they move on to industry and apply what they have [learned] to build crazy good products, or they actually move on to academia and start educating the next generation of researchers. In this sense, it makes sense that it's kind of like a selfreinforcing system. You already have a few years of research behind you. Maybe you want to tell me what you consider your best find till now. What are you most proud of? Well, I have done some data sets, and I'm still surprised that they are still around by now. I would have guessed that each of them would last probably for two to three years, and then they would be replaced by something way cooler. They are both still around, and I don’t know why. Oh, you can mention them! We are not shy. [laughs] Okay, I have done HMDB and Breakfast. It's actually very cool to see that people still find them useful. However, when you ask me what's the most important thing that I have done, honestly, it will always be one of my current projects. So, the current ones are always the most important to me, no matter what I have done in the past. What are the current projects? Can you share something with us? Yes, all the projects that I do at the moment are about multimodal learning. Technically, they all somehow deal with this question of how to bridge modalities. With this respect, many of them are actually not so much about building new architectures but understanding what current systems are doing and how to make this better. One of them is, for example, a paper that will be published at ICCV about learning by sorting. For example, we show that by changing the loss

15 DAILY ICCV Friday Hilde Kuehne function, we can learn embedding spaces that are better suited for Knearest neighbor retrieval. And as retrieval is one of the cornerstones for multimodal learning, this is something that's actually pretty cool. We have a lot of interesting papers on the usage of language together with video or how we can actually make text better for video. We have some interesting ones, which will hopefully be available soon, but I cannot talk about them at the moment. [laughs] Let's tell the ICCV people that they should come to the poster of Nina Shvetsova, Sirnam Swetha and Wei Lin. Come to the three posters of these young and fine people and ask questions. You might, by chance, find Hilde there. Three posters are no mean feat! Absolutely! Actually, there are four posters. Which is the fourth? Nina has two. Nina has Sorting and In-Style. Okay, so you will have to tell me about Nina, who was able to get two first-author posters at the same conference… What is special about her way of working? [laughs] Well, let’s first say Nina is great! Nina is also working with me. Nina is my first PhD, so it's always something special. And first, just to not overstate, the sorting paper was a lot of hard work, and it got rejected twice or even three times. Whoever gets rejected always resubmits and makes it better. At some point, it will work. But the second thing is this In-Style paper, which is a bit more about this research on how to use language models to make video annotations better. So that's generally Nina's idea; it’s all hers. I think it's super cool work, and hopefully, it helps the video community to solve a few of our problems. What is the most difficult thing that you have done in this field until now? Oh, my God. That's a good question! Thankyou. [laughs] Um, I don't know, actually, because every project is kind of unique!

16 DAILY ICCV Friday Women in Computer Vision Every project has ups and downs! I could not say, okay, this was especially a nightmare because of blah… There is no crazy outline or crazy point where I would say, okay, this was a nightmare because of this. Did you ever think, “This is too tough, I give up”? [laughs] One thing that I always wanted to do that never worked, and I would still love to, is actually binary networks, like real binary neural networks. Why don't you do it? Because it's tough, it's just a super tough problem. Perhaps some ambitious researcher in this community will say, I want to do that and will ask you for advice. My advice is probably to do something else. [both laugh] What is the best advice that you have given, and what is the best advice you have received? [hesitates for a moment] So, the best advice I ever received in my life was probably when I was considering studying computer science I was “If you want to study computer science, go home, start programming, and if you love it, just come back, then you're right for this!”

Xi Yin 17 DAILY ICCV Friday Hilde Kuehne definitely not planning to do computer science in the first place. Actually, I was leaning more towards arts and design. But I went to the study counsel of the university, and he told me, “If you want to study computer science, go home, start programming, and if you love it, just come back, then you're right for this!” Obviously, I went home and started programming. I loved it, and so I went back, and that's the rest of the story. I guess I don't know what's the best advice I ever gave to people because I'm randomly blurring out stupid stuff all the time. [laughs] You only have to ask people what's the best advice they ever got. But I think if I had to give advice, it would be exactly that. If you want to do something, if you consider doing a PhD, try publishing. If you love it, comeback. Can you tell us about the MIT-IBM Watson AI lab? First, the lab is a collaboration between IBM and MIT. Technically, it's a very interesting lab because it's an industry lab, but it's run in a very academic way. So, it sometimes feels more like academia than industry. And the reason for that is our work is mainly project-based, like in academia. We have to hand in proposals for projects like in academia. Each project is then actually headed by one Pi from MIT and one Pi from IBM. So, it's always both sides involved. I think this makes it a bit special and super interesting as it's exactly at this intersection between academia and industry. This is actually where I feel most comfortable because I really like academia, I really like industry, and I always looked for a place where I can have a balance between both of them. I don't want to be 100% on one side. I also don't want to be 100% on the other side. I always want to be in between somehow. Elementary, Mr. Watson! You have found the right balance between both. Exactly!

18 DAILY ICCV Friday Women in Computer Vision “If you want to study computer science, go home, start programming, and if you love it, just come back, then you're right for this!” We have spoken about the present, and we have spoken a little bit about the past. Let's speak about the future. What is your future? As I just started in Bonn, I guess currently, it's mainly settling. Actually, I am starting to hire more people because I also got an ERC starting grant last month. Oh, very nice! Do you need people? Yeah. PhD students or postdocs? Actually, both. Guys, if you read this and you are interested, you have an incredible chance to work with awesome Hilde Kuehne. Don't miss it, or both Hilde and I will be disappointed. [both laugh] BTW, you are also a program chair at the upcoming WACV, and this is a baby that is very dear to your heart. I have one last question for you, Hilde. It's about ICCV. What do you expect from the upcoming conference? I'm super looking forward to the workshops and to the poster session. I have to say, I love the poster session. I will try to stop by every video-related poster, I promise!

19 DAILY ICCV Friday CVAAD Workshop Noah Snavely delivered an interesting keynote at the 1st Computer Vision-Aided Architectural Design (CVAAD) workshop during ICCV. Program Chair Seyran Khademi praised his talk titled "Towards Generative Architecture“, noting its enthusiastic reception among the audience. During his presentation, Noah highlighted the potential of scene synthesis and 3D modeling in the realm of architectural representation. He also shared his passion for monumental buildings and explored the possibilities of enhancing generated building graphics through crowd-sourced public data.

Posters ☺ Yael Vinker is a PhD student at the Tel Aviv University. Here she is shown during her oral presentation at ICCV. In her own words, “Our method converts a scene image into a sketch with different types of abstractions by disentangling abstraction into two axes of control: fidelity and simplicity. We produce a whole matrix of sketches per image, from which users can select the desired sketch.” In the right pages, she observes co-author Yuval Alaluf present their work. 20 DAILY ICCV Friday My first (in person) ICCV

21 DAILY ICCV Friday My first (in person) ICCV

Dmytro Fishman is a young group leader at the University of Tartu, Estonia. His research group focuses specifically on the analysis of various biomedical imaging data, aiming to empower biologists and healthcare professionals with state-of-the-art technology. In collaboration with world-leading companies such as Revvity and Syngenta, they are developing advanced techniques for microscopy image analysis, including classification, segmentation, and detection of cells. Also, Dmytro's group works closely with local Estonian hospitals to automate image analysis, to significantly speed up the workflow in pathology labs and improve patient outcomes. Finally, two years ago, Dr. Fishman co-founded a med-tech startup, Better Medicine, intending to bring his research to practice, saving lives and empowering healthcare professionals with cutting-edge technology. Dima is wearing a traditional Ukrainian Vyshyvanka. 22 DAILY ICCV Friday UKRAINE CORNER

23 DAILY ICCV Friday UKRAINE CORNER Anna Rohrbach recently became a full professor at TU Darmstadt; she is looking for prospective PhD students and postdocs in the area of Vision and Language / multimodal AI. Get in touch! Anna is wearing the traditional Ukrainian Vyshyvanka. ICCV's sister conference CVPR adopted a motion with a very large majority, condemning in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine. We decided to host a Ukraine Corner also in the ICCV Daily.

24 DAILY ICCV Friday Generating 3D maps of environments is a fundamental task in computer vision. These maps must be actionable, containing crucial information about objects and instances and their positions and relationships to other elements. Recently, the emergence of 3D scene graphs has sparked considerable interest in the field of scene representation. These graphs are easily scalable, updatable, and shareable while maintaining a lightweight, privacy-aware profile. With their increased use in solving downstream tasks, such as navigation, completion, and room rearrangement, this paper explores the potential of leveraging and recycling 3D scene graphs for creating comprehensive 3D maps of environments, a pivotal step in robot-agent operation. “Building 3D scene graphs has recently emerged as a topic in scene representation, which is used in several embodied AI applications to represent the world in a structured and rich manner,” Sayan tells us. “SGAligner focuses on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial. We address this problem using multimodal learning and leverage the output for multiple downstream tasks of 3D point cloud registration and 3D scene reconstruction by developing a holistic and intuitive understanding of the scene aided with semantic reasoning.” Sayan demonstrates that aligning 3D scenes directly on the scene graph level enhances accuracy and Sayan Deb Sarkar is a Computer Science master’s student at ETH Zurich majoring in Visual Interactive Computing. Currently, he is interning with Qualcomm XR Research in Amsterdam. In this paper, Sayan presents a novel method for aligning pairs of 3D scene graphs robust to in-the-wild scenarios. He speaks to us ahead of his poster this afternoon. SGAligner: 3D Scene Alignment with Scene Graphs Poster Presentation

25 DAILY ICCV Friday SGAligner speed in these tasks and is robust to in-the-wild scenarios and scenes with unknown overlap. This finding opens up exciting possibilities to unlock potential in fields like graphic design, architectural modeling, XR/VR experiences, and even the construction industry. The ultimate goal is to move toward a gradient reality, where spaces are designed with user needs in mind, fostering immersive connectivity, communication, and interaction on a global scale. This work is the first step toward that goal using semantic reasoning and aligning 3D scenes using a semantic meaning. The journey to develop SGAligner has not been without its challenges. From a technical standpoint, Sayan tells us formulating the project and navigating the complexities of generating and aligning scene graphs was difficult. “We wanted to be inspired by the language domain and then applied a similar setting in our computer vision problems,” he explains. “That was one challenge figuring it out. The second part was understanding and visualizing which sorts of potential real-life applications we could plug into, and understanding the practicality and how to make this whole module very lightweight so that it is privacy-aware and can be easily shared among people.”

26 DAILY ICCV Friday Poster Presentation be easily shared among people.” To his knowledge, this is the first work to address the problem of aligning pairs of 3D scene graphs. Its approach sets SGAligner apart – rather than relying on metric data, it operates exclusively on the graph level. This unique perspective confers robustness against various challenges, including noise, in-thewild scenarios, and overlap. The implications of this approach are far-reaching, opening doors to applications in mixed and augmented reality and even SLAM. Sayan started his master’s degree last September and took up this project in the first semester. One key personal hurdle was the balancing act of pursuing all this while moving to Zurich and adapting to new surroundings. However, any obstacles were surmountable with the support of dedicated supervisors like Ondrej Miksik, Marc Pollefeys, Dániel Baráth, andIroArmeni. “I originally had reached out to Iro for the project when I was moving to ETH to start my master’s,” he recalls. “She has been very helpful in understanding where I need support and where I can be independent. Daniel is very experienced with point cloud registration and all the technical parts. He was the best person in the community to help me figure out the downstream applications. Marc is one of the grandfathers of 3D computer vision.

27 DAILY ICCV Friday SGAligner He’s been very helpful in driving the project. Ondrej always had a highlevel understanding of the project, which helped me because, as a first author, it’s easy to get lost in the technicalities when you work on a project. The last year of working with them helped me get my internship at Qualcomm, and I’m very grateful for all their help and advice.” The impact of SGAligner is already being felt across the computer vision community. Making the code public and releasing the paper on arXiv has sparked interest among fellow researchers exploring various avenues for further development. Potential directions include crossmodal alignments, such as aligning point clouds with CAD models or other modalities, applications in scene retrieval from extensive databases, and multiple downstream applications in AR scenarios. The benefit of SGAligner’s lightweight and privacy-aware scene graphs to the community cannot be overstated. As lead author, this is the first time Sayan has had a paper accepted at a top conference – a fantastic achievement. Does he have any wisdom for those whose papers were not accepted this year? “The computer vision community has grown manifold in the last few years,” he responds. “Please do not be disheartened that your paper wasn’t accepted because we have all been there at some point. Always have a high-level understanding of where your work could play into both industry and academia. It’s not only about solving a new problem but also having real-life applications of the problem. SGAligner was a new problem, but we could also find multiple real-life applications where we could make a difference. Also, don’t forget to do ablations and understand which other works might be relevant for comparison because it’s always good from a reviewer’s perspective to understand how your work performs differently from others.”

28 DAILY ICCV Friday Poster Presentation Sayan has a research internship in Marco Manfredi’s team at Qualcomm XR, working on improving the performance of lifelong SLAM systems and understanding how multimodality could plug into SLAM. “I’mstill a master’s student, so I’m very happy getting exposed to this sort of research,” he smiles. “It’s a journey that started before ETH with Vincent Lepetit, and then with Iro, Marc, and everybody at ETH, it’s continuing!” To learn more about Sayan’s work, visit his poster this Friday at ICCV - Poster session of 14:30-16:30. Congrats to both Michael Black and Rama Chellappa!

29 DAILY ICCV Friday Poster: Liliane Momeni Liliane Momeni presenting her poster “Verbs in Action: Improving Verb Understanding in Video-LanguageModels”

RkJQdWJsaXNoZXIy NTc3NzU=