WACV 2024 Daily - Saturday

A publication by Winter Conference on Applications of Computer Vision Saturday 2024 WACV

786 Hyperbolic vs Euclidean Embeddings in Few-Shot Learning: Two Sides of the Same Coin 1424 Solving the Plane-Sphere Ambiguity in Top-Down Structure-From-Motion VIRTUAL Can You Even Tell Left From Right? Presenting a New Challenge for VQA VIRTUAL Top-Down Beats Bottom-Up in 3D Instance Segmentation 708 Differentiable JPEG: The Devil is in the Details Eadom’s picks of the day: Hello! My name is Eadom Dessalene, a PhD student at the University of Maryland, College Park in the department of Computer Science. I'm a member of the PRG lab under the advising of Yiannis Aloimonos. My research is in the area of action understanding in egocentric video. For today, Saturday 6 2 Eadom’s Picks DAILY WACV Saturday Posters My most recent work, inspired by Active Inference, introduces a method for generating action programs over egocentric video using LLMs. These action programs can theoretically be transferred to any robot platform and compiled for execution in the real world. This contrasts with the tradition of regarding action understanding and robotics as two different fields with zero mutual overlap. The work I am presenting today involves a novel two-stream architecture for the decomposing of action into base physical movements and the context in which the action occurs. I've had so many interesting conversations with some of the students here. Shoutout to Dima Damen for doing so much for the egocentric action understanding community! I'll be at Poster #366 from 5:30 to 7:15 PM: “Context in Human Action Through Motion Complementarity”. Please come and say hi! I'm really happy to be in Hawaii, experiencing all it has to offer from snorkeling, scuba diving, hiking and whale watching with the brilliant people I've met at WACV! I'm huge into skateboarding (which is huge in Hawaii with all the hills here) but I'm still recovering from a bad wrist injury so unfortunately I'll have to watch for now.

3 DAILY WACV Saturday Oral Presentation Alex Hung is a PhD student at UCLA. His paper is trying to solve the problem of anisotropic medical image segmentation. He speaks to us ahead of his oral presentation this afternoon. CSAM: A 2.5D Cross-Slice Attention Module for Anisotropic Volumetric Medical Image Segmentation Medical image segmentation faces a significant challenge when dealing with anisotropic volumes. A large proportion of volumetric medical data, particularly MRI data, exhibits anisotropy, where the through-plane resolution is much lower than the in-plane resolution.

4 DAILY WACV Saturday Oral Presentation Both 3D and purely 2D deep learning-based segmentation methods fall short when confronted with volumetric data of this nature. In general, 2D methods lack the capacity to fully harness the volumetric information effectively, while 3D methods face limitations when confronted with variations in volume resolution. “Sometimes 2D methods are more or less sufficient, and then for more isotropic volumes, 3D methods are sufficient, though there’s not as much need for that,” Alex points out. “But there are problems where only anisotropic images are used. In that scenario, it’s better to use a more effective approach.” The dichotomy between 2D and 3D segmentation methods has led him to explore a different approach to reconcile these disparities and enhance the accuracy of anisotropic image segmentation: 2.5D segmentation. These models focus on learning the relationship across slices. However, existing methods typically have many parameters to train.

“We propose a Cross-Slice Attention Module (CSAM) that considers the cross-slice information,” Alex tells us. “We mostly analyze the images in 2D, but we incorporate 3D information into the 2D feature maps. For volumetric segmentation, 3D methods don’t work that well when dealing with anisotropic images, and 2D methods don’t use volumetric information.” While the problem is not entirely solved, substantial progress has been made by incorporating cross-slice information into the segmentation process with minimal trainable parameters. The proposed 2.5D segmentation approach bridges the gap by encoding and decoding the images in 2D but using 3D information during the attention so that the 2D feature maps are enriched with crucial volumetric data. “There’s still room to improve,” he reveals. “For the current method, I mostly use global information during the attention. That’s the information within the entire volume, but I think in some cases, only more local information would be sufficient or even better. Maybe finding a way to use more localized information would further improve the performance.” This paper is actually a follow-up to Alex’s previously published journal paper, CAT-Net, which he says had some flaws, including requiring a large number of parameters for training. His passion for this work stems from the prospect of overcoming these challenges and contributing a more memoryefficient solution to a less-explored problem. “What we’re trying to do is called 2.5D segmentation, and that’s not something many people have done,” he tells us. “There are some works that focus on it, but all of them have drawbacks. Some methods have hyperparameters that need to be set, which we don’t want to, and others are pretty large in terms of memory.” To learn more about Alex’s work, visit Orals 3.1 [Paper 3] today at 14:00-15:45 (Naupaka). 5 DAILY WACV Saturday CSAM: A 2.5D Cross-Slice Attention Module

6 DAILY WACV Saturday Poster Presentation In recent years, significant progress has been made in generative networks, particularly in synthesizing images using text prompts. The exploration of this concept has led to groundbreaking research in the field of brain electroencephalography (EEG) signals to visualize the images generated by the human mind. The journey to reconstruct images from EEG data began in 2017, with pioneers like Spampinato and Palazzo laying the foundations for this work when they published their Brain2image approach. Their datasets have been a crucial resource for Prajwal, who is looking to advance their progress. When humans view images, their brains produce chemical responses and electrical impulses between neurons, which an EEG cap can Learning Robust Deep Visual Representations from EEG Brain Recordings Prajwal Singh is a third-year Computer Science PhD student at IIT Gandhinagar, India, advised by Shanmuganathan Raman. His paper attempts to answer a big question: How can computers reconstruct human thoughts? He speaks to us ahead of his virtual poster session today.

7 DAILY WACV Saturday Learning Robust Deep Visual … capture. This data can then be stored in a computer, but how do you extract the visual information from the EEG signal? “We use a contrastive learning approach to extract the features from the EEG data,” Prajwal explains. “Specifically, a triplet loss formulation. Once we have these features, a StyleGAN handles the synthesis part. Different generative networks have been used in the past, but for our work, we used StyleGAN, which generates photorealistic images and has been quite on trend recently.” To address the scarcity of large datasets required to train deep learning architectures, researchers used a StyleGAN-ADA network that can generalize across different datasets. Previous methods all dealt with generating images from particular EEG datasets instead of working more broadly. The potential applications of this work are significant, particularly in improving the quality of life of people with certain medical conditions, such as individuals who are mute or paralyzed, yet their brains remain active. In those scenarios, an EEG cap could be placed over that person’s scalp to record their brain signals and, ultimately, reconstruct their thoughts. Prajwal tells us the most challenging aspect of this research has been handling the noisy nature of EEG Each image is generated with different EEG signals across differentclasses, Thoughtviz dataset.

data. The non-deterministic nature of brain signals poses difficulties in extracting useful information, making it a complex and intricate process. “If I’m looking at a picture of a dog and record the EEG brain signal, when I repeat the experiment, the signal I get is going to be different because thoughts contain so many biases,” he explains. “While seeing one thing, we might think about something else or hear something from our surroundings, which makes it very difficult to extract useful information from the EEG data.” Synthesizing the information once extracted from the EEG was a further challenge. To solve this, he used a self-supervised strategy to train every approach, avoiding relying on supervised settings or ground truth. EEG has been compared to a fingerprint for each and every person. Looking ahead, Prajwal hopes that future research explores the possibility of generalized EEG feature extraction methods. Currently, datasets are very controlled, but is it possible to move towards generalization and a better strategy for synthesizing images? Recently, he has demonstrated the potential for deploying the model in a live setting by conducting an image-to-image translation experiment. 8 DAILY WACV Saturday Poster Presentation

“We’ve got a collection of images that are never shown to the network, but similar classes have been shown to the StyleGAN for training,” Prajwal tells us. “We took a bunch of unseen images, transformed all those images into the EEG feature space, and tried to reconstruct the EEG features with the StyleGAN. Even though our network hadn’t seen those images, it was able to reconstruct a close approximation of what this EEG might belong to. That shows we can deploy the model live if possible.” To learn more about this work, visit Posters 3 today at 17:15-19:15. Virtual papers are available via the WACV 2024 online interface. 9 DAILY WACV Unseen images to EEG representation space and reconstructing back using EEGStyleGAN-ADA. Saturday Learning Robust Deep Visual …

10 DAILY WACV Saturday Poster Presentation Valles Marineris is a system of canyons covering a vast area on Mars, equivalent to the size of the USA. The area is considered a museum of landslides, providing valuable clues about the planet’s geological history. Landslides, triggered by events such as volcanic activities, earthquakes, or heavy rainfall, play a crucial role in shaping the planet’s morphology. Besides the evidence of life, early Mars showed similar conditions to what we have now on Earth, and understanding these events can offer insights into Mars’s past and Earth’s current and future challenges, such as climate change. Abel has compiled a benchmark dataset incorporating various satellite image modalities from Mars missions, including NASA’s Viking Mission and the 2001 Mars Odyssey orbiter mission. “We have information MarsLS-Net: Martian Landslides Segmentation Network and Benchmark Dataset Abel Reyes is a Graduate Research Assistant and PhD student at Michigan Technological University. In this paper, he introduces a benchmark dataset and deep learning architecture for studying landslides on Mars. He speaks to us ahead of his virtual poster session today.

11 DAILY WACV Saturday MarsLS-Net in RGB colors and grayscale, thermal information, elevation information, and slope statistics,” he tells us. “Our work was to collate all of that into this benchmark dataset.” However, he faced a challenge as these images did not all have the same resolution. “The Viking Mission images have a resolution of 232 m per pixel, but the most current CTX data has a resolution of 5 m per pixel, so it’s a really high resolution,” Abel continues. “We upsampled some of the modalities. We aligned all of those modalities and have this big image with all the landslides visually annotated by an expert. Some visual characteristics are length, width, relative relief, slope statistics, and the slope of the scarp.” The size of the image also proved challenging. He addressed this by breaking it into small batches of 128 x 128 pixels, which can be easily used as input for a deep learning model. In addition to the dataset, Abel has developed the Martian landslide segmentation network (MarsLSNet), a segmentation model specifically designed to be computationally efficient. “Last month, I attended the NeurIPS conference and was talking with someone from NASA JPL,” he recalls. “He told me they’re using traditional computer vision algorithms in their devices because deep learning models are computationally expensive. We wanted to create a segmentation

model, and usually, they’re huge. They have millions of parameters. Our segmentation model is a fraction of this size and performs very well. In some cases, it actually outperforms the state-of-the-art deep learning segmentation models.” Diverging from traditional segmentation models, which usually have an encoder-decoder framework, MarsLS-Net uses a stack of blocks called Progressively Expanded Neuron Attention (PENAttention) blocks. “We’re using this concept of the progressive neuron expansion, where each neuron is progressively expanded using Maclaurin series expansion of a nonlinear function,” he explains. “We’re doing that to obtain richer and more relevant feature representation, which led to a very lightweight model in a different structure than state-of-the-art segmentation models.” Looking to the future, Abel highlights plans to improve the reliability of the dataset by utilizing image enhancement models to upsample lower-resolution images to match the higher ones. Additionally, he aims to make the architecture more trainable. “The way we’re using the progressive expansion neurons is actually fixing 12 DAILY WACV Saturday Poster Presentation

some parameters,” he points out. “We’re not injecting new trainable parameters. For future iterations of this architecture, we want to add trainable parameters to be more suitable for different kinds of projects and tasks. We also want to make this model more stochastic, adding some generative model attributes during training.” The new benchmark dataset is now open to the community for training, testing, and evaluation. “We believe this is going to be a good contribution to the research community,” Abel adds, “especially if they want to help with the exploration of extraplanetary places like Mars.” To learn more about this work, visit Posters 3 today at 17:15-19:15. Virtual papers are available via the WACV 2024 online interface. 13 DAILY WACV Saturday MarsLS-Net

Luisa Verdoliva is a full professor at the University of Naples Federico II in Italy. Luisa, you are also General Chair at WACV. Yes, yes, it was hard work, so it’s important to say that! [laughs] What is your work in general? My work is about media forensics. More specifically, I work on deepfake detection. I started around 12 years ago on this topic when I actually worked on detecting manipulated images. I started with a challenge. I joined the challenge. It was very fun at the time to detect if an image was manipulated or not. They were manipulations made using Photoshop, so very different from the ones that you can do right now. The tools to detect it were not very good at the time. Women in Computer Vision 14 DAILY WACV Saturday

Yes, they were not sophisticated, but in any case, it was challenging because if you spent time creating a good forged image, it was harder to detect. Our readers remember you at a Media Forensics workshop with Cristian Canton a few years back. Is cooperating with these kinds of initiatives part of your work? Yes, in that case, we joined that workshop with a paper. We won the Best Paper at the time. I remember, you won! Yes, but now I’m actually in the organisation of the CVPR Workshop on Media Forensics. So, winning a workshop pays. You get another job! [Luisa laughs] Yes, that’s right! You have seen remarkable progress in this field in the last 12 years. What is the most important thing that you have witnessed? I think what was really a break was in 2018 with the generative adversarial networks that were used to create synthetic images, so the GAN images. That was really incredible to us. We couldn’t believe that it was possible to obtain such high resolution in generated faces, and then not only faces but now you can generate whatever you want. 15 DAILY WACV Saturday Luisa Verdoliva

You can even describe what you want to generate, and you get it. You get it for images, and you get it for videos. This is really astonishing. There are things I couldn’t believe that could ever happen. Are you aware of the work done by Ilke Demir at Intel with her FakeCatcher? Do you mean the deepfake detector? Yes, I know it. I think they actually were inspired by a paper, if I’m not wrong, on checking about the heartbeat. Yes, this is really interesting. We also worked on these biometric features in order to understand if a video of a person is the real person or not based on these biometrics. This is a very interesting direction. What do we still need to solve in that area? What I think is really challenging is the fact that often, all these videos and images can be of low quality, compressed, and resized. When you upload them over a social network, they can be strongly compressed, so the quality reduces, and also these tiny traces can be reduced… Like artifacts? Yes, these artifacts can be reduced, and it could be harder to detect them. Also, what is really important is to develop explainable detectors so that, as you say, you can look for some specific traces that you can explain. Otherwise, they’re harder to interpret, and you don’t know what’s happening. If the detector says yes or no, why? Can I trust it? This is also very important. It seems like a game of cat and mouse: how to create fakes that are so good that they cannot be detected and how to find them. Who is going to win in the end? This is a really difficult question to answer, but note that even if a fake is perfect visually, this doesn’t mean it doesn’t embed some artifacts inside. It can be a perfect fake, but it can contain some artifacts that can be highlighted by some detectors. The main problem is if you have a very smart, malicious attacker that’s able to hide the traces or even inject some specific traces if it knows the detector you’re using. The problem is when this game is played with people who are also aware of the forensic detectors or have some knowledge so they can actually attack your detector. It seems that, in some way, you believe in the power of your opponents, and they are making your task very difficult. Yes, so you have to take this into consideration when you develop a detector, and you have to try to develop a method that is also robust to possible attacks. You probably do not know the next technique they will develop, but you are confident that you will always find an answer. How can you be so confident? Women in Computer Vision 16 DAILY WACV Saturday

17 DAILY WACV Saturday Luisa Verdoliva

Women in Computer Vision 18 DAILY WACV Saturday The point is that you can develop even different strategies that are based on different artifacts and this can help a lot because maybe you can attack a specific detector, but it’s harder to attack several ones. It’s really important never to rely on one single detector but to have different strategies. Each of them trying to detect a specific artifact. Obviously, you are passionate about this subject. Is it strong enough to keep you interested for years to come? For now, yeah. It depends on what will happen in the future. Also, in terms of protection of active methods. Methods for which you can maybe protect your data using some signatures or watermarks. Of course, maybe it can change, and it can evolve in the future, but I think there will be, in any case, some space for passive detectors and for a strategist that can integrate passive detectors with active ones. What fascinates you about the subject? It’s like an investigation. You have some difficult traces you have to highlight, and this is quite challenging. It is like, “Elementary, Mr. Watson”? Yes, right. You have to find evidence. Sometimes, something looks perfect. I remember CVPR 2016 in Las Vegas when Matthias Niessner showed his Face2Face. Face2Face! Actually, I also worked with Matthias because we developed the FaceForensics++ dataset. I was in the audience when he gave that live demo for the first time. It was very, very impressive. Yeah, it was impressive. It does not happen very much that we speak with researchers from your university in our magazine. I interviewed Fanny Ficuciello once, and that is pretty much it. Do you know why that is? The main problem is, in general, probably the area of computer vision. For Fanny, it was robotics, I think. Medical robotics. Medical robotics, yes. It’s probably expanding, so maybe you will have more interviews in the future! [she laughs]

How is it to work there? It was the university where I studied, so I like it a lot. I think it’s stimulating. I feel lucky to work with a lot of students who are very smart and who want to learn. Yes, I like the teaching aspect there a lot in terms of connection with the students, and I find a lot of stimulus with them. What about the connection with the hometown, with Napoli? Okay, I like Naples a lot. The food, the weather. The people… The people. What I like is the atmosphere you can feel there. I think that everyone who has been to Napoli knows that. There is the famous saying: ‘See Napoli and die!’ [laughs] Yeah, yeah, you’re right. It is translated into most languages that I ever heard of. Why did they choose Napoli for that? I think that Napoli is full of art, full of history, and probably not everybody knows that. Let’s say something about WACV, because you are General Chair. Tell us something about it. It is the biggest WACV ever. Yes, we are very, very happy about that. It’s a really great success with so many submissions compared to previous years. 2,000 submissions. It’s great. There was a lot of interest from people who wanted to come, as well as workshops and tutorials. It was hard work, but it was worth it. What was the most challenging part of this whole organization? At least from my perspective, it was not a single part; it was the whole. [laughs] Everything needed supervision. The General Chair supervises everything, so it’s not that you are doing all the work. There were great people in the organization who were doing everything, and you needed to make sure everything was done. It was really a lot. I didn’t find one single, specific topic that was more challenging, but it was the whole. You started working in this community when you were one of very few women, right? Now, I can tell that there are much more. How have you found the progression? Even at WACV, we have two keynote speakers who are women. I think this is really increasing a lot - slowly because probably this starts from when you’re a child. I noticed even in my class at university that I have very few girls. This year, I had mostly males in my class, but I think it’s something that should start from education – trying to stimulate girls to study maths and coding. I think they would be great! Read 100 FASCINATING interviews with Women in Computer Vision! 19 DAILY WACV Saturday Luisa Verdoliva

Double-DIP Don’t miss the BEST OF WACV 2023 in Computer Vision News of February. Subscribe for free and get it in your mailbox! Click here

21 DAILY WACV Saturday WACV Panel… ☺ Full house for the panel on “Innovation in Computer Vision: What Works and What Doesn’t”, moderated by Anthony Hoogs (left). Daniel Cremers is speaking near Walter Scheirer (right).

Did you read Computer Vision News of December? Read it here

23 DAILY WACV Saturday More Orals… ☺ Paul Grimal (top), presenting his oral work “A Metric for Evaluating Alignment in Text-to-Image Generation”. Jordy Van Landeghem (bottom), presenting his oral work “Beyond Document Page Classification: Design, Datasets, and Challenges”.

One day after Dima Damen, another fascinating keynote speech by Lihi Zelnik (Technion): “Digitizing Touch”. 24 DAILY WACV Keynote… ☺

25 DAILY WACV Saturday More Posters… ☺ Sebastian Koch (top), a PhD Student from Ulm University presenting his research on improving 3D Scene Graphs prediction using a novel self-supervised pre-training approach with no additional scene graph labels needed. Qiuxiao Chen (bottom), a PhD student at Utah State university, presenting her work on how global contextual information improves the bird’s eye view map segmentation performance by applying residual graph convolutional layers.

RkJQdWJsaXNoZXIy NTc3NzU=