them with the visual information.” Manuel claims that the research in the past involved few works, including single-frame action recognition or action anticipation, but he decided to explore this as a first step into the single-frame anticipation. In fact, the team is turning to future works through which they further develop this work. The vision part of this work is mostly composed of two main components. One is the visual encoder, whereas the same encoder is used for RGB and depth. Manuel evaluated different encoders, and he realized that a self-supervised encoder provided the best results. That is as expected because self-supervised captures the intrinsic data from the images rather than learn from pretrained data sets. And also in the depth case, RGB encoders cannot use directly depth information because the depth map is single channel. “We decided to apply a coloring strategy to convert the depth frame into an RGB one,” Manuel clarifies. “And lastly, for obtaining the depth image, we also realized that the ground truth depth can often be noisy. So instead of relying on original depth captures, we relied on depth estimation models. And in this case, it's Depth Anything V2. I wish to add that VLMs were quite limited for this task, rather than work very well as everyone expects them to work!” The idea of this work came from discussions with colleagues at ICSFORTH in Greece, where Manuel was a Visiting Researcher two summers ago. In his regular work aside from this paper, he focuses on action understanding tasks and multimodal action understanding. He has done work related to recognition, online action detection and action dissipation; he also collaborated on different multimodal tasks outside action related tasks, but including computer vision, such as pain assessment and more. Manuel will be presenting his paper during Poster Session 1, Sunday 11:15–13:00 in the Tucson Ballroom and Prefunction, poster 27. 27 DAILY WACV Sunday Manuel Benavent-Lledo
RkJQdWJsaXNoZXIy NTc3NzU=