25 DAILY WACV Sunday Manuel Benavent-Lledo But why do we need this? In some contexts, latency and real-time understanding is critical. Frame aggregation can be very costly and computationally expensive and a single frame in such context can really help in the tasks. The main difficulty for the authors was the fusion of the different modalities. They had to find the right strategies to combine them and also find the right features. It's not the same to use pre-trained backbones or use self-supervised backbones. It has a critical impact on the final results. But also, and more importantly, they realized that the single frame approach has an enormous impact on the dataset complexity. So on simpler datasets, this method can work really well. But in contrast, in some more complex scenarios where the variability is more complex, such as ADL, we need the information from video in such cases. How did Manuel and team solve the problem? Their method is based on three different branches: “First, we have the RGB branch”, he specifies. “We extract features from a self-supervised transformer, in this case DINOv2, to fuse them with depth features to include spatial information. And for this purpose, we use a cross-attention transformer.”
RkJQdWJsaXNoZXIy NTc3NzU=