Computer Vision News - Summer 2025

ConceptNet, enabling reasoning that extends beyond visual evidence. Our solution stood out in the competition due to its innovative integration of neuro-symbolic reasoning which effectively bridged the gap between low-level visual perception and high-level semantic understanding. SceneNet gave the model a grounded, structured view of the world, perfect for understanding things like where objects are and how they move. KnowledgeNet added a layer of abstract reasoning, which helped with questions that required prior knowledge or procedural understanding. By integrating symbolic scene representations into the prompt, this structured form of prompt engineering enabled more efficient reasoning over long sequences. These graph-based abstractions provided compact and interpretable encodings of spatial-temporal dynamics, reducing the need for extensive processing of raw visual tokens. The infusion of external knowledge further allowed the model to reason beyond visual input, enhancing its robustness and generalization in complex and data-scarce scenarios. This challenge provided both the community and our team with valuable insights into the potential and limitations of current approaches to egocentric video understanding. We demonstrated that graph-based structured representations can significantly enhance a model's ability to interpret complex object interactions and temporal dynamics when combined with external commonsense knowledge. At the same time, we observed the challenges of integrating heterogeneous information sources, reinforcing the need for more refined and context-aware fusion strategies. Our results support the importance of developing neuro-symbolic frameworks that are selectively applied and task-specific, rather than onesize-fits-all. These findings open several avenues for future research, including the creation of unified models that combine the strengths of SceneNet and KnowledgeNet, the exploration of more advanced, questionaware fusion techniques, and the extension of symbolic reasoning to new modalities and domains. Additionally, the idea of performing reasoning directly on symbolic representations, without relying on raw video, holds promising for scenarios with limited computational resources. Indeed, advancing scalable reasoning over long, complex egocentric sequences remains a key challenge in enabling deeper temporal and causal understanding. Overall, the experience confirmed that structured intermediate representations not only improve model performance but also lay the groundwork for building AI systems that are more interpretable, robust, and aligned with real-world human contexts. 19 Computer Vision News Computer Vision News Egovis HD-EPIC VQA

RkJQdWJsaXNoZXIy NTc3NzU=