by Agnese Taluzzi and Davide Gesualdi This challenge focused on a rapidly growing area of research: how can AI understand what’s happening in first-person videos, like the ones captured with smart glasses? Egocentric videos are particularly complex: the camera moves as it is attached to someone’s head, objects go in and out of view, and there is often a lot happening outside the frame. While humans naturally excel at interpreting both fine-grained actions and broader activities, traditional video understanding models struggle in this scenario because they’re trained mostly on third-person, nicely framed images. So, this challenge was about pushing the limits of AI to understand in the wild interactions and environments from a first-person perspective, something that’s becoming increasingly relevant for wearable tech, assistive devices, and robotics. We addressed the challenge by focusing on limitations of current models, particularly their difficulty in reasoning over long egocentric video sequences and their lack of structured semantic understanding. Most state-of-the-art multimodal models like Gemini are trained on third-person data and rely heavily on frame-level representations, which makes it hard for them to capture object affordances, state changes, and causal dynamics in first-person settings. To overcome this, we adopted a hybrid approach rooted in neuro-symbolic AI, combining two complementary modules: SceneNet, which generates structured scene graphs capturing spatial relationships and actions grounded in the video; and KnowledgeNet, which expands object understanding using external commonsense knowledge from Egovis HD-EPIC VQA 18 Computer Vision News Computer Vision News Challenge Winner From left: Riccardo, Agnese, Davide and Chiara. Work was carried out in the Smart Eyewear Lab, a joint research center of Politecnico di Milano with EssilorLuxottica. Other team members: Francesca, Simone, Matteo. Great job guys!
RkJQdWJsaXNoZXIy NTc3NzU=