CVPR Daily - Wednesday

Madeleine tells us the work started out with thinking about how to break down video understanding . A video is a complex piece of visual information with many important composite parts. This is how people understand visual information too, as a composition of different parts. How can you evaluate understanding of that and extract the most important elements? A good method to do this is through question-answering because questions are themselves compositional, so if you ask a model a question, it has to run through different reasoning steps and understand different visual concepts in order to be able to answer it. “ We wanted to make a question-answering benchmark for measuring visual understanding, but as we were looking into the question-answering space, we realized there were limitations, ” Madeleine explains. “ The ability to answer a question is only a proxy for actual understanding, rather than a direct measurement, so there are ways that the model can use linguistic biases in the dataset to artificially get a higher accuracy by making educated guesses. Also, if you have a question that has so many different complex parts, it’s hard to know when you’ve mushed them all together what the model is getting better on and what it’s struggling with. We wanted to explore how to use the medium of a question for a more in-depth understanding. ” Previously, there has been similar work on image question-answering. There are two datasets – GQA and CLEVR – which had looked explicitly at compositional reasoning and questions about images. It has been done on video before on synthetic datasets, but not on real-world videos and part of the reason for that is that video is very expensive to annotate with human annotators. There have been video question-answering benchmarks, but they tended to be smaller. With Madeleine Grunde-McLaughlin is an incoming PhD student at the University of Washington, having just graduated from the University of Pennsylvania with a degree in Cognitive Science. Her first CVPR paper proposes a visual question-answering dataset over video. She speaks to us ahead of her presentation today. AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning 10 DAILY CVPR Wednesday Presentation