CVPR Daily - Friday

Visual-question answering involves answering a question about an image, but the models handling these VQA tasks are illiterate – they cannot read the text within an image. In an ICCV 2019 paper, Ali proposed a novel task called Scene-Text VQA (STVQA) , which gave VQA models the ability to read text and use it to reason and generate an appropriate answer about an image. However, he realized answers were still constrained and not as freeform or human-like as he wanted, so he turned to language models for help. “ Language models like GPT-3 and T5 already encapsulate most of the things we require to answer these questions, ” Ali explains. “ These models give long answers, and they already have some prior world knowledge, which is very important because if the dataset doesn’t know what a website looks like or what a brand is, there’s no way it can answer a question about it. ” STVQA requires models to reason over different modalities, so this paper proposes a multimodal architecture and pre-training scheme which achieved significant gains of almost 10% over the previous state of the art. However, in the process, it was discovered that over 60% of the test set questions could be answered using only the text enriched with the layout information rather than the image itself. This performance poses a question for the community: does the task definition require only the language and layout information, or is this due to bias within the dataset inherited from the way it was created? Ali Furkan Biten is a PhD student in the Computer Vision Center at the Autonomous University of Barcelona. His paper proposes a novel multimodal architecture for Scene-Text VQA. He speaks to us ahead of his oral presentation today LaTr: Layout-Aware Transformer for Scene-Text VQA 4 DAILY CVPR Friday Oral Presentation