CVPR Daily - Friday

“ This phenomenon first occurred when they created the VQA dataset, ” Ali tells us. “ Everybody was getting 70-80% until they realized there was bias in the dataset. The model didn’t even look at the visual features. For example, if asked, is there an elephant in the room? Most of the time, the answer was yes, because if there wasn’t, why would someone ask about an elephant in the first place? ” This new STVQA model takes advantage of recent computer vision and NLP advancements . Leveraging the breakthrough that language and layout are essential, it utilizes the famous T5 architecture, a transformer-based language model already trained on text data , and a novel pre-training scheme, which further pre-trains the model on scanned documents with text in a variety of complex layouts. In conjunction with a vision transformer, it extracts the visual information, combines it all, and produces the answer. The layout-aware pre-training task fuses the semantic information with the layout information using something akin to a masked-language model. Instead of just masking the word, the model is given a rough location for the masked words and then must use the layout to answer the task of the masked-language model. Unlabeled documents downloaded from the web are used as they are more readily available than large quantities of natural images with text. 5 DAILY CVPR Friday Ali Furkan Biten