Computer Vision News - January 2018

Computer Vision News We’ll present two similarly-structured, simple, elegant models for handling VQA, and show their implementation in Python, using the Keras and the torch deep learning libraries. Both are very easy-to-use deep learning libraries and written in Python. Both models were developed to compete in The VQA Challenge (for more on the competition, see link ) . The first was developed back in 2015, based on a preliminary dataset prepared at that time. A number of problems arose: in retrospect it turned out that a non-negligible number of questions could be answered without referring to the image, such as “ what color is the sky in the image? ” These realizations led to the development of a new updated dataset in 2017, which we will describe in more detail below. We will also present an up to date deep learning VQA model developed this Summer, that achieved the best results on the new dataset. Dataset: Several datasets have been published for VQA (for a survey, see Visual question answering: A survey of methods and datasets ). Since its introduction in 2015, the VQA dataset of Antol et al. has been the de-facto benchmark. To arrive at a set of good quality questions, the dataset’s developers ran studies asking subjects for questions about a given image, that they believed a “ smart robot ” would have trouble answering. However, in the aftermath of the 2015 benchmark competition, they discovered that many of the questions laymen proposed, such as “ what color is the cat? ” or “ how many chairs are in the scene? ” are too simple -- requiring only low-level computer vision knowledge. The VQA dataset developers’ goal had been questions that require commonsense knowledge about the scene, like “ what sound does the pictured animal make? ” Furthermore, questions should also require the image to be correctly parsed and not be answerable using just commonsense, like “ what is the mustache made of? ” which combines identifying a location within the image with object identification. Computer Vision News Tool 17 Tool “Since its introduction in 2015, the VQA dataset of Antol et al. has been the de-facto benchmark”