Computer Vision News - January 2018

Visual Question Answering (VQA) networks This month’s we’ll be looking at Visual Question Answering (VQA) networks. We’ll talk about their overall structure, the type of input they get, typical databases used for training nowadays, and we’ll present an implementation in Keras and some of the latest results from benchmark competitions in the field. First, what is Visual Question Answering ? A visual question answering ( VQA ) task includes an image and a free text question about the content of the image, where the goal is developing a model that is able to automatically answer the question correctly for that image. Models in this field combine computer vision, natural language processing and artificial intelligence. Since the questions tend to be visual and refer to specific areas of the image, such as background details, context, etc., VQA problem solving methods require much better and more detailed comprehension of the image, its content, logic and internal structure, compared with methods that just need to estimate an image’s overall content to label it. The models developed in recent years adopt the following overall structure: The network takes in input through a two-branch structure (see figure below) -- one branch for the image and one for the text of the question. The image branch uses CNN to extract image features, for example the activations from the last hidden layer of VGGNet. The text branch uses LSTM-type networks, taking the internal state of the LSTM as features. At this point both branches’ outputs are combined in a single FC layer, whose output are the probabilities predicting the answer. 16 Computer Vision News Tool Tool by Assaf Spanier “Since its introduction in 2015, the VQA dataset of Antol et al. has been the de-facto benchmark”