Computer Vision News - January 2018

This led to the development of the VQA v2 dataset, in which every question has two images associated with it. The images selected purposely so each leads to a different answer, discouraging blind guessing from the question alone. This new dataset was the basis for the 2017 VQA challenge. The model of Agrawal et al. (2015): 1. Image features: the input image is passed through a Convolutional Neural Network (VGGNet) to obtain L2-normalized activations from the last hidden layer. 2. The text question features: 2048-dim embedding for the question is achieved by an LSTM with two hidden layers, followed by a fully- connected layer with tanh as non-linearity. The fully-connected transform of the 2048-dim embedding to 1024-dim (the question words themselves are encoded in the same way as in LSTM). 3. This image + question embedding are then passed to a fully connected neural network classifier with 2 hidden layers and 1000 hidden units. Dropout 0.5 layers is used between the hidden layers. Lastly a softmax layer is used to obtain a distribution over K answers. The entire model is learned end-to-end using a cross-entropy loss. VGGNet parameters are those learned for ImageNet classification and not fine-tuned in the image channel. We’ll look at the core parts of the model (the full implementation can be found here, use the file model.py) . This model uses three functions. The first (Word2VecModel) embeds the question text and extracts relevant features from it. The second (img_model) extracts image features. The third (vqa_model) fuses the text and image features into a single features vector, and adds an optimizer layer and fully connected layers. 1. The Word2VecModel function is built on a Keras sequence model that first encodes the question using the embedding function. The encoding is fed into 2 LSTM layers with a dropout layer between them -- all this is implemented in the short code snippet at the beginning of next page -- the beauty of the Keras library is that the code is simple, clean, minimalistic and self-explanatory. 18 Computer Vision News Tool Tool To our readers in the Silicon Valley: Don’t miss RE•WORK’s Deep Learning Summit in S.Francisco January 25-26 See page 38