Computer Vision News - January 2018

The model of Tendy et al.: The model described so far was developed for the first dataset from 2015 and published 2 years ago. Now let’s turn to the model that won the competition last summer, by Tendy et al. ( link ) . Its code too can be find in the model.py file. This model is implemented in torch. This method achieved better results than those of the 2015 competition on the improved (more difficult) 2017 dataset. We’ll see that the model follows the same basic structure, but by implementing some minor changes it achieved improved performance on a task made more difficult. The input for each instance - whether during training or test time - is a text question and an image. 1. Image features: the input image is passed through a Convolutional Neural Network (CNN) to obtain a vector representation of size K × 2048, where K is the number of image locations. K is computed based on a ResNet CNN within a Faster R-CNN framework. The resulting features can be thought of as ResNet features centered on the top-K objects in the image. It is trained to focus on specific elements in the given image, using annotations from the Visual Genome dataset. 20 Computer Vision News Tool def vqa_model(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate, num_classes): vgg_model = img_model(dropout_rate) lstm_model = Word2VecModel(embedding_matrix, num_words, embedding_dim, seq_length, dropout_rate) print("Merging final model...") fc_model = Sequential() fc_model.add(Merge([vgg_model, lstm_model], mode='mul')) fc_model.add(Dropout(dropout_rate)) fc_model.add(Dense(1000, activation='tanh')) fc_model.add(Dropout(dropout_rate)) fc_model.add(Dense(num_classes, activation='softmax')) fc_model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) return fc_model # image encoding image = F.normalize(image, -1) # (batch, K, feat_dim) # image attention qenc_reshape = qenc.repeat(1, self.K).view(-1, self.K, self.hid_dim) concated = torch.cat((image, qenc_reshape), -1) concated = self._gated_tanh(concated, self.gt_W_img_att, self.gt_W_prime_img_att) a = self.att_wa(concated) a = F.softmax(a.squeeze()) v_head = torch.bmm(a.unsqueeze(1), image).squeeze() Tool