Computer Vision News - November 2018

the depth of the “pixel” is equal to the sum of the source and target embeddings. In this setting the label output for the image is the predicted next word of the target sequence. Below is the illustration of such an “image”, trained for translating a sentence from French into English -- we can see the source on the left and the target at the top. 23 Computer Vision News Focus on… Focus on The code below is an implementation of the method described above: the first lines embed the target and source sequences and merge them to form the “image” (denoted as X below). Then, the function call the _forward function, which implements the CNN (DenseNet), with a special aggregation function, which projects the 2D data into a one-dimensional vector. (The authors evaluated several different aggregation methods, such as Max-pooling, Average-pooling and attention, with results detailed in the paper.) This one-dimensional vector is fed into softmax function to predict the next word. This method outperformed the state of the art methods on the IWSLT German- English translation dataset. This initial article, with such promising results, seems to indicate a revolutionary direction for the sequence to sequence area. Detailed comparisons of different setting parameters, such as embedding size, network depth, and others, can be found in the original paper ; the source code is here .