Computer Vision News - October 2018

not across channels) and the bottleneck layer only performs one-by-one convolution across channels. The Xception model has shown promising image classification results on ImageNet with fast computation. The DeepLab team adapted Xception, taking some inspiration from MSRA’s Aligned Xception. Their modifications: (1) deeper Xception; (2) all max pooling operations were replaced by depthwise separable convolution with striding, which allow atrous separable convolution to be used and extraction of feature maps at an arbitrary resolution; and (3) added extra batch normalization and ReLU DeepLab: DeepLab have published four versions so far: 1, 2, 3 and 3+. Below are detailed the main innovations of each version: 1. DeepLabV1: used atrous convolution to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. 2. DeepLabV2: used filters at multiple sampling rates and effective fields- of-view, the method is known as atrous spatial pyramid pooling (ASPP) -- reviewed above. 3. DeepLabV3: augmented the ASPP module with image-level features to capture longer range information, and added batch normalization. 4. DeepLabV3+: extended DeepLabV3 to include a simple yet effective encoder-decoder module to refine the segmentation results particularly along object boundaries. The main innovation of DeepLabV3+ is incorporating an encoder-decoder structure. The DeepLabV3+ encoder applies atrous convolution at multiple scales to encode multi-scale contextual data, and the decoder module uses depthwise separable convolution to achieve improved segmentation along object boundaries. For more detailed explanation see our Focus on section, in this magazine at page 56. Results: Below are the results for the PASCAL VOC 2012 dataset. The acronyms in the table are as follows: train OS: the output stride used during training. eval OS: the output stride used during evaluation. Decoder: employing the proposed decoder structure. MS: multi-scale inputs during evaluation. Flip: adding left- right flipped inputs. SC: adopting depthwise separable convolution for both ASPP and decoder modules. COCO: models pretrained on MS-COCO. JFT: models pretrained on JFT. Research 8 Research Computer Vision News