Computer Vision News - January 2019

very high resolution images, with labeling for every pixel. This would result in only being able to process a very small number of images in each batch, making normalization excessively difficult. This difficulty is avoided by the fine-tuning paradigm networks, which take advantage of the normalization parameters learned during pre-training. The authors adopted the following normalization methods for training from random initialization on detection and segmentation tasks. 1. Group Normalization (GN) -- GN performs computation that is independent of the batch dimensions. GN’s accuracy is insensitive to batch sizes. 2. Synchronized Batch Normalization (SyncBN) -- an implementation of BN which increases the effective batch size for BN by using many GPUs, overcoming the problem of small batches. Learning rate The learning rate update policy was to lower the learning rate by 10x for the last 60k iterations. And by another 10x for the last 20k iterations. The authors showed that there is no need to lower the learning rate earlier than just before the very end of training. There is also no need to train at a low learning rate for a long time -- this only causes overfitting. Hyper-parameters All other hyper-parameters follow those in Detectron. Specifically, the initial learning rate is 0.02 (with a linear warm-up). The weight decay is 0.0001 and momentum is 0.9. All models are trained on 8 GPUs using synchronized SGD, with a mini-batch size of 2 images per GPU. Per Detectron’s default, Mask R-CNN used no data augmentation for testing and only horizontal flipping augmentation for training. The image scale was 800 pixels for the shorter side. Results: Given enough data, any network can be trained :-) as it can be seen in the graph that follows. The volume of data used for ImageNet pre-training is shown in light blue; The fine-tuning volume of data used is darker blue; and training from scratch volume of data used is in purple. The top bar is the number of images trained used for training; The middle bar is the number of objects (each image can include more than one object); The bottom purple bar shows you the total volume of pixels handled (image sizes vary between datasets), which translate to volume of data. You can see from the bottom purple bar that overall the network processes the same data volume whether pre-trained then fine-tuned or trained from scratch (random initialization). Research 6 Research Computer Vision News Given enough data, any network can be trained