Computer Vision News

CNNs have proven highly efficient for image classification and shown promising preliminary results for audio classification. In this article the authors employ a variety of popular CNN architectures (DNN, AlexNet, VGG, Inception, ResNet) and demonstrate their capacity to categorize video segments based on their audio track alone -- using the large YouTube-8M dataset! A very large scale video clip library of 100M clips, averaging 4.6 minutes each, for a total length of 5.4 million hours. The clips include 30,000 labels (objects seen during the clip), with an average of about 5 labels per video. The dataset was subdivided by the authors into 70M, 10M and 20M for training, evaluation, and validation respectively. The audio is divided into non-overlapping 960 ms frames. For the 70M training videos this gave about 20 billion training samples. Each audio frame used all the labels of its parent video. Short-time Fourier transform was used to decompose the frames using 25 ms windows every 10 ms. This process resulted in a spectrogram of dimension 96 × 64. This spectrogram was used as input to all the CNNs. Evaluation: The authors evaluated the following networks (4 CNN and 1 FCN) on the above dataset: 1. Fully Connected Network -- A baseline fully connected network with RELU, with 3 activation layers and 2000 neurons per layer. 2. AlexNet -- The original AlexNet with one additional layer stride of 2x1 so that the number of activation units will be similar to the initial 224x224x3 layer of the original AlexNet. In addition a batch normalization was used after each convolutional layer instead of local response normalization (LRN) and the final layer was replaced with a 3087 unit layer to fit the number of labels in the current task. 3. VGG -- The original VGG network was used with two changes similar to the one in AlexNet: (a) the final layer was set to 3087 with sigmoid to fit the number of labels in the current task; (b) the use of batch normalization instead of LRN. 4. Inception V3 -- The original inception v3 network was used with the following changes: the first four layers of the stem were removed, including the MaxPool, the auxiliary network was removed as well. And the average pool size was changed to 10x6 to reflect the change in activation layer size. 5. ResNet-50 -- The original ResNet-50 network was used with the following modification: the stride of 2 from the first 7x7 convolution was removed so that the number of activations was not too different in the audio version. In addition the average pool size was set to 6x4 to reflect the change in activation layer size. Research 46 Research Computer Vision News CNN Architectures for Large-Scale Audio Classification by Assaf Spanier

Computer Vision News - May 2018