Computer Vision News - October 2016

Line 1-3: The results start with a “root-only” model (i.e., no parts) and then show results after adding 4 or 8 parts to each component. With 4 parts, mAP increases by 0.9 percentage points. Line 4: The hypothesis is that the convolution filters in conv5 already act as a set of shared “parts” on top of the conv5 features, this idea is implemented by applying a 3×3 max filter to conv5 and then training a root-only DeepPyramid DPM with three components achieving 45.2. Those are the best results for this section supporting the hypothesis. In addition, the hypothesis is also evidenced in the face heat maps above (HOG vs conv5) in that it selects specific visual structures at their locations and scales (face at conv5 level5). Line 5-6: Computed to HOG- DPM baseline using 6 components and 8 parts, removing parts decreases the performance to 25.2%. Both do not perform as well compared with DeepPyramid -DPM max5. Line 7: Compared to the recently proposed R-CNN [Goodfellow, ICML 2013]. The R-CNN compared here has pool5 features without fine-tuning. Line 8-9: Fine-tuned R-CNN [Goodfellow, ICML 2013] clearly outperform DeepPyramid-CNN. However, DeepPyramid-CNN runs at about 20x. The R-CNN results (lines 7-9) suggest that the gains from fine-tuning come in through the non-linear classifier (layers fc6 and fc7) applied to pool5 features. This suggests that similar levels of performance might be achievable with DeepPyramid DPM through the use of a more powerful nonlinear classifier than the SVM. Computer Vision News Research 31 Research Number of components Number of parts Train Test mAP DP-DPM conv 5 3 0 43.3 N/A DP-DPM conv 5 3 4 44.2 DP-DPM conv 5 3 8 44.4 DP-DPM max 5 3 0 45.2 42.0 HOG-DPM 6 0 25.2 N/A HOG-DPM 6 8 33.7 33.4 R-CNN pool 5 [Goodfellow, ICML 2013] N/A N/A 44.2 N/A R-CNN FT fc 7 [Goodfellow, ICML 2013] N/A N/A 54.2 50.2 R-CNN FT fc 7 BB N/A N/A 85.5 53.7 Sum up: For decades, visual recognition models have made a wide use of part-based representation techniques, such as deformable part models (DPMs). In recent years, good performance on image classification, object detection and a wide variety of vision tasks made CNNs extremely popular among researchers. DPMs and CNNs are generally viewed as distinct approaches to visual recognition, the former being graphical models and the latter being non-linear classifiers. The question asked by this paper is whether these models are actually distinct or whether it can be shown that any DPM can in fact be formulated as an equivalent CNN. In order to do so, the authors introduce the notion of distance transform pooling, and the object geometry layer. The authors find that DeepPyramid DPMs significantly outperform DPMs based on histograms of oriented gradients features (HOG) and slightly outperform a comparable version of the R-CNN detection system, while running significantly faster.