1607.07155 - hassony2/inria-research-wiki GitHub Wiki

ECCV 2016

[1411.4038] A unified multi-scale deep convolutional neural network for fast object detection [PDF] [notes]

Zhaowei Cai, Quanfu Fan, Rogerio S. Feris, Nuno Vasconcelos

read 25/09/2017

Objective

Solve multi-class object detection when object can appear at various scales (detectors often miss small instances because of pooling operations which reduce small objects to basically nothing at the feature level)

Introduce MS-CNN specifically to treat those edge cases

MS-RCNN idea :

Train several complimentary detectors at different output layers to cover various scales.

Various scales match various conv layers which gave different recetpive field sizes.

Deconvolutional layers are introduces to increase the resolution of feature maps.

Synthesis

Previous approaches

An efficient way to cover multi-scale detection is to learn a single classifier and feed it several images rescaled with different ratios. This approach is costly as the classifier has to be run as many times as there are rescaled images.
- accurate
- slow
Apply several detectors each specialized on one specific scale detection
- not that accurate
- faster as features can be computed only once

Hybrid approaches

Compute features for small number of rescaled images and interpolate values for features at intermediate scales
- small accuracy decrease
- speed-up as features computed a small number of times
RPN : apply several models to the features to detect bounding boxes at various locations
- as unique set of features are used (matching a unique scale), it is not coherent for some detection sizes

MS-RCNN approach

Region Proposal Network

Use various conv layers that correspond to different receptive fields as input to shallow networks that output b bounding box coordinates for c number of classes at various localization and one score per class (with a number of localizations that match the resolution of the receptive field)

This gives different "detection branches" (detections computed from a single forward pass but with outputs that match different scales).

network-struct

All the parameters are learned during training

Loss

They use a multi-task loss:
- class loss that is a cross-entropy loss over the predicted class probabilities
- bounding box loss
  - smooth L1 penalization over the predicted coordinates and dimensions of the bounding box
  - only computed for positive samples (if max IoU with ground truth bounding boxes >= 0.5)
  - positive and negative samples are heavily unbalanced (more patches in natural images are background then objects) therefore hard negatives are mined by choosing a given number of samples with lowest detection scores

Object detection network

RPN could serve as detector, but is not strong enough as its sliding windows do not cover all object shapes.

Deconvolution is used to increase the precision of the feature maps (this is justified to increase resulotion in a faster and less memory-intensive way then increasing the input size (larger input image), deconvolution was compared with simple feature upsampling (which would be even faster with virtually no extra memory costs), and showed better performances. Note that the article reminds us that upsampling does not increase resolution of image details, instead it allows higher convolutional layers to respond stronger to small objects.
ROI pooling layer is added to extract features of fixed dimension (7x7x512) for both object (feature map crop centered and of object fixed size) and context (1.5 * larger crop representing context). Those two 7x7x512 features are stacked together to represent a context-aware detection feature. Context embedding almost doubles the number of parameters in the model, justifying the need for dimensionality reduction
Convolutional layer without padding is added to compress output to SxSx512 feature map
features are fed to fully connected layer and output layers (bounding box and class probabilities)

A object detection loss (with also cross entropy for class loss and smooth l1 for bbox values) is added to the RPN loss, and the structure is trained end-to-end

Results

KITTI dataset

Detection is accepted if best match IoU is higher then 70% for cars and 50% for pedestrians/cyclists
Increasing size of input image improves accurcacy for detection up when going from 384px to 576 size byt no noticeable improvement beyond that
Outperforms Faster-RCNN on pedestrians and cyclists