1607.07155 - hassony2/inria-research-wiki GitHub Wiki
ECCV 2016
[1411.4038] A unified multi-scale deep convolutional neural network for fast object detection [PDF] [notes]
Zhaowei Cai, Quanfu Fan, Rogerio S. Feris, Nuno Vasconcelos
read 25/09/2017
Objective
Solve multi-class object detection when object can appear at various scales (detectors often miss small instances because of pooling operations which reduce small objects to basically nothing at the feature level)
Introduce MS-CNN specifically to treat those edge cases
MS-RCNN idea :
Train several complimentary detectors at different output layers to cover various scales.
Various scales match various conv layers which gave different recetpive field sizes.
Deconvolutional layers are introduces to increase the resolution of feature maps.
Synthesis
Previous approaches
-
An efficient way to cover multi-scale detection is to learn a single classifier and feed it several images rescaled with different ratios. This approach is costly as the classifier has to be run as many times as there are rescaled images.
- accurate
- slow
-
Apply several detectors each specialized on one specific scale detection
- not that accurate
- faster as features can be computed only once
Hybrid approaches
-
Compute features for small number of rescaled images and interpolate values for features at intermediate scales
- small accuracy decrease
- speed-up as features computed a small number of times
-
RPN : apply several models to the features to detect bounding boxes at various locations
- as unique set of features are used (matching a unique scale), it is not coherent for some detection sizes
MS-RCNN approach
Region Proposal Network
Use various conv layers that correspond to different receptive fields as input to shallow networks that output b bounding box coordinates for c number of classes at various localization and one score per class (with a number of localizations that match the resolution of the receptive field)
This gives different "detection branches" (detections computed from a single forward pass but with outputs that match different scales).
All the parameters are learned during training
Loss
- They use a multi-task loss:
- class loss that is a cross-entropy loss over the predicted class probabilities
- bounding box loss
- smooth L1 penalization over the predicted coordinates and dimensions of the bounding box
- only computed for positive samples (if max IoU with ground truth bounding boxes >= 0.5)
- positive and negative samples are heavily unbalanced (more patches in natural images are background then objects) therefore hard negatives are mined by choosing a given number of samples with lowest detection scores
Object detection network
RPN could serve as detector, but is not strong enough as its sliding windows do not cover all object shapes.
- Deconvolution is used to increase the precision of the feature maps (this is justified to increase resulotion in a faster and less memory-intensive way then increasing the input size (larger input image), deconvolution was compared with simple feature upsampling (which would be even faster with virtually no extra memory costs), and showed better performances. Note that the article reminds us that upsampling does not increase resolution of image details, instead it allows higher convolutional layers to respond stronger to small objects.
- ROI pooling layer is added to extract features of fixed dimension (7x7x512) for both object (feature map crop centered and of object fixed size) and context (1.5 * larger crop representing context). Those two 7x7x512 features are stacked together to represent a context-aware detection feature. Context embedding almost doubles the number of parameters in the model, justifying the need for dimensionality reduction
- Convolutional layer without padding is added to compress output to SxSx512 feature map
- features are fed to fully connected layer and output layers (bounding box and class probabilities)
A object detection loss (with also cross entropy for class loss and smooth l1 for bbox values) is added to the RPN loss, and the structure is trained end-to-end
Results
KITTI dataset
- Detection is accepted if best match IoU is higher then 70% for cars and 50% for pedestrians/cyclists
- Increasing size of input image improves accurcacy for detection up when going from 384px to 576 size byt no noticeable improvement beyond that
- Outperforms Faster-RCNN on pedestrians and cyclists