SSD - Lab41/attalos GitHub Wiki

Single Shot MultiBox Detector

Single Shot MultiBox Detector (SSD) is a fast method for detecting objects in images using a deep neural network. The network produces thousands of predictions at various scales and aspect ratios before performing non-maximum suppression, resulting in a handful of final tags. The following page provides some links to help in setting up and understanding SSD.

Set Up

Follow the instructions from Wei Liu's SSD Github page to install the necessary packages, prepare the data, and train/evaluate caffe models. This page also contains links to models trained on VOC0712, MSCOCO, and ILSVRC2015.

Make sure $CAFFE_ROOT is set to your Caffe directory and that $PYTHONPATH includes $CAFFE_ROOT/python. We also had to add /opt/conda/bin/python to the $PYTHONPATH. Make sure that $CAFFE_ROOT/python appears first in $PYTHONPATH, otherwise running ./data/VOC0712/create_data.sh will not work. If running on a Docker container, you may need to apt-get install -y python-numpy, and set export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64 to get rid of CUDA errors in running SSD. Though running make pycaffe is not required, it is recommended, particularly if you're having issues with creating the LMDB files (using ./data/VOC017/create_data.sh).

Comparing VOC_SSD_300 vs COCO_SSD_500

The paper discusses using the COCO model on their architecture diagram rather than the VOC. The two networks are vastly different, particularly past the VGG-16 layers. Though the COCO model acquires a conv9_2 layer, average pooling still remains at the last layer of the network. The thinking is that this adds on another 'multi-scale' feature map for detection.

![VOC0712 SSD 300x300] (https://github.com/Lab41/attalos/blob/master/analysis/ssd/images/SSD_300_deploy_prototext.png) VOC0712 SSD 300x300 deploy

![MS COCO SSD 500x500] (https://github.com/Lab41/attalos/blob/master/analysis/ssd/images/ssd_coco_500_network.png) COCO SSD 500x500 deploy

The Training Architecture

As discussed on the paper, the training objective is derived from the "Multi-box objective", which has been extended to handle multiple object categories. The loss function is the weighted sum of the confidence loss (conf) and localisation loss (loc) - many of these occur after the VGG-16/pool5 part of the network. The weighted (conf) sum is evaluated at the end of the network, and conv (6-9) layers are averaged just once, which is then fed into a (global) averaging function, also at the end of the network, compared to GoogLeNet's multiple averaging functions (from inception layers).

Using a Trained SSD Model

After following the steps through "Preparation," you can run your own test images through SSD using the Python notebook found in $CAFFE_ROOT\examples\ssd_detect.ipynb.

Untagged image before running SSD

Tagged image after running SSD with 0.6 confidence

Tracking down layer responsible for object detections

SSD attempts to find objects of various sizes and scales using multiple layers, each detecting different objects. After feeding an image through the network, it is not immediately clear which layer is responsible for a high confidence detection. In order to solve this problem, this Python notebook feeds an image forward through the model, then traces back to find the specific layer and features responsible for any high confidence detection. This "high confidence" threshold is tunable, but the network filters down to the top 200 detections after performing non-maximum suppression.

SSD Layer/Label/Shape Statistics for VOC0712

We can also look at which layers produce high confidence predictions for various inputs. This Python notebook runs each image through the network, and produces a couple heat maps for Layer vs Label and Layer vs Object Size. Not surprisingly, earlier layers produce predictions for smaller objects.

SSD Layer vs Area

It also appears as though some layers respond more strongly to different object types. However, the relation between object type and size in each image has not been explored.

SSD Layer vs Label

Parsing an SSD LMDB File

This final Python notebook notebook parses an SSD training or testing LMDB file to pull out images by index, along with associated tagged objects and their bounding boxes.

Testing on SSD500 with the MSCOCO dataset

Keep in mind that the tests we've done so far have been on SSD300 for VOC0712. The SSD paper references the architecture for MSCOCO/SSD500.

Drawing a Caffe Network

I found it very helpful to have a graphical representation of SSD's network. Luckily, I found Christopher Bourez's blog which includes a nice tutorial on Caffe. To draw the network, from the command line enter

python $CAFFE_ROOT/python/draw_net.py $CAFFE_ROOT/models/VGGNet/VOC0712/SSD_300x300/test.prototxt my_net.png

The SSD models are too large to display here, but another model from Christopher's blog is shown below.

VGG Model

In order to use draw_net.py, I had to first install both pydot and graphviz. conda install pydot conda install graphviz