DenseCap Plan - Lab41/attalos GitHub Wiki

###1. Run DenseCap* *Can replace DenseCap w/ another approach and follow similar steps

DenseCap requires Torch and some other Lua dependencies that are all included in the Dockerfile.densecap build. As a user, you simply need to run the l41-densecap image with the appropriate volumnes mounted and ports exposed. The Makefile included in the repo provides a nice template to get started. The DenseCap dockerfile builds on an NVIDIA container so be sure to deploy on a GPU machine for the best performance.

Once attached to the container, clone the DenseCap repo. The page has details on how to download a pretrained model, process images and view the output. Also see Run DenseCap.

###2. Overview of DenseCap components

DenseCap consists of a CNN for image feature extraction, a localization layer, a recognition network, and finally a language model for captioning.

  1. CNN -- The DenseCap model utilizes VGG16 for image feature extraction. If you're familiar with VGG16, DenseCap uses the output of the layer prior to the last max pooling layer. The pretrained DenseCap model uses VGG vectors from the Caffe Model Zoo.
  2. The localization layer -- The localization layer consists of several operations for proprosing regions to include:
    • a convolutional layer
    • region proposals operations
    • grid generation and sampling
    • bilinear samplling The core of DenseCap's success lies in this differentiable localizaiton layer.
  3. The recognition network -- this network consists of 2 fully connected layers with relu activation layers and produces the region codes.
  4. The language model -- the language model tags each object with a caption or label depending on the training data target. DenseCap uses short captions whereas Lab41 is currently targeting object tags. Lab41 made no architectural model changes.

###3. Object Feature Extraction

To better align with other models in the challenge, the Attalos Team needs the localized features produced by the model. The first attempt to achieve this was to train a new DenseCap model on VG objects. Objects provide the more targeted segmentation the team is after. With the new model, the team needs to extract the DenseCap features for all of VG objects (in progess as of 7 July). The ideal solution (see number 5 below) would be to reconfigure DenseCap to produce object bounding boxes and classification using a more simple SoftMax classifier.

###4. Evaluate DenseCap retrieval capabilities

  • Configure DenseCap to compute p( object | image ) for each region proposal
  • Extract NUS-WIDE features and design experiment to retrieve images given a query
    • DenseCap image retrieval experiement: "We use 1000 random images from the VG test set for this experiment. We generate 100 test queries by repeatedly sampling four random captions from some image and then expect the model to correct the source image for each query"
    • L41 image retrieval experiment:
      1. Extract DenseCap features for a new set of tagged images
      2. Construct a small network to predict tags and evaluate using prediction-ground truth overlap and/or distance to target word vector cluster.

###5. Extend/build on DenseCap or try a more novel approach (?)

  • Remove the LSTM layer and replace with a simple SoftMax classifier. This will allow the attalos team to incorporate other text embedding methods. There are a lot of hooks in DenseCap for captioning -- it's not simply a few lines to add an LSTM. There is some preprocessing to incorporate start and end tokens. It it likely that a new Classifier class is required to return the neccessary outputs to make the model run smoothly.