DenseCap Feature Embedding - Lab41/attalos GitHub Wiki

Goal

Pictures are great. Words are baller. Vector spaces are clutch. Let's have them all live in the same space. With pictures, there are many different objects in the same picture. A picture of a street would have people and cars and trees...so do we have that image 'live' close to the vector of the word 'person', or maybe close to 'cars', but what about 'tree'? The current thought to solve this problem is to localize the image and then embed those localizations. So for our picture of the street, we'd draw a box around the people and embed just that box close to the representation of the word 'person.'

The attalos team has been exploring two ways to draw those boxes around objects of interest in images: SSD and DenseCap. This report will be about how DenseCap was used to try to accomplish this goal.

How to get DenseCap features

DenseCap uses the network diagram below to draw these bounding boxes around objects of interest, but the network also creates captions for each of those boxes.

The idea we wanted to explore was to take the network created in DenseCap, run it through the 'region code' portion of the network', and then embed those codes into word2vec space. Essentially, we were chopping off the language model of the network. Sounds good in practice, but let's take a look at how it worked (spoiler is that it failed for reasons we'll explore later).

DenseCap Code exploration

The way that DenseCap works (on the pre-trained model) is that you give it an image and then it draws boxes around things of interest and tries to caption those boxes. Well we could instead of having it suggest the boxes for us, we could tell it what boxes to create features for. Namely, if we gave the network ground-truth data from the Visual Genome dataset, we can get the region codes for just those boxes, know where that region code should be embedded in word2vec space, and then train a model to fit that regression. Perfect.

We note in the code there is an api for what is called the roi_pooling method which takes in the CNN features run through the convnet part of the model and a list of bounding boxes. This method then returns an feature for each region (or box). As described in the DenseCap paper the output of this pooling is a tensor of size 512 x 7 x 7. This can then be transformed via the recog_base method to return vector features of length 4,096.

Regression :)

I know that it's a lot to ask to embed a vector of length 4,096 into a vector space that is 300 dimensions (as is in the pretrained word2vec model) but it's worth a shot?

We ran a simple regression with a MSECriterion to perform this regression. Our training data were the region features after running VisualGenome data through the (partial) DenseCap model. The ground truth for each region feature was the associated word2vec representation of that region's label in Visual Genome. (We ignored those labels that were not in the word2vec corpus.)

To put it mildly, this regression didn't converge. We now go over some reasons this might be the case.

Regression :(

Maybe you didn't run the model long enough?

Totally possible. I wasn't very patient so maybe just more time with the data would make everything better.

Maybe the region feature representations are not separable?

Well then we can't do anything.

Maybe the bounding boxes in the dataset are wrong?

This certainly might be the case. We can look at the labels from the ground truth set and compare those labels to the captions that DenseCap would predict for those regions. In an ideal world, they would be the same. In reality, they aren't.

We will look at the data in file VG-object-regions.h5 and consider the second image in the dataset. The file VG-object-regions-dict.json tells us that it corresponds to this image with these captions.

Well if we use the command boxes_xcycwh = boxes[{{img_to_first_box[2],img_to_last_box[2]}}] to get the boxes and then run that through the model to get features. Once we have those, we can get the suggested captions with the command model.nets.language_model:decodeSequence(model.nets.language_model:sample(features)). This gives us the captions:

"the surfboard is white"
"white and blue ocean waves"
"white and brown waves in ocean"
"two men standing on the beach"

These clearly have nothing to do with the image. I suspect that there is an issue with the metadata in the file VG-object-regions-dicts.json as the bounding boxes for the image does not correspond to the bounding boxes here

Recommendations

Clearly there was some post processing on the VG-object-regions-dicts.json file to take captions that were phrases and trim them down to single words. Whatever that processing was, might have changed the way in which the bounding boxes were saved?