Visual Genome Localizations in Inception - Lab41/attalos GitHub Wiki

Goal

What happens when you take Visual Genome localizations and run them through the Inception model?

Set Up

We first take a look at the known labels in the Visual Genome dataset and in the inception-v3 labels. We use this script to compute the intersection of the two sets. The final computed set of intersections has cardinality 5,419.

This means that we can now look at all regions in the Visual Genome set that have one of these 5,419 labels and then feed those through the Inception Model. (And by the way there are more than 1,000,000 regions that satisfy this condition.)

Results

Well the results are not that promising. It's hard to give statistics on the efficacy of this system so we'll just present some generalizations.

Consider this image from the Visual Genome dataset. alt text .

We see that sometimes the inception model works well:

alt text
The Visual Genome project tagged that image as jacket and the inception model classifies it as sweatshirt which is pretty good.

We see that sometimes the inception model is bad, for instance on this small image:

alt text

And even other times, inception fails because people fail:

alt text
That region is tagged as street, which clearly is not the most salient feature in that image, though there is a parking meter in the scene.