Visual Genome Localizations in Inception - Lab41/attalos GitHub Wiki
Goal
What happens when you take Visual Genome localizations and run them through the Inception model?
Set Up
We first take a look at the known labels in the Visual Genome dataset and in the inception-v3 labels. We use this script to compute the intersection of the two sets. The final computed set of intersections has cardinality 5,419.
This means that we can now look at all regions in the Visual Genome set that have one of these 5,419 labels and then feed those through the Inception Model. (And by the way there are more than 1,000,000 regions that satisfy this condition.)
Results
Well the results are not that promising. It's hard to give statistics on the efficacy of this system so we'll just present some generalizations.
Consider this image from the Visual Genome dataset.
.
We see that sometimes the inception model works well:
The Visual Genome project tagged that image as jacket
and the inception model classifies it as sweatshirt
which is pretty good.
We see that sometimes the inception model is bad, for instance on this small image:
And even other times, inception fails because people fail:
That region is tagged as street
, which clearly is not the most salient feature in that image, though there is a parking meter
in the scene.