Image Featurization Using Mistrained Caffe Models - Sotera/watchman GitHub Wiki
We use the BLVC Reference model (https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet) to generate features for associating images with clusters. I have investigated ways to improve our results.
-
I investigated using the "PlacesNet" network rather than the BLVC network. This gave much worse results. My thinking was that the images of our events have a deeper field of focus that is more like the images fed into PlacesNet than the BLVC network. I also thought the "deeper" features needed for place recognition might be more like the distinction between a concert and a riot. However the conv3 layer of the PlacesNet was simply not highly activated by our images which effectively leads to random results at the "prob" layer of this network.
-
I investigated the conv3, fc7, fc8 and prob layers of the BLVC network for their potential as generic feature detectors.
a. conv3: This layer is essentially the layer that represents the 3rd order compound features of a given image. However there are over 20 layers between this layer and where training is enforced so activations at this layer can be non-discriminatory. I also struggled with how to aggregate the per-picture values of this layer into the cluster's vector.
b. fc7 and fc8: These are the layers just before the final prob layer. We initially used the fc7 layer. I used the same logic with both layers and got almost identical results from both layers.
c. prob: This layer is just the softmax function applied to the fc8 output. The softmax function drives the network to a decision by exponentially amplifying the differences in the fc8 output layer.
For the first day-protestt image:
We get an fc8 layer that looks like:
The Softmax function
transforms fc8 to the prob layer:
This makes the determination that this image from our “day protest image collection” is likely to be an image of Maypole due to the 0.47 in bucket #645 (of 1000) (note the light-posts surrounded by a crowd):
I intuitively hate the Softmax function, however I get better results by creating non-algebraic vectors for each cluster comprised of the individual max values for each bucket for each cluster and computing the dot-product between this “max” vector and the vector for each picture than I do by using a more algebraic approach and computing a distance from the average of each cluster for each picture.
The next image contains the four "cluster vectors" based on the fc8 vectors computed by averaging the fc8 values of each image. We can see that discrimination is tenuous.
We can contrast the above image with the prob vectors for each cluster aggregated using the max function on each bucket:
LAYER: | fc8 | prob |
---|---|---|
Aggregation Function: | average() | max() |
Successful assignments: | 96 of 197 | 146 of 197 |
Percent: | 48% | 74% |
Note that most buckets in the prob image strongly associate with just one cluster (color)
This "prob-max-dot" method approaches being the same thing as assigning specific output features to each bucket. It looks like we found 8 or so buckets per cluster in this way for our four clusters. Since the output vector size is 1000, there is clearly a limit to how many clusters we can associate in this way. It is clearly less than 1000/8 = 125 and because we have to assume bucket choices are random, my guess is that the limit is closer to 16 or so clusters that we can discriminated in this way and once we go beyond that, the odd of vector collisions go up and the accuracy will go down. The accuracy is also likely proportional to the number of clusters, so possibly by the time we get to 8 clusters the accuracy may drop below 50%.
The code currently uses average-matching against the “fc7” layer with similar results to the average-matching against the “fc8” layer found above.