Final Report - psastras/vocabtree GitHub Wiki
#Introduction Given the large amount of image data there is and the rich information associated with images, we often want to find images similar to a given query image. The size of these datasets are growing rapidly, especially when considering video, and we want them to be on the order of millions of images. Traditional image search techniques don’t scale well with this ever increasing amount of data.
Nister and Stewenius [1] introduce the scalable image search technique of a vocabulary tree. The vocabulary tree is a hierarchical data structure and they found that it gives good accuracy and performance.
For our project we implement an inverted index and a vocabulary tree. Both have parallel train (single node for the tree, multi node for the index) and both search with multiple threads and nodes. We compare our implementation of the vocabulary tree and inverted index on a single node with varying number of threads to compare performance. We also add a cache to the vocab tree to speed up subsequent image searches by lowering disk reads.
#Image Search Background The first step in an image search is creating descriptors for images. These describe regions of the image that will ideally be the same from different angles, lighting, etc. Given a database of images with their descriptors and a search image with its descriptors we want to return an image where the most descriptors match.
The key idea in image search is turning these descriptors into visual words. An image database will have a large number of descriptors. We take similar descriptors and group them into one visual word. Do this for all descriptors and we build a vocabulary. Now image search becomes document search, where each image is a document with keywords. To search given an image we match each of its descriptors to a word, and then find images with similar words that that of the query.
It is also important to note the the size of the vocabulary is very important in determining both accuracy and performance. The larger vocabulary will increase accuracy because each visual word is more unique and will overlap with less images so a match on a word will be more important. This also leads to an search speed increase because there are less overlaps between images that are irrelevant.
#Inverted Index and its Shortcomings The inverted index is the straightforward implementation of a visual vocabulary. For a desired vocabulary size k the inverted index will perform a k-means computation. This will create k visual words and assign each of the descriptors to the closest word. Finally each word will have a list of images with that word in it (a glossary).
To search an inverted index you first assign each of the query images descriptors to a word, then compile a list of images that share those words. For each of these possibilities the inverted index performs a dot product of their Term Frequency Inverse Document Frequency (TF-IDF) to rank the images.
The problem with inverted indexes is that they don’t scale well with the number of images in the database. For a fixed vocabulary size adding more images will make the lists for each word very long and the search ends up searching over a huge number of images (similar to the typical hash table performance problems when there aren’t enough hash values). One solution is to increase the vocabulary size and thereby decrease the number of images that have a given word. This is only good to a point because the clustering step for constructing the inverted index will be increasingly expensive. This doesn’t scale well.
#Vocabulary Tree Nister and Stewenius [1] present a vocabulary tree as an alternative to the inverted index. Instead of creating all the visual words in 1 step the vocabulary tree recursively builds the vocabulary. The root node starts with all descriptors and then performs a k-means clustering with k being the branch factor for the tree (we and the paper use 10). This creates k words and each descriptor is assigned to one of these words. Each of these words and associated descriptors is now it’s own node and the computation continues recursively to the predefined depth (we and the paper use 6). This will create branch ^ depth words (for our data its 10^6=1 million), one word on each leaf. Each leaf/word will act similar to the inverted index and store all images containing that word.
To search the tree each descriptor of the query image will travel down the tree, at each node performing a dot product to determine the closest word. This will result in branch factor * depth dot products. The images associated with each of these words will become part of the possible images list just as in the inverted index. To give an image a score the image will count the number of time one of its descriptors passes through a node, giving a vector of the length of the total number of nodes with counts. Furthermore, each node can be weighted by the number of build images that have passed through it, i.e. if every image from the build dataset goes through the first branch off the root node that node’s weight will be 0 because an image going into that node will mean nothing. Nodes that only a few images pass through will have a much higher weight.
#Implementation Details ###Parallel Inverted Index: For construction the k-means (bulk of the computation) can be performed in parallel. We use the distributed k-means clustering implemented in [3] and [4] when clustering across multiple nodes, and the implementation provided by OpenCV when clustering on a single node. When searching for a specific query an initial list of candidate images is built from the inverted index. These images are then scored using the TF-IDF weight vectors in parallel. ###Parallel Vocabulary Tree During the vocabulary tree training stage, each k-means computation can be performed parallel since each sub-tree is independent, these clusterings can be computed in parallel and the tree can be merged together later.
In the search each descriptor of the query image can traverse the tree in parallel because there’s no dependence between descriptors. This is easily doable over multiple nodes because the only needed communication is at the end where each node synchronizes the visited count for each tree node. On each node this is also easily parallel with a simple atomic increment for the visited count. From here the images from the top leaves are compiled and then checked for the best matches. On multiple nodes each node needs to know what inverted indexes (from the leaves) to draw which requires the synchronization of a (relatively) small list of ints. Each node can then independently compile the same list of possible images and find their top images. Each node then sends their top images to the head node, which does the final compilation. Assuming the number of images is small (for us 10-16) this should be fast too. On a single node looking through the list can be done in parallel with a parallel as well. ###Vocab Tree Limiting Images: An additional detail is that we don’t look at every image with an overlapping word. The vocabulary tree weights each leaf it will consider based on the weighted score described above, the idea being that leaves with a higher weight will likely have more relevant images. The tree will use the score to determine the order of looking at the nodes and stop after it checks a set number of images.
#Improvements (Cache) We want to improve search times for applications that perform multiple subsequent queries, such as video search. Part of the search process is reading descriptor and other image data from hard disk (too much to possibly fit in memory). We implement an LRU cache between the file system query and disk to improve multiple searches. Ideally when the first query reads some possible image data it will store it in the cache so the if the second query wants that same data, it will load the data much faster and perform the search faster.
##Types of Cache The main problem with an LRU cache is that it requires a lock when accessing the data structure, since the least recently used items must be updated. We test three different types of caches to determine the effect of locking on our cache performance.
We implement three different types of LRU caches. In the first type (single cache), all threads on a single node share a single cache, with a single mutex protecting the cache for all threads. In the second implementation (distributed cache), we test a single cache for each thread. Since each thread has its own cache, no locking mechanism is needed. Finally we test a cache layout where sets of sequential images in the database are assigned to a single cache, such that images 1 through k are assigned to cache 1, k through 2*k are assigned to cache 2 and so on (ring cache).
In our implementation, the ring cache actually assigns images 1 through k, ck through c(k+1), (c+1)k through (c+1)(k+1) ... images in the database to cache 1 (and similar for cache 2), where c is the number of caches, and k are the number of images per slice of the database. Furthermore, when querying a cache, a thread will first try to access the assigned cache, but will not wait on that cache if it locked - instead it will proceed to the first non locked cache it finds within the ring.
#Accuracy Accuracy is measured on the Oxford buildings dataset [4]. We provide two measurements of accuracy, we first measure the number of matching images in the top k=16 results, averaged over 128 image queries. Two images are considered a match if there is a matrix transformation (fundamental matrix) between them.
Inverted index (65k clusters): correct: 257.0 total: 2048.0 score: 0.12548828125 Vocabulary tree (10 branch, 6 depth): correct: 219.0 total: 2048.0 score: 0.10693359375
We then measure precision as we in increase k from 1 to 64, where precision is determined as the number of matching images in the first k matches divided by the total number of images in the first k matches.
The vocabulary tree gives slightly worse but very comparable performance to the inverted index. In their paper Nister and Stewenius go into more discussion of accuracy and picking good parameters. Below we also show some qualitative results on the Oxford buildings dataset.
In the image above, each row represents a query and its matches. The leftmost image is the image query, and the images to the right are the matches in sorted order (the leftmost image is the image with the highest score). Images with a green border around them are those which passed geometric verification with the query image, while those with a red border are those which failed geometric verification.
#Performance Data
##Parallel Speedup
###Desktop:
4 Cores (2 threads per core) Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
Average image search time on a standard desktop computer for one through eight threads (lower is better). As we move to eight threads performance gains start to flat-line.
###Balaton parallel speedup:
16 cores (4 threads per core) AMD Opteron(TM) Processor 6272 @ 2.4 GHz
Average image search time on a server for one through sixty four threads.
###Warp cluster speedup:
4 Cores (2 threads per core) Quad-Core AMD Opteron(tm) Processor 2354 @ 1.1 GHz
Average image search time on a single cluster node for one through eight threads. Unfortunately, while our MPI implementation works on a single node, we were unable to test the MPI implementation scaling on multiple nodes in the warp cluster due to MPI environment errors (in particular the warp cluster could not successfully launch MPI processes on multi node due to communication errors).
##Cache Speedup
Search speed as cache size increases (smaller is better). In the test set, we had ~2k database vectors, thus a single cache of size 2048 is able to hold all of the vectors in memory and has maximum performance. Note that cache size indicates total cache size, thus for the distributed and ring caches each threads cache size is 1/n of the size of the single cache, where n is the number of threads, thus each cache type uses an equivalent amount of memory.
Cache hit rate for different cache sizes (measured as (#hits / (#hits + #misses)). Note that the single cache has a hit ratio of about one once the cache size is about 2k since there are only 2k database vectors in our training set.
###Cache Performance on Video Data
Cache hit ratio (hits / (hits + misses)), as the number of search queries increases for a video dataset using a vocabulary tree, where the queries are chosen randomly and sequentially. During the first few iterations, cache performance is poor since the database vectors have not yet been loaded. Sequential frames have similar appearance and are therefore more likely to travel down the same paths in a vocabulary tree - thus the required database vectors should be similar, giving sequential queries a slight performance boost over random queries.
#Discussion of Data The vocabulary tree outperforms the inverted index on 2 of the 3 machines by a significant amount. The speedup based on number of threads is vastly different on the machines. On the desktop the speedup is almost exactly the same. For all 3 the vocabulary tree sees good speedup up to 4 threads and then only a slight increase with 8 threads. The inverted index is curious in that it sees basically no performance increase on the balaton machine but great increase up to 4 threads on warp machine then a decrease in performance for 8. We expect these differences are due in part to hard drive I/O, as both the vocab tree and inverted index read a lot from disk, but we don’t know the real reason behind these discrepancies.
Balaton gives some puzzling data for thread count over 8. There is a huge performance decrease all the way up to 64 threads. Part of this may be due to poor scaling for that many threads (locks, work per node, etc…), but we suspect there are other factors about the machine that we aren’t considering (disk I/O…).
From the data we can see that parallelization increases performance only to a point. For larger number of threads it would be best to parallelize over queries, since queries are inherently parallel. On a single machine these parallel queries will be completely independent except for conflict over disk I/O. We have implemented a search parallelized over queries, however we did not measure scaling performance although we expect that this should be roughly linear with number of threads, since little communication or locking is needed when parallelizing over queries.
We do have working MPI parallelization for both the vocabulary tree and inverted index. Unfortunately we were not able to get MPI working properly on the cluster and were unable to get data for running on multiple nodes.
We do not have timings for build time but that is an important consideration when looking at vocabulary trees versus inverted index. To get an inverted index with x number of words, the training involves a very large k-means computation for x clusters. The tree however only performs many smaller k-means (on whatever magnitude the branching is), which is significantly faster. As an example the inverted index with 65k words took a day, while the vocab tree with 1 million words took 10 minutes. Additionally, to increase the amount of words in the vocab tree (without changing branching) you can simply recurse on one of the leaf nodes without redoing the whole tree, saving on compute time. The inverted index would have to completely start over for any resizing of the vocabulary.
We unsurprisingly found that using a cache helped to improve performance. What we found interesting, however, was that a single cache suffered from thread locking contention (at least until everything was loaded into the cache). However, for larger training datasets it becomes more difficult to load all of the database vectors into main memory. Similarly, the single cache for each thread performed well for small cache sizes since there is no locking, however for larger cache sizes, its performance started to suffer, since each thread could have vastly different matches from query to query, resulting in different features needing to be loaded into each cache. Our ring cache implementation served as a good middle ground, since queries were directed to the cache that would probably contain them, and had a lower chance of multiple threads trying to access the same cache.
#Future Work There were many ways the vocabulary tree could have been sped up, both on a single thread and for multiple threads and nodes. There is no multinode implementation of training. Given more time and more helpful testing data (especially for multinode) the implementation could be a lot faster.
The inverted lists at each leaf store the images that have matching descriptors and how many descriptors they each have. Currently this second number is unused, but it could be used to order which images to look at first and which to throw away. This many bring up accuracy issues but would make searching faster.
The tree will search over some given number of images and return a given number of them. There could be some speedup by stopping this check earlier is some conditions are met. If the tree finds some very good matches early on it could theoretically stop searching and just return those. The key problem here is determining what constitutes a good enough match and determining if/when the matches you have are good enough.
Currently a batch query will simply run in parallel over each query, which should provide linear speedup when only considering the CPU (and not memory, cache, disk I/O…). There may be ways to perform this batch query faster than multiple single queries by re-using overlapping data.
One key application that the cache attempted to address is performing additional queries with similar images, i.e. from video where most neighboring frames will be very similar. Ideally performing 2 very similar queries in succession will see a big speedup on the second since most of the work would be the same. The cache does this in part by keeping recently used descriptors and other image data in the cache so multiple queries don’t need to read from disk. Future work would be to include other intermediate data generated in a query and attempt to use that on the next search.
Another similar idea would be for queries to modify the actual tree so more frequently used images will show up first and the tree would use these images to halt computation earlier.
#References [1] D.t Nister and H. Stewenius, "Scalable recognition with a vocabulary tree ", CVPR, 2006. http://www.vis.uky.edu/~stewe/publications/nister_stewenius_cvpr2006.pdf
[2] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, 2003.
[3] Muja, M. and Lowe, D. Fast approximate nearest neighbours with automatic algorithm configuration Proceedings of the International Conference on Computer Vision Theory and Applications (2009)
[4] Philbin, J. , Chum, O. , Isard, M. , Sivic, J. and Zisserman, A. Object retrieval with large vocabularies and fast spatial matching Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2007)