Checkpoint 1 - psastras/vocabtree GitHub Wiki

#Progress Summary We have implemented the baseline bag of words model + inverted index (tf-idf) for a single node, multiple threads and tested it on a few datasets (it appears to be working based on the tests). This includes boilerplate such as computing and storing features / reading different datasets. The vocab tree has also been implemented, for single node, single thread. This includes building the descriptor tree defining the vocabulary and creating and storing all the database images' vectors to make the queries quick as well as functions for saving and loading the tree into files. Parts of the code are (painfully) serial but there is lots of places to use parallelism.

So far, no unforseen problems have popped up. The planned schedule for our project largely remains the same, although we have removed one future week (since we miscounted the number of available weeks) - since this week was the overflow week in our original schedule, we are still largely on schedule. The slightly modified schedule is below. By checkpoint two, we should have a multinode baseline implementation and multinode vocab tree implementation, plus some benchmark code in place.

#Planned Schedule

Week 1 - Work on vocab tree implementation (single node, single core) + get data ready - done

Week 2 - Work on vocab tree implementation (single node, single core) - done

Week 3 - Test initial vocab tree on subset of data + debug, start changing for single node, multicore.

Week 4 - Test single node + multicore, start working on multinode + multicore.

Week 5 - Test multinode + multicore. If time permits and we actually have a full implementation, work on optimizing + implementing other speedup schemes (ex. feature compression)

Week 6 - Add benchmarking tools / final code changes + misc things that should have been done earlier. Run tests on larger data sets. Make some vis tools (ex: what does the tree look like?)

Week 7 - Final benchmarking and runs + report (benchmarking could take several days given data size).