IndexAnalyzer - gtoubassi/femtozip GitHub Wiki

IndexAnalyzer

NOTE: The Java implementation of FemtoZip has fallen slightly behind the C version. If you are using the Java version and find with ad hoc tests that the C version is performing better, please file an Issue and the author will be enthusiastic about helping remedy.

As part of the pure java implementation of FemtoZip, IndexAnalyzer is a tool that is used to benchmark FemtoZip's effectiveness compressing a specific Lucene Search Index. The following assumes you have already build FemtoZip, and that you have an index located at ~/myindex.

Prerequisites

You have already built FemtoZip.
You have a lucene index, for example at ~/myindex
You know how many documents are in the index. We will assume 100,000, and pick 10% of that (10000) to build our model against.

1. Build a model of your index

% cd java/femtozip
% java -classpath bin:lib/lucene-core-2.4.1.jar org.toubassi.femtozip.lucene.IndexAnalyzer --build --model ~/index.fzm --numsamples 10000 ~/myindex

This will create a FemtoZip compression model for each stored field in the index. 10000 documents are sampled for purposes of model building, and the resulting models are stored in the newly created ~/index.fzm/ (specified by --modelpath).

It will dump out statistics about how well different compression methods performed on each field, and at the end you will get a summary which aggregates across all fields:

Summary:
Total Index Size: 123423872
# Documents in Index: 89095
Approx. Stored Data Size: 84309661 (68.31% of index)
Aggregate performance:
Best per Field 33.22% (314396 from 946355 bytes)
FemtoZipCompressionModel 34.03% (322022 from 946355 bytes)
GZipDictionaryCompressionModel 54.99% (520368 from 946355 bytes)
PureHuffmanCompressionModel 61.3% (580092 from 946355 bytes)
GZipCompressionModel 75.49% (714380 from 946355 bytes)

You can see the total size of the index (all bytes under the index directory), the # of documents in the index, and the total size of all stored fields in the index. The total stored fields size gives you an idea how much room their is to shrink the overall index. Compression rates are reported under Aggregate Performance. The line for the "Best per Field" is the one that represents the expected compression rate for the stored data. Note the GZip/GZipDictionary compression models are included mainly for comparison, to tell you how effective GZip would be (equivalent to built in Lucene field compression), and GZipDictionary tells how effective GZip would be with a deflate dictionary set (the same one that FemtoZip is using, although only 32k in size).