Tutorial - gtoubassi/femtozip GitHub Wiki
Tutorial
The following walks through generating sample data, building a model, and compressing/decompressing in comparison to gzip. If you can easily extract sample data into files (one file per document) this is an easy recipe for evaluating FemtoZip's impact on your own data. It is assumed that FemtoZip has been installed and the initial build was done in ~/femtozip (this is do the fzdatagen tool can be run, which is not installed into /usr/local/bin). For more on building FemtoZip, see How to Build.
1. Generate Sample Data
First we use fzdatagen to generate sample data. fzdatagen generates example json serialized user records, with field values such as first name and last name following a English language character distribution statistics. First run the tool and make sure it is generating data:
% ~/femtozip/cpp/fzdatagen/src/fzdatagen --num 5
{"first":"i ftaparrh","last":"hereecseherota","email":"patnsdaatlcih @cktret.com","gender":"m","bday":"1994-6-24"}
{"first":"lbrdtar","last":"srroaori tsrg","email":"htch lnnh @gmail.com","gender":"f","bday":"2000-7-19"}
{"first":"oobn hebigm","last":"teedrcnt","email":"mthaarkduagli@nr r ebr.edu","gender":"f","bday":"1980-6-21"}
{"first":"ectlaoorntj","last":"idotnntmo","email":"nre [email protected]","gender":"m","bday":"1963-5-11"}
{"first":"siooil","last":"eei erdl","email":"ett mat gee@emmihios rce.edu","gender":"f","bday":"1976-8-9"}
You can see the form of the data. Each document is very similar to each other, with very little internal repetition/similarity. This is ideal data for FemtoZip. Imagine our luck! Lets generate 1000 documents to sample for purposes of building a model, and then another 1000 that we will use for benchmarking vs gzip. Note we don't want to generate a model using the same documents that we benchmark as that make result in unrealistic compression rates vs the common case.
% cd ~/femtozip/cpp
% mkdir -p /tmp/data/train
% mkdir -p /tmp/data/benchmark
% fzdatagen/src/fzdatagen --num 25000 /tmp/data/train
% fzdatagen/src/fzdatagen --num 25000 /tmp/data/benchmark
The last argument causes fzdatagen to output each record in its own file within the specified directory. Check out the data:
% ls /tmp/data/train | head -5
1
10
100
1000
101
% grep first /tmp/data/train/* | head -5
/tmp/data/train/1:{"first":"i ftaparrh","last":"hereecseherota","email":"patnsdaatlcih @cktret.com","gender":"m","bday":"1994-6-24"}
/tmp/data/train/10:{"first":"eehnenmaudefee","last":"r ola rorot","email":"suhteoi tt [email protected]","gender":"m","bday":"1968-5-18"}
/tmp/data/train/100:{"first":"lr aehsna","last":"efthof trp","email":"[email protected]","gender":"m","bday":"1991-1-5"}
/tmp/data/train/1000:{"first":"poohitahcaie","last":" lteeirggi","email":"[email protected]","gender":"f","bday":"1988-2-23"}
/tmp/data/train/101:{"first":"iudnherrtt","last":"fse slne","email":"roeddiotahri@a nrslkdssis.com","gender":"m","bday":"1962-6-8"}
Also lets see exactly how much data is in the benchmark data set, as well as its md5, which we will use later for verifying compression/decompression round trip, as well as comparing compression rates to gzip. NOTE: On Linux, use md5sum rather than md5
% cat /tmp/data/benchmark/* | wc -c
2726923
% cat /tmp/data/benchmark/* | md5
a3755df1fb4d94962071984eaa68aa8d
Note your values will, of course, vary.
2. Build a Model
Now build a model of the data which FemtoZip can use for compression/decompression later on.
% fzip/src/fzip --model /tmp/data/model.fzm --build /tmp/data/train
3. Compress the data
Now compress the benchmark data set using our model, and see how big the resulting data is.
% fzip/src/fzip --model /tmp/data/model.fzm --compress /tmp/data/benchmark
% cat /tmp/data/benchmark/* | wc -c
821805
Dividing by the original total size computed above we get a compression rate of 821805/2726923 = 30%!
3. Decompress the data
Decompress the data, and make sure the total size and md5 are the same as those computed earlier.
% fzip/src/fzip --model /tmp/data/model.fzm --decompress /tmp/data/benchmark
% cat /tmp/data/benchmark/* | wc -c
2726923
% cat /tmp/data/benchmark/* | md5
a3755df1fb4d94962071984eaa68aa8d
The MD5's match, so the data was restored.
4. Compare to gzip
% gzip /tmp/data/benchmark/*
% cat /tmp/data/benchmark/* | wc -c
2972728
% gunzip /tmp/data/benchmark/*
Gzip achieves a compression ratio of 2972728/2726923 = 109%. FemtoZip shrunk the data to 30% its original size, while gzip made it 9% bigger.
zstd's dictionary based compression
5. Compare to% zstd --train /tmp/data/train/* -o /tmp/data/model.zstd
% ls -l /tmp/data/model.*
-rw-r--r-- 1 nnnnnnnnn nnnn 71331 Jul 26 10:41 model.fzm
-rw-r--r-- 1 nnnnnnnnn nnnn 112640 Jul 26 10:43 model.zstd
% zstd -D /tmp/data/model.zstd --rm /tmp/data/benchmark/*
% cat /tmp/data/benchmark/* | wc -c
1506618
% zstd -D /tmp/data/model.zstd --decompress --rm /tmp/data/benchmark/*
Zstd used a 60% larger dictionary and compressed to 1506618/2726923 = 55% of the original size, vs 30% for FemtoZip.