Tutorial - gtoubassi/femtozip GitHub Wiki

Tutorial

The following walks through generating sample data, building a model, and compressing/decompressing in comparison to gzip. If you can easily extract sample data into files (one file per document) this is an easy recipe for evaluating FemtoZip's impact on your own data. It is assumed that FemtoZip has been installed and the initial build was done in ~/femtozip (this is do the fzdatagen tool can be run, which is not installed into /usr/local/bin). For more on building FemtoZip, see How to Build.

1. Generate Sample Data

First we use fzdatagen to generate sample data. fzdatagen generates example json serialized user records, with field values such as first name and last name following a English language character distribution statistics. First run the tool and make sure it is generating data:

% ~/femtozip/cpp/fzdatagen/src/fzdatagen --num 5
{"first":"i ftaparrh","last":"hereecseherota","email":"patnsdaatlcih @cktret.com","gender":"m","bday":"1994-6-24"}
{"first":"lbrdtar","last":"srroaori tsrg","email":"htch lnnh  @gmail.com","gender":"f","bday":"2000-7-19"}
{"first":"oobn hebigm","last":"teedrcnt","email":"mthaarkduagli@nr r  ebr.edu","gender":"f","bday":"1980-6-21"}
{"first":"ectlaoorntj","last":"idotnntmo","email":"nre [email protected]","gender":"m","bday":"1963-5-11"}
{"first":"siooil","last":"eei erdl","email":"ett mat gee@emmihios rce.edu","gender":"f","bday":"1976-8-9"}

You can see the form of the data. Each document is very similar to each other, with very little internal repetition/similarity. This is ideal data for FemtoZip. Imagine our luck! Lets generate 1000 documents to sample for purposes of building a model, and then another 1000 that we will use for benchmarking vs gzip. Note we don't want to generate a model using the same documents that we benchmark as that make result in unrealistic compression rates vs the common case.

% cd ~/femtozip/cpp
% mkdir -p /tmp/data/train
% mkdir -p /tmp/data/benchmark
% fzdatagen/src/fzdatagen --num 25000 /tmp/data/train
% fzdatagen/src/fzdatagen --num 25000 /tmp/data/benchmark

The last argument causes fzdatagen to output each record in its own file within the specified directory. Check out the data:

% ls /tmp/data/train | head -5 
1
10
100
1000
101
% grep first /tmp/data/train/* | head -5
/tmp/data/train/1:{"first":"i ftaparrh","last":"hereecseherota","email":"patnsdaatlcih @cktret.com","gender":"m","bday":"1994-6-24"}
/tmp/data/train/10:{"first":"eehnenmaudefee","last":"r ola rorot","email":"suhteoi tt [email protected]","gender":"m","bday":"1968-5-18"}
/tmp/data/train/100:{"first":"lr aehsna","last":"efthof trp","email":"[email protected]","gender":"m","bday":"1991-1-5"}
/tmp/data/train/1000:{"first":"poohitahcaie","last":" lteeirggi","email":"[email protected]","gender":"f","bday":"1988-2-23"}
/tmp/data/train/101:{"first":"iudnherrtt","last":"fse slne","email":"roeddiotahri@a nrslkdssis.com","gender":"m","bday":"1962-6-8"}

Also lets see exactly how much data is in the benchmark data set, as well as its md5, which we will use later for verifying compression/decompression round trip, as well as comparing compression rates to gzip. NOTE: On Linux, use md5sum rather than md5

% cat /tmp/data/benchmark/* | wc -c
  2726923
% cat /tmp/data/benchmark/* | md5 
a3755df1fb4d94962071984eaa68aa8d

Note your values will, of course, vary.

2. Build a Model

Now build a model of the data which FemtoZip can use for compression/decompression later on.

% fzip/src/fzip --model /tmp/data/model.fzm --build /tmp/data/train

3. Compress the data

Now compress the benchmark data set using our model, and see how big the resulting data is.

% fzip/src/fzip --model /tmp/data/model.fzm --compress /tmp/data/benchmark
% cat /tmp/data/benchmark/* | wc -c
   821805

Dividing by the original total size computed above we get a compression rate of 821805/2726923 = 30%!

3. Decompress the data

Decompress the data, and make sure the total size and md5 are the same as those computed earlier.

% fzip/src/fzip --model /tmp/data/model.fzm --decompress /tmp/data/benchmark
% cat /tmp/data/benchmark/* | wc -c
  2726923
% cat /tmp/data/benchmark/* | md5 
a3755df1fb4d94962071984eaa68aa8d

The MD5's match, so the data was restored.

4. Compare to gzip

% gzip /tmp/data/benchmark/*
% cat /tmp/data/benchmark/* | wc -c
  2972728
% gunzip /tmp/data/benchmark/*

Gzip achieves a compression ratio of 2972728/2726923 = 109%. FemtoZip shrunk the data to 30% its original size, while gzip made it 9% bigger.

5. Compare to zstd's dictionary based compression

% zstd --train /tmp/data/train/* -o /tmp/data/model.zstd
% ls -l /tmp/data/model.*
  -rw-r--r--      1 nnnnnnnnn  nnnn   71331 Jul 26 10:41 model.fzm
  -rw-r--r--      1 nnnnnnnnn  nnnn  112640 Jul 26 10:43 model.zstd
% zstd -D /tmp/data/model.zstd --rm /tmp/data/benchmark/*
% cat /tmp/data/benchmark/* | wc -c
  1506618
% zstd -D /tmp/data/model.zstd --decompress --rm /tmp/data/benchmark/*

Zstd used a 60% larger dictionary and compressed to 1506618/2726923 = 55% of the original size, vs 30% for FemtoZip.