Data Compression Corpora - pete4abw/lrzip-next GitHub Wiki

These are sources for test data for evaluating lrzip-next and other compression programs.

Evaluating compression benchmarks is incredibly subjective. Real world data compression uses are the best benchmarks. Some programs can be tuned to be better at text or binary or code. Others are generally good. Some are fast, slow, etc.

It comes down to time vs compression. And are the benefits of a few percent better compression worth the extra time? Today's systems have vast resources at cheap prices. Do a few megabytes of space savings justify hours of extra compression time? It's all up to the use case.

In the README I wrote on Benchmarks, I discuss a way to compare.

Here then are some sources for test data. Personally, any kernel tar is good and a compiled kernel is better because it mixes text and binary.

Data Compression.info. Contains links to a variety of sources including below.
The Squash Corpus Maintained by @nemequ right here on github.
Large Text Data Compression by Matt Mahoney the creator of ZPAQ.
The Silesia Corpus (2003)
The Canterbury Corpus (1997)
The Calgary Corpus (1987)