Introduction to the Stanford Compression Library - kedartatwawadi/stanford_compression_library GitHub Wiki
Why stanford compression library?
The Stanford Compression Library is being implemented to aid research in the area of data compression. As students working in the field of data compression, we noticed that although these softwares are used everyday (gzip, jpeg etc); unfortunately there is no good research implementation which can be tweaked quickly.
The goals of SCL are:
- To provide research implementation of common data compression algorithms
- To provide convenient framework to quickly modify existing compression algorithm and to aid research in the area
- More Selfishly, to understand these algorithms better :)
Getting Started
Before we dive deep into the library, please make sure you have followed the instructions in the README and that all the tests are passing. Please file an issue if you face problems with the setup.
Library Structure
The directory structure for the library is as follows:
├── LICENSE
├── README.md
├── requirements.txt
├── core
│ ├── data_block.py
│ ├── data_encoder_decoder.py
│ ├── data_stream.py
│ ├── encoded_stream.py
│ └── prob_dist.py
├── compressors
│ ├── baseline_compressors.py
│ ├── golomb_coder.py
│ ├── huffman_coder.py
│ ├── ...
│ └── universal_integer_coder.py
├── external_compressors
│ └── zlib_external.py
└── utils
├── bitarray_utils.py
├── ...
└── tree_utils.py
The directories are self-explanatory, but here are some more details:
- /core: This contains the core part of the library which is common to almost all compressors. For eg: classes to represent input data, encoded bitarrays, Encoders, Decoders
- /compressors: This includes compressors implemented natively in SCL.
- /external_compressors: SCL-like interface to external compressors (such as
zlib
etc.) - /utils: general utility functions for debugging, bitarray manipulation, testing etc.
core
library
The the core library implements all the basic frameworks and classes common to all compressors. We elaborate them below:
- DataBlock: The encoders and decoders in SCL operate on blocks of data. Each input block is of type
DataBlock
. TheDataBlock
can be thought of as a thin wrapper around a list of input symbols. Lets take a look at the class definition:
class DataBlock:
"""
wrapper around a list of symbols.
The class is a wrapper around a list of symbols (self.data_list). The data_block is typically used
to represent input to the data encoders (or output from data decoders)
Some utility functions (useful generally for compression) implemented are:
- size
- alphabet
- empirical_distribution
- entropy
"""
def __init__(self, data_list: List):
self.data_list = data_list
@property
def size(self):
return len(self.data_list)
...
As you can see, the main data is stored in the self.data_list
attribute, the other functions are helper functions which are useful in multiple places in the code.
One useful think in the SCL is that unit tests are present in the same file at the bottom, and are very useful as usage examples. For example, lets take a look at the tests for DataBlock
def test_data_block_basic_ops():
"""checks basic operations for a DataBlock"""
data_list = [0, 1, 0, 0, 1, 1]
# create data block object
data_block = DataBlock(data_list)
# check size
assert data_block.size == 6
# check counts
counts_dict = data_block.get_counts(order=0)
assert counts_dict[0] == 3
...
- DataEncoder and DataDecoder: Any compressor consists of an Encoder and a Decoder. the encoder maps input symbols to bits while the decoder does the reverse mapping (bits to output symbols). In case of SCL, all encoders are subclasses of
DataEncoder
and all decoders are subclasses ofDataDecoder
. Let's take a look at the class definitions to understand better:
class DataEncoder(abc.ABC):
...
def encode_block(self, data_block: DataBlock):
...
# return encoded_bitarray
raise NotImplementedError
...
class DataDecoder(abc.ABC):
...
def decode_block(self, bitarray: BitArray):
...
# return decoded_block, num_bits_consumed
raise NotImplementedError
...
For now let's focus on the encode_block
and decode_block
member functions, which are inverses of each other. The encode_block
function of DataEncoder
maps input DataBlock
to a BitArray
, while the decode_block
function of DataDecoder
does the reverse. Note that decode_block
also returns the num_bits_consumed
. This is useful as the input BitArray
might contain bits written by other encoders, and so the decode_block
might not consume all the bits. We will see how this is useful in combining multiple encoders.
The encode_block
and decode_block
functions are the core logic of any compressor, and is usually the only part subclassing encoders/decoders need to implement. Here is an example of the
We will also return to DataEncoder, DataDecoder
later to discuss the other member functions (encode
, decode
, encode_file
... etc.)