Introduction to the Stanford Compression Library - kedartatwawadi/stanford_compression_library GitHub Wiki

Why stanford compression library?

The Stanford Compression Library is being implemented to aid research in the area of data compression. As students working in the field of data compression, we noticed that although these softwares are used everyday (gzip, jpeg etc); unfortunately there is no good research implementation which can be tweaked quickly.

The goals of SCL are:

To provide research implementation of common data compression algorithms
To provide convenient framework to quickly modify existing compression algorithm and to aid research in the area
More Selfishly, to understand these algorithms better :)

Getting Started

Before we dive deep into the library, please make sure you have followed the instructions in the README and that all the tests are passing. Please file an issue if you face problems with the setup.

Library Structure

The directory structure for the library is as follows:

├── LICENSE
├── README.md
├── requirements.txt
├── core
│   ├── data_block.py
│   ├── data_encoder_decoder.py
│   ├── data_stream.py
│   ├── encoded_stream.py
│   └── prob_dist.py
├── compressors
│   ├── baseline_compressors.py
│   ├── golomb_coder.py
│   ├── huffman_coder.py
│   ├── ...
│   └── universal_integer_coder.py
├── external_compressors
│   └── zlib_external.py
└── utils
    ├── bitarray_utils.py
    ├── ...
    └── tree_utils.py

The directories are self-explanatory, but here are some more details:

/core: This contains the core part of the library which is common to almost all compressors. For eg: classes to represent input data, encoded bitarrays, Encoders, Decoders
/compressors: This includes compressors implemented natively in SCL.
/external_compressors: SCL-like interface to external compressors (such as zlib etc.)
/utils: general utility functions for debugging, bitarray manipulation, testing etc.

The `core` library

the core library implements all the basic frameworks and classes common to all compressors. We elaborate them below:

DataBlock: The encoders and decoders in SCL operate on blocks of data. Each input block is of type DataBlock. The DataBlock can be thought of as a thin wrapper around a list of input symbols. Lets take a look at the class definition:

class DataBlock:
    """
    wrapper around a list of symbols.
    The class is a wrapper around a list of symbols (self.data_list). The data_block is typically used
    to represent input to the data encoders (or output from data decoders)
    Some utility functions (useful generally for compression) implemented are:
    - size
    - alphabet
    - empirical_distribution
    - entropy
    """

    def __init__(self, data_list: List):
        self.data_list = data_list

    @property
    def size(self):
        return len(self.data_list)
    ...

As you can see, the main data is stored in the self.data_list attribute, the other functions are helper functions which are useful in multiple places in the code.

One useful think in the SCL is that unit tests are present in the same file at the bottom, and are very useful as usage examples. For example, lets take a look at the tests for DataBlock

def test_data_block_basic_ops():
    """checks basic operations for a DataBlock"""
    data_list = [0, 1, 0, 0, 1, 1]

    # create data block object
    data_block = DataBlock(data_list)

    # check size
    assert data_block.size == 6

    # check counts
    counts_dict = data_block.get_counts(order=0)
    assert counts_dict[0] == 3
    ...

DataEncoder and DataDecoder: Any compressor consists of an Encoder and a Decoder. the encoder maps input symbols to bits while the decoder does the reverse mapping (bits to output symbols). In case of SCL, all encoders are subclasses of DataEncoder and all decoders are subclasses of DataDecoder. Let's take a look at the class definitions to understand better:

class DataEncoder(abc.ABC):
    ...
    def encode_block(self, data_block: DataBlock):
        ...
        # return encoded_bitarray
        raise NotImplementedError
    ...


class DataDecoder(abc.ABC):
    ...
    def decode_block(self, bitarray: BitArray):
        ...
        # return decoded_block, num_bits_consumed
        raise NotImplementedError
   ...

For now let's focus on the encode_block and decode_block member functions, which are inverses of each other. The encode_block function of DataEncoder maps input DataBlock to a BitArray, while the decode_block function of DataDecoder does the reverse. Note that decode_block also returns the num_bits_consumed. This is useful as the input BitArray might contain bits written by other encoders, and so the decode_block might not consume all the bits. We will see how this is useful in combining multiple encoders.

The encode_block and decode_block functions are the core logic of any compressor, and is usually the only part subclassing encoders/decoders need to implement. Here is an example of the