HuffmanCoding - Eishaaya/GMRDataStructuresTest GitHub Wiki

Background

The world is filled with data, and algorithms designed to represent data in the most efficient manner as possible play an important role in modern infrastructure.

Because data takes up space, we need to find ways to make the representation of it be as efficient as possible. That is where compression comes into play. Data Compression refers to reducing the size of one or more files. Data compression is the process of encoding information using as fewer bits as possible when compared to the original representation. When a file is compressed, it takes up less space compared to its uncompressed version, and can be transferred to other systems more quickly. Compression is often used to reduce the time needed to transfer files over the internet.

Fixed Length Encoding

We are going to use a binary code which will encode each character as a binary string. The goal is to find a binary code that will encode a file using as few bits as possible. In essence, we will be compressing our file. A char in most languages is generally 1 byte big, that means it can represent 256 different values, ranging from 0 to 255 inclusive. As we know, ASCII character set can represent 128 different values from 0 to 127 inclusive. Therefore, the English alphabet which is 26 characters, can be represented with a 1 byte char. However, the original ASCII code is encoded on 7 bits and that is why it can only represent 128 characters. Due to the fact that each character is represented by 7 bits, we describe it as being fixed-length code. There is a whole extra bit that is there not being utilized, and if we have a huge amount of data, that is a lot of wasted space caused by a single unused bit. In fact, we don't necessarily need all of the 7 bits to represent our characters, as shown by the upcoming examples.

Variable Length Encoding

Huffman code works by utilizing variable-length code. Which means that the binary strings that are used to represent characters can be of variable lengths. Suppose we want to compress a 100,000 byte file that we know contains only letters A through F. Since there are only six distinct letters to encode, we can represent each letter with a fixed length code of 3 bits. So, A = 000, B = 001, ..., F = 101. Can we do better than this? Yes, we can!

In our data 100,000 byte file, if we knew the occurrence frequency of each letter we could exploit that information into helping create our variable length encoding. It would be trivial to assign the most frequently occurring letters the shorter code and the least frequently occurring the longer code. For example, for the given letter and frequency count, we could have variable length codes. * A = 45, B = 13, C = 12, D = 16, E = 9, F = 5 * The corresponding codes could be: A = 0, B = 101, C = 100, D = 111, E = 1101, F = 1100

As we can see, there is some additional space that can be conserved by the Huffman coding scheme due to the higher frequency of certain letters which are encoded by a shorter bit pattern.

Implementation

The Huffman coding algorithm is considered a greedy algorithm because it makes choices that are locally optimal. The algorithm constructs the tree in a bottom up manner. Suppose you are given a set number of leaves containing of characters and their frequencies, we build our tree leaves (nodes) by merging our two subtrees with the smallest frequencies, and repeating until there is a single node left. The single node is the root of our tree that we can traverse and build the path for each character (leaf node). The left path indicates a binary 0 and a right path indicates a binary 1.

The algorithm itself is not too difficult to understand, however knowledge of a Priority Queue is required to implement it. To create the Huffman tree from a given string the first step is to calculate the frequency (number of occurences) of every character within the string. Once the frequency is calculated, store them as nodes inside of a Priority Queue, where each node has the current frequency inside of it which determines that node's priority. When all the nodes are added, start popping two nodes at a time and create a parent node which holds a left and right child and the combined weight of the two nodes and reinsert it into the Priority Queue. Once the Priority Queue has only one node remaining within it, you are done with the tree construction as you now have your root node. This tree is highly specific to the given input, so changing the frequencies of letters ever so slightly could cause a dramatic change in the tree.

Once the tree is created, the tree is usually traversed with every left path as a 0 and every right path as a 1. Each time a leaf node is seen, the character value of the node is mapped to the binary string representing the directions it travelled to get to that node. Once all the characters in the tree have been mapped, it is easy to convert a given string to the Huffman representation of it in the given tree, so long as all the characters in the string are in the tree.

Decompression is also simple now that the tree is created. Given a binary tree, going through every character and going left when a 0 is seen and right when 1 is seen allows you to go down the tree in the correct order. Every time you reach a leaf node, add the current character to a String Builder and reset the current node to the root. Once you are done going through the binary string, your String Builder should contain the decompressed message.

Analysis

Let's perform a runtime analysis on the tree. Building the initial heap takes O(n * log(n)) time because each add function in the heap takes O(log(n)) time, and we have n items. There are also (n - 1) merges performed which takes O(log(n)) time. Therefore our final Big O notation for runtime of the Huffman encoding is O(n * log(n)).

Drawbacks

Our implementation of the Huffman coding has a major drawback. That drawback is that we have to scan the entire file and build the tree in order to compress a file. This can take a very long time because disk access is orders of magnitude slower than RAM access times. We need to iterate over the input a second time in order to write out the codewords of each character. We also need to write out the prefix tree so that the decompression algorithm knows how to interpret the stream of bits.


Assignments


  • Compress a file using a fixed length compression and compare the file size of the compression version against the original file.
    • Note: Also implement the decompression logic and validate that the decompressed data matches the original data.
  • Compress a file using Huffman variable length compression.
    • Perform frequency analysis on the data, and build the heap and then the tree.
    • Once the tree has been built, generate the mapping for each character and use it to encode the input into a stream of bytes, thereby compressing it.
    • Validate the data by decompressing the data and comparing it against the original file.

References



<-- Previous | Next -->