entropy example 4 - davidar/scholarpedia GitHub Wiki
To understand the formula
- <math>\label{approx}
consider a lossless compression algorithm applied to a message <math>B</math> of a finite length <math>N\ .</math> To save on the output length, the decoding instructions must be relatively short compared to <math>N\ .</math> This is easily achieved in codes which refer to relatively short components of the messages only. For example, the decoding instructions may consist of a complete list of the subwords <math>A</math> of some carefully chosen length <math>n</math> appearing in <math>B</math> and their images <math>\Phi(A)\ .</math> The images may have different lengths (as short as possible). The image of <math>B</math> is obtained by cutting <math>B</math> into subwords <math>B=A_1A_2\dots A_k</math> of length <math>n</math> and concatenating the images of these subwords: <math>\Phi(B)=\Phi(A_1)\Phi(A_2)\cdots\Phi(A_k)\ .</math> (There are additional issues here: in order for such a code to be invertible, the images <math>\Phi(A)</math> must form a prefix-free family, so that there is always a unique way of partitioning <math>\Phi(B)</math> back into the images <math>\Phi(A_i)\ .</math> But this does not essentially affect the computations.) For best compression results, it is reasonable to assign the shortest images to the subwords appearing in <math>B</math> with highest frequencies. It can be proved, with the help of the Shannon-McMillan-Breiman theorem, that such a code realizes the compression rate as in \eqref{approx} for nearly all sufficiently long messages <math>B\ .</math>
For example, consider the binary message of length 30:
- <math>
On the right, <math>B</math> is shown divided into subwords of length 3. The frequencies of the subwords in this division are:
- <math>
The theoretical value (obtained using the formula \eqref{approx} for <math>n=3</math>) of the best compression ratio is
- <math>
A binary prefix-free code giving the shortest images to the most frequent subwords is
- <math>
The image of <math>B</math> by this code is:
- <math>
The compression rate achieved on <math>B</math> using this code equals <math>21/30 = 70\%\ ,</math> which means that the code is quite efficient on <math>B\ .</math>
Remark: The image given here does not include the decoding instructions, which comprise simply a record of the above prefix-free code. One has to imagine that <math>B</math> is much longer and then the decoding instructions (of fixed length) do not affect the compression ratio.