Range Coding - kedartatwawadi/stanford_compression_library GitHub Wiki

TODO: this is not final version, will clean this is up shortly.

Range coder v1

This is the canonical range coder or Schindler range coder. I believe this is very similar to the Martin paper: https://sachingarg.com/compression/entropy_coding/range_coder.pdf. It was a bit hard to read unfortunately so I can't say for sure if there are differences.

The implementation here is based on:

Arturo Campos's write up and pseudocode at https://web.archive.org/web/20160423141129/http://www.arturocampos.com/ac_range.html
Schindler's code: https://github.com/makinacorpus/libecw/blob/d5bef9682e05f3b044d7e5a19665efa27ff62c7c/Source/C/NCSEcw/shared_src/rangecode.c

This is more efficient than the Russian range coder since it more carefully manages the mid range stuff rather than just wasting part of the range. In addition, this allows us to have higher frequency precision (23 bit vs. 16 bit when using 32 bit ints). It is slightly more complicated however. Below I try to describe the main ideas, working throughout with 32 bit precision for the sake of exposition.

Let's define the following: Top value = 0x80000000 = 1000...0000 Bottom value = 0x00800000 [1 byte below top]

As usual, we deal with low and range, where high = low+range. The shrink_range stuff is exactly same as RangeCoderV2, which is basically same as arithmetic coding, so we skip that here (similarly for the decode_symbol function).

We use at most 31 bits (30-0) for the range, i.e., range is at most top value. We use 23 bits for the distribution (i.e., sum of frequencies <= bottom value). Once range gets below this we need to normalize.

The top bit (bit 31) is used as a carry. Say if low (bit 30-0) starts with

Top bit (31) - “carry” 30-22 - “renormalization byte” Bit 21

Top value: 0x80000000 Bottom value: 0x00800000

First written byte is arbitrary to simplify rest of logic slightly

Bottom value is max value of range

Encoder normalize:

While range is less than bottom:

If top bit of low is 1, we can increment previous buffer by 1 release it, and put the next byte (30-22) into buffer. If top bit is 0 and renorm byte is not 0xFF, release previous buffer, and put the renorm byte into buffer. In above two cases we are not sure if the eventual range has the renorm byte or that +1 which is why we are not ready to release it yet.

The last case is similar to mid range for arithmetic coding.

If top bit is 0 and renorm byte is 0xFF, we can’t quite release anything yet because we are not sure if the top bit will be eventually 0 or 1. Instead we increment a counter for number of times we ran into this situation, and later once we hit one of the previous two cases, we first release (depending on the case: either buffer followed by bunch of 0xFFs or buffer+1 followed by bunch of 0x00s).

This is very similar to arithmetic coding approach, except that we deal with a byte rather than a bit. Another difference is that we don't check high for some reason, rather always assume there is a possibility of carry happening.

Why is range at most top value?

Top bit is not used for range, it is only for carry. So we always release bits 30-23 (renorm byte) into buffer knowing that the we want to either release this byte or this plus 1 (ignoring the last "mid-range" case at the moment). Later based on the carry we will realize whether we need to add 1 and only then we actually put it into the output from the buffer. If we didn't have this bit, we would have overflow and won't be able to capture the carry.

Note that range (after normalization satisfies): 0x00800000 < range <= 0x80000000. Since each normalization loop shifts range left by 8 bits, this means that we can be sure that the normalization step during decoding end up with same range (and hence low) as the normalization step during encoding (after each symbol).

Correspondence between encoding and decoding states and how does the end flush part work correctly to guarantee that we read all the bytes we have written: (stages on same line have same exact state (low, range, buffer, mid range count) after they finish)

Decoding:

C implementation fixes low to be 0 and uses the variable named low to represent what we call state. We stick with the low, range, state variables as used in V2 and in arithmetic coding for clarity.

Range coder v2

Also called "Russian range coder" in certain places, e.g., at http://cbloomrants.blogspot.com/2008/10/10-05-08-5.html. This is the range coder used in PPMd and is somewhat less optimal since it just sort of gives up when facing underflow and simply releases bytes without trying to figure out if the ultimate range is above or below mid. Bloom suggests possible optimization for this in https://cbloomrants.blogspot.com/2008/10/10-06-08-followup-on-russian-range-coder.html.

We implement here the version as described in https://sachingarg.com/compression/entropy_coding/64bit/ (code at https://sachingarg.com/compression/entropy_coding/64bit/code/entropy/range32.cpp) which is the standard version as described in Blooms first post above.

The range coder as described on Wikipedia (https://en.wikipedia.org/w/index.php?title=Range_coding&oldid=1069851861) is very similar to this approach.

https://web.archive.org/web/20020420161153/http://www.softcomplete.com/algo/pack/rus-range.txt

Normalization

if low and low+range have same top byte we can already output because now we know the range is completely contained in this. another way to write this is that the XOR of low and low+range is less than self.TOP (meaning all bits in top byte must be same). # in terms of range the right way to think about this while loop: Think of the total possible range being divided into 256 partitions where a partition corresponds to a top byte value. Here, we have the range lying completely within one partition so we know the top byte can be safely released and we can just focus on working within the partition with higher resolution.
we are at a point where the range is too small but the top byte of low and high do not match. This means that we have a very small range but it spans two partitions defined by the top byte equality. More precisely (for 32 bit example), this can only happen when low is like 0xXXFFXXXX and high is like 0xXX00XXXX where the top byte of high is one more than the top byte of low. To solve this we set the new range to be (self.MASK-low)&(self.BOTTOM-1) which basically brings down high to the byte boundary. After this we can release a byte and continue with the loop. Some examples below (32 bit case):

                # old_low: 0xabffacdd
                # old_range: 0xfff0
                # old_high: 0xac00accd
                # range_after_step 0x5323
                # high_after_step 0xac000000
                # byte_released 0xab

                # old_low: 0xabffffff
                # old_range: 0x54
                # old_high: 0xac000053
                # range_after_step 0x1
                # high_after_step 0xac000000
                # byte_released 0xab

Not completely understood points:

Why is bottom chosen as 2^16 (for 32 bit arithmetic)? Note that bottom determines the max sum of frequencies and we do not allow range to go below this. Clearly too high and too low values have issues (too frequent normalization/loss vs. too low resolution in probabilities). Is 2^16 the only value that would work? Is there a significance to it being byte aligned (i.e., 16=2*8)?

Based on testing, setting bottom to 2^20 also works fine, but the gap to entropy is slightly higher.