TKuLZ - cr88192/bgbtech_misc GitHub Wiki
TK uLZ
- Small Huffman-coded LZ77 variant.
- Aim is to be smaller/simpler to decode than Deflate.
- Will still be Huffman coded, unlike LZ4.
- Will not (necessarily) aim for speed, but should not be overly slow.
- LSB first.
- Huffman symbols are 1-12 bits.
- Rather than 1-15 as in Deflate.
- This allows simplifying the table encoding some.
- Also allows for a single-stage lookup table to be smaller.
Top level bitstream will consist of a series of 4 bit tags.
- 0: End of Stream.
- 1: Raw Uncompressed Data
- 2: LZ Compressed Data
- 3: Reserved
- 4: Packed Huffman Table, Tag
- 5: Packed Huffman Table, Literal
- 6: Packed Huffman Table, Distance
- 7: Reserved
- 8: Fixed Table
- 9: Raw Uncompressed Data, Stream Ends.
- A: LZ Packed Data, Stream Ends.
- B..F: Reserved
- 2 bit prefixed length:
- 0: 12 bit length (0 to 4K)
- 1: 16 bit length (0 to 64K)
- 2: 20 bit length (0 to 1M)
- 3: 24 bit length (0 to 16M)
- Bitstream then realigns to a byte boundary.
- This many raw bytes follows.
Huffman tables will encode a series of lengths, but the tables will not be entropy coded. Each table will represent up to 256 lengths.
Table contents will be encoded via 4 bit codes:
- 0: Non-encoded symbol
- 1..D: Symbol Length, 1..13 bits.
- Length 13: Reserved for now.
- E: 3..18 zeroes (4 bits follow).
- F: Escaped Run, 2 bit code follows:
- 0: 19..82 zeroes (6 bits).
- 1: 4..66 repeats of prior length (6 bits).
- Bias is 3, with 0 encoding an "End of Table".
- 2: Reserved
- 3: Reserved
- Rest of table is filled with zeroes.
If an End of Table is seen, any remaining lengths are set to zero.
Major Tables:
- Tag(T): Run Prefix Tags
- Literal(L): Literal Bytes
- Distance(D): Distance Prefix
- Shorter symbols precede longer ones;
- Symbols with the same length will be be ordered by symbol value.
These tables will contained a fixed pattern (not Huffman).
These will contain:
- 2 bits: Table Number
- 2 bits: Table Tag
- 4 bits: Table Parameter
- 0: Tag
- 1: Literal
- 2: Distance
- 0: Fixed Bit-Length Symbols
- Parameter: Number of symbol bits from 1 to 8
- 1: Length-Limited Rice
- Parameter gives the number of suffix bits (K).
- Parameter Range is 0 to 7 bits.
- 2: Length-Limited Rice-Pair
- Symbol encoded as two Rice-Coded values (Low, High).
- The parameter is split into two 2-bit K values.
The Rice symbol will be length-limited, such that four 1 bits will be followed by an 8 bit suffix giving the raw value. This means that the longest possible Rice symbol will be 12 bits.
The Rice-Pair case will be similar to the Rice case.
- The pair will be encoded first with the low nybble, then the high nybble.
- The low 2 bits of the parameter will give the K value for the low nybble.
- The high 2 bits of the parameter will give the K value for the high nybble.
- No encoded symbol may have a Rice code with a Q greater than 3.
- No encoded symbol may have a combined length exceeding 12 bits.
Runs will be encoded as:
- Tag Prefix
- If Needed, Overflowed Literal Length
- If Needed, Overflowed Match Length
- Match Distance
- If Special Case, Extra Distance
- Zero or more literal symbols
- (Match copy happens here)
- High 4 bits: Literal Length (0..14), or 15 for overflow.
- Low 4 bits: Match Length (3..17), or 15 for overflow.
- The overflow length will add a bias, 15 for literal, and 18 for match.
Distance Prefix:
- High 5 bits: Exponent (ExtraBits-1)
- Low 3 bits: Fraction
- 00..07: 0.. 7 (0)
- 08..0F: 8.. 15 (0)
- 10..17: 16.. 31 (1)
- 18..1F: 32.. 63 (2)
- 20..27: 64.. 127 (3)
- 28..2F: 128.. 255 (4)
- 30..37: 256.. 511 (5)
- 38..3F: 512.. 1023 (6)
- 40..47: 1024.. 2047 (7)
- 48..4F: 2048.. 4095 (8)
- 50..57: 4096.. 8191 (9)
- 58..5F: 8192..16383 (10)
- 60..67: 16384..32767 (11)
- 68..6F: 32768..65535 (12)
- ...
- 3(0): End of Block
- 4(1): Predicted match length and distance.
- 5(2): Predicted match length (match distance given as ExtraDistance).
- 6(3): Predicted distance (match length given as ExtraDistance+3).
- Other values are reserved.