LZSS Format - niemasd/PyFF7 GitHub Wiki
Lempel–Ziv–Storer–Szymanski (LZSS) is a lossless data compression algorithm used to compress many files in Final Fantasy VII. It seems like most approaches to LZSS compression (at least in gaming) stem from the design by Haruhiko Okumura (see this article). Some LZSS-compressed files in Final Fantasy VII do not have a file extension, and some have the extension .lzs
(likely to have exactly 3 characters in the extension), but note that LZSS compression is not the same as LZS compression.
An LZSS-compressed file's header is only 4 bytes, and it is a 4-byte unsigned integer denoting the size of the compressed data (should be LZSS filesize - 4).
The rest of the LZSS-compressed file is the actual compressed data split into blocks.
Each block in the LZSS-compressed file consists of 1 Control Byte and 8 pieces of data.
The Control Byte indicates how many pieces of the block are uncompressed ("literal data") and how many are compressed ("references"): 0 = "reference" and 1 = "literal". You read the bits of the Control Byte LSB-first (LSB = Least Significant Bit, i.e., right-to-left). For example, a Control Byte of 11100000 indicates that the first 6 bytes of the block are "references" (11000000) and the last 2 bytes of the block are "literal data" (1100000).
Each piece of "literal data" is just 1 byte of uncompressed data. Simply output each byte of "literal data" as-is.
Each reference consists of 2 bytes in the format OOOOOOOO OOOOLLLL
: the first 12 bits represent the Offset (where to find the data in existing decompressed data), and the last 4 bits represent the Length.
Note: It seems like some files (e.g. something contained in RIDE_0.NPK
) have a reference that is only 1 byte in the format OOOOLLLL
. A valid LZSS decompressor should check for this
The length consists of the 4 least significant (i.e., rightmost) bits of the second byte of the reference + 3. For example, if the 4 bits of the length were 0000
, this would represent a length of 0 + 3 = 3, and if the 4 bits of the length were 1111
, this would represent a length of 15 + 3 = 18. Thus, the length must be in the range [3,18]. This is because, if the length were less than 3, it would be more efficient to just output as uncompressed data.
The first byte of the reference is the 8 least significant (i.e., rightmost) bits of the offset. The 4 most significant (i.e., leftmost) bits of the second byte of the reference contain the 4 most significant (i.e., leftmost) bits of the offset.
If you use a 4-KiB (4,096-byte) buffer, you can use this raw offset directly, but if you are not (e.g. if you're storing everything in memory at once), you can compute the real offset as follows (where raw_offset
is this offset and tail
is your current position in the output):
real_offset = tail - ((tail - 18 - raw_offset) mod 4096)
Also note that, if you are using a 4-KiB (4,096-byte) buffer, you should initialize the buffer position to 0xFEE
, not 0 (but the buffer content should be initialized to 0).
Now that you have your length and your offset, simply copy the corresponding chunk of data and append it to the output.
Imagine we've already written 1,000 bytes of output, and we need to read in a new Control Byte because we finished the previous block. Imagine the next few bytes are as follows:
0xFC, 0x53, 0x12 ...
First, we read in a Control Byte: 0xFC
(11111100 in binary). This tells us that our current block has 2 references (each taking up 2 bytes) followed by 6 literal data bytes (each of course taking up 1 byte). In other words, this current block consists of the 11 bytes: the Control Byte we just read followed by (2 bytes/reference) * (2 references) = 4 bytes of references followed by (1 byte/literal) * (6 literals) = 6 bytes of literal data.
To read the first reference, we read 2 bytes (0x53, 0x12
). This yields an offset of 0x153
(0x53
from the first byte and 0x10
from the higher nibble of the second byte) and a length of 0x2
(the lower nibble of the second byte). Note that 0x153
= 339 and 0x2
= 2, so our raw offset is 339 and our length is 2 + 3 (because we have to add 3) = 5.
Our current position in the output is 1,000, so our real offset is 1000 - ((1000 - 18 - 339) and 0xFFF)
(and
with 0xFFF
is a quick way to do modulo 4096), yielding a real offset of 357.
We go to position 357 of our output data, copy a chunk of length 5 (so positions 357, 358, 359, 360, 361) and add those to the end of our output.
Now, we can read in our second reference of this block and do it again, and then we can read our 6 literal data bytes, and we will be done with this block as well.
Final Fantasy VII uses clever tricks in their compression that we need to take into account.
For all negative offsets (i.e., "before the beginning of the output file"), simply output <NULL>
(i.e., 0x0
). For example, if you currently at position 50 of the output, if the offset indicates that you need to go back 60 positions (to position -10) and read 15 bytes, this corresponds to outputting 10 <NULL>
(i.e., 0x0
) bytes followed by the first 5 bytes of your output.
When you go off the end of your output, you simply loop what you are outputting. For example, if you are at position 100 of the output and the offset indicates going back 5 positions (to position 95) and read 7 bytes, you will output the bytes at positions 95, 96, 97, 98, 99, 95, 96 (i.e., when you ran out of available bytes, you looped back around).
Sometimes, the chunk you load is 0 bytes. In this case, just use <NULL>
(i.e., 0x0
) for any bytes you try to write.
-
File Header
- 4 bytes (Compressed Data Size)
-
Actual Compressed Data
- All remaining bytes (Compressed Data)