Codecs - bnnm/vgmstream GitHub Wiki

VGM CODECS 101

Some general info about video game codecs for the neophyte.

Codec families are ordered depending on how different they are from the original (the closer to the top the more standard they are).

Several codecs don't have an official names per se and are internally called generic names (ex. PS1 ADPCM is just "ADPCM"), so used names may be just conventions.

CODEC FAMILIES

ADPCM

Codecs are roughly defined by:

layout or frame format (how data is organized, frame size, how channels are interleaved, etc)
decoding function (how 4-bit becomes 16-bit)
headers to seek/reset state, or none

Decoding roughly transforms ("expands") a 4-bit ADPCM code (nibble) to some value, that depends on previous 16-bit history samples to become the final 16-bit sample.

Most codecs are mono with a few having stereo (or more) modes, rest just interleave channel data at configurable sizes.

IMA (1 SAMPLE HISTORY ADPCM)

Defined by Intel/DVI and recommended by the IMA (Interactive Multimedia Association).

Main decoding depends on:

step_table and step_index_table config
nibble expand with a series of ops, then modifying previous sample

The codec is fairly simple and tends to be a bit noisy (precision is not high). This is (theoretically) improved in variations with "headered frames".

Main codecs

IMA: base raw codec, no frames
- DVI IMA: same with different nibble order
- 3DS, SNDS, OTNS, WV6, ALP, FFTA2, etc: change main decoding slightly
- UBI IMA: custom file header and decoding
MS-IMA: headered frames, odd samples per frame
- Reflections IMA: custom data layout
- NDS IMA: even samples per frame
- RAD IMA, DAT4 IMA
- Apple/Quicktime IMA: smaller header
- XBOX-IMA: fixed-size frames, even samples per frame
  - FSB-IMA: different header format
  - Wwise IMA: machine endian
- H4M IMA: variable frame formats controlled by blocks
OKI (aka Dialogic ADPCM/VOX): smaller step table, 12-bit output
- PC-FX: modified decoding, buggy
- 'OKI 16': 16-bit output
YAMAHA: custom tables
- AICA: minor 'filtering' differences
- 'framed' YAMAHA
- NXAP: variation

BRR/XA (2 HISTORY SAMPLES ADPCM)

Defined by Sony's researchers (mid ~1980s) and patented (expired?), unsure of actual name. BRR = Bit Rate Reduction.

Main decoding depends on:

coefs/filter and shift/scale config
scaled nibble, adjusted with hist1 sample * coef1 + hist2 sample * coef2

Codec is a bit more complex and allows better 'fine tuning' vs IMA.

Codecs

BRR / SNES ADPCM: simple layout, 4-coef table, shifts
XA (PS1/CD-i): complex layout for CD format quirks
ADP/DTK: simple layout, int hist clamping
PS-ADPCM (aka VAG): simple layout, 5-coef table, SPU flags
- PS-ADPCM with bad flags: crafty devs reuse flags for other causes
- PS-ADPCM with configurable frame size, no flags
AFC/XMD/ASF/LSF/L5-555/PROCYON: quirky but close to XA
FADPCM: complex layout
EA-XA: slightly modified decoding
- MAXIS XA: modified layout
- EA-XA v2: has PCM blocks
  - EA-XAS v0: has header frames
    - EA-XAS v1: complex layout
MS-ADPCM: configurable frames, scales, complex fixed table (theoretically configurable)
- Cricket Audio MSADPCM: minor variation
ADX: scales, 2-coef table per file
GC-ADPCM (aka DSP): scales, 16-coef table per file
- DSP with subinterleave

OTHER ADPCMs

Custom or unique enough.

MTA: YAMAHA-like with multi tables
MTA2: EA-XAS v1-like with shift tables
HEVAG: PS-ADPCM-like with multi tables and 4 hist samples
MC3: 3-bit ADPCM
Westwood: VBR, multi-mode
ACM: multi-mode, unknown
ESS: Eugen Systems multi-pass ADPCM
8-bit XA
Circus/NWA/other Japanese makers' A/DPCM: often weird and non-useful variations

SPEECH

Speech codecs are different in that they use the characteristics of human speech (such as, more mids) to compress. Since human voice is more predictable and simpler than music you can do things that don't make sense in other codecs. Techniques to achieve this can be quite different and more or less emulate vocal chords.

EA-MT / CBX
Speex
ITU-T G.722.1 annex C (Polycom Siren14): MLT/IMLT based (somewhat MDCT-like)
ITU G.719 annex B (Polycom Siren22): improved Siren14, almost audio codec
SILK (part of Opus)

TRANSFORM-BASED (FFT/DCT/MDCT/etc + psychoacoustics)

(bear in mind my understanding of those codecs is limited and there can be inaccuracies)

TLDR: take samples > make a spectrogram > throw away useless values (such as higher frequencies) > put into a file. The key here deciding what to "throw away".

Very roughly FFT/DCT/MDCT/etc are math function that "transform" data (for audio or any kind of file) grouping it in a way that allows further and better compression techniques to be applied. But rather than using all data 1:1 (=lossless), parts that could be removed and still sound ok enough to human ears are discarded (=lossy), so that compression improves. What to discard is decided with "psychoacoustics" models.

When encoding (more or less):

take audio signals and convert it into PCM "samples" (numbers going up and down)
divide all samples into discrete "frames" (often parts of 1024 samples)
transform frame samples ("time domain") into spectrogram ("frequency domain"), using FFT/DCT/MDCT or a similar math function
classify spectrogram into "bands" (rough grouping of signals)
discard parts of the spectrogram that aren't noticeable to human ears ("pychoacoustics") to improve compression
simplify/compress/codify using the least bits as possibly ("codebooks")
put that data into a custom "bitstream"

Decoding reverses those steps.

read bitstream
uncompress and put into bands to reconstruct the spectrogram
apply inverse transform functions (iFFT/iDCT/iMDCT/etc) on the spectrogram to get samples
output samples

Codecs under this family all do those similar steps, but each use their own collection of compression tricks and can be quite different. For example:

since audio from L and R channels is often very similar, both channels can be can be partially grouped (joint stereo, MS-stereo)
volume could be scaled down first to get better compression (smaller numbers), then scaled up when decompressing
audio from higher pitched "bands" can be codified with less "resolution" because humans don't hear higher or lower pitched sounds as well.
- 3-bits for less hearable bands, 10-bits for more important ones... (more resolution = more bits = more accurate output sound)

Common codecs:

MPEG: CBR/VBR
- MPEG Audio Layer I (MP1): the original
- MPEG Audio Layer II (MP2): more complex, more samples per frame
  - AHX: fake 'deflated' frames
- MPEG Audio Layer III (MP3): hacky MP2 extension, even more samples per frame
  - EA-MP3: PCM blocks
  - EALayer3: PCM blocks, simplified bitstream, can output 576
RELIC: somewhat MPEG-like, mono, simplistic
Musepack (MPC): MPEG-like
AC3
AAC: robust, simplified vs MPEG
HCA: CBR, clean and simpler decoding bitstream vs others
Ogg Vorbis: VBR, weird Ogg layout, per-song codebooks (allows fine tuning compression)
- many simple encrypted/obfuscated variations just to make harder playing them outside the game
- FSB5 Vorbis: simplified layout, common codebooks (many)
- Wwise Vorbis: simplified layout, trimmed bitstream, common codebooks (few)
- OGL Vorbis: simplified layout
CELT: VBR, weird (never finalized so there are many Xiph variations)
- FSB CELT: simplified layout
- CELT (for audio) along with SILK (for speech) were absorbed into OPUS
Ogg Opus: CBR/VBR, CELT+SILK variable modes, complex
- Switch (NX) Opus: simplified layout
- EA Opus: simplified layout
- UE4 Opus: simplified layout
- Exient Opus: simplified layout
WMA: VBR, complex
WMA Pro: VBR, more complex, multichannel support
- XMA1/2: same as WMA Pro with fixed config/frames, stereo-pairs multichannel
  - EA-XMA: 'deflated' frames
ATRAC3: CBR, simple/weird
ATRAC3Plus: CBR, a bit less weird
ATRAC9: CBR, multi-pass, complex
Bink Audio: VBR
ICE DCT Codec: VBR, odd, simple
KA1A Codec: CBR, odd, simple

MP3 VS OGG?? Note that those codecs "trim with pychoacoustics" audio, but what is exactly trimmed is not specified by the format. This means one MP3 encoder may decide to trim some things, and other MP3 encoder other things, plus may use the MP3 format in slightly different ways (there is room to use variations of tricks). Add to that bit-rate settings (aka how much room the encoder has to play around). Same thing happens with OGG. So basically, comparing MP3 (the format) to OGG (the format) is not very useful, it's better to compare "MP3 encoded with X at Y bitrate" vs "OGG encoded with X at Y bitrate", since a crap encoder with sound like crap, no matter the format.

Another thing to note is that Ogg lies a little about its bitrate (to look better in comparisons basically), and doesn't count parts of the file in the bitrate that MP3 does count, so it's not always useful to compare bitrates 1:1.

A problem common to all these codecs is that decoding depends on previous frames. This causes a "delay" (silence) before getting audible sound data of at least 1 frame. Since samples per frame can be somewhat high (like ~1000 samples), it's not great for small, immediate SFX or gapless tracks. To solve this, the encoder usually specifies how long is this silence, so the player skips it when decoding. If your player doesn't understand this though you don't get proper gapless audio (vgmstream tries its best to handle it, since it's very important for looped audio, while for example FFmpeg gets this wrong in several codecs).

AUDIO QUALITY

Encoders convert from .wav (or similar format) and create a file in other format. While decoders take this file and create a .wav (or similar format).

Generally from worst to best: IMA ADPCM > low bitrate transform-based codecs > custom ADPCM > XA/BRR ADPCM > high bitrate transform-based codecs.

ENCODER AUDIO QUALITY

An important detail is that the encoder that creates the file does matter a lot in final sound quality. Especially for transform-based (MDCT/psychoacoustics) codecs, but also for ADPCM. The encoder's job is to pick the best internal numbers, and soddy encoders will pick worse ones.

For example, Unreal Engine 4 uses MS-ADPCM, but the home-baked encoder they use is worse vs Microsoft's. Or an ancient circa 2000 MP3 created by Xing encoders will sound worse than current LAME MP3 encoder at the same bitrate.

Also note that "high bitrate" is relative. Some codecs have high bitrate but are just wasteful and don't sound very good. Or you can also make a +400kbps MP3 (seriously, the format allows this), however the MP3 format has some internal quality limits and most of those kbps are wasted and not much better than 320kpbs (this odd setting is mainly to avoid 'bit reservoir').

In other words use your ears or at least some tool to compare spectrograms.

DECODER AUDIO QUALITY

While there are multiple ways to encode, there is only 1 to decode. In other words, decoder doesn't matter in audio quality and all should sound the same.

BUT! You can have a decoder that sounds mostly correct yet has bugs which result in slight differences, often not noticeable.

In ADPCM bugs will usually make wobbly waveforms, which typically manifests in little "pops" at times. You can see in Audacity that the waveform looks a bit off (such as going upwards too much). Also rounding errors may make results off by +- 1 values, but this isn't noticeable.

In transform-based codecs (MP3, HCA etc) bugs are a lot harder to notice and track down as incorrect decoding usually sounds fine. They usually manifest in final values going slightly higher or lower that supposed, or show as some types of noises. This all can (sometimes) can be "seen" using spectrograms.

Best to use official decoders to compare with other decoders.