AAC and MPEG 4 Audio introduction - ntrrgc/media-notes GitHub Wiki

MPEG-2 AAC

AAC was standarized under that name in ISO 13818-7:1997 MPEG-2 Part 7: Advanced Audio Coding, as a non-backwards compatible sucessor for the popular MP3.

Along with the frame raw data format, the spec goes one step further and defines two basic file/stream formats to contain the frames:

ADIF, a simple but now rare format for storing a stream in a file (this was before MP4 was a thing): a tiny header followed by all the frames concatenated.
ADTS, a self-synchronizing format that is still widely used as a relatively barebones representation. It's used as-is in Shoutcast radio and it's also the payload of PES containing AAC audio in MPEG-TS. There is no stream-wide header, but rather each frame carries the configuration of the stream, which (for this reason) is done with very few bits and limited extensibility. This causes problems later.

MPEG-4 Audio

AAC as defined by MPEG-2 still has some usage, but what most people refer to as AAC nowadays is actually ISO 13396-3 MPEG-4 Part 3: Audio (from here on, MPEG-4 Audio) using AAC.

Existing MPEG-2 AAC frames are valid MPEG-4 Audio frames, but MPEG-4 Audio introduces new additional optional encoding steps that can improve the coding efficiency.

Conceptually the biggest difference is that MPEG-4 Audio is not just about AAC anymore. Instead, it attempts to be arguably very overarching and cover wildly different use cases. The most popular use case is referred as "General Audio Coding (GA)" and inherits much of the inner workings of AAC, but MPEG-4 Audio also has sections devoted to other use cases, like human speech coding, machine speech synthesis, lossless audio coding, and Structured Audio (more akin to MIDI files and soundbanks).

The differentiation between use cases is actually important, because each one has a different pipeline template defined by the spec for how coding is done, often as a series of steps (tools) where each one takes the data from the previous step, performs some transformation and writes some data to the compressed audio stream that allows to revert it to an acceptable degree when decoding.

In the case of MPEG-4 General Audio most of these tools are transposed (and potentially extended) from the MPEG-2 AAC standard, but there are also new tools introduced in MPEG-4, notably Perceptual Noise Reduction.

Most of the tools in the MPEG-4 Audio GA pipeline are optional (codec applications will use them when requested and appropriate), and some have different alternatives. In particular, for quantization and coding (the last, but very important step in the pipeline) AAC quantization and coding is commonly used, but the spec allows to alternatively use different, much less supported codings like TwinVQ (known to perform well at very low bitrates) and BSAC. Files using the later would generally not be considered AAC.

As far as the MP4 container is concerned, there is no such a thing as "MPEG-4 AAC", but rather the contained type is just MPEG-4 Audio. It's represented inside MP4 by a mp4a box.

Audio Object Types (AOT)

With so many use cases and variants, there needs to be a way to differentiate what kind of content an MPEG-4 stream has. For this purpose, they added the concept of MPEG-4 Audio Object Types (AOT). Each MPEG-4 Audio stream has one AOT. Some examples:

1: AAC main
2: AAC LC
3: AAC SSR (a rare variant)
5: SBR (Spectral Band Replication, this one is important and will be explained later in this document)
7: TwinVQ
8: CELP (a natural speech coding format defined also by MPEG-4)
9: HVXC (another natural speech coding format by MPEG-4)
12: TTSI (a synthetic speech format by MPEG-4)
15: General MIDI
34: Layer-3 (MP3 wrapped as MPEG-4)
36: ALS (a lossless coding format defined by MPEG-4)

There is a table defining what tools each object type is allowed to use. Looking at it you can see, for instance, that the difference between AAC main and the much more common AAC LC is only one tool: "frequency domain prediction". Also files using the AAC object types only have access to the AAC quantization and coding, and not TwinVQ, and vice versa, as could be expected. There are also many simple cases where the object type and the tool coincide.

In addition to object types, there are also profiles. Profiles define what object types a decoder supports, according to a table in a spec.

AAC Profile: AAC LC (2).
High Efficiency AAC Profile: SBR (5), AAC LC (2).
High Efficiency AAC v2 Profile: SBR (5), PS (29), AAC LC (2).
Speech Audio Profile: CELP (8), HVXC (9).

(Despite the name, AAC main is so rare it's only included in Main Audio Profile and Natural Audio Profile, both of which require support for lots of other rarely used object types.)

AudioSpecificConfig (codec-data)

MPEG-4 audio streams need some out-of-band data to declare the object type and other important stream-wide parameters, some of which apply to all object types (e.g. sampling rate) and some of which are specific to specific AOTs. This is done with a standard bitstring defined as AudioSpecificConfig in the MPEG-4 Audio spec.

Decoder software is usually configured by providing this raw string, and many container formats will also embed it. In GStreamer this string is the codec-data in audio/mpeg, mpegversion=4. In MP4 it is filled in DecoderSpecificInfo inside the esds box (which unlike most boxes, it's not part of ISO 14496 Part 12: ISO Base Media File Format, but rather an ISO 14496 Part 14: MP4 File Format exclusive, and whose content is just an ES_Descriptor, defined in ISO 14496 Part 1: Systems), inside the mp4a box. MPEG standards are really intricate and interlinked like that.

The first 5 bits of AudioSpecificConfig define the object type. It's followed by more bits defining the bit-rate (usually from a fixed table) and speaker configuration (mono, stereo, surround). The specification for the whole AudioSpecificConfig is several pages long and includes references to other parts of the spec, but most of it is due to the wide variety of options. Despite the complexity these strings are usually very short (2-8 bytes).

GStreamer has a very readable real-world AudioSpecificConfig parser in gst_codec_utils_aac_get_audio_object_type_full(), in codec-utils.c.

For a typical AAC LC file, this is the AudioSpecificConfig: 0x1210, parsed from left to right, all values here are coded as unsigned integers:

# bits	Purpose	Value	Notes
5	audioObjectType	2 (AAC LC)
4	samplingFrequencyIndex	4 (44100 Hz)
4	channelConfiguration	2 (stereo)
1	GA frameLengthFlag	0 (1024/128 lines IMDCT)
1	GA dependsOnCoreCoder	0	If 1, the next 14 bits specify coreCoderDelay
1	GA extensionFlag	0

SBR and HE-AAC

Well after MPEG-4 Audio was already deployed it was figured out it would be possible to increase the perceived quality of low-bitrate audio by by encoding high frequencies (which we are less sensitive to) roughly in terms of low frequencies and potentially with a more aggresive compression. This technique was named Spectral Band Replication (SBR).

SBR is specified as a MPEG-4 Audio tool for the General Audio use case and can in theory be paired with a number of AOTs, but in practice it's most commonly used with the popular AAC LC.

When using SBR, the original codec (usually AAC) encodes only the audio contained in the lower half of the spectrum, whereas SBR is used to code the rest.

Back when it was introduced for AAC, some degree of compatibility was desired. Honoring to this, SBR data is coded inside the AAC extension_payload field so that older (now rare) players could skip it safely and at least play the baseband audio, resulting in noticeably lower quality, but still audible output.

Files using this technique are referred as High Efficiency AAC (HE-AAC) by the spec. Some manufacturers refer to it as AAC+.

Owing to the desire to remain compatible with (now rare) old AAC players, there is a number of alternative methods to signal the presence of SBR in MPEG-4 streams, all defined by the spec, with different consequences.

Explicit hierarchical signaling

This method is not backwards-compatible with old MPEG-4 Audio players: they will refuse to play the file since it declares an AOT they don't know. This method can be used in cases where it wants to be ensured that the stream will be played with the full audio quality it was encoded.

In this method, the AOT of the MPEG-4 Audio stream is 5 (SBR). Since each AOT can have different fields following in AudioSpecificConfig, in the case of SBR, one of these signals the AOT of the codec SBR is applied to, usually AAC LC (2).

# bits	Purpose	Value	Notes
5	audioObjectType	5 (SBR)
4	samplingFrequencyIndex	7 (22050 Hz)
4	channelConfiguration	2 (stereo)
4	extensionSamplingFrequencyIndex	4 (44100 Hz)
5	nested audioObjectType	2 (AAC LC)
1	GA frameLengthFlag	0 (1024/128 lines IMDCT)
1	GA dependsOnCoreCoder	0	If 1, the next 14 bits specify coreCoderDelay
1	GA extensionFlag	0

Here, and in all cases, samplingFrequencyIndex still refers to the sampling rate of the AAC frames before SBR is applied. extensionSamplingFrequencyIndex refers to the sampling rate after SBR is applied.

Explicit compatible signaling

This take on explicit signaling takes advantage of the fact that older AudioSpecificConfig parsers are expected to stop at the last field they know. By adding more bits after that, newer parsers can read information about extensions while keeping the stream backwards-compatible.

This method is only feasible with container formats that embed an AudioSpecificConfig (therefore, not ADTS) and further, only these where the size of the AudioSpecificConfig is coded in the format so that the old parser would skip the additional bits. This is the case for most byte-oriented containers like MP4.

The additional bits contain a 11 bit syncExtensionType, for which — in this position — only one value is currently registered (0x2b7), followed by the AOT of the extension (5 for SBR).

After this there is an additional bit to say whether the extension is used. When it is zero this allows to explicitly state a stream does not use SBR. This is the only signaling method where this is possible. When it is one, it follows describing the SBR parameters in the same way as in explicit hierarchical signaling.

Here is an example:

# bits	Purpose	Value	Notes
5	audioObjectType	2 (AAC LC)
4	samplingFrequencyIndex	7 (22050 Hz)
4	channelConfiguration	2 (stereo)
1	GA frameLengthFlag	0 (1024/128 lines IMDCT)
1	GA dependsOnCoreCoder	0	If 1, the next 14 bits specify coreCoderDelay
1	GA extensionFlag	0
11	syncExtensionType	695
5	extensionAudioObjectType	5 (SBR)
1	sbrPresentFlag	1
4	extensionSamplingFrequencyIndex	4 (44100 Hz)

Implicit signaling (and the ADTS format again)

This a backwards compatible method. the stream poses as a normal AAC LC with half the sample rate. When the first frame is fed to the decoder, an HE-AAC-compatible decoder is required by the spec to notice the SBR extension if present and then output a sample rate doubling the one used by the AAC LC stream.

Whenever you see in GStreamer an AAC decoder receiving AAC with sample-rate=22050 but outputting sample-rate=44100, this is what is happening.

This has the unfortunate consequence that the sample rate needed to play a stream posing as AAC LC is not known until the first frame is completely parsed. This has tripped Chromium's MediaSource Extensions implementation in the past. Apple also does not fully support it.

Although this method's main purpose was to remain compatible with old AAC decoders, it remains in use in today's world of widespread SBR support due to the limitations of the ADTS format.

ADTS was not scrapped in MPEG-4 Audio. Instead a new version was introduced with very few changes: there was a single ID bit that was supposed to differentiate between MPEG-1 Audio (never actually used, since ADTS was only used with AAC) and MPEG-2 Audio; this bit was retrofitted into an AAC version bit, 1 for MPEG-2, 0 for MPEG-4.

The changes to the format were rather subtle. In particular, the profile field was modified to codify the AOT (minus 1, since there is no AOT=0), but it's still only 2 bits long. And there is still no place to put the AudioSpecificConfig bitstring.

As a consequence of all of these, it's impossible to explicitly signal SBR in ADTS. It's also a consequence that unlike other containers, ADTS can only carry AAC, not more general MPEG-4 Audio.

ADTS streams are still very common, in part due its simplicity and in part due to it still being a common way to encapsulate AAC in MPEG-2 TS. It also finds use in adaptive streaming implementations.

MPEG-2 AAC and SBR

Although originating in MPEG-4, SBR has been also backported into the MPEG-2 AAC spec, so that it can be used in protocols that support only MPEG-2 streams.

Downsampled SBR

SBR always operates at twice the sample rate from the original AAC stream. There are some streams where SBR is used with 44100 Hz and 48000 Hz audio, therefore the SBR process producing 88200 Hz or 96000 Hz audio respectively, resulting in frequency bands well above the human hearing limit and above the common sample rates of audio cards, therefore requiring a downsample of half in order to result in audible sound again.

The spec specifies a simple method for the SBR decoder to do this downsampling by itself and saving what for very low power systems could be a costly transformation: basically, since the SBR algorithm operates in the frequency domain, it's easy to discard the extreme highband from there.

This usage of SBR is referred to as downsampled SBR by the spec, and it's used every time the SBR sample rate would be higher than supported.

I couldn't find explanations on the spec or any other documents about why this usage is advantageous over plain AAC. I've seen reports of better compression rates for high bitrates compared to plain AAC, but no explanations as to why this is the case.

Downsampled SBR can be signalled explictly by setting extensionSamplingFrequencyIndex to the same value as samplingFrequencyIndex. SBR by design always doubles the sample rate internally, but this setting would tell it to do the downsampling inside the decoder if possible.

Parametric Stereo

HE AAC v2 is actually AAC LC+SBR+PS (Parametric Stereo).

Whereas SBR was an encoded as extension of AAC LC, PS is encoded as an extension of SBR. Therefore, when there is PS it always comes with SBR.

PS improves the efficiency of the core codec (AAC LC) by encoding only the "middle" channel (the average between left and right channel) and then using the PS extension to recover the differences.

(Due to the nature of this optimization, there is no such thing as mono HE AAC v2. In fact, codecs like fdkaac will fail if requested HE AAC v2 but given a mono source.)

The signaling techniques of PS are the same as those in SBR. For hierarchical explicit signaling, a new AOT is defined, 29 (PS). The PS AOT implies SBR, so a further SBR nested AOT is skipped. In the case of explicit backwards-compatible signaling a new additional suffix is added, in order to keep compatibility with HE AAC v1 decoders.