About the nomenclature of MPEG standards
The MP4 container format standards
Basics of the MP4 file format
Top level boxes
- ftyp
- moov
- moof
- mdat
- free
Timestamps and composition
Timescale
Track edit list
- Simple edit lists
Non fragmented vs fragmented vs segmented MP4
Frame index in non-fragmented MP4
Fragmented MP4
Segments

About the nomenclature of MPEG standards

There are a number of MPEG standard sets, from older to recent:

MPEG-1 (ISO/IEC 11172)
MPEG-2 (ISO/IEC 13818)
MPEG-4 (ISO/IEC 14496)

Each of these standards is divided into "parts" that act as standards of their own. Each part defines a concrete matter, whilst the scope of the standard set is much broader.

For instance, MPEG-1 is decomposed in:

MPEG-1 Part 1: Systems
MPEG-1 Part 2: Video
MPEG-1 Part 3: Audio

The first one explains the .mpg container format. The second one explains a video format derived from H.261 that is used with it, and so on.

Sometimes, parts introduce profiles and/or layers. Regardless of the term used, they both serve to limit the burden of implementors, specially when dealing with low power hardware. For instance, MPEG-1 Part 3 introduced three layers, each one defining a different audio compression format:

MPEG-1 Part 3 Layer I. Format for very limited hardware that rapidly became obsolete. Used the extensions .mp1 or .m1a.
MPEG-1 Part 3 Layer II. Similar to Layer I, with greater compression. Obsolete. Used the extensions .mp2 or .m2a.
MPEG-1 Part 3 Layer III. A different, more efficient but also more demanding audio format; it's the very MP3 we still use today. It uses the extension .mp3, of course.

Note: the extensions are quite arbitrary and are often ambiguous. .mpg may refer to the MPEG-1 Part 1 video container, MPEG-2 Part 1 (a more recent version of the format) or any of the previous audio formats.

Even if the extension includes a number, the number does not always mean the same thing. The most notable case of this is .mp4 vs .mp3. .mp4 is used nowadays exclusively with MPEG-4 Part 14, which is a container format; the number refers to the version of the standard set. On the other hand, in .mp3 the number refers to the audio layer... Actually they are so different that .mp3 files are uncontainerized elementary streams (they have no container format at all, unlike newer audio formats like Ogg Vorbis or AAC).

Bonus fact: Did you ever wonder why tagging MP3 files has so many compatibility problems and competing formats? Well, it is because tagging is usually a responsibility of the container... and we have no such thing in MP3 files. The tagging formats used with MP3 (ID3v1, ID3v2 and APE) all exploit some specific features of its data stream, mainly that it's a synchronizing stream where data is skipped until a know MP3 frame header is found.

The MP4 container format standards

The MP4 container format is heavily based in the QuickTime container format. This is the reason they are demuxed with qtdemux in GStreamer: The differences are small enough they can be handled by the same demuxer. This does not mean that players that handle one necessarily should be able to handle the other though.

There have been actually three container formats defined in MPEG-4:

ISO/IEC 14496-1:1998 (Systems) defined an ambitious set of standards for mixing and synchronizing audio video as well as raster and vector 2D/3D objects. These were supposed to be used with early transport containers like MPEG-2 Transport Stream, but also defined a minimal multiplexing format called FlexMux to contain their streams.

FlexMux has been since withdrawn and many of the concepts initially defined in this old standard have been scrapped into newer parts.
The same standard was revised in 2001 (ISO/IEC 14496-1:2001), replacing the early container format with a new one based on QuickTime, sometimes called "MP4 version 1".
Shortly thereafter, the previous version was revised and split into the similar ISO/IEC 14496-14:2003, sometimes called "MP4 version 2", which has been revised to this day.

Furthermore, in 2004, MP4 was generalized into ISO Base Media File Format (ISO BMFF). 3GPP and 3GPP2 are two container formats that emerged from this new base format that are not MP4.

As a consequence of this trajectory and a desire of separation of concerns, there are several standards defining the MP4 format that may be of relevance.

ISO/IEC 14496-12 (ISO Base Media File Format) is the most important one. It defines the layout of the container and most of the boxes that can be used with it. (Freely available)
ISO/IEC 14496-14 (MP4 File Format) is a short document describing the few differences between ISO BMFF files and MP4 files. It also defines a few MP4-specific boxes: iods, esds, mp4a, mp4v, mp4s and stsd.
ISO/IEC 14496-1 (Systems) explains the buffering model, the timing model and a metadata format known as Object Description Framework used in MP4.
ISO/IEC 14496-10 (Advanced Video Coding or AVC) documents a video compression format identical to H.264. (Freely available)
ISO/IEC 14496-15 (Advance Video Coding file format) adds additional requirements to ISO BMFF files containing AVC video. In particular, it describes the following boxes: avc1 (mandatory), avcC (mandatory), btrt (recommended for streaming) and m4ds boxes.
ISO/IEC 14496-2 (Visual) defines an older video compression format heavily based on H.263. It was widely used before H.264 became popular. The famous DivX and Xvid codecs both compressed video to this format.
MSE ISO BMFF Byte Stream Format defines some welcome limitations on the format of MP4 files being used with MSE and details how certain processes required by MSE will be carried out with these files.

A few MPEG standards are freely available here: http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html

The rest are sold by ISO in an individual basis.

Wikipedia has a very nice table with all the MPEG-4 standards.

Basics of the MP4 file format

MP4 files use a straighforward hierarchical structure called Box (formelly called atom) to describe all kinds of metadata.

Every box starts with a 32-bit big endian integer size field, including the very same size field and any field or child of the box. Following the size, there is the type field that contains a 4-byte letter code (e.g. ftyp, moov, free, mdat...).

Most, but not all boxes are defined in the ISO BMFF spec. An extensive list of boxes including extensions and their related specifications can be found here: http://mp4ra.org/atoms.html

As a guideline, boxes either have data fields or they contain children boxes. There are a few exceptions though: boxes that contain both data fields and children boxes. For instance — although it's a rather obscure example, the trep box (Track Extension Properties) is a container yet it has an additional track_id field to specify what track its children modify.

An MP4 file is nothing more than an ordered sequence of contiguous boxes. The boundary of an MP4 file is the end of file.

Unlike other hierarchical formats like Matroska or XML, MP4 does not have a single root element. In fact, every useful MP4 file will have many top-level boxes.

The ISO BMFF specification defines boxes in an object oriented way, using class inheritance. The size and type fields mentioned above are defined in the class Box. Many boxes inherit from a child class named FullBox that adds the fields version and flags. The version field allows spec authors to modify the syntax of a box in backwards-incompatible ways without changing the box type. The interpretation of flags is box-specific.

Pseudocode is provided to explain the syntax of each box in the spec: if conditionals blocks may appear around optional fields and for loops are used to explain how tables with multiple entries are coded inside the box.

Top level boxes

ftyp

ftyp (File Type Box) serves as a file type declaration. Its mere presence indicates this is an ISO BMFF or QuickTime file. Its first specific field, major_brand is a 4 byte string that defines the specific container format, for instance:

qt (padded with spaces to the right): Quick Time container format.
3gp4: 3GPP container format (there are also more versions)
3g2a: 3GPP2 container format (there are also more versions)
isom: First version of ISO BMFF.
iso2: Added in ISO/IEC 14496-12:2005
iso3: Added in ISO/IEC 14496-12:2008
iso4, iso5, iso6: Added in ISO/IEC 14496-12:2012
iso7, iso8, iso9: Added in ISO/IEC 14496-12:2015

Every new iso brand requires support for new boxes. The required boxes for each major brand is specified in the annex E of the ISO BMFF spec.

An old yet extensive list of major brands can be found here: http://www.ftyps.com/

In addition to the major brand, any number of additional brands can be added. These are referred to as compatible brands. These represent features that a player needs in order to play the file. For instance, the avc1 brand specifies this file has AVC video.

MP4 files, unlike QuickTime, 3GPP and 3GPP2 files, are identified most often by a compatible brand than a major brand. The reverse can occur though, so it's often safer for readers to not make any distinction between the major brand and compatible brands.

mp41: MP4 version 1 defined in ISO/IEC 14496-1:2001, rare.
mp42: MP4 version 2, much more common.

moov

moov (Movie Box) is a container box. Its children define all the metadata of the movie, e.g. how many tracks it has, where the frames are stored... Generally a player needs to read this box in its entirety before it can play anything.

moof

moof (Movie Fragment Box) denotes the start of a fragment and serves as a container of all the metadata of the fragment. Fragments are an optional feature of ISO BMFF, explained later.

mdat

mdat (Media Data Box) is an opaque box where video and/or audio can be stored. It has no fields nor child boxes. It's never decoded by the demuxer directly. Instead, meta data boxes inside moov or moof specify file offsets that must point somewhere inside a box of this type.

Not all the contents of mdat boxes need to be used. Authoring programs may allocate a big mdat box and then only use areas inside it as needed. Post-processing applications can read the offsets and sizes specified in the metadata to reorganize mdat contents in optimized ways (e.g. no wasted space, reorder frames in decoding order, etc.)

free

free or skip (Free Space Box) is very similar to mdat, but its contents must not be used anywhere in the file. Post-processing applications can delete all free boxes without affecting the presentation, as long as they take care to update the offsets in the meta data that may point to data in more meaningful boxes.

free boxes serve two purposes:

As comments from authoring or post-processing applications, e.g. Produced with GPAC 0.7.1-revrelease.
As buffers for editor applications so that the moov box can grow a generous size without having to move the mdat box following. This is necessary because, as you may remember, all boxes in ISO BMFF are tightly packed: finding an empty area between boxes is an error.

Timestamps and composition

MP4 was created as a rather ambitious standard. On paper it has many more features than are actually used in the wild and supported by the players nowadays.

One particular area where this ambition is quite evident is the composition system. Understanding it is helpful to grasp important aspects of the format, even if the full extent of its capabilities is never used.

An MP4 file is formed by an arbitrary number of tracks, possibly of different durations. The type of a track is specified in the handler_type field of the the hdlr box (Handler Reference Box) inside moov/trak/mdia. ISO BMFF defines the following values; MP4 defines a few more not covered here.

vide: Video track.
soun: Audio track.
hint: Streaming hint track: Used by streaming protocols like RTP, its contents are specific to that protocol.
meta: Timed metadata track: its frames contain arbitrary data, derived specifications define how to make sense of it. It can be used for subtitles, for instance.
auxv: Auxiliary video track: quite exotic, it's a video track that is not supposed to be directly shown, but interpreted in some way by the application, e.g. to represent depth in a movie projected in a 3D scene.

The tkhd box (Track Header Box) in moov/trak defines several notable parameters related to the composition model:

A track_ID used to reference the track elsewhere.
layer: Higher values are further away from the viewer. In theory, this would allow for several audio or video tracks to be played simultaneously.
matrix: An affine transformation matrix for the video. It could be used to rotate, flip, scale, move or even shear the video. Combined with layer this would allow to do fancy things like picture-in-picture without re-encoding.

Real numbers in ISO BMFF are stored in fixed-point binary, with the most significant half representing the integer part and least significant half representing the fractional part.
alternate_group: Tracks having the same value in this field are alternative groups: only one track from each group must be played at the same time. For instance, a movie may have audio in several languages, but only one should be chosen.

Tracks contain samples which are synonymous with frames. Note that for audio, samples as understood in MP4 are not individual PCM-samples as usually denoted in other contexts, but refer instead to audio frames: chunks of audio compressed and stored together with a given format (e.g. AAC or MP3). The rest of this document will use the term frame in most cases, but bear in mind both terms are synonymous, in particular when reading the ISO specs.

Each frame has three timestamps:

Decoding Timestamp (DT): It defines a deadline for when the frame should be completely decoded, measured in the timeline of the track. Every frame within a track has a DT associated.

Frames need to be decoded in ascending DT order so that their dependencies are fulfilled. MP4 does not define frame-per-frame dependencies within the container like Matroska does.
Composition Timestamp (CT): It defines a deadline for when the frame should be ready to be composited with the other layers and shown in the screen, measured in the timeline of the track. Every frame within a track defines a CT, usually as an offset from DT.

Special case, relevant only for some video formats: If a frame is never shown directly but it should be decoded nevertheless because it may be referrenced from following frames, CT should be the most negative time that can be represented. The details of this are explained in Time to Sample Boxes in the 2015 revision of ISO BMFF, point 8.6.1.1.
Presentation Timestamp (PT): It defines the time the frame should be shown in the screen, or in the case of an audio frame, when it should be start playing. The difference between PT and CT is that while CT is track time, PT defines time in the context of the overall movie. PT values should be the ones the end user can see in their video or audio player.

The mapping between track time (CT) and movie time (PT) is defined by the edit list of the track. In abscence of an edit list, PT = CT.
Duration: It defines how long the frame should be played for.

Timescale

MP4 does not standarize a time unit for the container. Instead, it allows authors to use arbitrary time units.

Any movie or track must define a timescale: an integer that defines the number of time units that pass in one second. Times and durations related to that movie or track are expressed as an integer number of time units of that timescale.

For instance, if a track specified a timescale of 100, a duration value of 1 would be interpreted as 10 ms. All time values are integers, so it would not be possible to represent a duration or time offset of 15 ms with that timescale.

Different tracks may, and usually do define different timescales. These are set in mdhd (Media Header Box) within trak/mdia.

Audio tracks usually use the PCM sampling rate as timescale. This makes a lot of sense: if your track is intended to be played with a sound card capable of playing PCM samples at 48.000 Hz (i.e. the voltage sent to the speakers can be changed every 1/48000 of a second), you don't need a higher timescale than would allow to offset the audio smaller time intervals than the sound card can physically play.

Video tracks use timescales such that it's possible to specify multiples of their intended frame rate exactly. For instance, to accommodate 23.976 fps (or more precisely, 24 frames each 1.001 seconds) — a very common frame rate, intended to be compatible with NTSC players — a timescale of 24000 could be chosen, using durations of 1001 for each frame.

$1001\ \cancel{units} \cdot \frac{1\ second}{24000\ \cancel{units}} = \frac{1001}{24000}\ seconds = \frac{24000}{1001}\ Hz \approx 23.976 Hz$

In addition to the track timescales, the overall movie also has its own timescale, defined in mvhd (Movie Header Box), just inside moov. This timescale is used when non-track-specific time is desired. Usually it has much less precision than the track timescales. Values of 1000 or 600 are not uncommon for this timescale.

Track edit list

An edit list may be used to offset a track a given time or play different parts of the track at different time intervals of the movie. The latter is rare, but the former is actually quite common and MSE ISO BMFF requires support for it.

Edit lists are stored in the elst box (Edit List Box) that must be the only child of a edts (Edit Box) inside trak (Track Box).

An edit list defines a table with the following fields:

Movie edit start time (implicit, initially zero).
Movie edit duration (segment_duration), in movie time units. The value 0 is special: it is interpreted as "until the end of the track, fragments included".
Track start time (media_time), in track time units. The value -1 is special, means "don't play this track for the duration of the edit" (this is referred to as an empty edit).
Track play rate (media_rate): only 1.0 (normal track play rate) and 0.0 (still image) are supported by the ISO BMFF spec. QuickTime allows to use a 32-bit fixed point number specifying any playback rate.

Movie start time is implicit: it corresponds to the sum of the previous edit durations, initially zero.

Each entry of the edit list (or edit) defines a portion of the movie that is filled with a portion of the track. Both portions have the same duration but may start on different times, e.g. a track may start playing from the beginning when we're at second 5 of the overall movie, or vice versa.

media_rate would allow to insert still images instead of moving video, but this is very poorly supported (only QuickTime Player supports this). In QuickTime Format (.mov), fractional values are supposed to be interpreted as playing the track media faster or slower. Note that this only affects the track timeline, not the movie timeline. So for instance, an edit at the start of the movie with a segment_duration=5s and media_rate=2.0 would play 10 seconds worth of the contained video track during the first 5 seconds of the movie. At the end of the edit the user would see the movie position in their player UI is at 5 seconds.

Different edits may have different media_rate, but the movie timeline should advance always at the same speed.

Note that not all portions of the movie need to have a portion of the track. It's possible to specify a range of time of the movie where the track would not be played (empty edit). Since ISO BMFF supports layers, another tracks could still be played during that time, although player support for that is quite unlikely.

The behavior when a player finds a region of the movie that is not covered by any track is not standardized by the spec and varies widely among players: some skip it, some wait (either with still image or blank), some vary depending on whether the gap is at the end of the movie or between edits.

The following is an example of an edit list including a bit of everything:

Edit = namedtuple("Edit", ["duration", "media_time", "media_rate"])

movie_timescale = 600
video_track_timescale = 24000
audio_track_timescale = 48000

def gen_desired_edits(track_timescale):
    return [
        # First 5 seconds of the movie consist of the track media from
        # [4, 4+5) s.
        Edit(5 * movie_timescale, 4 * track_timescale, 1),
        # The following 90 seconds of the movie consist of the track media from
        # [40, 40+90) s.
        Edit(90 * movie_timescale, 40 * track_timescale, 1),
        # The following 20 seconds in the movie, there is nothing.
        Edit(20 * movie_timescale, -1, 1),
        # Then, 10 seconds of video from the beginning of the track
        Edit(10 * movie_timescale, 0 * track_timescale, 1),
        # Then, 5 seconds of still image from the second 1 of the track
        Edit(5 * movie_timescale, 1 * track_timescale, 0),
    ]

Edits are not required by the spec to start in sync frame boundaries, but support for that in actual players is very rare (of the ones I tried, only ffplay was able to deal with that correctly).

Simple edit lists

Edit lists are very powerful, but its support among players is very patchy and they are rarely used to their full extent on the wild — and many players don't support them at all.

By far, the most common use case of an edit list is offsetting a track by a small amount of time. The ISO BMFF spec recommends such an edit list to be used when compositions offsets are used and the first frame has CT > 0, so that there is actually a frame on PT = 0.

This case happens whenever a movie has a B-frame, because composition offsets in MP4 are always positive. Let's take this simple example with a single 3-frame GOP with IBP layout (the simplest GOP layout with all the types of video frames):

                     ·---·   ·---·   ·---·
Presentation order:  | A |-->| B |<--| C |
                     ·---·   ·---·   ·---·

In order for decoding to success, all dependencies must come before they dependents, so this would be the decode order.

                     ·---·   ·---·   ·---·
Decode order:        | A |   | C |   | B |
                     ·---·   ·---·   ·---·

Let's also set the DTS for the frames in the video track. ISO BMFF requires that DTS starts at zero. This is due to the fact that DTS are implicitly coded by accumulating the durations of the previous frames. Let's assume this track has 3 frames per second and a track timescale of 300 (300 units is a second).

                     ·---·   ·---·   ·---·
Decode order:        | A |   | C |   | B |
                     ·---·   ·---·   ·---·
Duration:             100     100     100
Track DTS:              0     100     200

Next to track PTS (or CTS, in the parlance of ISO BMFF spec). We need to satisfy the rule that DTS <= PTS. In other words, we can't show a frame before it has been decoded. This is the best we can get:

                     ·---·   ·---·   ·---·
Decode order:        | A |   | C |   | B |
                     ·---·   ·---·   ·---·
Duration:             100     100     100
Track DTS:              0     100     200
Track PTS (AKA CTS)   100     300     200

Note that the PTS of the first frame is not zero as expected, but 10 (a third of second). Unfortunately, we can't make it any smaller, otherwise B would have DTS > PTS, which is invalid.

ISO BMFF 2015 introduces an exception: non-displaying frames (frames that are supposed to be decoded but not shown) are now supported. In these frames, PTS must be a negative number (so that they are never played) and in consequence, DTS > PTS for these frames.

That is bad, we want our movie to start at zero seconds, not 0.333 seconds.

The problem would be solved very easily if ISO BMFF allowed us to set the Track DTS of the first frame to a negative amount (in this case -10 units), but the format does not allow that. The solution that ISO BMFF gives us for this is edit lists. Definitely an over-engineered solution for such a simple problem, but it is all there is.

By using an edit list, track DTS and track PTS remain the same, but we can construct a movie that only plays the part of the track that actually has content, skipping the empty third of a second at the start of the track.

This is how we do it. In this example we use a movie timescale of 30 (30 units per second), set in mvhd. We can use whatever timescale for the movie we want, as long as we can express the exact amount we want to offset with it (a third of a second in this case). The edit list will have a single edit.

Movie timescale: 30
Track timescale: 300

Edit list:
Edit(duration=30 [1s], media_time=100 [0.333s], media_rate=1.0)

Since there are no frames with a track PTS < 0.333s, we are not discarding any frames from the beginning of the track. Since the duration of the edit reaches the end of the track, we are not discarding any frames from the end of the track either. But thanks to the edit, now the first frame, which has track PTS=0.333 will be displayed at movie PTS=0, fixing the problem.

                     ·---·   ·---·   ·---·
Decode order:        | A |   | C |   | B |
                     ·---·   ·---·   ·---·
Duration:             100     100     100
Track DTS:              0     100     200
Track PTS (AKA CTS)   100     300     200
Movie DTS:           -100       0     100
Movie PTS:              0     200     100

GStreamer specific notes: signed timestamps

GStreamer wraps frames in a class named GstBuffer, which has PTS, DTS and duration fields. These GstBuffer flow in a pipeline serialized with a variety of events.

One important kind of event is GST_EVENT_SEGMENT. It defines a GstSegment, a structure defining timing metadata for the GstBuffer's coming later through the stream.

GstSegment main purpose is time synchronization: Calculations are performed with a GstBuffer and its associated GstSegment in order to figure out the running time PTS of the frame. That is, when the frame should be displayed since the playback started. For instance, a frame may have PTS=30s, but if at the beginning of the playback we made a seek to 25s, the frame should be played in 5 seconds, not 30. The complete running time formula is explained in the GStreamer design documents.

So far so good.

Much like ISO BMFF track structures, GStreamer made PTS and DTS fields unsigned in GstBuffer. After reading the previous section it can be argued that this was a mistake, but not an easy one to fix without a major ABI break so, also much like ISO BMFF, GStreamer came up with an over-engineered, poorly understood solution to be able to have negative timestamps somehow: stream time.

In this design, GstBuffer.pts and GstBuffer.dts (referred as buffer time) are supposed to be meaningless by themselves. Instead, calculations are performed using the associated GstSegment to figure out the stream time PTS or DTS. The GstSegment.time field determines what stream time it is for the buffer time specified in GstSegment.start. The full stream time formula is like this:

stream time <timestamp> = (B.<timestamp> - S.start) * abs(S.applied_rate) + S.time

<timestamp> can be either PTS or DTS. B is GstBuffer. S is GstSegment.

Even though all the fields in the formula are unsigned, the result can become negative thanks to the subtraction. Let's take for example the first frame from the section above. We could get valid stream time like this — though it's not the only way:

S = {start: 0.333s, applied_rate: 1.0, time: 0s}
B = {pts: 0.333s, dts: 0s}

stream time PTS = (0.333s - 0.333s) * 1.0 + 0s = 0s
stream time DTS = (0s - 0.333s) * 1.0 + 0s = -0.333s

GStreamer provides a function to perform this calculation: gst_segment_position_from_running_time_full() Beware that you need to check the return value to know the sign of the computed stream time.

Unfortunately, this design is — maybe unsurprisingly — rarely understood by many users, especially since the difference between stream time and buffer time in most cases is small, so not performing this calculation is a common cause of bugs.

In this design, applications should not directly use buffer time.

Note the timestamps that appear in gst-launch when using silent=false are buffer time, not stream time. You need to perform the calculation yourself in order to get stream time.

GStreamer specific notes: Edit lists in qtdemux

Edit lists require signed timestamps (see Simple edit lists for a common example), so qtdemux needs a way to express them. As it was explained in the last section, this has to be achived with GstSegment in some way or another.

One approach qtdemux could have used is to emit a single GstSegment with an arbitrary start, e.g. 100 hours, mapped to zero in stream time. This way negative timestamps would be represented in buffer time as a number less than 100 hours and calculating stream time would yield a negative number. This approach would be relatively simple and could work in theory — it's what some elements like x264enc who need negative timestamps do, but given that stream time is such an obscure feature for so many users and qtdemux buffers reach all the pipeline downstream, implementing it that way would be a breaking change in practice, even if in theory, it shouldn't because none should take decisions based upon buffer time.

Instead, the approach implemented in qtdemux is as follows:

Buffer time maps directly to track time. Track DTS is unsigned in ISO BMFF. Track PTS is only negative for non-displaying frames.
A GstSegment is emitted per edit.
Stream time determines the movie timeline.

Non fragmented vs fragmented vs segmented MP4

Fragments and segments are not the same thing, but ISO BMFF Bytestream for MSE requires them to have a 1:1 mapping. This actually makes some sense, as it makes streaming easier.

Concerning ISO BMFF (or MP4) files in general, not only the subset of those files used in MSE, a movie may be:

a) Not fragmented. This is the most common case.

b) Fragmented. MP4 fragments were introduced so that video camera devices could record and play very long sequences without worrying about the moov box growing so much it collided with the mdat box and ensuring that if the power is lost, the video could still be played until the last written fragment.

c) Fragmented and segmented files. The idea here is that ordered sets of fragments, named "segments" can be extracted from the original file. The word can makes it a bit more confusing, as segments can be stored in separate .m4s files or all in the same .mp4 file.

The distinction between the last two becomes much more subtle when you consider that every fragmented movie can be understood also as a segmented movie; it becomes a matter of whether we consider the segments or not. More on that in the #segments section.

Frame index in non-fragmented MP4

When a film is not fragmented, a per-track index of all audio and video frames is stored in moov/trak/mdia/minf/stbl/{stts,ctts,stss,stsc,stsz,stco}. Every track has a single one of these boxes, though some of them are optional. It's a bit confusing because every one of these boxes encodes different but related pieces of data and all of them compress tables with RLE, which makes the spec harder to understand; but the basic idea is that the player reads them when it opens the file and this way it can know for each frame its DTS, CTS, whether it is a sync frame, its size in bytes and its offset within the file.

Note about RLE (1): RLE (Run-length encoding) is a simple compression method that works like this: Suppose you have a large array of values (these values may be bytes, 32-bit integer or whatever you want) but many values stored contiguously are equal. With RLE, instead of repeating the same values all over the array, you store an array of compressed entries instead, each consisting of a single value and a header saying how many items share the same value. The ISO BMFF spec uses this technique very extensively when encoding tables, so watch for it.

Note about RLE (2): In order for RLE to work with some tables, often fields are modified to use delta encoding. For instance, instead of storing a list of DTS, a table may encode the difference between the DTS of the frame referred by that entry and the previous one. This way, in the most common case where every frame has the same duration, the table is reduced to a single RLE compressed entry. Watch for this too when reading the spec.

Note about RLE (3): RLE is a general compression schema that may be implemented in several equivalent ways. Although most compressed tables in ISO BMFF use a field specifying how many entries have the same value (e.g. sample_count in stts), a few other tables specify instead the index of the first different element (e.g. first_chunk in stsc).

Furthermore — this time not related to RLE, in order to make the tables more compressed at the expense of making the index a little more confusing, the format establishes that frames are stored in chunks within the file. Frames within the same chunk are stored contiguously (with zero-byte separation).

Finally, these are the mappings defined by each of the boxes. Note that you still need to read the spec if you want to make sense of the fields within these boxes, but having this list handy will serve as a hint of where and what you should be looking for.

Box type	Section in spec	Mapping
`stco` or `co64`	Chunk Offset Box	number of chunk ➝ absolute position of the chunk within the file
`stts`	Decoding Time to Sample Box	number of frame ➝ delta DTS
`ctts`	Composition Time to Sample Box	number of frame ➝ (CTS − DTS)
`stsc`	Sample To Chunk Box	number of frame ➝ number of chunk, index inside the chunk (e.g if it's the first, second, etc. frame within that chunk)
`stsz` or `stz2`	Sample Size Box	number of frame ➝ size of the frame in bytes
`stss`	Sync Sample Box	number of frame ➝ whether it is a sync frame
Much less common boxes
`padb`	Padding Bits Box	number of frame ➝ number of padding bits in the last byte of the frame (needed when a video or audio format is not byte-oriented and the decoder needs to be fed a number of bits that may not be a multiple of 8)
`sdtp`	Independent and Disposable Samples Box	number of frame ➝ Whether the frame is known to have dependencies on other frames Whether the frame is known to be a dependency of other frames Whether it is a leading frame* Whether it has redundant coding (i.e. it can be decoded with different sets of dependencies)
`stdp`	Degradation Priority Box	number of frame ➝ degradation priority (not currently used by neither gstreamer nor ffmpeg, but it would hint which frames are the least harmful to drop) ...or so I think, according to the ISO BMFF spec «specifications derived from this define the exact meaning» but I've not read any of them so far

These boxes allow applications to find frames and perform efficient skipping without needing to know anything about the contained video format.

* Leading frames are frames that, in decoding order arrive after an I-frame, but are presented before it and have no dependencies on it (though it may have dependencies to previous frames). When the playback starts at the presentation timestamp of the I-frame, the decoding of leading frames can be skipped and in any case they must not be presented.

Leading frame graphical explanation

Fragmented MP4

MP4 fragments were introduced so that video camera devices could record and play very long sequences without worrying about the moov box growing so much it collided with the mdat box and ensuring that if the power is lost, the video could still be played until the last written fragment.

Fragments are blocks of video and audio appended to MP4 files in such a way they extend the original movie and any preceding fragments. Appending a new fragment to an MP4 file with support for fragmentation does not require modifying previous structures within the file. A video where the last fragment has been lost can be played to that point without issues.

A fragment consists of a moof (Movie Fragment Box) followed by any number of mdat and free boxes.

You can create a fragmented MP4 file very easily with MP4Box. The following code remuxes a movie so that new fragments appear every 60 seconds.

MP4Box -frag 60000 sintel_trailer-720p.mp4

Fragmented movie metadata

Most metadata is similar between fragmented and non-fragmented video, but there are new boxes: not only inside moof, but also inside moov. This new metadata in moov forms the mvex box (Movie Extends Box).

The mere presence of the mvex box warns readers that there may be movie fragments in the file, so they should scan the file until the end for them. But also, mvex is a container box that has these other boxes inside:

It may contain a mehd box (Movie Extends Box) that says how long is the movie, in movie time units. Camcorders can write this box once when they create the file and update it every time they add a new fragment. Doing so does not need to displace any data since it's just a fixed 32 or 64 bit integer.

The mehd box allows players to show a status bar without reading the entire file. Unfortunately it's optional and even then, some authoring programs (e.g. MP4Box) include the box but set the field to zero.
For each track in the movie there is a trex box that sets default values for that track in the following fragments. This is only just another compression trick MP4 uses, as all those values can be later overriden in the individual fragments.
An optional trep container box may also appear for each track, containing extensions. These have been added in ISO/IEC 14996-12:2015 but are not common.

Frame indices in fragmented MP4

The tables from the section before (stco, stsc and the like) are not very useful for fragmented MP4: They only describe the part of the movie that is not fragmented. According to the ISO BMFF spec, a movie can have a regular non-fragmented part before any fragments (this way a player not supporting fragments could at least play the non-fragmented part), but for the sake of orthogonality, when a movie is fragmented usually all its frames are in fragments and no frames are in the non-fragmented part. The moov box still has the metadata of the movie (e.g. number and type of tracks) but its frame index boxes enumerate no frames.

MSE note: This common practice of having no frames in moov in fragmented MP4 becomes a mandatory requirement in the MSE ISO BMFF spec.

Inside moof there are two types of boxes. The first one, mfhd (Movie Fragment Header Box) contains a sequence number that serves as a safety check. It's an error to construct a file where fragments are out of order.

For each track being extended by the fragment there is a traf box (Track Fragment Box) with its own frame index. It contains a tfhd (Track Fragment Header Box) and several trun boxes (Track Fragment Run Box).

The tfhd box

tfhd has several fields that set some default values for its neighboring trun boxes, a track_id field that specifies what track this track fragment corresponds to, some flags and — most importantly — an optional base_data_offset that defines a position within the file that is used as a base to be summed to the offsets found in the trun boxes in order to locate the frames. Depending on whether base_data_offset is present and other flags there are three ways of addressing frames in fragments:

The sane way: If base_data_offset is not present and the flag default-base-is-moof (0x020000) is set, trun offsets are relative to the begining of the fragment, that is, the beginning of the moof box. This is a reasonable base address, as you can take for granted that positive offsets greater than the size of the fragment don't fall outside of it and does not require you to know the size of previous fragments, which makes it easy to move them into separate files.

It could be argued that the base_data_offset field is actually quite pointless since these defaults already make enough sense and introducing it makes the format unnecessarily complicated. Unfortunately, when MP4 fragments were defined, not only this questionable field was added but this way of addressing frames was not even considered.

This addressing requires the iso5 brand as it's not supported in earlier versions of the ISO BMFF spec.
The complicated way: In abscence of both the base_data_offset field and the default-base-is-moof flag, addressing is different for each track:

The first* track uses the moof box as the base address, just as in the point above. The second and subsequent tracks in the fragment use the end defined by the preceding track.

This way fragments can also be moved easily without invalidating their offsets, but this addressing is unfortunately a bit ambiguous and does not work intuitively with interleaving of video and audio frames, given the definition above.

* Unfortunately the spec does not clarify a definition of "first track". Does it mean the one whose traf comes first, the first one of these to not specify base_data_offset nor default-base-is-moof or the one with the lowest track ID? I don't know. Often all these possible definitions coincide as traf boxes are usually written in track ID order and all use the same kind of addressing.
The redundant, file-dependent way: If base_data_offset is present, the base address used is the value specified in it, interpreted as an offset from the beginning of the file. This field must be updated if the fragment is moved.

This way of addressing requires having knowledge of file offsets, therefore it's incompatible with MSE, since MSE has no way to know which file and what position within it a media segment came from.

The MSE ISO BMFF bytestream requires either of the first two ways of addressing. It's an error to use the third in MSE, which must trigger the Append Error Algorithm.

The trun boxes

For some reason, frame index tables in fragments have a completely different format than frame index tables in the movie header. Fortunately, this format is quite simpler!

Frames of the same track within a fragment are grouped contiguously in runs (note in non-fragmented media these were called chunks, but they are essentially the very same thing).

Each run is encoded in a separate trun (Track Fragment Run Box). trun boxes specify an offset that is summed to the base offset defined or computed in tfhd and defines where the run starts in the file. By decoding trun boxes the demuxer obtains a table with the following columns:

Number of frame.
Size of the frame in bytes.
Duration of the frame.
Composition time offset (CTS − DTS).
Sample (frame) flags: An innocent-looking 32 bit field coding all of these. The layout of this field is explained in the definition of trex (not trun!) within the ISO BMFF spec.
- Whether this is a sync frame
- Whether the frame is known to have dependencies on other frames
- Whether the frame is known to be a dependency of other frames
- Whether this is a leading frame
- Whether it has redundant coding
- Its degradation priority
- The number of padding bits

As a way of achieving compression, neither frame numbers nor DTS are directly encoded, but sequential trun boxes describe sequential sets of sequential frames. DTSs are computed by adding duration to the DTS of the previous frame (initially the last DTS + duration defined in the non-fragmented movie, but more usually, zero).

Also, defaults can be set for all the table columns so that fewer columns have to be encoded.

There is also an optional first_sample_flags field that allows to override the Sample flags default only for the first frame, as it's often the case that a run consists of a sync frame followed by many non-sync frames, with all the other flags cleared.

Seeking in fragmented MP4 files

Seeking in files with lots of fragments is problematic, as normally you would need to scan the entire file in order to read each of the frame indices in the moof boxes until that point. To make seeking in long fragmented movies faster, the optional mfra box (Movie Fragment Random Access Box) was introduced.

It consists of several indices (one per track), placed at the end of the file, listing some sync video frames. For each listed frame, its position in the file and the position of the moof of the fragment containing it is specified. Not all sync frames need to appear in the mfra box.

The values used in this table are presentation timestamps expressed in the timescale of the track.

The player, once it found the movie was fragmented (by finding the mvex box) would look at the end of the file for this index, which would allow it to quickly find the desired playback point.

There is a slight problem with that approach though: boxes cannot normally be read backwards as the size of the box is stated in the beginning of the box, not at the end. A reader normally could not differentiate therefore between the size field or arbitrary data stored inside the box. To workaround this limitation, the mfra is actually a container with two child box types, like this:

mfra (Movie Fragment Random Access Box)
 `- tfra (Track Fragment Random Access Box)
 `- mfro (Movie Fragment Random Access Offset Box)

There is a tfra box for each track containing the explained index. The mfro on the other hand contains a single 32 bit integer stating the size of the entire mfra box. A reader would read these last bytes, seek appropriately to (size of the file − size of mfra) and use the index if actually a mfra table is found there.

Note: Being pedantic, this method is not completely bullet-proof. Technically, another box (e.g. a mdat or free box) could have at the end of its contained data something that could be parsed like an mfra box that also contained something that could be parsed as an mfro box, yet not technically be an actual box if the file was completely parsed from start to finish. This could be reproduced by changing the size field of the last mdat box to reach the end of the file.

Beginning time of fragments

Fragments as explained until now don't specify its starting decoding time. There are actually three ways to know the starting time of a fragment:

a) Reading all the moof boxes, parsing all their trun boxes and summing their frame durations.

b) If performing random access with a tfra table, this could also be calculated by looking at the presentation time of the sample specified in it, convert it into composition time, find the associated sample within the trun box, which is also specified in tfra, calculate its decoding time by subtracting its composition offset and then subtracting the duration of every preceding sample in the fragment. (As complicated as it sounds, this computation is actually explictly mentioned in the ISO BMFF spec.)

c) If the fragment contains it, read the tfdt box (Track Fragment Decoding Time), which is inside moof/traf. This box specifies the exact decoding time of the first sample of the fragment. Although this is the simplest way, this box is not guaranteed to exist or be respected by players.

tfdt was actually added in the ISO/IEC 14996-12:2012 coinciding with the standarization of MPEG-DASH, which requires it: Having a box that specifies the starting time of a fragment allows them to be played independently without needing to provide that bit out of band. tfdt boxes are required by MSE ISO BMFF Bytestream spec. Per spec, browser must run the append error algorithm if a traf does not contain a tfdt box.

Segments

Also related with MPEG-DASH, ISO/IEC 14496-12:2012 introduced the concept of segments, which are pieces a movie that can be split and served separately for streaming. The can part is very important: a segment may or may not be stored in an independent file, and a file may contain more than one segment.

A segment, as defined in the ISO BMFF spec is either of the following:

a) A portion of a file that includes a moov box and its related boxes, not including any fragments.

b) One or more movie fragments.

Usually segmented files do not contain any frames in the non-fragmented part, but this is not a requirement of ISO BMFF.

Segments are confused easily with fragments, so it can be helpful to pick up some similarities and differences from their definitions:

Every fragmented MP4 file has, per definition, segments.
Every fragment constitutes a valid segment.
The non-fragmented part of a fragmented MP4 file does not constitute a fragment but constitutes a segment.
Every fragment contains a single moof box but a segment can contain more than one fragment, and therefore, more than one moof box.

Segments in MSE

The definition of segments in MSE ISO BMFF is much stricter than in ISO BMFF. A segment, per the MSE definition is either:

a) A portion of a file that includes a moov box and its related boxes, not including any fragments. These are called initialization segments. MSE initialization segments must not contain any frames.

b) A single movie fragment. These are called media segments.

Note than in the MSE definition segments and fragments are much more related than in ISO BMFF. Here are some conclusions:

Every movie used in MSE is fragmented and has segments.
Media segments consist of a single fragment, so they're pretty much the same thing for practical purposes.

MSE also adds more restrictions to segments/fragments that ISO BMFF in other regards:

Every fragment must start with a sync frame (i.e. fragments used with MSE must not have any dependencies on each other).
Every fragment must include a tfdt box.
Every fragment must use movie-fragment relative addressing, as explained in The tfhd box.

Reified segments, subsegments and segment trees

The definition of segments does not require any boxes or additional markup. Therefore, any given fragmented movie could be segmented in many different ways (e.g. using one, two or three fragment per segment).

ISO BMFF also defines a subsegment as an interval of a segment formed from movie fragments that is also a valid segment. This allows for a recursive description of segments, where every segment can be understood as a set of smaller subsegments.

Streaming implementations may want to define a certain arragment of segments in a certain markup so that clients can find them. MPD files used in MPEG-DASH provide an implementation of such markup out-of-band.

Another markup that can be used alternatively or in conjunction with MPD files is the sidx (Segment Index Box). It consists of a table specifying how a segment is subdivided. A file indexed this way would usually have a top level segment containing all fragments, which is divided with this box. For each subsegment it is specified:

Whether the entry defines a leaf subsegment containing fragments or a subsegment that is further subdivided in deeper subsegments.
A file offset and size pointing to the begining of the leaf subsegment or — if it's not a leaf subsegment — a deeper sidx box.
The presentation time covered by the segment, in a custom timescale specified in the sidx box. It's highly recommended that this timescale matches the track timescale.

Note that these layouts and markup formats are not processed by MSE-enabled browsers themselves. Instead, is the client library that must parse one or more of these in order to find the segments to feed to the MSE API.

Segmenting media with MP4Box

There are several ways to create segmented media with MP4Box:

Split a movie in segments covering 60 seconds each, store them all in the same file, with a sidx box for each, further decomposing it into a subsegment per fragment. These sidx boxes are not referred from a top level sidx box. An MPD file is also created pointing to each 60-second segment.
```
MP4Box -dash 60000 movie.mp4
```
A player performing a seek would first read movie_dashinit.mp4 to find the byte range of the 60-second segment and the location of its sidx box. Then, it would read the sidx box to find the leaf subsegment (fragment) containing the desired presentation time.
Same as before, but ensure that every fragment (and therefore every segment) starts with a random access point as required by MSE, even if the durations of the segments are no longer 60 exact seconds.
```
MP4Box -dash 60000 -frag-rap movie.mp4
```
Same as before, but store each 60 second segment in a different file.
```
MP4Box -dash 60000 -frag-rap -segment-name segment_ movie.mp4
```
Every segment file contains a single sidx box describing its leaf subsegments (fragments).

The MPD file no longer includes file offsets, but points to the entire segment files instead, which have the .m4s extension. A player would find the file, read its sidx box and find the leaf subsegment with the desired presentation time.
Create a one-level deeper segment hierarchy: subdivide each 60-second segment in 4 subsegments which in turn contain the leaf subsegments (fragments).
```
MP4Box -dash 60000 -frag-rap -segment-name segment_ -subsegs-per-sidx 4 movie.mp4
```
In this configuration there are 5 sidx boxes per .m4s file: a root one and one for each child subsegment.
Alternatively, create a daisy chain of subsegments. Subsgements are limited to a few entries, all except the last are leaf subsegments; the last entry instead points to a nested segment following the same rules, until there are no more leaf segments.
```
MP4Box -dash 60000 -frag-rap -segment-name segment_ -daisy-chain movie.mp4
```

MP4 Notes - ntrrgc/media-notes GitHub Wiki

Table of Contents

About the nomenclature of MPEG standards

The MP4 container format standards

Basics of the MP4 file format

Top level boxes

ftyp

moov

moof

mdat

free

Timestamps and composition

Timescale

Track edit list

Simple edit lists

GStreamer specific notes: signed timestamps

GStreamer specific notes: Edit lists in qtdemux

Non fragmented vs fragmented vs segmented MP4

Frame index in non-fragmented MP4

Fragmented MP4

Fragmented movie metadata

Frame indices in fragmented MP4

The tfhd box

The trun boxes

Seeking in fragmented MP4 files

Beginning time of fragments

Segments

Segments in MSE

Reified segments, subsegments and segment trees

Segmenting media with MP4Box

⚠️ GitHub.com Fallback ⚠️

MP4 Notes - ntrrgc/media-notes GitHub Wiki

Table of Contents

About the nomenclature of MPEG standards

The MP4 container format standards

Basics of the MP4 file format

Top level boxes

ftyp

moov

moof

mdat

free

Timestamps and composition

Timescale

Track edit list

Simple edit lists

GStreamer specific notes: signed timestamps

GStreamer specific notes: Edit lists in qtdemux

Non fragmented vs fragmented vs segmented MP4

Frame index in non-fragmented MP4

Fragmented MP4

Fragmented movie metadata

Frame indices in fragmented MP4

The tfhd box

The trun boxes

Seeking in fragmented MP4 files

Beginning time of fragments

Segments

Segments in MSE

Reified segments, subsegments and segment trees

Segmenting media with MP4Box

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️