VP9 DTS - ntrrgc/media-notes GitHub Wiki

How VP9 compound prediction works in containers

VP9 works in such a way that as far the demuxer is concerned, there is no bi-prediction and no forward references. In consequence, DTS=PTS, which is nice.

Under the hood, there is still something very similar to bi-prediction (as that's important for the format to be performant), but it is not exposed to the container like in earlier formats.

This works thanks to several tricks:

  • VP9 frames can be visible or hidden.

  • Instead of bi-prediction there is compound prediction, which is the same except both reference frames must precede the predicted frame.

  • It's very cheap to create a frame that reuses an earlier frame. If there are no differences to show on top of it, it's just one or two bytes (word of the spec).

  • So let's imagine we want to have the following frames in presentation order:

    +-----+   +-----+   +-----+
    |  A  |-->|  B  |<--|  C  |
    +-----+   +-----+   +-----+
    

    (The frame at the pointy end of an arrow has a dependency of the frame at the other end)

    We cannot have references into the future in VP9, so instead of that, we turn C into a hidden frame (C_h) and move it before B. Then, we add a visible frame after B (C') that is just a compressed copy of C_h.

       +--------------------+
       |         +--------+ |
       |         |        v v
    +-----+   +-----+   +-----+   +-----+
    |  A  |   | C_h |   |  B  |   |  C' |
    +-----+   +-----+   +-----+   +-----+
                 |                   ^
                 |                   |
                 +-------------------+
    

    The above does not play very nicely with containers, because C_h must have a zero-ish duration and no PTS, which is quite problematic. So they came up with a great idea: superframes.

    A superframe is (usually*) one or more hidden frames followed by a visible frame. Superframes are defined in the video format level: the container is oblivious about whether its memory chunks are regular frames or superframes.

    * They are not required by spec to contain any specific pattern of frames: they support having several or zero frames of each kind (visible and hidden), but the explained pattern is the most desirable to work nicely with containers, as usually we want to have an index of visible frames in them.

                +-------------------+
        +-----+ | +-----+   +-----+ | +-----+
        |  A  | | | C_h |   |  B  | | |  C' |
        +-----+ | +-----+   +-----+ | +-----+
                +-------------------+
            
    PTS    1              2              3
    DTS    1              2              3
    

    This way, as far as the demuxer and upper layers are concerned, there is nothing like B-frames and PTS=DTS.