SVT AV1 Encoder Design Guide - AlexBoswellVCD/gitlab_wiki_test GitHub Wiki
- Revision History
- Table of Contents
- Introduction
- Definitions
- High-level encoder architecture
- Inter-process data and control management
-
Encoder Processes and Algorithms
- Resource Coordination Process
- Picture Analysis Process
- Picture Decision Process
- Motion Estimation Process
- Initial Rate Control Process
- Source-based Operations Process
- Picture Manager Process
- Rate Control Process
- Partitioning, Mode Decision and Encoding
- Loop Filter Process
- Constrained Directional Enhancement Filter Process
- Restoration Filter Process
- Entropy Coding Process
- Packetization Process
-
Detailed Feature Implementation Design Appendices
- Altref Appendix
- CDEF Appendix
- CfL Appendix
- Compliant Subpel Interpolation Filter Search Appendix
- Deblocking Loop Filter (LF) Appendix
- Filter Intra Appendix
- Global Motion Appendix
- Intra Block Copy Appendix
- Local Warped Motion appendix
- OBMC Appendix
- Palette Prediction Appendix
- Restoration Filter Appendix
- Subpel Interpolation in the Open Loop Motion Estimation Appendix
- Legal Disclaimer
This document describes the Intel SVT-AV1 encoder design. In particular, the encoder block diagram and system resource manager are described. In addition, the document contains detailed descriptions of the SVT-AV1 encoder-specific algorithms such as those used for motion estimation, mode decision, quantization, etc. This document is meant to be an accompanying document to the “C Model” source code, which contains the more specific details of the inner workings of each algorithm.
This section contains definitions used throughout this design document.
Term | Definition |
---|---|
Picture | collection of luma and chroma samples assembled into rectangular regions with a width, height, and sample bit-depth. |
Super block (SB) | a square block of luma and chroma samples defined to have a size of either 64 or 128 luma samples. SBs are the root node of a tree data structure. |
Block | a square or rectangular region of data that is part of a SB and that is obtained through the partitioning of the SB. |
Transform block | a square or rectangular region of data whose size is the same as or smaller than the size of the corresponding block. |
Bitstream | a collection of bits corresponding to entropy coded data. |
Syntax elements | pre-entropy coder encoded symbols used during the decoding reconstruction process. |
Tiles | a rectangular collection of SBs which are independently decodable. |
Groups-of-Pictures (GoP) | a collection of pictures with a particular referencing structure. |
SAD | sum of absolute differences, representing the sum of absolute values of sample differences; distortion measurement. |
SSE | sum of squared sample error; distortion measurement. |
The source video is partitioned into various groupings of various spatial and temporal divisions. The following partitions and nomenclature are used extensively within this document and the source code. Furthermore, the following partitioning scheme determines data flow and influences algorithmic designs. At the highest level, pictures in the source video are grouped into groups of pictures (GoPs) that are defined according the prediction structure. Figure 1 shows an example of the relationship between pictures contained in a five layer prediction structure where each frame references only one picture in each direction. In a prediction structure, each picture is of a particular prediction type and belongs to a specific temporal layer. Also, each picture might reference other pictures depending on its picture type and it might be referenced itself multiple times or not at all depending on the Prediction Structure used and the picture’s relative position within the period. In the example shown in Figure 1, Pictures 0 and 16 are said to belong to temporal layer 0 or base layer, whereas pictures 1, 3, 5, 7, 9, 11, 13 and 15 are said to belong to the non-reference layer or temporal layer 4.
Figure 1: Five-layer prediction structure used in the SVT-AV1 encoder with one reference picture in each direction.
Figure 2 shows the relationship between Pictures, Tiles, SBs, blocks and transform blocks.
Figure 2: Picture partitioning diagram.
The encoder is designed around processes. A process is an execution thread in software or an IP core in hardware. Processes execute one or more encoder tasks (e.g., motion estimation, rate control, deblocking filter, etc.). Processes are either control oriented and are picture based, or are data processing oriented. An example of a control process is the Picture Manager Process, which determines the prediction structure and when input pictures are available to start encoding depending on the state of the decoded reference picture buffer. An example of a data processing process is the Motion Estimation Process, which performs Motion Estimation on the input picture data. Only one process of each control process is allowed since control decisions cannot be made independently of each other without risking data leakage and deadlock scenarios.
To facilitate parallel processing on a variety of compute platforms (e.g., multi-core general purpose CPUs, GPUs, DSPs, FPGAs, etc.), inter-process communication should be minimized. One important design feature of the SVT-AV1 encoder architecture is that processes are stateless. All information related to process states is conveyed via inter-process control and data information. System resource managers manage inter-process control and data information coherence through FIFO buffers and are discussed in detail later.
The architecture is flexible enough to support an implementation in which one superblock (SB) at a time is encoded through the entire pipeline. Alternatively, a single encoder task might process all SBs in a picture before moving on to the next task. In the latter case for example, motion estimation might be performed on all SBs within a picture before moving to the next task. Statistics about each picture important for rate control are gathered early in the processing pipeline and used in subsequent tasks. To allow for low-delay, the rate control process may proceed before all motion estimation picture processes are complete. The picture QP value is derived using information available to it at the time the QP is derived. The QP may be varied on a SB basis within the coding loop process. The next major processing step involves the processes associated with the coding loop. The latter includes encoding tasks such as intra prediction, mode decision, transform, quantization, in-loop filtering, entropy coding etc. Subsequent processing would involve the filtering operations that include the in-loop filter, CDEF and restoration filter.
An important aspect of the encoder architecture is that it supports multiple encoder instances. That is, an encoder implemented based on the architecture described in this document would be capable of encoding a 720p sequence and a completely independent 1080p sequence simultaneously. At any given time, some of the motion estimation, coding loop, entropy encoder, etc. processes would be working on the 720p sequence while the other processes would be working on the 1080p sequences. In addition, the architecture supports encoding different pictures (whether they are associated with the same instance or not) in the pipeline at any given time instant.
In the SVT-AV1 encoder, a picture could be divided into segments. Parallelism in the encoder could be achieved at multiple levels. At the process level, different processes could be running simultaneously, where each process could, for example, be performing a different task in the encoding pipeline. At the picture level, multiple instances of the same process could process different pictures simultaneously. At the segment level, multiple instances of a given process could process different segments from the same picture simultaneously.
A high-level diagram of the encoder pipeline is shown in Figure 3, with more details provided in Figure 4. An illustration of the segment level processing is shown in Figure 5.
Figure 3. High-level encoder process dataflow.
Figure 4. Detailed encoder process dataflow.
Figure 5. Illustration of segment-based processing: The squares in the picture represent segments. Light yellow segments have already been processed. Dark yellow segments are being processed in a parallel manner.
System resource managers perform inter-process data and control management. They manage objects and connect processes to one another by controlling how objects are passed. Objects encapsulate data and control information and are organized into four types: results, sequence control sets, picture control sets, and picture descriptors. Objects are described later in this section.
Figure 6 shows a block diagram of a system resource manager. As depicted in the diagram, the empty object path begins when an empty object from the empty object FIFO is assigned to one of N producer processes. The producer process fills the empty object with data and control information and queues the now full object onto the full object FIFO. In a manner similar to that of the empty object path, the full object path begins when a full object from the full object FIFO is assigned to one of M consumer processes. The consumer process uses the information in the full object and completes the data path by queuing the now empty object back onto the original empty object FIFO. To better understand how the encoder block diagram in Figure 4 and the system resource manager block diagram in Figure 6 relate to one another, we have used matching line colors to indicate corresponding object flow. It is important to note that each encoder process acts as both a producer and consumer of objects to processes occurring later, and respectively, earlier in the encoder pipeline.
The system resource manager dynamically assigns objects to processes so as to minimize idle process time. In addition, separate coordination of the empty and full object paths allows a great deal of configuration flexibility. This flexibility is important when, for example, producer and consumer processes require differing amounts of computational resources. In this case, a system resource manager may have N producers and M consumers where N is not equal to M.
Figure 6: System resource manager dataflow.
Objects encapsulate inter-process data and control information. While the term object may have a wide variety of meanings, in this document object shall be used exclusively to refer to four types of data and control information: sequence control sets, picture control sets, picture descriptors, and results. The objects are described in detail below.
The sequence control set object contains a mixture of system resources (e.g., system resource managers, a picture descriptor pool, API callback functions, etc.) and encoder configuration parameters that may apply to more than one picture (e.g., GOP length, prediction structure, number of reference pictures, motion estimation search range, target bit rate, etc.).
The data structures associated with sequence control set objects are defined in EbSequenceControlSet.h.
The picture control set object contains encoder configuration parameters that apply to a single picture (e.g., picture descriptors, reference lists, SB motion and mode information, etc.). The information carried by the results object has a lifetime of at most one picture time interval.
The data structures associated with picture control set objects are defined in EbPictureControlSet.h.
The picture descriptor object contains data and control information that describe picture attributes. The input, enhanced, motion compensated prediction, reconstructed, and reference pictures all have associated picture descriptor objects. In addition, transform coefficients generated in the coding loop process and consumed by the entropy coding process have an associated picture descriptor. Examples of data and control information conveyed by picture descriptor objects include: color space, bit depth, picture dimensions, and pointers to memory locations containing the picture data.
The data structures associated with picture descriptor objects are defined in EbPictureBufferDesc.h.
The results object is used to convey data and control information between two processes. Producer processes fill results objects while consumer processes use the information conveyed in those objects. A consumer process waits (i.e., blocks) for a results object to arrive from the preceding producer process before commencing execution. Results objects convey process state information explicitly in the case of encoder parameters (e.g., bit depth, color space, etc.) and implicitly in the case of system timing (e.g., when to begin execution).
The data structures associated with results objects are defined in results header files, e.g, EbRateControlResults.h.
The system resource managers manage objects and connect processes to one another by controlling how objects are passed. They provide the link between processes through which nearly all inter-process communication takes place. The light green lines on the encoder block diagram indicate rare instances where inter-process communication is not handled by the system resource manager. The encoder block diagram shows only system resource manager associated with results objects. It is important to understand that the encoder block diagram does not depict system resource managers associated with other objects.
As shown in Figure 5, the system resource managers comprise many individual components. These components are described in this subsection.
The empty object FIFO contains empty objects that have not yet been assigned to a producer process. When an object is released, it is placed back into the empty object FIFO. As with all objects, empty objects are not restricted to use within the context of a single resource manager. For example, sequence control set objects are passed throughout the full encoder pipeline.
The ith producer empty object FIFO contains empty objects that have been assigned to the ith producer process. The ith producer process requires an empty object from the ith producer empty object FIFO in order to begin execution. The ith producer empty object FIFO is hardwired to the ith producer process.
The producer process works to convert empty objects provided by its producer empty object FIFO into full objects. The producer process queues recently filled objects onto the full object FIFO. It is important to understand that a process that is a producer process in a system resource manager can simultaneously be a consumer process in a different system resource manager. When a producer process completes its task, it places itself on the producer process FIFO.
The producer process FIFO contains producer processes waiting to be assigned an empty object on which to work. Without the producer process FIFO, producer processes would have to poll the empty object FIFO that can lead to inefficiencies when the empty object FIFO contains no empty objects.
The full object FIFO contains full objects that have not yet been assigned to a consumer process. Only system resource managers that manage results objects have full object FIFOs.
The jth consumer full object FIFO contains full objects that have been assigned to the jth consumer process. The jth consumer process requires an empty object from the jth consumer full object FIFO in order to begin execution. The jth consumer full object FIFO is hardwired to the jth consumer process.
The consumer process uses full objects provided by its consumer full object FIFO. The consumer process queues used objects onto the empty object FIFO. It is important to understand that a process that is a consumer process in a system resource manager can simultaneously be a producer process in a different system resource manager. When a consumer process completes its task, it places itself on the consumer process FIFO. System resource managers managing results objects are the only resource managers with an associated consumer process. Sequence control set, picture control set, and picture descriptor objects do not have consumer processes.
The consumer process FIFO contains consumer processes waiting to be assigned a full object to use. Without the consumer process FIFO, consumer processes would have to poll the full object FIFO which can lead to inefficiencies when the full object FIFO contains no full objects.
This subsection contains an execution flow snapshot whose purpose is to show how the individual components of the system resource manager interact with one another.
-
The producer process produces a full object (dark green arrows)
-
The full object demux queues the object in the full object FIFO
-
The full object MUX checks the status of the consumer process FIFO. If the consumer process FIFO
-
is not empty then the first object in the full object FIFO is assigned to the first process identified in the consumer process FIFO. The assignment wakes the first process in the consumer process FIFO.
-
is empty, then the producer process requests an empty object from its empty object FIFO and continues to step 1
-
-
The consumer process awoken in step 3a begins to work and consumes the full object (blue arrows)
-
The consumer releases its used full object whereupon it is returned to the Empty Object FIFO (red arrows)
Note: steps 1 and 2 are carried out by the producer process. Step 3 is carried out in the system resource code, while steps 4 and 5 are executed by the consumer process.
Note: The ith producer process will not become active if the ith producer empty object FIFO contains no objects
Note: The ith consumer process will not become active if the ith consumer full object FIFO contains no objects
Note: the tangerine arrows depict the producer processes’ acquisition of empty objects and are not described in the execution flow snapshot above.
There are many signals that are used throughout the encoder at various stages and for various amounts of time. The following groups of signals are used extensively throughout the encoder.
The configuration sets are general settings that can be used to control the encoder externally.
Signal | Description |
---|---|
frame_rate | Input frame rate |
intra_period_length | Interval between successive I-pictures |
hierarchical_levels | The number of levels of hierarchy in the Prediction Structure |
The Sequence Control Set contains information specific to the whole sequence. The relevant data structure is SequenceControlSet_s.
Signal | Description |
---|---|
max_input_luma_width | Width of picture |
max_input_luma_height | Height of picture |
profile_idc | Defines the encoding profile |
level_idc | Defines the encoding level |
The Picture Control Set contains the information for individually coded pictures. The information is split between Picture Parent Control Set and Picture Control Set. Picture Parent Control Set contains information that is needed in the first few processes in the pipeline (from Resource Coordination process to Source-Based Operations process). The Picture control set includes a pointer to the Picture Parent Control Set and includes also additional information needed in the subsequent processes starting at the Picture Manager process. The relevant data structures are PictureParentControlSet_s and PictureControlSet_s.
Signal | Description |
---|---|
av1_frame_type | 0 (KEY_FRAME) to 3 (S_FRAME) |
show_frame | 1/0: Display or not decoded frame |
is_skip_mode_allowed | 1/0: sets whether using skip modes is allowed within the current frame or not |
allow_high_precision_mv | 1/0 Enable/disable eighth pel MV precision |
base_qindex | quantization base index for the picture |
Picture descriptors provide information on the various picture buffers used in the encoder. The relevant data structure is EbPictureBufferDesc_s.
Signal | Description |
---|---|
buffer_y | Address of the luma buffer |
buffer_cb | Address of the Cb chroma buffer |
buffer_cr | Address of the Cr chroma buffer |
origin_x | Contains the x-padding offset of the picture buffer (luma) |
origin_y | Contains the y-padding offset of the picture buffer (luma) |
width | Luma picture width excluding padding |
height | Luma picture height excluding padding |
bit_depth | The bitdepth of the buffers |
luma_size | The size (bytes) of the luma buffers |
chroma_size | The size (bytes) of the chroma buffers |
stride_y | The number of luma samples to increment a position to the next line (includes padding) |
stride_cb / stride_cr | The number of chroma samples to increment a position to the next line (includes padding) |
The following section describes the processes and algorithms used in the SVT-AV1 encoder. The first-level subsections describe the operation of each concurrent process found in Figure 3 and Figure 4. Second-level and lower subsections describe high-level algorithm behavior and design considerations.
The Resource Coordination Process performs two functions, namely it gathers input data and distributes encoder-setting changes properly into the encoder pipeline. Input data is passed to the encoder in packets that can contain varying amounts of data from partial segments of one input pictures to multiple pictures. The Resource Coordination Process assembles the input data packets into complete frames and passes this data along with the current encoder settings to the Picture Analysis Processes. Encoder settings include, but are not limited to Bitrate Settings, Rate Control Mode of operation, etc.
The Picture Analysis processes perform the first stage of encoder pre-processing analysis as well as any intra-picture image conversion procedures, such as resampling, color space conversion, or tone mapping. The Picture Analysis processes can be multithreaded and as such can process multiple input pictures at a time. The encoder pre-analysis includes creating an n-bin histogram for the purpose of scene change detection, gathering the 1st and 2nd moment statistics for each 8x8 block in the picture which are used to compute variance, input subsampling and screen content detection. All image-modifying functions should be completed before any statistics-gathering functions begin.
The statistical moments of a picture are useful in a number of algorithms in the encoder including rate control and mode decision configuration. A useful property of the 1st and 2nd statistical moments is that blocks can be combined together to find the mean and variance of larger block sizes, so with the statistical moments of the 8x8 blocks, we can calculate the variance of all blocks in the SB that are larger than 8x8.
Let the first and second moment-accumulations be defined as:
1st Moment:
2nd Moment:
Variance is normally calculated as
where, is the input sample, and is the mean of the input samples.
This method requires an initial pass over the input samples to calculate the mean, followed by a second pass to calculate the variance. However, an alternate method can be used that requires only a single pass over the input data:
This form can be made more efficient:
Let
where is the number of samples.
So,
It follows that
So the relative variance for an N-sized block can be calculated by:
The relative variance can be calculated from two accumulations over each region. In order to calculate the variance for all block sizes in a SB, the first two moment-accumulations are first computed for each of the 8x8 blocks in a SB. If, for example, it is desired to compute the relative variance of 16x16 blocks in a SB, the 16x16 block moment-accumulations can be calculated from the four collocated 8x8 block-accumulations by adding them together. After the two moment-accumulations are computed, the relative variance can be calculated by modified relative variance calculation.
The Picture Decision process performs multi-picture level decisions, including setting of the prediction structure, setting the picture type, and scene change detection. Since the prior Picture Analysis process stage is multithreaded, inputs to the Picture Decision Process can arrive out-of-display-order, so a reordering queue is used to enforce processing of pictures in display order. The algorithms employed in the Picture Decision process are dependent on prior pictures’ statistics, so the order in which pictures are processed must be strictly enforced. Additionally, the Picture Decision process uses the reorder queue to hold input pictures until they can be started into the Motion Estimation process while following the proper prediction structure. The data flow in the picture decision process is show in Figure 6.
Figure 7: Picture decision process dataflow.
Setting up the prediction structure. To illustrate the process by which a prediction structure is implemented, an example showing a four-layer prediction structure is shown in Figure 8. The prediction structure is implemented using the Show_frame and Show_existing_frame flags, as shown in the table below.
)
Figure 8. An example of a four-layer prediction structure.
Table 5: Settings for the show_frame and Show_existing_frame flags when encoding one mini-GoP in a 4-layer prediction structure.
PicNumber | Show_frame | Show_existing_frame | DPB |
---|---|---|---|
0 | 1 | 0 | 0 |
8 | 0 | 0 | 8-0 |
4 | 0 | 0 | 4-8-0 |
2 | 0 | 0 | 2-4-8-0 |
1 | 1 | 0 | 2-4-8-0 |
2e | 0 | 1 | 2-4-8-0 |
3 | 1 | 0 | 2-4-8-0 |
4e | 0 | 1 | 2-4-8-0 |
6 | 0 | 0 | 6-2-4-8-0 |
5 | 1 | 0 | 6-2-4-8-0 |
6e | 0 | 1 | 6-2-4-8-0 |
7 | 1 | 0 | 6-2-4-8-0 |
8e | 0 | 1 | 6-2-4-8-0 |
16 | 0 | 0 | 6-2-4-8-16 |
The Motion Estimation (ME) process performs motion estimation and open-loop intra search (OIS). This process has access to the current input picture as well as to the input pictures the current picture uses as references according to the prediction structure pattern. The Motion Estimation process is multithreaded, so pictures can be processed out of order as long as all inputs are available.
The ME process generates inter-prediction and intra-prediction candidates using highly parallelizable, open loop, neighbor-independent methods. The candidates’ costs generated in the ME processes are further refined in downstream processes and as more neighbor information becomes available allowing for more accurate costs to be calculated. ME consists of four components: Hierarchical motion estimation (HME), search center selection, Motion Estimation and open-loop intra candidate search (OIS). The HME performs a quick search on down-sampled pictures to converge to a candidate search center for full-pel motion estimation search at the full picture resolution. Search center select uses a competition method to select a single search center from several methods, including hierarchical motion estimation, temporal motion vector predictors, and/or external search center candidates. The search parameters of each search center method can be configured at the picture-level in-order to control the number of computations used during search center selection. One goal of this design is to eliminate large areas of the search area with search center selection and then to fully search a smaller search area with the refinement stage. Tier-0 Motion Estimation finds the best motion vector around the SB search center for each of the partitions searched in ME.
In the current SVT-AV1 encoder, the ME process is based on the input pictures, i.e. the reference pictures are replaced by the corresponding source pictures. As a result, the ME is an open loop operation.
OIS searches all available intra luma modes for each active block. Intra chroma modes are not considered. The intra prediction process is conformant with the AOM-AV1 specification; however, the intra reference samples disregard filtering and use input samples rather than the nominal reconstructed samples.
Hierarchical Motion Estimation (HME) takes as input an Enhanced Input Picture and Reference Picture and produces a search center for each SB. The HME consists of three stages: a one-sixteenth resolution Level-0 full search, a one-quarter resolution Level-1 refinement search, and a base-resolution Level-2 refinement search as depicted in Figure 8. In addition, the total search area is sub-divided into N-search areas, where each of the Level-0, Level-1, and Level-2 searches are performed independently to produce N-search centers. Of the N-search centers, one search center is finally selected. Having multiple search centers prevents the Level-0 and Level-1 searches from choosing local minima and missing the true center of motion completely. All search and selection decisions are based on a pure SAD distortion metric. Figure 10 depicts an example HME full search and refinement data flow through Level-0, Level-1, and Level-2.
Figure 9: Hierarchical motion estimation dataflow.
Figure 10: Hierarchical motion estimation data illustration.
Search center selection chooses the best SB search center for Tier-0 Motion Estimation based on a SAD Distortion Metric. Search Center candidates may be generated from HME or other sources of candidate search centers. A diagram showing search center selection and the Tier-0 Motion Estimation is given in Figure 10.
Motion Estimation (ME) takes as input an enhanced input picture, reference picture, and search center for each SB. ME produces Motion Vector Pairs, one for each of the blocks in a SB. ME is composed of four main components: Integer Full Search, Half-Pel Refinement Search, Quarter-Pel Refinement Search, and the Search Area Quarter-Pel Interpolation filter as depicted in Figure 12. First, the Integer Full Search produces an integer MV candidate for each block. In parallel to the Integer Full Search, the Search Area Quarter-Pel Interpolation filter produces the quarter-pel interpolation for the Search Area. After both the Integer Full Search and Search Area Quarter-Pel Interpolation have completed, the Half-Pel Refinement process produces a half-pel refined MV candidate for each block. Finally the Quarter-Pel Refinement process produces a quarter-pel refined MV candidate for each block and the SB SAD estimation using the base 8x8 block SAD data. The filters for each of the subsample locations can be configured independently as well as the Search Area width and height.
Figure 11: Search center select and motion estimation.
Figure 12: Motion estimation dataflow.
Motion Estimation Integer Full-Search (ME FS) takes as input an enhanced input picture, reference picture, and search center for SB. ME FS produces integer-resolution Motion Vector Pairs for each of the PUs in the SB. ME FS performs a full search on the Search Area for every block in each SB. The cost metric used in the search is composed only of a SAD distortion. In order to avoid redundant SAD distortion calculations among the blocks, the ME FS algorithm uses the 8x8 SAD blocks (corresponding to the 8x8 blocks in a SB) at each search position to construct the blocks as depicted in Figure 13. Blocks of size 4xN and Nx4 use the motion vectors of the parent block and their corresponding SAD is computed accordingly.
Figure 13: Motion Estimation Integer Full-Search for the case of 64x64 SB. 4xN and Nx4 blocks are not included in this diagram to keep the presentation simple.
Motion Estimation Half-Pel Refinement (ME HPR) takes as input an interpolated enhanced input picture, an interpolated reference picture, a search center for the motion vector difference (MVD) cost metric, and integer-resolution Motion Vector Pairs. ME HPR produces half-pel resolution Motion Vector Pairs for each of the blocks in each SB. Additionally, ME HPR also has a selectable refinement area. The ME HPR algorithm is to refine each of the Motion Vector Pairs from integer-resolution to half-pel-resolution as depicted in Figure 14. Since the MV pairs are restricted to the Search Area, only the Search Area in the Reference Picture needs to be interpolated. As in the ME FS, the Search Center is used as the basis for the MVD cost metric. The MVD Cost Estimate look-up-table used in ME FS is again used in ME HPR.
Figure 14: Half-Pel Refinement for the case of 64x64 SB. 4xN and Nx4 blocks are not included in this diagram to keep the presentation simple.
Motion Estimation Quarter-Pel Refinement (ME QPR) takes as input an interpolated enhanced input picture, an interpolated reference picture, a search center for the motion vector difference (MVD) cost metric, and half-pel-resolution Motion Vector Pairs. Additionally, ME QPR also has a selectable refinement area. The ME QPR algorithm is exactly the same as the ME HPR algorithm, except that the Motion Vector Pairs are being refined to quarter-pel resolution and distortion estimates. Again, the MV Pairs are restricted to the Search Area and the Search Center is used as the basis for the MVD cost metric and the ME FS MVD Cost Estimate Table is used again. A diagram of the ME QPR is shown in Figure 15.
Figure 15: Quarter-Pel Refinement for the case of 64x64 SB. 4xN and Nx4 blocks are not included in this diagram to keep the presentation simple.
The Initial Rate Control process determines the initial bit budget for each picture depending on the data gathered in the Picture Analysis and Motion Estimation processes as well as the settings determined in the Picture Decision process. The Initial Rate Control process also employs a sliding window buffer to analyze multiple pictures if the delay is allowed. Note that through this process, until the subsequent Picture Manager process, no reference picture data has been used. The data flow in the Initial Rate Control process is illustrated in Figure 16.
Figure 16: Initial Rate Control Process Dataflow.
Source-based operations process involves a number analysis algorithms to identify spatiotemporal characteristics of the input pictures. Some of the operations aim at characterizing individual SBs (e.g. grass areas), others aim at characterizing whole pictures (e.g. Potential aura areas).
The Picture Manager Process performs the function of managing both the Enhanced (Input) Picture and Reference Picture buffers and subdividing the Input Picture into Tiles. Both the Enhanced Picture and Reference Picture buffers particular management depends on the GoP structure currently in use. The Picture Manager Process uses the Enhanced Picture and Reference Picture buffers to implement Pyramidal B GoP structures. Figure 16 shows the interaction of the Picture Management algorithm with the Enhanced Input Picture and Reference Picture buffers.
Figure 17: Picture Management Dataflow.
The Picture Manager Processes runs in an asynchronous mode of operation where the Picture Management algorithm is run whenever an Enhanced Input Picture or Reference Picture input are received. For both Enhanced Input Pictures and Reference Picture inputs, the Picture is first placed in the appropriate picture buffer. Then the Picture Management algorithm is run.
The Picture Management algorithm walks the Enhanced Input Picture Buffer and for each entry checks if the necessary Reference Pictures are available in the Reference Picture Buffer. If all the References are available, the Input Picture is split into Tiles and the now complete Reference Lists are passed to the Motion Estimation Processes. Additionally, an entry is made in the Reference Picture Buffer for the now active Input Picture where the Input Picture’s finished Reference Picture will be stored for future reference. This entry contains a descriptor to the Reference Picture data and a count of the Expected References by subsequent pictures in coding order. Finally for each Reference Picture in the Reference Lists, the expected reference count for the Reference Picture’s corresponding Reference Picture Buffer entry is decremented by one. Once the expected reference count is decremented to zero, the Reference Picture’s entry is removed from the Reference Picture Buffer. When the Picture Manager Process receives a Key_Frame flag or prediction structure reset, the Reference Picture Buffer entries must be decremented properly to avoid memory leakage. All buffer management actions are triggered by either receiving a Key_Frame, prediction structure reset, or upon the successful start of an Input Picture into the pipeline.
The Rate Control process uses the distortion and image statistics generated in previous processes, the current picture’s bit budget, and previous picture statistics to set the QP and the bit budget for each picture. The encoder currently support VBR -type of rate control.
The next several steps in the encoder pipeline involve finalizing the partitioning and mode decisions for each SB in the picture. Give the large number of block sizes and the large number of modes the could be considered for each block, it would be computationally very expensive to evaluate all options using all available tools to converge on the final partitioning and mode decisions. Consequently, a staged decision approach is considered in SVT-AV1 as shown in Figure 17. The process starts with the very large number N0 of candidates at the input of PD Stage 0. At this stage, very elementary tools and performance measures are used in evaluating the fitness of the different candidates. The best N1<<N0 of candidates are selected and passed on to PD stage 1. More sophisticated prediction and performance measure tools are considered in PD Stage 1 to evaluate all the N1 input candidates and select the top N2<<N1 from among the tested N1 candidates. The same idea is applied in subsequent steps until PD Stage n where Nn candidates and their corresponding coding modes are selected.
Figure 18. Partitioning decision stages.
An illustration of the different processing details that can take place in each PD stage are given in Figure 18. In this example, PD Stage 0 is based on the ME and OIS distortion data. Candidate costs including the MV rate are used in making decision to select the best subset of candidates to pass on to PD Stage 1. The latter may involve more precise prediction tools and more accurate cost calculation as part of the mode decision (MD) process to select an even smaller subset of candidates to pass on to PD stage 2. For example, PD Stage 1 may use more accurate interpolation filters in the sub-pel search. The same idea is applied in subsequent stages. In the last stage, only very few candidates as considered at the input and usually the full set of high performance prediction tools are used to finalize the list of candidates.
Figure 19. Example of the processing details in each PD stage.
The overall flow of the remaining tasks in the encoder is summarized in Figure 20.
Figure 20. Partitioning, mode decision, encoding and filtering tasks.
The Mode Decision Configuration process is associated with PD Stage 0 and determines which SB and block coding decisions that will be searched in the Coding Loop. The SB coding decisions specify which blocks to search in the SB. For example, the search could include only the 16x16 blocks or it could include the 64x64 and the 32x32 blocks. The block coding decisions could include, for example, how many inter and intra modes to search, and whether any specific intra or inter modes should be searched as well.
The Mode Decision Configuration Output Signals are meant to be a powerful, yet lightweight means of shaping the compute budget to focus on the coding tools that are most likely to yield the best coding decisions.
The EncDec process currently encapsulates the remaining PD stages where many of the encoder tasks such as Intra Prediction, Motion Compensated Prediction, Transform, Quantization and Mode Decision are performed. It takes as input the Motion Vector XY Pairs, OIS results and Distortion Estimations from the Motion Estimation processes, and the picture-level QP from the Rate Control process. The Enc/Dec process operates on an SB coding basis.
The Neighbor Array is a structure that manages neighboring block information during the EncDec process by continuously updating its memory locations throughout the encoding process. The Neighbor Array replaces the normal entire-picture block array solutions that are used to access neighboring block data. There are three neighbor array types: Top block, Left block, and Top-left block as illustrated in Figure 20. Also note that the neighbor array design can store either mode information directly or reference data indirectly (e.g. pointers).
Figure 21: Neighbor array structure.
The Neighbor Array design hinges on how its memory locations are accessed and updated. The Left Neighbor Array is approximately one SB tall and is accessed by using the SB y-location of the current block. The Top Neighbor Array is approximately one Picture wide and is accessed by using the x-location of the current block. The Top-Left Neighbor Array is accessed as seen in Figure 16.
The basic processing flow is that at the beginning of a picture, each of the Neighbor Arrays is reset. As each block is completed, its information mode or reference information is written to each of the Neighbor Arrays using the appropriate index. The Neighbor Array update and access flow can be described as follows:
-
Construct neighbor information using the Neighbor Arrays
-
Block Mode Decision
-
Update each of the Neighbor Arrays using the current block location
-
If at a Quadtree mode decision point, update the neighbor array
-
Proceed to the next block
This process is illustrated in Figure 17. The arrows represent the block z-scan coding order and the colors represent each block’s mode information. The three neighbor arrays contain a snapshot of the mode information currently stored in each block position at the time that the block labeled “Current Block” is being processed.
Figure 22: Neighbor array data illustration.
The SVT-AV1 encoder makes use of the AOM AV1 encoder rate estimation tables.
The AOM-AV1 specification provides various means to encode each block, which are grouped together into candidate types by syntax element usage for this design. Each candidate type has its own sample prediction, rate estimation, and transform block coding requirements. The candidate types are Intra, Inter, and Compound.
Intra candidates are predicted for each Block. Intra candidates are searched by their luma/chroma modes. The search includes also Chroma from Luma (CfL) search, Intra block copy (IBC).
Motion estimation candidates are inter predicted for each block. ME candidates (NEWMV mode) use MVP to signal the motion vector values. The skip mode in AV1 is considered only for NEAREST/NEAREST inter mode candidates. The mode determination is made by performing rate distortion optimization (RDO) analysis.
Compound candidates are generated using either inter-intra compound mode (which include the smooth inter-intra prediction and inter-intra wedge prediction), or inter-inter compound mode (which includes inter-inter wedge prediction, distance-weighted inter-inter prediction, and difference-weighted inter-inter prediction).
Encode SB performs RDO analysis on each candidate in each block and sets the partitioning structure, sets syntax elements, produces AV1-conformant reconstructed samples. The two main tasks that are performed in the encode SB are the mode decisions tasks and the encode pass tasks, both outlined below.
The mode decision takes place in PD stages 1 to n. In the current implementation, n=1. The tasks performed in each PD stage are structures as shown in Figure 22. The input candidates to a PD stage are grouped according to classes. In the current implementation, there are seven classes corresponding to Intra, Inter, MV Pred (Nearest, Near…), Inter-Inter, Intra-Inter, OBMC and Filter Intra candidates. Moreover, multiple MD stages are involved in the PD stage, where the complexity of the MD stages increases from MD Stage 0 to MD stage n due to the use of more accurate prediction tools and more accurate performance measures. Once the input candidates to a given MD stage are processed, only the best among the processed candidates are passed on to the next stage, hence reducing the number of candidates to be processed in the subsequent stage. In the process, some classes might be retired (not considered in subsequent stages) if the performance their corresponding candidates is not satisfactory. The main idea behind introducing candidate classes it to ensure that important types of candidates are given a chance to be present in the final MD stage and to compete at that stage against the best from other candidate classes.
It should be noted that the prediction tools considered in MD are not necessarily conformant tools, as the objective of MD is to produce partitioning and mode decisions, and not necessarily residuals to be coded and transmitted in the bitstream, which is the task performed by the encode pass discussed next.
Figure 23. MD flow.
The encode/decode pass takes as input the selected partitioning and coding modes from mode decision for each SB and produces AOM-AV1 conformant bit stream. The encode/decode pass includes intra prediction, motion compensation, transform, quantization, inverse quantization, inverse transform and reconstruction. All the prediction tools considered in the encode pass are conformant tools.
The deblocking filter is used to address blocking artifacts in reconstructed pictures. The filter was developed based on VP9 deblocking filter and switches between two types of filters: Narrow filters and Wide filters. Filtering is applied to all vertical edges first, then to all horizontal edges.
The steps involved in the deblocking filter are as follows:
-
Determine the loopfilter level and sharpness. Both are frame level parameters.
-
The level takes value in [0, 63] and can be set using different methods:
-
0 to disable filtering,
-
can be set as a function of the AC quantization step size, or
-
set through evaluation of filtering using multiple levels and selecting the level with the lowest distortion in the filtered frame.
-
-
The sharpness takes value in [0, 7]. For keyframes, sharpness=0, else sharpness=input sharpness setting.
-
-
Identify edges to filter.
-
Determine adaptive filter strength parameters: |v|, limit, blimit and thresh. These are block level properties. They build on the frame level settings and include refinements based on segmentation, coding mode, reference picture and loop filter data.
-
Determine filter masks: High edge variance mask (hevMask), fiterMask, flatMask and flatMask2.
-
Select and apply filters.
The constrained directional enhancement filter (CDEF) is applied after the deblocking filter and aims at improving the reconstructed picture. CDEF is a combination of the directional de-ringing filter from the Daala codec and the Constrained Low Pass Filter (CLPF) from the Thor codec, and provides directional de-ringing filtering (helps with ringing artifacts) and constrained low-pass filtering (CLPF) (helps with quantization and interpolation errors). The filtering is applied on 8x8 block level.
The filtering algorithm involves the following steps:
-
Identify the direction of the block (i.e. direction of edges). Eight directions could be identified. The search is performed on an 8x8 block basis. Eight possible directions (0 to 7) can be considered. The search is performed for the luma component and the direction is assumed to be the same for the chroma components.
-
Apply a nonlinear filter along the edge in the identified direction.
-
Primary filtering: Filter taps are aligned in the direction of the block. The main goal is to address ringing artifacts.
-
Secondary filtering: Mild filter along a 45 degrees direction from the edge.
-
The filtering operation is a function the pixel value difference between the pixel being filtered and a neighboring pixel, the filter strength and the filter damping. The strength determines the maximum difference allowed. Damping determines where filtering would not be considered.
Signaling:
-
The frame is divided into 64x64 filter blocks.
-
Information signaled at the frame level: damping, the number of bits used for filter block signaling, and a list of presets (could be 1, 2, 4 or 8 presets).
-
Preset: Luma and chroma primary strengths, Luma and chroma secondary strengths.
-
At the filter block level, the preset to use for the filter block is signaled.
More details on the CDEF algorithm and its implementation in SVT-AV1 are included in the Appendix.
The restoration filter is applied after the constrained directional enhancement filter and aims at improving the reconstructed picture. Two types of filters are involved:
-
Separable Symmetric Wiener Filter where only half of the filter coefficients are included in the bit stream due to symmetry.
-
Self-guided restoration filter with projection that has the effect of edge preserving smoothing.
Switching between OFF, Wiener filter and the self-guided restoration filter can be performed on a restoration unit basis.
The Entropy Coding process is responsible for producing an AV1 conformant bitstream for each frame. It takes as input the coding decisions and information for each block and produces as output the bitstream for each frame. The entropy coder is a frame-based process and is based on multi-symbol arithmetic range coding.
The Packetization process gathers the bitstreams from each frame, codes the Temporal Delimiter (TD) as well as the sequence and frame headers, and serves as an end-point for the encoder pipeline. The Packetization process takes as input the bitstreams for each frame as well as sequence and picture level coding settings, and produces the final bitstream in picture-decoding order.
The following appendices highlight the design and implementation of features in much greater detail than this document.
- Altref Appendix
- CDEF Appendix
- CfL Appendix
- Compliant Subpel Interpolation Filter Search Appendix
- Deblocking Loop Filter (LF) Appendix
- Filter Intra Appendix
- Global Motion Appendix
- Intra Block Copy Appendix
- Local Warped Motion appendix
- OBMC Appendix
- Palette Prediction Appendix
- Restoration Filter Appendix
- Subpel Interpolation in the Open Loop Motion Estimation Appendix
The notices and disclaimers can be found here.