Overview of AVS Video Coding Standard - baidut/co-codec GitHub Wiki
author: Houqiang LI, Hui LIU
University of Science and Technology of China, Hefei
- Introduction ===============
Since the early 1990s, significant advances in digital video processing and communication technologies have enabled the dream of many applications. With the widspread deployment of applications and services such as digital television, video conferencing, Internet DVD-video, video compression has become an essential component of multimedia and communication technology, and has bridged the gap between a huge amount of visual data needed transmission/storage and limited bandwidth of communication channel or limited size of storage. During the past two decades, digital video compression technologies have fundamentally changed the way we create, communicate, and consume visual information.
With the rapid development of video compression technology, video coding standards have been playing pivotal roles in establishing the corresponding technology via providing interoperability among products developd by different manufacturers and providing assurance to the content creators that their content will run everywhere if only they comply with the corresponding standard. They have resulted in steep reduction in cost for masses to be able to afford the technology, and have cultivated open interactions among experts from different companies. Two international organizations have been heavily involved in the standardization of video coding methodologies, namely ISO/IEC and ITU-T. ITU-T Video Coding Experts Group (VCEG) develops international standards for advanced moving image coding methods appropriate for conversational and non-conversational audio/video applications, which are generally referred to as H.26x series[3,5-6,8]. It caters essentially to real-time video applications. ISO/IEC Moving Picture Experts Group (MPEG) develops international standards for compression and coding, decompression, processing, representation of moving pictures, images, audio and their combination, which are generally referred to as MPEG series[4-5,7-8]. It caters essentially to video storage, broadcast video, video streaming (video over internet/DSL/wireless) applications. In late 2001, ISO/IEC MPEG and ITU-T VCEG decided on a joint venture towards enhancing standard video coding performance, and formed a joint team of both the standard organizations called Joint Video Team (JVT), by which the developed in March 2003[8]. H.264/AVC represents a number of advances in and flexibility for effective use over a broad variety of network types and application domains[9-10]. So far, H.264/AVC has been very strongly embraced by industry and is now being deployed in virtually all new and existing applications of digital video technology.
During the past two decades, China became the important consumer electronics market with the highest development potential in the world with its huge population and rapid economic growth. Furthermore, China is becoming an extremely important consumer electronics manufacturing base in the globe, with many products being now mainly produced in China. So far, the dominant audio/video compression codecs MPEG and H.26x enjoy widespread use in requires Chinese manufacturers to pay substantial royalty fees to the most foreign companies that hold patents on technology in those standards. In order to reduce foreign dependence on core intellectual properties used in digital media technology, the government of China initiated its development of Chinese national standard called AVS (the Audio Video Coding Standard of China) several years ago. AVS standard is developed by the Audio Video Coding Standard Working Group of China, which was authorized and established by Science and Technology Department under National Information Industry Ministry (MII) in June 2002. Until April 2008, it has totally 175 members, including 135 offical memebers and 40 observing members. The group is diverse, and encompasses computer hardware and software manufacturers, telecommunications manufacturers, consumer electronics companies, semiconductor chip design firms, and universities and research organizations. The goal of the group is to establish general technical standards for the compression, decoding, processing, and the representation of digital audio-video. This standard is applied in the field of significant information industry such as high-resolution multimedia communication, and internet broard-band stream media.
AVS is China independent standard based on homeland innovative technologies and partial public technologies. It is an efficient and state-of-the-art video compression standard, whose coding efficiency is about two times better than that of MPEG-2, and close to that of H.264/AVC but with less complexity. Due to many new features of the AVS standard, it has attracted many attentions from researchers and engineers internationally. While the AVS standard wends its way through the approval process, it is already being used on a trial basis for IPTV, mobile TV, and terrestrial broadcasting in China. Some parts of the AVS have already been completed or are close to completiong. AVS1-P1 is the system part. It is an extension part of the previous China standard, which is equal to MPEG-2 system, for supporting AVS video and audio. AVS1-P2 is the video part for high-end applications, such as standard-definition (SD) and high-definition (HD) broadcast and storage[1]. AVS1-P2 defines two profiles: baseline profile (J-Profile) and extended profile (X-Profile). AVS1-P2 J-Profile targets such typical applications as IPTV, Setbox, SDTV, and HDTV. It has been approved as China national standard in March 2006. AVS1-P2 X-Profile is designed to be backward compatible with J-Profile, and targets high-end applications (such as HDTV, storage media) by introducing several high efficiency coding tools[49-50]. Up to now, X-Profile is still underway, and the AVS working group forecasts the release of FCD (Final Commitee Draft) in May 2008. AVS1-P2 also includes a part named AVS-S, which targets real-time encoding and decoding for video surveillance[51]. Another part of AVS video is AVS1-P7 (also called AVS-M), which targets low complexity, low picture resolution mobility applications[2].It is not defined as a profile of AVS1-P2 and is not compatible with it. AVS1-P7 was finalized in March 2006 and is expected to be approved to be national standard of China. AVS1-P3 is audio part and AVS1-P6 is digital right management (DRM) part. They were finalized in April 2006 and in March 2007, respectively, and were reported to Chinese government for approval. This chapter aims at giving an overview of video coding part in AVS standard. It is organized as follow. Section 2 and Section 3 focus on AVS1-P2. Section 2 provides an overview of basic coding structure of AVS1-P2 and highlights some key technique features of it. Section 4 is devoted to AVS1-P7 and introduces its coding tools, its profile and levels, its performance and complexity, and so on.
- Basic Architecture of AVS1-P2 ================================
In common with many previous and existing video coding standards, AVS-P2 also employs the so-called block-based hybrid video coding scheme, which has been demonstrated to be a very successful video compression framework[11-13]. In fact, the coding structure of AVS1-P2 is similar in spirit to the new international standard H.264/AVC[1,9-10]. However, considering the target applications, low complexity of implementation, and the IPR problems, the technique of the AVS1-P2 video codec in every module is more or less different from that used in H.264. Similar to most video coding methods, AVS-P2 exploits both temporal and spatial redundancy to achieve compression. In the temporal domain, there is usually a high correlation (similarity) between frames of video that are captured at around the same time. Temporally adjacent frames (successive frames in time order) are often highly correlated, especially if the temporal sampling rate (the frame rate) is high. In the spatial domain, there is usually a high correlation between pixels (samples) that are close to each other, i.e., the values of neighboring samples are often very similar. AVS1-P2 video is based on block-based hybrid coding scheme to reduce these redundancy and achieve compression.
The encoder of the block-based hybrid scheme mainly consists of intra-prediction, inter prediction, transform and quantization, and entropy coding, to compress video information. Intra prediction is generally used to exploit spatial redundancy. It is conducted in the spatial domain, by referring to neighboring samples of previously-coded blocks which are to the left and/or above the block to be predicted. The inter prediction attempts to reduce temporal redundancy by exploiting the similarities between neighboring video frames, usually by constructing a prediction of the current video frame. In AVS-P2, the prediction is formed from one or two previous or future frame. In AVS-P2, the prediction is formed from one or two previous or future frames and is improved by compensating for differences between the frames (motion compensated prediction). The output of the inter prediction is a residual frame (create by subtracting the prediction from the actual current frame) and a set of prediction parameters, typically a set of motion vectors describing how the motion was compensated. The residual data resulting from either intra or inter samples into another domain in which they are represented by transform coefficients. The coefficients are quantized to remove insignificant values, leaving a small number of significant coefficients that provide a more compact representation of the residual frame. The output of the transform module is a set of quantized transform coefficients. The parameters of both the intra prediction and inter prediction (typically mode information, motion vectors, etc.) and the transform redundancy in the data (for example, representing commonly-occurring vectors and coefficients by short binary codes) and produces a compressed bit stream or file that may be transmitted and/or stored. A compressed sequence consists of coded motion vector parameters, coded residual coefficients and header information.
The video decoder reconstructs a video frame from the compressed bit stream. The coefficients and motion vectors are decoded by an entropy decoder, and then scaled and inverse transformed to reconstruct a version of the residual frame. The decoder use the motion vector parameters, together with one or more previously decoded frames, to create a prediction of current frame and the decoded frame itself is reconstructed by adding the residual frame to this prediction.
Fig.1 depicts the block diagram of AVS1-P2 video encoder. For simplicity, here we focus on the encoder and not depict the decoder, which can be just viewed as an inverse process. Fig.1 includes two dataflow paths, a "forward" path and a "reconstruction" path.
(1) Forward Path
An input frame or field is first partitioned into macroblocks (MB), each of which contains 16X16 pixels. Each MB needs to be predicted and encoded in intra or inter mode. For each block in the MB, a prediction PRED is formed based on reconstructed picture samples. The first picture of a sequence and random access frames are typically intra coded, i.e., PRED is formed from samples in the current slice that have previously encoded, decoded and reconstructed. Each sample of a MB in an intra frame is predicted using spatially neighboring samples of previously coded blocks. The encoding process choose which and how neighboring samples are used for intra prediction. This is simultaneously conducted at the encoder and decoder using the transmitted intra prediction side information. For all remaining pictures of a sequence or between random acess points, typically inter mode coding is used. In inter mode, PRED is formed by motion-compensated prediction from one or tow reference picture(s), each of which may be chosen from a selection of past or future pictures (in display order) that have already been encoded, reconstructed and filtered. Inter prediction in AVS-P2 will be detailed in next section. The prediction PRED is substracted from the current block to produce a residual (difference) block. Then the residual block are transformed (using 8X8 integer transform) and quantized to give a set of quentized transform coefficients which are reordered and entropy encoded. The scanning order for progressive blocks is still zigzag (similar to that used in MPEG-2). However, a new scanning order is defined for interlacing blocks. AVS1-P2 employs a Context-based Variable Length Coding (CBVLC) technique to carry out entropy coding. The each block within the MB, e.g., prediction modes, quantizer parameter, motion vector information, etc.), from the compressed bitstream.
(2) Reconstruction Path
The encoder decodes (reconstructs) each encoded block in a MB to provide a reference for further predictions. The coefficients are scaled and inverse transformed to produce a difference block. The prediction block PRED is added to it to create a reconstructed block (a decoded version of the original block). A de-blocking filter is applied to reduce the effect of blocking distortion and the reconstructed reference picture is created from a series of blocks. AVS1-P2 uses the de-blocking filter in motion compensation loop, which directly acts on the reconstructed reference first across vertical edges and then across horizontal edges. Obviously, different image regions and different bit rates need differnt degree of smoothing. Therefore, the de-blocking filter is automatically adjusted in AVS1-P2 depending on activities of blocks and QP parameters.
Since MPEG-2 codec and system are extensively deployed in the existing broadcast systems, the syntax structure of AVS1-P2 is specially designed to be similar to that of MPEG-2. The similarity enables AVS1-P2 to be readily applied to the MPEG-2 system. Some start codes are defined in AVS1-P2 to indicate the beginning of a sequence, a picture and a slice respectively in a stream; however, there is no group of pictures (GOP) header indicated in AVS1-P2. The absolute time information can be contained in an I pictures header by setting a flag as 1. Therefore, I pictures enable random access to compressed stream and switches among programs. To distinguish I pictures from P and B pictures, the start code of I pictures is differnt from that of P and B pictures[1].
The above description of AVS-P2 is simplified in order to provide an overview of its basic coding structure. A more detailed introduction of the technical contents of AVS1-P2 is given below.