AOMedia AV1 Codec
AV1 ENCODER GUIDE

Introduction

This document provides an architectural overview of the libaom AV1 encoder.

It is intended as a high level starting point for anyone wishing to contribute to the project, that will help them to more quickly understand the structure of the encoder and find their way around the codebase.

It stands above and will where necessary link to more detailed function level documents.

Generic Block Transform Based Codecs

Most modern video encoders including VP8, H.264, VP9, HEVC and AV1 (in increasing order of complexity) share a common basic paradigm. This comprises separating a stream of raw video frames into a series of discrete blocks (of one or more sizes), then computing a prediction signal and a quantized, transform coded, residual error signal. The prediction and residual error signal, along with any side information needed by the decoder, are then entropy coded and packed to form the encoded bitstream. See Figure 1: below, where the blue blocks are, to all intents and purposes, the lossless parts of the encoder and the red block is the lossy part.

This is of course a gross oversimplification, even in regard to the simplest of the above codecs. For example, all of them allow for block based prediction at multiple different scales (i.e. different block sizes) and may use previously coded pixels in the current frame for prediction or pixels from one or more previously encoded frames. Further, they may support multiple different transforms and transform sizes and quality optimization tools like loop filtering.

AV1 Structure and Complexity

As previously stated, AV1 adopts the same underlying paradigm as other block transform based codecs. However, it is much more complicated than previous generation codecs and supports many more block partitioning, prediction and transform options.

AV1 supports block partitions of various sizes from 128x128 pixels down to 4x4 pixels using a multi-layer recursive tree structure as illustrated in figure 2 below.

AV1 also provides 71 basic intra prediction modes, 56 single frame inter prediction modes (7 reference frames x 4 modes x 2 for OBMC (overlapped block motion compensation)), 12768 compound inter prediction modes (that combine inter predictors from two reference frames) and 36708 compound inter / intra prediction modes. Furthermore, in addition to simple inter motion estimation, AV1 also supports warped motion prediction using affine transforms.

In terms of transform coding, it has 16 separable 2-D transform kernels \((DCT, ADST, fADST, IDTX)^2\) that can be applied at up to 19 different scales from 64x64 down to 4x4 pixels.

When combined together, this means that for any one 8x8 pixel block in a source frame, there are approximately 45,000,000 different ways that it can be encoded.

Consequently, AV1 requires complex control processes. While not necessarily a normative part of the bitstream, these are the algorithms that turn a set of compression tools and a bitstream format specification, into a coherent and useful codec implementation. These may include but are not limited to things like :-

  • Rate distortion optimization (The process of trying to choose the most efficient combination of block size, prediction mode, transform type etc.)
  • Rate control (regulation of the output bitrate)
  • Encoder speed vs quality trade offs.
  • Features such as two pass encoding or optimization for low delay encoding.

For a more detailed overview of AV1's encoding tools and a discussion of some of the design considerations and hardware constraints that had to be accommodated, please refer to A Technical Overview of AV1.

Figure 3 provides a slightly expanded but still simplistic view of the AV1 encoder architecture with blocks that relate to some of the subsequent sections of this document. In this diagram, the raw uncompressed frame buffers are shown in dark green and the reconstructed frame buffers used for prediction in light green. Red indicates those parts of the codec that are (or may be) lossy, where fidelity can be traded off against compression efficiency, whilst light blue shows algorithms or coding tools that are lossless. The yellow blocks represent non-bitstream normative configuration and control algorithms.

The Libaom Command Line Interface

Add details or links here: TODO ? elliotk@

Main Encoder Data Structures

The following are the main high level data structures used by the libaom AV1 encoder and referenced elsewhere in this overview document:

Encoder Use Cases

The libaom AV1 encoder is configurable to support a number of different use cases and rate control strategies.

The principle use cases for which it is optimised are as follows:

  • Video on Demand / Streaming
  • Low Delay or Live Streaming
  • Video Conferencing / Real Time Coding (RTC)
  • Fixed Quality / Testing

Other examples of use cases for which the encoder could be configured but for which there is less by way of specific optimizations include:

  • Download and Play
  • Disk Playback>
  • Storage
  • Editing
  • Broadcast video

Specific use cases may have particular requirements or constraints. For example:

Video Conferencing: In a video conference we need to encode the video in real time and to avoid any coding tools that could increase latency, such as frame look ahead.

Live Streams: In cases such as live streaming of games or events, it may be possible to allow some limited buffering of the video and use of lookahead coding tools to improve encoding quality. However, whilst a lag of a second or two may be fine given the one way nature of this type of video, it is clearly not possible to use tools such as two pass coding.

Broadcast: Broadcast video (e.g. digital TV over satellite) may have specific requirements such as frequent and regular key frames (e.g. once per second or more) as these are important as entry points to users when switching channels. There may also be strict upper limits on bandwidth over a short window of time.

Download and Play: Download and play applications may have less strict requirements in terms of local frame by frame rate control but there may be a requirement to accurately hit a file size target for the video clip as a whole. Similar considerations may apply to playback from mass storage devices such as DVD or disk drives.

Editing: In certain special use cases such as offline editing, it may be desirable to have very high quality and data rate but also very frequent key frames or indeed to encode the video exclusively as key frames. Lossless video encoding may also be required in this use case.

VOD / Streaming: One of the most important and common use cases for AV1 is video on demand or streaming, for services such as YouTube and Netflix. In this use case it is possible to do two or even multi-pass encoding to improve compression efficiency. Streaming services will often store many encoded copies of a video at different resolutions and data rates to support users with different types of playback device and bandwidth limitations. Furthermore, these services support dynamic switching between multiple streams, so that they can respond to changing network conditions.

Exact rate control when encoding for a specific format (e.g 360P or 1080P on YouTube) may not be critical, provided that the video bandwidth remains within allowed limits. Whilst a format may have a nominal target data rate, this can be considered more as the desired average egress rate over the video corpus rather than a strict requirement for any individual clip. Indeed, in order to maintain optimal quality of experience for the end user, it may be desirable to encode some easier videos or sections of video at a lower data rate and harder videos or sections at a higher rate.

VOD / streaming does not usually require very frequent key frames (as in the broadcast case) but key frames are important in trick play (scanning back and forth to different points in a video) and for adaptive stream switching. As such, in a use case like YouTube, there is normally an upper limit on the maximum time between key frames of a few seconds, but within certain limits the encoder can try to align key frames with real scene cuts.

Whilst encoder speed may not seem to be as critical in this use case, for services such as YouTube, where millions of new videos have to be encoded every day, encoder speed is still important, so libaom allows command line control of the encode speed vs quality trade off.

Fixed Quality / Testing Mode: Libaom also has a fixed quality encoder pathway designed for testing under highly constrained conditions.

Speed vs Quality Trade Off

In any modern video encoder there are trade offs that can be made in regard to the amount of time spent encoding a video or video frame vs the quality of the final encode.

These trade offs typically limit the scope of the search for an optimal prediction / transform combination with faster encode modes doing fewer partition, reference frame, prediction mode and transform searches at the cost of some reduction in coding efficiency.

The pruning of the size of the search tree is typically based on assumptions about the likelihood of different search modes being selected based on what has gone before and features such as the dimensions of the video frames and the Q value selected for encoding the frame. For example certain intra modes are less likely to be chosen at high Q but may be more likely if similar modes were used for the previously coded blocks above and to the left of the current block.

The speed settings depend both on the use case (e.g. Real Time encoding) and an explicit speed control passed in on the command line as –cpu-used and stored in the AV1_COMP::speed field of the main compressor instance data structure (cpi).

The control flags for the speed trade off are stored the AV1_COMP::sf field of the compressor instancve and are set in the following functions:-

A second factor impacting the speed of encode is rate distortion optimisation (rd vs non-rd encoding).

When rate distortion optimization is enabled each candidate combination of a prediction mode and transform coding strategy is fully encoded and the resulting error (or distortion) as compared to the original source and the number of bits used, are passed to a rate distortion function. This function converts the distortion and cost in bits to a single RD value (where lower is better). This RD value is used to decide between different encoding strategies for the current block where, for example, a one may result in a lower distortion but a larger number of bits.

The calculation of this RD value is broadly speaking as follows:

\[ RD = (λ * Rate) + Distortion \]

This assumes a linear relationship between the number of bits used and distortion (represented by the rate multiplier value λ) which is not actually valid across a broad range of rate and distortion values. Typically, where distortion is high, expending a small number of extra bits will result in a large change in distortion. However, at lower values of distortion the cost in bits of each incremental improvement is large.

To deal with this we scale the value of λ based on the quantizer value chosen for the frame. This is assumed to be a proxy for our approximate position on the true rate distortion curve and it is further assumed that over a limited range of distortion values, a linear relationship between distortion and rate is a valid approximation.

Doing a rate distortion test on each candidate prediction / transform combination is expensive in terms of cpu cycles. Hence, for cases where encode speed is critical, libaom implements a non-rd pathway where the RD value is estimated based on the prediction error and quantizer setting.

Source Frame Processing

Main Data Structures

The following are the main data structures referenced in this section (see also Main Encoder Data Structures):

Frame Ingest / Coding Pipeline

To encode a frame, first call av1_receive_raw_frame() to obtain the raw frame data. Then call av1_get_compressed_data() to encode raw frame data into compressed frame data. The main body of av1_get_compressed_data() is av1_encode_strategy(), which determines high-level encode strategy (frame type, frame placement, etc.) and then encodes the frame by calling av1_encode(). In av1_encode(), av1_first_pass() will execute the first_pass of two-pass encoding, while encode_frame_to_data_rate() will perform the final pass for either one-pass or two-pass encoding.

The main body of encode_frame_to_data_rate() is encode_with_recode_loop_and_filter(), which handles encoding before in-loop filters (with recode loops encode_with_recode_loop(), or without any recode loop encode_without_recode()), followed by in-loop filters (deblocking filters loopfilter_frame(), CDEF filters and restoration filters cdef_restoration_frame()).

Except for rate/quality control, both encode_with_recode_loop() and encode_without_recode() call av1_encode_frame() to manage the reference frame buffers and encode_frame_internal() to perform the rest of encoding that does not require access to external frames. encode_frame_internal() is the starting point for the partition search (see Block Partition Search).

Temporal Filtering

Overview

Video codecs exploit the spatial and temporal correlations in video signals to achieve compression efficiency. The noise factor in the source signal attenuates such correlation and impedes the codec performance. Denoising the video signal is potentially a promising solution.

One strategy for denoising a source is motion compensated temporal filtering. Unlike image denoising, where only the spatial information is available, video denoising can leverage a combination of the spatial and temporal information. Specifically, in the temporal domain, similar pixels can often be tracked along the motion trajectory of moving objects. Motion estimation is applied to neighboring frames to find similar patches or blocks of pixels that can be combined to create a temporally filtered output.

AV1, in common with VP8 and VP9, uses an in-loop motion compensated temporal filter to generate what are referred to as alternate reference frames (or ARF frames). These can be encoded in the bitstream and stored as frame buffers for use in the prediction of subsequent frames, but are not usually directly displayed (hence they are sometimes referred to as non-display frames).

The following command line parameters set the strength of the filter, the number of frames used and determine whether filtering is allowed for key frames.

Note that in AV1, the temporal filtering scheme is designed around the hierarchical ARF based pyramid coding structure. We typically apply denoising only on key frame and ARF frames at the highest (and sometimes the second highest) layer in the hierarchical coding structure.

Temporal Filtering Algorithm

Our method divides the current frame into "MxM" blocks. For each block, a motion search is applied on frames before and after the current frame. Only the best matching patch with the smallest mean square error (MSE) is kept as a candidate patch for a neighbour frame. The current block is also a candidate patch. A total of N candidate patches are combined to generate the filtered output.

Let f(i) represent the filtered sample value and \(p_{j}(i)\) the sample value of the j-th patch. The filtering process is:

\[ f(i) = \frac{p_{0}(i) + \sum_{j=1}^{N} ω_{j}(i).p_{j}(i)} {1 + \sum_{j=1}^{N} ω_{j}(i)} \]

where \( ω_{j}(i) \) is the weight of the j-th patch from a total of N patches. The weight is determined by the patch difference as:

\[ ω_{j}(i) = exp(-\frac{D_{j}(i)}{h^2}) \]

where \( D_{j}(i) \) is the sum of squared difference between the current block and the j-th candidate patch:

\[ D_{j}(i) = \sum_{k\inΩ_{i}}||p_{0}(k) - p_{j}(k)||_{2} \]

where:

  • \(p_{0}\) refers to the current frame.
  • \(Ω_{i}\) is the patch window, an "LxL" pixel square.
  • h is a critical parameter that controls the decay of the weights measured by the Euclidean distance. It is derived from an estimate of noise amplitude in the source. This allows the filter coefficients to adapt for videos with different noise characteristics.
  • Usually, M = 32, N = 7, and L = 5, but they can be adjusted.

It is recommended that the reader refers to the code for more details.

Temporal Filter Functions

The main entry point for temporal filtering is av1_temporal_filter(). This function returns 1 if temporal filtering is successful, otherwise 0. When temporal filtering is applied, the filtered frame will be held in the output_frame, which is the frame to be encoded in the following encoding process.

Almost all temporal filter related code is in av1/encoder/temporal_filter.c and av1/encoder/temporal_filter.h.

Inside av1_temporal_filter(), the reader's attention is directed to tf_setup_filtering_buffer() and tf_do_filtering().