1/*!\page encoder_guide AV1 ENCODER GUIDE 2 3\tableofcontents 4 5\section architecture_introduction Introduction 6 7This document provides an architectural overview of the libaom AV1 encoder. 8 9It is intended as a high level starting point for anyone wishing to contribute 10to the project, that will help them to more quickly understand the structure 11of the encoder and find their way around the codebase. 12 13It stands above and will where necessary link to more detailed function 14level documents. 15 16\subsection architecture_gencodecs Generic Block Transform Based Codecs 17 18Most modern video encoders including VP8, H.264, VP9, HEVC and AV1 19(in increasing order of complexity) share a common basic paradigm. This 20comprises separating a stream of raw video frames into a series of discrete 21blocks (of one or more sizes), then computing a prediction signal and a 22quantized, transform coded, residual error signal. The prediction and residual 23error signal, along with any side information needed by the decoder, are then 24entropy coded and packed to form the encoded bitstream. See Figure 1: below, 25where the blue blocks are, to all intents and purposes, the lossless parts of 26the encoder and the red block is the lossy part. 27 28This is of course a gross oversimplification, even in regard to the simplest 29of the above codecs. For example, all of them allow for block based 30prediction at multiple different scales (i.e. different block sizes) and may 31use previously coded pixels in the current frame for prediction or pixels from 32one or more previously encoded frames. Further, they may support multiple 33different transforms and transform sizes and quality optimization tools like 34loop filtering. 35 36\image html genericcodecflow.png "" width=70% 37 38\subsection architecture_av1_structure AV1 Structure and Complexity 39 40As previously stated, AV1 adopts the same underlying paradigm as other block 41transform based codecs. However, it is much more complicated than previous 42generation codecs and supports many more block partitioning, prediction and 43transform options. 44 45AV1 supports block partitions of various sizes from 128x128 pixels down to 4x4 46pixels using a multi-layer recursive tree structure as illustrated in figure 2 47below. 48 49\image html av1partitions.png "" width=70% 50 51AV1 also provides 71 basic intra prediction modes, 56 single frame inter prediction 52modes (7 reference frames x 4 modes x 2 for OBMC (overlapped block motion 53compensation)), 12768 compound inter prediction modes (that combine inter 54predictors from two reference frames) and 36708 compound inter / intra 55prediction modes. Furthermore, in addition to simple inter motion estimation, 56AV1 also supports warped motion prediction using affine transforms. 57 58In terms of transform coding, it has 16 separable 2-D transform kernels 59\f$(DCT, ADST, fADST, IDTX)^2\f$ that can be applied at up to 19 different 60scales from 64x64 down to 4x4 pixels. 61 62When combined together, this means that for any one 8x8 pixel block in a 63source frame, there are approximately 45,000,000 different ways that it can 64be encoded. 65 66Consequently, AV1 requires complex control processes. While not necessarily 67a normative part of the bitstream, these are the algorithms that turn a set 68of compression tools and a bitstream format specification, into a coherent 69and useful codec implementation. These may include but are not limited to 70things like :- 71 72- Rate distortion optimization (The process of trying to choose the most 73 efficient combination of block size, prediction mode, transform type 74 etc.) 75- Rate control (regulation of the output bitrate) 76- Encoder speed vs quality trade offs. 77- Features such as two pass encoding or optimization for low delay 78 encoding. 79 80For a more detailed overview of AV1's encoding tools and a discussion of some 81of the design considerations and hardware constraints that had to be 82accommodated, please refer to <a href="https://arxiv.org/abs/2008.06091"> 83A Technical Overview of AV1</a>. 84 85Figure 3 provides a slightly expanded but still simplistic view of the 86AV1 encoder architecture with blocks that relate to some of the subsequent 87sections of this document. In this diagram, the raw uncompressed frame buffers 88are shown in dark green and the reconstructed frame buffers used for 89prediction in light green. Red indicates those parts of the codec that are 90(or may be) lossy, where fidelity can be traded off against compression 91efficiency, whilst light blue shows algorithms or coding tools that are 92lossless. The yellow blocks represent non-bitstream normative configuration 93and control algorithms. 94 95\image html av1encoderflow.png "" width=70% 96 97\section architecture_command_line The Libaom Command Line Interface 98 99 Add details or links here: TODO ? elliotk@ 100 101\section architecture_enc_data_structures Main Encoder Data Structures 102 103The following are the main high level data structures used by the libaom AV1 104encoder and referenced elsewhere in this overview document: 105 106- \ref AV1_PRIMARY 107 - \ref AV1_PRIMARY.gf_group (\ref GF_GROUP) 108 - \ref AV1_PRIMARY.lap_enabled 109 - \ref AV1_PRIMARY.twopass (\ref TWO_PASS) 110 - \ref AV1_PRIMARY.p_rc (\ref PRIMARY_RATE_CONTROL) 111 - \ref AV1_PRIMARY.tf_info (\ref TEMPORAL_FILTER_INFO) 112 113- \ref AV1_COMP 114 - \ref AV1_COMP.oxcf (\ref AV1EncoderConfig) 115 - \ref AV1_COMP.rc (\ref RATE_CONTROL) 116 - \ref AV1_COMP.speed 117 - \ref AV1_COMP.sf (\ref SPEED_FEATURES) 118 119- \ref AV1EncoderConfig (Encoder configuration parameters) 120 - \ref AV1EncoderConfig.pass 121 - \ref AV1EncoderConfig.algo_cfg (\ref AlgoCfg) 122 - \ref AV1EncoderConfig.kf_cfg (\ref KeyFrameCfg) 123 - \ref AV1EncoderConfig.rc_cfg (\ref RateControlCfg) 124 125- \ref AlgoCfg (Algorithm related configuration parameters) 126 - \ref AlgoCfg.arnr_max_frames 127 - \ref AlgoCfg.arnr_strength 128 129- \ref KeyFrameCfg (Keyframe coding configuration parameters) 130 - \ref KeyFrameCfg.enable_keyframe_filtering 131 132- \ref RateControlCfg (Rate control configuration) 133 - \ref RateControlCfg.mode 134 - \ref RateControlCfg.target_bandwidth 135 - \ref RateControlCfg.best_allowed_q 136 - \ref RateControlCfg.worst_allowed_q 137 - \ref RateControlCfg.cq_level 138 - \ref RateControlCfg.under_shoot_pct 139 - \ref RateControlCfg.over_shoot_pct 140 - \ref RateControlCfg.maximum_buffer_size_ms 141 - \ref RateControlCfg.starting_buffer_level_ms 142 - \ref RateControlCfg.optimal_buffer_level_ms 143 - \ref RateControlCfg.vbrbias 144 - \ref RateControlCfg.vbrmin_section 145 - \ref RateControlCfg.vbrmax_section 146 147- \ref PRIMARY_RATE_CONTROL (Primary Rate control status) 148 - \ref PRIMARY_RATE_CONTROL.gf_intervals[] 149 - \ref PRIMARY_RATE_CONTROL.cur_gf_index 150 151- \ref RATE_CONTROL (Rate control status) 152 - \ref RATE_CONTROL.intervals_till_gf_calculate_due 153 - \ref RATE_CONTROL.frames_till_gf_update_due 154 - \ref RATE_CONTROL.frames_to_key 155 156- \ref TWO_PASS (Two pass status and control data) 157 158- \ref GF_GROUP (Data related to the current GF/ARF group) 159 160- \ref FIRSTPASS_STATS (Defines entries in the first pass stats buffer) 161 - \ref FIRSTPASS_STATS.coded_error 162 163- \ref SPEED_FEATURES (Encode speed vs quality tradeoff parameters) 164 - \ref SPEED_FEATURES.hl_sf (\ref HIGH_LEVEL_SPEED_FEATURES) 165 166- \ref HIGH_LEVEL_SPEED_FEATURES 167 - \ref HIGH_LEVEL_SPEED_FEATURES.recode_loop 168 - \ref HIGH_LEVEL_SPEED_FEATURES.recode_tolerance 169 170- \ref TplParams 171 172\section architecture_enc_use_cases Encoder Use Cases 173 174The libaom AV1 encoder is configurable to support a number of different use 175cases and rate control strategies. 176 177The principle use cases for which it is optimised are as follows: 178 179 - <b>Video on Demand / Streaming</b> 180 - <b>Low Delay or Live Streaming</b> 181 - <b>Video Conferencing / Real Time Coding (RTC)</b> 182 - <b>Fixed Quality / Testing</b> 183 184Other examples of use cases for which the encoder could be configured but for 185which there is less by way of specific optimizations include: 186 187 - <b>Download and Play</b> 188 - <b>Disk Playback</b>> 189 - <b>Storage</b> 190 - <b>Editing</b> 191 - <b>Broadcast video</b> 192 193Specific use cases may have particular requirements or constraints. For 194example: 195 196<b>Video Conferencing:</b> In a video conference we need to encode the video 197in real time and to avoid any coding tools that could increase latency, such 198as frame look ahead. 199 200<b>Live Streams:</b> In cases such as live streaming of games or events, it 201may be possible to allow some limited buffering of the video and use of 202lookahead coding tools to improve encoding quality. However, whilst a lag of 203a second or two may be fine given the one way nature of this type of video, 204it is clearly not possible to use tools such as two pass coding. 205 206<b>Broadcast:</b> Broadcast video (e.g. digital TV over satellite) may have 207specific requirements such as frequent and regular key frames (e.g. once per 208second or more) as these are important as entry points to users when switching 209channels. There may also be strict upper limits on bandwidth over a short 210window of time. 211 212<b>Download and Play:</b> Download and play applications may have less strict 213requirements in terms of local frame by frame rate control but there may be a 214requirement to accurately hit a file size target for the video clip as a 215whole. Similar considerations may apply to playback from mass storage devices 216such as DVD or disk drives. 217 218<b>Editing:</b> In certain special use cases such as offline editing, it may 219be desirable to have very high quality and data rate but also very frequent 220key frames or indeed to encode the video exclusively as key frames. Lossless 221video encoding may also be required in this use case. 222 223<b>VOD / Streaming:</b> One of the most important and common use cases for AV1 224is video on demand or streaming, for services such as YouTube and Netflix. In 225this use case it is possible to do two or even multi-pass encoding to improve 226compression efficiency. Streaming services will often store many encoded 227copies of a video at different resolutions and data rates to support users 228with different types of playback device and bandwidth limitations. 229Furthermore, these services support dynamic switching between multiple 230streams, so that they can respond to changing network conditions. 231 232Exact rate control when encoding for a specific format (e.g 360P or 1080P on 233YouTube) may not be critical, provided that the video bandwidth remains within 234allowed limits. Whilst a format may have a nominal target data rate, this can 235be considered more as the desired average egress rate over the video corpus 236rather than a strict requirement for any individual clip. Indeed, in order 237to maintain optimal quality of experience for the end user, it may be 238desirable to encode some easier videos or sections of video at a lower data 239rate and harder videos or sections at a higher rate. 240 241VOD / streaming does not usually require very frequent key frames (as in the 242broadcast case) but key frames are important in trick play (scanning back and 243forth to different points in a video) and for adaptive stream switching. As 244such, in a use case like YouTube, there is normally an upper limit on the 245maximum time between key frames of a few seconds, but within certain limits 246the encoder can try to align key frames with real scene cuts. 247 248Whilst encoder speed may not seem to be as critical in this use case, for 249services such as YouTube, where millions of new videos have to be encoded 250every day, encoder speed is still important, so libaom allows command line 251control of the encode speed vs quality trade off. 252 253<b>Fixed Quality / Testing Mode:</b> Libaom also has a fixed quality encoder 254pathway designed for testing under highly constrained conditions. 255 256\section architecture_enc_speed_quality Speed vs Quality Trade Off 257 258In any modern video encoder there are trade offs that can be made in regard to 259the amount of time spent encoding a video or video frame vs the quality of the 260final encode. 261 262These trade offs typically limit the scope of the search for an optimal 263prediction / transform combination with faster encode modes doing fewer 264partition, reference frame, prediction mode and transform searches at the cost 265of some reduction in coding efficiency. 266 267The pruning of the size of the search tree is typically based on assumptions 268about the likelihood of different search modes being selected based on what 269has gone before and features such as the dimensions of the video frames and 270the Q value selected for encoding the frame. For example certain intra modes 271are less likely to be chosen at high Q but may be more likely if similar 272modes were used for the previously coded blocks above and to the left of the 273current block. 274 275The speed settings depend both on the use case (e.g. Real Time encoding) and 276an explicit speed control passed in on the command line as <b>--cpu-used</b> 277and stored in the \ref AV1_COMP.speed field of the main compressor instance 278data structure (<b>cpi</b>). 279 280The control flags for the speed trade off are stored the \ref AV1_COMP.sf 281field of the compressor instancve and are set in the following functions:- 282 283- \ref av1_set_speed_features_framesize_independent() 284- \ref av1_set_speed_features_framesize_dependent() 285- \ref av1_set_speed_features_qindex_dependent() 286 287A second factor impacting the speed of encode is rate distortion optimisation 288(<b>rd vs non-rd</b> encoding). 289 290When rate distortion optimization is enabled each candidate combination of 291a prediction mode and transform coding strategy is fully encoded and the 292resulting error (or distortion) as compared to the original source and the 293number of bits used, are passed to a rate distortion function. This function 294converts the distortion and cost in bits to a single <b>RD</b> value (where 295lower is better). This <b>RD</b> value is used to decide between different 296encoding strategies for the current block where, for example, a one may 297result in a lower distortion but a larger number of bits. 298 299The calculation of this <b>RD</b> value is broadly speaking as follows: 300 301\f[ 302 RD = (λ * Rate) + Distortion 303\f] 304 305This assumes a linear relationship between the number of bits used and 306distortion (represented by the rate multiplier value <b>λ</b>) which is 307not actually valid across a broad range of rate and distortion values. 308Typically, where distortion is high, expending a small number of extra bits 309will result in a large change in distortion. However, at lower values of 310distortion the cost in bits of each incremental improvement is large. 311 312To deal with this we scale the value of <b>λ</b> based on the quantizer 313value chosen for the frame. This is assumed to be a proxy for our approximate 314position on the true rate distortion curve and it is further assumed that over 315a limited range of distortion values, a linear relationship between distortion 316and rate is a valid approximation. 317 318Doing a rate distortion test on each candidate prediction / transform 319combination is expensive in terms of cpu cycles. Hence, for cases where encode 320speed is critical, libaom implements a non-rd pathway where the <b>RD</b> 321value is estimated based on the prediction error and quantizer setting. 322 323\section architecture_enc_src_proc Source Frame Processing 324 325\subsection architecture_enc_frame_proc_data Main Data Structures 326 327The following are the main data structures referenced in this section 328(see also \ref architecture_enc_data_structures): 329 330- \ref AV1_PRIMARY ppi (the primary compressor instance data structure) 331 - \ref AV1_PRIMARY.tf_info (\ref TEMPORAL_FILTER_INFO) 332 333- \ref AV1_COMP cpi (the main compressor instance data structure) 334 - \ref AV1_COMP.oxcf (\ref AV1EncoderConfig) 335 336- \ref AV1EncoderConfig (Encoder configuration parameters) 337 - \ref AV1EncoderConfig.algo_cfg (\ref AlgoCfg) 338 - \ref AV1EncoderConfig.kf_cfg (\ref KeyFrameCfg) 339 340- \ref AlgoCfg (Algorithm related configuration parameters) 341 - \ref AlgoCfg.arnr_max_frames 342 - \ref AlgoCfg.arnr_strength 343 344- \ref KeyFrameCfg (Keyframe coding configuration parameters) 345 - \ref KeyFrameCfg.enable_keyframe_filtering 346 347\subsection architecture_enc_frame_proc_ingest Frame Ingest / Coding Pipeline 348 349 To encode a frame, first call \ref av1_receive_raw_frame() to obtain the raw 350 frame data. Then call \ref av1_get_compressed_data() to encode raw frame data 351 into compressed frame data. The main body of \ref av1_get_compressed_data() 352 is \ref av1_encode_strategy(), which determines high-level encode strategy 353 (frame type, frame placement, etc.) and then encodes the frame by calling 354 \ref av1_encode(). In \ref av1_encode(), \ref av1_first_pass() will execute 355 the first_pass of two-pass encoding, while \ref encode_frame_to_data_rate() 356 will perform the final pass for either one-pass or two-pass encoding. 357 358 The main body of \ref encode_frame_to_data_rate() is 359 \ref encode_with_recode_loop_and_filter(), which handles encoding before 360 in-loop filters (with recode loops \ref encode_with_recode_loop(), or 361 without any recode loop \ref encode_without_recode()), followed by in-loop 362 filters (deblocking filters \ref loopfilter_frame(), CDEF filters and 363 restoration filters \ref cdef_restoration_frame()). 364 365 Except for rate/quality control, both \ref encode_with_recode_loop() and 366 \ref encode_without_recode() call \ref av1_encode_frame() to manage the 367 reference frame buffers and \ref encode_frame_internal() to perform the 368 rest of encoding that does not require access to external frames. 369 \ref encode_frame_internal() is the starting point for the partition search 370 (see \ref architecture_enc_partitions). 371 372\subsection architecture_enc_frame_proc_tf Temporal Filtering 373 374\subsubsection architecture_enc_frame_proc_tf_overview Overview 375 376Video codecs exploit the spatial and temporal correlations in video signals to 377achieve compression efficiency. The noise factor in the source signal 378attenuates such correlation and impedes the codec performance. Denoising the 379video signal is potentially a promising solution. 380 381One strategy for denoising a source is motion compensated temporal filtering. 382Unlike image denoising, where only the spatial information is available, 383video denoising can leverage a combination of the spatial and temporal 384information. Specifically, in the temporal domain, similar pixels can often be 385tracked along the motion trajectory of moving objects. Motion estimation is 386applied to neighboring frames to find similar patches or blocks of pixels that 387can be combined to create a temporally filtered output. 388 389AV1, in common with VP8 and VP9, uses an in-loop motion compensated temporal 390filter to generate what are referred to as alternate reference frames (or ARF 391frames). These can be encoded in the bitstream and stored as frame buffers for 392use in the prediction of subsequent frames, but are not usually directly 393displayed (hence they are sometimes referred to as non-display frames). 394 395The following command line parameters set the strength of the filter, the 396number of frames used and determine whether filtering is allowed for key 397frames. 398 399- <b>--arnr-strength</b> (\ref AlgoCfg.arnr_strength) 400- <b>--arnr-maxframes</b> (\ref AlgoCfg.arnr_max_frames) 401- <b>--enable-keyframe-filtering</b> 402 (\ref KeyFrameCfg.enable_keyframe_filtering) 403 404Note that in AV1, the temporal filtering scheme is designed around the 405hierarchical ARF based pyramid coding structure. We typically apply denoising 406only on key frame and ARF frames at the highest (and sometimes the second 407highest) layer in the hierarchical coding structure. 408 409\subsubsection architecture_enc_frame_proc_tf_algo Temporal Filtering Algorithm 410 411Our method divides the current frame into "MxM" blocks. For each block, a 412motion search is applied on frames before and after the current frame. Only 413the best matching patch with the smallest mean square error (MSE) is kept as a 414candidate patch for a neighbour frame. The current block is also a candidate 415patch. A total of N candidate patches are combined to generate the filtered 416output. 417 418Let f(i) represent the filtered sample value and \f$p_{j}(i)\f$ the sample 419value of the j-th patch. The filtering process is: 420 421\f[ 422 f(i) = \frac{p_{0}(i) + \sum_{j=1}^{N} ω_{j}(i).p_{j}(i)} 423 {1 + \sum_{j=1}^{N} ω_{j}(i)} 424\f] 425 426where \f$ ω_{j}(i) \f$ is the weight of the j-th patch from a total of 427N patches. The weight is determined by the patch difference as: 428 429\f[ 430 ω_{j}(i) = exp(-\frac{D_{j}(i)}{h^2}) 431\f] 432 433where \f$ D_{j}(i) \f$ is the sum of squared difference between the current 434block and the j-th candidate patch: 435 436\f[ 437 D_{j}(i) = \sum_{k\inΩ_{i}}||p_{0}(k) - p_{j}(k)||_{2} 438\f] 439 440where: 441- \f$p_{0}\f$ refers to the current frame. 442- \f$Ω_{i}\f$ is the patch window, an "LxL" pixel square. 443- h is a critical parameter that controls the decay of the weights measured by 444 the Euclidean distance. It is derived from an estimate of noise amplitude in 445 the source. This allows the filter coefficients to adapt for videos with 446 different noise characteristics. 447- Usually, M = 32, N = 7, and L = 5, but they can be adjusted. 448 449It is recommended that the reader refers to the code for more details. 450 451\subsubsection architecture_enc_frame_proc_tf_funcs Temporal Filter Functions 452 453The main entry point for temporal filtering is \ref av1_temporal_filter(). 454This function returns 1 if temporal filtering is successful, otherwise 0. 455When temporal filtering is applied, the filtered frame will be held in 456the output_frame, which is the frame to be 457encoded in the following encoding process. 458 459Almost all temporal filter related code is in av1/encoder/temporal_filter.c 460and av1/encoder/temporal_filter.h. 461 462Inside \ref av1_temporal_filter(), the reader's attention is directed to 463\ref tf_setup_filtering_buffer() and \ref tf_do_filtering(). 464 465- \ref tf_setup_filtering_buffer(): sets up the frame buffer for 466 temporal filtering, determines the number of frames to be used, and 467 calculates the noise level of each frame. 468 469- \ref tf_do_filtering(): the main function for the temporal 470 filtering algorithm. It breaks each frame into "MxM" blocks. For each 471 block a motion search \ref tf_motion_search() is applied to find 472 the motion vector from one neighboring frame. tf_build_predictor() is then 473 called to build the matching patch and \ref av1_apply_temporal_filter_c() (see 474 also optimised SIMD versions) to apply temporal filtering. The weighted 475 average over each pixel is accumulated and finally normalized in 476 \ref tf_normalize_filtered_frame() to generate the final filtered frame. 477 478- \ref av1_apply_temporal_filter_c(): the core function of our temporal 479 filtering algorithm (see also optimised SIMD versions). 480 481\subsection architecture_enc_frame_proc_film Film Grain Modelling 482 483 Add details here. 484 485\section architecture_enc_rate_ctrl Rate Control 486 487\subsection architecture_enc_rate_ctrl_data Main Data Structures 488 489The following are the main data structures referenced in this section 490(see also \ref architecture_enc_data_structures): 491 492 - \ref AV1_PRIMARY ppi (the primary compressor instance data structure) 493 - \ref AV1_PRIMARY.twopass (\ref TWO_PASS) 494 495 - \ref AV1_COMP cpi (the main compressor instance data structure) 496 - \ref AV1_COMP.oxcf (\ref AV1EncoderConfig) 497 - \ref AV1_COMP.rc (\ref RATE_CONTROL) 498 - \ref AV1_COMP.sf (\ref SPEED_FEATURES) 499 500 - \ref AV1EncoderConfig (Encoder configuration parameters) 501 - \ref AV1EncoderConfig.rc_cfg (\ref RateControlCfg) 502 503 - \ref FIRSTPASS_STATS *frame_stats_buf (used to store per frame first 504 pass stats) 505 506 - \ref SPEED_FEATURES (Encode speed vs quality tradeoff parameters) 507 - \ref SPEED_FEATURES.hl_sf (\ref HIGH_LEVEL_SPEED_FEATURES) 508 509\subsection architecture_enc_rate_ctrl_options Supported Rate Control Options 510 511Different use cases (\ref architecture_enc_use_cases) may have different 512requirements in terms of data rate control. 513 514The broad rate control strategy is selected using the <b>--end-usage</b> 515parameter on the command line, which maps onto the field 516\ref aom_codec_enc_cfg_t.rc_end_usage in \ref aom_encoder.h. 517 518The four supported options are:- 519 520- <b>VBR</b> (Variable Bitrate) 521- <b>CBR</b> (Constant Bitrate) 522- <b>CQ</b> (Constrained Quality mode ; A constrained variant of VBR) 523- <b>Fixed Q</b> (Constant quality of Q mode) 524 525The value of \ref aom_codec_enc_cfg_t.rc_end_usage is in turn copied over 526into the encoder rate control configuration data structure as 527\ref RateControlCfg.mode. 528 529In regards to the most important use cases above, Video on demand uses either 530VBR or CQ mode. CBR is the preferred rate control model for RTC and Live 531streaming and Fixed Q is only used in testing. 532 533The behaviour of each of these modes is regulated by a series of secondary 534command line rate control options but also depends somewhat on the selected 535use case, whether 2-pass coding is enabled and the selected encode speed vs 536quality trade offs (\ref AV1_COMP.speed and \ref AV1_COMP.sf). 537 538The list below gives the names of the main rate control command line 539options together with the names of the corresponding fields in the rate 540control configuration data structures. 541 542- <b>--target-bitrate</b> (\ref RateControlCfg.target_bandwidth) 543- <b>--min-q</b> (\ref RateControlCfg.best_allowed_q) 544- <b>--max-q</b> (\ref RateControlCfg.worst_allowed_q) 545- <b>--cq-level</b> (\ref RateControlCfg.cq_level) 546- <b>--undershoot-pct</b> (\ref RateControlCfg.under_shoot_pct) 547- <b>--overshoot-pct</b> (\ref RateControlCfg.over_shoot_pct) 548 549The following control aspects of vbr encoding 550 551- <b>--bias-pct</b> (\ref RateControlCfg.vbrbias) 552- <b>--minsection-pct</b> ((\ref RateControlCfg.vbrmin_section) 553- <b>--maxsection-pct</b> ((\ref RateControlCfg.vbrmax_section) 554 555The following relate to buffer and delay management in one pass low delay and 556real time coding 557 558- <b>--buf-sz</b> (\ref RateControlCfg.maximum_buffer_size_ms) 559- <b>--buf-initial-sz</b> (\ref RateControlCfg.starting_buffer_level_ms) 560- <b>--buf-optimal-sz</b> (\ref RateControlCfg.optimal_buffer_level_ms) 561 562\subsection architecture_enc_vbr Variable Bitrate (VBR) Encoding 563 564For streamed VOD content the most common rate control strategy is Variable 565Bitrate (VBR) encoding. The CQ mode mentioned above is a variant of this 566where additional quantizer and quality constraints are applied. VBR 567encoding may in theory be used in conjunction with either 1-pass or 2-pass 568encoding. 569 570VBR encoding varies the number of bits given to each frame or group of frames 571according to the difficulty of that frame or group of frames, such that easier 572frames are allocated fewer bits and harder frames are allocated more bits. The 573intent here is to even out the quality between frames. This contrasts with 574Constant Bitrate (CBR) encoding where each frame is allocated the same number 575of bits. 576 577Whilst for any given frame or group of frames the data rate may vary, the VBR 578algorithm attempts to deliver a given average bitrate over a wider time 579interval. In standard VBR encoding, the time interval over which the data rate 580is averaged is usually the duration of the video clip. An alternative 581approach is to target an average VBR bitrate over the entire video corpus for 582a particular video format (corpus VBR). 583 584\subsubsection architecture_enc_1pass_vbr 1 Pass VBR Encoding 585 586The command line for libaom does allow 1 Pass VBR, but this has not been 587properly optimised and behaves much like 1 pass CBR in most regards, with bits 588allocated to frames by the following functions: 589 590- \ref av1_calc_iframe_target_size_one_pass_vbr() 591- \ref av1_calc_pframe_target_size_one_pass_vbr() 592 593\subsubsection architecture_enc_2pass_vbr 2 Pass VBR Encoding 594 595The main focus here will be on 2-pass VBR encoding (and the related CQ mode) 596as these are the modes most commonly used for VOD content. 597 5982-pass encoding is selected on the command line by setting --passes=2 599(or -p 2). 600 601Generally speaking, in 2-pass encoding, an encoder will first encode a video 602using a default set of parameters and assumptions. Depending on the outcome 603of that first encode, the baseline assumptions and parameters will be adjusted 604to optimize the output during the second pass. In essence the first pass is a 605fact finding mission to establish the complexity and variability of the video, 606in order to allow a better allocation of bits in the second pass. 607 608The libaom 2-pass algorithm is unusual in that the first pass is not a full 609encode of the video. Rather it uses a limited set of prediction and transform 610options and a fixed quantizer, to generate statistics about each frame. No 611output bitstream is created and the per frame first pass statistics are stored 612entirely in volatile memory. This has some disadvantages when compared to a 613full first pass encode, but avoids the need for file I/O and improves speed. 614 615For two pass encoding, the function \ref av1_encode() will first be called 616for each frame in the video with the value \ref AV1EncoderConfig.pass = 1. 617This will result in calls to \ref av1_first_pass(). 618 619Statistics for each frame are stored in \ref FIRSTPASS_STATS frame_stats_buf. 620 621After completion of the first pass, \ref av1_encode() will be called again for 622each frame with \ref AV1EncoderConfig.pass = 2. The frames are then encoded in 623accordance with the statistics gathered during the first pass by calls to 624\ref encode_frame_to_data_rate() which in turn calls 625 \ref av1_get_second_pass_params(). 626 627In summary the second pass code :- 628 629- Searches for scene cuts (if auto key frame detection is enabled). 630- Defines the length of and hierarchical structure to be used in each 631 ARF/GF group. 632- Allocates bits based on the relative complexity of each frame, the quality 633 of frame to frame prediction and the type of frame (e.g. key frame, ARF 634 frame, golden frame or normal leaf frame). 635- Suggests a maximum Q (quantizer value) for each ARF/GF group, based on 636 estimated complexity and recent rate control compliance 637 (\ref RATE_CONTROL.active_worst_quality) 638- Tracks adherence to the overall rate control objectives and adjusts 639 heuristics. 640 641The main two pass functions in regard to the above include:- 642 643- \ref find_next_key_frame() 644- \ref define_gf_group() 645- \ref calculate_total_gf_group_bits() 646- \ref get_twopass_worst_quality() 647- \ref av1_gop_setup_structure() 648- \ref av1_gop_bit_allocation() 649- \ref av1_twopass_postencode_update() 650 651For each frame, the two pass algorithm defines a target number of bits 652\ref RATE_CONTROL.base_frame_target, which is then adjusted if necessary to 653reflect any undershoot or overshoot on previous frames to give 654\ref RATE_CONTROL.this_frame_target. 655 656As well as \ref RATE_CONTROL.active_worst_quality, the two pass code also 657maintains a record of the actual Q value used to encode previous frames 658at each level in the current pyramid hierarchy 659(\ref PRIMARY_RATE_CONTROL.active_best_quality). The function 660\ref rc_pick_q_and_bounds(), uses these values to set a permitted Q range 661for each frame. 662 663\subsubsection architecture_enc_1pass_lagged 1 Pass Lagged VBR Encoding 664 6651 pass lagged encode falls between simple 1 pass encoding and full two pass 666encoding and is used for cases where it is not possible to do a full first 667pass through the entire video clip, but where some delay is permissible. For 668example near live streaming where there is a delay of up to a few seconds. In 669this case the first pass and second pass are in effect combined such that the 670first pass starts encoding the clip and the second pass lags behind it by a 671few frames. When using this method, full sequence level statistics are not 672available, but it is possible to collect and use frame or group of frame level 673data to help in the allocation of bits and in defining ARF/GF coding 674hierarchies. The reader is referred to the \ref AV1_PRIMARY.lap_enabled field 675in the main compressor instance (where <b>lap</b> stands for 676<b>look ahead processing</b>). This encoding mode for the most part uses the 677same rate control pathways as two pass VBR encoding. 678 679\subsection architecture_enc_rc_loop The Main Rate Control Loop 680 681Having established a target rate for a given frame and an allowed range of Q 682values, the encoder then tries to encode the frame at a rate that is as close 683as possible to the target value, given the Q range constraints. 684 685There are two main mechanisms by which this is achieved. 686 687The first selects a frame level Q, using an adaptive estimate of the number of 688bits that will be generated when the frame is encoded at any given Q. 689Fundamentally this mechanism is common to VBR, CBR and to use cases such as 690RTC with small adjustments. 691 692As the Q value mainly adjusts the precision of the residual signal, it is not 693actually a reliable basis for accurately predicting the number of bits that 694will be generated across all clips. A well predicted clip, for example, may 695have a much smaller error residual after prediction. The algorithm copes with 696this by adapting its predictions on the fly using a feedback loop based on how 697well it did the previous time around. 698 699The main functions responsible for the prediction of Q and the adaptation over 700time, for the two pass encoding pipeline are: 701 702- \ref rc_pick_q_and_bounds() 703 - \ref get_q() 704 - \ref av1_rc_regulate_q() 705 - \ref get_rate_correction_factor() 706 - \ref set_rate_correction_factor() 707 - \ref find_closest_qindex_by_rate() 708- \ref av1_twopass_postencode_update() 709 - \ref av1_rc_update_rate_correction_factors() 710 711A second mechanism for control comes into play if there is a large rate miss 712for the current frame (much too big or too small). This is a recode mechanism 713which allows the current frame to be re-encoded one or more times with a 714revised Q value. This obviously has significant implications for encode speed 715and in the case of RTC latency (hence it is not used for the RTC pathway). 716 717Whether or not a recode is allowed for a given frame depends on the selected 718encode speed vs quality trade off. This is set on the command line using the 719--cpu-used parameter which maps onto the \ref AV1_COMP.speed field in the main 720compressor instance data structure. 721 722The value of \ref AV1_COMP.speed, combined with the use case, is used to 723populate the speed features data structure AV1_COMP.sf. In particular 724\ref HIGH_LEVEL_SPEED_FEATURES.recode_loop determines the types of frames that 725may be recoded and \ref HIGH_LEVEL_SPEED_FEATURES.recode_tolerance is a rate 726error trigger threshold. 727 728For more information the reader is directed to the following functions: 729 730- \ref encode_with_recode_loop() 731- \ref encode_without_recode() 732- \ref recode_loop_update_q() 733- \ref recode_loop_test() 734- \ref av1_set_speed_features_framesize_independent() 735- \ref av1_set_speed_features_framesize_dependent() 736 737\subsection architecture_enc_fixed_q Fixed Q Mode 738 739There are two main fixed Q cases: 740-# Fixed Q with adaptive qp offsets: same qp offset for each pyramid level 741 in a given video, but these offsets are adaptive based on video content. 742-# Fixed Q with fixed qp offsets: content-independent fixed qp offsets for 743 each pyramid level. 744 745The reader is also refered to the following functions: 746- \ref av1_rc_pick_q_and_bounds() 747- \ref rc_pick_q_and_bounds_no_stats_cbr() 748- \ref rc_pick_q_and_bounds_no_stats() 749- \ref rc_pick_q_and_bounds() 750 751\section architecture_enc_frame_groups GF/ ARF Frame Groups & Hierarchical Coding 752 753\subsection architecture_enc_frame_groups_data Main Data Structures 754 755The following are the main data structures referenced in this section 756(see also \ref architecture_enc_data_structures): 757 758- \ref AV1_COMP cpi (the main compressor instance data structure) 759 - \ref AV1_COMP.rc (\ref RATE_CONTROL) 760 761- \ref FIRSTPASS_STATS *frame_stats_buf (used to store per frame first pass 762stats) 763 764\subsection architecture_enc_frame_groups_groups Frame Groups 765 766To process a sequence/stream of video frames, the encoder divides the frames 767into groups and encodes them sequentially (possibly dependent on previous 768groups). In AV1 such a group is usually referred to as a golden frame group 769(GF group) or sometimes an Alt-Ref (ARF) group or a group of pictures (GOP). 770A GF group determines and stores the coding structure of the frames (for 771example, frame type, usage of the hierarchical structure, usage of overlay 772frames, etc.) and can be considered as the base unit to process the frames, 773therefore playing an important role in the encoder. 774 775The length of a specific GF group is arguably the most important aspect when 776determining a GF group. This is because most GF group level decisions are 777based on the frame characteristics, if not on the length itself directly. 778Note that the GF group is always a group of consecutive frames, which means 779the start and end of the group (so again, the length of it) determines which 780frames are included in it and hence determines the characteristics of the GF 781group. Therefore, in this document we will first discuss the GF group length 782decision in Libaom, followed by frame structure decisions when defining a GF 783group with a certain length. 784 785\subsection architecture_enc_gf_length GF / ARF Group Length Determination 786 787The basic intuition of determining the GF group length is that it is usually 788desirable to group together frames that are similar. Hence, we may choose 789longer groups when consecutive frames are very alike and shorter ones when 790they are very different. 791 792The determination of the GF group length is done in function \ref 793calculate_gf_length(). The following encoder use cases are supported: 794 795<ul> 796 <li><b>Single pass with look-ahead disabled(\ref has_no_stats_stage()): 797 </b> in this case there is no information available on the following stream 798 of frames, therefore the function will set the GF group length for the 799 current and the following GF groups (a total number of MAX_NUM_GF_INTERVALS 800 groups) to be the maximum value allowed.</li> 801 802 <li><b>Single pass with look-ahead enabled (\ref AV1_PRIMARY.lap_enabled):</b> 803 look-ahead processing is enabled for single pass, therefore there is a 804 limited amount of information available regarding future frames. In this 805 case the function will determine the length based on \ref FIRSTPASS_STATS 806 (which is generated when processing the look-ahead buffer) for only the 807 current GF group.</li> 808 809 <li><b>Two pass:</b> the first pass in two-pass encoding collects the stats 810 and will not call the function. In the second pass, the function tries to 811 determine the GF group length of the current and the following GF groups (a 812 total number of MAX_NUM_GF_INTERVALS groups) based on the first-pass 813 statistics. Note that as we will be discussing later, such decisions may not 814 be accurate and can be changed later.</li> 815</ul> 816 817Except for the first trivial case where there is no prior knowledge of the 818following frames, the function \ref calculate_gf_length() tries to determine the 819GF group length based on the first pass statistics. The determination is divided 820into two parts: 821 822<ol> 823 <li>Baseline decision based on accumulated statistics: this part of the function 824 iterates through the firstpass statistics of the following frames and 825 accumulates the statistics with function accumulate_next_frame_stats. 826 The accumulated statistics are then used to determine whether the 827 correlation in the GF group has dropped too much in function detect_gf_cut. 828 If detect_gf_cut returns non-zero, or if we've reached the end of 829 first-pass statistics, the baseline decision is set at the current point.</li> 830 831 <li>If we are not at the end of the first-pass statistics, the next part will 832 try to refine the baseline decision. This algorithm is based on the analysis 833 of firstpass stats. It tries to cut the groups in stable regions or 834 relatively stable points. Also it tries to avoid cutting in a blending 835 region.</li> 836</ol> 837 838As mentioned, for two-pass encoding, the function \ref 839calculate_gf_length() tries to determine the length of as many as 840MAX_NUM_GF_INTERVALS groups. The decisions are stored in 841\ref PRIMARY_RATE_CONTROL.gf_intervals[]. The variables 842\ref RATE_CONTROL.intervals_till_gf_calculate_due and 843\ref PRIMARY_RATE_CONTROL.gf_intervals[] help with managing and updating the stored 844decisions. In the function \ref define_gf_group(), the corresponding 845stored length decision will be used to define the current GF group. 846 847When the maximum GF group length is larger or equal to 32, the encoder will 848enforce an extra layer to determine whether to use maximum GF length of 32 849or 16 for every GF group. In such a case, \ref calculate_gf_length() is 850first called with the original maximum length (>=32). Afterwards, 851\ref av1_tpl_setup_stats() is called to analyze the determined GF group 852and compare the reference to the last frame and the middle frame. If it is 853decided that we should use a maximum GF length of 16, the function 854\ref calculate_gf_length() is called again with the updated maximum 855length, and it only sets the length for a single GF group 856(\ref RATE_CONTROL.intervals_till_gf_calculate_due is set to 1). This process 857is shown below. 858 859\image html tplgfgroupdiagram.png "" width=40% 860 861Before encoding each frame, the encoder checks 862\ref RATE_CONTROL.frames_till_gf_update_due. If it is zero, indicating 863processing of the current GF group is done, the encoder will check whether 864\ref RATE_CONTROL.intervals_till_gf_calculate_due is zero. If it is, as 865discussed above, \ref calculate_gf_length() is called with original 866maximum length. If it is not zero, then the GF group length value stored 867in \ref PRIMARY_RATE_CONTROL.gf_intervals[\ref PRIMARY_RATE_CONTROL.cur_gf_index] is used 868(subject to change as discussed above). 869 870\subsection architecture_enc_gf_structure Defining a GF Group's Structure 871 872The function \ref define_gf_group() defines the frame structure as well 873as other GF group level parameters (e.g. bit allocation) once the length of 874the current GF group is determined. 875 876The function first iterates through the first pass statistics in the GF group to 877accumulate various stats, using accumulate_this_frame_stats() and 878accumulate_next_frame_stats(). The accumulated statistics are then used to 879determine the use of the use of ALTREF frame along with other properties of the 880GF group. The values of \ref PRIMARY_RATE_CONTROL.cur_gf_index, \ref 881RATE_CONTROL.intervals_till_gf_calculate_due and \ref 882RATE_CONTROL.frames_till_gf_update_due are also updated accordingly. 883 884The function \ref av1_gop_setup_structure() is called at the end to determine 885the frame layers and reference maps in the GF group, where the 886construct_multi_layer_gf_structure() function sets the frame update types for 887each frame and the group structure. 888 889- If ALTREF frames are allowed for the GF group: the first frame is set to 890 KF_UPDATE, GF_UPDATE or ARF_UPDATE. The last frames of the GF group is set to 891 OVERLAY_UPDATE. Then in set_multi_layer_params(), frame update 892 types are determined recursively in a binary tree fashion, and assigned to 893 give the final IBBB structure for the group. - If the current branch has more 894 than 2 frames and we have not reached maximum layer depth, then the middle 895 frame is set as INTNL_ARF_UPDATE, and the left and right branches are 896 processed recursively. - If the current branch has less than 3 frames, or we 897 have reached maximum layer depth, then every frame in the branch is set to 898 LF_UPDATE. 899 900- If ALTREF frame is not allowed for the GF group: the frames are set 901 as LF_UPDATE. This basically forms an IPPP GF group structure. 902 903As mentioned, the encoder may use Temporal dependancy modelling (TPL - see \ref 904architecture_enc_tpl) to determine whether we should use a maximum length of 32 905or 16 for the current GF group. This requires calls to \ref define_gf_group() 906but should not change other settings (since it is in essence a trial). This 907special case is indicated by the setting parameter <b>is_final_pass</b> for to 908zero. 909 910For single pass encodes where look-ahead processing is disabled 911(\ref AV1_PRIMARY.lap_enabled = 0), \ref define_gf_group_pass0() is used 912instead of \ref define_gf_group(). 913 914\subsection architecture_enc_kf_groups Key Frame Groups 915 916A special constraint for GF group length is the location of the next keyframe 917(KF). The frames between two KFs are referred to as a KF group. Each KF group 918can be encoded and decoded independently. Because of this, a GF group cannot 919span beyond a KF and the location of the next KF is set as a hard boundary 920for GF group length. 921 922<ul> 923 <li>For two-pass encoding \ref RATE_CONTROL.frames_to_key controls when to 924 encode a key frame. When it is zero, the current frame is a keyframe and 925 the function \ref find_next_key_frame() is called. This in turn calls 926 \ref define_kf_interval() to work out where the next key frame should 927 be placed.</li> 928 929 <li>For single-pass with look-ahead enabled, \ref define_kf_interval() 930 is called whenever a GF group update is needed (when 931 \ref RATE_CONTROL.frames_till_gf_update_due is zero). This is because 932 generally KFs are more widely spaced and the look-ahead buffer is usually 933 not long enough.</li> 934 935 <li>For single-pass with look-ahead disabled, the KFs are placed according 936 to the command line parameter <b>--kf-max-dist</b> (The above two cases are 937 also subject to this constraint).</li> 938</ul> 939 940The function \ref define_kf_interval() tries to detect a scenecut. 941If a scenecut within kf-max-dist is detected, then it is set as the next 942keyframe. Otherwise the given maximum value is used. 943 944\section architecture_enc_tpl Temporal Dependency Modelling 945 946The temporal dependency model runs at the beginning of each GOP. It builds the 947motion trajectory within the GOP in units of 16x16 blocks. The temporal 948dependency of a 16x16 block is evaluated as the predictive coding gains it 949contributes to its trailing motion trajectory. This temporal dependency model 950reflects how important a coding block is for the coding efficiency of the 951overall GOP. It is hence used to scale the Lagrangian multiplier used in the 952rate-distortion optimization framework. 953 954\subsection architecture_enc_tpl_config Configurations 955 956The temporal dependency model and its applications are by default turned on in 957libaom encoder for the VoD use case. To disable it, use --tpl-model=0 in the 958aomenc configuration. 959 960\subsection architecture_enc_tpl_algoritms Algorithms 961 962The scheme works in the reverse frame processing order over the source frames, 963propagating information from future frames back to the current frame. For each 964frame, a propagation step is run for each MB. it operates as follows: 965 966<ul> 967 <li> Estimate the intra prediction cost in terms of sum of absolute Hadamard 968 transform difference (SATD) noted as intra_cost. It also loads the motion 969 information available from the first-pass encode and estimates the inter 970 prediction cost as inter_cost. Due to the use of hybrid inter/intra 971 prediction mode, the inter_cost value is further upper bounded by 972 intra_cost. A propagation cost variable is used to collect all the 973 information flowed back from future processing frames. It is initialized as 974 0 for all the blocks in the last processing frame in a group of pictures 975 (GOP).</li> 976 977 <li> The fraction of information from a current block to be propagated towards 978 its reference block is estimated as: 979\f[ 980 propagation\_fraction = (1 - inter\_cost/intra\_cost) 981\f] 982 It reflects how much the motion compensated reference would reduce the 983 prediction error in percentage.</li> 984 985 <li> The total amount of information the current block contributes to the GOP 986 is estimated as intra_cost + propagation_cost. The information that it 987 propagates towards its reference block is captured by: 988 989\f[ 990 propagation\_amount = 991 (intra\_cost + propagation\_cost) * propagation\_fraction 992\f]</li> 993 994 <li> Note that the reference block may not necessarily sit on the grid of 995 16x16 blocks. The propagation amount is hence dispensed to all the blocks 996 that overlap with the reference block. The corresponding block in the 997 reference frame accumulates its own propagation cost as it receives back 998 propagation. 999 1000\f[ 1001 propagation\_cost = propagation\_cost + 1002 (\frac{overlap\_area}{(16*16)} * propagation\_amount) 1003\f]</li> 1004 1005 <li> In the final encoding stage, the distortion propagation factor of a block 1006 is evaluated as \f$(1 + \frac{propagation\_cost}{intra\_cost})\f$, where the second term 1007 captures its impact on later frames in a GOP.</li> 1008 1009 <li> The Lagrangian multiplier is adapted at the 64x64 block level. For every 1010 64x64 block in a frame, we have a distortion propagation factor: 1011 1012\f[ 1013 dist\_prop[i] = 1 + \frac{propagation\_cost[i]}{intra\_cost[i]} 1014\f] 1015 1016 where i denotes the block index in the frame. We also have the frame level 1017 distortion propagation factor: 1018 1019\f[ 1020 dist\_prop = 1 + 1021 \frac{\sum_{i}propagation\_cost[i]}{\sum_{i}intra\_cost[i]} 1022\f] 1023 1024 which is used to normalize the propagation factor at the 64x64 block level. The 1025 Lagrangian multiplier is hence adapted as: 1026 1027\f[ 1028 λ[i] = λ[0] * \frac{dist\_prop}{dist\_prop[i]} 1029\f] 1030 1031 where λ0 is the multiplier associated with the frame level QP. The 1032 64x64 block level QP is scaled according to the Lagrangian multiplier. 1033</ul> 1034 1035\subsection architecture_enc_tpl_keyfun Key Functions and data structures 1036 1037The reader is also refered to the following functions and data structures: 1038 1039- \ref TplParams 1040- \ref av1_tpl_setup_stats() builds the TPL model. 1041- \ref setup_delta_q() Assign different quantization parameters to each super 1042 block based on its TPL weight. 1043 1044\section architecture_enc_partitions Block Partition Search 1045 1046 A frame is first split into tiles in \ref encode_tiles(), with each tile 1047 compressed by av1_encode_tile(). Then a tile is processed in superblock rows 1048 via \ref av1_encode_sb_row() and then \ref encode_sb_row(). 1049 1050 The partition search processes superblocks sequentially in \ref 1051 encode_sb_row(). Two search modes are supported, depending upon the encoding 1052 configuration, \ref encode_nonrd_sb() is for 1-pass and real-time modes, 1053 while \ref encode_rd_sb() performs more exhaustive rate distortion based 1054 searches. 1055 1056 Partition search over the recursive quad-tree space is implemented by 1057 recursive calls to \ref av1_nonrd_use_partition(), 1058 \ref av1_rd_use_partition(), or av1_rd_pick_partition() and returning best 1059 options for sub-trees to their parent partitions. 1060 1061 In libaom, the partition search lays on top of the mode search (predictor, 1062 transform, etc.), instead of being a separate module. The interface of mode 1063 search is \ref pick_sb_modes(), which connects the partition_search with 1064 \ref architecture_enc_inter_modes and \ref architecture_enc_intra_modes. To 1065 make good decisions, reconstruction is also required in order to build 1066 references and contexts. This is implemented by \ref encode_sb() at the 1067 sub-tree level and \ref encode_b() at coding block level. 1068 1069 See also \ref partition_search 1070 1071\section architecture_enc_intra_modes Intra Mode Search 1072 1073AV1 also provides 71 different intra prediction modes, i.e. modes that predict 1074only based upon information in the current frame with no dependency on 1075previous or future frames. For key frames, where this independence from any 1076other frame is a defining requirement and for other cases where intra only 1077frames are required, the encoder need only considers these modes in the rate 1078distortion loop. 1079 1080Even so, in most use cases, searching all possible intra prediction modes for 1081every block and partition size is not practical and some pruning of the search 1082tree is necessary. 1083 1084For the Rate distortion optimized case, the main top level function 1085responsible for selecting the intra prediction mode for a given block is 1086\ref av1_rd_pick_intra_mode_sb(). The readers attention is also drawn to the 1087functions \ref hybrid_intra_mode_search() and \ref av1_nonrd_pick_intra_mode() 1088which may be used where encode speed is critical. The choice between the 1089rd path and the non rd or hybrid paths depends on the encoder use case and the 1090\ref AV1_COMP.speed parameter. Further fine control of the speed vs quality 1091trade off is provided by means of fields in \ref AV1_COMP.sf (which has type 1092\ref SPEED_FEATURES). 1093 1094Note that some intra modes are only considered for specific use cases or 1095types of video. For example the palette based prediction modes are often 1096valueable for graphics or screen share content but not for natural video. 1097(See \ref av1_search_palette_mode()) 1098 1099See also \ref intra_mode_search for more details. 1100 1101\section architecture_enc_inter_modes Inter Prediction Mode Search 1102 1103For inter frames, where we also allow prediction using one or more previously 1104coded frames (which may chronologically speaking be past or future frames or 1105non-display reference buffers such as ARF frames), the size of the search tree 1106that needs to be traversed, to select a prediction mode, is considerably more 1107massive. 1108 1109In addition to the 71 possible intra modes we also need to consider 56 single 1110frame inter prediction modes (7 reference frames x 4 modes x 2 for OBMC 1111(overlapped block motion compensation)), 12768 compound inter prediction modes 1112(these are modes that combine inter predictors from two reference frames) and 111336708 compound inter / intra prediction modes. 1114 1115As with the intra mode search, libaom supports an RD based pathway and a non 1116rd pathway for speed critical use cases. The entry points for these two cases 1117are \ref av1_rd_pick_inter_mode() and \ref av1_nonrd_pick_inter_mode_sb() 1118respectively. 1119 1120Various heuristics and predictive strategies are used to prune the search tree 1121with fine control provided through the speed features parameter in the main 1122compressor instance data structure \ref AV1_COMP.sf. 1123 1124It is worth noting, that some prediction modes incurr a much larger rate cost 1125than others (ignoring for now the cost of coding the error residual). For 1126example, a compound mode that requires the encoder to specify two reference 1127frames and two new motion vectors will almost inevitable have a higher rate 1128cost than a simple inter prediction mode that uses a predicted or 0,0 motion 1129vector. As such, if we have already found a mode for the current block that 1130has a low RD cost, we can skip a large number of the possible modes on the 1131basis that even if the error residual is 0 the inherent rate cost of the 1132mode itself will garauntee that it is not chosen. 1133 1134See also \ref inter_mode_search for more details. 1135 1136\section architecture_enc_tx_search Transform Search 1137 1138AV1 implements the transform stage using 4 seperable 1-d transforms (DCT, 1139ADST, FLIPADST and IDTX, where FLIPADST is the reversed version of ADST 1140and IDTX is the identity transform) which can be combined to give 16 2-d 1141combinations. 1142 1143These combinations can be applied at 19 different scales from 64x64 pixels 1144down to 4x4 pixels. 1145 1146This gives rise to a large number of possible candidate transform options 1147for coding the residual error after prediction. An exhaustive rate-distortion 1148based evaluation of all candidates would not be practical from a speed 1149perspective in a production encoder implementation. Hence libaom addopts a 1150number of strategies to prune the selection of both the transform size and 1151transform type. 1152 1153There are a number of strategies that have been tested and implememnted in 1154libaom including: 1155 1156- A statistics based approach that looks at the frequency with which certain 1157 combinations are used in a given context and prunes out very unlikely 1158 candidates. It is worth noting here that some size candidates can be pruned 1159 out immediately based on the size of the prediction partition. For example it 1160 does not make sense to use a transform size that is larger than the 1161 prediction partition size but also a very large prediction partition size is 1162 unlikely to be optimally pared with small transforms. 1163 1164- A Machine learning based model 1165 1166- A method that initially tests candidates using a fast algorithm that skips 1167 entropy encoding and uses an estimated cost model to choose a reduced subset 1168 for full RD analysis. This subject is covered more fully in a paper authored 1169 by Bohan Li, Jingning Han, and Yaowu Xu titled: <b>Fast Transform Type 1170 Selection Using Conditional Laplace Distribution Based Rate Estimation</b> 1171 1172<b>TODO Add link to paper when available</b> 1173 1174See also \ref transform_search for more details. 1175 1176\section architecture_post_enc_filt Post Encode Loop Filtering 1177 1178AV1 supports three types of post encode <b>in loop</b> filtering to improve 1179the quality of the reconstructed video. 1180 1181- <b>Deblocking Filter</b> The first of these is a farily traditional boundary 1182 deblocking filter that attempts to smooth discontinuities that may occur at 1183 the boundaries between blocks. See also \ref in_loop_filter. 1184 1185- <b>CDEF Filter</b> The constrained directional enhancement filter (CDEF) 1186 allows the codec to apply a non-linear deringing filter along certain 1187 (potentially oblique) directions. A primary filter is applied along the 1188 selected direction, whilst a secondary filter is applied at 45 degrees to 1189 the primary direction. (See also \ref in_loop_cdef and 1190 <a href="https://arxiv.org/abs/2008.06091"> A Technical Overview of AV1</a>. 1191 1192- <b>Loop Restoration Filter</b> The loop restoration filter is applied after 1193 any prior post filtering stages. It acts on units of either 64 x 64, 1194 128 x 128, or 256 x 256 pixel blocks, refered to as loop restoration units. 1195 Each unit can independently select either to bypass filtering, use a Wiener 1196 filter, or use a self-guided filter. (See also \ref in_loop_restoration and 1197 <a href="https://arxiv.org/abs/2008.06091"> A Technical Overview of AV1</a>. 1198 1199\section architecture_entropy Entropy Coding 1200 1201\subsection architecture_entropy_aritmetic Arithmetic Coder 1202 1203VP9, used a binary arithmetic coder to encode symbols, where the propability 1204of a 1 or 0 at each descision node was based on a context model that took 1205into account recently coded values (for example previously coded coefficients 1206in the current block). A mechanism existed to update the context model each 1207frame, either explicitly in the bitstream, or implicitly at both the encoder 1208and decoder based on the observed frequency of different outcomes in the 1209previous frame. VP9 also supported seperate context models for different types 1210of frame (e.g. inter coded frames and key frames). 1211 1212In contrast, AV1 uses an M-ary symbol arithmetic coder to compress the syntax 1213elements, where integer \f$M\in[2, 14]\f$. This approach is based upon the entropy 1214coding strategy used in the Daala video codec and allows for some bit-level 1215parallelism in its implementation. AV1 also has an extended context model and 1216allows for updates to the probabilities on a per symbol basis as opposed to 1217the per frame strategy in VP9. 1218 1219To improve the performance / throughput of the arithmetic encoder, especially 1220in hardware implementations, the probability model is updated and maintained 1221at 15-bit precision, but the arithmetic encoder only uses the most significant 12229 bits when encoding a symbol. A more detailed discussion of the algorithm 1223and design constraints can be found in 1224<a href="https://arxiv.org/abs/2008.06091"> A Technical Overview of AV1</a>. 1225 1226TODO add references to key functions / files. 1227 1228As with VP9, a mechanism exists in AV1 to encode some elements into the 1229bitstream as uncrompresed bits or literal values, without using the arithmetic 1230coder. For example, some frame and sequence header values, where it is 1231beneficial to be able to read the values directly. 1232 1233TODO add references to key functions / files. 1234 1235\subsection architecture_entropy_coef Transform Coefficient Coding and Optimization 1236\image html coeff_coding.png "" width=70% 1237 1238\subsubsection architecture_entropy_coef_what Transform coefficient coding 1239Transform coefficient coding is where the encoder compresses a quantized version 1240of prediction residue into the bitstream. 1241 1242\paragraph architecture_entropy_coef_prepare Preparation - transform and quantize 1243Before the entropy coding stage, the encoder decouple the pixel-to-pixel 1244correlation of the prediction residue by transforming the residue from the 1245spatial domain to the frequency domain. Then the encoder quantizes the transform 1246coefficients to make the coefficients ready for entropy coding. 1247 1248\paragraph architecture_entropy_coef_coding The coding process 1249The encoder uses \ref av1_write_coeffs_txb() to write the coefficients of 1250a transform block into the bitstream. 1251The coding process has three stages. 12521. The encoder will code transform block skip flag (txb_skip). If the skip flag is 1253off, then the encoder will code the end of block position (eob) which is the scan 1254index of the last non-zero coefficient plus one. 12552. Second, the encoder will code lower magnitude levels of each coefficient in 1256reverse scan order. 12573. Finally, the encoder will code the sign and higher magnitude levels for each 1258coefficient if they are available. 1259 1260Related functions: 1261- \ref av1_write_coeffs_txb() 1262- write_inter_txb_coeff() 1263- \ref av1_write_intra_coeffs_mb() 1264 1265\paragraph architecture_entropy_coef_context Context information 1266To improve the compression efficiency, the encoder uses several context models 1267tailored for transform coefficients to capture the correlations between coding 1268symbols. Most of the context models are built to capture the correlations 1269between the coefficients within the same transform block. However, transform 1270block skip flag (txb_skip) and the sign of dc coefficient (dc_sign) require 1271context info from neighboring transform blocks. 1272 1273Here is how context info spread between transform blocks. Before coding a 1274transform block, the encoder will use get_txb_ctx() to collect the context 1275information from neighboring transform blocks. Then the context information 1276will be used for coding transform block skip flag (txb_skip) and the sign of 1277dc coefficient (dc_sign). After the transform block is coded, the encoder will 1278extract the context info from the current block using 1279\ref av1_get_txb_entropy_context(). Then encoder will store the context info 1280into a byte (uint8_t) using av1_set_entropy_contexts(). The encoder will use 1281the context info to code other transform blocks. 1282 1283Related functions: 1284- \ref av1_get_txb_entropy_context() 1285- av1_set_entropy_contexts() 1286- get_txb_ctx() 1287- \ref av1_update_intra_mb_txb_context() 1288 1289\subsubsection architecture_entropy_coef_rd RD optimization 1290Beside the actual entropy coding, the encoder uses several utility functions 1291to make optimal RD decisions. 1292 1293\paragraph architecture_entropy_coef_cost Entropy cost 1294The encoder uses \ref av1_cost_coeffs_txb() or \ref av1_cost_coeffs_txb_laplacian() 1295to estimate the entropy cost of a transform block. Note that 1296\ref av1_cost_coeffs_txb() is slower but accurate whereas 1297\ref av1_cost_coeffs_txb_laplacian() is faster but less accurate. 1298 1299Related functions: 1300- \ref av1_cost_coeffs_txb() 1301- \ref av1_cost_coeffs_txb_laplacian() 1302- \ref av1_cost_coeffs_txb_estimate() 1303 1304\paragraph architecture_entropy_coef_opt Quantized level optimization 1305Beside computing entropy cost, the encoder also uses \ref av1_optimize_txb() 1306to adjust the coefficient’s quantized levels to achieve optimal RD trade-off. 1307In \ref av1_optimize_txb(), the encoder goes through each quantized 1308coefficient and lowers the quantized coefficient level by one if the action 1309yields a better RD score. 1310 1311Related functions: 1312- \ref av1_optimize_txb() 1313 1314All the related functions are listed in \ref coefficient_coding. 1315 1316*/ 1317 1318/*!\defgroup encoder_algo Encoder Algorithm 1319 * 1320 * The encoder algorithm describes how a sequence is encoded, including high 1321 * level decision as well as algorithm used at every encoding stage. 1322 */ 1323 1324/*!\defgroup high_level_algo High-level Algorithm 1325 * \ingroup encoder_algo 1326 * This module describes sequence level/frame level algorithm in AV1. 1327 * More details will be added. 1328 * @{ 1329 */ 1330 1331/*!\defgroup speed_features Speed vs Quality Trade Off 1332 * \ingroup high_level_algo 1333 * This module describes the encode speed vs quality tradeoff 1334 * @{ 1335 */ 1336/*! @} - end defgroup speed_features */ 1337 1338/*!\defgroup src_frame_proc Source Frame Processing 1339 * \ingroup high_level_algo 1340 * This module describes algorithms in AV1 assosciated with the 1341 * pre-processing of source frames. See also \ref architecture_enc_src_proc 1342 * 1343 * @{ 1344 */ 1345/*! @} - end defgroup src_frame_proc */ 1346 1347/*!\defgroup rate_control Rate Control 1348 * \ingroup high_level_algo 1349 * This module describes rate control algorithm in AV1. 1350 * See also \ref architecture_enc_rate_ctrl 1351 * @{ 1352 */ 1353/*! @} - end defgroup rate_control */ 1354 1355/*!\defgroup tpl_modelling Temporal Dependency Modelling 1356 * \ingroup high_level_algo 1357 * This module includes algorithms to implement temporal dependency modelling. 1358 * See also \ref architecture_enc_tpl 1359 * @{ 1360 */ 1361/*! @} - end defgroup tpl_modelling */ 1362 1363/*!\defgroup two_pass_algo Two Pass Mode 1364 \ingroup high_level_algo 1365 1366 In two pass mode, the input file is passed into the encoder for a quick 1367 first pass, where statistics are gathered. These statistics and the input 1368 file are then passed back into the encoder for a second pass. The statistics 1369 help the encoder reach the desired bitrate without as much overshooting or 1370 undershooting. 1371 1372 During the first pass, the codec will return "stats" packets that contain 1373 information useful for the second pass. The caller should concatenate these 1374 packets as they are received. In the second pass, the concatenated packets 1375 are passed in, along with the frames to encode. During the second pass, 1376 "frame" packets are returned that represent the compressed video. 1377 1378 A complete example can be found in `examples/twopass_encoder.c`. Pseudocode 1379 is provided below to illustrate the core parts. 1380 1381 During the first pass, the uncompressed frames are passed in and stats 1382 information is appended to a byte array. 1383 1384~~~~~~~~~~~~~~~{.c} 1385// For simplicity, assume that there is enough memory in the stats buffer. 1386// Actual code will want to use a resizable array. stats_len represents 1387// the length of data already present in the buffer. 1388void get_stats_data(aom_codec_ctx_t *encoder, char *stats, 1389 size_t *stats_len, bool *got_data) { 1390 const aom_codec_cx_pkt_t *pkt; 1391 aom_codec_iter_t iter = NULL; 1392 while ((pkt = aom_codec_get_cx_data(encoder, &iter))) { 1393 *got_data = true; 1394 if (pkt->kind != AOM_CODEC_STATS_PKT) continue; 1395 memcpy(stats + *stats_len, pkt->data.twopass_stats.buf, 1396 pkt->data.twopass_stats.sz); 1397 *stats_len += pkt->data.twopass_stats.sz; 1398 } 1399} 1400 1401void first_pass(char *stats, size_t *stats_len) { 1402 struct aom_codec_enc_cfg first_pass_cfg; 1403 ... // Initialize the config as needed. 1404 first_pass_cfg.g_pass = AOM_RC_FIRST_PASS; 1405 aom_codec_ctx_t first_pass_encoder; 1406 ... // Initialize the encoder. 1407 1408 while (frame_available) { 1409 // Read in the uncompressed frame, update frame_available 1410 aom_image_t *frame_to_encode = ...; 1411 aom_codec_encode(&first_pass_encoder, img, pts, duration, flags); 1412 get_stats_data(&first_pass_encoder, stats, stats_len); 1413 } 1414 // After all frames have been processed, call aom_codec_encode with 1415 // a NULL ptr repeatedly, until no more data is returned. The NULL 1416 // ptr tells the encoder that no more frames are available. 1417 bool got_data; 1418 do { 1419 got_data = false; 1420 aom_codec_encode(&first_pass_encoder, NULL, pts, duration, flags); 1421 get_stats_data(&first_pass_encoder, stats, stats_len, &got_data); 1422 } while (got_data); 1423 1424 aom_codec_destroy(&first_pass_encoder); 1425} 1426~~~~~~~~~~~~~~~ 1427 1428 During the second pass, the uncompressed frames and the stats are 1429 passed into the encoder. 1430 1431~~~~~~~~~~~~~~~{.c} 1432// Write out each encoded frame to the file. 1433void get_cx_data(aom_codec_ctx_t *encoder, FILE *file, 1434 bool *got_data) { 1435 const aom_codec_cx_pkt_t *pkt; 1436 aom_codec_iter_t iter = NULL; 1437 while ((pkt = aom_codec_get_cx_data(encoder, &iter))) { 1438 *got_data = true; 1439 if (pkt->kind != AOM_CODEC_CX_FRAME_PKT) continue; 1440 fwrite(pkt->data.frame.buf, 1, pkt->data.frame.sz, file); 1441 } 1442} 1443 1444void second_pass(char *stats, size_t stats_len) { 1445 struct aom_codec_enc_cfg second_pass_cfg; 1446 ... // Initialize the config file as needed. 1447 second_pass_cfg.g_pass = AOM_RC_LAST_PASS; 1448 cfg.rc_twopass_stats_in.buf = stats; 1449 cfg.rc_twopass_stats_in.sz = stats_len; 1450 aom_codec_ctx_t second_pass_encoder; 1451 ... // Initialize the encoder from the config. 1452 1453 FILE *output = fopen("output.obu", "wb"); 1454 while (frame_available) { 1455 // Read in the uncompressed frame, update frame_available 1456 aom_image_t *frame_to_encode = ...; 1457 aom_codec_encode(&second_pass_encoder, img, pts, duration, flags); 1458 get_cx_data(&second_pass_encoder, output); 1459 } 1460 // Pass in NULL to flush the encoder. 1461 bool got_data; 1462 do { 1463 got_data = false; 1464 aom_codec_encode(&second_pass_encoder, NULL, pts, duration, flags); 1465 get_cx_data(&second_pass_encoder, output, &got_data); 1466 } while (got_data); 1467 1468 aom_codec_destroy(&second_pass_encoder); 1469} 1470~~~~~~~~~~~~~~~ 1471 */ 1472 1473 /*!\defgroup look_ahead_buffer The Look-Ahead Buffer 1474 \ingroup high_level_algo 1475 1476 A program should call \ref aom_codec_encode() for each frame that needs 1477 processing. These frames are internally copied and stored in a fixed-size 1478 circular buffer, known as the look-ahead buffer. Other parts of the code 1479 will use future frame information to inform current frame decisions; 1480 examples include the first-pass algorithm, TPL model, and temporal filter. 1481 Note that this buffer also keeps a reference to the last source frame. 1482 1483 The look-ahead buffer is defined in \ref av1/encoder/lookahead.h. It acts as an 1484 opaque structure, with an interface to create and free memory associated with 1485 it. It supports pushing and popping frames onto the structure in a FIFO 1486 fashion. It also allows look-ahead when using the \ref av1_lookahead_peek() 1487 function with a non-negative number, and look-behind when -1 is passed in (for 1488 the last source frame; e.g., firstpass will use this for motion estimation). 1489 The \ref av1_lookahead_depth() function returns the current number of frames 1490 stored in it. Note that \ref av1_lookahead_pop() is a bit of a misnomer - it 1491 only pops if either the "flush" variable is set, or the buffer is at maximum 1492 capacity. 1493 1494 The buffer is stored in the \ref AV1_PRIMARY::lookahead field. 1495 It is initialized in the first call to \ref aom_codec_encode(), in the 1496 \ref av1_receive_raw_frame() sub-routine. The buffer size is defined by 1497 the g_lag_in_frames parameter set in the 1498 \ref aom_codec_enc_cfg_t::g_lag_in_frames struct. 1499 This can be modified manually but should only be set once. On the command 1500 line, the flag "--lag-in-frames" controls it. The default size is 19 for 1501 non-realtime usage and 1 for realtime. Note that a maximum value of 35 is 1502 enforced. 1503 1504 A frame will stay in the buffer as long as possible. As mentioned above, 1505 the \ref av1_lookahead_pop() only removes a frame when either flush is set, 1506 or the buffer is full. Note that each call to \ref aom_codec_encode() inserts 1507 another frame into the buffer, and pop is called by the sub-function 1508 \ref av1_encode_strategy(). The buffer is told to flush when 1509 \ref aom_codec_encode() is passed a NULL image pointer. Note that the caller 1510 must repeatedly call \ref aom_codec_encode() with a NULL image pointer, until 1511 no more packets are available, in order to fully flush the buffer. 1512 1513 */ 1514 1515/*! @} - end defgroup high_level_algo */ 1516 1517/*!\defgroup partition_search Partition Search 1518 * \ingroup encoder_algo 1519 * For and overview of the partition search see \ref architecture_enc_partitions 1520 * @{ 1521 */ 1522 1523/*! @} - end defgroup partition_search */ 1524 1525/*!\defgroup intra_mode_search Intra Mode Search 1526 * \ingroup encoder_algo 1527 * This module describes intra mode search algorithm in AV1. 1528 * More details will be added. 1529 * @{ 1530 */ 1531/*! @} - end defgroup intra_mode_search */ 1532 1533/*!\defgroup inter_mode_search Inter Mode Search 1534 * \ingroup encoder_algo 1535 * This module describes inter mode search algorithm in AV1. 1536 * More details will be added. 1537 * @{ 1538 */ 1539/*! @} - end defgroup inter_mode_search */ 1540 1541/*!\defgroup palette_mode_search Palette Mode Search 1542 * \ingroup intra_mode_search 1543 * This module describes palette mode search algorithm in AV1. 1544 * More details will be added. 1545 * @{ 1546 */ 1547/*! @} - end defgroup palette_mode_search */ 1548 1549/*!\defgroup transform_search Transform Search 1550 * \ingroup encoder_algo 1551 * This module describes transform search algorithm in AV1. 1552 * @{ 1553 */ 1554/*! @} - end defgroup transform_search */ 1555 1556/*!\defgroup coefficient_coding Transform Coefficient Coding and Optimization 1557 * \ingroup encoder_algo 1558 * This module describes the algorithms of transform coefficient coding and optimization in AV1. 1559 * More details will be added. 1560 * @{ 1561 */ 1562/*! @} - end defgroup coefficient_coding */ 1563 1564/*!\defgroup in_loop_filter In-loop Filter 1565 * \ingroup encoder_algo 1566 * This module describes in-loop filter algorithm in AV1. 1567 * More details will be added. 1568 * @{ 1569 */ 1570/*! @} - end defgroup in_loop_filter */ 1571 1572/*!\defgroup in_loop_cdef CDEF 1573 * \ingroup encoder_algo 1574 * This module describes the CDEF parameter search algorithm 1575 * in AV1. More details will be added. 1576 * @{ 1577 */ 1578/*! @} - end defgroup in_loop_restoration */ 1579 1580/*!\defgroup in_loop_restoration Loop Restoration 1581 * \ingroup encoder_algo 1582 * This module describes the loop restoration search 1583 * and estimation algorithm in AV1. 1584 * More details will be added. 1585 * @{ 1586 */ 1587/*! @} - end defgroup in_loop_restoration */ 1588 1589/*!\defgroup cyclic_refresh Cyclic Refresh 1590 * \ingroup encoder_algo 1591 * This module describes the cyclic refresh (aq-mode=3) in AV1. 1592 * More details will be added. 1593 * @{ 1594 */ 1595/*! @} - end defgroup cyclic_refresh */ 1596 1597/*!\defgroup SVC Scalable Video Coding 1598 * \ingroup encoder_algo 1599 * This module describes scalable video coding algorithm in AV1. 1600 * More details will be added. 1601 * @{ 1602 */ 1603/*! @} - end defgroup SVC */ 1604/*!\defgroup variance_partition Variance Partition 1605 * \ingroup encoder_algo 1606 * This module describes variance partition algorithm in AV1. 1607 * More details will be added. 1608 * @{ 1609 */ 1610/*! @} - end defgroup variance_partition */ 1611/*!\defgroup nonrd_mode_search NonRD Optimized Mode Search 1612 * \ingroup encoder_algo 1613 * This module describes NonRD Optimized Mode Search used in Real-Time mode. 1614 * More details will be added. 1615 * @{ 1616 */ 1617/*! @} - end defgroup nonrd_mode_search */ 1618