1LZ4 Block Format Description 2============================ 3Last revised: 2022-07-31 . 4Author : Yann Collet 5 6 7This specification is intended for developers willing to 8produce or read LZ4 compressed data blocks 9using any programming language of their choice. 10 11LZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format. 12There is no entropy encoder back-end nor framing layer. 13The latter is assumed to be handled by other parts of the system 14(see [LZ4 Frame format]). 15This design is assumed to favor simplicity and speed. 16 17This document describes only the Block Format, 18not how the compressor nor decompressor actually work. 19For more details on such topics, see later section "Implementation Notes". 20 21[LZ4 Frame format]: lz4_Frame_format.md 22 23 24 25Compressed block format 26----------------------- 27An LZ4 compressed block is composed of sequences. 28A sequence is a suite of literals (not-compressed bytes), 29followed by a match copy operation. 30 31Each sequence starts with a `token`. 32The `token` is a one byte value, separated into two 4-bits fields. 33Therefore each field ranges from 0 to 15. 34 35 36The first field uses the 4 high-bits of the token. 37It provides the length of literals to follow. 38 39If the field value is smaller than 15, 40then it represents the total nb of literals present in the sequence, 41including 0, in which case there is no literal. 42 43The value 15 is a special case: more bytes are required to indicate the full length. 44Each additional byte then represents a value from 0 to 255, 45which is added to the previous value to produce a total length. 46When the byte value is 255, another byte must be read and added, and so on. 47There can be any number of bytes of value `255` following `token`. 48The Block Format does not define any "size limit", 49though real implementations may feature some practical limits 50(see more details in later chapter "Implementation Notes"). 51 52Note : this format explains why a non-compressible input block is expanded by 0.4%. 53 54Example 1 : A literal length of 48 will be represented as : 55 56 - 15 : value for the 4-bits High field 57 - 33 : (=48-15) remaining length to reach 48 58 59Example 2 : A literal length of 280 will be represented as : 60 61 - 15 : value for the 4-bits High field 62 - 255 : following byte is maxed, since 280-15 >= 255 63 - 10 : (=280 - 15 - 255) remaining length to reach 280 64 65Example 3 : A literal length of 15 will be represented as : 66 67 - 15 : value for the 4-bits High field 68 - 0 : (=15-15) yes, the zero must be output 69 70Following `token` and optional length bytes, are the literals themselves. 71They are exactly as numerous as just decoded (length of literals). 72Reminder: it's possible that there are zero literals. 73 74 75Following the literals is the match copy operation. 76 77It starts by the `offset` value. 78This is a 2 bytes value, in little endian format 79(the 1st byte is the "low" byte, the 2nd one is the "high" byte). 80 81The `offset` represents the position of the match to be copied from the past. 82For example, 1 means "current position - 1 byte". 83The maximum `offset` value is 65535. 65536 and beyond cannot be coded. 84Note that 0 is an invalid `offset` value. 85The presence of a 0 `offset` value denotes an invalid (corrupted) block. 86 87Then the `matchlength` can be extracted. 88For this, we use the second `token` field, the low 4-bits. 89Such a value, obviously, ranges from 0 to 15. 90However here, 0 means that the copy operation is minimal. 91The minimum length of a match, called `minmatch`, is 4. 92As a consequence, a 0 value means 4 bytes. 93Similarly to literal length, any value smaller than 15 represents a length, 94to which 4 (`minmatch`) must be added, thus ranging from 4 to 18. 95A value of 15 is special, meaning 19+ bytes, 96to which one must read additional bytes, one at a time, 97with each byte value ranging from 0 to 255. 98They are added to total to provide the final match length. 99A 255 value means there is another byte to read and add. 100There is no limit to the number of optional `255` bytes that can be present, 101and therefore no limit to representable match length, 102though real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below). 103 104Note: this format has a maximum achievable compression ratio of about ~250. 105 106Decoding the `matchlength` reaches the end of current sequence. 107Next byte will be the start of another sequence, and therefore a new `token`. 108 109 110End of block conditions 111------------------------- 112There are specific restrictions required to terminate an LZ4 block. 113 1141. The last sequence contains only literals. 115 The block ends right after the literals (no `offset` field). 1162. The last 5 bytes of input are always literals. 117 Therefore, the last sequence contains at least 5 bytes. 118 - Special : if input is smaller than 5 bytes, 119 there is only one sequence, it contains the whole input as literals. 120 Even empty input can be represented, using a zero byte, 121 interpreted as a final token without literal and without a match. 1223. The last match must start at least 12 bytes before the end of block. 123 The last match is part of the _penultimate_ sequence. 124 It is followed by the last sequence, which contains _only_ literals. 125 - Note that, as a consequence, 126 blocks < 12 bytes cannot be compressed. 127 And as an extension, _independent_ blocks < 13 bytes cannot be compressed, 128 because they must start by at least one literal, 129 that the match can then copy afterwards. 130 131When a block does not respect these end conditions, 132a conformant decoder is allowed to reject the block as incorrect. 133 134These rules are in place to ensure compatibility with 135a wide range of historical decoders 136which rely on these conditions for their speed-oriented design. 137 138Implementation notes 139----------------------- 140The LZ4 Block Format only defines the compressed format, 141it does not tell how to create a decoder or an encoder, 142which design is left free to the imagination of the implementer. 143 144However, thanks to experience, there are a number of typical topics that 145most implementations will have to consider. 146This section tries to provide a few guidelines. 147 148#### Metadata 149 150An LZ4-compressed Block requires additional metadata for proper decoding. 151Typically, a decoder will require the compressed block's size, 152and an upper bound of decompressed size. 153Other variants exist, such as knowing the decompressed size, 154and having an upper bound of the input size. 155The Block Format does not specify how to transmit such information, 156which is considered an out-of-band information channel. 157That's because in many cases, the information is present in the environment. 158For example, databases must store the size of their compressed block for indexing, 159and know that their decompressed block can't be larger than a certain threshold. 160 161If you need a format which is "self-contained", 162and also transports the necessary metadata for proper decoding on any platform, 163consider employing the [LZ4 Frame format] instead. 164 165#### Large lengths 166 167While the Block Format does not define any maximum value for length fields, 168in practice, most implementations will feature some form of limit, 169since it's expected for such values to be stored into registers of fixed bit width. 170 171If length fields use 64-bit registers, 172then it can be assumed that there is no practical limit, 173as it would require a single continuous block of multiple petabytes to reach it, 174which is unreasonable by today's standard. 175 176If length fields use 32-bit registers, then it can be overflowed, 177but requires a compressed block of size > 16 MB. 178Therefore, implementations that do not deal with compressed blocks > 16 MB are safe. 179However, if such a case is allowed, 180then it's recommended to check that no large length overflows the register. 181 182If length fields use 16-bit registers, 183then it's definitely possible to overflow such register, 184with less than < 300 bytes of compressed data. 185 186A conformant decoder should be able to detect length overflows when it's possible, 187and simply error out when that happens. 188The input block might not be invalid, 189it's just not decodable by the local decoder implementation. 190 191Note that, in order to be compatible with the larger LZ4 ecosystem, 192it's recommended to be able to read and represent lengths of up to 4 MB, 193and to accept blocks of size up to 4 MB. 194Such limits are compatible with 32-bit length registers, 195and prevent overflow of 32-bit registers. 196 197#### Safe decoding 198 199If a decoder receives compressed data from any external source, 200it is recommended to ensure that the decoder is resilient to corrupted input, 201and made safe from buffer overflow manipulations. 202Always ensure that read and write operations 203remain within the limits of provided buffers. 204 205Of particular importance, ensure that the nb of bytes instructed to copy 206does not overflow neither the input nor the output buffers. 207Ensure also, when reading an offset value, that the resulting position to copy 208does not reach beyond the beginning of the buffer. 209Such a situation can happen during the first 64 KB of decoded data. 210 211For more safety, test the decoder with fuzzers 212to ensure it's resilient to improbable sequences of conditions. 213Combine them with sanitizers, in order to catch overflows (asan) 214or initialization issues (msan). 215 216Pay some attention to offset 0 scenario, which is invalid, 217and therefore must not be blindly decoded: 218a naive implementation could preserve destination buffer content, 219which could then result in information disclosure 220if such buffer was uninitialized and still containing private data. 221For reference, in such a scenario, the reference LZ4 decoder 222clears the match segment with `0` bytes, 223though other solutions are certainly possible. 224 225Finally, pay attention to the "overlap match" scenario, 226when `matchlength` is larger than `offset`. 227In which case, since `match_pos + matchlength > current_pos`, 228some of the later bytes to copy do not exist yet, 229and will be generated during the early stage of match copy operation. 230Such scenario must be handled with special care. 231A common case is an offset of 1, 232meaning the last byte is repeated `matchlength` times. 233 234#### Compression techniques 235 236The core of a LZ4 compressor is to detect duplicated data across past 64 KB. 237The format makes no assumption nor limits to the way a compressor 238searches and selects matches within the source data block. 239For example, an upper compression limit can be reached, 240using a technique called "full optimal parsing", at high cpu and memory cost. 241But multiple other techniques can be considered, 242featuring distinct time / performance trade-offs. 243As long as the specified format is respected, 244the result will be compatible with and decodable by any compliant decoder. 245