1LZ4 Block Format Description 2============================ 3Last revised: 2018-04-25. 4Author : Yann Collet 5 6 7This specification is intended for developers 8willing to produce LZ4-compatible compressed data blocks 9using any programming language. 10 11LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. 12There is no entropy encoder back-end nor framing layer. 13The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]). 14This design is assumed to favor simplicity and speed. 15It helps later on for optimizations, compactness, and features. 16 17This document describes only the block format, 18not how the compressor nor decompressor actually work. 19The correctness of the decompressor should not depend 20on implementation details of the compressor, and vice versa. 21 22[LZ4 Frame format]: lz4_Frame_format.md 23 24 25 26Compressed block format 27----------------------- 28An LZ4 compressed block is composed of sequences. 29A sequence is a suite of literals (not-compressed bytes), 30followed by a match copy. 31 32Each sequence starts with a `token`. 33The `token` is a one byte value, separated into two 4-bits fields. 34Therefore each field ranges from 0 to 15. 35 36 37The first field uses the 4 high-bits of the token. 38It provides the length of literals to follow. 39 40If the field value is 0, then there is no literal. 41If it is 15, then we need to add some more bytes to indicate the full length. 42Each additional byte then represent a value from 0 to 255, 43which is added to the previous value to produce a total length. 44When the byte value is 255, another byte is output. 45There can be any number of bytes following `token`. There is no "size limit". 46(Side note : this is why a not-compressible input block is expanded by 0.4%). 47 48Example 1 : A literal length of 48 will be represented as : 49 50 - 15 : value for the 4-bits High field 51 - 33 : (=48-15) remaining length to reach 48 52 53Example 2 : A literal length of 280 will be represented as : 54 55 - 15 : value for the 4-bits High field 56 - 255 : following byte is maxed, since 280-15 >= 255 57 - 10 : (=280 - 15 - 255) ) remaining length to reach 280 58 59Example 3 : A literal length of 15 will be represented as : 60 61 - 15 : value for the 4-bits High field 62 - 0 : (=15-15) yes, the zero must be output 63 64Following `token` and optional length bytes, are the literals themselves. 65They are exactly as numerous as previously decoded (length of literals). 66It's possible that there are zero literal. 67 68 69Following the literals is the match copy operation. 70 71It starts by the `offset`. 72This is a 2 bytes value, in little endian format 73(the 1st byte is the "low" byte, the 2nd one is the "high" byte). 74 75The `offset` represents the position of the match to be copied from. 761 means "current position - 1 byte". 77The maximum `offset` value is 65535, 65536 cannot be coded. 78Note that 0 is an invalid value, not used. 79 80Then we need to extract the `matchlength`. 81For this, we use the second token field, the low 4-bits. 82Value, obviously, ranges from 0 to 15. 83However here, 0 means that the copy operation will be minimal. 84The minimum length of a match, called `minmatch`, is 4. 85As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. 86Similar to literal length, on reaching the highest possible value (15), 87we output additional bytes, one at a time, with values ranging from 0 to 255. 88They are added to total to provide the final match length. 89A 255 value means there is another byte to read and add. 90There is no limit to the number of optional bytes that can be output this way. 91(This points towards a maximum achievable compression ratio of about 250). 92 93Decoding the `matchlength` reaches the end of current sequence. 94Next byte will be the start of another sequence. 95But before moving to next sequence, 96it's time to use the decoded match position and length. 97The decoder copies `matchlength` bytes from match position to current position. 98 99In some cases, `matchlength` is larger than `offset`. 100Therefore, `match_pos + matchlength > current_pos`, 101which means that later bytes to copy are not yet decoded. 102This is called an "overlap match", and must be handled with special care. 103A common case is an offset of 1, 104meaning the last byte is repeated `matchlength` times. 105 106 107Parsing restrictions 108----------------------- 109There are specific parsing rules to respect in order to remain compatible 110with assumptions made by the decoder : 111 1121. The last 5 bytes are always literals. In other words, the last five bytes 113 from the uncompressed input (or all bytes, if the input has less than five 114 bytes) must be encoded as literals on behalf of the last sequence. 115 The last sequence is incomplete, and stops right after the literals. 1162. The last match must start at least 12 bytes before end of block. 117 The last match is part of the penultimate sequence, 118 since the last sequence stops right after literals. 119 Note that, as a consequence, blocks < 13 bytes cannot be compressed. 120 121These rules are in place to ensure that the decoder 122can speculatively execute copy instructions 123without ever reading nor writing beyond provided I/O buffers. 124 1251. To copy literals from a non-last sequence, an 8-byte copy instruction 126 can always be safely issued (without reading past the input), 127 because literals are followed by a 2-byte offset, 128 and last sequence is at least 1+5 bytes long. 1292. Similarly, a match operation can speculatively copy up to 12 bytes 130 while remaining within output buffer boundaries. 131 132Empty inputs can be represented with a zero byte, 133interpreted as a token without literals and without a match. 134 135 136Additional notes 137----------------------- 138There is no assumption nor limits to the way the compressor 139searches and selects matches within the source data block. 140It could be a fast scan, a multi-probe, a full search using BST, 141standard hash chains or MMC, well whatever. 142 143Advanced parsing strategies can also be implemented, such as lazy match, 144or full optimal parsing. 145 146All these trade-off offer distinctive speed/memory/compression advantages. 147Whatever the method used by the compressor, its result will be decodable 148by any LZ4 decoder if it follows the format specification described above. 149