1LZ4 Block Format Description 2============================ 3Last revised: 2015-05-07. 4Author : Yann Collet 5 6 7This specification is intended for developers 8willing to produce LZ4-compatible compressed data blocks 9using any programming language. 10 11LZ4 is an LZ77-type compressor with a fixed, byte-oriented encoding. 12There is no entropy encoder back-end nor framing layer. 13The latter is assumed to be handled by other parts of the system (see [LZ4 Frame format]). 14This design is assumed to favor simplicity and speed. 15It helps later on for optimizations, compactness, and features. 16 17This document describes only the block format, 18not how the compressor nor decompressor actually work. 19The correctness of the decompressor should not depend 20on implementation details of the compressor, and vice versa. 21 22[LZ4 Frame format]: lz4_Frame_format.md 23 24 25 26Compressed block format 27----------------------- 28An LZ4 compressed block is composed of sequences. 29A sequence is a suite of literals (not-compressed bytes), 30followed by a match copy. 31 32Each sequence starts with a token. 33The token is a one byte value, separated into two 4-bits fields. 34Therefore each field ranges from 0 to 15. 35 36 37The first field uses the 4 high-bits of the token. 38It provides the length of literals to follow. 39 40If the field value is 0, then there is no literal. 41If it is 15, then we need to add some more bytes to indicate the full length. 42Each additional byte then represent a value from 0 to 255, 43which is added to the previous value to produce a total length. 44When the byte value is 255, another byte is output. 45There can be any number of bytes following the token. There is no "size limit". 46(Side note : this is why a not-compressible input block is expanded by 0.4%). 47 48Example 1 : A length of 48 will be represented as : 49 50 - 15 : value for the 4-bits High field 51 - 33 : (=48-15) remaining length to reach 48 52 53Example 2 : A length of 280 will be represented as : 54 55 - 15 : value for the 4-bits High field 56 - 255 : following byte is maxed, since 280-15 >= 255 57 - 10 : (=280 - 15 - 255) ) remaining length to reach 280 58 59Example 3 : A length of 15 will be represented as : 60 61 - 15 : value for the 4-bits High field 62 - 0 : (=15-15) yes, the zero must be output 63 64Following the token and optional length bytes, are the literals themselves. 65They are exactly as numerous as previously decoded (length of literals). 66It's possible that there are zero literal. 67 68 69Following the literals is the match copy operation. 70 71It starts by the offset. 72This is a 2 bytes value, in little endian format 73(the 1st byte is the "low" byte, the 2nd one is the "high" byte). 74 75The offset represents the position of the match to be copied from. 761 means "current position - 1 byte". 77The maximum offset value is 65535, 65536 cannot be coded. 78Note that 0 is an invalid value, not used. 79 80Then we need to extract the match length. 81For this, we use the second token field, the low 4-bits. 82Value, obviously, ranges from 0 to 15. 83However here, 0 means that the copy operation will be minimal. 84The minimum length of a match, called minmatch, is 4. 85As a consequence, a 0 value means 4 bytes, and a value of 15 means 19+ bytes. 86Similar to literal length, on reaching the highest possible value (15), 87we output additional bytes, one at a time, with values ranging from 0 to 255. 88They are added to total to provide the final match length. 89A 255 value means there is another byte to read and add. 90There is no limit to the number of optional bytes that can be output this way. 91(This points towards a maximum achievable compression ratio of about 250). 92 93With the offset and the matchlength, 94the decoder can now proceed to copy the data from the already decoded buffer. 95On decoding the matchlength, we reach the end of the compressed sequence, 96and therefore start another one. 97 98 99Parsing restrictions 100----------------------- 101There are specific parsing rules to respect in order to remain compatible 102with assumptions made by the decoder : 103 1041. The last 5 bytes are always literals 1052. The last match must start at least 12 bytes before end of block. 106 Consequently, a block with less than 13 bytes cannot be compressed. 107 108These rules are in place to ensure that the decoder 109will never read beyond the input buffer, nor write beyond the output buffer. 110 111Note that the last sequence is also incomplete, 112and stops right after literals. 113 114 115Additional notes 116----------------------- 117There is no assumption nor limits to the way the compressor 118searches and selects matches within the source data block. 119It could be a fast scan, a multi-probe, a full search using BST, 120standard hash chains or MMC, well whatever. 121 122Advanced parsing strategies can also be implemented, such as lazy match, 123or full optimal parsing. 124 125All these trade-off offer distinctive speed/memory/compression advantages. 126Whatever the method used by the compressor, its result will be decodable 127by any LZ4 decoder if it follows the format specification described above. 128