• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1LZ4 Block Format Description
2============================
3Last revised: 2022-07-31 .
4Author : Yann Collet
5
6
7This specification is intended for developers willing to
8produce or read LZ4 compressed data blocks
9using any programming language of their choice.
10
11LZ4 is an LZ77-type compressor with a fixed byte-oriented encoding format.
12There is no entropy encoder back-end nor framing layer.
13The latter is assumed to be handled by other parts of the system
14(see [LZ4 Frame format]).
15This design is assumed to favor simplicity and speed.
16
17This document describes only the Block Format,
18not how the compressor nor decompressor actually work.
19For more details on such topics, see later section "Implementation Notes".
20
21[LZ4 Frame format]: lz4_Frame_format.md
22
23
24
25Compressed block format
26-----------------------
27An LZ4 compressed block is composed of sequences.
28A sequence is a suite of literals (not-compressed bytes),
29followed by a match copy operation.
30
31Each sequence starts with a `token`.
32The `token` is a one byte value, separated into two 4-bits fields.
33Therefore each field ranges from 0 to 15.
34
35
36The first field uses the 4 high-bits of the token.
37It provides the length of literals to follow.
38
39If the field value is smaller than 15,
40then it represents the total nb of literals present in the sequence,
41including 0, in which case there is no literal.
42
43The value 15 is a special case: more bytes are required to indicate the full length.
44Each additional byte then represents a value from 0 to 255,
45which is added to the previous value to produce a total length.
46When the byte value is 255, another byte must be read and added, and so on.
47There can be any number of bytes of value `255` following `token`.
48The Block Format does not define any "size limit",
49though real implementations may feature some practical limits
50(see more details in later chapter "Implementation Notes").
51
52Note : this format explains why a non-compressible input block is expanded by 0.4%.
53
54Example 1 : A literal length of 48 will be represented as :
55
56  - 15 : value for the 4-bits High field
57  - 33 : (=48-15) remaining length to reach 48
58
59Example 2 : A literal length of 280 will be represented as :
60
61  - 15  : value for the 4-bits High field
62  - 255 : following byte is maxed, since 280-15 >= 255
63  - 10  : (=280 - 15 - 255) remaining length to reach 280
64
65Example 3 : A literal length of 15 will be represented as :
66
67  - 15 : value for the 4-bits High field
68  - 0  : (=15-15) yes, the zero must be output
69
70Following `token` and optional length bytes, are the literals themselves.
71They are exactly as numerous as just decoded (length of literals).
72Reminder: it's possible that there are zero literals.
73
74
75Following the literals is the match copy operation.
76
77It starts by the `offset` value.
78This is a 2 bytes value, in little endian format
79(the 1st byte is the "low" byte, the 2nd one is the "high" byte).
80
81The `offset` represents the position of the match to be copied from the past.
82For example, 1 means "current position - 1 byte".
83The maximum `offset` value is 65535. 65536 and beyond cannot be coded.
84Note that 0 is an invalid `offset` value.
85The presence of a 0 `offset` value denotes an invalid (corrupted) block.
86
87Then the `matchlength` can be extracted.
88For this, we use the second `token` field, the low 4-bits.
89Such a value, obviously, ranges from 0 to 15.
90However here, 0 means that the copy operation is minimal.
91The minimum length of a match, called `minmatch`, is 4.
92As a consequence, a 0 value means 4 bytes.
93Similarly to literal length, any value smaller than 15 represents a length,
94to which 4 (`minmatch`) must be added, thus ranging from 4 to 18.
95A value of 15 is special, meaning 19+ bytes,
96to which one must read additional bytes, one at a time,
97with each byte value ranging from 0 to 255.
98They are added to total to provide the final match length.
99A 255 value means there is another byte to read and add.
100There is no limit to the number of optional `255` bytes that can be present,
101and therefore no limit to representable match length,
102though real-life implementations are likely going to enforce limits for practical reasons (see more details in "Implementation Notes" section below).
103
104Note: this format has a maximum achievable compression ratio of about ~250.
105
106Decoding the `matchlength` reaches the end of current sequence.
107Next byte will be the start of another sequence, and therefore a new `token`.
108
109
110End of block conditions
111-------------------------
112There are specific restrictions required to terminate an LZ4 block.
113
1141. The last sequence contains only literals.
115   The block ends right after the literals (no `offset` field).
1162. The last 5 bytes of input are always literals.
117   Therefore, the last sequence contains at least 5 bytes.
118   - Special : if input is smaller than 5 bytes,
119     there is only one sequence, it contains the whole input as literals.
120     Even empty input can be represented, using a zero byte,
121     interpreted as a final token without literal and without a match.
1223. The last match must start at least 12 bytes before the end of block.
123   The last match is part of the _penultimate_ sequence.
124   It is followed by the last sequence, which contains _only_ literals.
125   - Note that, as a consequence,
126     blocks < 12 bytes cannot be compressed.
127     And as an extension, _independent_ blocks < 13 bytes cannot be compressed,
128     because they must start by at least one literal,
129     that the match can then copy afterwards.
130
131When a block does not respect these end conditions,
132a conformant decoder is allowed to reject the block as incorrect.
133
134These rules are in place to ensure compatibility with
135a wide range of historical decoders
136which rely on these conditions for their speed-oriented design.
137
138Implementation notes
139-----------------------
140The LZ4 Block Format only defines the compressed format,
141it does not tell how to create a decoder or an encoder,
142which design is left free to the imagination of the implementer.
143
144However, thanks to experience, there are a number of typical topics that
145most implementations will have to consider.
146This section tries to provide a few guidelines.
147
148#### Metadata
149
150An LZ4-compressed Block requires additional metadata for proper decoding.
151Typically, a decoder will require the compressed block's size,
152and an upper bound of decompressed size.
153Other variants exist, such as knowing the decompressed size,
154and having an upper bound of the input size.
155The Block Format does not specify how to transmit such information,
156which is considered an out-of-band information channel.
157That's because in many cases, the information is present in the environment.
158For example, databases must store the size of their compressed block for indexing,
159and know that their decompressed block can't be larger than a certain threshold.
160
161If you need a format which is "self-contained",
162and also transports the necessary metadata for proper decoding on any platform,
163consider employing the [LZ4 Frame format] instead.
164
165#### Large lengths
166
167While the Block Format does not define any maximum value for length fields,
168in practice, most implementations will feature some form of limit,
169since it's expected for such values to be stored into registers of fixed bit width.
170
171If length fields use 64-bit registers,
172then it can be assumed that there is no practical limit,
173as it would require a single continuous block of multiple petabytes to reach it,
174which is unreasonable by today's standard.
175
176If length fields use 32-bit registers, then it can be overflowed,
177but requires a compressed block of size > 16 MB.
178Therefore, implementations that do not deal with compressed blocks > 16 MB are safe.
179However, if such a case is allowed,
180then it's recommended to check that no large length overflows the register.
181
182If length fields use 16-bit registers,
183then it's definitely possible to overflow such register,
184with less than < 300 bytes of compressed data.
185
186A conformant decoder should be able to detect length overflows when it's possible,
187and simply error out when that happens.
188The input block might not be invalid,
189it's just not decodable by the local decoder implementation.
190
191Note that, in order to be compatible with the larger LZ4 ecosystem,
192it's recommended to be able to read and represent lengths of up to 4 MB,
193and to accept blocks of size up to 4 MB.
194Such limits are compatible with 32-bit length registers,
195and prevent overflow of 32-bit registers.
196
197#### Safe decoding
198
199If a decoder receives compressed data from any external source,
200it is recommended to ensure that the decoder is resilient to corrupted input,
201and made safe from buffer overflow manipulations.
202Always ensure that read and write operations
203remain within the limits of provided buffers.
204
205Of particular importance, ensure that the nb of bytes instructed to copy
206does not overflow neither the input nor the output buffers.
207Ensure also, when reading an offset value, that the resulting position to copy
208does not reach beyond the beginning of the buffer.
209Such a situation can happen during the first 64 KB of decoded data.
210
211For more safety, test the decoder with fuzzers
212to ensure it's resilient to improbable sequences of conditions.
213Combine them with sanitizers, in order to catch overflows (asan)
214or initialization issues (msan).
215
216Pay some attention to offset 0 scenario, which is invalid,
217and therefore must not be blindly decoded:
218a naive implementation could preserve destination buffer content,
219which could then result in information disclosure
220if such buffer was uninitialized and still containing private data.
221For reference, in such a scenario, the reference LZ4 decoder
222clears the match segment with `0` bytes,
223though other solutions are certainly possible.
224
225Finally, pay attention to the "overlap match" scenario,
226when `matchlength` is larger than `offset`.
227In which case, since `match_pos + matchlength > current_pos`,
228some of the later bytes to copy do not exist yet,
229and will be generated during the early stage of match copy operation.
230Such scenario must be handled with special care.
231A common case is an offset of 1,
232meaning the last byte is repeated `matchlength` times.
233
234#### Compression techniques
235
236The core of a LZ4 compressor is to detect duplicated data across past 64 KB.
237The format makes no assumption nor limits to the way a compressor
238searches and selects matches within the source data block.
239For example, an upper compression limit can be reached,
240using a technique called "full optimal parsing", at high cpu and memory cost.
241But multiple other techniques can be considered,
242featuring distinct time / performance trade-offs.
243As long as the specified format is respected,
244the result will be compatible with and decodable by any compliant decoder.
245