• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1LZ4 Frame Format Description
2============================
3
4### Notices
5
6Copyright (c) 2013-2015 Yann Collet
7
8Permission is granted to copy and distribute this document
9for any  purpose and without charge,
10including translations into other  languages
11and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17### Version
18
191.6.1 (30/01/2018)
20
21
22Introduction
23------------
24
25The purpose of this document is to define a lossless compressed data format,
26that is independent of CPU type, operating system,
27file system and character set, suitable for
28File compression, Pipe and streaming compression
29using the [LZ4 algorithm](http://www.lz4.org).
30
31The data can be produced or consumed,
32even for an arbitrarily long sequentially presented input data stream,
33using only an a priori bounded amount of intermediate storage,
34and hence can be used in data communications.
35The format uses the LZ4 compression method,
36and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash),
37for detection of data corruption.
38
39The data format defined by this specification
40does not attempt to allow random access to compressed data.
41
42This specification is intended for use by implementers of software
43to compress data into LZ4 format and/or decompress data from LZ4 format.
44The text of the specification assumes a basic background in programming
45at the level of bits and other primitive data representations.
46
47Unless otherwise indicated below,
48a compliant compressor must produce data sets
49that conform to the specifications presented here.
50It doesn’t need to support all options though.
51
52A compliant decompressor must be able to decompress
53at least one working set of parameters
54that conforms to the specifications presented here.
55It may also ignore checksums.
56Whenever it does not support a specific parameter within the compressed stream,
57it must produce a non-ambiguous error code
58and associated error message explaining which parameter is unsupported.
59
60
61General Structure of LZ4 Frame format
62-------------------------------------
63
64| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum |
65|:-------:|:-------------:| ----- | ----- | ------- | ----------- |
66| 4 bytes |  3-15 bytes   |       |       | 4 bytes | 0-4 bytes   |
67
68__Magic Number__
69
704 Bytes, Little endian format.
71Value : 0x184D2204
72
73__Frame Descriptor__
74
753 to 15 Bytes, to be detailed in its own paragraph,
76as it is the most important part of the spec.
77
78The combined __Magic Number__ and __Frame Descriptor__ fields are sometimes
79called ___LZ4 Frame Header___. Its size varies between 7 and 19 bytes.
80
81__Data Blocks__
82
83To be detailed in its own paragraph.
84That’s where compressed data is stored.
85
86__EndMark__
87
88The flow of blocks ends when the last data block has a size of “0”.
89The size is expressed as a 32-bits value.
90
91__Content Checksum__
92
93Content Checksum verify that the full content has been decoded correctly.
94The content checksum is the result
95of [xxh32() hash function](https://github.com/Cyan4973/xxHash)
96digesting the original (decoded) data as input, and a seed of zero.
97Content checksum is only present when its associated flag
98is set in the frame descriptor.
99Content Checksum validates the result,
100that all blocks were fully transmitted in the correct order and without error,
101and also that the encoding/decoding process itself generated no distortion.
102Its usage is recommended.
103
104The combined __EndMark__ and __Content Checksum__ fields might sometimes be
105referred to as ___LZ4 Frame Footer___. Its size varies between 4 and 8 bytes.
106
107__Frame Concatenation__
108
109In some circumstances, it may be preferable to append multiple frames,
110for example in order to add new data to an existing compressed file
111without re-framing it.
112
113In such case, each frame has its own set of descriptor flags.
114Each frame is considered independent.
115The only relation between frames is their sequential order.
116
117The ability to decode multiple concatenated frames
118within a single stream or file
119is left outside of this specification.
120As an example, the reference lz4 command line utility behavior is
121to decode all concatenated frames in their sequential order.
122
123
124Frame Descriptor
125----------------
126
127| FLG     | BD      | (Content Size) | (Dictionary ID) | HC      |
128| ------- | ------- |:--------------:|:---------------:| ------- |
129| 1 byte  | 1 byte  |  0 - 8 bytes   |   0 - 4 bytes   | 1 byte  |
130
131The descriptor uses a minimum of 3 bytes,
132and up to 15 bytes depending on optional parameters.
133
134__FLG byte__
135
136|  BitNb  |  7-6  |   5   |    4     |  3   |    2     |    1     |   0  |
137| ------- |-------|-------|----------|------|----------|----------|------|
138|FieldName|Version|B.Indep|B.Checksum|C.Size|C.Checksum|*Reserved*|DictID|
139
140
141__BD byte__
142
143|  BitNb  |     7    |     6-5-4     |  3-2-1-0 |
144| ------- | -------- | ------------- | -------- |
145|FieldName|*Reserved*| Block MaxSize |*Reserved*|
146
147In the tables, bit 7 is highest bit, while bit 0 is lowest.
148
149__Version Number__
150
1512-bits field, must be set to `01`.
152Any other value cannot be decoded by this version of the specification.
153Other version numbers will use different flag layouts.
154
155__Block Independence flag__
156
157If this flag is set to “1”, blocks are independent.
158If this flag is set to “0”, each block depends on previous ones
159(up to LZ4 window size, which is 64 KB).
160In such case, it’s necessary to decode all blocks in sequence.
161
162Block dependency improves compression ratio, especially for small blocks.
163On the other hand, it makes random access or multi-threaded decoding impossible.
164
165__Block checksum flag__
166
167If this flag is set, each data block will be followed by a 4-bytes checksum,
168calculated by using the xxHash-32 algorithm on the raw (compressed) data block.
169The intention is to detect data corruption (storage or transmission errors)
170immediately, before decoding.
171Block checksum usage is optional.
172
173__Content Size flag__
174
175If this flag is set, the uncompressed size of data included within the frame
176will be present as an 8 bytes unsigned little endian value, after the flags.
177Content Size usage is optional.
178
179__Content checksum flag__
180
181If this flag is set, a 32-bits content checksum will be appended
182after the EndMark.
183
184__Dictionary ID flag__
185
186If this flag is set, a 4-bytes Dict-ID field will be present,
187after the descriptor flags and the Content Size.
188
189__Block Maximum Size__
190
191This information is useful to help the decoder allocate memory.
192Size here refers to the original (uncompressed) data size.
193Block Maximum Size is one value among the following table :
194
195|  0  |  1  |  2  |  3  |   4   |   5    |  6   |  7   |
196| --- | --- | --- | --- | ----- | ------ | ---- | ---- |
197| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB |
198
199The decoder may refuse to allocate block sizes above any system-specific size.
200Unused values may be used in a future revision of the spec.
201A decoder conformant with the current version of the spec
202is only able to decode block sizes defined in this spec.
203
204__Reserved bits__
205
206Value of reserved bits **must** be 0 (zero).
207Reserved bit might be used in a future version of the specification,
208typically enabling new optional features.
209When this happens, a decoder respecting the current specification version
210shall not be able to decode such a frame.
211
212__Content Size__
213
214This is the original (uncompressed) size.
215This information is optional, and only present if the associated flag is set.
216Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes.
217Format is Little endian.
218This value is informational, typically for display or memory allocation.
219It can be skipped by a decoder, or used to validate content correctness.
220
221__Dictionary ID__
222
223Dict-ID is only present if the associated flag is set.
224It's an unsigned 32-bits value, stored using little-endian convention.
225A dictionary is useful to compress short input sequences.
226The compressor can take advantage of the dictionary context
227to encode the input in a more compact manner.
228It works as a kind of “known prefix” which is used by
229both the compressor and the decompressor to “warm-up” reference tables.
230
231The decompressor can use Dict-ID identifier to determine
232which dictionary must be used to correctly decode data.
233The compressor and the decompressor must use exactly the same dictionary.
234It's presumed that the 32-bits dictID uniquely identifies a dictionary.
235
236Within a single frame, a single dictionary can be defined.
237When the frame descriptor defines independent blocks,
238each block will be initialized with the same dictionary.
239If the frame descriptor defines linked blocks,
240the dictionary will only be used once, at the beginning of the frame.
241
242__Header Checksum__
243
244One-byte checksum of combined descriptor fields, including optional ones.
245The value is the second byte of `xxh32()` : ` (xxh32()>>8) & 0xFF `
246using zero as a seed, and the full Frame Descriptor as an input
247(including optional fields when they are present).
248A wrong checksum indicates an error in the descriptor.
249Header checksum is informational and can be skipped.
250
251
252Data Blocks
253-----------
254
255| Block Size |  data  | (Block Checksum) |
256|:----------:| ------ |:----------------:|
257|  4 bytes   |        |   0 - 4 bytes    |
258
259
260__Block Size__
261
262This field uses 4-bytes, format is little-endian.
263
264The highest bit is “1” if data in the block is uncompressed.
265
266The highest bit is “0” if data in the block is compressed by LZ4.
267
268All other bits give the size, in bytes, of the following data block
269(the size does not include the block checksum if present).
270
271Block Size shall never be larger than Block Maximum Size.
272Such a thing could happen for incompressible source data.
273In such case, such a data block shall be passed in uncompressed format.
274
275__Data__
276
277Where the actual data to decode stands.
278It might be compressed or not, depending on previous field indications.
279Uncompressed size of Data can be any size, up to “block maximum size”.
280Note that data block is not necessarily full :
281an arbitrary “flush” may happen anytime. Any block can be “partially filled”.
282
283__Block checksum__
284
285Only present if the associated flag is set.
286This is a 4-bytes checksum value, in little endian format,
287calculated by using the xxHash-32 algorithm on the raw (undecoded) data block,
288and a seed of zero.
289The intention is to detect data corruption (storage or transmission errors)
290before decoding.
291
292Block checksum is cumulative with Content checksum.
293
294
295Skippable Frames
296----------------
297
298| Magic Number | Frame Size | User Data |
299|:------------:|:----------:| --------- |
300|   4 bytes    |  4 bytes   |           |
301
302Skippable frames allow the integration of user-defined data
303into a flow of concatenated frames.
304Its design is pretty straightforward,
305with the sole objective to allow the decoder to quickly skip
306over user-defined data and continue decoding.
307
308For the purpose of facilitating identification,
309it is discouraged to start a flow of concatenated frames with a skippable frame.
310If there is a need to start such a flow with some user data
311encapsulated into a skippable frame,
312it’s recommended to start with a zero-byte LZ4 frame
313followed by a skippable frame.
314This will make it easier for file type identifiers.
315
316
317__Magic Number__
318
3194 Bytes, Little endian format.
320Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
321All 16 values are valid to identify a skippable frame.
322
323__Frame Size__
324
325This is the size, in bytes, of the following User Data
326(without including the magic number nor the size field itself).
3274 Bytes, Little endian format, unsigned 32-bits.
328This means User Data can’t be bigger than (2^32-1) Bytes.
329
330__User Data__
331
332User Data can be anything. Data will just be skipped by the decoder.
333
334
335Legacy frame
336------------
337
338The Legacy frame format was defined into the initial versions of “LZ4Demo”.
339Newer compressors should not use this format anymore, as it is too restrictive.
340
341Main characteristics of the legacy format :
342
343- Fixed block size : 8 MB.
344- All blocks must be completely filled, except the last one.
345- All blocks are always compressed, even when compression is detrimental.
346- The last block is detected either because
347  it is followed by the “EOF” (End of File) mark,
348  or because it is followed by a known Frame Magic Number.
349- No checksum
350- Convention is Little endian
351
352| MagicNb | B.CSize | CData | B.CSize | CData |  (...)  | EndMark |
353| ------- | ------- | ----- | ------- | ----- | ------- | ------- |
354| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times |   EOF   |
355
356
357__Magic Number__
358
3594 Bytes, Little endian format.
360Value : 0x184C2102
361
362__Block Compressed Size__
363
364This is the size, in bytes, of the following compressed data block.
3654 Bytes, Little endian format.
366
367__Data__
368
369Where the actual compressed data stands.
370Data is always compressed, even when compression is detrimental.
371
372__EndMark__
373
374End of legacy frame is implicit only.
375It must be followed by a standard EOF (End Of File) signal,
376wether it is a file or a stream.
377
378Alternatively, if the frame is followed by a valid Frame Magic Number,
379it is considered completed.
380This policy makes it possible to concatenate legacy frames.
381
382Any other value will be interpreted as a block size,
383and trigger an error if it does not fit within acceptable range.
384
385
386Version changes
387---------------
388
3891.6.1 : introduced terms "LZ4 Frame Header" and "LZ4 Frame Footer"
390
3911.6.0 : restored Dictionary ID field in Frame header
392
3931.5.1 : changed document format to MarkDown
394
3951.5 : removed Dictionary ID from specification
396
3971.4.1 : changed wording from “stream” to “frame”
398
3991.4 : added skippable streams, re-added stream checksum
400
4011.3 : modified header checksum
402
4031.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”.
404
4051.1 : optional fields are now part of the descriptor
406
4071.0 : changed “block size” specification, adding a compressed/uncompressed flag
408
4090.9 : reduced scale of “block maximum size” table
410
4110.8 : removed : high compression flag
412
4130.7 : removed : stream checksum
414
4150.6 : settled : stream size uses 8 bytes, endian convention is little endian
416
4170.5: added copyright notice
418
4190.4 : changed format to Google Doc compatible OpenDocument
420