• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1LZ4 Frame Format Description
2============================
3
4### Notices
5
6Copyright (c) 2013-2015 Yann Collet
7
8Permission is granted to copy and distribute this document
9for any  purpose and without charge,
10including translations into other  languages
11and incorporation into compilations,
12provided that the copyright notice and this notice are preserved,
13and that any substantive changes or deletions from the original
14are clearly marked.
15Distribution of this document is unlimited.
16
17### Version
18
191.6.1 (30/01/2018)
20
21
22Introduction
23------------
24
25The purpose of this document is to define a lossless compressed data format,
26that is independent of CPU type, operating system,
27file system and character set, suitable for
28File compression, Pipe and streaming compression
29using the [LZ4 algorithm](http://www.lz4.org).
30
31The data can be produced or consumed,
32even for an arbitrarily long sequentially presented input data stream,
33using only an a priori bounded amount of intermediate storage,
34and hence can be used in data communications.
35The format uses the LZ4 compression method,
36and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash),
37for detection of data corruption.
38
39The data format defined by this specification
40does not attempt to allow random access to compressed data.
41
42This specification is intended for use by implementers of software
43to compress data into LZ4 format and/or decompress data from LZ4 format.
44The text of the specification assumes a basic background in programming
45at the level of bits and other primitive data representations.
46
47Unless otherwise indicated below,
48a compliant compressor must produce data sets
49that conform to the specifications presented here.
50It doesn’t need to support all options though.
51
52A compliant decompressor must be able to decompress
53at least one working set of parameters
54that conforms to the specifications presented here.
55It may also ignore checksums.
56Whenever it does not support a specific parameter within the compressed stream,
57it must produce a non-ambiguous error code
58and associated error message explaining which parameter is unsupported.
59
60
61General Structure of LZ4 Frame format
62-------------------------------------
63
64| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum |
65|:-------:|:-------------:| ----- | ----- | ------- | ----------- |
66| 4 bytes |  3-15 bytes   |       |       | 4 bytes | 0-4 bytes   |
67
68__Magic Number__
69
704 Bytes, Little endian format.
71Value : 0x184D2204
72
73__Frame Descriptor__
74
753 to 15 Bytes, to be detailed in its own paragraph,
76as it is the most important part of the spec.
77
78The combined __Magic Number__ and __Frame Descriptor__ fields are sometimes
79called ___LZ4 Frame Header___. Its size varies between 7 and 19 bytes.
80
81__Data Blocks__
82
83To be detailed in its own paragraph.
84That’s where compressed data is stored.
85
86__EndMark__
87
88The flow of blocks ends when the last data block has a size of “0”.
89The size is expressed as a 32-bits value.
90
91__Content Checksum__
92
93Content Checksum verify that the full content has been decoded correctly.
94The content checksum is the result
95of [xxh32() hash function](https://github.com/Cyan4973/xxHash)
96digesting the original (decoded) data as input, and a seed of zero.
97Content checksum is only present when its associated flag
98is set in the frame descriptor.
99Content Checksum validates the result,
100that all blocks were fully transmitted in the correct order and without error,
101and also that the encoding/decoding process itself generated no distortion.
102Its usage is recommended.
103
104The combined __EndMark__ and __Content Checksum__ fields might sometimes be
105referred to as ___LZ4 Frame Footer___. Its size varies between 4 and 8 bytes.
106
107__Frame Concatenation__
108
109In some circumstances, it may be preferable to append multiple frames,
110for example in order to add new data to an existing compressed file
111without re-framing it.
112
113In such case, each frame has its own set of descriptor flags.
114Each frame is considered independent.
115The only relation between frames is their sequential order.
116
117The ability to decode multiple concatenated frames
118within a single stream or file
119is left outside of this specification.
120As an example, the reference lz4 command line utility behavior is
121to decode all concatenated frames in their sequential order.
122
123
124Frame Descriptor
125----------------
126
127| FLG     | BD      | (Content Size) | (Dictionary ID) | HC      |
128| ------- | ------- |:--------------:|:---------------:| ------- |
129| 1 byte  | 1 byte  |  0 - 8 bytes   |   0 - 4 bytes   | 1 byte  |
130
131The descriptor uses a minimum of 3 bytes,
132and up to 15 bytes depending on optional parameters.
133
134__FLG byte__
135
136|  BitNb  |  7-6  |   5   |    4     |  3   |    2     |    1     |   0  |
137| ------- |-------|-------|----------|------|----------|----------|------|
138|FieldName|Version|B.Indep|B.Checksum|C.Size|C.Checksum|*Reserved*|DictID|
139
140
141__BD byte__
142
143|  BitNb  |     7    |     6-5-4     |  3-2-1-0 |
144| ------- | -------- | ------------- | -------- |
145|FieldName|*Reserved*| Block MaxSize |*Reserved*|
146
147In the tables, bit 7 is highest bit, while bit 0 is lowest.
148
149__Version Number__
150
1512-bits field, must be set to `01`.
152Any other value cannot be decoded by this version of the specification.
153Other version numbers will use different flag layouts.
154
155__Block Independence flag__
156
157If this flag is set to “1”, blocks are independent.
158If this flag is set to “0”, each block depends on previous ones
159(up to LZ4 window size, which is 64 KB).
160In such case, it’s necessary to decode all blocks in sequence.
161
162Block dependency improves compression ratio, especially for small blocks.
163On the other hand, it makes random access or multi-threaded decoding impossible.
164
165__Block checksum flag__
166
167If this flag is set, each data block will be followed by a 4-bytes checksum,
168calculated by using the xxHash-32 algorithm on the raw (compressed) data block.
169The intention is to detect data corruption (storage or transmission errors)
170immediately, before decoding.
171Block checksum usage is optional.
172
173__Content Size flag__
174
175If this flag is set, the uncompressed size of data included within the frame
176will be present as an 8 bytes unsigned little endian value, after the flags.
177Content Size usage is optional.
178
179__Content checksum flag__
180
181If this flag is set, a 32-bits content checksum will be appended
182after the EndMark.
183
184__Dictionary ID flag__
185
186If this flag is set, a 4-bytes Dict-ID field will be present,
187after the descriptor flags and the Content Size.
188
189__Block Maximum Size__
190
191This information is useful to help the decoder allocate memory.
192Size here refers to the original (uncompressed) data size.
193Block Maximum Size is one value among the following table :
194
195|  0  |  1  |  2  |  3  |   4   |   5    |  6   |  7   |
196| --- | --- | --- | --- | ----- | ------ | ---- | ---- |
197| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB |
198
199The decoder may refuse to allocate block sizes above any system-specific size.
200Unused values may be used in a future revision of the spec.
201A decoder conformant with the current version of the spec
202is only able to decode block sizes defined in this spec.
203
204__Reserved bits__
205
206Value of reserved bits **must** be 0 (zero).
207Reserved bit might be used in a future version of the specification,
208typically enabling new optional features.
209When this happens, a decoder respecting the current specification version
210shall not be able to decode such a frame.
211
212__Content Size__
213
214This is the original (uncompressed) size.
215This information is optional, and only present if the associated flag is set.
216Content size is provided using unsigned 8 Bytes, for a maximum of 16 Exabytes.
217Format is Little endian.
218This value is informational, typically for display or memory allocation.
219It can be skipped by a decoder, or used to validate content correctness.
220
221__Dictionary ID__
222
223Dict-ID is only present if the associated flag is set.
224It's an unsigned 32-bits value, stored using little-endian convention.
225A dictionary is useful to compress short input sequences.
226The compressor can take advantage of the dictionary context
227to encode the input in a more compact manner.
228It works as a kind of “known prefix” which is used by
229both the compressor and the decompressor to “warm-up” reference tables.
230
231The decompressor can use Dict-ID identifier to determine
232which dictionary must be used to correctly decode data.
233The compressor and the decompressor must use exactly the same dictionary.
234It's presumed that the 32-bits dictID uniquely identifies a dictionary.
235
236Within a single frame, a single dictionary can be defined.
237When the frame descriptor defines independent blocks,
238each block will be initialized with the same dictionary.
239If the frame descriptor defines linked blocks,
240the dictionary will only be used once, at the beginning of the frame.
241
242__Header Checksum__
243
244One-byte checksum of combined descriptor fields, including optional ones.
245The value is the second byte of `xxh32()` : ` (xxh32()>>8) & 0xFF `
246using zero as a seed, and the full Frame Descriptor as an input
247(including optional fields when they are present).
248A wrong checksum indicates an error in the descriptor.
249Header checksum is informational and can be skipped.
250
251
252Data Blocks
253-----------
254
255| Block Size |  data  | (Block Checksum) |
256|:----------:| ------ |:----------------:|
257|  4 bytes   |        |   0 - 4 bytes    |
258
259
260__Block Size__
261
262This field uses 4-bytes, format is little-endian.
263
264The highest bit is “1” if data in the block is uncompressed.
265
266The highest bit is “0” if data in the block is compressed by LZ4.
267
268All other bits give the size, in bytes, of the following data block.
269The size does not include the block checksum if present.
270
271Block Size shall never be larger than Block Maximum Size.
272Such a thing could potentially happen for non-compressible sources.
273In such a case, such data block shall be passed using uncompressed format.
274
275__Data__
276
277Where the actual data to decode stands.
278It might be compressed or not, depending on previous field indications.
279
280When compressed, the data must respect the [LZ4 block format specification](https://github.com/lz4/lz4/blob/master/doc/lz4_Block_format.md).
281
282Note that the block is not necessarily full.
283Uncompressed size of data can be any size, up to "Block Maximum Size”,
284so it may contain less data than the maximum block size.
285
286__Block checksum__
287
288Only present if the associated flag is set.
289This is a 4-bytes checksum value, in little endian format,
290calculated by using the xxHash-32 algorithm on the raw (undecoded) data block,
291and a seed of zero.
292The intention is to detect data corruption (storage or transmission errors)
293before decoding.
294
295Block checksum is cumulative with Content checksum.
296
297
298Skippable Frames
299----------------
300
301| Magic Number | Frame Size | User Data |
302|:------------:|:----------:| --------- |
303|   4 bytes    |  4 bytes   |           |
304
305Skippable frames allow the integration of user-defined data
306into a flow of concatenated frames.
307Its design is pretty straightforward,
308with the sole objective to allow the decoder to quickly skip
309over user-defined data and continue decoding.
310
311For the purpose of facilitating identification,
312it is discouraged to start a flow of concatenated frames with a skippable frame.
313If there is a need to start such a flow with some user data
314encapsulated into a skippable frame,
315it’s recommended to start with a zero-byte LZ4 frame
316followed by a skippable frame.
317This will make it easier for file type identifiers.
318
319
320__Magic Number__
321
3224 Bytes, Little endian format.
323Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
324All 16 values are valid to identify a skippable frame.
325
326__Frame Size__
327
328This is the size, in bytes, of the following User Data
329(without including the magic number nor the size field itself).
3304 Bytes, Little endian format, unsigned 32-bits.
331This means User Data can’t be bigger than (2^32-1) Bytes.
332
333__User Data__
334
335User Data can be anything. Data will just be skipped by the decoder.
336
337
338Legacy frame
339------------
340
341The Legacy frame format was defined into the initial versions of “LZ4Demo”.
342Newer compressors should not use this format anymore, as it is too restrictive.
343
344Main characteristics of the legacy format :
345
346- Fixed block size : 8 MB.
347- All blocks must be completely filled, except the last one.
348- All blocks are always compressed, even when compression is detrimental.
349- The last block is detected either because
350  it is followed by the “EOF” (End of File) mark,
351  or because it is followed by a known Frame Magic Number.
352- No checksum
353- Convention is Little endian
354
355| MagicNb | B.CSize | CData | B.CSize | CData |  (...)  | EndMark |
356| ------- | ------- | ----- | ------- | ----- | ------- | ------- |
357| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times |   EOF   |
358
359
360__Magic Number__
361
3624 Bytes, Little endian format.
363Value : 0x184C2102
364
365__Block Compressed Size__
366
367This is the size, in bytes, of the following compressed data block.
3684 Bytes, Little endian format.
369
370__Data__
371
372Where the actual compressed data stands.
373Data is always compressed, even when compression is detrimental.
374
375__EndMark__
376
377End of legacy frame is implicit only.
378It must be followed by a standard EOF (End Of File) signal,
379wether it is a file or a stream.
380
381Alternatively, if the frame is followed by a valid Frame Magic Number,
382it is considered completed.
383This policy makes it possible to concatenate legacy frames.
384
385Any other value will be interpreted as a block size,
386and trigger an error if it does not fit within acceptable range.
387
388
389Version changes
390---------------
391
3921.6.1 : introduced terms "LZ4 Frame Header" and "LZ4 Frame Footer"
393
3941.6.0 : restored Dictionary ID field in Frame header
395
3961.5.1 : changed document format to MarkDown
397
3981.5 : removed Dictionary ID from specification
399
4001.4.1 : changed wording from “stream” to “frame”
401
4021.4 : added skippable streams, re-added stream checksum
403
4041.3 : modified header checksum
405
4061.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”.
407
4081.1 : optional fields are now part of the descriptor
409
4101.0 : changed “block size” specification, adding a compressed/uncompressed flag
411
4120.9 : reduced scale of “block maximum size” table
413
4140.8 : removed : high compression flag
415
4160.7 : removed : stream checksum
417
4180.6 : settled : stream size uses 8 bytes, endian convention is little endian
419
4200.5: added copyright notice
421
4220.4 : changed format to Google Doc compatible OpenDocument
423