1LZ4 Frame Format Description 2============================ 3 4### Notices 5 6Copyright (c) 2013-2015 Yann Collet 7 8Permission is granted to copy and distribute this document 9for any purpose and without charge, 10including translations into other languages 11and incorporation into compilations, 12provided that the copyright notice and this notice are preserved, 13and that any substantive changes or deletions from the original 14are clearly marked. 15Distribution of this document is unlimited. 16 17### Version 18 191.6.1 (30/01/2018) 20 21 22Introduction 23------------ 24 25The purpose of this document is to define a lossless compressed data format, 26that is independent of CPU type, operating system, 27file system and character set, suitable for 28File compression, Pipe and streaming compression 29using the [LZ4 algorithm](http://www.lz4.org). 30 31The data can be produced or consumed, 32even for an arbitrarily long sequentially presented input data stream, 33using only an a priori bounded amount of intermediate storage, 34and hence can be used in data communications. 35The format uses the LZ4 compression method, 36and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash), 37for detection of data corruption. 38 39The data format defined by this specification 40does not attempt to allow random access to compressed data. 41 42This specification is intended for use by implementers of software 43to compress data into LZ4 format and/or decompress data from LZ4 format. 44The text of the specification assumes a basic background in programming 45at the level of bits and other primitive data representations. 46 47Unless otherwise indicated below, 48a compliant compressor must produce data sets 49that conform to the specifications presented here. 50It doesn’t need to support all options though. 51 52A compliant decompressor must be able to decompress 53at least one working set of parameters 54that conforms to the specifications presented here. 55It may also ignore checksums. 56Whenever it does not support a specific parameter within the compressed stream, 57it must produce a non-ambiguous error code 58and associated error message explaining which parameter is unsupported. 59 60 61General Structure of LZ4 Frame format 62------------------------------------- 63 64| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum | 65|:-------:|:-------------:| ----- | ----- | ------- | ----------- | 66| 4 bytes | 3-15 bytes | | | 4 bytes | 0-4 bytes | 67 68__Magic Number__ 69 704 Bytes, Little endian format. 71Value : 0x184D2204 72 73__Frame Descriptor__ 74 753 to 15 Bytes, to be detailed in its own paragraph, 76as it is the most important part of the spec. 77 78The combined __Magic Number__ and __Frame Descriptor__ fields are sometimes 79called ___LZ4 Frame Header___. Its size varies between 7 and 19 bytes. 80 81__Data Blocks__ 82 83To be detailed in its own paragraph. 84That’s where compressed data is stored. 85 86__EndMark__ 87 88The flow of blocks ends when the last data block has a size of “0”. 89The size is expressed as a 32-bits value. 90 91__Content Checksum__ 92 93Content Checksum verify that the full content has been decoded correctly. 94The content checksum is the result 95of [xxh32() hash function](https://github.com/Cyan4973/xxHash) 96digesting the original (decoded) data as input, and a seed of zero. 97Content checksum is only present when its associated flag 98is set in the frame descriptor. 99Content Checksum validates the result, 100that all blocks were fully transmitted in the correct order and without error, 101and also that the encoding/decoding process itself generated no distortion. 102Its usage is recommended. 103 104The combined __EndMark__ and __Content Checksum__ fields might sometimes be 105referred to as ___LZ4 Frame Footer___. Its size varies between 4 and 8 bytes. 106 107__Frame Concatenation__ 108 109In some circumstances, it may be preferable to append multiple frames, 110for example in order to add new data to an existing compressed file 111without re-framing it. 112 113In such case, each frame has its own set of descriptor flags. 114Each frame is considered independent. 115The only relation between frames is their sequential order. 116 117The ability to decode multiple concatenated frames 118within a single stream or file 119is left outside of this specification. 120As an example, the reference lz4 command line utility behavior is 121to decode all concatenated frames in their sequential order. 122 123 124Frame Descriptor 125---------------- 126 127| FLG | BD | (Content Size) | (Dictionary ID) | HC | 128| ------- | ------- |:--------------:|:---------------:| ------- | 129| 1 byte | 1 byte | 0 - 8 bytes | 0 - 4 bytes | 1 byte | 130 131The descriptor uses a minimum of 3 bytes, 132and up to 15 bytes depending on optional parameters. 133 134__FLG byte__ 135 136| BitNb | 7-6 | 5 | 4 | 3 | 2 | 1 | 0 | 137| ------- |-------|-------|----------|------|----------|----------|------| 138|FieldName|Version|B.Indep|B.Checksum|C.Size|C.Checksum|*Reserved*|DictID| 139 140 141__BD byte__ 142 143| BitNb | 7 | 6-5-4 | 3-2-1-0 | 144| ------- | -------- | ------------- | -------- | 145|FieldName|*Reserved*| Block MaxSize |*Reserved*| 146 147In the tables, bit 7 is highest bit, while bit 0 is lowest. 148 149__Version Number__ 150 1512-bits field, must be set to `01`. 152Any other value cannot be decoded by this version of the specification. 153Other version numbers will use different flag layouts. 154 155__Block Independence flag__ 156 157If this flag is set to “1”, blocks are independent. 158If this flag is set to “0”, each block depends on previous ones 159(up to LZ4 window size, which is 64 KB). 160In such case, it’s necessary to decode all blocks in sequence. 161 162Block dependency improves compression ratio, especially for small blocks. 163On the other hand, it makes random access or multi-threaded decoding impossible. 164 165__Block checksum flag__ 166 167If this flag is set, each data block will be followed by a 4-bytes checksum, 168calculated by using the xxHash-32 algorithm on the raw (compressed) data block. 169The intention is to detect data corruption (storage or transmission errors) 170immediately, before decoding. 171Block checksum usage is optional. 172 173__Content Size flag__ 174 175If this flag is set, the uncompressed size of data included within the frame 176will be present as an 8 bytes unsigned little endian value, after the flags. 177Content Size usage is optional. 178 179__Content checksum flag__ 180 181If this flag is set, a 32-bits content checksum will be appended 182after the EndMark. 183 184__Dictionary ID flag__ 185 186If this flag is set, a 4-bytes Dict-ID field will be present, 187after the descriptor flags and the Content Size. 188 189__Block Maximum Size__ 190 191This information is useful to help the decoder allocate memory. 192Size here refers to the original (uncompressed) data size. 193Block Maximum Size is one value among the following table : 194 195| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 196| --- | --- | --- | --- | ----- | ------ | ---- | ---- | 197| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB | 198 199The decoder may refuse to allocate block sizes above any system-specific size. 200Unused values may be used in a future revision of the spec. 201A decoder conformant with the current version of the spec 202is only able to decode block sizes defined in this spec. 203 204__Reserved bits__ 205 206Value of reserved bits **must** be 0 (zero). 207Reserved bit might be used in a future version of the specification, 208typically enabling new optional features. 209When this happens, a decoder respecting the current specification version 210shall not be able to decode such a frame. 211 212__Content Size__ 213 214This is the original (uncompressed) size. 215This information is optional, and only present if the associated flag is set. 216Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes. 217Format is Little endian. 218This value is informational, typically for display or memory allocation. 219It can be skipped by a decoder, or used to validate content correctness. 220 221__Dictionary ID__ 222 223Dict-ID is only present if the associated flag is set. 224It's an unsigned 32-bits value, stored using little-endian convention. 225A dictionary is useful to compress short input sequences. 226The compressor can take advantage of the dictionary context 227to encode the input in a more compact manner. 228It works as a kind of “known prefix” which is used by 229both the compressor and the decompressor to “warm-up” reference tables. 230 231The decompressor can use Dict-ID identifier to determine 232which dictionary must be used to correctly decode data. 233The compressor and the decompressor must use exactly the same dictionary. 234It's presumed that the 32-bits dictID uniquely identifies a dictionary. 235 236Within a single frame, a single dictionary can be defined. 237When the frame descriptor defines independent blocks, 238each block will be initialized with the same dictionary. 239If the frame descriptor defines linked blocks, 240the dictionary will only be used once, at the beginning of the frame. 241 242__Header Checksum__ 243 244One-byte checksum of combined descriptor fields, including optional ones. 245The value is the second byte of `xxh32()` : ` (xxh32()>>8) & 0xFF ` 246using zero as a seed, and the full Frame Descriptor as an input 247(including optional fields when they are present). 248A wrong checksum indicates an error in the descriptor. 249Header checksum is informational and can be skipped. 250 251 252Data Blocks 253----------- 254 255| Block Size | data | (Block Checksum) | 256|:----------:| ------ |:----------------:| 257| 4 bytes | | 0 - 4 bytes | 258 259 260__Block Size__ 261 262This field uses 4-bytes, format is little-endian. 263 264The highest bit is “1” if data in the block is uncompressed. 265 266The highest bit is “0” if data in the block is compressed by LZ4. 267 268All other bits give the size, in bytes, of the following data block 269(the size does not include the block checksum if present). 270 271Block Size shall never be larger than Block Maximum Size. 272Such a thing could happen for incompressible source data. 273In such case, such a data block shall be passed in uncompressed format. 274 275__Data__ 276 277Where the actual data to decode stands. 278It might be compressed or not, depending on previous field indications. 279Uncompressed size of Data can be any size, up to “block maximum size”. 280Note that data block is not necessarily full : 281an arbitrary “flush” may happen anytime. Any block can be “partially filled”. 282 283__Block checksum__ 284 285Only present if the associated flag is set. 286This is a 4-bytes checksum value, in little endian format, 287calculated by using the xxHash-32 algorithm on the raw (undecoded) data block, 288and a seed of zero. 289The intention is to detect data corruption (storage or transmission errors) 290before decoding. 291 292Block checksum is cumulative with Content checksum. 293 294 295Skippable Frames 296---------------- 297 298| Magic Number | Frame Size | User Data | 299|:------------:|:----------:| --------- | 300| 4 bytes | 4 bytes | | 301 302Skippable frames allow the integration of user-defined data 303into a flow of concatenated frames. 304Its design is pretty straightforward, 305with the sole objective to allow the decoder to quickly skip 306over user-defined data and continue decoding. 307 308For the purpose of facilitating identification, 309it is discouraged to start a flow of concatenated frames with a skippable frame. 310If there is a need to start such a flow with some user data 311encapsulated into a skippable frame, 312it’s recommended to start with a zero-byte LZ4 frame 313followed by a skippable frame. 314This will make it easier for file type identifiers. 315 316 317__Magic Number__ 318 3194 Bytes, Little endian format. 320Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. 321All 16 values are valid to identify a skippable frame. 322 323__Frame Size__ 324 325This is the size, in bytes, of the following User Data 326(without including the magic number nor the size field itself). 3274 Bytes, Little endian format, unsigned 32-bits. 328This means User Data can’t be bigger than (2^32-1) Bytes. 329 330__User Data__ 331 332User Data can be anything. Data will just be skipped by the decoder. 333 334 335Legacy frame 336------------ 337 338The Legacy frame format was defined into the initial versions of “LZ4Demo”. 339Newer compressors should not use this format anymore, as it is too restrictive. 340 341Main characteristics of the legacy format : 342 343- Fixed block size : 8 MB. 344- All blocks must be completely filled, except the last one. 345- All blocks are always compressed, even when compression is detrimental. 346- The last block is detected either because 347 it is followed by the “EOF” (End of File) mark, 348 or because it is followed by a known Frame Magic Number. 349- No checksum 350- Convention is Little endian 351 352| MagicNb | B.CSize | CData | B.CSize | CData | (...) | EndMark | 353| ------- | ------- | ----- | ------- | ----- | ------- | ------- | 354| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times | EOF | 355 356 357__Magic Number__ 358 3594 Bytes, Little endian format. 360Value : 0x184C2102 361 362__Block Compressed Size__ 363 364This is the size, in bytes, of the following compressed data block. 3654 Bytes, Little endian format. 366 367__Data__ 368 369Where the actual compressed data stands. 370Data is always compressed, even when compression is detrimental. 371 372__EndMark__ 373 374End of legacy frame is implicit only. 375It must be followed by a standard EOF (End Of File) signal, 376wether it is a file or a stream. 377 378Alternatively, if the frame is followed by a valid Frame Magic Number, 379it is considered completed. 380This policy makes it possible to concatenate legacy frames. 381 382Any other value will be interpreted as a block size, 383and trigger an error if it does not fit within acceptable range. 384 385 386Version changes 387--------------- 388 3891.6.1 : introduced terms "LZ4 Frame Header" and "LZ4 Frame Footer" 390 3911.6.0 : restored Dictionary ID field in Frame header 392 3931.5.1 : changed document format to MarkDown 394 3951.5 : removed Dictionary ID from specification 396 3971.4.1 : changed wording from “stream” to “frame” 398 3991.4 : added skippable streams, re-added stream checksum 400 4011.3 : modified header checksum 402 4031.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”. 404 4051.1 : optional fields are now part of the descriptor 406 4071.0 : changed “block size” specification, adding a compressed/uncompressed flag 408 4090.9 : reduced scale of “block maximum size” table 410 4110.8 : removed : high compression flag 412 4130.7 : removed : stream checksum 414 4150.6 : settled : stream size uses 8 bytes, endian convention is little endian 416 4170.5: added copyright notice 418 4190.4 : changed format to Google Doc compatible OpenDocument 420