1LZ4 Frame Format Description 2============================ 3 4### Notices 5 6Copyright (c) 2013-2015 Yann Collet 7 8Permission is granted to copy and distribute this document 9for any purpose and without charge, 10including translations into other languages 11and incorporation into compilations, 12provided that the copyright notice and this notice are preserved, 13and that any substantive changes or deletions from the original 14are clearly marked. 15Distribution of this document is unlimited. 16 17### Version 18 191.6.1 (30/01/2018) 20 21 22Introduction 23------------ 24 25The purpose of this document is to define a lossless compressed data format, 26that is independent of CPU type, operating system, 27file system and character set, suitable for 28File compression, Pipe and streaming compression 29using the [LZ4 algorithm](http://www.lz4.org). 30 31The data can be produced or consumed, 32even for an arbitrarily long sequentially presented input data stream, 33using only an a priori bounded amount of intermediate storage, 34and hence can be used in data communications. 35The format uses the LZ4 compression method, 36and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash), 37for detection of data corruption. 38 39The data format defined by this specification 40does not attempt to allow random access to compressed data. 41 42This specification is intended for use by implementers of software 43to compress data into LZ4 format and/or decompress data from LZ4 format. 44The text of the specification assumes a basic background in programming 45at the level of bits and other primitive data representations. 46 47Unless otherwise indicated below, 48a compliant compressor must produce data sets 49that conform to the specifications presented here. 50It doesn’t need to support all options though. 51 52A compliant decompressor must be able to decompress 53at least one working set of parameters 54that conforms to the specifications presented here. 55It may also ignore checksums. 56Whenever it does not support a specific parameter within the compressed stream, 57it must produce a non-ambiguous error code 58and associated error message explaining which parameter is unsupported. 59 60 61General Structure of LZ4 Frame format 62------------------------------------- 63 64| MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum | 65|:-------:|:-------------:| ----- | ----- | ------- | ----------- | 66| 4 bytes | 3-15 bytes | | | 4 bytes | 0-4 bytes | 67 68__Magic Number__ 69 704 Bytes, Little endian format. 71Value : 0x184D2204 72 73__Frame Descriptor__ 74 753 to 15 Bytes, to be detailed in its own paragraph, 76as it is the most important part of the spec. 77 78The combined __Magic Number__ and __Frame Descriptor__ fields are sometimes 79called ___LZ4 Frame Header___. Its size varies between 7 and 19 bytes. 80 81__Data Blocks__ 82 83To be detailed in its own paragraph. 84That’s where compressed data is stored. 85 86__EndMark__ 87 88The flow of blocks ends when the last data block has a size of “0”. 89The size is expressed as a 32-bits value. 90 91__Content Checksum__ 92 93Content Checksum verify that the full content has been decoded correctly. 94The content checksum is the result 95of [xxh32() hash function](https://github.com/Cyan4973/xxHash) 96digesting the original (decoded) data as input, and a seed of zero. 97Content checksum is only present when its associated flag 98is set in the frame descriptor. 99Content Checksum validates the result, 100that all blocks were fully transmitted in the correct order and without error, 101and also that the encoding/decoding process itself generated no distortion. 102Its usage is recommended. 103 104The combined __EndMark__ and __Content Checksum__ fields might sometimes be 105referred to as ___LZ4 Frame Footer___. Its size varies between 4 and 8 bytes. 106 107__Frame Concatenation__ 108 109In some circumstances, it may be preferable to append multiple frames, 110for example in order to add new data to an existing compressed file 111without re-framing it. 112 113In such case, each frame has its own set of descriptor flags. 114Each frame is considered independent. 115The only relation between frames is their sequential order. 116 117The ability to decode multiple concatenated frames 118within a single stream or file 119is left outside of this specification. 120As an example, the reference lz4 command line utility behavior is 121to decode all concatenated frames in their sequential order. 122 123 124Frame Descriptor 125---------------- 126 127| FLG | BD | (Content Size) | (Dictionary ID) | HC | 128| ------- | ------- |:--------------:|:---------------:| ------- | 129| 1 byte | 1 byte | 0 - 8 bytes | 0 - 4 bytes | 1 byte | 130 131The descriptor uses a minimum of 3 bytes, 132and up to 15 bytes depending on optional parameters. 133 134__FLG byte__ 135 136| BitNb | 7-6 | 5 | 4 | 3 | 2 | 1 | 0 | 137| ------- |-------|-------|----------|------|----------|----------|------| 138|FieldName|Version|B.Indep|B.Checksum|C.Size|C.Checksum|*Reserved*|DictID| 139 140 141__BD byte__ 142 143| BitNb | 7 | 6-5-4 | 3-2-1-0 | 144| ------- | -------- | ------------- | -------- | 145|FieldName|*Reserved*| Block MaxSize |*Reserved*| 146 147In the tables, bit 7 is highest bit, while bit 0 is lowest. 148 149__Version Number__ 150 1512-bits field, must be set to `01`. 152Any other value cannot be decoded by this version of the specification. 153Other version numbers will use different flag layouts. 154 155__Block Independence flag__ 156 157If this flag is set to “1”, blocks are independent. 158If this flag is set to “0”, each block depends on previous ones 159(up to LZ4 window size, which is 64 KB). 160In such case, it’s necessary to decode all blocks in sequence. 161 162Block dependency improves compression ratio, especially for small blocks. 163On the other hand, it makes random access or multi-threaded decoding impossible. 164 165__Block checksum flag__ 166 167If this flag is set, each data block will be followed by a 4-bytes checksum, 168calculated by using the xxHash-32 algorithm on the raw (compressed) data block. 169The intention is to detect data corruption (storage or transmission errors) 170immediately, before decoding. 171Block checksum usage is optional. 172 173__Content Size flag__ 174 175If this flag is set, the uncompressed size of data included within the frame 176will be present as an 8 bytes unsigned little endian value, after the flags. 177Content Size usage is optional. 178 179__Content checksum flag__ 180 181If this flag is set, a 32-bits content checksum will be appended 182after the EndMark. 183 184__Dictionary ID flag__ 185 186If this flag is set, a 4-bytes Dict-ID field will be present, 187after the descriptor flags and the Content Size. 188 189__Block Maximum Size__ 190 191This information is useful to help the decoder allocate memory. 192Size here refers to the original (uncompressed) data size. 193Block Maximum Size is one value among the following table : 194 195| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 196| --- | --- | --- | --- | ----- | ------ | ---- | ---- | 197| N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB | 198 199The decoder may refuse to allocate block sizes above any system-specific size. 200Unused values may be used in a future revision of the spec. 201A decoder conformant with the current version of the spec 202is only able to decode block sizes defined in this spec. 203 204__Reserved bits__ 205 206Value of reserved bits **must** be 0 (zero). 207Reserved bit might be used in a future version of the specification, 208typically enabling new optional features. 209When this happens, a decoder respecting the current specification version 210shall not be able to decode such a frame. 211 212__Content Size__ 213 214This is the original (uncompressed) size. 215This information is optional, and only present if the associated flag is set. 216Content size is provided using unsigned 8 Bytes, for a maximum of 16 Exabytes. 217Format is Little endian. 218This value is informational, typically for display or memory allocation. 219It can be skipped by a decoder, or used to validate content correctness. 220 221__Dictionary ID__ 222 223Dict-ID is only present if the associated flag is set. 224It's an unsigned 32-bits value, stored using little-endian convention. 225A dictionary is useful to compress short input sequences. 226The compressor can take advantage of the dictionary context 227to encode the input in a more compact manner. 228It works as a kind of “known prefix” which is used by 229both the compressor and the decompressor to “warm-up” reference tables. 230 231The decompressor can use Dict-ID identifier to determine 232which dictionary must be used to correctly decode data. 233The compressor and the decompressor must use exactly the same dictionary. 234It's presumed that the 32-bits dictID uniquely identifies a dictionary. 235 236Within a single frame, a single dictionary can be defined. 237When the frame descriptor defines independent blocks, 238each block will be initialized with the same dictionary. 239If the frame descriptor defines linked blocks, 240the dictionary will only be used once, at the beginning of the frame. 241 242__Header Checksum__ 243 244One-byte checksum of combined descriptor fields, including optional ones. 245The value is the second byte of `xxh32()` : ` (xxh32()>>8) & 0xFF ` 246using zero as a seed, and the full Frame Descriptor as an input 247(including optional fields when they are present). 248A wrong checksum indicates an error in the descriptor. 249Header checksum is informational and can be skipped. 250 251 252Data Blocks 253----------- 254 255| Block Size | data | (Block Checksum) | 256|:----------:| ------ |:----------------:| 257| 4 bytes | | 0 - 4 bytes | 258 259 260__Block Size__ 261 262This field uses 4-bytes, format is little-endian. 263 264The highest bit is “1” if data in the block is uncompressed. 265 266The highest bit is “0” if data in the block is compressed by LZ4. 267 268All other bits give the size, in bytes, of the following data block. 269The size does not include the block checksum if present. 270 271Block Size shall never be larger than Block Maximum Size. 272Such a thing could potentially happen for non-compressible sources. 273In such a case, such data block shall be passed using uncompressed format. 274 275__Data__ 276 277Where the actual data to decode stands. 278It might be compressed or not, depending on previous field indications. 279 280When compressed, the data must respect the [LZ4 block format specification](https://github.com/lz4/lz4/blob/master/doc/lz4_Block_format.md). 281 282Note that the block is not necessarily full. 283Uncompressed size of data can be any size, up to "Block Maximum Size”, 284so it may contain less data than the maximum block size. 285 286__Block checksum__ 287 288Only present if the associated flag is set. 289This is a 4-bytes checksum value, in little endian format, 290calculated by using the xxHash-32 algorithm on the raw (undecoded) data block, 291and a seed of zero. 292The intention is to detect data corruption (storage or transmission errors) 293before decoding. 294 295Block checksum is cumulative with Content checksum. 296 297 298Skippable Frames 299---------------- 300 301| Magic Number | Frame Size | User Data | 302|:------------:|:----------:| --------- | 303| 4 bytes | 4 bytes | | 304 305Skippable frames allow the integration of user-defined data 306into a flow of concatenated frames. 307Its design is pretty straightforward, 308with the sole objective to allow the decoder to quickly skip 309over user-defined data and continue decoding. 310 311For the purpose of facilitating identification, 312it is discouraged to start a flow of concatenated frames with a skippable frame. 313If there is a need to start such a flow with some user data 314encapsulated into a skippable frame, 315it’s recommended to start with a zero-byte LZ4 frame 316followed by a skippable frame. 317This will make it easier for file type identifiers. 318 319 320__Magic Number__ 321 3224 Bytes, Little endian format. 323Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F. 324All 16 values are valid to identify a skippable frame. 325 326__Frame Size__ 327 328This is the size, in bytes, of the following User Data 329(without including the magic number nor the size field itself). 3304 Bytes, Little endian format, unsigned 32-bits. 331This means User Data can’t be bigger than (2^32-1) Bytes. 332 333__User Data__ 334 335User Data can be anything. Data will just be skipped by the decoder. 336 337 338Legacy frame 339------------ 340 341The Legacy frame format was defined into the initial versions of “LZ4Demo”. 342Newer compressors should not use this format anymore, as it is too restrictive. 343 344Main characteristics of the legacy format : 345 346- Fixed block size : 8 MB. 347- All blocks must be completely filled, except the last one. 348- All blocks are always compressed, even when compression is detrimental. 349- The last block is detected either because 350 it is followed by the “EOF” (End of File) mark, 351 or because it is followed by a known Frame Magic Number. 352- No checksum 353- Convention is Little endian 354 355| MagicNb | B.CSize | CData | B.CSize | CData | (...) | EndMark | 356| ------- | ------- | ----- | ------- | ----- | ------- | ------- | 357| 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times | EOF | 358 359 360__Magic Number__ 361 3624 Bytes, Little endian format. 363Value : 0x184C2102 364 365__Block Compressed Size__ 366 367This is the size, in bytes, of the following compressed data block. 3684 Bytes, Little endian format. 369 370__Data__ 371 372Where the actual compressed data stands. 373Data is always compressed, even when compression is detrimental. 374 375__EndMark__ 376 377End of legacy frame is implicit only. 378It must be followed by a standard EOF (End Of File) signal, 379wether it is a file or a stream. 380 381Alternatively, if the frame is followed by a valid Frame Magic Number, 382it is considered completed. 383This policy makes it possible to concatenate legacy frames. 384 385Any other value will be interpreted as a block size, 386and trigger an error if it does not fit within acceptable range. 387 388 389Version changes 390--------------- 391 3921.6.1 : introduced terms "LZ4 Frame Header" and "LZ4 Frame Footer" 393 3941.6.0 : restored Dictionary ID field in Frame header 395 3961.5.1 : changed document format to MarkDown 397 3981.5 : removed Dictionary ID from specification 399 4001.4.1 : changed wording from “stream” to “frame” 401 4021.4 : added skippable streams, re-added stream checksum 403 4041.3 : modified header checksum 405 4061.2 : reduced choice of “block size”, to postpone decision on “dynamic size of BlockSize Field”. 407 4081.1 : optional fields are now part of the descriptor 409 4101.0 : changed “block size” specification, adding a compressed/uncompressed flag 411 4120.9 : reduced scale of “block maximum size” table 413 4140.8 : removed : high compression flag 415 4160.7 : removed : stream checksum 417 4180.6 : settled : stream size uses 8 bytes, endian convention is little endian 419 4200.5: added copyright notice 421 4220.4 : changed format to Google Doc compatible OpenDocument 423