1# Deflate 2 3Deflate is a general purpose compression format, using Huffman codes and 4Lempel-Ziv style length-distance back-references. It is specified in [RFC 51951](https://www.ietf.org/rfc/rfc1951.txt). 6 7Gzip ([RFC 1952](https://www.ietf.org/rfc/rfc1952.txt)) and zlib ([RFC 81950](https://www.ietf.org/rfc/rfc1950.txt)) are two thin wrappers (with 9similar purpose, but different wire formats) around the raw deflate encoding. 10Gzip (the file format) is used by the `gzip` and `tar` command line tools and 11by the HTTP network protocol. Zlib is used by the ELF executable and PNG image 12file formats. 13 14[Zip](https://support.pkware.com/display/PKZIP/APPNOTE) (also known as PKZIP) 15is another wrapper, similar to combining gzip with tar, that can compress 16multiple files into a single archive. Zip is widely used by the ECMA Office 17Open XML format, the OASIS Open Document Format for Office Applications and the 18Java JAR format. 19 20Wrangling those formats that build on deflate (gzip, zip and zlib) is not 21provided by this package. For zlib, look at the `std/zlib` package instead. The 22other formats are TODO. 23 24For example, look at `test/data/romeo.txt*`. First, the uncompressed text: 25 26 $ xxd test/data/romeo.txt 27 00000000: 526f 6d65 6f20 616e 6420 4a75 6c69 6574 Romeo and Juliet 28 00000010: 0a45 7863 6572 7074 2066 726f 6d20 4163 .Excerpt from Ac 29 etc 30 00000390: 740a 536f 2073 7475 6d62 6c65 7374 206f t.So stumblest o 31 000003a0: 6e20 6d79 2063 6f75 6e73 656c 3f0a n my counsel?. 32 33The raw deflate encoding: 34 35 $ xxd test/data/romeo.txt.deflate 36 00000000: 4d53 c16e db30 0cbd f32b d853 2e46 0e3d MS.n.0...+.S.F.= 37 00000010: 2e87 20ed 0234 c5ba 0049 861e 861d 149b .. ..4...I...... 38 etc 39 00000200: 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 610d }.....$.h..0zYa. 40 00000210: 7f01 .. 41 42The gzip format wraps a variable length header (in this case, 20 bytes) and 8 43byte footer around the raw deflate data. The header contains the NUL-terminated 44C string name of the original, uncompressed file, "romeo.txt", amongst other 45data. The footer contains a 4 byte CRC32 checksum and a 4 byte length of the 46uncompressed file (0x3ae = 942 bytes). 47 48 $ xxd test/data/romeo.txt.gz 49 00000000: 1f8b 0808 26d8 5d59 0003 726f 6d65 6f2e ....&.]Y..romeo. 50 00000010: 7478 7400 4d53 c16e db30 0cbd f32b d853 txt.MS.n.0...+.S 51 etc 52 00000210: de5d 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 .],g}.....$.h..0 53 00000220: 7a59 610d 7f01 ef07 e5ab ae03 0000 zYa........... 54 55The zlib format wraps a 2 byte header and 4 byte footer around the raw deflate 56data. The footer contains a 4 byte Adler32 checksum. TODO: move this to 57std/zlib/README.md. 58 59 $ xxd test/data/romeo.txt.zlib 60 00000000: 789c 4d53 c16e db30 0cbd f32b d853 2e46 x.MS.n.0...+.S.F 61 00000010: 0e3d 2e87 20ed 0234 c5ba 0049 861e 861d .=.. ..4...I.... 62 etc 63 00000200: 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 ,g}.....$.h..0zY 64 00000210: 610d 7f01 57bb 3ede a...W.>. 65 66 67# Wire Format Worked Example 68 69Consider `test/data/romeo.txt.deflate`. The relevant spec is RFC 1951. 70 71 offset xoffset ASCII hex binary 72 000000 0x0000 M 0x4D 0b_0100_1101 73 000001 0x0001 S 0x53 0b_0101_0011 74 000002 0x0002 . 0xC1 0b_1100_0001 75 000003 0x0003 n 0x6E 0b_0110_1110 76 000004 0x0004 . 0xDB 0b_1101_1011 77 000005 0x0005 0 0x30 0b_0011_0000 78 000006 0x0006 . 0x0C 0b_0000_1100 79 000007 0x0007 . 0xBD 0b_1011_1101 80 000008 0x0008 . 0xF3 0b_1111_0011 81 000009 0x0009 + 0x2B 0b_0010_1011 82 000010 0x000A . 0xD8 0b_1101_1000 83 000011 0x000B S 0x53 0b_0101_0011 84 000012 0x000C . 0x2E 0b_0010_1110 85 000013 0x000D F 0x46 0b_0100_0110 86 000014 0x000E . 0x0E 0b_0000_1110 87 000015 0x000F = 0x3D 0b_0011_1101 88 000016 0x0010 . 0x2E 0b_0010_1110 89 000017 0x0011 . 0x87 0b_1000_0111 90 000018 0x0012 0x20 0b_0010_0000 91 000019 0x0013 . 0xED 0b_1110_1101 92 etc 93 000522 0x020A . 0xEB 0b_1110_1011 94 000523 0x020B 0 0x30 0b_0011_0000 95 000524 0x020C z 0x7A 0b_0111_1010 96 000525 0x020D Y 0x59 0b_0101_1001 97 000526 0x020E a 0x61 0b_0110_0001 98 000527 0x020F . 0x0D 0b_0000_1101 99 000528 0x0210 . 0x7F 0b_0111_1111 100 000529 0x0211 . 0x01 0b_0000_0001 101 102Starting at the top: 103 104 offset xoffset ASCII hex binary 105 000000 0x0000 M 0x4D 0b_0100_1101 106 107As per the RFC section 3.2.3. Details of block format: 108 109 - BFINAL is the first (LSB) bit, 0b1. This block is the final block. 110 - BTYPE is the next two bits in natural (MSB to LSB) order, 0b10, which means 111 a block that's compressed with dynamic Huffman codes. 112 113There are 3 block types: uncompressed, compressed with fixed Huffman codes and 114compressed with dynamic Huffman codes. The first type is straightforward: 115containing a uint16 length `N` (and its complement, for error detection) and 116then `N` literal bytes. The second type is the same as the third type but with 117built-in `lcode` and `dcode` tables (see below). The third type is the 118interesting part of the deflate format, and its deconstruction continues below. 119 120 121## Dynamic Huffman Blocks 122 123The bit stream is now at: 124 125 offset xoffset ASCII hex binary 126 000000 0x0000 M 0x4D 0b_0100_1... 127 000001 0x0001 S 0x53 0b_0101_0011 128 000002 0x0002 . 0xC1 0b_1100_0001 129 130As per the RFC section 3.2.7. Compression with dynamic Huffman codes, three 131variables (5, 5 and 4 bits) are read in the same natural (MSB to LSB) order: 132 133 - HLIT = 0b01001 = 9 134 - HDIST = 0b10011 = 19 135 - HCLEN = 0b1010 = 10 136 137Adding the implicit biases give: 138 139 nlit = 9 + 257 = 266 140 ndist = 19 + 1 = 20 141 nclen = 10 + 4 = 14 142 143 144## Code Length Code Lengths 145 146The bit stream is now at: 147 148 offset xoffset ASCII hex binary 149 000002 0x0002 . 0xC1 0b_1100_000. 150 000003 0x0003 n 0x6E 0b_0110_1110 151 000004 0x0004 . 0xDB 0b_1101_1011 152 000005 0x0005 0 0x30 0b_0011_0000 153 000006 0x0006 . 0x0C 0b_0000_1100 154 000007 0x0007 . 0xBD 0b_1011_1101 155 156There are 19 possible code length code lengths (this is not a typo), but the 157wire format order is shuffled (in the idiosyncratic order: 16, 17, 18, 0, 8, 158..., 15) so that later elements are more likely to be zero (and hence 159compressible). There are `nclen` (in this case, 14) explicit code length code 160lengths, 3 bits each. 161 162 clcode_lengths[16] = 0b000 = 0 163 clcode_lengths[17] = 0b100 = 4 164 clcode_lengths[18] = 0b101 = 5 165 clcode_lengths[ 0] = 0b011 = 3 166 clcode_lengths[ 8] = 0b011 = 3 167 clcode_lengths[ 7] = 0b011 = 3 168 clcode_lengths[ 9] = 0b011 = 3 169 clcode_lengths[ 6] = 0b011 = 3 170 clcode_lengths[10] = 0b000 = 0 171 clcode_lengths[ 5] = 0b011 = 3 172 clcode_lengths[11] = 0b000 = 0 173 clcode_lengths[ 4] = 0b011 = 3 174 clcode_lengths[12] = 0b000 = 0 175 clcode_lengths[ 3] = 0b101 = 5 176 177The remaining (19 - `nclen`) = 5 entries are implicitly zero: 178 179 clcode_lengths[13] = 0 180 clcode_lengths[ 2] = 0 181 clcode_lengths[14] = 0 182 clcode_lengths[ 1] = 0 183 clcode_lengths[15] = 0 184 185Undoing the shuffle gives: 186 187 clcode_lengths[ 0] = 3 188 clcode_lengths[ 1] = 0 189 clcode_lengths[ 2] = 0 190 clcode_lengths[ 3] = 5 191 clcode_lengths[ 4] = 3 192 clcode_lengths[ 5] = 3 193 clcode_lengths[ 6] = 3 194 clcode_lengths[ 7] = 3 195 clcode_lengths[ 8] = 3 196 clcode_lengths[ 9] = 3 197 clcode_lengths[10] = 0 198 clcode_lengths[11] = 0 199 clcode_lengths[12] = 0 200 clcode_lengths[13] = 0 201 clcode_lengths[14] = 0 202 clcode_lengths[15] = 0 203 clcode_lengths[16] = 0 204 clcode_lengths[17] = 4 205 clcode_lengths[18] = 5 206 207For `clcode`s, there are 7 + 1 + 2 = 10 non-zero entries: 7 3-bit codes, 1 2084-bit code and 2 5-bit codes. The canonical Huffman encoding's map from bits to 209`clcode`s is: 210 211 bits clcode 212 0b000 0 213 0b001 4 214 0b010 5 215 0b011 6 216 0b100 7 217 0b101 8 218 0b110 9 219 0b1110 17 220 0b11110 3 221 0b11111 18 222 223Call that Huffman table `H-CL`. 224 225 226## Lcodes and Dcodes 227 228The bit stream is now at: 229 230 offset xoffset ASCII hex binary 231 000007 0x0007 . 0xBD 0b_1011_1... 232 000008 0x0008 . 0xF3 0b_1111_0011 233 000009 0x0009 + 0x2B 0b_0010_1011 234 000010 0x000A . 0xD8 0b_1101_1000 235 000011 0x000B S 0x53 0b_0101_0011 236 000012 0x000C . 0x2E 0b_0010_1110 237 000013 0x000D F 0x46 0b_0100_0110 238 239When decoding via a Huffman table, bits are read in LSB to MSB order, 240right-to-left in this debugging output. Extra bits, if any, are then read in 241the other, natural order (MSB to LSB). Decoding via `H-CL` gives: 242 243 bits clcode 244 0b1110 17, 3 extra bits = 0b111 = 7 245 0b001 4 246 0b11111 18, 7 extra bits = 0b0001010 = 10 247 0b001 4 248 0b101 8 249 0b1110 17, 3 extra bits = 0b010 = 2 250 0b100 7 251 0b1110 17, 3 extra bits = 0b001 = 1 252 0b011 6 253 0b000 0 254 255Still in the RFC section 3.2.7., this means: 256 257 lcode_lengths has 3 + 7 = 10 consecutive zeroes 258 lcode_lengths[ 10] = 4 259 lcode_lengths has 11 + 10 = 21 consecutive zeroes 260 lcode_lengths[ 32] = 4 261 lcode_lengths[ 33] = 8 262 lcode_lengths has 3 + 2 = 5 consecutive zeroes 263 lcode_lengths[ 39] = 7 264 lcode_lengths has 3 + 1 = 4 consecutive zeroes 265 lcode_lengths[ 44] = 6 266 lcode_lengths[ 45] = 0 267 268After decoding the first 10 code lengths, the bit stream is now at: 269 270 offset xoffset ASCII hex binary 271 000013 0x000D F 0x46 0b_01.._.... 272 273Continuing to read all `nlit` (266) `lcode` lengths and `ndist` (20) `dcode` 274lengths (zeroes are not shown here): 275 276 lcode_lengths[ 10] = 4 277 lcode_lengths[ 32] = 4 278 lcode_lengths[ 33] = 8 279 lcode_lengths[ 39] = 7 280 lcode_lengths[ 44] = 6 281 lcode_lengths[ 46] = 7 282 lcode_lengths[ 50] = 8 283 lcode_lengths[ 58] = 9 284 lcode_lengths[ 59] = 7 285 lcode_lengths[ 63] = 7 286 etc 287 lcode_lengths[264] = 9 288 lcode_lengths[265] = 9 289 290and 291 292 dcode_lengths[ 5] = 6 293 dcode_lengths[ 6] = 5 294 dcode_lengths[ 8] = 4 295 dcode_lengths[ 9] = 5 296 dcode_lengths[10] = 3 297 etc 298 dcode_lengths[18] = 3 299 dcode_lengths[19] = 6 300 301The `H-CL` table is no longer used from here onwards. 302 303For `lcode`s, there are 6 + 9 + 10 + 16 + 10 + 12 = 63 non-zero entries: 6 3044-bit codes, 9 5-bit codes, 10 6-bit codes, 16 7-bit codes, 10 8-bit codes and 30512 9-bit codes. The canonical Huffman encoding's map from bits to `lcode` 306values is: 307 308 bits lcode 309 0b0000 10 310 0b0001 32 311 0b0010 101 312 0b0011 116 313 0b0100 257 314 0b0101 259 315 0b01100 97 316 0b01101 104 317 0b01110 105 318 0b01111 110 319 0b10000 111 320 0b10001 114 321 0b10010 115 322 0b10011 258 323 0b10100 260 324 0b101010 44 325 0b101011 99 326 0b101100 100 327 0b101101 102 328 0b101110 108 329 0b101111 109 330 0b110000 112 331 0b110001 117 332 0b110010 119 333 0b110011 121 334 0b1101000 39 335 0b1101001 46 336 0b1101010 59 337 0b1101011 63 338 0b1101100 65 339 0b1101101 69 340 0b1101110 73 341 0b1101111 79 342 0b1110000 82 343 0b1110001 83 344 0b1110010 84 345 0b1110011 98 346 0b1110100 103 347 0b1110101 261 348 0b1110110 262 349 0b1110111 263 350 0b11110000 33 351 0b11110001 50 352 0b11110010 66 353 0b11110011 67 354 0b11110100 72 355 0b11110101 74 356 0b11110110 77 357 0b11110111 87 358 0b11111000 107 359 0b11111001 118 360 0b111110100 58 361 0b111110101 68 362 0b111110110 76 363 0b111110111 78 364 0b111111000 85 365 0b111111001 91 366 0b111111010 93 367 0b111111011 120 368 0b111111100 122 369 0b111111101 256 370 0b111111110 264 371 0b111111111 265 372 373Call that Huffman table `H-L`. 374 375For `dcode`s, there are 5 + 4 + 3 + 2 = 14 non-zero entries: 5 3-bit codes, 4 3764-bit codes, 3 5-bit codes and 2 6-bit codes. The canonical Huffman encoding's 377map from bits to `dcode` values is: 378 379 bits dcode 380 0b000 10 381 0b001 14 382 0b010 16 383 0b011 17 384 0b100 18 385 0b1010 8 386 0b1011 12 387 0b1100 13 388 0b1101 15 389 0b11100 6 390 0b11101 9 391 0b11110 11 392 0b111110 5 393 0b111111 19 394 395Call that Huffman table `H-D`. 396 397 398## Decoding Literals 399 400The bit stream is now at: 401 402 offset xoffset ASCII hex binary 403 000052 0x0034 . 0xE7 0b_1..._.... 404 000053 0x0035 C 0x43 0b_0100_0011 405 000054 0x0036 . 0xE8 0b_1110_1000 406 000055 0x0037 ) 0x29 0b_0010_1001 407 000056 0x0038 . 0xA0 0b_1010_0000 408 000057 0x0039 . 0xF1 0b_1111_0001 409 410Decoding from that bit stream via `H-L` gives some literal bytes (where the 411`lcode` value is < 256): 412 413 bits lcode 414 0b1110000 82 (ASCII 'R') 415 0b10000 111 (ASCII 'o') 416 0b101111 109 (ASCII 'm') 417 0b0010 101 (ASCII 'e') 418 0b10000 111 (ASCII 'o') 419 0b0001 32 (ASCII ' ') 420 0b01100 97 (ASCII 'a') 421 0b01111 110 (ASCII 'n') 422 423 424## Decoding Back-References 425 426This continues, up until we get to the "O Romeo, Romeo! wherefore art thou 427Romeo?" line. 428 429The bit stream is now at: 430 431 offset xoffset ASCII hex binary 432 000089 0x0059 . 0xC1 0b_11.._.... 433 000090 0x005A . 0x1E 0b_0001_1110 434 000091 0x005B . 0x0F 0b_0000_1111 435 000092 0x005C . 0xF9 0b_1111_1001 436 437Decoding from that bit stream via `H-L` gives: 438 439 bits lcode 440 0b1101111 79 (ASCII 'O') 441 0b0001 32 (ASCII ' ') 442 0b1110000 82 (ASCII 'R') 443 0b10011 258 444 445This 258 is our first non-literal `lcode`, the start of a length-distance 446back-reference. We first need to decode the length. The RFC section 3.2.5. 447Compressed blocks (length and distance codes) says that an `lcode` of 258 means 448that length=4 with no extra bits. 449 450The bit stream is now at: 451 452 offset xoffset ASCII hex binary 453 000092 0x005C . 0xF9 0b_111._.... 454 000093 0x005D Y 0x59 0b_0101_1001 455 456We next decode via `H-D` instead of `H-L`. 457 458 bits dcode 459 0b11110 11 460 461Again, from section 3.2.5., a `dcode` of 11 means a distance in the range [49, 46264], 49 plus 4 extra bits. The next 4 bits are 0b0110=6, so the distance is 55. 463We therefore copy length=4 bytes from distance=55 bytes ago: "omeo". 464 465The bit stream is now at: 466 467 offset xoffset ASCII hex binary 468 000093 0x005D Y 0x59 0b_01.._.... 469 000094 0x005E U 0x55 0b_0101_0101 470 000095 0x005F > 0x3E 0b_0011_1110 471 472We go back to decoding via `H-L`, which gives 473 474 bits lcode 475 0b101010 44 (ASCII ',') 476 0b10100 260 477 478This is another non-literal. Section 3.2.5. says that an `lcode` of 260 means 479length=6 with no extra bits. 480 481The bit stream is now at: 482 483 offset xoffset ASCII hex binary 484 000095 0x005F > 0x3E 0b_0011_111. 485 486Decode with `H-D`. 487 488 bits dcode 489 0b111110 5 490 491Again, from section 3.2.5., a `dcode` of 5 means a distance in the range [7, 4928], 7 plus 1 extra bit. The next 1 bits are 0b0=0, so the distance is 7. We 493therefore copy length=6 bytes from distance=7 bytes ago: " Romeo". 494 495The bit stream is now at: 496 497 offset xoffset ASCII hex binary 498 000096 0x0060 . 0x0F 0b_0000_1111 499 500We go back to decoding via `H-L`, which gives 501 502 bits lcode 503 0b11110000 33 (ASCII '!') 504 505And on we go. 506 507 508## End of Block 509 510The block finishes with the bit stream at: 511 512 offset xoffset ASCII hex binary 513 000522 0x020A . 0xEB 0b_1110_101. 514 000523 0x020B 0 0x30 0b_0011_0000 515 000524 0x020C z 0x7A 0b_0111_1010 516 000525 0x020D Y 0x59 0b_0101_1001 517 000526 0x020E a 0x61 0b_0110_0001 518 000527 0x020F . 0x0D 0b_0000_1101 519 000528 0x0210 . 0x7F 0b_0111_1111 520 000529 0x0211 . 0x01 0b_0000_0001 521 522The decoding is: 523 524 bits lcode 525 0b101011 99 (ASCII 'c') 526 0b10000 111 (ASCII 'o') 527 0b110001 117 (ASCII 'u') 528 0b01111 110 (ASCII 'n') 529 0b0100 257 (length=3 + 0_extra_bits...) 530 531 bits dcode 532 0b1101 15 (distance=193 + 6_extra_bits, 0b000010 in MSB to LSB order; 533 length=3, distance=195: copy "sel") 534 535 bits lcode 536 0b1101011 63 (ASCII '?') 537 0b0000 10 (ASCII '\n') 538 0b111111101 256 (end of block) 539 540In other words, the block ends with "counsel?\n". 541