README.md
1# Deflate
2
3Deflate is a general purpose compression format, using Huffman codes and
4Lempel-Ziv style length-distance back-references. It is specified in [RFC
51951](https://www.ietf.org/rfc/rfc1951.txt).
6
7Gzip ([RFC 1952](https://www.ietf.org/rfc/rfc1952.txt)) and zlib ([RFC
81950](https://www.ietf.org/rfc/rfc1950.txt)) are two thin wrappers (with
9similar purpose, but different wire formats) around the raw deflate encoding.
10Gzip (the file format) is used by the `gzip` and `tar` command line tools and
11by the HTTP network protocol. Zlib is used by the ELF executable and PNG image
12file formats.
13
14[Zip](https://support.pkware.com/display/PKZIP/APPNOTE) (also known as PKZIP)
15is another wrapper, similar to combining gzip with tar, that can compress
16multiple files into a single archive. Zip is widely used by the ECMA Office
17Open XML format, the OASIS Open Document Format for Office Applications and the
18Java JAR format.
19
20Wrangling those formats that build on deflate (gzip, zip and zlib) is not
21provided by this package. For zlib, look at the `std/zlib` package instead. The
22other formats are TODO.
23
24For example, look at `test/data/romeo.txt*`. First, the uncompressed text:
25
26 $ xxd test/data/romeo.txt
27 00000000: 526f 6d65 6f20 616e 6420 4a75 6c69 6574 Romeo and Juliet
28 00000010: 0a45 7863 6572 7074 2066 726f 6d20 4163 .Excerpt from Ac
29 etc
30 00000390: 740a 536f 2073 7475 6d62 6c65 7374 206f t.So stumblest o
31 000003a0: 6e20 6d79 2063 6f75 6e73 656c 3f0a n my counsel?.
32
33The raw deflate encoding:
34
35 $ xxd test/data/romeo.txt.deflate
36 00000000: 4d53 c16e db30 0cbd f32b d853 2e46 0e3d MS.n.0...+.S.F.=
37 00000010: 2e87 20ed 0234 c5ba 0049 861e 861d 149b .. ..4...I......
38 etc
39 00000200: 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 610d }.....$.h..0zYa.
40 00000210: 7f01 ..
41
42The gzip format wraps a variable length header (in this case, 20 bytes) and 8
43byte footer around the raw deflate data. The header contains the NUL-terminated
44C string name of the original, uncompressed file, "romeo.txt", amongst other
45data. The footer contains a 4 byte CRC32 checksum and a 4 byte length of the
46uncompressed file (0x3ae = 942 bytes).
47
48 $ xxd test/data/romeo.txt.gz
49 00000000: 1f8b 0808 26d8 5d59 0003 726f 6d65 6f2e ....&.]Y..romeo.
50 00000010: 7478 7400 4d53 c16e db30 0cbd f32b d853 txt.MS.n.0...+.S
51 etc
52 00000210: de5d 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 .],g}.....$.h..0
53 00000220: 7a59 610d 7f01 ef07 e5ab ae03 0000 zYa...........
54
55The zlib format wraps a 2 byte header and 4 byte footer around the raw deflate
56data. The footer contains a 4 byte Adler32 checksum. TODO: move this to
57std/zlib/README.md.
58
59 $ xxd test/data/romeo.txt.zlib
60 00000000: 789c 4d53 c16e db30 0cbd f32b d853 2e46 x.MS.n.0...+.S.F
61 00000010: 0e3d 2e87 20ed 0234 c5ba 0049 861e 861d .=.. ..4...I....
62 etc
63 00000200: 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 ,g}.....$.h..0zY
64 00000210: 610d 7f01 57bb 3ede a...W.>.
65
66
67# Wire Format Worked Example
68
69Consider `test/data/romeo.txt.deflate`. The relevant spec is RFC 1951.
70
71 offset xoffset ASCII hex binary
72 000000 0x0000 M 0x4D 0b_0100_1101
73 000001 0x0001 S 0x53 0b_0101_0011
74 000002 0x0002 . 0xC1 0b_1100_0001
75 000003 0x0003 n 0x6E 0b_0110_1110
76 000004 0x0004 . 0xDB 0b_1101_1011
77 000005 0x0005 0 0x30 0b_0011_0000
78 000006 0x0006 . 0x0C 0b_0000_1100
79 000007 0x0007 . 0xBD 0b_1011_1101
80 000008 0x0008 . 0xF3 0b_1111_0011
81 000009 0x0009 + 0x2B 0b_0010_1011
82 000010 0x000A . 0xD8 0b_1101_1000
83 000011 0x000B S 0x53 0b_0101_0011
84 000012 0x000C . 0x2E 0b_0010_1110
85 000013 0x000D F 0x46 0b_0100_0110
86 000014 0x000E . 0x0E 0b_0000_1110
87 000015 0x000F = 0x3D 0b_0011_1101
88 000016 0x0010 . 0x2E 0b_0010_1110
89 000017 0x0011 . 0x87 0b_1000_0111
90 000018 0x0012 0x20 0b_0010_0000
91 000019 0x0013 . 0xED 0b_1110_1101
92 etc
93 000522 0x020A . 0xEB 0b_1110_1011
94 000523 0x020B 0 0x30 0b_0011_0000
95 000524 0x020C z 0x7A 0b_0111_1010
96 000525 0x020D Y 0x59 0b_0101_1001
97 000526 0x020E a 0x61 0b_0110_0001
98 000527 0x020F . 0x0D 0b_0000_1101
99 000528 0x0210 . 0x7F 0b_0111_1111
100 000529 0x0211 . 0x01 0b_0000_0001
101
102Starting at the top:
103
104 offset xoffset ASCII hex binary
105 000000 0x0000 M 0x4D 0b_0100_1101
106
107As per the RFC section 3.2.3. Details of block format:
108
109 - BFINAL is the first (LSB) bit, 0b1. This block is the final block.
110 - BTYPE is the next two bits in natural (MSB to LSB) order, 0b10, which means
111 a block that's compressed with dynamic Huffman codes.
112
113There are 3 block types: uncompressed, compressed with fixed Huffman codes and
114compressed with dynamic Huffman codes. The first type is straightforward:
115containing a uint16 length `N` (and its complement, for error detection) and
116then `N` literal bytes. The second type is the same as the third type but with
117built-in `lcode` and `dcode` tables (see below). The third type is the
118interesting part of the deflate format, and its deconstruction continues below.
119
120
121## Dynamic Huffman Blocks
122
123The bit stream is now at:
124
125 offset xoffset ASCII hex binary
126 000000 0x0000 M 0x4D 0b_0100_1...
127 000001 0x0001 S 0x53 0b_0101_0011
128 000002 0x0002 . 0xC1 0b_1100_0001
129
130As per the RFC section 3.2.7. Compression with dynamic Huffman codes, three
131variables (5, 5 and 4 bits) are read in the same natural (MSB to LSB) order:
132
133 - HLIT = 0b01001 = 9
134 - HDIST = 0b10011 = 19
135 - HCLEN = 0b1010 = 10
136
137Adding the implicit biases give:
138
139 nlit = 9 + 257 = 266
140 ndist = 19 + 1 = 20
141 nclen = 10 + 4 = 14
142
143
144## Code Length Code Lengths
145
146The bit stream is now at:
147
148 offset xoffset ASCII hex binary
149 000002 0x0002 . 0xC1 0b_1100_000.
150 000003 0x0003 n 0x6E 0b_0110_1110
151 000004 0x0004 . 0xDB 0b_1101_1011
152 000005 0x0005 0 0x30 0b_0011_0000
153 000006 0x0006 . 0x0C 0b_0000_1100
154 000007 0x0007 . 0xBD 0b_1011_1101
155
156There are 19 possible code length code lengths (this is not a typo), but the
157wire format order is shuffled (in the idiosyncratic order: 16, 17, 18, 0, 8,
158..., 15) so that later elements are more likely to be zero (and hence
159compressible). There are `nclen` (in this case, 14) explicit code length code
160lengths, 3 bits each.
161
162 clcode_lengths[16] = 0b000 = 0
163 clcode_lengths[17] = 0b100 = 4
164 clcode_lengths[18] = 0b101 = 5
165 clcode_lengths[ 0] = 0b011 = 3
166 clcode_lengths[ 8] = 0b011 = 3
167 clcode_lengths[ 7] = 0b011 = 3
168 clcode_lengths[ 9] = 0b011 = 3
169 clcode_lengths[ 6] = 0b011 = 3
170 clcode_lengths[10] = 0b000 = 0
171 clcode_lengths[ 5] = 0b011 = 3
172 clcode_lengths[11] = 0b000 = 0
173 clcode_lengths[ 4] = 0b011 = 3
174 clcode_lengths[12] = 0b000 = 0
175 clcode_lengths[ 3] = 0b101 = 5
176
177The remaining (19 - `nclen`) = 5 entries are implicitly zero:
178
179 clcode_lengths[13] = 0
180 clcode_lengths[ 2] = 0
181 clcode_lengths[14] = 0
182 clcode_lengths[ 1] = 0
183 clcode_lengths[15] = 0
184
185Undoing the shuffle gives:
186
187 clcode_lengths[ 0] = 3
188 clcode_lengths[ 1] = 0
189 clcode_lengths[ 2] = 0
190 clcode_lengths[ 3] = 5
191 clcode_lengths[ 4] = 3
192 clcode_lengths[ 5] = 3
193 clcode_lengths[ 6] = 3
194 clcode_lengths[ 7] = 3
195 clcode_lengths[ 8] = 3
196 clcode_lengths[ 9] = 3
197 clcode_lengths[10] = 0
198 clcode_lengths[11] = 0
199 clcode_lengths[12] = 0
200 clcode_lengths[13] = 0
201 clcode_lengths[14] = 0
202 clcode_lengths[15] = 0
203 clcode_lengths[16] = 0
204 clcode_lengths[17] = 4
205 clcode_lengths[18] = 5
206
207For `clcode`s, there are 7 + 1 + 2 = 10 non-zero entries: 7 3-bit codes, 1
2084-bit code and 2 5-bit codes. The canonical Huffman encoding's map from bits to
209`clcode`s is:
210
211 bits clcode
212 0b000 0
213 0b001 4
214 0b010 5
215 0b011 6
216 0b100 7
217 0b101 8
218 0b110 9
219 0b1110 17
220 0b11110 3
221 0b11111 18
222
223Call that Huffman table `H-CL`.
224
225
226## Lcodes and Dcodes
227
228The bit stream is now at:
229
230 offset xoffset ASCII hex binary
231 000007 0x0007 . 0xBD 0b_1011_1...
232 000008 0x0008 . 0xF3 0b_1111_0011
233 000009 0x0009 + 0x2B 0b_0010_1011
234 000010 0x000A . 0xD8 0b_1101_1000
235 000011 0x000B S 0x53 0b_0101_0011
236 000012 0x000C . 0x2E 0b_0010_1110
237 000013 0x000D F 0x46 0b_0100_0110
238
239When decoding via a Huffman table, bits are read in LSB to MSB order,
240right-to-left in this debugging output. Extra bits, if any, are then read in
241the other, natural order (MSB to LSB). Decoding via `H-CL` gives:
242
243 bits clcode
244 0b1110 17, 3 extra bits = 0b111 = 7
245 0b001 4
246 0b11111 18, 7 extra bits = 0b0001010 = 10
247 0b001 4
248 0b101 8
249 0b1110 17, 3 extra bits = 0b010 = 2
250 0b100 7
251 0b1110 17, 3 extra bits = 0b001 = 1
252 0b011 6
253 0b000 0
254
255Still in the RFC section 3.2.7., this means:
256
257 lcode_lengths has 3 + 7 = 10 consecutive zeroes
258 lcode_lengths[ 10] = 4
259 lcode_lengths has 11 + 10 = 21 consecutive zeroes
260 lcode_lengths[ 32] = 4
261 lcode_lengths[ 33] = 8
262 lcode_lengths has 3 + 2 = 5 consecutive zeroes
263 lcode_lengths[ 39] = 7
264 lcode_lengths has 3 + 1 = 4 consecutive zeroes
265 lcode_lengths[ 44] = 6
266 lcode_lengths[ 45] = 0
267
268After decoding the first 10 code lengths, the bit stream is now at:
269
270 offset xoffset ASCII hex binary
271 000013 0x000D F 0x46 0b_01.._....
272
273Continuing to read all `nlit` (266) `lcode` lengths and `ndist` (20) `dcode`
274lengths (zeroes are not shown here):
275
276 lcode_lengths[ 10] = 4
277 lcode_lengths[ 32] = 4
278 lcode_lengths[ 33] = 8
279 lcode_lengths[ 39] = 7
280 lcode_lengths[ 44] = 6
281 lcode_lengths[ 46] = 7
282 lcode_lengths[ 50] = 8
283 lcode_lengths[ 58] = 9
284 lcode_lengths[ 59] = 7
285 lcode_lengths[ 63] = 7
286 etc
287 lcode_lengths[264] = 9
288 lcode_lengths[265] = 9
289
290and
291
292 dcode_lengths[ 5] = 6
293 dcode_lengths[ 6] = 5
294 dcode_lengths[ 8] = 4
295 dcode_lengths[ 9] = 5
296 dcode_lengths[10] = 3
297 etc
298 dcode_lengths[18] = 3
299 dcode_lengths[19] = 6
300
301The `H-CL` table is no longer used from here onwards.
302
303For `lcode`s, there are 6 + 9 + 10 + 16 + 10 + 12 = 63 non-zero entries: 6
3044-bit codes, 9 5-bit codes, 10 6-bit codes, 16 7-bit codes, 10 8-bit codes and
30512 9-bit codes. The canonical Huffman encoding's map from bits to `lcode`
306values is:
307
308 bits lcode
309 0b0000 10
310 0b0001 32
311 0b0010 101
312 0b0011 116
313 0b0100 257
314 0b0101 259
315 0b01100 97
316 0b01101 104
317 0b01110 105
318 0b01111 110
319 0b10000 111
320 0b10001 114
321 0b10010 115
322 0b10011 258
323 0b10100 260
324 0b101010 44
325 0b101011 99
326 0b101100 100
327 0b101101 102
328 0b101110 108
329 0b101111 109
330 0b110000 112
331 0b110001 117
332 0b110010 119
333 0b110011 121
334 0b1101000 39
335 0b1101001 46
336 0b1101010 59
337 0b1101011 63
338 0b1101100 65
339 0b1101101 69
340 0b1101110 73
341 0b1101111 79
342 0b1110000 82
343 0b1110001 83
344 0b1110010 84
345 0b1110011 98
346 0b1110100 103
347 0b1110101 261
348 0b1110110 262
349 0b1110111 263
350 0b11110000 33
351 0b11110001 50
352 0b11110010 66
353 0b11110011 67
354 0b11110100 72
355 0b11110101 74
356 0b11110110 77
357 0b11110111 87
358 0b11111000 107
359 0b11111001 118
360 0b111110100 58
361 0b111110101 68
362 0b111110110 76
363 0b111110111 78
364 0b111111000 85
365 0b111111001 91
366 0b111111010 93
367 0b111111011 120
368 0b111111100 122
369 0b111111101 256
370 0b111111110 264
371 0b111111111 265
372
373Call that Huffman table `H-L`.
374
375For `dcode`s, there are 5 + 4 + 3 + 2 = 14 non-zero entries: 5 3-bit codes, 4
3764-bit codes, 3 5-bit codes and 2 6-bit codes. The canonical Huffman encoding's
377map from bits to `dcode` values is:
378
379 bits dcode
380 0b000 10
381 0b001 14
382 0b010 16
383 0b011 17
384 0b100 18
385 0b1010 8
386 0b1011 12
387 0b1100 13
388 0b1101 15
389 0b11100 6
390 0b11101 9
391 0b11110 11
392 0b111110 5
393 0b111111 19
394
395Call that Huffman table `H-D`.
396
397
398## Decoding Literals
399
400The bit stream is now at:
401
402 offset xoffset ASCII hex binary
403 000052 0x0034 . 0xE7 0b_1..._....
404 000053 0x0035 C 0x43 0b_0100_0011
405 000054 0x0036 . 0xE8 0b_1110_1000
406 000055 0x0037 ) 0x29 0b_0010_1001
407 000056 0x0038 . 0xA0 0b_1010_0000
408 000057 0x0039 . 0xF1 0b_1111_0001
409
410Decoding from that bit stream via `H-L` gives some literal bytes (where the
411`lcode` value is < 256):
412
413 bits lcode
414 0b1110000 82 (ASCII 'R')
415 0b10000 111 (ASCII 'o')
416 0b101111 109 (ASCII 'm')
417 0b0010 101 (ASCII 'e')
418 0b10000 111 (ASCII 'o')
419 0b0001 32 (ASCII ' ')
420 0b01100 97 (ASCII 'a')
421 0b01111 110 (ASCII 'n')
422
423
424## Decoding Back-References
425
426This continues, up until we get to the "O Romeo, Romeo! wherefore art thou
427Romeo?" line.
428
429The bit stream is now at:
430
431 offset xoffset ASCII hex binary
432 000089 0x0059 . 0xC1 0b_11.._....
433 000090 0x005A . 0x1E 0b_0001_1110
434 000091 0x005B . 0x0F 0b_0000_1111
435 000092 0x005C . 0xF9 0b_1111_1001
436
437Decoding from that bit stream via `H-L` gives:
438
439 bits lcode
440 0b1101111 79 (ASCII 'O')
441 0b0001 32 (ASCII ' ')
442 0b1110000 82 (ASCII 'R')
443 0b10011 258
444
445This 258 is our first non-literal `lcode`, the start of a length-distance
446back-reference. We first need to decode the length. The RFC section 3.2.5.
447Compressed blocks (length and distance codes) says that an `lcode` of 258 means
448that length=4 with no extra bits.
449
450The bit stream is now at:
451
452 offset xoffset ASCII hex binary
453 000092 0x005C . 0xF9 0b_111._....
454 000093 0x005D Y 0x59 0b_0101_1001
455
456We next decode via `H-D` instead of `H-L`.
457
458 bits dcode
459 0b11110 11
460
461Again, from section 3.2.5., a `dcode` of 11 means a distance in the range [49,
46264], 49 plus 4 extra bits. The next 4 bits are 0b0110=6, so the distance is 55.
463We therefore copy length=4 bytes from distance=55 bytes ago: "omeo".
464
465The bit stream is now at:
466
467 offset xoffset ASCII hex binary
468 000093 0x005D Y 0x59 0b_01.._....
469 000094 0x005E U 0x55 0b_0101_0101
470 000095 0x005F > 0x3E 0b_0011_1110
471
472We go back to decoding via `H-L`, which gives
473
474 bits lcode
475 0b101010 44 (ASCII ',')
476 0b10100 260
477
478This is another non-literal. Section 3.2.5. says that an `lcode` of 260 means
479length=6 with no extra bits.
480
481The bit stream is now at:
482
483 offset xoffset ASCII hex binary
484 000095 0x005F > 0x3E 0b_0011_111.
485
486Decode with `H-D`.
487
488 bits dcode
489 0b111110 5
490
491Again, from section 3.2.5., a `dcode` of 5 means a distance in the range [7,
4928], 7 plus 1 extra bit. The next 1 bits are 0b0=0, so the distance is 7. We
493therefore copy length=6 bytes from distance=7 bytes ago: " Romeo".
494
495The bit stream is now at:
496
497 offset xoffset ASCII hex binary
498 000096 0x0060 . 0x0F 0b_0000_1111
499
500We go back to decoding via `H-L`, which gives
501
502 bits lcode
503 0b11110000 33 (ASCII '!')
504
505And on we go.
506
507
508## End of Block
509
510The block finishes with the bit stream at:
511
512 offset xoffset ASCII hex binary
513 000522 0x020A . 0xEB 0b_1110_101.
514 000523 0x020B 0 0x30 0b_0011_0000
515 000524 0x020C z 0x7A 0b_0111_1010
516 000525 0x020D Y 0x59 0b_0101_1001
517 000526 0x020E a 0x61 0b_0110_0001
518 000527 0x020F . 0x0D 0b_0000_1101
519 000528 0x0210 . 0x7F 0b_0111_1111
520 000529 0x0211 . 0x01 0b_0000_0001
521
522The decoding is:
523
524 bits lcode
525 0b101011 99 (ASCII 'c')
526 0b10000 111 (ASCII 'o')
527 0b110001 117 (ASCII 'u')
528 0b01111 110 (ASCII 'n')
529 0b0100 257 (length=3 + 0_extra_bits...)
530
531 bits dcode
532 0b1101 15 (distance=193 + 6_extra_bits, 0b000010 in MSB to LSB order;
533 length=3, distance=195: copy "sel")
534
535 bits lcode
536 0b1101011 63 (ASCII '?')
537 0b0000 10 (ASCII '\n')
538 0b111111101 256 (end of block)
539
540In other words, the block ends with "counsel?\n".
541