• Home
Name Date Size #Lines LOC

..--

README.mdD12-May-202416.5 KiB541434

common_consts.wuffsD12-May-20242.9 KiB5552

decode_deflate.wuffsD12-May-202426.2 KiB856798

decode_huffman_fast.wuffsD12-May-202414.1 KiB395362

decode_huffman_slow.wuffsD12-May-20248.9 KiB286265

README.md

1# Deflate
2
3Deflate is a general purpose compression format, using Huffman codes and
4Lempel-Ziv style length-distance back-references. It is specified in [RFC
51951](https://www.ietf.org/rfc/rfc1951.txt).
6
7Gzip ([RFC 1952](https://www.ietf.org/rfc/rfc1952.txt)) and zlib ([RFC
81950](https://www.ietf.org/rfc/rfc1950.txt)) are two thin wrappers (with
9similar purpose, but different wire formats) around the raw deflate encoding.
10Gzip (the file format) is used by the `gzip` and `tar` command line tools and
11by the HTTP network protocol. Zlib is used by the ELF executable and PNG image
12file formats.
13
14[Zip](https://support.pkware.com/display/PKZIP/APPNOTE) (also known as PKZIP)
15is another wrapper, similar to combining gzip with tar, that can compress
16multiple files into a single archive. Zip is widely used by the ECMA Office
17Open XML format, the OASIS Open Document Format for Office Applications and the
18Java JAR format.
19
20Wrangling those formats that build on deflate (gzip, zip and zlib) is not
21provided by this package. For zlib, look at the `std/zlib` package instead. The
22other formats are TODO.
23
24For example, look at `test/data/romeo.txt*`. First, the uncompressed text:
25
26    $ xxd test/data/romeo.txt
27    00000000: 526f 6d65 6f20 616e 6420 4a75 6c69 6574  Romeo and Juliet
28    00000010: 0a45 7863 6572 7074 2066 726f 6d20 4163  .Excerpt from Ac
29    etc
30    00000390: 740a 536f 2073 7475 6d62 6c65 7374 206f  t.So stumblest o
31    000003a0: 6e20 6d79 2063 6f75 6e73 656c 3f0a       n my counsel?.
32
33The raw deflate encoding:
34
35    $ xxd test/data/romeo.txt.deflate
36    00000000: 4d53 c16e db30 0cbd f32b d853 2e46 0e3d  MS.n.0...+.S.F.=
37    00000010: 2e87 20ed 0234 c5ba 0049 861e 861d 149b  .. ..4...I......
38    etc
39    00000200: 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 610d  }.....$.h..0zYa.
40    00000210: 7f01                                     ..
41
42The gzip format wraps a variable length header (in this case, 20 bytes) and 8
43byte footer around the raw deflate data. The header contains the NUL-terminated
44C string name of the original, uncompressed file, "romeo.txt", amongst other
45data. The footer contains a 4 byte CRC32 checksum and a 4 byte length of the
46uncompressed file (0x3ae = 942 bytes).
47
48    $ xxd test/data/romeo.txt.gz
49    00000000: 1f8b 0808 26d8 5d59 0003 726f 6d65 6f2e  ....&.]Y..romeo.
50    00000010: 7478 7400 4d53 c16e db30 0cbd f32b d853  txt.MS.n.0...+.S
51    etc
52    00000210: de5d 2c67 7d13 8ffd b9a3 24bb 68f4 eb30  .],g}.....$.h..0
53    00000220: 7a59 610d 7f01 ef07 e5ab ae03 0000       zYa...........
54
55The zlib format wraps a 2 byte header and 4 byte footer around the raw deflate
56data. The footer contains a 4 byte Adler32 checksum. TODO: move this to
57std/zlib/README.md.
58
59    $ xxd test/data/romeo.txt.zlib
60    00000000: 789c 4d53 c16e db30 0cbd f32b d853 2e46  x.MS.n.0...+.S.F
61    00000010: 0e3d 2e87 20ed 0234 c5ba 0049 861e 861d  .=.. ..4...I....
62    etc
63    00000200: 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 7a59  ,g}.....$.h..0zY
64    00000210: 610d 7f01 57bb 3ede                      a...W.>.
65
66
67# Wire Format Worked Example
68
69Consider `test/data/romeo.txt.deflate`. The relevant spec is RFC 1951.
70
71    offset  xoffset ASCII   hex     binary
72    000000  0x0000  M       0x4D    0b_0100_1101
73    000001  0x0001  S       0x53    0b_0101_0011
74    000002  0x0002  .       0xC1    0b_1100_0001
75    000003  0x0003  n       0x6E    0b_0110_1110
76    000004  0x0004  .       0xDB    0b_1101_1011
77    000005  0x0005  0       0x30    0b_0011_0000
78    000006  0x0006  .       0x0C    0b_0000_1100
79    000007  0x0007  .       0xBD    0b_1011_1101
80    000008  0x0008  .       0xF3    0b_1111_0011
81    000009  0x0009  +       0x2B    0b_0010_1011
82    000010  0x000A  .       0xD8    0b_1101_1000
83    000011  0x000B  S       0x53    0b_0101_0011
84    000012  0x000C  .       0x2E    0b_0010_1110
85    000013  0x000D  F       0x46    0b_0100_0110
86    000014  0x000E  .       0x0E    0b_0000_1110
87    000015  0x000F  =       0x3D    0b_0011_1101
88    000016  0x0010  .       0x2E    0b_0010_1110
89    000017  0x0011  .       0x87    0b_1000_0111
90    000018  0x0012          0x20    0b_0010_0000
91    000019  0x0013  .       0xED    0b_1110_1101
92    etc
93    000522  0x020A  .       0xEB    0b_1110_1011
94    000523  0x020B  0       0x30    0b_0011_0000
95    000524  0x020C  z       0x7A    0b_0111_1010
96    000525  0x020D  Y       0x59    0b_0101_1001
97    000526  0x020E  a       0x61    0b_0110_0001
98    000527  0x020F  .       0x0D    0b_0000_1101
99    000528  0x0210  .       0x7F    0b_0111_1111
100    000529  0x0211  .       0x01    0b_0000_0001
101
102Starting at the top:
103
104    offset  xoffset ASCII   hex     binary
105    000000  0x0000  M       0x4D    0b_0100_1101
106
107As per the RFC section 3.2.3. Details of block format:
108
109  - BFINAL is the first (LSB) bit, 0b1. This block is the final block.
110  - BTYPE is the next two bits in natural (MSB to LSB) order, 0b10, which means
111    a block that's compressed with dynamic Huffman codes.
112
113There are 3 block types: uncompressed, compressed with fixed Huffman codes and
114compressed with dynamic Huffman codes. The first type is straightforward:
115containing a uint16 length `N` (and its complement, for error detection) and
116then `N` literal bytes. The second type is the same as the third type but with
117built-in `lcode` and `dcode` tables (see below). The third type is the
118interesting part of the deflate format, and its deconstruction continues below.
119
120
121## Dynamic Huffman Blocks
122
123The bit stream is now at:
124
125    offset  xoffset ASCII   hex     binary
126    000000  0x0000  M       0x4D    0b_0100_1...
127    000001  0x0001  S       0x53    0b_0101_0011
128    000002  0x0002  .       0xC1    0b_1100_0001
129
130As per the RFC section 3.2.7. Compression with dynamic Huffman codes, three
131variables (5, 5 and 4 bits) are read in the same natural (MSB to LSB) order:
132
133  - HLIT  = 0b01001 =  9
134  - HDIST = 0b10011 = 19
135  - HCLEN =  0b1010 = 10
136
137Adding the implicit biases give:
138
139    nlit  =  9 + 257 = 266
140    ndist = 19 +   1 =  20
141    nclen = 10 +   4 =  14
142
143
144## Code Length Code Lengths
145
146The bit stream is now at:
147
148    offset  xoffset ASCII   hex     binary
149    000002  0x0002  .       0xC1    0b_1100_000.
150    000003  0x0003  n       0x6E    0b_0110_1110
151    000004  0x0004  .       0xDB    0b_1101_1011
152    000005  0x0005  0       0x30    0b_0011_0000
153    000006  0x0006  .       0x0C    0b_0000_1100
154    000007  0x0007  .       0xBD    0b_1011_1101
155
156There are 19 possible code length code lengths (this is not a typo), but the
157wire format order is shuffled (in the idiosyncratic order: 16, 17, 18, 0, 8,
158..., 15) so that later elements are more likely to be zero (and hence
159compressible). There are `nclen` (in this case, 14) explicit code length code
160lengths, 3 bits each.
161
162    clcode_lengths[16] = 0b000 = 0
163    clcode_lengths[17] = 0b100 = 4
164    clcode_lengths[18] = 0b101 = 5
165    clcode_lengths[ 0] = 0b011 = 3
166    clcode_lengths[ 8] = 0b011 = 3
167    clcode_lengths[ 7] = 0b011 = 3
168    clcode_lengths[ 9] = 0b011 = 3
169    clcode_lengths[ 6] = 0b011 = 3
170    clcode_lengths[10] = 0b000 = 0
171    clcode_lengths[ 5] = 0b011 = 3
172    clcode_lengths[11] = 0b000 = 0
173    clcode_lengths[ 4] = 0b011 = 3
174    clcode_lengths[12] = 0b000 = 0
175    clcode_lengths[ 3] = 0b101 = 5
176
177The remaining (19 - `nclen`) = 5 entries are implicitly zero:
178
179    clcode_lengths[13] = 0
180    clcode_lengths[ 2] = 0
181    clcode_lengths[14] = 0
182    clcode_lengths[ 1] = 0
183    clcode_lengths[15] = 0
184
185Undoing the shuffle gives:
186
187    clcode_lengths[ 0] = 3
188    clcode_lengths[ 1] = 0
189    clcode_lengths[ 2] = 0
190    clcode_lengths[ 3] = 5
191    clcode_lengths[ 4] = 3
192    clcode_lengths[ 5] = 3
193    clcode_lengths[ 6] = 3
194    clcode_lengths[ 7] = 3
195    clcode_lengths[ 8] = 3
196    clcode_lengths[ 9] = 3
197    clcode_lengths[10] = 0
198    clcode_lengths[11] = 0
199    clcode_lengths[12] = 0
200    clcode_lengths[13] = 0
201    clcode_lengths[14] = 0
202    clcode_lengths[15] = 0
203    clcode_lengths[16] = 0
204    clcode_lengths[17] = 4
205    clcode_lengths[18] = 5
206
207For `clcode`s, there are 7 + 1 + 2 = 10 non-zero entries: 7 3-bit codes, 1
2084-bit code and 2 5-bit codes. The canonical Huffman encoding's map from bits to
209`clcode`s is:
210
211    bits        clcode
212    0b000        0
213    0b001        4
214    0b010        5
215    0b011        6
216    0b100        7
217    0b101        8
218    0b110        9
219    0b1110      17
220    0b11110      3
221    0b11111     18
222
223Call that Huffman table `H-CL`.
224
225
226## Lcodes and Dcodes
227
228The bit stream is now at:
229
230    offset  xoffset ASCII   hex     binary
231    000007  0x0007  .       0xBD    0b_1011_1...
232    000008  0x0008  .       0xF3    0b_1111_0011
233    000009  0x0009  +       0x2B    0b_0010_1011
234    000010  0x000A  .       0xD8    0b_1101_1000
235    000011  0x000B  S       0x53    0b_0101_0011
236    000012  0x000C  .       0x2E    0b_0010_1110
237    000013  0x000D  F       0x46    0b_0100_0110
238
239When decoding via a Huffman table, bits are read in LSB to MSB order,
240right-to-left in this debugging output. Extra bits, if any, are then read in
241the other, natural order (MSB to LSB). Decoding via `H-CL` gives:
242
243    bits        clcode
244    0b1110      17, 3 extra bits = 0b111 = 7
245    0b001        4
246    0b11111     18, 7 extra bits = 0b0001010 = 10
247    0b001        4
248    0b101        8
249    0b1110      17, 3 extra bits = 0b010 = 2
250    0b100        7
251    0b1110      17, 3 extra bits = 0b001 = 1
252    0b011        6
253    0b000        0
254
255Still in the RFC section 3.2.7., this means:
256
257    lcode_lengths has 3 + 7 = 10 consecutive zeroes
258    lcode_lengths[ 10] = 4
259    lcode_lengths has 11 + 10 = 21 consecutive zeroes
260    lcode_lengths[ 32] = 4
261    lcode_lengths[ 33] = 8
262    lcode_lengths has 3 + 2 = 5 consecutive zeroes
263    lcode_lengths[ 39] = 7
264    lcode_lengths has 3 + 1 = 4 consecutive zeroes
265    lcode_lengths[ 44] = 6
266    lcode_lengths[ 45] = 0
267
268After decoding the first 10 code lengths, the bit stream is now at:
269
270    offset  xoffset ASCII   hex     binary
271    000013  0x000D  F       0x46    0b_01.._....
272
273Continuing to read all `nlit` (266) `lcode` lengths and `ndist` (20) `dcode`
274lengths (zeroes are not shown here):
275
276    lcode_lengths[ 10] = 4
277    lcode_lengths[ 32] = 4
278    lcode_lengths[ 33] = 8
279    lcode_lengths[ 39] = 7
280    lcode_lengths[ 44] = 6
281    lcode_lengths[ 46] = 7
282    lcode_lengths[ 50] = 8
283    lcode_lengths[ 58] = 9
284    lcode_lengths[ 59] = 7
285    lcode_lengths[ 63] = 7
286    etc
287    lcode_lengths[264] = 9
288    lcode_lengths[265] = 9
289
290and
291
292    dcode_lengths[ 5] = 6
293    dcode_lengths[ 6] = 5
294    dcode_lengths[ 8] = 4
295    dcode_lengths[ 9] = 5
296    dcode_lengths[10] = 3
297    etc
298    dcode_lengths[18] = 3
299    dcode_lengths[19] = 6
300
301The `H-CL` table is no longer used from here onwards.
302
303For `lcode`s, there are 6 + 9 + 10 + 16 + 10 + 12 = 63 non-zero entries: 6
3044-bit codes, 9 5-bit codes, 10 6-bit codes, 16 7-bit codes, 10 8-bit codes and
30512 9-bit codes. The canonical Huffman encoding's map from bits to `lcode`
306values is:
307
308    bits        lcode
309    0b0000       10
310    0b0001       32
311    0b0010      101
312    0b0011      116
313    0b0100      257
314    0b0101      259
315    0b01100      97
316    0b01101     104
317    0b01110     105
318    0b01111     110
319    0b10000     111
320    0b10001     114
321    0b10010     115
322    0b10011     258
323    0b10100     260
324    0b101010     44
325    0b101011     99
326    0b101100    100
327    0b101101    102
328    0b101110    108
329    0b101111    109
330    0b110000    112
331    0b110001    117
332    0b110010    119
333    0b110011    121
334    0b1101000    39
335    0b1101001    46
336    0b1101010    59
337    0b1101011    63
338    0b1101100    65
339    0b1101101    69
340    0b1101110    73
341    0b1101111    79
342    0b1110000    82
343    0b1110001    83
344    0b1110010    84
345    0b1110011    98
346    0b1110100   103
347    0b1110101   261
348    0b1110110   262
349    0b1110111   263
350    0b11110000   33
351    0b11110001   50
352    0b11110010   66
353    0b11110011   67
354    0b11110100   72
355    0b11110101   74
356    0b11110110   77
357    0b11110111   87
358    0b11111000  107
359    0b11111001  118
360    0b111110100  58
361    0b111110101  68
362    0b111110110  76
363    0b111110111  78
364    0b111111000  85
365    0b111111001  91
366    0b111111010  93
367    0b111111011 120
368    0b111111100 122
369    0b111111101 256
370    0b111111110 264
371    0b111111111 265
372
373Call that Huffman table `H-L`.
374
375For `dcode`s, there are 5 + 4 + 3 + 2 = 14 non-zero entries: 5 3-bit codes, 4
3764-bit codes, 3 5-bit codes and 2 6-bit codes. The canonical Huffman encoding's
377map from bits to `dcode` values is:
378
379    bits        dcode
380    0b000       10
381    0b001       14
382    0b010       16
383    0b011       17
384    0b100       18
385    0b1010       8
386    0b1011      12
387    0b1100      13
388    0b1101      15
389    0b11100      6
390    0b11101      9
391    0b11110     11
392    0b111110     5
393    0b111111    19
394
395Call that Huffman table `H-D`.
396
397
398## Decoding Literals
399
400The bit stream is now at:
401
402    offset  xoffset ASCII   hex     binary
403    000052  0x0034  .       0xE7    0b_1..._....
404    000053  0x0035  C       0x43    0b_0100_0011
405    000054  0x0036  .       0xE8    0b_1110_1000
406    000055  0x0037  )       0x29    0b_0010_1001
407    000056  0x0038  .       0xA0    0b_1010_0000
408    000057  0x0039  .       0xF1    0b_1111_0001
409
410Decoding from that bit stream via `H-L` gives some literal bytes (where the
411`lcode` value is < 256):
412
413    bits        lcode
414    0b1110000    82 (ASCII 'R')
415    0b10000     111 (ASCII 'o')
416    0b101111    109 (ASCII 'm')
417    0b0010      101 (ASCII 'e')
418    0b10000     111 (ASCII 'o')
419    0b0001       32 (ASCII ' ')
420    0b01100      97 (ASCII 'a')
421    0b01111     110 (ASCII 'n')
422
423
424## Decoding Back-References
425
426This continues, up until we get to the "O Romeo, Romeo! wherefore art thou
427Romeo?" line.
428
429The bit stream is now at:
430
431    offset  xoffset ASCII   hex     binary
432    000089  0x0059  .       0xC1    0b_11.._....
433    000090  0x005A  .       0x1E    0b_0001_1110
434    000091  0x005B  .       0x0F    0b_0000_1111
435    000092  0x005C  .       0xF9    0b_1111_1001
436
437Decoding from that bit stream via `H-L` gives:
438
439    bits        lcode
440    0b1101111    79 (ASCII 'O')
441    0b0001       32 (ASCII ' ')
442    0b1110000    82 (ASCII 'R')
443    0b10011     258
444
445This 258 is our first non-literal `lcode`, the start of a length-distance
446back-reference. We first need to decode the length. The RFC section 3.2.5.
447Compressed blocks (length and distance codes) says that an `lcode` of 258 means
448that length=4 with no extra bits.
449
450The bit stream is now at:
451
452    offset  xoffset ASCII   hex     binary
453    000092  0x005C  .       0xF9    0b_111._....
454    000093  0x005D  Y       0x59    0b_0101_1001
455
456We next decode via `H-D` instead of `H-L`.
457
458    bits        dcode
459    0b11110     11
460
461Again, from section 3.2.5., a `dcode` of 11 means a distance in the range [49,
46264], 49 plus 4 extra bits. The next 4 bits are 0b0110=6, so the distance is 55.
463We therefore copy length=4 bytes from distance=55 bytes ago: "omeo".
464
465The bit stream is now at:
466
467    offset  xoffset ASCII   hex     binary
468    000093  0x005D  Y       0x59    0b_01.._....
469    000094  0x005E  U       0x55    0b_0101_0101
470    000095  0x005F  >       0x3E    0b_0011_1110
471
472We go back to decoding via `H-L`, which gives
473
474    bits        lcode
475    0b101010     44 (ASCII ',')
476    0b10100     260
477
478This is another non-literal. Section 3.2.5. says that an `lcode` of 260 means
479length=6 with no extra bits.
480
481The bit stream is now at:
482
483    offset  xoffset ASCII   hex     binary
484    000095  0x005F  >       0x3E    0b_0011_111.
485
486Decode with `H-D`.
487
488    bits        dcode
489    0b111110     5
490
491Again, from section 3.2.5., a `dcode` of 5 means a distance in the range [7,
4928], 7 plus 1 extra bit. The next 1 bits are 0b0=0, so the distance is 7. We
493therefore copy length=6 bytes from distance=7 bytes ago: " Romeo".
494
495The bit stream is now at:
496
497    offset  xoffset ASCII   hex     binary
498    000096  0x0060  .       0x0F    0b_0000_1111
499
500We go back to decoding via `H-L`, which gives
501
502    bits        lcode
503    0b11110000   33 (ASCII '!')
504
505And on we go.
506
507
508## End of Block
509
510The block finishes with the bit stream at:
511
512    offset  xoffset ASCII   hex     binary
513    000522  0x020A  .       0xEB    0b_1110_101.
514    000523  0x020B  0       0x30    0b_0011_0000
515    000524  0x020C  z       0x7A    0b_0111_1010
516    000525  0x020D  Y       0x59    0b_0101_1001
517    000526  0x020E  a       0x61    0b_0110_0001
518    000527  0x020F  .       0x0D    0b_0000_1101
519    000528  0x0210  .       0x7F    0b_0111_1111
520    000529  0x0211  .       0x01    0b_0000_0001
521
522The decoding is:
523
524    bits        lcode
525    0b101011     99 (ASCII 'c')
526    0b10000     111 (ASCII 'o')
527    0b110001    117 (ASCII 'u')
528    0b01111     110 (ASCII 'n')
529    0b0100      257 (length=3 + 0_extra_bits...)
530
531    bits        dcode
532    0b1101      15  (distance=193 + 6_extra_bits, 0b000010 in MSB to LSB order;
533                     length=3, distance=195: copy "sel")
534
535    bits        lcode
536    0b1101011    63 (ASCII '?')
537    0b0000       10 (ASCII '\n')
538    0b111111101 256 (end of block)
539
540In other words, the block ends with "counsel?\n".
541