• Home
Name Date Size #Lines LOC

..--

images/03-May-2024-

puffin/src/03-May-2024-

scripts/03-May-2024-249171

src/03-May-2024-7,6665,430

.clang-formatD03-May-202472 42

Android.bpD03-May-20242.9 KiB142135

BUILD.gnD03-May-20243.5 KiB190178

LICENSED03-May-20241.5 KiB2827

METADATAD03-May-202439 43

MakefileD03-May-20241.3 KiB6445

OWNERSD03-May-2024113 97

PRESUBMIT.cfgD03-May-2024128 64

PREUPLOAD.cfgD03-May-202436 32

README.mdD03-May-202410.7 KiB232185

README.versionD03-May-202484 32

TEST_MAPPINGD03-May-2024142 1312

libpuffdiff.pcD03-May-2024291 108

libpuffpatch.pcD03-May-2024281 108

README.md

1# Puffin: A deterministic deflate re-compressor (for patching purposes)
2
3## Table of Contents
4
5[TOC]
6
7## Puffin
8
9Puffin is a deterministic deflate recompressor. It is mainly used for patching
10deflate compressed images (zip, gzip, etc.) because patches created from deflate
11files/streams are usually large (deflate has a bit-aligned format, hence,
12changing one byte in the raw data can cause the entire deflate stream to change
13drastically.)
14
15Puffin has two tools (operations): `puffdiff` and `puffpatch` (shown
16[here](puffin).)  The purpose of `puffdiff` operation is to create a patch
17between a source and a target file with one or both of them having some deflate
18streams. This patch is used in the `puffpatch` operation to generate the target
19file from the source file deterministically. The patch itself is created by
20`bsdiff` library (but can be replaced by any other diffing mechanism). But,
21before it uses `bsdiff` to create the patch, we have to transform both the
22source and target files into a specific format so `bsdiff` can produce smaller
23patches. We define `puff` operation to perform such a transformation. The
24resulting stream is called a `puff` stream. The reverse transformation (from
25`puff` stream to deflate stream) is called a `huff` operation. `huff` is used in
26the client to transform the `puff` stream back into its original deflate stream
27deterministically.
28
29![puffin](images/puffin.png "Puffin architecture")
30
31For information about deflate format see [RFC 1951].
32
33## puff and huff
34
35`puff` is an operation that decompresses only the Huffman part of the deflate
36stream and keeps the structure of the LZ77 coding unchanged. This is roughly
37equivalent of decompressing ‘half way’.
38
39`huff` is the exact opposite of `puff` and it deterministically converts the
40`puff` stream back to its original deflate stream. This deterministic conversion
41is based on two facts:
42
43*   There is no need to perform LZ77 algorithm. This means the deflate stream
44    could be built by any LZ77 algorithm.
45*   The dynamic Huffman tables can be recreated uniquely from the code length
46    array stored inside the `puff` stream. This means the deflate stream can be
47    reconstructed deterministically using the Huffman tables.
48
49The inclusion of Huffman tables in the `puff` stream has minuscule space burden
50(on average maximum 320 bytes for each block. There is roughly one block per
5132KB of uncompressed data).
52
53`bsdiff` of two `puffed` streams has much smaller patch in comparison to their
54deflate streams, but it is larger than uncompressed streams.
55
56**The major benefits**
57
58*   Size of the patch is smaller than deflate stream’s patch.
59*   `puff` and `huff` are deterministic operations.
60*   `huff` is in order of 10X faster than full recompression. It is even faster
61    than Huffman algorithm because it already has the Huffman tables and does
62    not need to reconstruct it.
63*   Both algorithms have very low memory footprint.
64*   The underlying LZ77 algorithm can be anything (as long as it is deflate
65    compatible). This includes google’s [zopfli]
66
67**The drawbacks**
68
69*   The need to define a file format for the puffed stream and stay with this
70    format forever. If format needs to be changed in the future, then some
71    versioning mechanism should be there to handle it and backward compatibility
72    should be maintained.
73*   The need to define a payload format and stay with it forever. Similarly
74    needs to be versioned if required later change.
75*   Does not reduces the patch size as full recompression.
76
77
78## puffdiff and puffpatch
79
80A deflate compressed file (gzip, zip, etc.) contains multiple deflate streams at
81different locations. When this file is puffed, the resulting `puff` stream
82contains both puffs and the raw data that existed between the deflates in the
83compressed file. When performing `huff` operation, the location of puffs in the
84`puff` stream and deflates in the deflate stream should be passed upon. This is
85necessary as `huff` operation has to know exactly where the locations of both
86puffs and deflates are. (See the following [image](puffin-stream))
87
88![puffin-stream](images/puffin-stream.png "Puffin stream")
89
90Similarly `puffpatch` requires deflates and puffs locations in both the source
91and target streams. These location information is saved using Protobufs in the
92patch generated by `bsdiff`. One requirement for these two operations are that
93`puffpatch` has to be efficient in the client. Client devices have limited
94memory and CPU bandwidth and it is necessary that each of these operations are
95performed with the most efficiency available. In order to achieve this
96efficiency a new operation can be added to `bspatch`, that reads and writes into
97a `puff` streams using special interfaces for puffing and huffing on the fly.
98
99
100## Puffin Patch Format
101
102*   Magic (4 bytes) - The string literal "PUF1".
103*   Header Length (4 bytes) - The size of the header (length of the generated
104    Protobuf).
105*   Header - Lengths and locations of deflate and puffs streams in source and
106    target files in a Protobuf format. See [puffin.proto].
107*   Patch - This is a binary array directly generated by the `bsdiff` program.
108
109## Puffin Stream Format
110
111*   [Block Header](#block-header-format) (3+ bytes) - Defines the type of the
112    block.
113*   Data - A mix of literals list and copy length/distances.
114*   End of Block (2 bytes) - The end of block symbol. Similar to the symbols for
115    copy length/distance but without the distance bytes. The actual copy length
116    value is 259 (0x81FF).
117
118### Block Header Format
119
120*   Length (2 Bytes) - The size of the block header excluding the two Length
121    bytes itself - 1.
122*   Final Block (1 Bit) - Whether the block is the last one or not.
123    *   1 - Final block
124	*   0 - Middle block
125*   Type (2 Bits)
126    *   0 - Uncompressed. Immediately after the header, zero or one literals
127        list is present which defines the content of the uncompressed blocks.
128	*   1 - Fixed Huffman table. The list of literals and length/distances comes
129        immediately after the header. Fixed Huffman table is defined the deflate
130        specification and will not change.
131	*   2 - Dynamic Huffman table. The dynamic Huffman table comes at the end of
132        the block header.
133	*   3 - Undefined. This is an error.
134*   Skip Bits (5 Bits) - Used only for uncompressed blocks (For other types its
135    value is zero). In an uncompressed block, the [RFC 1951] skips any bits
136    after reading final block and type bits until the byte boundary in the
137    input. However, it does not define what the value of these bits should
138    be. Most implementations assume 0, but some implementations may use any
139    value. This segment contains those bits as a five-bits integer. When writing
140    the block header back to the deflate format, the actual number of bits which
141    where skipped will be calculated easily.
142*   Huffman Table - It only comes for dynamic Huffman table.
143
144#### Dynamic Huffman Table Format
145
146Depending on the scheme for storing the Huffman tables, the payload size can
147change. We discovered that the following scheme does not produce the smallest
148payload, but it is the most deterministic one. In a deflate stream, Huffman
149tables for literals/length and distances are themselves Huffman coded. In this
150format, we also `puff` the Huffman tables into the `puff` stream instead of
151completely decompressing them.
152
153There are three tables stored in this structure very similar to the one defined
154in [RFC 1951].  A Huffman table can be defined as an array of unsigned integer
155code length values. Three Puffed Huffman tables appear like the following
156scheme. The first table (codes) is the Huffman table used to decode the next two
157Huffman tables. The second Huffman table is used to decode literals and lengths,
158and the third Huffman table is used to decode distances.
159
160
161*   Literal/Length Count (1 byte) - Number of alphabets used for literal/length
162    Huffman codes - 257.
163*   Distance Count (1 byte) - Number of alphabets used for distance Huffman
164    codes - 1.
165*   Alphabet Count (1 byte) - Number of alphabets for coding two previous
166    Huffman tables - 4.
167*   Code Lengths ((Alphabet Count + 1) / 2 bytes) - A list of codes for reading
168    the next two Huffman tables. Each byte contains two codes. If the number of
169    codes is odd, the last four Bits will be zero.
170*   Literal/Length/Distance Code Lengths - List of code lengths for encoding
171    literals/lengths/distance followed The encoding is as follows:
172	*   [0..15] - Represent code [0..15]
173	*   [16..19] - Copy the last code length [3..6] times.
174	*   [20..27] - Repeat code length of 0 for [3..10] times.
175	*   [28..155] - Repeat code length of 0 for [11..138] times.
176
177### Literals List
178Literals lists are constructed by a “length” value followed by “length” bytes of
179literals. The Puffer should try to merge adjacent literals lists as much as
180possible into one literals list in the `puff` stream.  This Is a length value
181followed by length bytes of literals (Even if there is only one literal.)
182
183*   Tag (1 Bit) - The value is 0 indicating that this is a list of literals (not
184    a copy length/distance).
185*   Length - The number of literals that would follow in the list.
186    *   (7 Bits) Length = [1 .. 127], The value is: Length - 1
187	*   (7 Bits + 2 Bytes) Length = [128 .. 65663], The values are: 127,
188        Length - 127.
189	Conserves size by using only one byte if the number of upcoming
190    literals is smaller or equal to 127 (About 99% of literals length in a
191    normal deflate stream fit into this category.) We should never have zero
192    length literals. Otherwise it will use three bytes.
193*   Literals (Length Bytes) - A sequence of Length number of literals.
194
195### Copy Length/Distance
196
197This Is a Length value followed by a Distance value.
198
199*   Tag (1 Bit) - The value is 1 indicating that this is a copy length/distance
200    field.
201*   Length - Conserves size by using only one byte if the length value is
202    smaller or equal to 129. Otherwise two bytes are used.
203    *   (7 Bits) - Length = [3 .. 129], The value is: Length - 3
204	*   (7 Bites + 1 Byte) Length = [130 .. 258], The value is: 127, Length -
205        130:
206*   Distance (2 Bytes) - Distance = [1 .. 32768], The value is:
207    Distance - 1. The distance value as an unsigned integer.
208
209## Building Puffin
210Currently Puffin is being used in both Android and Chrome OS and is built
211differently in each of them. There is also a Makefile build, but it is not
212comprehensive.
213
214## Usage
215
216Puffin builds an executable `puffin` which can be used to perform diffing and
217patching algorithms using Puffin format. To get the list of options available
218run:
219
220```shell
221puffin --help
222```
223
224It can also be used as a library (currently used by update_engine) that provides
225different APIs.
226
227## References
228
229[RFC 1951]: https://www.ietf.org/rfc/rfc1951.txt
230[puffin.proto]: /src/puffin.proto
231[zopfli]: https://en.wikipedia.org/wiki/Zopfli
232

README.version

1Homepage: https://android.googlesource.com/platform/external/puffin/
2Version: 1.0.0
3