• Home
Name Date Size #Lines LOC

..--

cmd/12-May-2024-1,9461,508

doc/12-May-2024-2,1561,598

example/12-May-2024-851555

fuzz/c/12-May-2024-540370

internal/12-May-2024-8,6286,504

lang/12-May-2024-10,0258,065

lib/12-May-2024-5,5703,740

release/c/12-May-2024-23,94319,713

script/12-May-2024-4,9393,547

std/12-May-2024-5,6745,002

test/12-May-2024-7,3945,760

.gitignoreD12-May-202438 76

AUTHORSD12-May-2024548 1612

CONTRIBUTING.mdD12-May-2024709 1812

CONTRIBUTORSD12-May-20241.5 KiB3834

LICENSED12-May-20249.9 KiB178150

README.mdD12-May-202416.3 KiB361281

build-all.shD12-May-20242.9 KiB7829

build-example.shD12-May-20241.7 KiB6132

build-fuzz.shD12-May-20241.1 KiB4621

go.modD12-May-202440 42

wuffs-root-directory.txtD12-May-2024151 53

README.md

1# Wrangling Untrusted File Formats Safely
2
3([Formerly known as
4Puffs](https://groups.google.com/d/topic/puffslang/ZX-ymyf8xh0/discussion):
5Parsing Untrusted File Formats Safely).
6
7Wuffs is a domain-specific language and library for wrangling untrusted file
8formats safely. Wrangling includes parsing, decoding and encoding. Examples of
9such file formats include images, audio, video, fonts and compressed archives.
10
11Unlike the C programming language, Wuffs is safe with respect to buffer
12overflows, integer arithmetic overflows and null pointer dereferences. The key
13difference between Wuffs and other memory-safe languages is that all such
14checks are done at compile time, not at run time. *If it compiles, it is safe*,
15with respect to those three bug classes.
16
17The aim is to produce software libraries that are as safe as Go or Rust,
18roughly speaking, but as fast as C, and that can be used anywhere C libraries
19are used. This includes very large C/C++ products, such as popular web browsers
20and operating systems (using that term to include desktop and mobile user
21interfaces, not just the kernel).
22
23The trade-off in aiming for both safety and speed is that Wuffs programs take
24longer for a programmer to write, as they have to explicitly annotate their
25programs with proofs of safety. A statement like `x += 1` unsurprisingly means
26to increment the variable `x` by `1`. However, in Wuffs, such a statement is a
27compile time error unless the compiler can also prove that `x` is not the
28maximal value of `x`'s type (e.g. `x` is not `255` if `x` is a `u8`), as the
29increment would otherwise overflow. Similarly, an integer arithmetic expression
30like `x / y` is a compile time error unless the compiler can also prove that
31`y` is not zero.
32
33Wuffs is not a general purpose programming language. While technically
34possible, it is unlikely that a Wuffs compiler would be worth writing in Wuffs.
35
36
37## What Does Wuffs Code Look Like?
38
39The [`std/lzw/decode_lzw.wuffs`](./std/lzw/decode_lzw.wuffs) file is a good
40example. See the "Poking Around" section below for more guidance.
41
42
43## What Does Compile Time Checking Look Like?
44
45For example, making this one-line edit to the GIF codec leads to a compile time
46error. `wuffs gen` fails to generate the C code, i.e. fails to compile
47(transpile) the Wuffs code to C code:
48
49```diff
50diff --git a/std/lzw/decode_lzw.wuffs b/std/lzw/decode_lzw.wuffs
51index f878c5e..f10dcee 100644
52--- a/std/lzw/decode_lzw.wuffs
53+++ b/std/lzw/decode_lzw.wuffs
54@@ -98,7 +98,7 @@ pub func lzw_decoder.decode?(dst ptr buf1, src ptr buf1, src_final bool)() {
55                        in.dst.write?(x:s)
56
57                        if use_save_code {
58-                               this.suffixes[save_code] = c as u8
59+                               this.suffixes[save_code] = (c + 1) as u8
60                                this.prefixes[save_code] = prev_code as u16
61                        }
62```
63
64```
65$ wuffs gen std/gif
66check: expression "(c + 1) as u8" bounds [1..256] is not within bounds [0..255] at
67/home/n/go/src/github.com/google/wuffs/std/lzw/decode_lzw.wuffs:101. Facts:
68    n_bits < 8
69    c < 256
70    this.stack[s] == (c as u8)
71    use_save_code
72```
73
74In comparison, this two-line edit will compile (but the "does it decode GIF
75correctly" tests then fail):
76
77```diff
78diff --git a/std/lzw/decode_lzw.wuffs b/std/lzw/decode_lzw.wuffs
79index f878c5e..b43443d 100644
80--- a/std/lzw/decode_lzw.wuffs
81+++ b/std/lzw/decode_lzw.wuffs
82@@ -97,8 +97,8 @@ pub func lzw_decoder.decode?(dst ptr buf1, src ptr buf1, src_final bool)() {
83                        // type checking, bounds checking and code generation for it).
84                        in.dst.write?(x:s)
85
86-                       if use_save_code {
87-                               this.suffixes[save_code] = c as u8
88+                       if use_save_code and (c < 200) {
89+                               this.suffixes[save_code] = (c + 1) as u8
90                                this.prefixes[save_code] = prev_code as u16
91                        }
92```
93
94```
95$ wuffs gen std/gif
96gen wrote:      /home/n/go/src/github.com/google/wuffs/gen/c/gif.c
97gen unchanged:  /home/n/go/src/github.com/google/wuffs/gen/h/gif.h
98$ wuffs test std/gif
99gen unchanged:  /home/n/go/src/github.com/google/wuffs/gen/c/gif.c
100gen unchanged:  /home/n/go/src/github.com/google/wuffs/gen/h/gif.h
101test:           /home/n/go/src/github.com/google/wuffs/test/c/gif
102gif/basic.c     clang   PASS (8 tests run)
103gif/basic.c     gcc     PASS (8 tests run)
104gif/gif.c       clang   FAIL test_lzw_decode: bufs1_equal: wi: got 19311, want 19200.
105contents differ at byte 3 (in hex: 0x000003):
106  000000: dcdc dc00 00d9 f5f9 f6df dc5f 393a 3a3a  ..........._9:::
107  000010: 3a3b 618e c8e4 e4e4 e5e4 e600 00e4 bbbb  :;a.............
108  000020: eded 8f91 9191 9090 9090 9190 9192 9192  ................
109  000030: 9191 9292 9191 9293 93f0 f0f0 f1f1 f2f2  ................
110excerpts of got (above) versus want (below):
111  000000: dcdc dcdc dcd9 f5f9 f6df dc5f 393a 3a3a  ..........._9:::
112  000010: 3a3a 618e c8e4 e4e4 e5e4 e6e4 e4e4 bbbb  ::a.............
113  000020: eded 8f91 9191 9090 9090 9090 9191 9191  ................
114  000030: 9191 9191 9191 9193 93f0 f0f0 f1f1 f2f2  ................
115
116gif/gif.c       gcc     FAIL test_lzw_decode: bufs1_equal: wi: got 19311, want 19200.
117contents differ at byte 3 (in hex: 0x000003):
118  000000: dcdc dc00 00d9 f5f9 f6df dc5f 393a 3a3a  ..........._9:::
119  000010: 3a3b 618e c8e4 e4e4 e5e4 e600 00e4 bbbb  :;a.............
120  000020: eded 8f91 9191 9090 9090 9190 9192 9192  ................
121  000030: 9191 9292 9191 9293 93f0 f0f0 f1f1 f2f2  ................
122excerpts of got (above) versus want (below):
123  000000: dcdc dcdc dcd9 f5f9 f6df dc5f 393a 3a3a  ..........._9:::
124  000010: 3a3a 618e c8e4 e4e4 e5e4 e6e4 e4e4 bbbb  ::a.............
125  000020: eded 8f91 9191 9090 9090 9090 9191 9191  ................
126  000030: 9191 9191 9191 9193 93f0 f0f0 f1f1 f2f2  ................
127
128wuffs-test-c: some tests failed
129wuffs test: some tests failed
130```
131
132# Background
133
134Decoding untrusted data, such as images downloaded from across the web, have a
135long history of security vulnerabilities. As of 2017, libpng is over 18 years
136old, and the [PNG specification is dated 2003](https://www.w3.org/TR/PNG/), but
137that well examined C library is still getting [CVE's published in
1382017](https://www.cvedetails.com/vulnerability-list/vendor_id-7294/year-2017/Libpng.html).
139
140Sandboxing and fuzzing can mitigate the danger, but they are reactions to C's
141fundamental unsafety. Newer programming languages remove entire classes of
142potential security bugs. Buffer overflows and null pointer dereferences are
143amongst the most well known.
144
145Less well known are integer overflow bugs. Offset-length pairs, defining a
146sub-section of a file, are seen in many file formats, such as OpenType fonts
147and PDF documents. A conscientious C programmer might think to check that a
148section of a file or a buffer is within bounds by writing `if (offset + length
149< end)` before processing that section, but that addition can silently
150overflow, and a maliciously crafted file might bypass the check.
151
152A variation on this theme is where `offset` is a pointer, exemplified by
153[capnproto's
154CVE-2017-7892](https://github.com/sandstorm-io/capnproto/blob/master/security-advisories/2017-04-17-0-apple-clang-elides-bounds-check.md)
155and [another
156example](https://www.blackhat.com/docs/us-14/materials/us-14-Rosenberg-Reflections-on-Trusting-TrustZone.pdf).
157For a pointer-typed offset, witnessing such a vulnerability can depend on both
158the malicious input itself and the addresses of the memory the software used to
159process that input. Those addresses can vary from run to run and from system to
160system, e.g. 32-bit versus 64-bit systems and whether dynamically allocated
161memory can have sufficiently high address values, and that variability makes it
162harder to reproduce and to catch such subtle bugs from fuzzing.
163
164In C, some integer overflow is *undefined behavior*, as per [the C99 spec
165section 3.4.3](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf). In
166Go, integer overflow is [silently
167ignored](https://golang.org/ref/spec#Integer_overflow). In Rust, integer
168overflow is [checked at run time in debug mode and silently ignored in release
169mode](http://huonw.github.io/blog/2016/04/myths-and-legends-about-integer-overflow-in-rust/)
170by default, as the run time performance penalty was deemed too great. In Swift,
171it's a [run time
172error](https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/AdvancedOperators.html#//apple_ref/doc/uid/TP40014097-CH27-ID37).
173In D, it's [configurable](http://dconf.org/2017/talks/alexandrescu.pdf). Other
174languages like Python and Haskell can automatically spill into 'big integers'
175larger than 64 bits, but this can have a performance impact when such integers
176are used in inner loops.
177
178Even if overflow is checked, it is usually checked at run time. Similarly,
179modern languages do their bounds checking at run time. An expression like
180`a[i]` is really `if ((0 <= i) && (i < a.length)) { use a[i] } else { throw }`,
181in mangled pseudo-code. Compilers for these languages can often eliminate many
182of these bounds checks, e.g. if `i` is an iterator index, but not always all of
183them.
184
185The run time cost is small, measured in nanoseconds. But if an image decoding
186library has to eat this cost per pixel, and you have a megapixel image, then
187nanoseconds become milliseconds, and milliseconds can matter.
188
189In comparison, in Wuffs, all bounds checks and arithmetic overflow checks
190happen at compile time, with zero run time overhead.
191
192
193# Getting Started
194
195Wuffs code (that is proved safe via explicit assertions) is compiled to C code
196(with those assertions removed) - it is transpiled. If you are a C/C++
197programmer and just want to *use* the C edition of the Wuffs standard library,
198then clone the repository and look at the files in the `gen/c` and `gen/h`
199directories. No other software tools are required and there are no library
200dependencies, other than C standard library concepts like `<stdint.h>`'s
201`uint32_t` type and `<string.h>`'s `memset` function.
202
203If your C/C++ project is large, you might want both the .c files (adding each
204to your build system) and the .h files. If your C/C++ project is small, you
205might only need the .c files, not the .h files, as the .c files are designed to
206be a [drop-in library](http://gpfault.net/posts/drop-in-libraries.txt.html).
207For example, if you want a GIF decoder, you only need `gif.c`. See TODO for an
208example. More complicated decoders might require multiple .c files - multiple
209modules. For example, the PNG codec (TODO) requires the deflate codec, but they
210are separate files, since HTTP can use also deflate compression (also known as
211gzip or zlib, roughly speaking) without necessarily processing PNG images.
212
213
214## Getting Deeper
215
216If you want to modify the Wuffs standard library, or compile your own Wuffs
217code, you will need to do a little more work, and will have to install at least
218the Go toolchain in order to build the Wuffs tools. To run the test suite, you
219might also have to install C compilers like clang and gcc, as well as C
220libraries (and their .h files) like libjpeg and libpng, as some tests compare
221that Wuffs produces exactly the same output as these other libraries.
222
223Running `go get -v github.com/google/wuffs/cmd/...` will download and install
224the Wuffs tools. Change `get` to `install` to re-install those programs without
225downloading, e.g. after you've modified their source code, or after a manually
226issued `git pull`. The Wuffs tools that you'll most often use are `wuffsfmt`
227(analogous to `clang-format`, `gofmt` or `rustfmt`) and `wuffs` (roughly
228analogous to `make`, `go` or `cargo`).
229
230You should now be able to run `wuffs test`. If all goes well, you should see
231some output containing the word "PASS" multiple times.
232
233
234## Poking Around
235
236Feel free to edit the `std/lzw/decode_lzw.wuffs` file, which implements the GIF
237LZW decoder. After editing, run `wuffs gen std/gif` or `wuffs test std/gif` to
238re-generate the C edition of the Wuffs standard library's GIF codec, and
239optionally run its tests.
240
241Try deleting an assert statement and re-running `wuffs gen`. The result should
242be syntactically valid, but a compile error, as some bounds checks can no
243longer be proven.
244
245Find the line `var bits u32`, which declares the bits variable and initializes
246it to zero. Try adding `bits -= 1` on a new line of code after it. Again,
247`wuffs gen` should fail, as the computation can underflow.
248
249Similarly, replacing the line `var n_bits u32` with `var n_bits u32 = 10`
250should fail, as an `n_bits < 8` assertion, a pre-condition, a few lines further
251down again cannot be proven.
252
253Similarly, changing the `4095` in `var prev_code u32[..4095]` either higher or
254lower should fail.
255
256Try adding `assert false` at various places, which should obviously fail, but
257should also cause `wuffs gen` to print what facts the compiler can prove at
258that point. This can be useful when debugging why Wuffs can't prove something
259you think it should be able to.
260
261
262## Running the Tests
263
264If you've changed any of the tools (i.e. changed any `.go` code), re-run `go
265install -v github.com/google/wuffs/cmd/...` and `go test
266github.com/google/wuffs/lang/...`.
267
268If you've changed any of the libraries (i.e. changed any `.wuffs` code), run
269`wuffs test` or, ideally, `wuffs test -mimic` to also check that Wuffs' output
270mimics (i.e. exactly matches) other libraries' output, such as giflib for GIF,
271libpng for PNG, etc.
272
273If your library change is an optimization, run `wuffs bench` or `wuffs bench
274-mimic` both before and after your change to quantify the improvement. The
275mimic benchmark numbers should't change if you're only changing `.wuffs` code,
276but seeing zero change in those numbers is a sanity check on any unrelated
277system variance, such as software updates or virus checkers running in the
278background.
279
280
281## Directory Layout
282
283- `lang` holds the Go libraries that implement the Wuffs language: tokenizer,
284  AST, parser, renderer, etc. The Wuffs tools are written in Go, but as
285  mentioned above, Wuffs transpiles to C code, and Go is not necessarily
286  involved if all you want is to use the C edition of Wuffs.
287- `lib` holds other Go libraries, not specific to the Wuffs language per se.
288- `internal` holds internal implementation details, as per Go's [internal
289  packages](https://golang.org/s/go14internal) convention.
290- `cmd` holds Wuffs' command line tools, also written in Go.
291- `std` holds the Wuffs standard library's code. The initial focus is on
292  popular image codecs: BMP, GIF, JPEG, PNG, TIFF and WEBP.
293- `gen` holds the transpiled editions of that standard library. The initial
294  focus is generating C code. Later on, the repository might include generated
295  Go and Rust code.
296- `release` holds the releases of the Wuffs standard library.
297- `test` holds the regular tests for the Wuffs standard library.
298- `fuzz` holds the fuzz tests for the Wuffs standard library.
299- `script` holds miscellaneous utility programs.
300- `doc` holds documentation.
301- `example` holds example programs.
302
303For a guide on how various things work together, the "99ff8e2 Let fields have
304default values" commit is an example of adding new Wuffs syntax and threading
305that all the way through to C code generation and testing.
306
307
308# Documentation
309
310- [Changelog](./doc/changelog.md)
311- [Related Work](./doc/related-work.md)
312- [Roadmap](./doc/roadmap.md)
313- [Wuffs the Language](./doc/wuffs-the-language.md)
314- Wuffs the Library (TODO)
315
316Measurements:
317
318- [Benchmarks](./doc/benchmarks.md)
319- [Binary Size](./doc/binary-size.md)
320- [Compatibility](./doc/compatibility.md)
321
322
323# Status
324
325Proof of concept. Version 0.1 at best. API and ABI aren't stabilized yet. There
326are plenty of tests to create, docs to write and TODOs to do. The compiler
327undoubtedly has bugs. Assertion checking needs more rigor, especially around
328side effects and aliasing, and being sufficiently well specified to allow
329alternative implementations. Lots of detail needs work, but the broad
330brushstrokes are there.
331
332
333# Discussion
334
335The mailing list is at
336[https://groups.google.com/forum/#!forum/wuffs](https://groups.google.com/forum/#!forum/wuffs).
337
338
339# Contributing
340
341The [CONTRIBUTING.md](./CONTRIBUTING.md) file contains instructions on how to
342file the Contributor License Agreement before sending any pull requests (PRs).
343Of course, if you're new to the project, it's usually best to discuss any
344proposals and reach consensus before sending your first PR.
345
346
347# License
348
349Apache 2. See the LICENSE file for details.
350
351
352# Disclaimer
353
354This is not an official Google product, it is just code that happens to be
355owned by Google.
356
357
358---
359
360Updated on June 2018.
361