• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Making a new parser from scratch
2
3Writing a parser is a very fun, interactive process, but sometimes a daunting
4task. How do you test it? How to see ambiguities in specifications?
5
6nom is designed to abstract data manipulation (counting array offsets,
7converting to structures, etc) while providing a safe, composable API. It also
8takes care of making the code easy to test and read, but it can be confusing at
9first, if you are not familiar with parser combinators, or if you are not used
10to Rust generic functions.
11
12This document is here to help you in getting started with nom. You can also find
13[nom recipes for common short parsing tasks here](nom_recipes.md). If you need
14more specific help, please ping `geal` on IRC (libera, geeknode,
15oftc), go to `#nom-parsers` on Libera IRC, or on the
16[Gitter chat room](https://gitter.im/Geal/nom).
17
18# First step: the initial research
19
20A big part of the initial work lies in accumulating enough documentation and
21samples to understand the format. The specification is useful, but specifications
22represent an "official" point of view, that may not be the real world usage. Any
23blog post or open source code is useful, because it shows how people understand
24the format, and how they work around each other's bugs (if you think a
25specification ensures every implementation is consistent with the others, think again).
26
27You should get a lot of samples (file or network traces) to test your code. The
28easy way is to use a small number of samples coming from the same source and
29develop everything around them, to realize later that they share a very specific
30bug.
31
32# Code organization
33
34While it is tempting to insert the parsing code right inside the rest of the
35logic, it usually results in  unmaintainable code, and makes testing challenging.
36Parser combinators, the parsing technique used in nom, assemble a lot of small
37functions to make powerful parsers. This means that those functions only depend
38on their input, not on an external state. This makes it easy to parse the input
39partially, and to test those functions independently.
40
41Usually, you can separate the parsing functions in their own module, so you
42could have a `src/lib.rs` file containing this:
43
44```rust
45pub mod parser;
46```
47
48And the `src/parser.rs` file:
49
50```rust
51use nom::IResult;
52use nom::number::complete::be_u16;
53use nom::bytes::complete::take;
54
55pub fn length_value(input: &[u8]) -> IResult<&[u8],&[u8]> {
56    let (input, length) = be_u16(input)?;
57    take(length)(input)
58}
59```
60
61# Writing a first parser
62
63Let's parse a simple expression like `(12345)`. nom parsers are functions that
64use the `nom::IResult` type everywhere. As an example, a parser taking a byte
65slice `&[u8]` and returning a 32 bits unsigned integer `u32` would have this
66signature: `fn parse_u32(input: &[u8]) -> IResult<&[u8], u32>`.
67
68The `IResult` type depends on the input and output types, and an optional custom
69error type. This enum can either be `Ok((i,o))` containing the remaining input
70and the output value, or, on the `Err` side, an error or an indication that more
71data is needed.
72
73```rust
74pub type IResult<I, O, E=(I,ErrorKind)> = Result<(I, O), Err<E>>;
75
76#[derive(Debug, PartialEq, Eq, Clone, Copy)]
77pub enum Needed {
78  Unknown,
79  Size(u32)
80}
81
82#[derive(Debug, Clone, PartialEq)]
83pub enum Err<E> {
84  Incomplete(Needed),
85  Error(E),
86  Failure(E),
87}
88```
89
90nom uses this type everywhere. Every combination of parsers will pattern match
91on this to know if it must return a value, an error, consume more data, etc.
92But this is done behind the scenes most of the time.
93
94Parsers are usually built from the bottom up, by first writing parsers for the
95smallest elements, then assembling them in more complex parsers by using
96combinators.
97
98As an example, here is how we could build a (non spec compliant) HTTP request
99line parser:
100
101```rust
102// first implement the basic parsers
103let method = take_while1(is_alpha);
104let space = take_while1(|c| c == ' ');
105let url = take_while1(|c| c!= ' ');
106let is_version = |c| c >= b'0' && c <= b'9' || c == b'.';
107let http = tag("HTTP/");
108let version = take_while1(is_version);
109let line_ending = tag("\r\n");
110
111// combine http and version to extract the version string
112// preceded will return the result of the second parser
113// if both succeed
114let http_version = preceded(http, version);
115
116// combine all previous parsers in one function
117fn request_line(i: &[u8]) -> IResult<&[u8], Request> {
118
119  // tuple takes as argument a tuple of parsers and will return
120  // a tuple of their results
121  let (input, (method, _, url, _, version, _)) =
122    tuple((method, space, url, space, http_version, line_ending))(i)?;
123
124  Ok((input, Request { method, url, version }))
125}
126```
127
128Since it is easy to combine small parsers, I encourage you to write small
129functions corresponding to specific parts of the format, test them
130independently, then combine them in more general parsers.
131
132# Finding the right combinator
133
134nom has a lot of different combinators, depending on the use case. They are all
135described in the [reference](https://docs.rs/nom).
136
137[Basic functions](https://docs.rs/nom/#functions) are available. They deal mostly
138in recognizing character types, like `alphanumeric` or `digit`. They also parse
139big endian and little endian integers and floats of multiple sizes.
140
141Most of the functions are there to combine parsers, and they are generic over
142the input type.
143
144# Testing the parsers
145
146Once you have a parser function, a good trick is to test it on a lot of the
147samples you gathered, and integrate this to your unit tests. To that end, put
148all of the test files in a folder like `assets` and refer to test files like
149this:
150
151```rust
152#[test]
153fn header_test() {
154  let data = include_bytes!("../assets/axolotl-piano.gif");
155  println!("bytes:\n{}", &data[0..100].to_hex(8));
156  let res = header(data);
157  // ...
158```
159
160The `include_bytes!` macro (provided by Rust's standard library) will integrate
161the file as a byte slice in your code. You can then just refer to the part of
162the input the parser has to handle via its offset. Here, we take the first 100
163bytes of a GIF file to parse its header
164(complete code [here](https://github.com/Geal/gif.rs/blob/master/src/parser.rs#L305-L309)).
165
166If your parser handles textual data, you can just use a lot of strings directly
167in the test, like this:
168
169```rust
170#[test]
171fn factor_test() {
172  assert_eq!(factor("3"),       Ok(("", 3)));
173  assert_eq!(factor(" 12"),     Ok(("", 12)));
174  assert_eq!(factor("537  "),   Ok(("", 537)));
175  assert_eq!(factor("  24   "), Ok(("", 24)));
176}
177```
178
179The more samples and test cases you get, the more you can experiment with your
180parser design.
181
182# Debugging the parsers
183
184There are a few tools you can use to debug how code is generated.
185
186## dbg_dmp
187
188This function wraps a parser that accepts a `&[u8]` as input and
189prints its hexdump if the child parser encountered an error:
190
191```rust
192use nom::{IResult, error::dbg_dmp, bytes::complete::tag};
193
194fn f(i: &[u8]) -> IResult<&[u8], &[u8]> {
195  dbg_dmp(tag("abcd"), "tag")(i)
196}
197
198  let a = &b"efghijkl"[..];
199
200// Will print the following message:
201// Error(Position(0, [101, 102, 103, 104, 105, 106, 107, 108])) at l.5 by ' tag ! ( "abcd" ) '
202// 00000000        65 66 67 68 69 6a 6b 6c         efghijkl
203f(a);
204```
205
206