1# Making a new parser from scratch 2 3Writing a parser is a very fun, interactive process, but sometimes a daunting 4task. How do you test it? How to see ambiguities in specifications? 5 6nom is designed to abstract data manipulation (counting array offsets, 7converting to structures, etc) while providing a safe, composable API. It also 8takes care of making the code easy to test and read, but it can be confusing at 9first, if you are not familiar with parser combinators, or if you are not used 10to Rust generic functions. 11 12This document is here to help you in getting started with nom. You can also find 13[nom recipes for common short parsing tasks here](nom_recipes.md). If you need 14more specific help, please ping `geal` on IRC (libera, geeknode, 15oftc), go to `#nom-parsers` on Libera IRC, or on the 16[Gitter chat room](https://gitter.im/Geal/nom). 17 18# First step: the initial research 19 20A big part of the initial work lies in accumulating enough documentation and 21samples to understand the format. The specification is useful, but specifications 22represent an "official" point of view, that may not be the real world usage. Any 23blog post or open source code is useful, because it shows how people understand 24the format, and how they work around each other's bugs (if you think a 25specification ensures every implementation is consistent with the others, think again). 26 27You should get a lot of samples (file or network traces) to test your code. The 28easy way is to use a small number of samples coming from the same source and 29develop everything around them, to realize later that they share a very specific 30bug. 31 32# Code organization 33 34While it is tempting to insert the parsing code right inside the rest of the 35logic, it usually results in unmaintainable code, and makes testing challenging. 36Parser combinators, the parsing technique used in nom, assemble a lot of small 37functions to make powerful parsers. This means that those functions only depend 38on their input, not on an external state. This makes it easy to parse the input 39partially, and to test those functions independently. 40 41Usually, you can separate the parsing functions in their own module, so you 42could have a `src/lib.rs` file containing this: 43 44```rust 45pub mod parser; 46``` 47 48And the `src/parser.rs` file: 49 50```rust 51use nom::IResult; 52use nom::number::complete::be_u16; 53use nom::bytes::complete::take; 54 55pub fn length_value(input: &[u8]) -> IResult<&[u8],&[u8]> { 56 let (input, length) = be_u16(input)?; 57 take(length)(input) 58} 59``` 60 61# Writing a first parser 62 63Let's parse a simple expression like `(12345)`. nom parsers are functions that 64use the `nom::IResult` type everywhere. As an example, a parser taking a byte 65slice `&[u8]` and returning a 32 bits unsigned integer `u32` would have this 66signature: `fn parse_u32(input: &[u8]) -> IResult<&[u8], u32>`. 67 68The `IResult` type depends on the input and output types, and an optional custom 69error type. This enum can either be `Ok((i,o))` containing the remaining input 70and the output value, or, on the `Err` side, an error or an indication that more 71data is needed. 72 73```rust 74pub type IResult<I, O, E=(I,ErrorKind)> = Result<(I, O), Err<E>>; 75 76#[derive(Debug, PartialEq, Eq, Clone, Copy)] 77pub enum Needed { 78 Unknown, 79 Size(u32) 80} 81 82#[derive(Debug, Clone, PartialEq)] 83pub enum Err<E> { 84 Incomplete(Needed), 85 Error(E), 86 Failure(E), 87} 88``` 89 90nom uses this type everywhere. Every combination of parsers will pattern match 91on this to know if it must return a value, an error, consume more data, etc. 92But this is done behind the scenes most of the time. 93 94Parsers are usually built from the bottom up, by first writing parsers for the 95smallest elements, then assembling them in more complex parsers by using 96combinators. 97 98As an example, here is how we could build a (non spec compliant) HTTP request 99line parser: 100 101```rust 102// first implement the basic parsers 103let method = take_while1(is_alpha); 104let space = take_while1(|c| c == ' '); 105let url = take_while1(|c| c!= ' '); 106let is_version = |c| c >= b'0' && c <= b'9' || c == b'.'; 107let http = tag("HTTP/"); 108let version = take_while1(is_version); 109let line_ending = tag("\r\n"); 110 111// combine http and version to extract the version string 112// preceded will return the result of the second parser 113// if both succeed 114let http_version = preceded(http, version); 115 116// combine all previous parsers in one function 117fn request_line(i: &[u8]) -> IResult<&[u8], Request> { 118 119 // tuple takes as argument a tuple of parsers and will return 120 // a tuple of their results 121 let (input, (method, _, url, _, version, _)) = 122 tuple((method, space, url, space, http_version, line_ending))(i)?; 123 124 Ok((input, Request { method, url, version })) 125} 126``` 127 128Since it is easy to combine small parsers, I encourage you to write small 129functions corresponding to specific parts of the format, test them 130independently, then combine them in more general parsers. 131 132# Finding the right combinator 133 134nom has a lot of different combinators, depending on the use case. They are all 135described in the [reference](https://docs.rs/nom). 136 137[Basic functions](https://docs.rs/nom/#functions) are available. They deal mostly 138in recognizing character types, like `alphanumeric` or `digit`. They also parse 139big endian and little endian integers and floats of multiple sizes. 140 141Most of the functions are there to combine parsers, and they are generic over 142the input type. 143 144# Testing the parsers 145 146Once you have a parser function, a good trick is to test it on a lot of the 147samples you gathered, and integrate this to your unit tests. To that end, put 148all of the test files in a folder like `assets` and refer to test files like 149this: 150 151```rust 152#[test] 153fn header_test() { 154 let data = include_bytes!("../assets/axolotl-piano.gif"); 155 println!("bytes:\n{}", &data[0..100].to_hex(8)); 156 let res = header(data); 157 // ... 158``` 159 160The `include_bytes!` macro (provided by Rust's standard library) will integrate 161the file as a byte slice in your code. You can then just refer to the part of 162the input the parser has to handle via its offset. Here, we take the first 100 163bytes of a GIF file to parse its header 164(complete code [here](https://github.com/Geal/gif.rs/blob/master/src/parser.rs#L305-L309)). 165 166If your parser handles textual data, you can just use a lot of strings directly 167in the test, like this: 168 169```rust 170#[test] 171fn factor_test() { 172 assert_eq!(factor("3"), Ok(("", 3))); 173 assert_eq!(factor(" 12"), Ok(("", 12))); 174 assert_eq!(factor("537 "), Ok(("", 537))); 175 assert_eq!(factor(" 24 "), Ok(("", 24))); 176} 177``` 178 179The more samples and test cases you get, the more you can experiment with your 180parser design. 181 182# Debugging the parsers 183 184There are a few tools you can use to debug how code is generated. 185 186## dbg_dmp 187 188This function wraps a parser that accepts a `&[u8]` as input and 189prints its hexdump if the child parser encountered an error: 190 191```rust 192use nom::{IResult, error::dbg_dmp, bytes::complete::tag}; 193 194fn f(i: &[u8]) -> IResult<&[u8], &[u8]> { 195 dbg_dmp(tag("abcd"), "tag")(i) 196} 197 198 let a = &b"efghijkl"[..]; 199 200// Will print the following message: 201// Error(Position(0, [101, 102, 103, 104, 105, 106, 107, 108])) at l.5 by ' tag ! ( "abcd" ) ' 202// 00000000 65 66 67 68 69 6a 6b 6c efghijkl 203f(a); 204``` 205 206