• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1 // pest. The Elegant Parser
2 // Copyright (c) 2018 Dragoș Tiselice
3 //
4 // Licensed under the Apache License, Version 2.0
5 // <LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0> or the MIT
6 // license <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
7 // option. All files in the project carrying such notice may not be copied,
8 // modified, or distributed except according to those terms.
9 
10 //! # pest. The Elegant Parser
11 //!
12 //! pest is a general purpose parser written in Rust with a focus on accessibility, correctness,
13 //! and performance. It uses parsing expression grammars (or [PEG]) as input, which are similar in
14 //! spirit to regular expressions, but which offer the enhanced expressivity needed to parse
15 //! complex languages.
16 //!
17 //! [PEG]: https://en.wikipedia.org/wiki/Parsing_expression_grammar
18 //!
19 //! ## Getting started
20 //!
21 //! The recommended way to start parsing with pest is to read the official [book].
22 //!
23 //! Other helpful resources:
24 //!
25 //! * API reference on [docs.rs]
26 //! * play with grammars and share them on our [fiddle]
27 //! * leave feedback, ask questions, or greet us on [Gitter]
28 //!
29 //! [book]: https://pest-parser.github.io/book
30 //! [docs.rs]: https://docs.rs/pest
31 //! [fiddle]: https://pest-parser.github.io/#editor
32 //! [Gitter]: https://gitter.im/dragostis/pest
33 //!
34 //! ## `.pest` files
35 //!
36 //! Grammar definitions reside in custom `.pest` files located in the `src` directory. Their path is
37 //! relative to `src` and is specified between the `derive` attribute and empty `struct` that
38 //! `Parser` will be derived on.
39 //!
40 //! ```ignore
41 //! #[derive(Parser)]
42 //! #[grammar = "path/to/my_grammar.pest"] // relative to src
43 //! struct MyParser;
44 //! ```
45 //!
46 //! ## Inline grammars
47 //!
48 //! Grammars can also be inlined by using the `#[grammar_inline = "..."]` attribute.
49 //!
50 //! ## Grammar
51 //!
52 //! A grammar is a series of rules separated by whitespace, possibly containing comments.
53 //!
54 //! ### Comments
55 //!
56 //! Comments start with `//` and end at the end of the line.
57 //!
58 //! ```ignore
59 //! // a comment
60 //! ```
61 //!
62 //! ### Rules
63 //!
64 //! Rules have the following form:
65 //!
66 //! ```ignore
67 //! name = optional_modifier { expression }
68 //! ```
69 //!
70 //! The name of the rule is formed from alphanumeric characters or `_` with the condition that the
71 //! first character is not a digit and is used to create token pairs. When the rule starts being
72 //! parsed, the starting part of the token is being produced, with the ending part being produced
73 //! when the rule finishes parsing.
74 //!
75 //! The following token pair notation `a(b(), c())` denotes the tokens: start `a`, start `b`, end
76 //! `b`, start `c`, end `c`, end `a`.
77 //!
78 //! #### Modifiers
79 //!
80 //! Modifiers are optional and can be one of `_`, `@`, `$`, or `!`. These modifiers change the
81 //! behavior of the rules.
82 //!
83 //! 1. Silent (`_`)
84 //!
85 //!     Silent rules do not create token pairs during parsing, nor are they error-reported.
86 //!
87 //!     ```ignore
88 //!     a = _{ "a" }
89 //!     b =  { a ~ "b" }
90 //!     ```
91 //!
92 //!     Parsing `"ab"` produces the token pair `b()`.
93 //!
94 //! 2. Atomic (`@`)
95 //!
96 //!     Atomic rules do not accept whitespace or comments within their expressions and have a
97 //!     cascading effect on any rule they call. I.e. rules that are not atomic but are called by atomic
98 //!     rules behave atomically.
99 //!
100 //!     Any rules called by atomic rules do not generate token pairs.
101 //!
102 //!     ```ignore
103 //!     a =  { "a" }
104 //!     b = @{ a ~ "b" }
105 //!
106 //!     WHITESPACE = _{ " " }
107 //!     ```
108 //!
109 //!     Parsing `"ab"` produces the token pair `b()`, while `"a   b"` produces an error.
110 //!
111 //! 3. Compound-atomic (`$`)
112 //!
113 //!     Compound-atomic are identical to atomic rules with the exception that rules called by them are
114 //!     not forbidden from generating token pairs.
115 //!
116 //!     ```ignore
117 //!     a =  { "a" }
118 //!     b = ${ a ~ "b" }
119 //!
120 //!     WHITESPACE = _{ " " }
121 //!     ```
122 //!
123 //!     Parsing `"ab"` produces the token pairs `b(a())`, while `"a   b"` produces an error.
124 //!
125 //! 4. Non-atomic (`!`)
126 //!
127 //!     Non-atomic are identical to normal rules with the exception that they stop the cascading effect
128 //!     of atomic and compound-atomic rules.
129 //!
130 //!     ```ignore
131 //!     a =  { "a" }
132 //!     b = !{ a ~ "b" }
133 //!     c = @{ b }
134 //!
135 //!     WHITESPACE = _{ " " }
136 //!     ```
137 //!
138 //!     Parsing both `"ab"` and `"a   b"` produce the token pairs `c(a())`.
139 //!
140 //! #### Expressions
141 //!
142 //! Expressions can be either terminals or non-terminals.
143 //!
144 //! 1. Terminals
145 //!
146 //!     | Terminal   | Usage                                                          |
147 //!     |------------|----------------------------------------------------------------|
148 //!     | `"a"`      | matches the exact string `"a"`                                 |
149 //!     | `^"a"`     | matches the exact string `"a"` case insensitively (ASCII only) |
150 //!     | `'a'..'z'` | matches one character between `'a'` and `'z'`                  |
151 //!     | `a`        | matches rule `a`                                               |
152 //!
153 //! Strings and characters follow
154 //! [Rust's escape mechanisms](https://doc.rust-lang.org/reference/tokens.html#byte-escapes), while
155 //! identifiers can contain alpha-numeric characters and underscores (`_`), as long as they do not
156 //! start with a digit.
157 //!
158 //! 2. Non-terminals
159 //!
160 //!     | Non-terminal          | Usage                                                      |
161 //!     |-----------------------|------------------------------------------------------------|
162 //!     | `(e)`                 | matches `e`                                                |
163 //!     | `e1 ~ e2`             | matches the sequence `e1` `e2`                             |
164 //!     | <code>e1 \| e2</code> | matches either `e1` or `e2`                                |
165 //!     | `e*`                  | matches `e` zero or more times                             |
166 //!     | `e+`                  | matches `e` one or more times                              |
167 //!     | `e{n}`                | matches `e` exactly `n` times                              |
168 //!     | `e{, n}`              | matches `e` at most `n` times                              |
169 //!     | `e{n,} `              | matches `e` at least `n` times                             |
170 //!     | `e{m, n}`             | matches `e` between `m` and `n` times inclusively          |
171 //!     | `e?`                  | optionally matches `e`                                     |
172 //!     | `&e`                  | matches `e` without making progress                        |
173 //!     | `!e`                  | matches if `e` doesn't match without making progress       |
174 //!     | `PUSH(e)`             | matches `e` and pushes it's captured string down the stack |
175 //!
176 //!     where `e`, `e1`, and `e2` are expressions.
177 //!
178 //! Expressions can modify the stack only if they match the input. For example,
179 //! if `e1` in the compound expression `e1 | e2` does not match the input, then
180 //! it does not modify the stack, so `e2` sees the stack in the same state as
181 //! `e1` did. Repetitions and optionals (`e*`, `e+`, `e{, n}`, `e{n,}`,
182 //! `e{m,n}`, `e?`) can modify the stack each time `e` matches. The `!e` and `&e`
183 //! expressions are a special case; they never modify the stack.
184 //!
185 //! ## Special rules
186 //!
187 //! Special rules can be called within the grammar. They are:
188 //!
189 //! * `WHITESPACE` - runs between rules and sub-rules
190 //! * `COMMENT` - runs between rules and sub-rules
191 //! * `ANY` - matches exactly one `char`
192 //! * `SOI` - (start-of-input) matches only when a `Parser` is still at the starting position
193 //! * `EOI` - (end-of-input) matches only when a `Parser` has reached its end
194 //! * `POP` - pops a string from the stack and matches it
195 //! * `POP_ALL` - pops the entire state of the stack and matches it
196 //! * `PEEK` - peeks a string from the stack and matches it
197 //! * `PEEK[a..b]` - peeks part of the stack and matches it
198 //! * `PEEK_ALL` - peeks the entire state of the stack and matches it
199 //! * `DROP` - drops the top of the stack (fails to match if the stack is empty)
200 //!
201 //! `WHITESPACE` and `COMMENT` should be defined manually if needed. All other rules cannot be
202 //! overridden.
203 //!
204 //! ## `WHITESPACE` and `COMMENT`
205 //!
206 //! When defined, these rules get matched automatically in sequences (`~`) and repetitions
207 //! (`*`, `+`) between expressions. Atomic rules and those rules called by atomic rules are exempt
208 //! from this behavior.
209 //!
210 //! These rules should be defined so as to match one whitespace character and one comment only since
211 //! they are run in repetitions.
212 //!
213 //! If both `WHITESPACE` and `COMMENT` are defined, this grammar:
214 //!
215 //! ```ignore
216 //! a = { b ~ c }
217 //! ```
218 //!
219 //! is effectively transformed into this one behind the scenes:
220 //!
221 //! ```ignore
222 //! a = { b ~ WHITESPACE* ~ (COMMENT ~ WHITESPACE*)* ~ c }
223 //! ```
224 //!
225 //! ## `PUSH`, `POP`, `DROP`, and `PEEK`
226 //!
227 //! `PUSH(e)` simply pushes the captured string of the expression `e` down a stack. This stack can
228 //! then later be used to match grammar based on its content with `POP` and `PEEK`.
229 //!
230 //! `PEEK` always matches the string at the top of stack. So, if the stack contains `["b", "a"]`
231 //! (`"a"` being on top), this grammar:
232 //!
233 //! ```ignore
234 //! a = { PEEK }
235 //! ```
236 //!
237 //! is effectively transformed into at parse time:
238 //!
239 //! ```ignore
240 //! a = { "a" }
241 //! ```
242 //!
243 //! `POP` works the same way with the exception that it pops the string off of the stack if the
244 //! match worked. With the stack from above, if `POP` matches `"a"`, the stack will be mutated
245 //! to `["b"]`.
246 //!
247 //! `DROP` makes it possible to remove the string at the top of the stack
248 //! without matching it. If the stack is nonempty, `DROP` drops the top of the
249 //! stack. If the stack is empty, then `DROP` fails to match.
250 //!
251 //! ### Advanced peeking
252 //!
253 //! `PEEK[start..end]` and `PEEK_ALL` allow to peek deeper into the stack. The syntax works exactly
254 //! like Rust’s exclusive slice syntax. Additionally, negative indices can be used to indicate an
255 //! offset from the top. If the end lies before or at the start, the expression matches (as does
256 //! a `PEEK_ALL` on an empty stack). With the stack `["c", "b", "a"]` (`"a"` on top):
257 //!
258 //! ```ignore
259 //! fill = PUSH("c") ~ PUSH("b") ~ PUSH("a")
260 //! v = { PEEK_ALL } = { "a" ~ "b" ~ "c" }  // top to bottom
261 //! w = { PEEK[..] } = { "c" ~ "b" ~ "a" }  // bottom to top
262 //! x = { PEEK[1..2] } = { PEEK[1..-1] } = { "b" }
263 //! y = { PEEK[..-2] } = { PEEK[0..1] } = { "a" }
264 //! z = { PEEK[1..] } = { PEEK[-2..3] } = { "c" ~ "b" }
265 //! n = { PEEK[2..-2] } = { PEEK[2..1] } = { "" }
266 //! ```
267 //!
268 //! For historical reasons, `PEEK_ALL` matches from top to bottom, while `PEEK[start..end]` matches
269 //! from bottom to top. There is currectly no syntax to match a slice of the stack top to bottom.
270 //!
271 //! ## `Rule`
272 //!
273 //! All rules defined or used in the grammar populate a generated `enum` called `Rule`. This
274 //! implements `pest`'s `RuleType` and can be used throughout the API.
275 //!
276 //! ## `Built-in rules`
277 //!
278 //! Pest also comes with a number of built-in rules for convenience. They are:
279 //!
280 //! * `ASCII_DIGIT` - matches a numeric character from 0..9
281 //! * `ASCII_NONZERO_DIGIT` - matches a numeric character from 1..9
282 //! * `ASCII_BIN_DIGIT` - matches a numeric character from 0..1
283 //! * `ASCII_OCT_DIGIT` - matches a numeric character from 0..7
284 //! * `ASCII_HEX_DIGIT` - matches a numeric character from 0..9 or a..f or A..F
285 //! * `ASCII_ALPHA_LOWER` - matches a character from a..z
286 //! * `ASCII_ALPHA_UPPER` - matches a character from A..Z
287 //! * `ASCII_ALPHA` - matches a character from a..z or A..Z
288 //! * `ASCII_ALPHANUMERIC` - matches a character from a..z or A..Z or 0..9
289 //! * `ASCII` - matches a character from \x00..\x7f
290 //! * `NEWLINE` - matches either "\n" or "\r\n" or "\r"
291 
292 #![doc(html_root_url = "https://docs.rs/pest_derive")]
293 
294 extern crate pest_generator;
295 extern crate proc_macro;
296 
297 use proc_macro::TokenStream;
298 
299 #[proc_macro_derive(Parser, attributes(grammar, grammar_inline))]
derive_parser(input: TokenStream) -> TokenStream300 pub fn derive_parser(input: TokenStream) -> TokenStream {
301     pest_generator::derive_parser(input.into(), true).into()
302 }
303