1[/============================================================================== 2 Copyright (C) 2001-2011 Joel de Guzman 3 Copyright (C) 2001-2011 Hartmut Kaiser 4 5 Distributed under the Boost Software License, Version 1.0. (See accompanying 6 file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 7===============================================================================/] 8[section Parsing Expression Grammar] 9 10Parsing Expression Grammars (PEG) [footnote Bryan Ford: Parsing Expression 11Grammars: A Recognition-Based Syntactic Foundation, 12[@http://pdos.csail.mit.edu/~baford/packrat/popl04/]] are a derivative of 13Extended Backus-Naur Form (EBNF) [footnote Richard E. Pattis: EBNF: A Notation 14to Describe Syntax, [@http://www.cs.cmu.edu/~pattis/misc/ebnf.pdf]] 15with a different interpretation, designed to represent a recursive descent 16parser. A PEG can be directly represented as a recursive-descent parser. 17 18Like EBNF, PEG is a formal grammar for describing a formal language in 19terms of a set of rules used to recognize strings of this language. 20Unlike EBNF, PEGs have an exact interpretation. There is only one valid 21parse tree (see __ast__) for each PEG grammar. 22 23[heading Sequences] 24 25Sequences are represented by juxtaposition like in EBNF: 26 27 a b 28 29The PEG expression above states that, in order for this to succeed, 30`b` must follow `a`. Here's the syntax diagram: 31 32[:__sd_sequence__] 33 34Here's a trivial example: 35 36 'x' digit 37 38which means the character `x` must be followed by a digit. 39 40[note In __qi__, we use the `>>` for sequences since C++ does not 41allow juxtaposition, while in __karma__ we use the `<<` instead.] 42 43[heading Alternatives] 44 45Alternatives are represented in PEG using the slash: 46 47 a / b 48 49[note In __qi__ and __karma__, we use the `|` for alternatives just as in EBNF.] 50 51Alternatives allow for choices. The expression above reads: try to match 52`a`. If `a` succeeds, success, if not try to match `b`. This is a bit of 53a deviation from the usual EBNF interpretation where you simply match 54`a` *or* `b`. Here's the syntax diagram: 55 56[:__sd_choice__] 57 58PEGs allow for ambiguity in the alternatives. In the expression above, 59both `a` or `b` can both match an input string. However, only the first 60matching alternative is valid. As noted, there can only be one valid 61parse tree. [/FIXME: $$$ explain more about this $$$] 62 63[heading Loops] 64 65Again, like EBNF, PEG uses the regular-expression Kleene star and the 66plus loops: 67 68 a* 69 a+ 70 71[note __qi__ and __karma__ use the prefix star and plus since there is no 72postfix star or plus in C++.] 73 74Here are the syntax diagrams: 75 76[:__sd_kleene__] 77[:__sd_plus__] 78 79The first, called the Kleene star, matches zero or more of its subject 80`a`. The second, plus, matches one ore more of its subject `a`. 81 82Unlike EBNF, PEGs have greedy loops. It will match as much as it can 83until its subject fails to match without regard to what follows. The 84following is a classic example of a fairly common EBNF/regex expression 85failing to match in PEG: 86 87 alnum* digit 88 89In PEG, alnum will eat as much alpha-numeric characters as it can 90leaving nothing more left behind. Thus, the trailing digit will get 91nothing. Loops are simply implemented in recursive descent code as 92for/while loops making them extremely efficient. That is a definite 93advantage. On the other hand, those who are familiar with EBNF and regex 94behavior might find the behavior a major gotcha. PEG provides a couple 95of other mechanisms to circumvent this. We will see more of these other 96mechanisms shortly. 97 98[heading Difference] 99 100In some cases, you may want to restrict a certain expression. You can 101think of a PEG expression as a match for a potentially infinite set of 102strings. The difference operator allows you to restrict this set: 103 104 a - b 105 106The expression reads: match `a` but not `b`. 107 108[note There is no difference operator in __karma__, as the concept does not 109make sense in the context of output generation.] 110 111[endsect] 112 113 114