1[/ 2 / Copyright (c) 2008 Eric Niebler 3 / 4 / Distributed under the Boost Software License, Version 1.0. (See accompanying 5 / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 6 /] 7 8[section Grammars and Nested Matches] 9 10[h2 Overview] 11 12One of the key benefits of representing regexes as C++ expressions is the ability to easily refer to other C++ 13code and data from within the regex. This enables programming idioms that are not possible with other regular 14expression libraries. Of particular note is the ability for one regex to refer to another regex, allowing you 15to build grammars out of regular expressions. This section describes how to embed one regex in another by value 16and by reference, how regex objects behave when they refer to other regexes, and how to access the tree of results 17after a successful parse. 18 19[h2 Embedding a Regex by Value] 20 21The _basic_regex_ object has value semantics. When a regex object appears on the right-hand side in the definition 22of another regex, it is as if the regex were embedded by value; that is, a copy of the nested regex is stored by 23the enclosing regex. The inner regex is invoked by the outer regex during pattern matching. The inner regex 24participates fully in the match, back-tracking as needed to make the match succeed. 25 26Consider a text editor that has a regex-find feature with a whole-word option. You can implement this with 27xpressive as follows: 28 29 find_dialog dlg; 30 if( dialog_ok == dlg.do_modal() ) 31 { 32 std::string pattern = dlg.get_text(); // the pattern the user entered 33 bool whole_word = dlg.whole_word.is_checked(); // did the user select the whole-word option? 34 35 sregex re = sregex::compile( pattern ); // try to compile the pattern 36 37 if( whole_word ) 38 { 39 // wrap the regex in begin-word / end-word assertions 40 re = bow >> re >> eow; 41 } 42 43 // ... use re ... 44 } 45 46Look closely at this line: 47 48 // wrap the regex in begin-word / end-word assertions 49 re = bow >> re >> eow; 50 51This line creates a new regex that embeds the old regex by value. Then, the new regex is assigned back to 52the original regex. Since a copy of the old regex was made on the right-hand side, this works as you might 53expect: the new regex has the behavior of the old regex wrapped in begin- and end-word assertions. 54 55[note Note that `re = bow >> re >> eow` does ['not] define a recursive regular expression, since regex 56objects embed by value by default. The next section shows how to define a recursive regular expression by 57embedding a regex by reference.] 58 59[h2 Embedding a Regex by Reference] 60 61If you want to be able to build recursive regular expressions and context-free grammars, embedding a regex 62by value is not enough. You need to be able to make your regular expressions self-referential. Most regular 63expression engines don't give you that power, but xpressive does. 64 65[tip The theoretical computer scientists out there will correctly point out that a self-referential 66regular expression is not "regular", so in the strict sense, xpressive isn't really a ['regular] expression engine 67at all. But as Larry Wall once said, "the term '''[regular expression]''' has grown with the capabilities of our 68pattern matching engines, so I'm not going to try to fight linguistic necessity here."] 69 70Consider the following code, which uses the `by_ref()` helper to define a recursive regular expression that 71matches balanced, nested parentheses: 72 73 sregex parentheses; 74 parentheses // A balanced set of parentheses ... 75 = '(' // is an opening parenthesis ... 76 >> // followed by ... 77 *( // zero or more ... 78 keep( +~(set='(',')') ) // of a bunch of things that are not parentheses ... 79 | // or ... 80 by_ref(parentheses) // a balanced set of parentheses 81 ) // (ooh, recursion!) ... 82 >> // followed by ... 83 ')' // a closing parenthesis 84 ; 85 86Matching balanced, nested tags is an important text processing task, and it is one that "classic" regular 87expressions cannot do. The `by_ref()` helper makes it possible. It allows one regex object to be embedded 88in another ['by reference]. Since the right-hand side holds `parentheses` by reference, assigning the right-hand 89side back to `parentheses` creates a cycle, which will execute recursively. 90 91[h2 Building a Grammar] 92 93Once we allow self-reference in our regular expressions, the genie is out of the bottle and all manner of 94fun things are possible. In particular, we can now build grammars out of regular expressions. Let's have 95a look at the text-book grammar example: the humble calculator. 96 97 sregex group, factor, term, expression; 98 99 group = '(' >> by_ref(expression) >> ')'; 100 factor = +_d | group; 101 term = factor >> *(('*' >> factor) | ('/' >> factor)); 102 expression = term >> *(('+' >> term) | ('-' >> term)); 103 104The regex `expression` defined above does something rather remarkable for a regular expression: it matches 105mathematical expressions. For example, if the input string were `"foo 9*(10+3) bar"`, this pattern would 106match `"9*(10+3)"`. It only matches well-formed mathematical expressions, where the parentheses are 107balanced and the infix operators have two arguments each. Don't try this with just any regular expression 108engine! 109 110Let's take a closer look at this regular expression grammar. Notice that it is cyclic: `expression` is 111implemented in terms of `term`, which is implemented in terms of `factor`, which is implemented in terms 112of `group`, which is implemented in terms of `expression`, closing the loop. In general, the way to define 113a cyclic grammar is to forward-declare the regex objects and embed by reference those regular expressions 114that have not yet been initialized. In the above grammar, there is only one place where we need to reference 115a regex object that has not yet been initialized: the definition of `group`. In that place, we use 116`by_ref()` to embed `expression` by reference. In all other places, it is sufficient to embed the other 117regex objects by value, since they have already been initialized and their values will not change. 118 119[tip [*Embed by value if possible] 120\n\n 121In general, prefer embedding regular expressions by value rather than by reference. It 122involves one less indirection, making your patterns match a little faster. Besides, value semantics 123are simpler and will make your grammars easier to reason about. Don't worry about the expense of "copying" 124a regex. Each regex object shares its implementation with all of its copies.] 125 126[h2 Dynamic Regex Grammars] 127 128Using _regex_compiler_, you can also build grammars out of dynamic regular expressions. 129You do that by creating named regexes, and referring to other regexes by name. Each 130_regex_compiler_ instance keeps a mapping from names to regexes that have been created 131with it. 132 133You can create a named dynamic regex by prefacing your regex with `"(?$name=)"`, where 134/name/ is the name of the regex. You can refer to a named regex from another regex with 135`"(?$name)"`. The named regex does not need to exist yet at the time it is referenced 136in another regex, but it must exist by the time you use the regex. 137 138Below is a code fragment that uses dynamic regex grammars to implement the calculator 139example from above. 140 141 using namespace boost::xpressive; 142 using namespace regex_constants; 143 144 sregex expr; 145 146 { 147 sregex_compiler compiler; 148 syntax_option_type x = ignore_white_space; 149 150 compiler.compile("(? $group = ) \\( (? $expr ) \\) ", x); 151 compiler.compile("(? $factor = ) \\d+ | (? $group ) ", x); 152 compiler.compile("(? $term = ) (? $factor )" 153 " ( \\* (? $factor ) | / (? $factor ) )* ", x); 154 expr = compiler.compile("(? $expr = ) (? $term )" 155 " ( \\+ (? $term ) | - (? $term ) )* ", x); 156 } 157 158 std::string str("foo 9*(10+3) bar"); 159 smatch what; 160 161 if(regex_search(str, what, expr)) 162 { 163 // This prints "9*(10+3)": 164 std::cout << what[0] << std::endl; 165 } 166 167As with static regex grammars, nested regex invocations create nested 168match results (see /Nested Results/ below). The result is a complete parse tree 169for string that matched. Unlike static regexes, dynamic regexes are always 170embedded by reference, not by value. 171 172[h2 Cyclic Patterns, Copying and Memory Management, Oh My!] 173 174The calculator examples above raises a number of very complicated memory-management issues. Each of the 175four regex objects refer to each other, some directly and some indirectly, some by value and some by 176reference. What if we were to return one of them from a function and let the others go out of scope? 177What becomes of the references? The answer is that the regex objects are internally reference counted, 178such that they keep their referenced regex objects alive as long as they need them. So passing a regex 179object by value is never a problem, even if it refers to other regex objects that have gone out of scope. 180 181Those of you who have dealt with reference counting are probably familiar with its Achilles Heel: cyclic 182references. If regex objects are reference counted, what happens to cycles like the one created in the 183calculator examples? Are they leaked? The answer is no, they are not leaked. The _basic_regex_ object has some tricky 184reference tracking code that ensures that even cyclic regex grammars are cleaned up when the last external 185reference goes away. So don't worry about it. Create cyclic grammars, pass your regex objects around and 186copy them all you want. It is fast and efficient and guaranteed not to leak or result in dangling references. 187 188[h2 Nested Regexes and Sub-Match Scoping] 189 190Nested regular expressions raise the issue of sub-match scoping. If both the inner and outer regex write 191to and read from the same sub-match vector, chaos would ensue. The inner regex would stomp on the 192sub-matches written by the outer regex. For example, what does this do? 193 194 sregex inner = sregex::compile( "(.)\\1" ); 195 sregex outer = (s1= _) >> inner >> s1; 196 197The author probably didn't intend for the inner regex to overwrite the sub-match written by the outer 198regex. The problem is particularly acute when the inner regex is accepted from the user as input. The 199author has no way of knowing whether the inner regex will stomp the sub-match vector or not. This is 200clearly not acceptable. 201 202Instead, what actually happens is that each invocation of a nested regex gets its own scope. Sub-matches 203belong to that scope. That is, each nested regex invocation gets its own copy of the sub-match vector to 204play with, so there is no way for an inner regex to stomp on the sub-matches of an outer regex. So, for 205example, the regex `outer` defined above would match `"ABBA"`, as it should. 206 207[h2 Nested Results] 208 209If nested regexes have their own sub-matches, there should be a way to access them after a successful 210match. In fact, there is. After a _regex_match_ or _regex_search_, the _match_results_ struct behaves 211like the head of a tree of nested results. The _match_results_ class provides a `nested_results()` 212member function that returns an ordered sequence of _match_results_ structures, representing the 213results of the nested regexes. The order of the nested results is the same as the order in which 214the nested regex objects matched. 215 216Take as an example the regex for balanced, nested parentheses we saw earlier: 217 218 sregex parentheses; 219 parentheses = '(' >> *( keep( +~(set='(',')') ) | by_ref(parentheses) ) >> ')'; 220 221 smatch what; 222 std::string str( "blah blah( a(b)c (c(e)f (g)h )i (j)6 )blah" ); 223 224 if( regex_search( str, what, parentheses ) ) 225 { 226 // display the whole match 227 std::cout << what[0] << '\n'; 228 229 // display the nested results 230 std::for_each( 231 what.nested_results().begin(), 232 what.nested_results().end(), 233 output_nested_results() ); 234 } 235 236This program displays the following: 237 238[pre 239( a(b)c (c(e)f (g)h )i (j)6 ) 240 (b) 241 (c(e)f (g)h ) 242 (e) 243 (g) 244 (j) 245] 246 247Here you can see how the results are nested and that they are stored in the order in which they 248are found. 249 250[tip See the definition of [link boost_xpressive.user_s_guide.examples.display_a_tree_of_nested_results output_nested_results] in the 251[link boost_xpressive.user_s_guide.examples Examples] section.] 252 253[h2 Filtering Nested Results] 254 255Sometimes a regex will have several nested regex objects, and you want to know which result corresponds 256to which regex object. That's where `basic_regex<>::regex_id()` and `match_results<>::regex_id()` 257come in handy. When iterating over the nested results, you can compare the regex id from the results to 258the id of the regex object you're interested in. 259 260To make this a bit easier, xpressive provides a predicate to make it simple to iterate over just the 261results that correspond to a certain nested regex. It is called `regex_id_filter_predicate`, and it is 262intended to be used with _iterator_. You can use it as follows: 263 264 sregex name = +alpha; 265 sregex integer = +_d; 266 sregex re = *( *_s >> ( name | integer ) ); 267 268 smatch what; 269 std::string str( "marsha 123 jan 456 cindy 789" ); 270 271 if( regex_match( str, what, re ) ) 272 { 273 smatch::nested_results_type::const_iterator begin = what.nested_results().begin(); 274 smatch::nested_results_type::const_iterator end = what.nested_results().end(); 275 276 // declare filter predicates to select just the names or the integers 277 sregex_id_filter_predicate name_id( name.regex_id() ); 278 sregex_id_filter_predicate integer_id( integer.regex_id() ); 279 280 // iterate over only the results from the name regex 281 std::for_each( 282 boost::make_filter_iterator( name_id, begin, end ), 283 boost::make_filter_iterator( name_id, end, end ), 284 output_result 285 ); 286 287 std::cout << '\n'; 288 289 // iterate over only the results from the integer regex 290 std::for_each( 291 boost::make_filter_iterator( integer_id, begin, end ), 292 boost::make_filter_iterator( integer_id, end, end ), 293 output_result 294 ); 295 } 296 297where `output_results` is a simple function that takes a `smatch` and displays the full match. 298Notice how we use the `regex_id_filter_predicate` together with `basic_regex<>::regex_id()` and 299`boost::make_filter_iterator()` from the _iterator_ to select only those results 300corresponding to a particular nested regex. This program displays the following: 301 302[pre 303marsha 304jan 305cindy 306123 307456 308789 309] 310 311 312 313[endsect] 314