1[/ 2 / Copyright (c) 2008 Eric Niebler 3 / 4 / Distributed under the Boost Software License, Version 1.0. (See accompanying 5 / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 6 /] 7 8[section Semantic Actions and User-Defined Assertions] 9 10[h2 Overview] 11 12Imagine you want to parse an input string and build a `std::map<>` from it. For 13something like that, matching a regular expression isn't enough. You want to 14/do something/ when parts of your regular expression match. Xpressive lets 15you attach semantic actions to parts of your static regular expressions. This 16section shows you how. 17 18[h2 Semantic Actions] 19 20Consider the following code, which uses xpressive's semantic actions to parse 21a string of word/integer pairs and stuffs them into a `std::map<>`. It is 22described below. 23 24 #include <string> 25 #include <iostream> 26 #include <boost/xpressive/xpressive.hpp> 27 #include <boost/xpressive/regex_actions.hpp> 28 using namespace boost::xpressive; 29 30 int main() 31 { 32 std::map<std::string, int> result; 33 std::string str("aaa=>1 bbb=>23 ccc=>456"); 34 35 // Match a word and an integer, separated by =>, 36 // and then stuff the result into a std::map<> 37 sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) ) 38 [ ref(result)[s1] = as<int>(s2) ]; 39 40 // Match one or more word/integer pairs, separated 41 // by whitespace. 42 sregex rx = pair >> *(+_s >> pair); 43 44 if(regex_match(str, rx)) 45 { 46 std::cout << result["aaa"] << '\n'; 47 std::cout << result["bbb"] << '\n'; 48 std::cout << result["ccc"] << '\n'; 49 } 50 51 return 0; 52 } 53 54This program prints the following: 55 56[pre 571 5823 59456 60] 61 62The regular expression `pair` has two parts: the pattern and the action. The 63pattern says to match a word, capturing it in sub-match 1, and an integer, 64capturing it in sub-match 2, separated by `"=>"`. The action is the part in 65square brackets: `[ ref(result)[s1] = as<int>(s2) ]`. It says to take sub-match 66one and use it to index into the `results` map, and assign to it the result of 67converting sub-match 2 to an integer. 68 69[note To use semantic actions with your static regexes, you must 70`#include <boost/xpressive/regex_actions.hpp>`] 71 72How does this work? Just as the rest of the static regular expression, the part 73between brackets is an expression template. It encodes the action and executes 74it later. The expression `ref(result)` creates a lazy reference to the `result` 75object. The larger expression `ref(result)[s1]` is a lazy map index operation. 76Later, when this action is getting executed, `s1` gets replaced with the 77first _sub_match_. Likewise, when `as<int>(s2)` gets executed, `s2` is replaced 78with the second _sub_match_. The `as<>` action converts its argument to the 79requested type using Boost.Lexical_cast. The effect of the whole action is to 80insert a new word/integer pair into the map. 81 82[note There is an important difference between the function `boost::ref()` in 83`<boost/ref.hpp>` and `boost::xpressive::ref()` in 84`<boost/xpressive/regex_actions.hpp>`. The first returns a plain 85`reference_wrapper<>` which behaves in many respects like an ordinary 86reference. By contrast, `boost::xpressive::ref()` returns a /lazy/ reference 87that you can use in expressions that are executed lazily. That is why we can 88say `ref(result)[s1]`, even though `result` doesn't have an `operator[]` that 89would accept `s1`.] 90 91In addition to the sub-match placeholders `s1`, `s2`, etc., you can also use 92the placeholder `_` within an action to refer back to the string matched by 93the sub-expression to which the action is attached. For instance, you can use 94the following regex to match a bunch of digits, interpret them as an integer 95and assign the result to a local variable: 96 97 int i = 0; 98 // Here, _ refers back to all the 99 // characters matched by (+_d) 100 sregex rex = (+_d)[ ref(i) = as<int>(_) ]; 101 102[h3 Lazy Action Execution] 103 104What does it mean, exactly, to attach an action to part of a regular expression 105and perform a match? When does the action execute? If the action is part of a 106repeated sub-expression, does the action execute once or many times? And if the 107sub-expression initially matches, but ultimately fails because the rest of the 108regular expression fails to match, is the action executed at all? 109 110The answer is that by default, actions are executed /lazily/. When a sub-expression 111matches a string, its action is placed on a queue, along with the current 112values of any sub-matches to which the action refers. If the match algorithm 113must backtrack, actions are popped off the queue as necessary. Only after the 114entire regex has matched successfully are the actions actually exeucted. They 115are executed all at once, in the order in which they were added to the queue, 116as the last step before _regex_match_ returns. 117 118For example, consider the following regex that increments a counter whenever 119it finds a digit. 120 121 int i = 0; 122 std::string str("1!2!3?"); 123 // count the exciting digits, but not the 124 // questionable ones. 125 sregex rex = +( _d [ ++ref(i) ] >> '!' ); 126 regex_search(str, rex); 127 assert( i == 2 ); 128 129The action `++ref(i)` is queued three times: once for each found digit. But 130it is only /executed/ twice: once for each digit that precedes a `'!'` 131character. When the `'?'` character is encountered, the match algorithm 132backtracks, removing the final action from the queue. 133 134[h3 Immediate Action Execution] 135 136When you want semantic actions to execute immediately, you can wrap the 137sub-expression containing the action in a [^[funcref boost::xpressive::keep keep()]]. 138`keep()` turns off back-tracking for its sub-expression, but it also causes 139any actions queued by the sub-expression to execute at the end of the `keep()`. 140It is as if the sub-expression in the `keep()` were compiled into an 141independent regex object, and matching the `keep()` is like a separate invocation 142of `regex_search()`. It matches characters and executes actions but never backtracks 143or unwinds. For example, imagine the above example had been written as follows: 144 145 int i = 0; 146 std::string str("1!2!3?"); 147 // count all the digits. 148 sregex rex = +( keep( _d [ ++ref(i) ] ) >> '!' ); 149 regex_search(str, rex); 150 assert( i == 3 ); 151 152We have wrapped the sub-expression `_d [ ++ref(i) ]` in `keep()`. Now, whenever 153this regex matches a digit, the action will be queued and then immediately 154executed before we try to match a `'!'` character. In this case, the action 155executes three times. 156 157[note Like `keep()`, actions within [^[funcref boost::xpressive::before before()]] 158and [^[funcref boost::xpressive::after after()]] are also executed early when their 159sub-expressions have matched.] 160 161[h3 Lazy Functions] 162 163So far, we've seen how to write semantic actions consisting of variables and 164operators. But what if you want to be able to call a function from a semantic 165action? Xpressive provides a mechanism to do this. 166 167The first step is to define a function object type. Here, for instance, is a 168function object type that calls `push()` on its argument: 169 170 struct push_impl 171 { 172 // Result type, needed for tr1::result_of 173 typedef void result_type; 174 175 template<typename Sequence, typename Value> 176 void operator()(Sequence &seq, Value const &val) const 177 { 178 seq.push(val); 179 } 180 }; 181 182The next step is to use xpressive's `function<>` template to define a function 183object named `push`: 184 185 // Global "push" function object. 186 function<push_impl>::type const push = {{}}; 187 188The initialization looks a bit odd, but this is because `push` is being 189statically initialized. That means it doesn't need to be constructed 190at runtime. We can use `push` in semantic actions as follows: 191 192 std::stack<int> ints; 193 // Match digits, cast them to an int 194 // and push it on the stack. 195 sregex rex = (+_d)[push(ref(ints), as<int>(_))]; 196 197You'll notice that doing it this way causes member function invocations 198to look like ordinary function invocations. You can choose to write your 199semantic action in a different way that makes it look a bit more like 200a member function call: 201 202 sregex rex = (+_d)[ref(ints)->*push(as<int>(_))]; 203 204Xpressive recognizes the use of the `->*` and treats this expression 205exactly the same as the one above. 206 207When your function object must return a type that depends on its 208arguments, you can use a `result<>` member template instead of the 209`result_type` typedef. Here, for example, is a `first` function object 210that returns the `first` member of a `std::pair<>` or _sub_match_: 211 212 // Function object that returns the 213 // first element of a pair. 214 struct first_impl 215 { 216 template<typename Sig> struct result {}; 217 218 template<typename This, typename Pair> 219 struct result<This(Pair)> 220 { 221 typedef typename remove_reference<Pair> 222 ::type::first_type type; 223 }; 224 225 template<typename Pair> 226 typename Pair::first_type 227 operator()(Pair const &p) const 228 { 229 return p.first; 230 } 231 }; 232 233 // OK, use as first(s1) to get the begin iterator 234 // of the sub-match referred to by s1. 235 function<first_impl>::type const first = {{}}; 236 237[h3 Referring to Local Variables] 238 239As we've seen in the examples above, we can refer to local variables within 240an actions using `xpressive::ref()`. Any such variables are held by reference 241by the regular expression, and care should be taken to avoid letting those 242references dangle. For instance, in the following code, the reference to `i` 243is left to dangle when `bad_voodoo()` returns: 244 245 sregex bad_voodoo() 246 { 247 int i = 0; 248 sregex rex = +( _d [ ++ref(i) ] >> '!' ); 249 // ERROR! rex refers by reference to a local 250 // variable, which will dangle after bad_voodoo() 251 // returns. 252 return rex; 253 } 254 255When writing semantic actions, it is your responsibility to make sure that 256all the references do not dangle. One way to do that would be to make the 257variables shared pointers that are held by the regex by value. 258 259 sregex good_voodoo(boost::shared_ptr<int> pi) 260 { 261 // Use val() to hold the shared_ptr by value: 262 sregex rex = +( _d [ ++*val(pi) ] >> '!' ); 263 // OK, rex holds a reference count to the integer. 264 return rex; 265 } 266 267In the above code, we use `xpressive::val()` to hold the shared pointer by 268value. That's not normally necessary because local variables appearing in 269actions are held by value by default, but in this case, it is necessary. Had 270we written the action as `++*pi`, it would have executed immediately. That's 271because `++*pi` is not an expression template, but `++*val(pi)` is. 272 273It can be tedious to wrap all your variables in `ref()` and `val()` in your 274semantic actions. Xpressive provides the `reference<>` and `value<>` templates 275to make things easier. The following table shows the equivalencies: 276 277[table reference<> and value<> 278[[This ...][... is equivalent to this ...]] 279[[``int i = 0; 280 281sregex rex = +( _d [ ++ref(i) ] >> '!' );``][``int i = 0; 282reference<int> ri(i); 283sregex rex = +( _d [ ++ri ] >> '!' );``]] 284[[``boost::shared_ptr<int> pi(new int(0)); 285 286sregex rex = +( _d [ ++*val(pi) ] >> '!' );``][``boost::shared_ptr<int> pi(new int(0)); 287value<boost::shared_ptr<int> > vpi(pi); 288sregex rex = +( _d [ ++*vpi ] >> '!' );``]] 289] 290 291As you can see, when using `reference<>`, you need to first declare a local 292variable and then declare a `reference<>` to it. These two steps can be combined 293into one using `local<>`. 294 295[table local<> vs. reference<> 296[[This ...][... is equivalent to this ...]] 297[[``local<int> i(0); 298 299sregex rex = +( _d [ ++i ] >> '!' );``][``int i = 0; 300reference<int> ri(i); 301sregex rex = +( _d [ ++ri ] >> '!' );``]] 302] 303 304We can use `local<>` to rewrite the above example as follows: 305 306 local<int> i(0); 307 std::string str("1!2!3?"); 308 // count the exciting digits, but not the 309 // questionable ones. 310 sregex rex = +( _d [ ++i ] >> '!' ); 311 regex_search(str, rex); 312 assert( i.get() == 2 ); 313 314Notice that we use `local<>::get()` to access the value of the local 315variable. Also, beware that `local<>` can be used to create a dangling 316reference, just as `reference<>` can. 317 318[h3 Referring to Non-Local Variables] 319 320In the beginning of this 321section, we used a regex with a semantic action to parse a string of 322word/integer pairs and stuff them into a `std::map<>`. That required that 323the map and the regex be defined together and used before either could 324go out of scope. What if we wanted to define the regex once and use it 325to fill lots of different maps? We would rather pass the map into the 326_regex_match_ algorithm rather than embed a reference to it directly in 327the regex object. What we can do instead is define a placeholder and use 328that in the semantic action instead of the map itself. Later, when we 329call one of the regex algorithms, we can bind the reference to an actual 330map object. The following code shows how. 331 332 // Define a placeholder for a map object: 333 placeholder<std::map<std::string, int> > _map; 334 335 // Match a word and an integer, separated by =>, 336 // and then stuff the result into a std::map<> 337 sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) ) 338 [ _map[s1] = as<int>(s2) ]; 339 340 // Match one or more word/integer pairs, separated 341 // by whitespace. 342 sregex rx = pair >> *(+_s >> pair); 343 344 // The string to parse 345 std::string str("aaa=>1 bbb=>23 ccc=>456"); 346 347 // Here is the actual map to fill in: 348 std::map<std::string, int> result; 349 350 // Bind the _map placeholder to the actual map 351 smatch what; 352 what.let( _map = result ); 353 354 // Execute the match and fill in result map 355 if(regex_match(str, what, rx)) 356 { 357 std::cout << result["aaa"] << '\n'; 358 std::cout << result["bbb"] << '\n'; 359 std::cout << result["ccc"] << '\n'; 360 } 361 362This program displays: 363 364[pre 3651 36623 367456 368] 369 370We use `placeholder<>` here to define `_map`, which stands in for a 371`std::map<>` variable. We can use the placeholder in the semantic action as if 372it were a map. Then, we define a _match_results_ struct and bind an actual map 373to the placeholder with "`what.let( _map = result );`". The _regex_match_ call 374behaves as if the placeholder in the semantic action had been replaced with a 375reference to `result`. 376 377[note Placeholders in semantic actions are not /actually/ replaced at runtime 378with references to variables. The regex object is never mutated in any way 379during any of the regex algorithms, so they are safe to use in multiple 380threads.] 381 382The syntax for late-bound action arguments is a little different if you are 383using _regex_iterator_ or _regex_token_iterator_. The regex iterators accept 384an extra constructor parameter for specifying the argument bindings. There is 385a `let()` function that you can use to bind variables to their placeholders. 386The following code demonstrates how. 387 388 // Define a placeholder for a map object: 389 placeholder<std::map<std::string, int> > _map; 390 391 // Match a word and an integer, separated by =>, 392 // and then stuff the result into a std::map<> 393 sregex pair = ( (s1= +_w) >> "=>" >> (s2= +_d) ) 394 [ _map[s1] = as<int>(s2) ]; 395 396 // The string to parse 397 std::string str("aaa=>1 bbb=>23 ccc=>456"); 398 399 // Here is the actual map to fill in: 400 std::map<std::string, int> result; 401 402 // Create a regex_iterator to find all the matches 403 sregex_iterator it(str.begin(), str.end(), pair, let(_map=result)); 404 sregex_iterator end; 405 406 // step through all the matches, and fill in 407 // the result map 408 while(it != end) 409 ++it; 410 411 std::cout << result["aaa"] << '\n'; 412 std::cout << result["bbb"] << '\n'; 413 std::cout << result["ccc"] << '\n'; 414 415This program displays: 416 417[pre 4181 41923 420456 421] 422 423[h2 User-Defined Assertions] 424 425You are probably already familiar with regular expression /assertions/. In 426Perl, some examples are the [^^] and [^$] assertions, which you can use to 427match the beginning and end of a string, respectively. Xpressive lets you 428define your own assertions. A custom assertion is a contition which must be 429true at a point in the match in order for the match to succeed. You can check 430a custom assertion with xpressive's _check_ function. 431 432There are a couple of ways to define a custom assertion. The simplest is to 433use a function object. Let's say that you want to ensure that a sub-expression 434matches a sub-string that is either 3 or 6 characters long. The following 435struct defines such a predicate: 436 437 // A predicate that is true IFF a sub-match is 438 // either 3 or 6 characters long. 439 struct three_or_six 440 { 441 bool operator()(ssub_match const &sub) const 442 { 443 return sub.length() == 3 || sub.length() == 6; 444 } 445 }; 446 447You can use this predicate within a regular expression as follows: 448 449 // match words of 3 characters or 6 characters. 450 sregex rx = (bow >> +_w >> eow)[ check(three_or_six()) ] ; 451 452The above regular expression will find whole words that are either 3 or 6 453characters long. The `three_or_six` predicate accepts a _sub_match_ that refers 454back to the part of the string matched by the sub-expression to which the 455custom assertion is attached. 456 457[note The custom assertion participates in determining whether the match 458succeeds or fails. Unlike actions, which execute lazily, custom assertions 459execute immediately while the regex engine is searching for a match.] 460 461Custom assertions can also be defined inline using the same syntax as for 462semantic actions. Below is the same custom assertion written inline: 463 464 // match words of 3 characters or 6 characters. 465 sregex rx = (bow >> +_w >> eow)[ check(length(_)==3 || length(_)==6) ] ; 466 467In the above, `length()` is a lazy function that calls the `length()` member 468function of its argument, and `_` is a placeholder that receives the 469`sub_match`. 470 471Once you get the hang of writing custom assertions inline, they can be 472very powerful. For example, you can write a regular expression that 473only matches valid dates (for some suitably liberal definition of the 474term ["valid]). 475 476 int const days_per_month[] = 477 {31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 31, 31}; 478 479 mark_tag month(1), day(2); 480 // find a valid date of the form month/day/year. 481 sregex date = 482 ( 483 // Month must be between 1 and 12 inclusive 484 (month= _d >> !_d) [ check(as<int>(_) >= 1 485 && as<int>(_) <= 12) ] 486 >> '/' 487 // Day must be between 1 and 31 inclusive 488 >> (day= _d >> !_d) [ check(as<int>(_) >= 1 489 && as<int>(_) <= 31) ] 490 >> '/' 491 // Only consider years between 1970 and 2038 492 >> (_d >> _d >> _d >> _d) [ check(as<int>(_) >= 1970 493 && as<int>(_) <= 2038) ] 494 ) 495 // Ensure the month actually has that many days! 496 [ check( ref(days_per_month)[as<int>(month)-1] >= as<int>(day) ) ] 497 ; 498 499 smatch what; 500 std::string str("99/99/9999 2/30/2006 2/28/2006"); 501 502 if(regex_search(str, what, date)) 503 { 504 std::cout << what[0] << std::endl; 505 } 506 507The above program prints out the following: 508 509[pre 5102/28/2006 511] 512 513Notice how the inline custom assertions are used to range-check the values for 514the month, day and year. The regular expression doesn't match `"99/99/9999"` or 515`"2/30/2006"` because they are not valid dates. (There is no 99th month, and 516February doesn't have 30 days.) 517 518[endsect] 519