1[/ 2 / Copyright (c) 2008 Eric Niebler 3 / 4 / Distributed under the Boost Software License, Version 1.0. (See accompanying 5 / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 6 /] 7 8[section Static Regexes] 9 10[h2 Overview] 11 12The feature that really sets xpressive apart from other C/C++ regular 13expression libraries is the ability to author a regular expression using C++ 14expressions. xpressive achieves this through operator overloading, using a 15technique called ['expression templates] to embed a mini-language dedicated 16to pattern matching within C++. These "static regexes" have many advantages 17over their string-based brethren. In particular, static regexes: 18 19* are syntax-checked at compile-time; they will never fail at run-time due to 20 a syntax error. 21* can naturally refer to other C++ data and code, including other regexes, 22 making it simple to build grammars out of regular expressions and bind 23 user-defined actions that execute when parts of your regex match. 24* are statically bound for better inlining and optimization. Static regexes 25 require no state tables, virtual functions, byte-code or calls through 26 function pointers that cannot be resolved at compile time. 27* are not limited to searching for patterns in strings. You can declare a 28 static regex that finds patterns in an array of integers, for instance. 29 30Since we compose static regexes using C++ expressions, we are constrained by 31the rules for legal C++ expressions. Unfortunately, that means that 32"classic" regular expression syntax cannot always be mapped cleanly into 33C++. Rather, we map the regex ['constructs], picking new syntax that is 34legal C++. 35 36[h2 Construction and Assignment] 37 38You create a static regex by assigning one to an object of type _basic_regex_. 39For instance, the following defines a regex that can be used to find patterns 40in objects of type `std::string`: 41 42 sregex re = '$' >> +_d >> '.' >> _d >> _d; 43 44Assignment works similarly. 45 46[h2 Character and String Literals] 47 48In static regexes, character and string literals match themselves. For 49instance, in the regex above, `'$'` and `'.'` match the characters `'$'` and 50`'.'` respectively. Don't be confused by the fact that [^$] and [^.] are 51meta-characters in Perl. In xpressive, literals always represent themselves. 52 53When using literals in static regexes, you must take care that at least one 54operand is not a literal. For instance, the following are ['not] valid 55regexes: 56 57 sregex re1 = 'a' >> 'b'; // ERROR! 58 sregex re2 = +'a'; // ERROR! 59 60The two operands to the binary `>>` operator are both literals, and the 61operand of the unary `+` operator is also a literal, so these statements 62will call the native C++ binary right-shift and unary plus operators, 63respectively. That's not what we want. To get operator overloading to kick 64in, at least one operand must be a user-defined type. We can use xpressive's 65`as_xpr()` helper function to "taint" an expression with regex-ness, forcing 66operator overloading to find the correct operators. The two regexes above 67should be written as: 68 69 sregex re1 = as_xpr('a') >> 'b'; // OK 70 sregex re2 = +as_xpr('a'); // OK 71 72[h2 Sequencing and Alternation] 73 74As you've probably already noticed, sub-expressions in static regexes must 75be separated by the sequencing operator, `>>`. You can read this operator as 76"followed by". 77 78 // Match an 'a' followed by a digit 79 sregex re = 'a' >> _d; 80 81Alternation works just as it does in Perl with the `|` operator. You can 82read this operator as "or". For example: 83 84 // match a digit character or a word character one or more times 85 sregex re = +( _d | _w ); 86 87[h2 Grouping and Captures] 88 89In Perl, parentheses `()` have special meaning. They group, but as a 90side-effect they also create back\-references like [^$1] and [^$2]. In C++, 91parentheses only group \-\- there is no way to give them side\-effects. To 92get the same effect, we use the special `s1`, `s2`, etc. tokens. Assigning 93to one creates a back-reference. You can then use the back-reference later 94in your expression, like using [^\1] and [^\2] in Perl. For example, 95consider the following regex, which finds matching HTML tags: 96 97 "<(\\w+)>.*?</\\1>" 98 99In static xpressive, this would be: 100 101 '<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>' 102 103Notice how you capture a back-reference by assigning to `s1`, and then you 104use `s1` later in the pattern to find the matching end tag. 105 106[tip [*Grouping without capturing a back-reference] \n\n In 107xpressive, if you just want grouping without capturing a back-reference, you 108can just use `()` without `s1`. That is the equivalent of Perl's [^(?:)] 109non-capturing grouping construct.] 110 111[h2 Case-Insensitivity and Internationalization] 112 113Perl lets you make part of your regular expression case-insensitive by using 114the [^(?i:)] pattern modifier. xpressive also has a case-insensitivity 115pattern modifier, called `icase`. You can use it as follows: 116 117 sregex re = "this" >> icase( "that" ); 118 119In this regular expression, `"this"` will be matched exactly, but `"that"` 120will be matched irrespective of case. 121 122Case-insensitive regular expressions raise the issue of 123internationalization: how should case-insensitive character comparisons be 124evaluated? Also, many character classes are locale-specific. Which 125characters are matched by `digit` and which are matched by `alpha`? The 126answer depends on the `std::locale` object the regular expression object is 127using. By default, all regular expression objects use the global locale. You 128can override the default by using the `imbue()` pattern modifier, as 129follows: 130 131 std::locale my_locale = /* initialize a std::locale object */; 132 sregex re = imbue( my_locale )( +alpha >> +digit ); 133 134This regular expression will evaluate `alpha` and `digit` according to 135`my_locale`. See the section on [link boost_xpressive.user_s_guide.localization_and_regex_traits 136Localization and Regex Traits] for more information about how to customize 137the behavior of your regexes. 138 139[h2 Static xpressive Syntax Cheat Sheet] 140 141The table below lists the familiar regex constructs and their equivalents in 142static xpressive. 143 144[def _s1_ [globalref boost::xpressive::s1 s1]] 145[def _bos_ [globalref boost::xpressive::bos bos]] 146[def _eos_ [globalref boost::xpressive::eos eos]] 147[def _b_ [globalref boost::xpressive::_b _b]] 148[def _n_ [globalref boost::xpressive::_n _n]] 149[def _ln_ [globalref boost::xpressive::_ln _ln]] 150[def _d_ [globalref boost::xpressive::_d _d]] 151[def _w_ [globalref boost::xpressive::_w _w]] 152[def _s_ [globalref boost::xpressive::_s _s]] 153[def _alnum_ [globalref boost::xpressive::alnum alnum]] 154[def _alpha_ [globalref boost::xpressive::alpha alpha]] 155[def _blank_ [globalref boost::xpressive::blank blank]] 156[def _cntrl_ [globalref boost::xpressive::cntrl cntrl]] 157[def _digit_ [globalref boost::xpressive::digit digit]] 158[def _graph_ [globalref boost::xpressive::graph graph]] 159[def _lower_ [globalref boost::xpressive::lower lower]] 160[def _print_ [globalref boost::xpressive::print print]] 161[def _punct_ [globalref boost::xpressive::punct punct]] 162[def _space_ [globalref boost::xpressive::space space]] 163[def _upper_ [globalref boost::xpressive::upper upper]] 164[def _xdigit_ [globalref boost::xpressive::xdigit xdigit]] 165[def _set_ [globalref boost::xpressive::set set]] 166[def _repeat_ [funcref boost::xpressive::repeat repeat]] 167[def _range_ [funcref boost::xpressive::range range]] 168[def _icase_ [funcref boost::xpressive::icase icase]] 169[def _before_ [funcref boost::xpressive::before before]] 170[def _after_ [funcref boost::xpressive::after after]] 171[def _keep_ [funcref boost::xpressive::keep keep]] 172 173[table Perl syntax vs. Static xpressive syntax 174 [[Perl] [Static xpressive] [Meaning]] 175 [[[^.]] [[globalref boost::xpressive::_ `_`]] [any character (assuming Perl's /s modifier).]] 176 [[[^ab]] [`a >> b`] [sequencing of [^a] and [^b] sub-expressions.]] 177 [[[^a|b]] [`a | b`] [alternation of [^a] and [^b] sub-expressions.]] 178 [[[^(a)]] [`(_s1_= a)`] [group and capture a back-reference.]] 179 [[[^(?:a)]] [`(a)`] [group and do not capture a back-reference.]] 180 [[[^\1]] [`_s1_`] [a previously captured back-reference.]] 181 [[[^a*]] [`*a`] [zero or more times, greedy.]] 182 [[[^a+]] [`+a`] [one or more times, greedy.]] 183 [[[^a?]] [`!a`] [zero or one time, greedy.]] 184 [[[^a{n,m}]] [`_repeat_<n,m>(a)`] [between [^n] and [^m] times, greedy.]] 185 [[[^a*?]] [`-*a`] [zero or more times, non-greedy.]] 186 [[[^a+?]] [`-+a`] [one or more times, non-greedy.]] 187 [[[^a??]] [`-!a`] [zero or one time, non-greedy.]] 188 [[[^a{n,m}?]] [`-_repeat_<n,m>(a)`] [between [^n] and [^m] times, non-greedy.]] 189 [[[^^]] [`_bos_`] [beginning of sequence assertion.]] 190 [[[^$]] [`_eos_`] [end of sequence assertion.]] 191 [[[^\b]] [`_b_`] [word boundary assertion.]] 192 [[[^\B]] [`~_b_`] [not word boundary assertion.]] 193 [[[^\\n]] [`_n_`] [literal newline.]] 194 [[[^.]] [`~_n_`] [any character except a literal newline (without Perl's /s modifier).]] 195 [[[^\\r?\\n|\\r]] [`_ln_`] [logical newline.]] 196 [[[^\[^\\r\\n\]]] [`~_ln_`] [any single character not a logical newline.]] 197 [[[^\w]] [`_w_`] [a word character, equivalent to set\[alnum | '_'\].]] 198 [[[^\W]] [`~_w_`] [not a word character, equivalent to ~set\[alnum | '_'\].]] 199 [[[^\d]] [`_d_`] [a digit character.]] 200 [[[^\D]] [`~_d_`] [not a digit character.]] 201 [[[^\s]] [`_s_`] [a space character.]] 202 [[[^\S]] [`~_s_`] [not a space character.]] 203 [[[^\[:alnum:\]]] [`_alnum_`] [an alpha-numeric character.]] 204 [[[^\[:alpha:\]]] [`_alpha_`] [an alphabetic character.]] 205 [[[^\[:blank:\]]] [`_blank_`] [a horizontal white-space character.]] 206 [[[^\[:cntrl:\]]] [`_cntrl_`] [a control character.]] 207 [[[^\[:digit:\]]] [`_digit_`] [a digit character.]] 208 [[[^\[:graph:\]]] [`_graph_`] [a graphable character.]] 209 [[[^\[:lower:\]]] [`_lower_`] [a lower-case character.]] 210 [[[^\[:print:\]]] [`_print_`] [a printing character.]] 211 [[[^\[:punct:\]]] [`_punct_`] [a punctuation character.]] 212 [[[^\[:space:\]]] [`_space_`] [a white-space character.]] 213 [[[^\[:upper:\]]] [`_upper_`] [an upper-case character.]] 214 [[[^\[:xdigit:\]]] [`_xdigit_`] [a hexadecimal digit character.]] 215 [[[^\[0-9\]]] [`_range_('0','9')`] [characters in range `'0'` through `'9'`.]] 216 [[[^\[abc\]]] [`as_xpr('a') | 'b' |'c'`] [characters `'a'`, `'b'`, or `'c'`.]] 217 [[[^\[abc\]]] [`(_set_= 'a','b','c')`] [['same as above]]] 218 [[[^\[0-9abc\]]] [`_set_[ _range_('0','9') | 'a' | 'b' | 'c' ]`] [characters `'a'`, `'b'`, `'c'` or in range `'0'` through `'9'`.]] 219 [[[^\[0-9abc\]]] [`_set_[ _range_('0','9') | (_set_= 'a','b','c') ]`] [['same as above]]] 220 [[[^\[^abc\]]] [`~(_set_= 'a','b','c')`] [not characters `'a'`, `'b'`, or `'c'`.]] 221 [[[^(?i:['stuff])]] [`_icase_(`[^['stuff]]`)`] [match ['stuff] disregarding case.]] 222 [[[^(?>['stuff])]] [`_keep_(`[^['stuff]]`)`] [independent sub-expression, match ['stuff] and turn off backtracking.]] 223 [[[^(?=['stuff])]] [`_before_(`[^['stuff]]`)`] [positive look-ahead assertion, match if before ['stuff] but don't include ['stuff] in the match.]] 224 [[[^(?!['stuff])]] [`~_before_(`[^['stuff]]`)`] [negative look-ahead assertion, match if not before ['stuff].]] 225 [[[^(?<=['stuff])]] [`_after_(`[^['stuff]]`)`] [positive look-behind assertion, match if after ['stuff] but don't include ['stuff] in the match. (['stuff] must be constant-width.)]] 226 [[[^(?<!['stuff])]] [`~_after_(`[^['stuff]]`)`] [negative look-behind assertion, match if not after ['stuff]. (['stuff] must be constant-width.)]] 227 [[[^(?P<['name]>['stuff])]] [`_mark_tag_ `[^['name]]`(`['n]`);`\n ...\n `(`[^['name]]`= `[^['stuff]]`)`] [Create a named capture.]] 228 [[[^(?P=['name])]] [`_mark_tag_ `[^['name]]`(`['n]`);`\n ...\n [^['name]]] [Refer back to a previously created named capture.]] 229] 230\n 231 232[endsect] 233