1[/ 2 Copyright 2006-2007 John Maddock. 3 Distributed under the Boost Software License, Version 1.0. 4 (See accompanying file LICENSE_1_0.txt or copy at 5 http://www.boost.org/LICENSE_1_0.txt). 6] 7 8 9[section:basic_syntax POSIX Basic Regular Expression Syntax] 10 11[h3 Synopsis] 12 13The POSIX-Basic regular expression syntax is used by the Unix utility `sed`, 14and variations are used by `grep` and `emacs`. You can construct POSIX 15basic regular expressions in Boost.Regex by passing the flag `basic` to the 16regex constructor (see [syntax_option_type]), for example: 17 18 // e1 is a case sensitive POSIX-Basic expression: 19 boost::regex e1(my_expression, boost::regex::basic); 20 // e2 a case insensitive POSIX-Basic expression: 21 boost::regex e2(my_expression, boost::regex::basic|boost::regex::icase); 22 23[#boost_regex.posix_basic][h3 POSIX Basic Syntax] 24 25In POSIX-Basic regular expressions, all characters are match themselves except 26for the following special characters: 27 28[pre .\[\\*^$] 29 30[h4 Wildcard:] 31 32The single character '.' when used outside of a character set will match any 33single character except: 34 35* The NULL character when the flag `match_no_dot_null` is passed to the 36matching algorithms. 37* The newline character when the flag `match_not_dot_newline` is passed to 38the matching algorithms. 39 40[h4 Anchors:] 41 42A '^' character shall match the start of a line when used as the first 43character of an expression, or the first character of a sub-expression. 44 45A '$' character shall match the end of a line when used as the last 46character of an expression, or the last character of a sub-expression. 47 48[h4 Marked sub-expressions:] 49 50A section beginning `\(` and ending `\)` acts as a marked sub-expression. 51Whatever matched the sub-expression is split out in a separate field by the 52matching algorithms. Marked sub-expressions can also repeated, or 53referred-to by a back-reference. 54 55[h4 Repeats:] 56 57Any atom (a single character, a marked sub-expression, or a character class) 58can be repeated with the \* operator. 59 60For example `a*` will match any number of letter a's repeated zero or more 61times (an atom repeated zero times matches an empty string), so the 62expression `a*b` will match any of the following: 63 64[pre 65b 66ab 67aaaaaaaab 68] 69 70An atom can also be repeated with a bounded repeat: 71 72`a\{n\}` Matches 'a' repeated exactly n times. 73 74`a\{n,\}` Matches 'a' repeated n or more times. 75 76`a\{n, m\}` Matches 'a' repeated between n and m times inclusive. 77 78For example: 79 80[pre ^a\{2,3\}$] 81 82Will match either of: 83 84[pre 85aa 86aaa 87] 88 89But neither of: 90 91[pre 92a 93aaaa 94] 95 96It is an error to use a repeat operator, if the preceding construct can not be 97repeated, for example: 98 99[pre a\(*\)] 100 101Will raise an error, as there is nothing for the \* operator to be applied to. 102 103[h4 Back references:] 104 105An escape character followed by a digit /n/, where /n/ is in the range 1-9, 106matches the same string that was matched by sub-expression /n/. For example 107the expression: 108 109[pre ^\\(a\*\\)\[\^a\]\*\\1$] 110 111Will match the string: 112 113[pre aaabbaaa] 114 115But not the string: 116 117[pre aaabba] 118 119[h4 Character sets:] 120 121A character set is a bracket-expression starting with \[ and ending with \], 122it defines a set of characters, and matches any single character that is a 123member of that set. 124 125A bracket expression may contain any combination of the following: 126 127[h5 Single characters:] 128 129For example `[abc]`, will match any of the characters 'a', 'b', or 'c'. 130 131[h5 Character ranges:] 132 133For example `[a-c]` will match any single character in the range 'a' to 'c'. 134By default, for POSIX-Basic regular expressions, a character /x/ is within the 135range /y/ to /z/, if it collates within that range; this results in 136locale specific behavior. This behavior can be turned off by unsetting 137the `collate` option flag when constructing the regular expression 138- in which case whether a character appears within 139a range is determined by comparing the code points of the characters only. 140 141[h5 Negation:] 142 143If the bracket-expression begins with the ^ character, then it matches the 144complement of the characters it contains, for example `[^a-c]` matches 145any character that is not in the range a-c. 146 147[h5 Character classes:] 148 149An expression of the form `[[:name:]]` matches the named character class "name", 150for example `[[:lower:]]` matches any lower case character. 151See [link boost_regex.syntax.character_classes character class names]. 152 153[h5 Collating Elements:] 154 155An expression of the form `[[.col.]` matches the collating element /col/. 156A collating element is any single character, or any sequence of 157characters that collates as a single unit. Collating elements may also 158be used as the end point of a range, for example: `[[.ae.]-c]` matches 159the character sequence "ae", plus any single character in the range "ae"-c, 160assuming that "ae" is treated as a single collating element in the current locale. 161 162Collating elements may be used in place of escapes (which are not 163normally allowed inside character sets), for example `[[.^.]abc]` would 164match either one of the characters 'abc^'. 165 166As an extension, a collating element may also be specified via its 167symbolic name, for example: 168 169[pre \[\[\.NUL\.\]\]] 170 171matches a 'NUL' character. 172See [link boost_regex.syntax.collating_names collating element names]. 173 174[h5 Equivalence classes:] 175 176An expression of the form `[[=col=]]`, matches any character or collating 177element whose primary sort key is the same as that for collating element 178/col/, as with collating elements the name /col/ may be a 179[link boost_regex.syntax.collating_names collating symbolic name]. 180A primary sort key is one that ignores case, accentation, or 181locale-specific tailorings; so for example `[[=a=]]` matches any of 182the characters: a, '''À''', '''Á''', '''Â''', 183'''Ã''', '''Ä''', '''Å''', A, '''à''', '''á''', 184'''â''', '''ã''', '''ä''' and '''å'''. 185Unfortunately implementation of this is reliant on the platform's 186collation and localisation support; this feature can not be relied 187upon to work portably across all platforms, or even all locales on one platform. 188 189[h5 Combinations:] 190 191All of the above can be combined in one character set declaration, for 192example: `[[:digit:]a-c[.NUL.]].` 193 194[h4 Escapes] 195 196With the exception of the escape sequences \\{, \\}, \\(, and \\), 197which are documented above, an escape followed by any character matches 198that character. This can be used to make the special characters 199 200[pre .\[\\\*^$] 201 202"ordinary". Note that the escape character loses its special meaning 203inside a character set, so `[\^]` will match either a literal '\\' or a '^'. 204 205[h3 What Gets Matched] 206 207When there is more that one way to match a regular expression, the 208"best" possible match is obtained using the 209[link boost_regex.syntax.leftmost_longest_rule leftmost-longest rule]. 210 211[h3 Variations] 212 213[#boost_regex.grep_syntax][h4 Grep] 214 215When an expression is compiled with the flag `grep` set, then the 216expression is treated as a newline separated list of 217[link boost_regex.posix_basic POSIX-Basic expressions], 218a match is found if any of the expressions in the list match, for example: 219 220 boost::regex e("abc\ndef", boost::regex::grep); 221 222will match either of the [link boost_regex.posix_basic POSIX-Basic expressions] 223"abc" or "def". 224 225As its name suggests, this behavior is consistent with the Unix utility grep. 226 227[h4 emacs] 228 229In addition to the [link boost_regex.posix_basic POSIX-Basic features] 230the following characters are also special: 231 232[table 233[[Character][Description]] 234[[+][repeats the preceding atom one or more times.]] 235[[?][repeats the preceding atom zero or one times.]] 236[[*?][A non-greedy version of *.]] 237[[+?][A non-greedy version of +.]] 238[[??][A non-greedy version of ?.]] 239] 240 241And the following escape sequences are also recognised: 242 243[table 244[[Escape][Description]] 245[[\\|][specifies an alternative.]] 246[[\\(?: ... \)][is a non-marking grouping construct - allows you to lexically group something without spitting out an extra sub-expression.]] 247[[\\w][matches any word character.]] 248[[\\W][matches any non-word character.]] 249[[\\sx][matches any character in the syntax group x, the following 250 emacs groupings are supported: 's', ' ', '_', 'w', '.', ')', '(', '"', '\\'', '>' and '<'. Refer to the emacs docs for details.]] 251[[\\Sx][matches any character not in the syntax grouping x.]] 252[[\\c and \\C][These are not supported.]] 253[[\\`][matches zero characters only at the start of a buffer (or string being matched).]] 254[[\\'][matches zero characters only at the end of a buffer (or string being matched).]] 255[[\\b][matches zero characters at a word boundary.]] 256[[\\B][matches zero characters, not at a word boundary.]] 257[[\\<][matches zero characters only at the start of a word.]] 258[[\\>][matches zero characters only at the end of a word.]] 259] 260 261Finally, you should note that emacs style regular expressions are matched 262according to the 263[link boost_regex.syntax.perl_syntax.what_gets_matched Perl "depth first search" rules]. 264Emacs expressions are 265matched this way because they contain Perl-like extensions, that do not 266interact well with the 267[link boost_regex.syntax.leftmost_longest_rule POSIX-style leftmost-longest rule]. 268 269[h3 Options] 270 271There are a [link boost_regex.ref.syntax_option_type.syntax_option_type_basic variety of flags] that may be combined with the `basic` and `grep` 272options when constructing the regular expression, in particular note 273that the 274[link boost_regex.ref.syntax_option_type.syntax_option_type_basic `newline_alt`, `no_char_classes`, `no-intervals`, `bk_plus_qm` 275and `bk_plus_vbar`] options all alter the syntax, while the 276[link boost_regex.ref.syntax_option_type.syntax_option_type_basic `collate` and `icase` options] modify how the case and locale sensitivity 277are to be applied. 278 279[h3 References] 280 281[@http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).] 282 283[@http://www.opengroup.org/onlinepubs/000095399/utilities/grep.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, grep (FWD.1).] 284 285[@http://www.gnu.org/software/emacs/ Emacs Version 21.3.] 286 287[endsect] 288 289 290