• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1[/
2  Copyright 2006-2007 John Maddock.
3  Distributed under the Boost Software License, Version 1.0.
4  (See accompanying file LICENSE_1_0.txt or copy at
5  http://www.boost.org/LICENSE_1_0.txt).
6]
7
8
9[section:basic_syntax POSIX Basic Regular Expression Syntax]
10
11[h3 Synopsis]
12
13The POSIX-Basic regular expression syntax is used by the Unix utility `sed`,
14and variations are used by `grep` and `emacs`.  You can construct POSIX
15basic regular expressions in Boost.Regex by passing the flag `basic` to the
16regex constructor (see [syntax_option_type]), for example:
17
18   // e1 is a case sensitive POSIX-Basic expression:
19   boost::regex e1(my_expression, boost::regex::basic);
20   // e2 a case insensitive POSIX-Basic expression:
21   boost::regex e2(my_expression, boost::regex::basic|boost::regex::icase);
22
23[#boost_regex.posix_basic][h3 POSIX Basic Syntax]
24
25In POSIX-Basic regular expressions, all characters are match themselves except
26for the following special characters:
27
28[pre .\[\\*^$]
29
30[h4 Wildcard:]
31
32The single character '.' when used outside of a character set will match any
33single character except:
34
35* The NULL character when the flag `match_no_dot_null` is passed to the
36matching algorithms.
37* The newline character when the flag `match_not_dot_newline` is passed to
38the matching algorithms.
39
40[h4 Anchors:]
41
42A '^' character shall match the start of a line when used as the first
43character of an expression, or the first character of a sub-expression.
44
45A '$' character shall match the end of a line when used as the last
46character of an expression, or the last character of a sub-expression.
47
48[h4 Marked sub-expressions:]
49
50A section beginning `\(` and ending `\)` acts as a marked sub-expression.
51Whatever matched the sub-expression is split out in a separate field by the
52matching algorithms.  Marked sub-expressions can also repeated, or
53referred-to by a back-reference.
54
55[h4 Repeats:]
56
57Any atom (a single character, a marked sub-expression, or a character class)
58can be repeated with the \* operator.
59
60For example `a*` will match any number of letter a's repeated zero or more
61times (an atom repeated zero times matches an empty string), so the
62expression `a*b` will match any of the following:
63
64[pre
65b
66ab
67aaaaaaaab
68]
69
70An atom can also be repeated with a bounded repeat:
71
72`a\{n\}`  Matches 'a' repeated exactly n times.
73
74`a\{n,\}`  Matches 'a' repeated n or more times.
75
76`a\{n, m\}`  Matches 'a' repeated between n and m times inclusive.
77
78For example:
79
80[pre ^a\{2,3\}$]
81
82Will match either of:
83
84[pre
85aa
86aaa
87]
88
89But neither of:
90
91[pre
92a
93aaaa
94]
95
96It is an error to use a repeat operator, if the preceding construct can not be
97repeated, for example:
98
99[pre a\(*\)]
100
101Will raise an error, as there is nothing for the \* operator to be applied to.
102
103[h4 Back references:]
104
105An escape character followed by a digit /n/, where /n/ is in the range 1-9,
106matches the same string that was matched by sub-expression /n/.  For example
107the expression:
108
109[pre ^\\(a\*\\)\[\^a\]\*\\1$]
110
111Will match the string:
112
113[pre aaabbaaa]
114
115But not the string:
116
117[pre aaabba]
118
119[h4 Character sets:]
120
121A character set is a bracket-expression starting with \[ and ending with \],
122it defines a set of characters, and matches any single character that is a
123member of that set.
124
125A bracket expression may contain any combination of the following:
126
127[h5 Single characters:]
128
129For example `[abc]`, will match any of the characters 'a', 'b', or 'c'.
130
131[h5 Character ranges:]
132
133For example `[a-c]` will match any single character in the range 'a' to 'c'.
134By default, for POSIX-Basic regular expressions, a character /x/ is within the
135range /y/ to /z/, if it collates within that range; this results in
136locale specific behavior.  This behavior can be turned off by unsetting
137the `collate` option flag when constructing the regular expression
138- in which case whether a character appears within
139a range is determined by comparing the code points of the characters only.
140
141[h5 Negation:]
142
143If the bracket-expression begins with the ^ character, then it matches the
144complement of the characters it contains, for example `[^a-c]` matches
145any character that is not in the range a-c.
146
147[h5 Character classes:]
148
149An expression of the form `[[:name:]]` matches the named character class "name",
150for example `[[:lower:]]` matches any lower case character.
151See [link boost_regex.syntax.character_classes character class names].
152
153[h5 Collating Elements:]
154
155An expression of the form `[[.col.]` matches the collating element /col/.
156A collating element is any single character, or any sequence of
157characters that collates as a single unit.  Collating elements may also
158be used as the end point of a range, for example: `[[.ae.]-c]` matches
159the character sequence "ae", plus any single character in the range "ae"-c,
160assuming that "ae" is treated as a single collating element in the current locale.
161
162Collating elements may be used in place of escapes (which are not
163normally allowed inside character sets), for example `[[.^.]abc]` would
164match either one of the characters 'abc^'.
165
166As an extension, a collating element may also be specified via its
167symbolic name, for example:
168
169[pre \[\[\.NUL\.\]\]]
170
171matches a 'NUL' character.
172See [link boost_regex.syntax.collating_names collating element names].
173
174[h5 Equivalence classes:]
175
176An expression of the form `[[=col=]]`, matches any character or collating
177element whose primary sort key is the same as that for collating element
178/col/, as with collating elements the name /col/ may be a
179[link boost_regex.syntax.collating_names collating symbolic name].
180A primary sort key is one that ignores case, accentation, or
181locale-specific tailorings; so for example `[[=a=]]` matches any of
182the characters: a, '''À''', '''Á''', '''Â''',
183'''Ã''', '''Ä''', '''Å''', A, '''à''', '''á''',
184'''â''', '''ã''', '''ä''' and '''å'''.
185Unfortunately implementation of this is reliant on the platform's
186collation and localisation support; this feature can not be relied
187upon to work portably across all platforms, or even all locales on one platform.
188
189[h5 Combinations:]
190
191All of the above can be combined in one character set declaration, for
192example: `[[:digit:]a-c[.NUL.]].`
193
194[h4 Escapes]
195
196With the exception of the escape sequences \\{, \\}, \\(, and \\),
197which are documented above, an escape followed by any character matches
198that character.  This can be used to make the special characters
199
200[pre .\[\\\*^$]
201
202"ordinary".  Note that the escape character loses its special meaning
203inside a character set, so `[\^]` will match either a literal '\\' or a '^'.
204
205[h3 What Gets Matched]
206
207When there is more that one way to match a regular expression, the
208"best" possible match is obtained using the
209[link boost_regex.syntax.leftmost_longest_rule leftmost-longest rule].
210
211[h3 Variations]
212
213[#boost_regex.grep_syntax][h4 Grep]
214
215When an expression is compiled with the flag `grep` set, then the
216expression is treated as a newline separated list of
217[link boost_regex.posix_basic POSIX-Basic expressions],
218a match is found if any of the expressions in the list match, for example:
219
220   boost::regex e("abc\ndef", boost::regex::grep);
221
222will match either of the [link boost_regex.posix_basic POSIX-Basic expressions]
223"abc" or "def".
224
225As its name suggests, this behavior is consistent with the Unix utility grep.
226
227[h4 emacs]
228
229In addition to the [link boost_regex.posix_basic POSIX-Basic features]
230the following characters are also special:
231
232[table
233[[Character][Description]]
234[[+][repeats the preceding atom one or more times.]]
235[[?][repeats the preceding atom zero or one times.]]
236[[*?][A non-greedy version of *.]]
237[[+?][A non-greedy version of +.]]
238[[??][A non-greedy version of ?.]]
239]
240
241And the following escape sequences are also recognised:
242
243[table
244[[Escape][Description]]
245[[\\|][specifies an alternative.]]
246[[\\(?:  ...  \)][is a non-marking grouping construct - allows you to lexically group something without spitting out an extra sub-expression.]]
247[[\\w][matches any word character.]]
248[[\\W][matches any non-word character.]]
249[[\\sx][matches any character in the syntax group x, the following
250   emacs groupings are supported: 's', ' ', '_', 'w', '.', ')', '(', '"', '\\'', '>' and '<'.  Refer to the emacs docs for details.]]
251[[\\Sx][matches any character not in the syntax grouping x.]]
252[[\\c and \\C][These are not supported.]]
253[[\\`][matches zero characters only at the start of a buffer (or string being matched).]]
254[[\\'][matches zero characters only at the end of a buffer (or string being matched).]]
255[[\\b][matches zero characters at a word boundary.]]
256[[\\B][matches zero characters, not at a word boundary.]]
257[[\\<][matches zero characters only at the start of a word.]]
258[[\\>][matches zero characters only at the end of a word.]]
259]
260
261Finally, you should note that emacs style regular expressions are matched
262according to the
263[link boost_regex.syntax.perl_syntax.what_gets_matched Perl "depth first search" rules].
264Emacs expressions are
265matched this way because they contain Perl-like extensions, that do not
266interact well with the
267[link boost_regex.syntax.leftmost_longest_rule POSIX-style leftmost-longest rule].
268
269[h3 Options]
270
271There are a [link boost_regex.ref.syntax_option_type.syntax_option_type_basic variety of flags] that may be combined with the `basic` and `grep`
272options when constructing the regular expression, in particular note
273that the
274[link boost_regex.ref.syntax_option_type.syntax_option_type_basic `newline_alt`, `no_char_classes`, `no-intervals`, `bk_plus_qm`
275and `bk_plus_vbar`] options all alter the syntax, while the
276[link boost_regex.ref.syntax_option_type.syntax_option_type_basic `collate` and `icase` options] modify how the case and locale sensitivity
277are to be applied.
278
279[h3 References]
280
281[@http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).]
282
283[@http://www.opengroup.org/onlinepubs/000095399/utilities/grep.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, grep (FWD.1).]
284
285[@http://www.gnu.org/software/emacs/ Emacs Version 21.3.]
286
287[endsect]
288
289
290