• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<html>
2<head>
3<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
4<title>Understanding Marked Sub-Expressions and Captures</title>
5<link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css">
6<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
7<link rel="home" href="../index.html" title="Boost.Regex 5.1.4">
8<link rel="up" href="../index.html" title="Boost.Regex 5.1.4">
9<link rel="prev" href="unicode.html" title="Unicode and Boost.Regex">
10<link rel="next" href="partial_matches.html" title="Partial Matches">
11</head>
12<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
13<table cellpadding="2" width="100%"><tr>
14<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../boost.png"></td>
15<td align="center"><a href="../../../../../index.html">Home</a></td>
16<td align="center"><a href="../../../../../libs/libraries.htm">Libraries</a></td>
17<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td>
18<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
19<td align="center"><a href="../../../../../more/index.htm">More</a></td>
20</tr></table>
21<hr>
22<div class="spirit-nav">
23<a accesskey="p" href="unicode.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="partial_matches.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a>
24</div>
25<div class="section">
26<div class="titlepage"><div><div><h2 class="title" style="clear: both">
27<a name="boost_regex.captures"></a><a class="link" href="captures.html" title="Understanding Marked Sub-Expressions and Captures">Understanding Marked Sub-Expressions
28    and Captures</a>
29</h2></div></div></div>
30<p>
31      Captures are the iterator ranges that are "captured" by marked sub-expressions
32      as a regular expression gets matched. Each marked sub-expression can result
33      in more than one capture, if it is matched more than once. This document explains
34      how captures and marked sub-expressions in Boost.Regex are represented and
35      accessed.
36    </p>
37<h5>
38<a name="boost_regex.captures.h0"></a>
39      <span class="phrase"><a name="boost_regex.captures.marked_sub_expressions"></a></span><a class="link" href="captures.html#boost_regex.captures.marked_sub_expressions">Marked
40      sub-expressions</a>
41    </h5>
42<p>
43      Every time a Perl regular expression contains a parenthesis group <code class="computeroutput"><span class="special">()</span></code>, it spits out an extra field, known as a
44      marked sub-expression, for example the expression:
45    </p>
46<pre class="programlisting">(\w+)\W+(\w+)</pre>
47<p>
48      Has two marked sub-expressions (known as $1 and $2 respectively), in addition
49      the complete match is known as $&amp;, everything before the first match as
50      $`, and everything after the match as $'. So if the above expression is searched
51      for within <code class="computeroutput"><span class="string">"@abc def--"</span></code>,
52      then we obtain:
53    </p>
54<div class="informaltable"><table class="table">
55<colgroup>
56<col>
57<col>
58</colgroup>
59<thead><tr>
60<th>
61              <p>
62                Sub-expression
63              </p>
64            </th>
65<th>
66              <p>
67                Text found
68              </p>
69            </th>
70</tr></thead>
71<tbody>
72<tr>
73<td>
74              <p>
75                $`
76              </p>
77            </td>
78<td>
79              <p>
80                "@"
81              </p>
82            </td>
83</tr>
84<tr>
85<td>
86              <p>
87                $&amp;
88              </p>
89            </td>
90<td>
91              <p>
92                "abc def"
93              </p>
94            </td>
95</tr>
96<tr>
97<td>
98              <p>
99                $1
100              </p>
101            </td>
102<td>
103              <p>
104                "abc"
105              </p>
106            </td>
107</tr>
108<tr>
109<td>
110              <p>
111                $2
112              </p>
113            </td>
114<td>
115              <p>
116                "def"
117              </p>
118            </td>
119</tr>
120<tr>
121<td>
122              <p>
123                $'
124              </p>
125            </td>
126<td>
127              <p>
128                "--"
129              </p>
130            </td>
131</tr>
132</tbody>
133</table></div>
134<p>
135      In Boost.Regex all these are accessible via the <a class="link" href="ref/match_results.html" title="match_results"><code class="computeroutput"><span class="identifier">match_results</span></code></a> class that gets filled
136      in when calling one of the regular expression matching algorithms ( <a class="link" href="ref/regex_search.html" title="regex_search"><code class="computeroutput"><span class="identifier">regex_search</span></code></a>, <a class="link" href="ref/regex_match.html" title="regex_match"><code class="computeroutput"><span class="identifier">regex_match</span></code></a>, or <a class="link" href="ref/regex_iterator.html" title="regex_iterator"><code class="computeroutput"><span class="identifier">regex_iterator</span></code></a>). So given:
137    </p>
138<pre class="programlisting"><span class="identifier">boost</span><span class="special">::</span><span class="identifier">match_results</span><span class="special">&lt;</span><span class="identifier">IteratorType</span><span class="special">&gt;</span> <span class="identifier">m</span><span class="special">;</span>
139</pre>
140<p>
141      The Perl and Boost.Regex equivalents are as follows:
142    </p>
143<div class="informaltable"><table class="table">
144<colgroup>
145<col>
146<col>
147</colgroup>
148<thead><tr>
149<th>
150              <p>
151                Perl
152              </p>
153            </th>
154<th>
155              <p>
156                Boost.Regex
157              </p>
158            </th>
159</tr></thead>
160<tbody>
161<tr>
162<td>
163              <p>
164                $`
165              </p>
166            </td>
167<td>
168              <p>
169                <code class="computeroutput"><span class="identifier">m</span><span class="special">.</span><span class="identifier">prefix</span><span class="special">()</span></code>
170              </p>
171            </td>
172</tr>
173<tr>
174<td>
175              <p>
176                $&amp;
177              </p>
178            </td>
179<td>
180              <p>
181                <code class="computeroutput"><span class="identifier">m</span><span class="special">[</span><span class="number">0</span><span class="special">]</span></code>
182              </p>
183            </td>
184</tr>
185<tr>
186<td>
187              <p>
188                $n
189              </p>
190            </td>
191<td>
192              <p>
193                <code class="computeroutput"><span class="identifier">m</span><span class="special">[</span><span class="identifier">n</span><span class="special">]</span></code>
194              </p>
195            </td>
196</tr>
197<tr>
198<td>
199              <p>
200                $'
201              </p>
202            </td>
203<td>
204              <p>
205                <code class="computeroutput"><span class="identifier">m</span><span class="special">.</span><span class="identifier">suffix</span><span class="special">()</span></code>
206              </p>
207            </td>
208</tr>
209</tbody>
210</table></div>
211<p>
212      In Boost.Regex each sub-expression match is represented by a <a class="link" href="ref/sub_match.html" title="sub_match"><code class="computeroutput"><span class="identifier">sub_match</span></code></a> object, this is basically
213      just a pair of iterators denoting the start and end position of the sub-expression
214      match, but there are some additional operators provided so that objects of
215      type <a class="link" href="ref/sub_match.html" title="sub_match"><code class="computeroutput"><span class="identifier">sub_match</span></code></a>
216      behave a lot like a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">basic_string</span></code>: for example they are implicitly
217      convertible to a <code class="computeroutput"><span class="identifier">basic_string</span></code>,
218      they can be compared to a string, added to a string, or streamed out to an
219      output stream.
220    </p>
221<h5>
222<a name="boost_regex.captures.h1"></a>
223      <span class="phrase"><a name="boost_regex.captures.unmatched_sub_expressions"></a></span><a class="link" href="captures.html#boost_regex.captures.unmatched_sub_expressions">Unmatched
224      Sub-Expressions</a>
225    </h5>
226<p>
227      When a regular expression match is found there is no need for all of the marked
228      sub-expressions to have participated in the match, for example the expression:
229    </p>
230<pre class="programlisting">(abc)|(def)</pre>
231<p>
232      can match either $1 or $2, but never both at the same time. In Boost.Regex
233      you can determine which sub-expressions matched by accessing the <code class="computeroutput"><span class="identifier">sub_match</span><span class="special">::</span><span class="identifier">matched</span></code> data member.
234    </p>
235<h5>
236<a name="boost_regex.captures.h2"></a>
237      <span class="phrase"><a name="boost_regex.captures.repeated_captures"></a></span><a class="link" href="captures.html#boost_regex.captures.repeated_captures">Repeated
238      Captures</a>
239    </h5>
240<p>
241      When a marked sub-expression is repeated, then the sub-expression gets "captured"
242      multiple times, however normally only the final capture is available, for example
243      if
244    </p>
245<pre class="programlisting">(?:(\w+)\W+)+</pre>
246<p>
247      is matched against
248    </p>
249<pre class="programlisting">one fine day</pre>
250<p>
251      Then $1 will contain the string "day", and all the previous captures
252      will have been forgotten.
253    </p>
254<p>
255      However, Boost.Regex has an experimental feature that allows all the capture
256      information to be retained - this is accessed either via the <code class="computeroutput"><span class="identifier">match_results</span><span class="special">::</span><span class="identifier">captures</span></code> member function or the <code class="computeroutput"><span class="identifier">sub_match</span><span class="special">::</span><span class="identifier">captures</span></code> member function. These functions
257      return a container that contains a sequence of all the captures obtained during
258      the regular expression matching. The following example program shows how this
259      information may be used:
260    </p>
261<pre class="programlisting"><span class="preprocessor">#include</span> <span class="special">&lt;</span><span class="identifier">boost</span><span class="special">/</span><span class="identifier">regex</span><span class="special">.</span><span class="identifier">hpp</span><span class="special">&gt;</span>
262<span class="preprocessor">#include</span> <span class="special">&lt;</span><span class="identifier">iostream</span><span class="special">&gt;</span>
263
264<span class="keyword">void</span> <span class="identifier">print_captures</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">&amp;</span> <span class="identifier">regx</span><span class="special">,</span> <span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">&amp;</span> <span class="identifier">text</span><span class="special">)</span>
265<span class="special">{</span>
266   <span class="identifier">boost</span><span class="special">::</span><span class="identifier">regex</span> <span class="identifier">e</span><span class="special">(</span><span class="identifier">regx</span><span class="special">);</span>
267   <span class="identifier">boost</span><span class="special">::</span><span class="identifier">smatch</span> <span class="identifier">what</span><span class="special">;</span>
268   <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"Expression:  \""</span> <span class="special">&lt;&lt;</span> <span class="identifier">regx</span> <span class="special">&lt;&lt;</span> <span class="string">"\"\n"</span><span class="special">;</span>
269   <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"Text:        \""</span> <span class="special">&lt;&lt;</span> <span class="identifier">text</span> <span class="special">&lt;&lt;</span> <span class="string">"\"\n"</span><span class="special">;</span>
270   <span class="keyword">if</span><span class="special">(</span><span class="identifier">boost</span><span class="special">::</span><span class="identifier">regex_match</span><span class="special">(</span><span class="identifier">text</span><span class="special">,</span> <span class="identifier">what</span><span class="special">,</span> <span class="identifier">e</span><span class="special">,</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">match_extra</span><span class="special">))</span>
271   <span class="special">{</span>
272      <span class="keyword">unsigned</span> <span class="identifier">i</span><span class="special">,</span> <span class="identifier">j</span><span class="special">;</span>
273      <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"** Match found **\n   Sub-Expressions:\n"</span><span class="special">;</span>
274      <span class="keyword">for</span><span class="special">(</span><span class="identifier">i</span> <span class="special">=</span> <span class="number">0</span><span class="special">;</span> <span class="identifier">i</span> <span class="special">&lt;</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">size</span><span class="special">();</span> <span class="special">++</span><span class="identifier">i</span><span class="special">)</span>
275         <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"      $"</span> <span class="special">&lt;&lt;</span> <span class="identifier">i</span> <span class="special">&lt;&lt;</span> <span class="string">" = \""</span> <span class="special">&lt;&lt;</span> <span class="identifier">what</span><span class="special">[</span><span class="identifier">i</span><span class="special">]</span> <span class="special">&lt;&lt;</span> <span class="string">"\"\n"</span><span class="special">;</span>
276      <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"   Captures:\n"</span><span class="special">;</span>
277      <span class="keyword">for</span><span class="special">(</span><span class="identifier">i</span> <span class="special">=</span> <span class="number">0</span><span class="special">;</span> <span class="identifier">i</span> <span class="special">&lt;</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">size</span><span class="special">();</span> <span class="special">++</span><span class="identifier">i</span><span class="special">)</span>
278      <span class="special">{</span>
279         <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"      $"</span> <span class="special">&lt;&lt;</span> <span class="identifier">i</span> <span class="special">&lt;&lt;</span> <span class="string">" = {"</span><span class="special">;</span>
280         <span class="keyword">for</span><span class="special">(</span><span class="identifier">j</span> <span class="special">=</span> <span class="number">0</span><span class="special">;</span> <span class="identifier">j</span> <span class="special">&lt;</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">captures</span><span class="special">(</span><span class="identifier">i</span><span class="special">).</span><span class="identifier">size</span><span class="special">();</span> <span class="special">++</span><span class="identifier">j</span><span class="special">)</span>
281         <span class="special">{</span>
282            <span class="keyword">if</span><span class="special">(</span><span class="identifier">j</span><span class="special">)</span>
283               <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">", "</span><span class="special">;</span>
284            <span class="keyword">else</span>
285               <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">" "</span><span class="special">;</span>
286            <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"\""</span> <span class="special">&lt;&lt;</span> <span class="identifier">what</span><span class="special">.</span><span class="identifier">captures</span><span class="special">(</span><span class="identifier">i</span><span class="special">)[</span><span class="identifier">j</span><span class="special">]</span> <span class="special">&lt;&lt;</span> <span class="string">"\""</span><span class="special">;</span>
287         <span class="special">}</span>
288         <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">" }\n"</span><span class="special">;</span>
289      <span class="special">}</span>
290   <span class="special">}</span>
291   <span class="keyword">else</span>
292   <span class="special">{</span>
293      <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"** No Match found **\n"</span><span class="special">;</span>
294   <span class="special">}</span>
295<span class="special">}</span>
296
297<span class="keyword">int</span> <span class="identifier">main</span><span class="special">(</span><span class="keyword">int</span> <span class="special">,</span> <span class="keyword">char</span><span class="special">*</span> <span class="special">[])</span>
298<span class="special">{</span>
299   <span class="identifier">print_captures</span><span class="special">(</span><span class="string">"(([[:lower:]]+)|([[:upper:]]+))+"</span><span class="special">,</span> <span class="string">"aBBcccDDDDDeeeeeeee"</span><span class="special">);</span>
300   <span class="identifier">print_captures</span><span class="special">(</span><span class="string">"(.*)bar|(.*)bah"</span><span class="special">,</span> <span class="string">"abcbar"</span><span class="special">);</span>
301   <span class="identifier">print_captures</span><span class="special">(</span><span class="string">"(.*)bar|(.*)bah"</span><span class="special">,</span> <span class="string">"abcbah"</span><span class="special">);</span>
302   <span class="identifier">print_captures</span><span class="special">(</span><span class="string">"^(?:(\\w+)|(?&gt;\\W+))*$"</span><span class="special">,</span>
303      <span class="string">"now is the time for all good men to come to the aid of the party"</span><span class="special">);</span>
304   <span class="keyword">return</span> <span class="number">0</span><span class="special">;</span>
305<span class="special">}</span>
306</pre>
307<p>
308      Which produces the following output:
309    </p>
310<pre class="programlisting">Expression:  "(([[:lower:]]+)|([[:upper:]]+))+"
311Text:        "aBBcccDDDDDeeeeeeee"
312** Match found **
313   Sub-Expressions:
314      $0 = "aBBcccDDDDDeeeeeeee"
315      $1 = "eeeeeeee"
316      $2 = "eeeeeeee"
317      $3 = "DDDDD"
318   Captures:
319      $0 = { "aBBcccDDDDDeeeeeeee" }
320      $1 = { "a", "BB", "ccc", "DDDDD", "eeeeeeee" }
321      $2 = { "a", "ccc", "eeeeeeee" }
322      $3 = { "BB", "DDDDD" }
323Expression:  "(.*)bar|(.*)bah"
324Text:        "abcbar"
325** Match found **
326   Sub-Expressions:
327      $0 = "abcbar"
328      $1 = "abc"
329      $2 = ""
330   Captures:
331      $0 = { "abcbar" }
332      $1 = { "abc" }
333      $2 = { }
334Expression:  "(.*)bar|(.*)bah"
335Text:        "abcbah"
336** Match found **
337   Sub-Expressions:
338      $0 = "abcbah"
339      $1 = ""
340      $2 = "abc"
341   Captures:
342      $0 = { "abcbah" }
343      $1 = { }
344      $2 = { "abc" }
345Expression:  "^(?:(\w+)|(?&gt;\W+))*$"
346Text:        "now is the time for all good men to come to the aid of the party"
347** Match found **
348   Sub-Expressions:
349      $0 = "now is the time for all good men to come to the aid of the party"
350      $1 = "party"
351   Captures:
352      $0 = { "now is the time for all good men to come to the aid of the party" }
353      $1 = { "now", "is", "the", "time", "for", "all", "good", "men", "to",
354         "come", "to", "the", "aid", "of", "the", "party" }
355</pre>
356<p>
357      Unfortunately enabling this feature has an impact on performance (even if you
358      don't use it), and a much bigger impact if you do use it, therefore to use
359      this feature you need to:
360    </p>
361<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; ">
362<li class="listitem">
363          Define BOOST_REGEX_MATCH_EXTRA for all translation units including the
364          library source (the best way to do this is to uncomment this define in
365          boost/regex/user.hpp and then rebuild everything.
366        </li>
367<li class="listitem">
368          Pass the match_extra flag to the particular algorithms where you actually
369          need the captures information (regex_search, regex_match, or regex_iterator).
370        </li>
371</ul></div>
372</div>
373<table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr>
374<td align="left"></td>
375<td align="right"><div class="copyright-footer">Copyright © 1998-2013 John Maddock<p>
376        Distributed under the Boost Software License, Version 1.0. (See accompanying
377        file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>)
378      </p>
379</div></td>
380</tr></table>
381<hr>
382<div class="spirit-nav">
383<a accesskey="p" href="unicode.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="partial_matches.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a>
384</div>
385</body>
386</html>
387