1<html> 2<head> 3<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 4<title>Introduction and Overview</title> 5<link rel="stylesheet" href="../../../../../doc/src/boostbook.css" type="text/css"> 6<meta name="generator" content="DocBook XSL Stylesheets V1.79.1"> 7<link rel="home" href="../index.html" title="Boost.Regex 5.1.4"> 8<link rel="up" href="../index.html" title="Boost.Regex 5.1.4"> 9<link rel="prev" href="install.html" title="Building and Installing the Library"> 10<link rel="next" href="unicode.html" title="Unicode and Boost.Regex"> 11</head> 12<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> 13<table cellpadding="2" width="100%"><tr> 14<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../boost.png"></td> 15<td align="center"><a href="../../../../../index.html">Home</a></td> 16<td align="center"><a href="../../../../../libs/libraries.htm">Libraries</a></td> 17<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td> 18<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td> 19<td align="center"><a href="../../../../../more/index.htm">More</a></td> 20</tr></table> 21<hr> 22<div class="spirit-nav"> 23<a accesskey="p" href="install.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="unicode.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a> 24</div> 25<div class="section"> 26<div class="titlepage"><div><div><h2 class="title" style="clear: both"> 27<a name="boost_regex.intro"></a><a class="link" href="intro.html" title="Introduction and Overview">Introduction and Overview</a> 28</h2></div></div></div> 29<p> 30 Regular expressions are a form of pattern-matching that are often used in text 31 processing; many users will be familiar with the Unix utilities grep, sed and 32 awk, and the programming language Perl, each of which make extensive use of 33 regular expressions. Traditionally C++ users have been limited to the POSIX 34 C API's for manipulating regular expressions, and while Boost.Regex does provide 35 these API's, they do not represent the best way to use the library. For example 36 Boost.Regex can cope with wide character strings, or search and replace operations 37 (in a manner analogous to either sed or Perl), something that traditional C 38 libraries can not do. 39 </p> 40<p> 41 The class <a class="link" href="ref/basic_regex.html" title="basic_regex"><code class="computeroutput"><span class="identifier">basic_regex</span></code></a> 42 is the key class in this library; it represents a "machine readable" 43 regular expression, and is very closely modeled on <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">basic_string</span></code>, 44 think of it as a string plus the actual state-machine required by the regular 45 expression algorithms. Like <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">basic_string</span></code> 46 there are two typedefs that are almost always the means by which this class 47 is referenced: 48 </p> 49<pre class="programlisting"><span class="keyword">namespace</span> <span class="identifier">boost</span><span class="special">{</span> 50 51<span class="keyword">template</span> <span class="special"><</span><span class="keyword">class</span> <span class="identifier">charT</span><span class="special">,</span> 52 <span class="keyword">class</span> <span class="identifier">traits</span> <span class="special">=</span> <span class="identifier">regex_traits</span><span class="special"><</span><span class="identifier">charT</span><span class="special">></span> <span class="special">></span> 53<span class="keyword">class</span> <span class="identifier">basic_regex</span><span class="special">;</span> 54 55<span class="keyword">typedef</span> <span class="identifier">basic_regex</span><span class="special"><</span><span class="keyword">char</span><span class="special">></span> <span class="identifier">regex</span><span class="special">;</span> 56<span class="keyword">typedef</span> <span class="identifier">basic_regex</span><span class="special"><</span><span class="keyword">wchar_t</span><span class="special">></span> <span class="identifier">wregex</span><span class="special">;</span> 57 58<span class="special">}</span> 59</pre> 60<p> 61 To see how this library can be used, imagine that we are writing a credit card 62 processing application. Credit card numbers generally come as a string of 16-digits, 63 separated into groups of 4-digits, and separated by either a space or a hyphen. 64 Before storing a credit card number in a database (not necessarily something 65 your customers will appreciate!), we may want to verify that the number is 66 in the correct format. To match any digit we could use the regular expression 67 [0-9], however ranges of characters like this are actually locale dependent. 68 Instead we should use the POSIX standard form [[:digit:]], or the Boost.Regex 69 and Perl shorthand for this \d (note that many older libraries tended to be 70 hard-coded to the C-locale, consequently this was not an issue for them). That 71 leaves us with the following regular expression to validate credit card number 72 formats: 73 </p> 74<pre class="programlisting">(\d{4}[- ]){3}\d{4}</pre> 75<p> 76 Here the parenthesis act to group (and mark for future reference) sub-expressions, 77 and the {4} means "repeat exactly 4 times". This is an example of 78 the extended regular expression syntax used by Perl, awk and egrep. Boost.Regex 79 also supports the older "basic" syntax used by sed and grep, but 80 this is generally less useful, unless you already have some basic regular expressions 81 that you need to reuse. 82 </p> 83<p> 84 Now let's take that expression and place it in some C++ code to validate the 85 format of a credit card number: 86 </p> 87<pre class="programlisting"><span class="keyword">bool</span> <span class="identifier">validate_card_format</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">&</span> <span class="identifier">s</span><span class="special">)</span> 88<span class="special">{</span> 89 <span class="keyword">static</span> <span class="keyword">const</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">regex</span> <span class="identifier">e</span><span class="special">(</span><span class="string">"(\\d{4}[- ]){3}\\d{4}"</span><span class="special">);</span> 90 <span class="keyword">return</span> <span class="identifier">regex_match</span><span class="special">(</span><span class="identifier">s</span><span class="special">,</span> <span class="identifier">e</span><span class="special">);</span> 91<span class="special">}</span> 92</pre> 93<p> 94 Note how we had to add some extra escapes to the expression: remember that 95 the escape is seen once by the C++ compiler, before it gets to be seen by the 96 regular expression engine, consequently escapes in regular expressions have 97 to be doubled up when embedding them in C/C++ code. Also note that all the 98 examples assume that your compiler supports argument-dependent lookup, if yours 99 doesn't (for example VC6), then you will have to add some <code class="computeroutput"><span class="identifier">boost</span><span class="special">::</span></code> prefixes to some of the function calls in 100 the examples. 101 </p> 102<p> 103 Those of you who are familiar with credit card processing, will have realized 104 that while the format used above is suitable for human readable card numbers, 105 it does not represent the format required by online credit card systems; these 106 require the number as a string of 16 (or possibly 15) digits, without any intervening 107 spaces. What we need is a means to convert easily between the two formats, 108 and this is where search and replace comes in. Those who are familiar with 109 the utilities sed and Perl will already be ahead here; we need two strings 110 - one a regular expression - the other a "format string" that provides 111 a description of the text to replace the match with. In Boost.Regex this search 112 and replace operation is performed with the algorithm <a class="link" href="ref/regex_replace.html" title="regex_replace"><code class="computeroutput"><span class="identifier">regex_replace</span></code></a>, for our credit card 113 example we can write two algorithms like this to provide the format conversions: 114 </p> 115<pre class="programlisting"><span class="comment">// match any format with the regular expression:</span> 116<span class="keyword">const</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">regex</span> <span class="identifier">e</span><span class="special">(</span><span class="string">"\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"</span><span class="special">);</span> 117<span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">machine_format</span><span class="special">(</span><span class="string">"\\1\\2\\3\\4"</span><span class="special">);</span> 118<span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">human_format</span><span class="special">(</span><span class="string">"\\1-\\2-\\3-\\4"</span><span class="special">);</span> 119 120<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">machine_readable_card_number</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">s</span><span class="special">)</span> 121<span class="special">{</span> 122 <span class="keyword">return</span> <span class="identifier">regex_replace</span><span class="special">(</span><span class="identifier">s</span><span class="special">,</span> <span class="identifier">e</span><span class="special">,</span> <span class="identifier">machine_format</span><span class="special">,</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">match_default</span> <span class="special">|</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">format_sed</span><span class="special">);</span> 123<span class="special">}</span> 124 125<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">human_readable_card_number</span><span class="special">(</span><span class="keyword">const</span> <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">s</span><span class="special">)</span> 126<span class="special">{</span> 127 <span class="keyword">return</span> <span class="identifier">regex_replace</span><span class="special">(</span><span class="identifier">s</span><span class="special">,</span> <span class="identifier">e</span><span class="special">,</span> <span class="identifier">human_format</span><span class="special">,</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">match_default</span> <span class="special">|</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">format_sed</span><span class="special">);</span> 128<span class="special">}</span> 129</pre> 130<p> 131 Here we've used marked sub-expressions in the regular expression to split out 132 the four parts of the card number as separate fields, the format string then 133 uses the sed-like syntax to replace the matched text with the reformatted version. 134 </p> 135<p> 136 In the examples above, we haven't directly manipulated the results of a regular 137 expression match, however in general the result of a match contains a number 138 of sub-expression matches in addition to the overall match. When the library 139 needs to report a regular expression match it does so using an instance of 140 the class <a class="link" href="ref/match_results.html" title="match_results"><code class="computeroutput"><span class="identifier">match_results</span></code></a>, 141 as before there are typedefs of this class for the most common cases: 142 </p> 143<pre class="programlisting"><span class="keyword">namespace</span> <span class="identifier">boost</span><span class="special">{</span> 144 145<span class="keyword">typedef</span> <span class="identifier">match_results</span><span class="special"><</span><span class="keyword">const</span> <span class="keyword">char</span><span class="special">*></span> <span class="identifier">cmatch</span><span class="special">;</span> 146<span class="keyword">typedef</span> <span class="identifier">match_results</span><span class="special"><</span><span class="keyword">const</span> <span class="keyword">wchar_t</span><span class="special">*></span> <span class="identifier">wcmatch</span><span class="special">;</span> 147<span class="keyword">typedef</span> <span class="identifier">match_results</span><span class="special"><</span><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">::</span><span class="identifier">const_iterator</span><span class="special">></span> <span class="identifier">smatch</span><span class="special">;</span> 148<span class="keyword">typedef</span> <span class="identifier">match_results</span><span class="special"><</span><span class="identifier">std</span><span class="special">::</span><span class="identifier">wstring</span><span class="special">::</span><span class="identifier">const_iterator</span><span class="special">></span> <span class="identifier">wsmatch</span><span class="special">;</span> 149 150<span class="special">}</span> 151</pre> 152<p> 153 The algorithms <a class="link" href="ref/regex_search.html" title="regex_search"><code class="computeroutput"><span class="identifier">regex_search</span></code></a> 154 and <a class="link" href="ref/regex_match.html" title="regex_match"><code class="computeroutput"><span class="identifier">regex_match</span></code></a> 155 make use of <a class="link" href="ref/match_results.html" title="match_results"><code class="computeroutput"><span class="identifier">match_results</span></code></a> 156 to report what matched; the difference between these algorithms is that <a class="link" href="ref/regex_match.html" title="regex_match"><code class="computeroutput"><span class="identifier">regex_match</span></code></a> 157 will only find matches that consume <span class="emphasis"><em>all</em></span> of the input text, 158 where as <a class="link" href="ref/regex_search.html" title="regex_search"><code class="computeroutput"><span class="identifier">regex_search</span></code></a> 159 will search for a match anywhere within the text being matched. 160 </p> 161<p> 162 Note that these algorithms are not restricted to searching regular C-strings, 163 any bidirectional iterator type can be searched, allowing for the possibility 164 of seamlessly searching almost any kind of data. 165 </p> 166<p> 167 For search and replace operations, in addition to the algorithm <a class="link" href="ref/regex_replace.html" title="regex_replace"><code class="computeroutput"><span class="identifier">regex_replace</span></code></a> that we have already 168 seen, the <a class="link" href="ref/match_results.html" title="match_results"><code class="computeroutput"><span class="identifier">match_results</span></code></a> 169 class has a <code class="computeroutput"><span class="identifier">format</span></code> member that 170 takes the result of a match and a format string, and produces a new string 171 by merging the two. 172 </p> 173<p> 174 For iterating through all occurrences of an expression within a text, there 175 are two iterator types: <a class="link" href="ref/regex_iterator.html" title="regex_iterator"><code class="computeroutput"><span class="identifier">regex_iterator</span></code></a> will enumerate over 176 the <a class="link" href="ref/match_results.html" title="match_results"><code class="computeroutput"><span class="identifier">match_results</span></code></a> 177 objects found, while <a class="link" href="ref/regex_token_iterator.html" title="regex_token_iterator"><code class="computeroutput"><span class="identifier">regex_token_iterator</span></code></a> will enumerate 178 a series of strings (similar to perl style split operations). 179 </p> 180<p> 181 For those that dislike templates, there is a high level wrapper class <a class="link" href="ref/deprecated/old_regex.html" title="High Level Class RegEx (Deprecated)"><code class="computeroutput"><span class="identifier">RegEx</span></code></a> 182 that is an encapsulation of the lower level template code - it provides a simplified 183 interface for those that don't need the full power of the library, and supports 184 only narrow characters, and the "extended" regular expression syntax. 185 This class is now deprecated as it does not form part of the regular expressions 186 C++ standard library proposal. 187 </p> 188<p> 189 The POSIX API functions: <a class="link" href="ref/posix.html#boost_regex.ref.posix.regcomp"><code class="computeroutput"><span class="identifier">regcomp</span></code></a>, <a class="link" href="ref/posix.html#boost_regex.ref.posix.regexec"><code class="computeroutput"><span class="identifier">regexec</span></code></a>, <a class="link" href="ref/posix.html#boost_regex.ref.posix.regfree"><code class="computeroutput"><span class="identifier">regfree</span></code></a> and [regerr], are available 190 in both narrow character and Unicode versions, and are provided for those who 191 need compatibility with these API's. 192 </p> 193<p> 194 Finally, note that the library now has <a class="link" href="background/locale.html" title="Localization">run-time 195 localization support</a>, and recognizes the full POSIX regular expression 196 syntax - including advanced features like multi-character collating elements 197 and equivalence classes - as well as providing compatibility with other regular 198 expression libraries including GNU and BSD4 regex packages, PCRE and Perl 5. 199 </p> 200</div> 201<table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr> 202<td align="left"></td> 203<td align="right"><div class="copyright-footer">Copyright © 1998-2013 John Maddock<p> 204 Distributed under the Boost Software License, Version 1.0. (See accompanying 205 file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>) 206 </p> 207</div></td> 208</tr></table> 209<hr> 210<div class="spirit-nav"> 211<a accesskey="p" href="install.html"><img src="../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../index.html"><img src="../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../index.html"><img src="../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="unicode.html"><img src="../../../../../doc/src/images/next.png" alt="Next"></a> 212</div> 213</body> 214</html> 215