1[/ 2 Copyright 2006-2007 John Maddock. 3 Distributed under the Boost Software License, Version 1.0. 4 (See accompanying file LICENSE_1_0.txt or copy at 5 http://www.boost.org/LICENSE_1_0.txt). 6] 7 8[section:intro Introduction and Overview] 9 10Regular expressions are a form of pattern-matching that are often used in 11text processing; many users will be familiar with the Unix utilities grep, sed 12and awk, and the programming language Perl, each of which make extensive use 13of regular expressions. Traditionally C++ users have been limited to the 14POSIX C API's for manipulating regular expressions, and while Boost.Regex does 15provide these API's, they do not represent the best way to use the library. 16For example Boost.Regex can cope with wide character strings, or search and 17replace operations (in a manner analogous to either sed or Perl), something 18that traditional C libraries can not do. 19 20The class [basic_regex] is the key class in this library; it represents a 21"machine readable" regular expression, and is very closely modeled on 22`std::basic_string`, think of it as a string plus the actual state-machine 23required by the regular expression algorithms. Like `std::basic_string` there 24are two typedefs that are almost always the means by which this class is referenced: 25 26 namespace boost{ 27 28 template <class charT, 29 class traits = regex_traits<charT> > 30 class basic_regex; 31 32 typedef basic_regex<char> regex; 33 typedef basic_regex<wchar_t> wregex; 34 35 } 36 37To see how this library can be used, imagine that we are writing a credit 38card processing application. Credit card numbers generally come as a string 39of 16-digits, separated into groups of 4-digits, and separated by either a 40space or a hyphen. Before storing a credit card number in a database 41(not necessarily something your customers will appreciate!), we may want to 42verify that the number is in the correct format. To match any digit we could 43use the regular expression \[0-9\], however ranges of characters like this are 44actually locale dependent. Instead we should use the POSIX standard 45form \[\[:digit:\]\], or the Boost.Regex and Perl shorthand for this \\d (note 46that many older libraries tended to be hard-coded to the C-locale, 47consequently this was not an issue for them). That leaves us with the 48following regular expression to validate credit card number formats: 49 50[pre (\d{4}\[- \]){3}\d{4}] 51 52Here the parenthesis act to group (and mark for future reference) 53sub-expressions, and the {4} means "repeat exactly 4 times". This is an 54example of the extended regular expression syntax used by Perl, awk and egrep. 55Boost.Regex also supports the older "basic" syntax used by sed and grep, 56but this is generally less useful, unless you already have some basic regular 57expressions that you need to reuse. 58 59Now let's take that expression and place it in some C++ code to validate the 60format of a credit card number: 61 62 bool validate_card_format(const std::string& s) 63 { 64 static const boost::regex e("(\\d{4}[- ]){3}\\d{4}"); 65 return regex_match(s, e); 66 } 67 68Note how we had to add some extra escapes to the expression: remember that 69the escape is seen once by the C++ compiler, before it gets to be seen by 70the regular expression engine, consequently escapes in regular expressions 71have to be doubled up when embedding them in C/C++ code. Also note that 72all the examples assume that your compiler supports argument-dependent 73lookup, if yours doesn't (for example VC6), then you will have to add some 74`boost::` prefixes to some of the function calls in the examples. 75 76Those of you who are familiar with credit card processing, will have realized 77that while the format used above is suitable for human readable card numbers, 78it does not represent the format required by online credit card systems; these 79require the number as a string of 16 (or possibly 15) digits, without any 80intervening spaces. What we need is a means to convert easily between the two 81formats, and this is where search and replace comes in. Those who are familiar 82with the utilities sed and Perl will already be ahead here; we need two 83strings - one a regular expression - the other a "format string" that provides 84a description of the text to replace the match with. In Boost.Regex this 85search and replace operation is performed with the algorithm [regex_replace], 86for our credit card example we can write two algorithms like this to 87provide the format conversions: 88 89 // match any format with the regular expression: 90 const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z"); 91 const std::string machine_format("\\1\\2\\3\\4"); 92 const std::string human_format("\\1-\\2-\\3-\\4"); 93 94 std::string machine_readable_card_number(const std::string s) 95 { 96 return regex_replace(s, e, machine_format, boost::match_default | boost::format_sed); 97 } 98 99 std::string human_readable_card_number(const std::string s) 100 { 101 return regex_replace(s, e, human_format, boost::match_default | boost::format_sed); 102 } 103 104Here we've used marked sub-expressions in the regular expression to split out 105the four parts of the card number as separate fields, the format string then 106uses the sed-like syntax to replace the matched text with the reformatted version. 107 108In the examples above, we haven't directly manipulated the results of 109a regular expression match, however in general the result of a match contains 110a number of sub-expression matches in addition to the overall match. When the 111library needs to report a regular expression match it does so using an instance 112of the class [match_results], as before there are typedefs of this class for 113the most common cases: 114 115 namespace boost{ 116 117 typedef match_results<const char*> cmatch; 118 typedef match_results<const wchar_t*> wcmatch; 119 typedef match_results<std::string::const_iterator> smatch; 120 typedef match_results<std::wstring::const_iterator> wsmatch; 121 122 } 123 124The algorithms [regex_search] and [regex_match] make use of [match_results] 125to report what matched; the difference between these algorithms is that 126[regex_match] will only find matches that consume /all/ of the input text, 127where as [regex_search] will search for a match anywhere within the text being matched. 128 129Note that these algorithms are not restricted to searching regular C-strings, 130any bidirectional iterator type can be searched, allowing for the 131possibility of seamlessly searching almost any kind of data. 132 133For search and replace operations, in addition to the algorithm [regex_replace] 134that we have already seen, the [match_results] class has a `format` member that 135takes the result of a match and a format string, and produces a new string 136by merging the two. 137 138For iterating through all occurrences of an expression within a text, 139there are two iterator types: [regex_iterator] will enumerate over the 140[match_results] objects found, while [regex_token_iterator] will enumerate 141a series of strings (similar to perl style split operations). 142 143For those that dislike templates, there is a high level wrapper class 144[RegEx] that is an encapsulation of the lower level template code - it 145provides a simplified interface for those that don't need the full 146power of the library, and supports only narrow characters, and the 147"extended" regular expression syntax. This class is now deprecated as 148it does not form part of the regular expressions C++ standard library proposal. 149 150The POSIX API functions: [regcomp], [regexec], [regfree] and [regerr], 151are available in both narrow character and Unicode versions, and are 152provided for those who need compatibility with these API's. 153 154Finally, note that the library now has 155[link boost_regex.background.locale run-time localization support], 156and recognizes the full POSIX regular expression syntax - including 157advanced features like multi-character collating elements and equivalence 158classes - as well as providing compatibility with other regular expression 159libraries including GNU and BSD4 regex packages, PCRE and Perl 5. 160 161[endsect] 162 163 164