1[/ 2 / Copyright (c) 2008 Eric Niebler 3 / 4 / Distributed under the Boost Software License, Version 1.0. (See accompanying 5 / file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 6 /] 7 8[section String Splitting and Tokenization] 9 10_regex_token_iterator_ is the Ginsu knife of the text manipulation world. It slices! It dices! This section describes 11how to use the highly-configurable _regex_token_iterator_ to chop up input sequences. 12 13[h2 Overview] 14 15You initialize a _regex_token_iterator_ with an input sequence, a regex, and some optional configuration parameters. 16The _regex_token_iterator_ will use _regex_search_ to find the first place in the sequence that the regex matches. When 17dereferenced, the _regex_token_iterator_ returns a ['token] in the form of a `std::basic_string<>`. Which string it returns 18depends on the configuration parameters. By default it returns a string corresponding to the full match, but it could also 19return a string corresponding to a particular marked sub-expression, or even the part of the sequence that ['didn't] match. 20When you increment the _regex_token_iterator_, it will move to the next token. Which token is next depends on the configuration 21parameters. It could simply be a different marked sub-expression in the current match, or it could be part or all of the 22next match. Or it could be the part that ['didn't] match. 23 24As you can see, _regex_token_iterator_ can do a lot. That makes it hard to describe, but some examples should make it clear. 25 26[h2 Example 1: Simple Tokenization] 27 28This example uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words. 29 30 std::string input("This is his face"); 31 sregex re = +_w; // find a word 32 33 // iterate over all the words in the input 34 sregex_token_iterator begin( input.begin(), input.end(), re ), end; 35 36 // write all the words to std::cout 37 std::ostream_iterator< std::string > out_iter( std::cout, "\n" ); 38 std::copy( begin, end, out_iter ); 39 40This program displays the following: 41 42[pre 43This 44is 45his 46face 47] 48 49[h2 Example 2: Simple Tokenization, Reloaded] 50 51This example also uses _regex_token_iterator_ to chop a sequence into a series of tokens consisting of words, 52but it uses the regex as a delimiter. When we pass a `-1` as the last parameter to the _regex_token_iterator_ 53constructor, it instructs the token iterator to consider as tokens those parts of the input that ['didn't] 54match the regex. 55 56 std::string input("This is his face"); 57 sregex re = +_s; // find white space 58 59 // iterate over all non-white space in the input. Note the -1 below: 60 sregex_token_iterator begin( input.begin(), input.end(), re, -1 ), end; 61 62 // write all the words to std::cout 63 std::ostream_iterator< std::string > out_iter( std::cout, "\n" ); 64 std::copy( begin, end, out_iter ); 65 66This program displays the following: 67 68[pre 69This 70is 71his 72face 73] 74 75[h2 Example 3: Simple Tokenization, Revolutions] 76 77This example also uses _regex_token_iterator_ to chop a sequence containing a bunch of dates into a series of 78tokens consisting of just the years. When we pass a positive integer [^['N]] as the last parameter to the 79_regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens only the [^['N]]-th 80marked sub-expression of each match. 81 82 std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981"); 83 sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date 84 85 // iterate over all the years in the input. Note the 3 below, corresponding to the 3rd sub-expression: 86 sregex_token_iterator begin( input.begin(), input.end(), re, 3 ), end; 87 88 // write all the words to std::cout 89 std::ostream_iterator< std::string > out_iter( std::cout, "\n" ); 90 std::copy( begin, end, out_iter ); 91 92This program displays the following: 93 94[pre 952003 961999 971981 98] 99 100[h2 Example 4: Not-So-Simple Tokenization] 101 102This example is like the previous one, except that instead of tokenizing just the years, this program 103turns the days, months and years into tokens. When we pass an array of integers [^['{I,J,...}]] as the last 104parameter to the _regex_token_iterator_ constructor, it instructs the token iterator to consider as tokens the 105[^['I]]-th, [^['J]]-th, etc. marked sub-expression of each match. 106 107 std::string input("01/02/2003 blahblah 04/23/1999 blahblah 11/13/1981"); 108 sregex re = sregex::compile("(\\d{2})/(\\d{2})/(\\d{4})"); // find a date 109 110 // iterate over the days, months and years in the input 111 int const sub_matches[] = { 2, 1, 3 }; // day, month, year 112 sregex_token_iterator begin( input.begin(), input.end(), re, sub_matches ), end; 113 114 // write all the words to std::cout 115 std::ostream_iterator< std::string > out_iter( std::cout, "\n" ); 116 std::copy( begin, end, out_iter ); 117 118This program displays the following: 119 120[pre 12102 12201 1232003 12423 12504 1261999 12713 12811 1291981 130] 131 132The `sub_matches` array instructs the _regex_token_iterator_ to first take the value of the 2nd sub-match, then 133the 1st sub-match, and finally the 3rd. Incrementing the iterator again instructs it to use _regex_search_ again 134to find the next match. At that point, the process repeats -- the token iterator takes the value of the 2nd 135sub-match, then the 1st, et cetera. 136 137[endsect] 138