1<html> 2<head> 3<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 4<title>Tokenizing Input Data</title> 5<link rel="stylesheet" href="../../../../../../../doc/src/boostbook.css" type="text/css"> 6<meta name="generator" content="DocBook XSL Stylesheets V1.79.1"> 7<link rel="home" href="../../../index.html" title="Spirit 2.5.8"> 8<link rel="up" href="../abstracts.html" title="Abstracts"> 9<link rel="prev" href="lexer_primitives/lexer_token_values.html" title="About Tokens and Token Values"> 10<link rel="next" href="lexer_semantic_actions.html" title="Lexer Semantic Actions"> 11</head> 12<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"> 13<table cellpadding="2" width="100%"><tr> 14<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../../../boost.png"></td> 15<td align="center"><a href="../../../../../../../index.html">Home</a></td> 16<td align="center"><a href="../../../../../../../libs/libraries.htm">Libraries</a></td> 17<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td> 18<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td> 19<td align="center"><a href="../../../../../../../more/index.htm">More</a></td> 20</tr></table> 21<hr> 22<div class="spirit-nav"> 23<a accesskey="p" href="lexer_primitives/lexer_token_values.html"><img src="../../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../abstracts.html"><img src="../../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../../index.html"><img src="../../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="lexer_semantic_actions.html"><img src="../../../../../../../doc/src/images/next.png" alt="Next"></a> 24</div> 25<div class="section"> 26<div class="titlepage"><div><div><h4 class="title"> 27<a name="spirit.lex.abstracts.lexer_tokenizing"></a><a class="link" href="lexer_tokenizing.html" title="Tokenizing Input Data">Tokenizing Input 28 Data</a> 29</h4></div></div></div> 30<h6> 31<a name="spirit.lex.abstracts.lexer_tokenizing.h0"></a> 32 <span class="phrase"><a name="spirit.lex.abstracts.lexer_tokenizing.the_tokenize_function"></a></span><a class="link" href="lexer_tokenizing.html#spirit.lex.abstracts.lexer_tokenizing.the_tokenize_function">The 33 tokenize function</a> 34 </h6> 35<p> 36 The <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code> 37 function is a helper function simplifying the usage of a lexer in a stand 38 alone fashion. For instance, you may have a stand alone lexer where all 39 that functional requirements are implemented inside lexer semantic actions. 40 A good example for this is the <a href="../../../../../example/lex/word_count_lexer.cpp" target="_top">word_count_lexer</a> 41 described in more detail in the section <a class="link" href="../tutorials/lexer_quickstart2.html" title="Quickstart 2 - A better word counter using Spirit.Lex">Lex 42 Quickstart 2 - A better word counter using <span class="emphasis"><em>Spirit.Lex</em></span></a>. 43 </p> 44<p> 45</p> 46<pre class="programlisting"><span class="keyword">template</span> <span class="special"><</span><span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">></span> 47<span class="keyword">struct</span> <span class="identifier">word_count_tokens</span> <span class="special">:</span> <span class="identifier">lex</span><span class="special">::</span><span class="identifier">lexer</span><span class="special"><</span><span class="identifier">Lexer</span><span class="special">></span> 48<span class="special">{</span> 49 <span class="identifier">word_count_tokens</span><span class="special">()</span> 50 <span class="special">:</span> <span class="identifier">c</span><span class="special">(</span><span class="number">0</span><span class="special">),</span> <span class="identifier">w</span><span class="special">(</span><span class="number">0</span><span class="special">),</span> <span class="identifier">l</span><span class="special">(</span><span class="number">0</span><span class="special">)</span> 51 <span class="special">,</span> <span class="identifier">word</span><span class="special">(</span><span class="string">"[^ \t\n]+"</span><span class="special">)</span> <span class="comment">// define tokens</span> 52 <span class="special">,</span> <span class="identifier">eol</span><span class="special">(</span><span class="string">"\n"</span><span class="special">)</span> 53 <span class="special">,</span> <span class="identifier">any</span><span class="special">(</span><span class="string">"."</span><span class="special">)</span> 54 <span class="special">{</span> 55 <span class="keyword">using</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">spirit</span><span class="special">::</span><span class="identifier">lex</span><span class="special">::</span><span class="identifier">_start</span><span class="special">;</span> 56 <span class="keyword">using</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">spirit</span><span class="special">::</span><span class="identifier">lex</span><span class="special">::</span><span class="identifier">_end</span><span class="special">;</span> 57 <span class="keyword">using</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">phoenix</span><span class="special">::</span><span class="identifier">ref</span><span class="special">;</span> 58 59 <span class="comment">// associate tokens with the lexer</span> 60 <span class="keyword">this</span><span class="special">-></span><span class="identifier">self</span> 61 <span class="special">=</span> <span class="identifier">word</span> <span class="special">[++</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">w</span><span class="special">),</span> <span class="identifier">ref</span><span class="special">(</span><span class="identifier">c</span><span class="special">)</span> <span class="special">+=</span> <span class="identifier">distance</span><span class="special">(</span><span class="identifier">_start</span><span class="special">,</span> <span class="identifier">_end</span><span class="special">)]</span> 62 <span class="special">|</span> <span class="identifier">eol</span> <span class="special">[++</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">c</span><span class="special">),</span> <span class="special">++</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">l</span><span class="special">)]</span> 63 <span class="special">|</span> <span class="identifier">any</span> <span class="special">[++</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">c</span><span class="special">)]</span> 64 <span class="special">;</span> 65 <span class="special">}</span> 66 67 <span class="identifier">std</span><span class="special">::</span><span class="identifier">size_t</span> <span class="identifier">c</span><span class="special">,</span> <span class="identifier">w</span><span class="special">,</span> <span class="identifier">l</span><span class="special">;</span> 68 <span class="identifier">lex</span><span class="special">::</span><span class="identifier">token_def</span><span class="special"><></span> <span class="identifier">word</span><span class="special">,</span> <span class="identifier">eol</span><span class="special">,</span> <span class="identifier">any</span><span class="special">;</span> 69<span class="special">};</span> 70</pre> 71<p> 72 </p> 73<p> 74 The construct used to tokenize the given input, while discarding all generated 75 tokens is a common application of the lexer. For this reason <span class="emphasis"><em>Spirit.Lex</em></span> 76 exposes an API function <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code> minimizing the code required: 77 </p> 78<pre class="programlisting"><span class="comment">// Read input from the given file</span> 79<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">str</span> <span class="special">(</span><span class="identifier">read_from_file</span><span class="special">(</span><span class="number">1</span> <span class="special">==</span> <span class="identifier">argc</span> <span class="special">?</span> <span class="string">"word_count.input"</span> <span class="special">:</span> <span class="identifier">argv</span><span class="special">[</span><span class="number">1</span><span class="special">]));</span> 80 81<span class="identifier">word_count_tokens</span><span class="special"><</span><span class="identifier">lexer_type</span><span class="special">></span> <span class="identifier">word_count_lexer</span><span class="special">;</span> 82<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">::</span><span class="identifier">iterator</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">begin</span><span class="special">();</span> 83 84<span class="comment">// Tokenize all the input, while discarding all generated tokens</span> 85<span class="keyword">bool</span> <span class="identifier">r</span> <span class="special">=</span> <span class="identifier">tokenize</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">end</span><span class="special">(),</span> <span class="identifier">word_count_lexer</span><span class="special">);</span> 86</pre> 87<p> 88 This code is completely equivalent to the more verbose version as shown 89 in the section <a class="link" href="../tutorials/lexer_quickstart2.html" title="Quickstart 2 - A better word counter using Spirit.Lex">Lex 90 Quickstart 2 - A better word counter using <span class="emphasis"><em>Spirit.Lex</em></span></a>. 91 The function <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code> 92 will return either if the end of the input has been reached (in this case 93 the return value will be <code class="computeroutput"><span class="keyword">true</span></code>), 94 or if the lexer couldn't match any of the token definitions in the input 95 (in this case the return value will be <code class="computeroutput"><span class="keyword">false</span></code> 96 and the iterator <code class="computeroutput"><span class="identifier">first</span></code> 97 will point to the first not matched character in the input sequence). 98 </p> 99<p> 100 The prototype of this function is: 101 </p> 102<pre class="programlisting"><span class="keyword">template</span> <span class="special"><</span><span class="keyword">typename</span> <span class="identifier">Iterator</span><span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">></span> 103<span class="keyword">bool</span> <span class="identifier">tokenize</span><span class="special">(</span><span class="identifier">Iterator</span><span class="special">&</span> <span class="identifier">first</span><span class="special">,</span> <span class="identifier">Iterator</span> <span class="identifier">last</span><span class="special">,</span> <span class="identifier">Lexer</span> <span class="keyword">const</span><span class="special">&</span> <span class="identifier">lex</span> 104 <span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">::</span><span class="identifier">char_type</span> <span class="keyword">const</span><span class="special">*</span> <span class="identifier">initial_state</span> <span class="special">=</span> <span class="number">0</span><span class="special">);</span> 105</pre> 106<div class="variablelist"> 107<p class="title"><b>where:</b></p> 108<dl class="variablelist"> 109<dt><span class="term">Iterator& first</span></dt> 110<dd><p> 111 The beginning of the input sequence to tokenize. The value of this 112 iterator will be updated by the lexer, pointing to the first not 113 matched character of the input after the function returns. 114 </p></dd> 115<dt><span class="term">Iterator last</span></dt> 116<dd><p> 117 The end of the input sequence to tokenize. 118 </p></dd> 119<dt><span class="term">Lexer const& lex</span></dt> 120<dd><p> 121 The lexer instance to use for tokenization. 122 </p></dd> 123<dt><span class="term">Lexer::char_type const* initial_state</span></dt> 124<dd><p> 125 This optional parameter can be used to specify the initial lexer 126 state for tokenization. 127 </p></dd> 128</dl> 129</div> 130<p> 131 A second overload of the <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code> function allows specifying of any arbitrary 132 function or function object to be called for each of the generated tokens. 133 For some applications this is very useful, as it might avoid having lexer 134 semantic actions. For an example of how to use this function, please have 135 a look at <a href="../../../../../example/lex/word_count_lexer.cpp" target="_top">word_count_functor.cpp</a>: 136 </p> 137<p> 138 The main function simply loads the given file into memory (as a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span></code>), instantiates an instance of 139 the token definition template using the correct iterator type (<code class="computeroutput"><span class="identifier">word_count_tokens</span><span class="special"><</span><span class="keyword">char</span> <span class="keyword">const</span><span class="special">*></span></code>), and finally calls <code class="computeroutput"><span class="identifier">lex</span><span class="special">::</span><span class="identifier">tokenize</span></code>, passing an instance of the 140 counter function object. The return value of <code class="computeroutput"><span class="identifier">lex</span><span class="special">::</span><span class="identifier">tokenize</span><span class="special">()</span></code> will be <code class="computeroutput"><span class="keyword">true</span></code> 141 if the whole input sequence has been successfully tokenized, and <code class="computeroutput"><span class="keyword">false</span></code> otherwise. 142 </p> 143<p> 144</p> 145<pre class="programlisting"><span class="keyword">int</span> <span class="identifier">main</span><span class="special">(</span><span class="keyword">int</span> <span class="identifier">argc</span><span class="special">,</span> <span class="keyword">char</span><span class="special">*</span> <span class="identifier">argv</span><span class="special">[])</span> 146<span class="special">{</span> 147 <span class="comment">// these variables are used to count characters, words and lines</span> 148 <span class="identifier">std</span><span class="special">::</span><span class="identifier">size_t</span> <span class="identifier">c</span> <span class="special">=</span> <span class="number">0</span><span class="special">,</span> <span class="identifier">w</span> <span class="special">=</span> <span class="number">0</span><span class="special">,</span> <span class="identifier">l</span> <span class="special">=</span> <span class="number">0</span><span class="special">;</span> 149 150 <span class="comment">// read input from the given file</span> 151 <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">str</span> <span class="special">(</span><span class="identifier">read_from_file</span><span class="special">(</span><span class="number">1</span> <span class="special">==</span> <span class="identifier">argc</span> <span class="special">?</span> <span class="string">"word_count.input"</span> <span class="special">:</span> <span class="identifier">argv</span><span class="special">[</span><span class="number">1</span><span class="special">]));</span> 152 153 <span class="comment">// create the token definition instance needed to invoke the lexical analyzer</span> 154 <span class="identifier">word_count_tokens</span><span class="special"><</span><span class="identifier">lex</span><span class="special">::</span><span class="identifier">lexertl</span><span class="special">::</span><span class="identifier">lexer</span><span class="special"><></span> <span class="special">></span> <span class="identifier">word_count_functor</span><span class="special">;</span> 155 156 <span class="comment">// tokenize the given string, the bound functor gets invoked for each of </span> 157 <span class="comment">// the matched tokens</span> 158 <span class="keyword">using</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">placeholders</span><span class="special">::</span><span class="identifier">_1</span><span class="special">;</span> 159 <span class="keyword">char</span> <span class="keyword">const</span><span class="special">*</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">c_str</span><span class="special">();</span> 160 <span class="keyword">char</span> <span class="keyword">const</span><span class="special">*</span> <span class="identifier">last</span> <span class="special">=</span> <span class="special">&</span><span class="identifier">first</span><span class="special">[</span><span class="identifier">str</span><span class="special">.</span><span class="identifier">size</span><span class="special">()];</span> 161 <span class="keyword">bool</span> <span class="identifier">r</span> <span class="special">=</span> <span class="identifier">lex</span><span class="special">::</span><span class="identifier">tokenize</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">last</span><span class="special">,</span> <span class="identifier">word_count_functor</span><span class="special">,</span> 162 <span class="identifier">boost</span><span class="special">::</span><span class="identifier">bind</span><span class="special">(</span><span class="identifier">counter</span><span class="special">(),</span> <span class="identifier">_1</span><span class="special">,</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">c</span><span class="special">),</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">w</span><span class="special">),</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">l</span><span class="special">)));</span> 163 164 <span class="comment">// print results</span> 165 <span class="keyword">if</span> <span class="special">(</span><span class="identifier">r</span><span class="special">)</span> <span class="special">{</span> 166 <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">"lines: "</span> <span class="special"><<</span> <span class="identifier">l</span> <span class="special"><<</span> <span class="string">", words: "</span> <span class="special"><<</span> <span class="identifier">w</span> 167 <span class="special"><<</span> <span class="string">", characters: "</span> <span class="special"><<</span> <span class="identifier">c</span> <span class="special"><<</span> <span class="string">"\n"</span><span class="special">;</span> 168 <span class="special">}</span> 169 <span class="keyword">else</span> <span class="special">{</span> 170 <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">rest</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">last</span><span class="special">);</span> 171 <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special"><<</span> <span class="string">"Lexical analysis failed\n"</span> <span class="special"><<</span> <span class="string">"stopped at: \""</span> 172 <span class="special"><<</span> <span class="identifier">rest</span> <span class="special"><<</span> <span class="string">"\"\n"</span><span class="special">;</span> 173 <span class="special">}</span> 174 <span class="keyword">return</span> <span class="number">0</span><span class="special">;</span> 175<span class="special">}</span> 176</pre> 177<p> 178 </p> 179<p> 180 Here is the prototype of this <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code> function overload: 181 </p> 182<pre class="programlisting"><span class="keyword">template</span> <span class="special"><</span><span class="keyword">typename</span> <span class="identifier">Iterator</span><span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">F</span><span class="special">></span> 183<span class="keyword">bool</span> <span class="identifier">tokenize</span><span class="special">(</span><span class="identifier">Iterator</span><span class="special">&</span> <span class="identifier">first</span><span class="special">,</span> <span class="identifier">Iterator</span> <span class="identifier">last</span><span class="special">,</span> <span class="identifier">Lexer</span> <span class="keyword">const</span><span class="special">&</span> <span class="identifier">lex</span><span class="special">,</span> <span class="identifier">F</span> <span class="identifier">f</span> 184 <span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">::</span><span class="identifier">char_type</span> <span class="keyword">const</span><span class="special">*</span> <span class="identifier">initial_state</span> <span class="special">=</span> <span class="number">0</span><span class="special">);</span> 185</pre> 186<div class="variablelist"> 187<p class="title"><b>where:</b></p> 188<dl class="variablelist"> 189<dt><span class="term">Iterator& first</span></dt> 190<dd><p> 191 The beginning of the input sequence to tokenize. The value of this 192 iterator will be updated by the lexer, pointing to the first not 193 matched character of the input after the function returns. 194 </p></dd> 195<dt><span class="term">Iterator last</span></dt> 196<dd><p> 197 The end of the input sequence to tokenize. 198 </p></dd> 199<dt><span class="term">Lexer const& lex</span></dt> 200<dd><p> 201 The lexer instance to use for tokenization. 202 </p></dd> 203<dt><span class="term">F f</span></dt> 204<dd><p> 205 A function or function object to be called for each matched token. 206 This function is expected to have the prototype: <code class="computeroutput"><span class="keyword">bool</span> 207 <span class="identifier">f</span><span class="special">(</span><span class="identifier">Lexer</span><span class="special">::</span><span class="identifier">token_type</span><span class="special">);</span></code>. 208 The <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code> 209 function will return immediately if <code class="computeroutput"><span class="identifier">F</span></code> 210 returns `false. 211 </p></dd> 212<dt><span class="term">Lexer::char_type const* initial_state</span></dt> 213<dd><p> 214 This optional parameter can be used to specify the initial lexer 215 state for tokenization. 216 </p></dd> 217</dl> 218</div> 219</div> 220<table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr> 221<td align="left"></td> 222<td align="right"><div class="copyright-footer">Copyright © 2001-2011 Joel de Guzman, Hartmut Kaiser<p> 223 Distributed under the Boost Software License, Version 1.0. (See accompanying 224 file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>) 225 </p> 226</div></td> 227</tr></table> 228<hr> 229<div class="spirit-nav"> 230<a accesskey="p" href="lexer_primitives/lexer_token_values.html"><img src="../../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../abstracts.html"><img src="../../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../../index.html"><img src="../../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="lexer_semantic_actions.html"><img src="../../../../../../../doc/src/images/next.png" alt="Next"></a> 231</div> 232</body> 233</html> 234