• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<html>
2<head>
3<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
4<title>Tokenizing Input Data</title>
5<link rel="stylesheet" href="../../../../../../../doc/src/boostbook.css" type="text/css">
6<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
7<link rel="home" href="../../../index.html" title="Spirit 2.5.8">
8<link rel="up" href="../abstracts.html" title="Abstracts">
9<link rel="prev" href="lexer_primitives/lexer_token_values.html" title="About Tokens and Token Values">
10<link rel="next" href="lexer_semantic_actions.html" title="Lexer Semantic Actions">
11</head>
12<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
13<table cellpadding="2" width="100%"><tr>
14<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../../../../../../boost.png"></td>
15<td align="center"><a href="../../../../../../../index.html">Home</a></td>
16<td align="center"><a href="../../../../../../../libs/libraries.htm">Libraries</a></td>
17<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td>
18<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
19<td align="center"><a href="../../../../../../../more/index.htm">More</a></td>
20</tr></table>
21<hr>
22<div class="spirit-nav">
23<a accesskey="p" href="lexer_primitives/lexer_token_values.html"><img src="../../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../abstracts.html"><img src="../../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../../index.html"><img src="../../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="lexer_semantic_actions.html"><img src="../../../../../../../doc/src/images/next.png" alt="Next"></a>
24</div>
25<div class="section">
26<div class="titlepage"><div><div><h4 class="title">
27<a name="spirit.lex.abstracts.lexer_tokenizing"></a><a class="link" href="lexer_tokenizing.html" title="Tokenizing Input Data">Tokenizing Input
28        Data</a>
29</h4></div></div></div>
30<h6>
31<a name="spirit.lex.abstracts.lexer_tokenizing.h0"></a>
32          <span class="phrase"><a name="spirit.lex.abstracts.lexer_tokenizing.the_tokenize_function"></a></span><a class="link" href="lexer_tokenizing.html#spirit.lex.abstracts.lexer_tokenizing.the_tokenize_function">The
33          tokenize function</a>
34        </h6>
35<p>
36          The <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code>
37          function is a helper function simplifying the usage of a lexer in a stand
38          alone fashion. For instance, you may have a stand alone lexer where all
39          that functional requirements are implemented inside lexer semantic actions.
40          A good example for this is the <a href="../../../../../example/lex/word_count_lexer.cpp" target="_top">word_count_lexer</a>
41          described in more detail in the section <a class="link" href="../tutorials/lexer_quickstart2.html" title="Quickstart 2 - A better word counter using Spirit.Lex">Lex
42          Quickstart 2 - A better word counter using <span class="emphasis"><em>Spirit.Lex</em></span></a>.
43        </p>
44<p>
45</p>
46<pre class="programlisting"><span class="keyword">template</span> <span class="special">&lt;</span><span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">&gt;</span>
47<span class="keyword">struct</span> <span class="identifier">word_count_tokens</span> <span class="special">:</span> <span class="identifier">lex</span><span class="special">::</span><span class="identifier">lexer</span><span class="special">&lt;</span><span class="identifier">Lexer</span><span class="special">&gt;</span>
48<span class="special">{</span>
49    <span class="identifier">word_count_tokens</span><span class="special">()</span>
50      <span class="special">:</span> <span class="identifier">c</span><span class="special">(</span><span class="number">0</span><span class="special">),</span> <span class="identifier">w</span><span class="special">(</span><span class="number">0</span><span class="special">),</span> <span class="identifier">l</span><span class="special">(</span><span class="number">0</span><span class="special">)</span>
51      <span class="special">,</span> <span class="identifier">word</span><span class="special">(</span><span class="string">"[^ \t\n]+"</span><span class="special">)</span>     <span class="comment">// define tokens</span>
52      <span class="special">,</span> <span class="identifier">eol</span><span class="special">(</span><span class="string">"\n"</span><span class="special">)</span>
53      <span class="special">,</span> <span class="identifier">any</span><span class="special">(</span><span class="string">"."</span><span class="special">)</span>
54    <span class="special">{</span>
55        <span class="keyword">using</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">spirit</span><span class="special">::</span><span class="identifier">lex</span><span class="special">::</span><span class="identifier">_start</span><span class="special">;</span>
56        <span class="keyword">using</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">spirit</span><span class="special">::</span><span class="identifier">lex</span><span class="special">::</span><span class="identifier">_end</span><span class="special">;</span>
57        <span class="keyword">using</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">phoenix</span><span class="special">::</span><span class="identifier">ref</span><span class="special">;</span>
58
59        <span class="comment">// associate tokens with the lexer</span>
60        <span class="keyword">this</span><span class="special">-&gt;</span><span class="identifier">self</span>
61            <span class="special">=</span>   <span class="identifier">word</span>  <span class="special">[++</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">w</span><span class="special">),</span> <span class="identifier">ref</span><span class="special">(</span><span class="identifier">c</span><span class="special">)</span> <span class="special">+=</span> <span class="identifier">distance</span><span class="special">(</span><span class="identifier">_start</span><span class="special">,</span> <span class="identifier">_end</span><span class="special">)]</span>
62            <span class="special">|</span>   <span class="identifier">eol</span>   <span class="special">[++</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">c</span><span class="special">),</span> <span class="special">++</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">l</span><span class="special">)]</span>
63            <span class="special">|</span>   <span class="identifier">any</span>   <span class="special">[++</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">c</span><span class="special">)]</span>
64            <span class="special">;</span>
65    <span class="special">}</span>
66
67    <span class="identifier">std</span><span class="special">::</span><span class="identifier">size_t</span> <span class="identifier">c</span><span class="special">,</span> <span class="identifier">w</span><span class="special">,</span> <span class="identifier">l</span><span class="special">;</span>
68    <span class="identifier">lex</span><span class="special">::</span><span class="identifier">token_def</span><span class="special">&lt;&gt;</span> <span class="identifier">word</span><span class="special">,</span> <span class="identifier">eol</span><span class="special">,</span> <span class="identifier">any</span><span class="special">;</span>
69<span class="special">};</span>
70</pre>
71<p>
72        </p>
73<p>
74          The construct used to tokenize the given input, while discarding all generated
75          tokens is a common application of the lexer. For this reason <span class="emphasis"><em>Spirit.Lex</em></span>
76          exposes an API function <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code> minimizing the code required:
77        </p>
78<pre class="programlisting"><span class="comment">// Read input from the given file</span>
79<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">str</span> <span class="special">(</span><span class="identifier">read_from_file</span><span class="special">(</span><span class="number">1</span> <span class="special">==</span> <span class="identifier">argc</span> <span class="special">?</span> <span class="string">"word_count.input"</span> <span class="special">:</span> <span class="identifier">argv</span><span class="special">[</span><span class="number">1</span><span class="special">]));</span>
80
81<span class="identifier">word_count_tokens</span><span class="special">&lt;</span><span class="identifier">lexer_type</span><span class="special">&gt;</span> <span class="identifier">word_count_lexer</span><span class="special">;</span>
82<span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span><span class="special">::</span><span class="identifier">iterator</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">begin</span><span class="special">();</span>
83
84<span class="comment">// Tokenize all the input, while discarding all generated tokens</span>
85<span class="keyword">bool</span> <span class="identifier">r</span> <span class="special">=</span> <span class="identifier">tokenize</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">end</span><span class="special">(),</span> <span class="identifier">word_count_lexer</span><span class="special">);</span>
86</pre>
87<p>
88          This code is completely equivalent to the more verbose version as shown
89          in the section <a class="link" href="../tutorials/lexer_quickstart2.html" title="Quickstart 2 - A better word counter using Spirit.Lex">Lex
90          Quickstart 2 - A better word counter using <span class="emphasis"><em>Spirit.Lex</em></span></a>.
91          The function <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code>
92          will return either if the end of the input has been reached (in this case
93          the return value will be <code class="computeroutput"><span class="keyword">true</span></code>),
94          or if the lexer couldn't match any of the token definitions in the input
95          (in this case the return value will be <code class="computeroutput"><span class="keyword">false</span></code>
96          and the iterator <code class="computeroutput"><span class="identifier">first</span></code>
97          will point to the first not matched character in the input sequence).
98        </p>
99<p>
100          The prototype of this function is:
101        </p>
102<pre class="programlisting"><span class="keyword">template</span> <span class="special">&lt;</span><span class="keyword">typename</span> <span class="identifier">Iterator</span><span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">&gt;</span>
103<span class="keyword">bool</span> <span class="identifier">tokenize</span><span class="special">(</span><span class="identifier">Iterator</span><span class="special">&amp;</span> <span class="identifier">first</span><span class="special">,</span> <span class="identifier">Iterator</span> <span class="identifier">last</span><span class="special">,</span> <span class="identifier">Lexer</span> <span class="keyword">const</span><span class="special">&amp;</span> <span class="identifier">lex</span>
104  <span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">::</span><span class="identifier">char_type</span> <span class="keyword">const</span><span class="special">*</span> <span class="identifier">initial_state</span> <span class="special">=</span> <span class="number">0</span><span class="special">);</span>
105</pre>
106<div class="variablelist">
107<p class="title"><b>where:</b></p>
108<dl class="variablelist">
109<dt><span class="term">Iterator&amp; first</span></dt>
110<dd><p>
111                The beginning of the input sequence to tokenize. The value of this
112                iterator will be updated by the lexer, pointing to the first not
113                matched character of the input after the function returns.
114              </p></dd>
115<dt><span class="term">Iterator last</span></dt>
116<dd><p>
117                The end of the input sequence to tokenize.
118              </p></dd>
119<dt><span class="term">Lexer const&amp; lex</span></dt>
120<dd><p>
121                The lexer instance to use for tokenization.
122              </p></dd>
123<dt><span class="term">Lexer::char_type const* initial_state</span></dt>
124<dd><p>
125                This optional parameter can be used to specify the initial lexer
126                state for tokenization.
127              </p></dd>
128</dl>
129</div>
130<p>
131          A second overload of the <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code> function allows specifying of any arbitrary
132          function or function object to be called for each of the generated tokens.
133          For some applications this is very useful, as it might avoid having lexer
134          semantic actions. For an example of how to use this function, please have
135          a look at <a href="../../../../../example/lex/word_count_lexer.cpp" target="_top">word_count_functor.cpp</a>:
136        </p>
137<p>
138          The main function simply loads the given file into memory (as a <code class="computeroutput"><span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span></code>), instantiates an instance of
139          the token definition template using the correct iterator type (<code class="computeroutput"><span class="identifier">word_count_tokens</span><span class="special">&lt;</span><span class="keyword">char</span> <span class="keyword">const</span><span class="special">*&gt;</span></code>), and finally calls <code class="computeroutput"><span class="identifier">lex</span><span class="special">::</span><span class="identifier">tokenize</span></code>, passing an instance of the
140          counter function object. The return value of <code class="computeroutput"><span class="identifier">lex</span><span class="special">::</span><span class="identifier">tokenize</span><span class="special">()</span></code> will be <code class="computeroutput"><span class="keyword">true</span></code>
141          if the whole input sequence has been successfully tokenized, and <code class="computeroutput"><span class="keyword">false</span></code> otherwise.
142        </p>
143<p>
144</p>
145<pre class="programlisting"><span class="keyword">int</span> <span class="identifier">main</span><span class="special">(</span><span class="keyword">int</span> <span class="identifier">argc</span><span class="special">,</span> <span class="keyword">char</span><span class="special">*</span> <span class="identifier">argv</span><span class="special">[])</span>
146<span class="special">{</span>
147    <span class="comment">// these variables are used to count characters, words and lines</span>
148    <span class="identifier">std</span><span class="special">::</span><span class="identifier">size_t</span> <span class="identifier">c</span> <span class="special">=</span> <span class="number">0</span><span class="special">,</span> <span class="identifier">w</span> <span class="special">=</span> <span class="number">0</span><span class="special">,</span> <span class="identifier">l</span> <span class="special">=</span> <span class="number">0</span><span class="special">;</span>
149
150    <span class="comment">// read input from the given file</span>
151    <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">str</span> <span class="special">(</span><span class="identifier">read_from_file</span><span class="special">(</span><span class="number">1</span> <span class="special">==</span> <span class="identifier">argc</span> <span class="special">?</span> <span class="string">"word_count.input"</span> <span class="special">:</span> <span class="identifier">argv</span><span class="special">[</span><span class="number">1</span><span class="special">]));</span>
152
153    <span class="comment">// create the token definition instance needed to invoke the lexical analyzer</span>
154    <span class="identifier">word_count_tokens</span><span class="special">&lt;</span><span class="identifier">lex</span><span class="special">::</span><span class="identifier">lexertl</span><span class="special">::</span><span class="identifier">lexer</span><span class="special">&lt;&gt;</span> <span class="special">&gt;</span> <span class="identifier">word_count_functor</span><span class="special">;</span>
155
156    <span class="comment">// tokenize the given string, the bound functor gets invoked for each of </span>
157    <span class="comment">// the matched tokens</span>
158    <span class="keyword">using</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">placeholders</span><span class="special">::</span><span class="identifier">_1</span><span class="special">;</span>
159    <span class="keyword">char</span> <span class="keyword">const</span><span class="special">*</span> <span class="identifier">first</span> <span class="special">=</span> <span class="identifier">str</span><span class="special">.</span><span class="identifier">c_str</span><span class="special">();</span>
160    <span class="keyword">char</span> <span class="keyword">const</span><span class="special">*</span> <span class="identifier">last</span> <span class="special">=</span> <span class="special">&amp;</span><span class="identifier">first</span><span class="special">[</span><span class="identifier">str</span><span class="special">.</span><span class="identifier">size</span><span class="special">()];</span>
161    <span class="keyword">bool</span> <span class="identifier">r</span> <span class="special">=</span> <span class="identifier">lex</span><span class="special">::</span><span class="identifier">tokenize</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">last</span><span class="special">,</span> <span class="identifier">word_count_functor</span><span class="special">,</span>
162        <span class="identifier">boost</span><span class="special">::</span><span class="identifier">bind</span><span class="special">(</span><span class="identifier">counter</span><span class="special">(),</span> <span class="identifier">_1</span><span class="special">,</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">c</span><span class="special">),</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">w</span><span class="special">),</span> <span class="identifier">boost</span><span class="special">::</span><span class="identifier">ref</span><span class="special">(</span><span class="identifier">l</span><span class="special">)));</span>
163
164    <span class="comment">// print results</span>
165    <span class="keyword">if</span> <span class="special">(</span><span class="identifier">r</span><span class="special">)</span> <span class="special">{</span>
166        <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"lines: "</span> <span class="special">&lt;&lt;</span> <span class="identifier">l</span> <span class="special">&lt;&lt;</span> <span class="string">", words: "</span> <span class="special">&lt;&lt;</span> <span class="identifier">w</span>
167                  <span class="special">&lt;&lt;</span> <span class="string">", characters: "</span> <span class="special">&lt;&lt;</span> <span class="identifier">c</span> <span class="special">&lt;&lt;</span> <span class="string">"\n"</span><span class="special">;</span>
168    <span class="special">}</span>
169    <span class="keyword">else</span> <span class="special">{</span>
170        <span class="identifier">std</span><span class="special">::</span><span class="identifier">string</span> <span class="identifier">rest</span><span class="special">(</span><span class="identifier">first</span><span class="special">,</span> <span class="identifier">last</span><span class="special">);</span>
171        <span class="identifier">std</span><span class="special">::</span><span class="identifier">cout</span> <span class="special">&lt;&lt;</span> <span class="string">"Lexical analysis failed\n"</span> <span class="special">&lt;&lt;</span> <span class="string">"stopped at: \""</span>
172                  <span class="special">&lt;&lt;</span> <span class="identifier">rest</span> <span class="special">&lt;&lt;</span> <span class="string">"\"\n"</span><span class="special">;</span>
173    <span class="special">}</span>
174    <span class="keyword">return</span> <span class="number">0</span><span class="special">;</span>
175<span class="special">}</span>
176</pre>
177<p>
178        </p>
179<p>
180          Here is the prototype of this <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code> function overload:
181        </p>
182<pre class="programlisting"><span class="keyword">template</span> <span class="special">&lt;</span><span class="keyword">typename</span> <span class="identifier">Iterator</span><span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">F</span><span class="special">&gt;</span>
183<span class="keyword">bool</span> <span class="identifier">tokenize</span><span class="special">(</span><span class="identifier">Iterator</span><span class="special">&amp;</span> <span class="identifier">first</span><span class="special">,</span> <span class="identifier">Iterator</span> <span class="identifier">last</span><span class="special">,</span> <span class="identifier">Lexer</span> <span class="keyword">const</span><span class="special">&amp;</span> <span class="identifier">lex</span><span class="special">,</span> <span class="identifier">F</span> <span class="identifier">f</span>
184  <span class="special">,</span> <span class="keyword">typename</span> <span class="identifier">Lexer</span><span class="special">::</span><span class="identifier">char_type</span> <span class="keyword">const</span><span class="special">*</span> <span class="identifier">initial_state</span> <span class="special">=</span> <span class="number">0</span><span class="special">);</span>
185</pre>
186<div class="variablelist">
187<p class="title"><b>where:</b></p>
188<dl class="variablelist">
189<dt><span class="term">Iterator&amp; first</span></dt>
190<dd><p>
191                The beginning of the input sequence to tokenize. The value of this
192                iterator will be updated by the lexer, pointing to the first not
193                matched character of the input after the function returns.
194              </p></dd>
195<dt><span class="term">Iterator last</span></dt>
196<dd><p>
197                The end of the input sequence to tokenize.
198              </p></dd>
199<dt><span class="term">Lexer const&amp; lex</span></dt>
200<dd><p>
201                The lexer instance to use for tokenization.
202              </p></dd>
203<dt><span class="term">F f</span></dt>
204<dd><p>
205                A function or function object to be called for each matched token.
206                This function is expected to have the prototype: <code class="computeroutput"><span class="keyword">bool</span>
207                <span class="identifier">f</span><span class="special">(</span><span class="identifier">Lexer</span><span class="special">::</span><span class="identifier">token_type</span><span class="special">);</span></code>.
208                The <code class="computeroutput"><span class="identifier">tokenize</span><span class="special">()</span></code>
209                function will return immediately if <code class="computeroutput"><span class="identifier">F</span></code>
210                returns `false.
211              </p></dd>
212<dt><span class="term">Lexer::char_type const* initial_state</span></dt>
213<dd><p>
214                This optional parameter can be used to specify the initial lexer
215                state for tokenization.
216              </p></dd>
217</dl>
218</div>
219</div>
220<table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%"><tr>
221<td align="left"></td>
222<td align="right"><div class="copyright-footer">Copyright © 2001-2011 Joel de Guzman, Hartmut Kaiser<p>
223        Distributed under the Boost Software License, Version 1.0. (See accompanying
224        file LICENSE_1_0.txt or copy at <a href="http://www.boost.org/LICENSE_1_0.txt" target="_top">http://www.boost.org/LICENSE_1_0.txt</a>)
225      </p>
226</div></td>
227</tr></table>
228<hr>
229<div class="spirit-nav">
230<a accesskey="p" href="lexer_primitives/lexer_token_values.html"><img src="../../../../../../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u" href="../abstracts.html"><img src="../../../../../../../doc/src/images/up.png" alt="Up"></a><a accesskey="h" href="../../../index.html"><img src="../../../../../../../doc/src/images/home.png" alt="Home"></a><a accesskey="n" href="lexer_semantic_actions.html"><img src="../../../../../../../doc/src/images/next.png" alt="Next"></a>
231</div>
232</body>
233</html>
234