1[/============================================================================== 2 Copyright (C) 2001-2011 Joel de Guzman 3 Copyright (C) 2001-2011 Hartmut Kaiser 4 5 Distributed under the Boost Software License, Version 1.0. (See accompanying 6 file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 7===============================================================================/] 8 9[section:lexer_quickstart3 Quickstart 3 - Counting Words Using a Parser] 10 11The whole purpose of integrating __lex__ as part of the __spirit__ library was 12to add a library allowing the merger of lexical analysis with the parsing 13process as defined by a __spirit__ grammar. __spirit__ parsers read their input 14from an input sequence accessed by iterators. So naturally, we chose iterators 15to be used as the interface between the lexer and the parser. A second goal of 16the lexer/parser integration was to enable the usage of different 17lexical analyzer libraries. The utilization of iterators seemed to be the 18right choice from this standpoint as well, mainly because these can be used as 19an abstraction layer hiding implementation specifics of the used lexer 20library. The [link spirit.lex.flowcontrol picture] below shows the common 21flow control implemented while parsing combined with lexical analysis. 22 23[fig flowofcontrol.png..The common flow control implemented while parsing combined with lexical analysis..spirit.lex.flowcontrol] 24 25Another problem related to the integration of the lexical analyzer with the 26parser was to find a way how the defined tokens syntactically could be blended 27with the grammar definition syntax of __spirit__. For tokens defined as 28instances of the `token_def<>` class the most natural way of integration was 29to allow to directly use these as parser components. Semantically these parser 30components succeed matching their input whenever the corresponding token type 31has been matched by the lexer. This quick start example will demonstrate this 32(and more) by counting words again, simply by adding up the numbers inside 33of semantic actions of a parser (for the full example code see here: 34[@../../example/lex/word_count.cpp word_count.cpp]). 35 36 37[import ../example/lex/word_count.cpp] 38 39 40[heading Prerequisites] 41 42This example uses two of the __spirit__ library components: __lex__ and __qi__, 43consequently we have to `#include` the corresponding header files. Again, we 44need to include a couple of header files from the __phoenix__ library. This 45example shows how to attach functors to parser components, which 46could be done using any type of C++ technique resulting in a callable object. 47Using __phoenix__ for this task simplifies things and avoids adding 48dependencies to other libraries (__phoenix__ is already in use for 49__spirit__ anyway). 50 51[wcp_includes] 52 53To make all the code below more readable we introduce the following namespaces. 54 55[wcp_namespaces] 56 57 58[heading Defining Tokens] 59 60If compared to the two previous quick start examples (__sec_lex_quickstart_1__ 61and __sec_lex_quickstart_2__) the token definition class for this example does 62not reveal any surprises. However, it uses lexer token definition macros to 63simplify the composition of the regular expressions, which will be described in 64more detail in the section __fixme__. Generally, any token definition is usable 65without modification from either a stand alone lexical analyzer or in conjunction 66with a parser. 67 68[wcp_token_definition] 69 70 71[heading Using Token Definition Instances as Parsers] 72 73While the integration of lexer and parser in the control flow is achieved by 74using special iterators wrapping the lexical analyzer, we still need a means of 75expressing in the grammar what tokens to match and where. The token definition 76class above uses three different ways of defining a token: 77 78* Using an instance of a `token_def<>`, which is handy whenever you need to 79 specify a token attribute (for more information about lexer related 80 attributes please look here: __sec_lex_attributes__). 81* Using a single character as the token, in this case the character represents 82 itself as a token, where the token id is the ASCII character value. 83* Using a regular expression represented as a string, where the token id needs 84 to be specified explicitly to make the token accessible from the grammar 85 level. 86 87All three token definition methods require a different method of grammar 88integration. But as you can see from the following code snippet, each of these 89methods are straightforward and blend the corresponding token instances 90naturally with the surrounding __qi__ grammar syntax. 91 92[table 93 [[Token definition] [Parser integration]] 94 [[`token_def<>`] [The `token_def<>` instance is directly usable as a 95 parser component. Parsing of this component will 96 succeed if the regular expression used to define 97 this has been matched successfully.]] 98 [[single character] [The single character is directly usable in the 99 grammar. However, under certain circumstances it needs 100 to be wrapped by a `char_()` parser component. 101 Parsing of this component will succeed if the 102 single character has been matched.]] 103 [[explicit token id] [To use an explicit token id in a __qi__ grammar you 104 are required to wrap it with the special `token()` 105 parser component. Parsing of this component will 106 succeed if the current token has the same token 107 id as specified in the expression `token(<id>)`.]] 108] 109 110The grammar definition below uses each of the three types demonstrating their 111usage. 112 113[wcp_grammar_definition] 114 115As already described (see: __sec_attributes__), the __qi__ parser 116library builds upon a set of fully attributed parser components. 117Consequently, all token definitions support this attribute model as well. The 118most natural way of implementing this was to use the token values as 119the attributes exposed by the parser component corresponding to the token 120definition (you can read more about this topic here: __sec_lex_tokenvalues__). 121The example above takes advantage of the full integration of the token values 122as the `token_def<>`'s parser attributes: the `word` token definition is 123declared as a `token_def<std::string>`, making every instance of a `word` token 124carry the string representation of the matched input sequence as its value. 125The semantic action attached to `tok.word` receives this string (represented by 126the `_1` placeholder) and uses it to calculate the number of matched 127characters: `ref(c) += size(_1)`. 128 129[heading Pulling Everything Together] 130 131The main function needs to implement a bit more logic now as we have to 132initialize and start not only the lexical analysis but the parsing process as 133well. The three type definitions (`typedef` statements) simplify the creation 134of the lexical analyzer and the grammar. After reading the contents of the 135given file into memory it calls the function __api_tokenize_and_parse__ to 136initialize the lexical analysis and parsing processes. 137 138[wcp_main] 139 140 141[endsect] 142