1[/============================================================================== 2 Copyright (C) 2001-2011 Joel de Guzman 3 Copyright (C) 2001-2011 Hartmut Kaiser 4 5 Distributed under the Boost Software License, Version 1.0. (See accompanying 6 file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 7===============================================================================/] 8 9[section:lexer_token_values About Tokens and Token Values] 10 11As already discussed, lexical scanning is the process of analyzing the stream 12of input characters and separating it into strings called tokens, most of the 13time separated by whitespace. The different token types recognized by a lexical 14analyzer often get assigned unique integer token identifiers (token ids). These 15token ids are normally used by the parser to identify the current token without 16having to look at the matched string again. The __lex__ library is not 17different with respect to this, as it uses the token ids as the main means of 18identification of the different token types defined for a particular lexical 19analyzer. However, it is different from commonly used lexical analyzers in the 20sense that it returns (references to) instances of a (user defined) token class 21to the user. The only limitation of this token class is that it must carry at 22least the token id of the token it represents. For more information about the 23interface a user defined token type has to expose please look at the 24__sec_ref_lex_token__ reference. The library provides a default 25token type based on the __lexertl__ library which should be sufficient in most 26cases: the __class_lexertl_token__ type. This section focusses on the 27description of general features a token class may implement and how this 28integrates with the other parts of the __lex__ library. 29 30[heading The Anatomy of a Token] 31 32It is very important to understand the difference between a token definition 33(represented by the __class_token_def__ template) and a token itself (for 34instance represented by the __class_lexertl_token__ template). 35 36The token definition is used to describe the main features of a particular 37token type, especially: 38 39* to simplify the definition of a token type using a regular expression pattern 40 applied while matching this token type, 41* to associate a token type with a particular lexer state, 42* to optionally assign a token id to a token type, 43* to optionally associate some code to execute whenever an instance of this 44 token type has been matched, 45* and to optionally specify the attribute type of the token value. 46 47The token itself is a data structure returned by the lexer iterators. 48Dereferencing a lexer iterator returns a reference to the last matched token 49instance. It encapsulates the part of the underlying input sequence matched by 50the regular expression used during the definition of this token type. 51Incrementing the lexer iterator invokes the lexical analyzer to 52match the next token by advancing the underlying input stream. The token data 53structure contains at least the token id of the matched token type, 54allowing to identify the matched character sequence. Optionally, the token 55instance may contain a token value and/or the lexer state this token instance 56was matched in. The following [link spirit.lex.tokenstructure figure] shows the 57schematic structure of a token. 58 59[fig tokenstructure.png..The structure of a token..spirit.lex.tokenstructure] 60 61The token value and the lexer state the token has been recognized in may be 62omitted for optimization reasons, thus avoiding the need for the token to carry 63more data than actually required. This configuration can be achieved by supplying 64appropriate template parameters for the 65__class_lexertl_token__ template while defining the token type. 66 67The lexer iterator returns the same token type for each of the different 68matched token definitions. To accommodate for the possible different token 69/value/ types exposed by the various token types (token definitions), the 70general type of the token value is a __boost_variant__. At a minimum (for the 71default configuration) this token value variant will be configured to always 72hold a __boost_iterator_range__ containing the pair of iterators pointing to 73the matched input sequence for this token instance. 74 75[note If the lexical analyzer is used in conjunction with a __qi__ parser, the 76 stored __boost_iterator_range__ token value will be converted to the 77 requested token type (parser attribute) exactly once. This happens at the 78 time of the first access to the token value requiring the 79 corresponding type conversion. The converted token value will be stored 80 in the __boost_variant__ replacing the initially stored iterator range. 81 This avoids having to convert the input sequence to the token value more 82 than once, thus optimizing the integration of the lexer with __qi__, even 83 during parser backtracking. 84] 85 86Here is the template prototype of the __class_lexertl_token__ template: 87 88 template < 89 typename Iterator = char const*, 90 typename AttributeTypes = mpl::vector0<>, 91 typename HasState = mpl::true_ 92 > 93 struct lexertl_token; 94 95[variablelist where: 96 [[Iterator] [This is the type of the iterator used to access the 97 underlying input stream. It defaults to a plain 98 `char const*`.]] 99 [[AttributeTypes] [This is either a mpl sequence containing all 100 attribute types used for the token definitions or the 101 type `omit`. If the mpl sequence is empty (which is 102 the default), all token instances will store a 103 __boost_iterator_range__`<Iterator>` pointing to the start 104 and the end of the matched section in the input stream. 105 If the type is `omit`, the generated tokens will 106 contain no token value (attribute) at all.]] 107 [[HasState] [This is either `mpl::true_` or `mpl::false_`, allowing 108 control as to whether the generated token instances will 109 contain the lexer state they were generated in. The 110 default is mpl::true_, so all token instances will 111 contain the lexer state.]] 112] 113 114Normally, during construction, a token instance always holds the 115__boost_iterator_range__ as its token value, unless it has been defined 116using the `omit` token value type. This iterator range then is 117converted in place to the requested token value type (attribute) when it is 118requested for the first time. 119 120 121[heading The Physiognomy of a Token Definition] 122 123The token definitions (represented by the __class_token_def__ template) are 124normally used as part of the definition of the lexical analyzer. At the same 125time a token definition instance may be used as a parser component in __qi__. 126 127The template prototype of this class is shown here: 128 129 template< 130 typename Attribute = unused_type, 131 typename Char = char 132 > 133 class token_def; 134 135[variablelist where: 136 [[Attribute] [This is the type of the token value (attribute) 137 supported by token instances representing this token 138 type. This attribute type is exposed to the __qi__ 139 library, whenever this token definition is used as a 140 parser component. The default attribute type is 141 `unused_type`, which means the token instance holds a 142 __boost_iterator_range__ pointing to the start 143 and the end of the matched section in the input stream. 144 If the attribute is `omit` the token instance will 145 expose no token type at all. Any other type will be 146 used directly as the token value type.]] 147 [[Char] [This is the value type of the iterator for the 148 underlying input sequence. It defaults to `char`.]] 149] 150 151The semantics of the template parameters for the token type and the token 152definition type are very similar and interdependent. As a rule of thumb you can 153think of the token definition type as the means of specifying everything 154related to a single specific token type (such as `identifier` or `integer`). 155On the other hand the token type is used to define the general properties of all 156token instances generated by the __lex__ library. 157 158[important If you don't list any token value types in the token type definition 159 declaration (resulting in the usage of the default __boost_iterator_range__ 160 token type) everything will compile and work just fine, just a bit 161 less efficient. This is because the token value will be converted 162 from the matched input sequence every time it is requested. 163 164 But as soon as you specify at least one token value type while 165 defining the token type you'll have to list all value types used for 166 __class_token_def__ declarations in the token definition class, 167 otherwise compilation errors will occur. 168] 169 170 171[heading Examples of using __class_lexertl_token__] 172 173Let's start with some examples. We refer to one of the __lex__ examples (for 174the full source code of this example please see 175[@../../example/lex/example4.cpp example4.cpp]). 176 177[import ../example/lex/example4.cpp] 178 179The first code snippet shows an excerpt of the token definition class, the 180definition of a couple of token types. Some of the token types do not expose a 181special token value (`if_`, `else_`, and `while_`). Their token value will 182always hold the iterator range of the matched input sequence. The token 183definitions for the `identifier` and the integer `constant` are specialized 184to expose an explicit token type each: `std::string` and `unsigned int`. 185 186[example4_token_def] 187 188As the parsers generated by __qi__ are fully attributed, any __qi__ parser 189component needs to expose a certain type as its parser attribute. Naturally, 190the __class_token_def__ exposes the token value type as its parser attribute, 191enabling a smooth integration with __qi__. 192 193The next code snippet demonstrates how the required token value types are 194specified while defining the token type to use. All of the token value types 195used for at least one of the token definitions have to be re-iterated for the 196token definition as well. 197 198[example4_token] 199 200To avoid the token to have a token value at all, the special tag `omit` can 201be used: `token_def<omit>` and `lexertl_token<base_iterator_type, omit>`. 202 203 204 205 206 207 208[endsect] 209