1[/============================================================================== 2 Copyright (C) 2001-2011 Joel de Guzman 3 Copyright (C) 2001-2011 Hartmut Kaiser 4 5 Distributed under the Boost Software License, Version 1.0. (See accompanying 6 file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 7===============================================================================/] 8 9[section:lexer_introduction Introduction to __lex__] 10 11Lexical scanning is the process of analyzing the stream of input characters and 12separating it into strings called tokens, separated by whitespace. 13Most compiler texts start here, and devote several chapters to discussing 14various ways to build scanners. __lex__ is a library built to take care of the 15complexities of creating a lexer for your grammar (in this documentation we 16will use the terms 'lexical analyzer', 'lexer' and 'scanner' interchangeably). 17All that is needed to create a lexer is to know the set of patterns describing 18the different tokens you want to recognize in the input. To make this a bit more 19formal, here are some definitions: 20 21* A token is a sequence of consecutive characters having a collective meaning. 22 Tokens may have attributes specific to the token type, carrying additional 23 information about the matched character sequence. 24* A pattern is a rule expressed as a regular expression and describing how a 25 particular token can be formed. For example, [^\[A-Za-z\]\[A-Za-z_0-9\]*] is 26 a pattern for a rule matching C++ identifiers. 27* Characters between tokens are called whitespace; these include spaces, tabs, 28 newlines, and formfeeds. Many people also count comments as whitespace, 29 though since some tools such as lint look at comments, this method is not 30 perfect. 31 32[heading Why Use a Separate Lexer?] 33 34Typically, lexical scanning is done in a separate module from the parser, 35feeding the parser with a stream of input tokens only. Theoretically it is 36not necessary implement this separation as in the end there is only one set of 37syntactical rules defining the language, so in theory we could write the whole 38parser in one module. In fact, __qi__ allows you to write parsers without using a 39lexer, parsing the input character stream directly, and for the most part this 40is the way __spirit__ has been used since its invention. 41 42However, this separation has both practical and theoretical basis, and proves to 43be very useful in practical applications. In 1956, Noam Chomsky defined the 44"Chomsky Hierarchy" of grammars: 45 46* Type 0: Unrestricted grammars (e.g., natural languages) 47* Type 1: Context-Sensitive grammars 48* Type 2: Context-Free grammars 49* Type 3: Regular grammars 50 51The complexity of these grammars increases from regular grammars being the 52simplest to unrestricted grammars being the most complex. Similarly, the 53complexity of pattern recognition for these grammars increases. Although, a few 54features of some programming languages (such as C++) are Type 1, fortunately 55for the most part programming languages can be described using only the Types 2 56and 3. The neat part about these two types is that they are well known and the 57ways to parse them are well understood. It has been shown that any regular 58grammar can be parsed using a state machine (finite automaton). Similarly, 59context-free grammars can always be parsed using a push-down automaton 60(essentially a state machine augmented by a stack). 61 62In real programming languages and practical grammars, the parts that can be 63handled as regular expressions tend to be the lower-level pieces, such as the 64definition of an identifier or of an integer value: 65 66 letter := [a-zA-Z] 67 digit := [0-9] 68 69 identifier := letter [ letter | digit ]* 70 integer := digit+ 71 72Higher level parts of practical grammars tend to be more complex and can't be 73implemented using plain regular expressions. We need to store 74information on the built-in hardware stack while recursing the grammar 75hierarchy, and that is the preferred approach used for top-down 76parsing. Since it takes a different kind of abstract machine to parse the two 77types of grammars, it proved to be efficient to separate the lexical scanner 78into a separate module which is built around the idea of a state machine. The 79goal here is to use the simplest parsing technique needed for the job. 80 81Another, more practical, reason for separating the scanner from the parser is 82the need for backtracking during parsing. The input data is a stream of 83characters, which is often thought to be processed left to right without any 84backtracking. Unfortunately, in practice most of the time that isn't possible. 85Almost every language has certain keywords such as IF, FOR, and WHILE. The 86decision if a certain character sequence actually comprises a keyword or just 87an identifier often can be made only after seeing the first delimiter /after/ 88it. In fact, this makes the process backtracking, since we need to store the 89string long enough to be able to make the decision. The same is true for more 90coarse grained language features such as nested IF/ELSE statements, where the 91decision about to which IF belongs the last ELSE statement can be made only 92after seeing the whole construct. 93 94So the structure of a conventional compiler often involves splitting up the 95functions of the lower-level and higher-level parsing. The lexical scanner 96deals with things at the character level, collecting characters into strings, 97converting character sequence into different representations as integers, etc., 98and passing them along to the parser proper as indivisible tokens. It's also 99considered normal to let the scanner do additional jobs, such as identifying 100keywords, storing identifiers in tables, etc. 101 102Now, __spirit__ follows this structure, where __lex__ can be used to implement 103state machine based recognizers, while __qi__ can be used to build recognizers 104for context free grammars. Since both modules are seamlessly integrated with 105each other and with the C++ target language it is even possible to use the 106provided functionality to build more complex grammar recognizers. 107 108[heading Advantages of using __lex__] 109 110The advantage of using __lex__ to create the lexical analyzer over using more 111traditional tools such as __flex__ is its carefully crafted integration with 112the __spirit__ library and the C++ host language. You don't need any external 113tools to generate the code, your lexer will be perfectly integrated with the 114rest of your program, making it possible to freely access any context 115information and data structure. Since the C++ compiler sees all the code, it 116will generate optimal code no matter what configuration options have been chosen 117by the user. __lex__ gives you the vast majority of features you could get from 118a similar __flex__ program without the need to leave C++ as a host language: 119 120* The definition of tokens is done using regular expressions (patterns) 121* The token definitions can refer to special substitution strings (pattern 122 macros) simplifying pattern definitions 123* The generated lexical scanner may have multiple start states 124* It is possible to attach code to any of the token definitions; this code gets 125 executed whenever the corresponding token pattern has been matched 126 127Even if it is possible to use __lex__ to generate C++ code representing 128the lexical analyzer (we will refer to that as the /static/ model, described in 129more detail in the section __sec_lex_static_model__) - a model 130very similar to the way __flex__ operates - we will mainly focus on the 131opposite, the /dynamic/ model. You can directly integrate the token definitions 132into your C++ program, building the lexical analyzer dynamically at runtime. The 133dynamic model is something not supported by __flex__ or other lexical scanner 134generators (such as __re2c__, __ragel__, etc.). This dynamic flexibility allows 135you to speed up the development of your application. 136 137[heading The Library Structure of __lex__] 138 139The [link spirit.lexerflow figure] below shows a high level overview of how the 140__lex__ library might be used in an application. __lex__ allows to create 141lexical analyzers based on patterns. These patterns are regular expression 142based rules used to define the different tokens to be recognized in the 143character input sequence. The input sequence is expected to be provided to the 144lexical analyzer as an arbitrary standard forward iterator. The lexical 145analyzer itself exposes a standard forward iterator as well. The difference 146here is that the exposed iterator provides access to the token sequence instead 147of to the character sequence. The tokens in this sequence are constructed on 148the fly by analyzing the underlying character sequence and matching this to the 149patterns as defined by the application. 150 151[fig lexerflow.png..The Library structure and Common Flow of Information while using __lex__ in an application..spirit.lexerflow] 152 153 154[endsect] 155