• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2<html>
3  <head>
4    <meta content=
5    "HTML Tidy for Windows (vers 1st February 2003), see www.w3.org"
6          name="generator">
7    <title>
8      Basic Concepts
9    </title>
10    <meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
11    <link rel="stylesheet" href="theme/style.css" type="text/css">
12  </head>
13  <body>
14    <table width="100%" border="0" background="theme/bkd2.gif" cellspacing="2">
15      <tr>
16        <td width="10"></td>
17        <td width="85%">
18          <font size="6" face="Verdana, Arial, Helvetica, sans-serif"><b>Basic
19          Concepts</b></font>
20        </td>
21        <td width="112">
22          <a href="http://spirit.sf.net"><img src="theme/spirit.gif"
23               width="112" height="48" align="right" border="0"></a>
24        </td>
25      </tr>
26    </table><br>
27    <table border="0">
28      <tr>
29        <td width="10"></td>
30        <td width="30">
31          <a href="../index.html"><img src="theme/u_arr.gif" border="0"></a>
32        </td>
33        <td width="30">
34          <a href="quick_start.html"><img src="theme/l_arr.gif" border="0"></a>
35        </td>
36        <td width="30">
37          <a href="organization.html"><img src="theme/r_arr.gif" border="0">
38          </a>
39        </td>
40      </tr>
41    </table>
42    <p>
43      There are a few fundamental concepts that need to be understood well: 1)
44      The <strong>Parser</strong>, 2) <strong>Match</strong>, 3) The
45      <strong>Scanner</strong>, and 4) <strong>Semantic Actions</strong>. These
46      basic concepts interact with one another, and the functionalities of each
47      interweave throughout the framework to make it one coherent whole.
48    </p>
49    <table width="48%" border="0" align="center">
50      <tr>
51        <td height="211">
52          <img src="theme/intro1.png">
53        </td>
54      </tr>
55    </table>
56    <h2>
57      The Parser
58    </h2>
59    <p>
60      Central to the framework is the parser. The parser does the actual work
61      of recognizing a linear input stream of data read sequentially from start
62      to end by the scanner. The parser attempts to match the input following a
63      well-defined set of specifications known as grammar rules. The parser
64      reports the success or failure to its client through a match object. When
65      successful, the parser calls a client-supplied semantic action. Finally,
66      the semantic action extracts structural information depending on the data
67      passed by the parser and the hierarchical context of the parser it is
68      attached to.
69    </p>
70    <p>
71      Parsers come in different flavors. The Spirit framework comes bundled
72      with an extensive set of pre-defined parsers that perform various parsing
73      tasks from the trivial to the complex. The parser, as a concept, has a
74      public conceptual interface contract. Following the contract, anyone can
75      write a conforming parser that will play along well with the framework's
76      predefined components. We shall provide a blueprint detailing the
77      conceptual interface of the parser later.
78    </p>
79    <p>
80      Clients of the framework generally do not need to write their own
81      hand-coded parsers at all. Spirit has an immense repertoire of
82      pre-defined parsers covering all aspects of syntax and semantic analysis.
83      We shall examine this repertoire of parsers in the following sections. In
84      the rare case where a specific functionality is not available, it is
85      extremely easy to write a user-defined parser. The ease in writing a
86      parser entity is the main reason for Spirit's extensibility.
87    </p>
88    <h2>
89      Primitives and Composites
90    </h2>
91    <p>
92      Spirit parsers fall into two categories: <b>primitives</b> and
93      <b>composites</b>. These two categories are more or less synonymous to
94      terminals and non-terminals in parsing lingo. Primitives are
95      non-decomposable atomic units. Composites on the other hand are parsers
96      that are composed of other parsers which can in turn be a primitive or
97      another composite. To illustrate, consider the Spirit expression:
98    </p>
99
100<pre><code><font color="#000000">    </font></code><code><span class="identifier">real_p</span> <span class=
101"special">&gt;&gt;</span> <span class="special">*(</span><span class=
102"literal">','</span> <span class="special">&gt;&gt;</span> <span class=
103"identifier">real_p</span><span class="special">)</span></code>
104</pre>
105<p>
106      <tt><tt>real_p</tt></tt> is a primitive parser that can parse real
107      numbers. The quoted comma <tt class="quotes">','</tt> in the expression
108      is a shortcut and is equivalent to <tt>ch_p<span class=
109      "operators">(</span><span class="quotes">','</span><span class=
110      "operators">)</span></tt>, which is another primitive parser that
111      recognizes single characters.
112    </p>
113    <p>
114      The expression above corresponds to the following parse tree:
115    </p>
116    <table width="29%" border="0" align="center">
117      <tr>
118        <td>
119          <img src="theme/intro7.png">
120        </td>
121      </tr>
122    </table>
123    <p>
124      The expression:
125    </p>
126
127<pre><code><font color="#000000">    </font></code><span class=
128     "literal">','</span> <span class="special">&gt;&gt;</span> <span class=
129     "identifier">real_p</span>
130</pre>
131<p>
132      composes a <b>sequence</b> parser. The <tt>sequence</tt> parser is a
133      composite parser comprising two parsers: the one on its left hand side
134      (lhs), <tt>ch_p<span class="operators">(</span><span class=
135      "quotes">','</span><span class="operators">)</span></tt> ; and the other
136      on its right hand side (rhs), <tt>real_p</tt>. This composite parser,
137      when called, calls its lhs and rhs in sequence and reports a successful
138      match only if both are successful.
139    </p>
140    <table width="14%" border="0" align="center">
141      <tr>
142        <td>
143          <img src="theme/intro2.png">
144        </td>
145      </tr>
146    </table>
147    <p>
148      The <tt>sequence</tt> parser is a binary composite. It is composed of two
149      parsers. There are unary composites as well. Unary composites hold only a
150      single subject. Like the binary composite, the unary composite may change
151      the behavior of its embedded subject. One particular example is the
152      <b>Kleene star</b>. The Kleene star, when called to parse, calls its sole
153      subject zero or more times. "Zero or more" implies that the Kleene star
154      always returns a successful match, possibly matching the null string: "".
155    </p>
156    <p>
157      The expression:
158    </p>
159
160<pre><code><font color="#000000">    </font></code><code><span class=
161"special">*(</span><span class="literal">','</span> <span class=
162"special">&gt;&gt;</span> <span class=
163      "identifier">real_p</span><span class="special">)</span></code>
164</pre>
165    <p>
166      wraps the whole sequence composite above inside a <tt>kleene_star</tt>.
167    </p>
168    <table width="17%" border="0" align="center">
169      <tr>
170        <td>
171          <img src="theme/intro3.png">
172        </td>
173      </tr>
174    </table>
175    <p>
176      Finally, the full expression composes a <tt>real_p</tt> primitive parser
177      and the <tt>kleene_star</tt> we have above into another higher level
178      <tt>sequence</tt> parser composite.
179    </p>
180    <table width="34%" border="0" align="center">
181      <tr>
182        <td>
183          <img src="theme/intro4.png">
184        </td>
185      </tr>
186    </table>
187    <p>
188      A few simple classes, when composed and structured in a hierarchy, form a
189      very powerful object-oriented recursive-descent parsing engine. These
190      classes provide the infrastructure needed for the construction of
191      more-complex parsers. The final parser composite is a non-deterministic
192      recursive-descent parser with infinite look-ahead.
193    </p>
194    <p>
195      Top-down descent traverses the hierarchy. The outer <tt>sequence</tt>
196      calls the leftmost <tt>real_p</tt> parser. If successful, the
197      <tt>kleene_star</tt> is called next. The <tt>kleene_star</tt> calls the
198      inner <tt>sequence</tt> repeatedly in a loop until it fails to match, or
199      the input is exhausted. Inside, <tt>ch_p(',')</tt> and then
200      <tt>real_p</tt> are called in sequence. The following diagram illustrates
201      what is happening, somewhat reminiscent of Pascal syntax diagrams.
202    </p>
203    <table width="37%" border="0" align="center">
204      <tr>
205        <td>
206          <img src="theme/intro5.png">
207        </td>
208      </tr>
209    </table>
210    <p>
211      The flexibility of object embedding and composition combined with
212      recursion opens up a unique approach to parsing. Subclasses are free to
213      form aggregates and algorithms of arbitrary complexity. Complex parsers
214      can be created with the composition of only a few primitive classes.
215    </p>
216    <p>
217      The framework is designed to be fully open-ended and extensible. New
218      primitives or composites, from the trivial to the complex, may be added
219      any time. Composition happens (statically) at compile time. This is
220      possible through the expressive flexibility of C++ expression templates
221      and template meta-programming.
222    </p>
223    <p>
224      The result is a composite composed of primitives and smaller composites.
225      This embedding strategy gives us the ability to build hierarchical
226      structures that fully model EBNF expressions of arbitrary complexity.
227      Later on, we shall see more primitive and composite building blocks.
228    </p>
229    <h2>
230      The Scanner
231    </h2>
232    <p>
233      Like the parser, the scanner is also an abstract concept. The task of the
234      scanner is to feed the sequential input data stream to the parser. The
235      scanner is composed of two STL conforming forward iterators, first and
236      last, where first is held by reference and last, by value. The first
237      iterator is held by reference to allow re-positioning by the parser. A
238      set of policies governs how the scanner behaves. Parsers extract data
239      from the scanner and position the iterator appropriately through its
240      member functions.
241    </p>
242    <p>
243      Knowledge of the intricacies of these policies is not required at all in
244      most cases. However, knowledge of the scanner's basic API is required to
245      write fully-conforming Spirit parsers. The scanner's API will be outlined
246      in a separate section. In addition, for the power users and the
247      adventurous among us, a full section will be devoted to covering the
248      scanner policies. The scanner policies make Spirit very flexible and
249      extensible. For instance, some of the policies may be modified to filter
250      data. A practical example is a scanner policy that does not distinguish
251      upper and lower case whereby making it useful for parsing case
252      insensitive input. Another example is a scanner policy that strips white
253      spaces from the input.
254    </p>
255    <h2>
256      The Match
257    </h2>
258    <p>
259      The parser has a conceptual parse member function taking in a scanner and
260      returning a match object. The primary function of the match object is to
261      report parsing success (or failure) back to the parser's caller; i.e., it
262      evaluates to true if the parse function is successful, false otherwise.
263      If the parse is successful, the match object may also be queried to
264      report the number of characters matched (using <tt>match.length()</tt>).
265      The length is non-negative if the match is successful, and the typical
266      length of a parse failure is -1. A zero length is perfectly valid and
267      still represents a successful match.
268    </p>
269    <p>
270      Parsers may have attribute data associated with it. For example, the
271      real_p parser has a numeric datum associated with it. This attribute is
272      the parsed number. This attribute is passed on to the returned match
273      object. The match object may be queried to get this attribute. This datum
274      is valid only when the match is successful.
275    </p>
276    <h2>
277      Semantic Actions
278    </h2>
279    <p>
280      A composite parser forms a hierarchy. Parsing proceeds from the topmost
281      parent parser which delegates and apportions the parsing task to its
282      children recursively to its children's children and so on until a
283      primitive is reached. By attaching semantic actions to various points in
284      this hierarchy, in effect we can transform the flat linear input stream
285      into a structured representation. This is essentially what parsers do.
286    </p>
287    <p>
288      Recall our example above:
289    </p>
290
291<pre><code><font color="#000000">    </font></code><code><span class=
292"identifier">real_p</span> <span class=
293      "special">&gt;&gt;</span> <span class="special">*(</span><span class=
294      "literal">','</span> <span class="special">&gt;&gt;</span> <span class=
295      "identifier">real_p</span><span class="special">)</span></code>
296</pre>
297<p>
298      By hooking a function (or functor) into the real_p parsers, we can
299      extract the numbers from the input:
300    </p>
301
302<pre><code><font color="#000000">    </font></code><span class=
303"identifier">real_p</span><span class="special">[&amp;</span><span class=
304"identifier">f</span><span class="special">]</span> <span class=
305"special">&gt;&gt;</span> <span class="special">*(</span><span class=
306"literal">','</span> <span class="special">&gt;&gt;</span> <span class=
307"identifier">real_p</span><span class="special">[&amp;</span><span class=
308"identifier">f</span><span class="special">])</span>
309</pre>
310<table width="41%" border="0" align="center">
311      <tr>
312        <td>
313          <img src="theme/intro6.png">
314        </td>
315      </tr>
316    </table>
317
318<p> where <tt>f</tt> is a function that takes in a single argument. The <tt><span class="operators">[&amp;</span>f<span class=
319      "operators">]</span></tt> hooks the parser with the function such that when
320  <tt>real_p</tt> recognizes a valid number, the function <tt>f</tt> is called.
321  It is up to the function then to do what is appropriate. For example, it can
322  stuff the numbers in a vector. Or perhaps, if the grammar is changed slightly
323  by replacing <tt class="quotes">','</tt> with <tt class="quotes">'+'</tt>, then
324  we have a primitive calculator that computes sums. The function <tt>f</tt> then
325  can then be made to add all incoming numbers.<br>
326</p>
327<table border="0">
328      <tr>
329        <td width="10"></td>
330        <td width="30">
331          <a href="../index.html"><img src="theme/u_arr.gif" border="0"></a>
332        </td>
333        <td width="30">
334          <a href="quick_start.html"><img src="theme/l_arr.gif" border="0"></a>
335        </td>
336        <td width="30">
337          <a href="organization.html"><img src="theme/r_arr.gif" border="0">
338          </a>
339        </td>
340      </tr>
341    </table><br>
342    <hr size="1">
343    <p class="copyright">
344      Copyright &copy; 1998-2003 Joel de Guzman<br>
345      <br>
346      <font size="2">Use, modification and distribution is subject to the Boost
347      Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or
348      copy at http://www.boost.org/LICENSE_1_0.txt)</font>
349    </p>
350    <p>&nbsp;
351
352    </p>
353  </body>
354</html>
355