1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 2 "http://www.w3.org/TR/html4/strict.dtd"> 3 4<html> 5<head> 6 <title>Kaleidoscope: Implementing a Parser and AST</title> 7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 8 <meta name="author" content="Chris Lattner"> 9 <link rel="stylesheet" href="../llvm.css" type="text/css"> 10</head> 11 12<body> 13 14<h1>Kaleidoscope: Implementing a Parser and AST</h1> 15 16<ul> 17<li><a href="index.html">Up to Tutorial Index</a></li> 18<li>Chapter 2 19 <ol> 20 <li><a href="#intro">Chapter 2 Introduction</a></li> 21 <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li> 22 <li><a href="#parserbasics">Parser Basics</a></li> 23 <li><a href="#parserprimexprs">Basic Expression Parsing</a></li> 24 <li><a href="#parserbinops">Binary Expression Parsing</a></li> 25 <li><a href="#parsertop">Parsing the Rest</a></li> 26 <li><a href="#driver">The Driver</a></li> 27 <li><a href="#conclusions">Conclusions</a></li> 28 <li><a href="#code">Full Code Listing</a></li> 29 </ol> 30</li> 31<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li> 32</ul> 33 34<div class="doc_author"> 35 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> 36</div> 37 38<!-- *********************************************************************** --> 39<h2><a name="intro">Chapter 2 Introduction</a></h2> 40<!-- *********************************************************************** --> 41 42<div> 43 44<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language 45with LLVM</a>" tutorial. This chapter shows you how to use the lexer, built in 46<a href="LangImpl1.html">Chapter 1</a>, to build a full <a 47href="http://en.wikipedia.org/wiki/Parsing">parser</a> for 48our Kaleidoscope language. Once we have a parser, we'll define and build an <a 49href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax 50Tree</a> (AST).</p> 51 52<p>The parser we will build uses a combination of <a 53href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent 54Parsing</a> and <a href= 55"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence 56Parsing</a> to parse the Kaleidoscope language (the latter for 57binary expressions and the former for everything else). Before we get to 58parsing though, lets talk about the output of the parser: the Abstract Syntax 59Tree.</p> 60 61</div> 62 63<!-- *********************************************************************** --> 64<h2><a name="ast">The Abstract Syntax Tree (AST)</a></h2> 65<!-- *********************************************************************** --> 66 67<div> 68 69<p>The AST for a program captures its behavior in such a way that it is easy for 70later stages of the compiler (e.g. code generation) to interpret. We basically 71want one object for each construct in the language, and the AST should closely 72model the language. In Kaleidoscope, we have expressions, a prototype, and a 73function object. We'll start with expressions first:</p> 74 75<div class="doc_code"> 76<pre> 77/// ExprAST - Base class for all expression nodes. 78class ExprAST { 79public: 80 virtual ~ExprAST() {} 81}; 82 83/// NumberExprAST - Expression class for numeric literals like "1.0". 84class NumberExprAST : public ExprAST { 85 double Val; 86public: 87 NumberExprAST(double val) : Val(val) {} 88}; 89</pre> 90</div> 91 92<p>The code above shows the definition of the base ExprAST class and one 93subclass which we use for numeric literals. The important thing to note about 94this code is that the NumberExprAST class captures the numeric value of the 95literal as an instance variable. This allows later phases of the compiler to 96know what the stored numeric value is.</p> 97 98<p>Right now we only create the AST, so there are no useful accessor methods on 99them. It would be very easy to add a virtual method to pretty print the code, 100for example. Here are the other expression AST node definitions that we'll use 101in the basic form of the Kaleidoscope language: 102</p> 103 104<div class="doc_code"> 105<pre> 106/// VariableExprAST - Expression class for referencing a variable, like "a". 107class VariableExprAST : public ExprAST { 108 std::string Name; 109public: 110 VariableExprAST(const std::string &name) : Name(name) {} 111}; 112 113/// BinaryExprAST - Expression class for a binary operator. 114class BinaryExprAST : public ExprAST { 115 char Op; 116 ExprAST *LHS, *RHS; 117public: 118 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) 119 : Op(op), LHS(lhs), RHS(rhs) {} 120}; 121 122/// CallExprAST - Expression class for function calls. 123class CallExprAST : public ExprAST { 124 std::string Callee; 125 std::vector<ExprAST*> Args; 126public: 127 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) 128 : Callee(callee), Args(args) {} 129}; 130</pre> 131</div> 132 133<p>This is all (intentionally) rather straight-forward: variables capture the 134variable name, binary operators capture their opcode (e.g. '+'), and calls 135capture a function name as well as a list of any argument expressions. One thing 136that is nice about our AST is that it captures the language features without 137talking about the syntax of the language. Note that there is no discussion about 138precedence of binary operators, lexical structure, etc.</p> 139 140<p>For our basic language, these are all of the expression nodes we'll define. 141Because it doesn't have conditional control flow, it isn't Turing-complete; 142we'll fix that in a later installment. The two things we need next are a way 143to talk about the interface to a function, and a way to talk about functions 144themselves:</p> 145 146<div class="doc_code"> 147<pre> 148/// PrototypeAST - This class represents the "prototype" for a function, 149/// which captures its name, and its argument names (thus implicitly the number 150/// of arguments the function takes). 151class PrototypeAST { 152 std::string Name; 153 std::vector<std::string> Args; 154public: 155 PrototypeAST(const std::string &name, const std::vector<std::string> &args) 156 : Name(name), Args(args) {} 157}; 158 159/// FunctionAST - This class represents a function definition itself. 160class FunctionAST { 161 PrototypeAST *Proto; 162 ExprAST *Body; 163public: 164 FunctionAST(PrototypeAST *proto, ExprAST *body) 165 : Proto(proto), Body(body) {} 166}; 167</pre> 168</div> 169 170<p>In Kaleidoscope, functions are typed with just a count of their arguments. 171Since all values are double precision floating point, the type of each argument 172doesn't need to be stored anywhere. In a more aggressive and realistic 173language, the "ExprAST" class would probably have a type field.</p> 174 175<p>With this scaffolding, we can now talk about parsing expressions and function 176bodies in Kaleidoscope.</p> 177 178</div> 179 180<!-- *********************************************************************** --> 181<h2><a name="parserbasics">Parser Basics</a></h2> 182<!-- *********************************************************************** --> 183 184<div> 185 186<p>Now that we have an AST to build, we need to define the parser code to build 187it. The idea here is that we want to parse something like "x+y" (which is 188returned as three tokens by the lexer) into an AST that could be generated with 189calls like this:</p> 190 191<div class="doc_code"> 192<pre> 193 ExprAST *X = new VariableExprAST("x"); 194 ExprAST *Y = new VariableExprAST("y"); 195 ExprAST *Result = new BinaryExprAST('+', X, Y); 196</pre> 197</div> 198 199<p>In order to do this, we'll start by defining some basic helper routines:</p> 200 201<div class="doc_code"> 202<pre> 203/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current 204/// token the parser is looking at. getNextToken reads another token from the 205/// lexer and updates CurTok with its results. 206static int CurTok; 207static int getNextToken() { 208 return CurTok = gettok(); 209} 210</pre> 211</div> 212 213<p> 214This implements a simple token buffer around the lexer. This allows 215us to look one token ahead at what the lexer is returning. Every function in 216our parser will assume that CurTok is the current token that needs to be 217parsed.</p> 218 219<div class="doc_code"> 220<pre> 221 222/// Error* - These are little helper functions for error handling. 223ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} 224PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } 225FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } 226</pre> 227</div> 228 229<p> 230The <tt>Error</tt> routines are simple helper routines that our parser will use 231to handle errors. The error recovery in our parser will not be the best and 232is not particular user-friendly, but it will be enough for our tutorial. These 233routines make it easier to handle errors in routines that have various return 234types: they always return null.</p> 235 236<p>With these basic helper functions, we can implement the first 237piece of our grammar: numeric literals.</p> 238 239</div> 240 241<!-- *********************************************************************** --> 242<h2><a name="parserprimexprs">Basic Expression Parsing</a></h2> 243<!-- *********************************************************************** --> 244 245<div> 246 247<p>We start with numeric literals, because they are the simplest to process. 248For each production in our grammar, we'll define a function which parses that 249production. For numeric literals, we have: 250</p> 251 252<div class="doc_code"> 253<pre> 254/// numberexpr ::= number 255static ExprAST *ParseNumberExpr() { 256 ExprAST *Result = new NumberExprAST(NumVal); 257 getNextToken(); // consume the number 258 return Result; 259} 260</pre> 261</div> 262 263<p>This routine is very simple: it expects to be called when the current token 264is a <tt>tok_number</tt> token. It takes the current number value, creates 265a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally 266returns.</p> 267 268<p>There are some interesting aspects to this. The most important one is that 269this routine eats all of the tokens that correspond to the production and 270returns the lexer buffer with the next token (which is not part of the grammar 271production) ready to go. This is a fairly standard way to go for recursive 272descent parsers. For a better example, the parenthesis operator is defined like 273this:</p> 274 275<div class="doc_code"> 276<pre> 277/// parenexpr ::= '(' expression ')' 278static ExprAST *ParseParenExpr() { 279 getNextToken(); // eat (. 280 ExprAST *V = ParseExpression(); 281 if (!V) return 0; 282 283 if (CurTok != ')') 284 return Error("expected ')'"); 285 getNextToken(); // eat ). 286 return V; 287} 288</pre> 289</div> 290 291<p>This function illustrates a number of interesting things about the 292parser:</p> 293 294<p> 2951) It shows how we use the Error routines. When called, this function expects 296that the current token is a '(' token, but after parsing the subexpression, it 297is possible that there is no ')' waiting. For example, if the user types in 298"(4 x" instead of "(4)", the parser should emit an error. Because errors can 299occur, the parser needs a way to indicate that they happened: in our parser, we 300return null on an error.</p> 301 302<p>2) Another interesting aspect of this function is that it uses recursion by 303calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call 304<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle 305recursive grammars, and keeps each production very simple. Note that 306parentheses do not cause construction of AST nodes themselves. While we could 307do it this way, the most important role of parentheses are to guide the parser 308and provide grouping. Once the parser constructs the AST, parentheses are not 309needed.</p> 310 311<p>The next simple production is for handling variable references and function 312calls:</p> 313 314<div class="doc_code"> 315<pre> 316/// identifierexpr 317/// ::= identifier 318/// ::= identifier '(' expression* ')' 319static ExprAST *ParseIdentifierExpr() { 320 std::string IdName = IdentifierStr; 321 322 getNextToken(); // eat identifier. 323 324 if (CurTok != '(') // Simple variable ref. 325 return new VariableExprAST(IdName); 326 327 // Call. 328 getNextToken(); // eat ( 329 std::vector<ExprAST*> Args; 330 if (CurTok != ')') { 331 while (1) { 332 ExprAST *Arg = ParseExpression(); 333 if (!Arg) return 0; 334 Args.push_back(Arg); 335 336 if (CurTok == ')') break; 337 338 if (CurTok != ',') 339 return Error("Expected ')' or ',' in argument list"); 340 getNextToken(); 341 } 342 } 343 344 // Eat the ')'. 345 getNextToken(); 346 347 return new CallExprAST(IdName, Args); 348} 349</pre> 350</div> 351 352<p>This routine follows the same style as the other routines. (It expects to be 353called if the current token is a <tt>tok_identifier</tt> token). It also has 354recursion and error handling. One interesting aspect of this is that it uses 355<em>look-ahead</em> to determine if the current identifier is a stand alone 356variable reference or if it is a function call expression. It handles this by 357checking to see if the token after the identifier is a '(' token, constructing 358either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate. 359</p> 360 361<p>Now that we have all of our simple expression-parsing logic in place, we can 362define a helper function to wrap it together into one entry point. We call this 363class of expressions "primary" expressions, for reasons that will become more 364clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to 365parse an arbitrary primary expression, we need to determine what sort of 366expression it is:</p> 367 368<div class="doc_code"> 369<pre> 370/// primary 371/// ::= identifierexpr 372/// ::= numberexpr 373/// ::= parenexpr 374static ExprAST *ParsePrimary() { 375 switch (CurTok) { 376 default: return Error("unknown token when expecting an expression"); 377 case tok_identifier: return ParseIdentifierExpr(); 378 case tok_number: return ParseNumberExpr(); 379 case '(': return ParseParenExpr(); 380 } 381} 382</pre> 383</div> 384 385<p>Now that you see the definition of this function, it is more obvious why we 386can assume the state of CurTok in the various functions. This uses look-ahead 387to determine which sort of expression is being inspected, and then parses it 388with a function call.</p> 389 390<p>Now that basic expressions are handled, we need to handle binary expressions. 391They are a bit more complex.</p> 392 393</div> 394 395<!-- *********************************************************************** --> 396<h2><a name="parserbinops">Binary Expression Parsing</a></h2> 397<!-- *********************************************************************** --> 398 399<div> 400 401<p>Binary expressions are significantly harder to parse because they are often 402ambiguous. For example, when given the string "x+y*z", the parser can choose 403to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from 404mathematics, we expect the later parse, because "*" (multiplication) has 405higher <em>precedence</em> than "+" (addition).</p> 406 407<p>There are many ways to handle this, but an elegant and efficient way is to 408use <a href= 409"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence 410Parsing</a>. This parsing technique uses the precedence of binary operators to 411guide recursion. To start with, we need a table of precedences:</p> 412 413<div class="doc_code"> 414<pre> 415/// BinopPrecedence - This holds the precedence for each binary operator that is 416/// defined. 417static std::map<char, int> BinopPrecedence; 418 419/// GetTokPrecedence - Get the precedence of the pending binary operator token. 420static int GetTokPrecedence() { 421 if (!isascii(CurTok)) 422 return -1; 423 424 // Make sure it's a declared binop. 425 int TokPrec = BinopPrecedence[CurTok]; 426 if (TokPrec <= 0) return -1; 427 return TokPrec; 428} 429 430int main() { 431 // Install standard binary operators. 432 // 1 is lowest precedence. 433 BinopPrecedence['<'] = 10; 434 BinopPrecedence['+'] = 20; 435 BinopPrecedence['-'] = 20; 436 BinopPrecedence['*'] = 40; // highest. 437 ... 438} 439</pre> 440</div> 441 442<p>For the basic form of Kaleidoscope, we will only support 4 binary operators 443(this can obviously be extended by you, our brave and intrepid reader). The 444<tt>GetTokPrecedence</tt> function returns the precedence for the current token, 445or -1 if the token is not a binary operator. Having a map makes it easy to add 446new operators and makes it clear that the algorithm doesn't depend on the 447specific operators involved, but it would be easy enough to eliminate the map 448and do the comparisons in the <tt>GetTokPrecedence</tt> function. (Or just use 449a fixed-size array).</p> 450 451<p>With the helper above defined, we can now start parsing binary expressions. 452The basic idea of operator precedence parsing is to break down an expression 453with potentially ambiguous binary operators into pieces. Consider ,for example, 454the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this 455as a stream of primary expressions separated by binary operators. As such, 456it will first parse the leading primary expression "a", then it will see the 457pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses 458are primary expressions, the binary expression parser doesn't need to worry 459about nested subexpressions like (c+d) at all. 460</p> 461 462<p> 463To start, an expression is a primary expression potentially followed by a 464sequence of [binop,primaryexpr] pairs:</p> 465 466<div class="doc_code"> 467<pre> 468/// expression 469/// ::= primary binoprhs 470/// 471static ExprAST *ParseExpression() { 472 ExprAST *LHS = ParsePrimary(); 473 if (!LHS) return 0; 474 475 return ParseBinOpRHS(0, LHS); 476} 477</pre> 478</div> 479 480<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for 481us. It takes a precedence and a pointer to an expression for the part that has been 482parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is 483allowed to be empty, in which case it returns the expression that is passed into 484it. In our example above, the code passes the expression for "a" into 485<tt>ParseBinOpRHS</tt> and the current token is "+".</p> 486 487<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em> 488minimal operator precedence</em> that the function is allowed to eat. For 489example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is 490passed in a precedence of 40, it will not consume any tokens (because the 491precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts 492with:</p> 493 494<div class="doc_code"> 495<pre> 496/// binoprhs 497/// ::= ('+' primary)* 498static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) { 499 // If this is a binop, find its precedence. 500 while (1) { 501 int TokPrec = GetTokPrecedence(); 502 503 // If this is a binop that binds at least as tightly as the current binop, 504 // consume it, otherwise we are done. 505 if (TokPrec < ExprPrec) 506 return LHS; 507</pre> 508</div> 509 510<p>This code gets the precedence of the current token and checks to see if if is 511too low. Because we defined invalid tokens to have a precedence of -1, this 512check implicitly knows that the pair-stream ends when the token stream runs out 513of binary operators. If this check succeeds, we know that the token is a binary 514operator and that it will be included in this expression:</p> 515 516<div class="doc_code"> 517<pre> 518 // Okay, we know this is a binop. 519 int BinOp = CurTok; 520 getNextToken(); // eat binop 521 522 // Parse the primary expression after the binary operator. 523 ExprAST *RHS = ParsePrimary(); 524 if (!RHS) return 0; 525</pre> 526</div> 527 528<p>As such, this code eats (and remembers) the binary operator and then parses 529the primary expression that follows. This builds up the whole pair, the first of 530which is [+, b] for the running example.</p> 531 532<p>Now that we parsed the left-hand side of an expression and one pair of the 533RHS sequence, we have to decide which way the expression associates. In 534particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)". 535To determine this, we look ahead at "binop" to determine its precedence and 536compare it to BinOp's precedence (which is '+' in this case):</p> 537 538<div class="doc_code"> 539<pre> 540 // If BinOp binds less tightly with RHS than the operator after RHS, let 541 // the pending operator take RHS as its LHS. 542 int NextPrec = GetTokPrecedence(); 543 if (TokPrec < NextPrec) { 544</pre> 545</div> 546 547<p>If the precedence of the binop to the right of "RHS" is lower or equal to the 548precedence of our current operator, then we know that the parentheses associate 549as "(a+b) binop ...". In our example, the current operator is "+" and the next 550operator is "+", we know that they have the same precedence. In this case we'll 551create the AST node for "a+b", and then continue parsing:</p> 552 553<div class="doc_code"> 554<pre> 555 ... if body omitted ... 556 } 557 558 // Merge LHS/RHS. 559 LHS = new BinaryExprAST(BinOp, LHS, RHS); 560 } // loop around to the top of the while loop. 561} 562</pre> 563</div> 564 565<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next 566iteration of the loop, with "+" as the current token. The code above will eat, 567remember, and parse "(c+d)" as the primary expression, which makes the 568current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with 569"*" as the binop to the right of the primary. In this case, the precedence of "*" is 570higher than the precedence of "+" so the if condition will be entered.</p> 571 572<p>The critical question left here is "how can the if condition parse the right 573hand side in full"? In particular, to build the AST correctly for our example, 574it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to 575do this is surprisingly simple (code from the above two blocks duplicated for 576context):</p> 577 578<div class="doc_code"> 579<pre> 580 // If BinOp binds less tightly with RHS than the operator after RHS, let 581 // the pending operator take RHS as its LHS. 582 int NextPrec = GetTokPrecedence(); 583 if (TokPrec < NextPrec) { 584 <b>RHS = ParseBinOpRHS(TokPrec+1, RHS); 585 if (RHS == 0) return 0;</b> 586 } 587 // Merge LHS/RHS. 588 LHS = new BinaryExprAST(BinOp, LHS, RHS); 589 } // loop around to the top of the while loop. 590} 591</pre> 592</div> 593 594<p>At this point, we know that the binary operator to the RHS of our primary 595has higher precedence than the binop we are currently parsing. As such, we know 596that any sequence of pairs whose operators are all higher precedence than "+" 597should be parsed together and returned as "RHS". To do this, we recursively 598invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum 599precedence required for it to continue. In our example above, this will cause 600it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS 601of the '+' expression.</p> 602 603<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed 604and added to the AST. With this little bit of code (14 non-trivial lines), we 605correctly handle fully general binary expression parsing in a very elegant way. 606This was a whirlwind tour of this code, and it is somewhat subtle. I recommend 607running through it with a few tough examples to see how it works. 608</p> 609 610<p>This wraps up handling of expressions. At this point, we can point the 611parser at an arbitrary token stream and build an expression from it, stopping 612at the first token that is not part of the expression. Next up we need to 613handle function definitions, etc.</p> 614 615</div> 616 617<!-- *********************************************************************** --> 618<h2><a name="parsertop">Parsing the Rest</a></h2> 619<!-- *********************************************************************** --> 620 621<div> 622 623<p> 624The next thing missing is handling of function prototypes. In Kaleidoscope, 625these are used both for 'extern' function declarations as well as function body 626definitions. The code to do this is straight-forward and not very interesting 627(once you've survived expressions): 628</p> 629 630<div class="doc_code"> 631<pre> 632/// prototype 633/// ::= id '(' id* ')' 634static PrototypeAST *ParsePrototype() { 635 if (CurTok != tok_identifier) 636 return ErrorP("Expected function name in prototype"); 637 638 std::string FnName = IdentifierStr; 639 getNextToken(); 640 641 if (CurTok != '(') 642 return ErrorP("Expected '(' in prototype"); 643 644 // Read the list of argument names. 645 std::vector<std::string> ArgNames; 646 while (getNextToken() == tok_identifier) 647 ArgNames.push_back(IdentifierStr); 648 if (CurTok != ')') 649 return ErrorP("Expected ')' in prototype"); 650 651 // success. 652 getNextToken(); // eat ')'. 653 654 return new PrototypeAST(FnName, ArgNames); 655} 656</pre> 657</div> 658 659<p>Given this, a function definition is very simple, just a prototype plus 660an expression to implement the body:</p> 661 662<div class="doc_code"> 663<pre> 664/// definition ::= 'def' prototype expression 665static FunctionAST *ParseDefinition() { 666 getNextToken(); // eat def. 667 PrototypeAST *Proto = ParsePrototype(); 668 if (Proto == 0) return 0; 669 670 if (ExprAST *E = ParseExpression()) 671 return new FunctionAST(Proto, E); 672 return 0; 673} 674</pre> 675</div> 676 677<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as 678well as to support forward declaration of user functions. These 'extern's are just 679prototypes with no body:</p> 680 681<div class="doc_code"> 682<pre> 683/// external ::= 'extern' prototype 684static PrototypeAST *ParseExtern() { 685 getNextToken(); // eat extern. 686 return ParsePrototype(); 687} 688</pre> 689</div> 690 691<p>Finally, we'll also let the user type in arbitrary top-level expressions and 692evaluate them on the fly. We will handle this by defining anonymous nullary 693(zero argument) functions for them:</p> 694 695<div class="doc_code"> 696<pre> 697/// toplevelexpr ::= expression 698static FunctionAST *ParseTopLevelExpr() { 699 if (ExprAST *E = ParseExpression()) { 700 // Make an anonymous proto. 701 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>()); 702 return new FunctionAST(Proto, E); 703 } 704 return 0; 705} 706</pre> 707</div> 708 709<p>Now that we have all the pieces, let's build a little driver that will let us 710actually <em>execute</em> this code we've built!</p> 711 712</div> 713 714<!-- *********************************************************************** --> 715<h2><a name="driver">The Driver</a></h2> 716<!-- *********************************************************************** --> 717 718<div> 719 720<p>The driver for this simply invokes all of the parsing pieces with a top-level 721dispatch loop. There isn't much interesting here, so I'll just include the 722top-level loop. See <a href="#code">below</a> for full code in the "Top-Level 723Parsing" section.</p> 724 725<div class="doc_code"> 726<pre> 727/// top ::= definition | external | expression | ';' 728static void MainLoop() { 729 while (1) { 730 fprintf(stderr, "ready> "); 731 switch (CurTok) { 732 case tok_eof: return; 733 case ';': getNextToken(); break; // ignore top-level semicolons. 734 case tok_def: HandleDefinition(); break; 735 case tok_extern: HandleExtern(); break; 736 default: HandleTopLevelExpression(); break; 737 } 738 } 739} 740</pre> 741</div> 742 743<p>The most interesting part of this is that we ignore top-level semicolons. 744Why is this, you ask? The basic reason is that if you type "4 + 5" at the 745command line, the parser doesn't know whether that is the end of what you will type 746or not. For example, on the next line you could type "def foo..." in which case 7474+5 is the end of a top-level expression. Alternatively you could type "* 6", 748which would continue the expression. Having top-level semicolons allows you to 749type "4+5;", and the parser will know you are done.</p> 750 751</div> 752 753<!-- *********************************************************************** --> 754<h2><a name="conclusions">Conclusions</a></h2> 755<!-- *********************************************************************** --> 756 757<div> 758 759<p>With just under 400 lines of commented code (240 lines of non-comment, 760non-blank code), we fully defined our minimal language, including a lexer, 761parser, and AST builder. With this done, the executable will validate 762Kaleidoscope code and tell us if it is grammatically invalid. For 763example, here is a sample interaction:</p> 764 765<div class="doc_code"> 766<pre> 767$ <b>./a.out</b> 768ready> <b>def foo(x y) x+foo(y, 4.0);</b> 769Parsed a function definition. 770ready> <b>def foo(x y) x+y y;</b> 771Parsed a function definition. 772Parsed a top-level expr 773ready> <b>def foo(x y) x+y );</b> 774Parsed a function definition. 775Error: unknown token when expecting an expression 776ready> <b>extern sin(a);</b> 777ready> Parsed an extern 778ready> <b>^D</b> 779$ 780</pre> 781</div> 782 783<p>There is a lot of room for extension here. You can define new AST nodes, 784extend the language in many ways, etc. In the <a href="LangImpl3.html">next 785installment</a>, we will describe how to generate LLVM Intermediate 786Representation (IR) from the AST.</p> 787 788</div> 789 790<!-- *********************************************************************** --> 791<h2><a name="code">Full Code Listing</a></h2> 792<!-- *********************************************************************** --> 793 794<div> 795 796<p> 797Here is the complete code listing for this and the previous chapter. 798Note that it is fully self-contained: you don't need LLVM or any external 799libraries at all for this. (Besides the C and C++ standard libraries, of 800course.) To build this, just compile with:</p> 801 802<div class="doc_code"> 803<pre> 804# Compile 805clang++ -g -O3 toy.cpp 806# Run 807./a.out 808</pre> 809</div> 810 811<p>Here is the code:</p> 812 813<div class="doc_code"> 814<pre> 815#include <cstdio> 816#include <cstdlib> 817#include <string> 818#include <map> 819#include <vector> 820 821//===----------------------------------------------------------------------===// 822// Lexer 823//===----------------------------------------------------------------------===// 824 825// The lexer returns tokens [0-255] if it is an unknown character, otherwise one 826// of these for known things. 827enum Token { 828 tok_eof = -1, 829 830 // commands 831 tok_def = -2, tok_extern = -3, 832 833 // primary 834 tok_identifier = -4, tok_number = -5 835}; 836 837static std::string IdentifierStr; // Filled in if tok_identifier 838static double NumVal; // Filled in if tok_number 839 840/// gettok - Return the next token from standard input. 841static int gettok() { 842 static int LastChar = ' '; 843 844 // Skip any whitespace. 845 while (isspace(LastChar)) 846 LastChar = getchar(); 847 848 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* 849 IdentifierStr = LastChar; 850 while (isalnum((LastChar = getchar()))) 851 IdentifierStr += LastChar; 852 853 if (IdentifierStr == "def") return tok_def; 854 if (IdentifierStr == "extern") return tok_extern; 855 return tok_identifier; 856 } 857 858 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ 859 std::string NumStr; 860 do { 861 NumStr += LastChar; 862 LastChar = getchar(); 863 } while (isdigit(LastChar) || LastChar == '.'); 864 865 NumVal = strtod(NumStr.c_str(), 0); 866 return tok_number; 867 } 868 869 if (LastChar == '#') { 870 // Comment until end of line. 871 do LastChar = getchar(); 872 while (LastChar != EOF && LastChar != '\n' && LastChar != '\r'); 873 874 if (LastChar != EOF) 875 return gettok(); 876 } 877 878 // Check for end of file. Don't eat the EOF. 879 if (LastChar == EOF) 880 return tok_eof; 881 882 // Otherwise, just return the character as its ascii value. 883 int ThisChar = LastChar; 884 LastChar = getchar(); 885 return ThisChar; 886} 887 888//===----------------------------------------------------------------------===// 889// Abstract Syntax Tree (aka Parse Tree) 890//===----------------------------------------------------------------------===// 891 892/// ExprAST - Base class for all expression nodes. 893class ExprAST { 894public: 895 virtual ~ExprAST() {} 896}; 897 898/// NumberExprAST - Expression class for numeric literals like "1.0". 899class NumberExprAST : public ExprAST { 900 double Val; 901public: 902 NumberExprAST(double val) : Val(val) {} 903}; 904 905/// VariableExprAST - Expression class for referencing a variable, like "a". 906class VariableExprAST : public ExprAST { 907 std::string Name; 908public: 909 VariableExprAST(const std::string &name) : Name(name) {} 910}; 911 912/// BinaryExprAST - Expression class for a binary operator. 913class BinaryExprAST : public ExprAST { 914 char Op; 915 ExprAST *LHS, *RHS; 916public: 917 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) 918 : Op(op), LHS(lhs), RHS(rhs) {} 919}; 920 921/// CallExprAST - Expression class for function calls. 922class CallExprAST : public ExprAST { 923 std::string Callee; 924 std::vector<ExprAST*> Args; 925public: 926 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) 927 : Callee(callee), Args(args) {} 928}; 929 930/// PrototypeAST - This class represents the "prototype" for a function, 931/// which captures its name, and its argument names (thus implicitly the number 932/// of arguments the function takes). 933class PrototypeAST { 934 std::string Name; 935 std::vector<std::string> Args; 936public: 937 PrototypeAST(const std::string &name, const std::vector<std::string> &args) 938 : Name(name), Args(args) {} 939 940}; 941 942/// FunctionAST - This class represents a function definition itself. 943class FunctionAST { 944 PrototypeAST *Proto; 945 ExprAST *Body; 946public: 947 FunctionAST(PrototypeAST *proto, ExprAST *body) 948 : Proto(proto), Body(body) {} 949 950}; 951 952//===----------------------------------------------------------------------===// 953// Parser 954//===----------------------------------------------------------------------===// 955 956/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current 957/// token the parser is looking at. getNextToken reads another token from the 958/// lexer and updates CurTok with its results. 959static int CurTok; 960static int getNextToken() { 961 return CurTok = gettok(); 962} 963 964/// BinopPrecedence - This holds the precedence for each binary operator that is 965/// defined. 966static std::map<char, int> BinopPrecedence; 967 968/// GetTokPrecedence - Get the precedence of the pending binary operator token. 969static int GetTokPrecedence() { 970 if (!isascii(CurTok)) 971 return -1; 972 973 // Make sure it's a declared binop. 974 int TokPrec = BinopPrecedence[CurTok]; 975 if (TokPrec <= 0) return -1; 976 return TokPrec; 977} 978 979/// Error* - These are little helper functions for error handling. 980ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} 981PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } 982FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } 983 984static ExprAST *ParseExpression(); 985 986/// identifierexpr 987/// ::= identifier 988/// ::= identifier '(' expression* ')' 989static ExprAST *ParseIdentifierExpr() { 990 std::string IdName = IdentifierStr; 991 992 getNextToken(); // eat identifier. 993 994 if (CurTok != '(') // Simple variable ref. 995 return new VariableExprAST(IdName); 996 997 // Call. 998 getNextToken(); // eat ( 999 std::vector<ExprAST*> Args; 1000 if (CurTok != ')') { 1001 while (1) { 1002 ExprAST *Arg = ParseExpression(); 1003 if (!Arg) return 0; 1004 Args.push_back(Arg); 1005 1006 if (CurTok == ')') break; 1007 1008 if (CurTok != ',') 1009 return Error("Expected ')' or ',' in argument list"); 1010 getNextToken(); 1011 } 1012 } 1013 1014 // Eat the ')'. 1015 getNextToken(); 1016 1017 return new CallExprAST(IdName, Args); 1018} 1019 1020/// numberexpr ::= number 1021static ExprAST *ParseNumberExpr() { 1022 ExprAST *Result = new NumberExprAST(NumVal); 1023 getNextToken(); // consume the number 1024 return Result; 1025} 1026 1027/// parenexpr ::= '(' expression ')' 1028static ExprAST *ParseParenExpr() { 1029 getNextToken(); // eat (. 1030 ExprAST *V = ParseExpression(); 1031 if (!V) return 0; 1032 1033 if (CurTok != ')') 1034 return Error("expected ')'"); 1035 getNextToken(); // eat ). 1036 return V; 1037} 1038 1039/// primary 1040/// ::= identifierexpr 1041/// ::= numberexpr 1042/// ::= parenexpr 1043static ExprAST *ParsePrimary() { 1044 switch (CurTok) { 1045 default: return Error("unknown token when expecting an expression"); 1046 case tok_identifier: return ParseIdentifierExpr(); 1047 case tok_number: return ParseNumberExpr(); 1048 case '(': return ParseParenExpr(); 1049 } 1050} 1051 1052/// binoprhs 1053/// ::= ('+' primary)* 1054static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) { 1055 // If this is a binop, find its precedence. 1056 while (1) { 1057 int TokPrec = GetTokPrecedence(); 1058 1059 // If this is a binop that binds at least as tightly as the current binop, 1060 // consume it, otherwise we are done. 1061 if (TokPrec < ExprPrec) 1062 return LHS; 1063 1064 // Okay, we know this is a binop. 1065 int BinOp = CurTok; 1066 getNextToken(); // eat binop 1067 1068 // Parse the primary expression after the binary operator. 1069 ExprAST *RHS = ParsePrimary(); 1070 if (!RHS) return 0; 1071 1072 // If BinOp binds less tightly with RHS than the operator after RHS, let 1073 // the pending operator take RHS as its LHS. 1074 int NextPrec = GetTokPrecedence(); 1075 if (TokPrec < NextPrec) { 1076 RHS = ParseBinOpRHS(TokPrec+1, RHS); 1077 if (RHS == 0) return 0; 1078 } 1079 1080 // Merge LHS/RHS. 1081 LHS = new BinaryExprAST(BinOp, LHS, RHS); 1082 } 1083} 1084 1085/// expression 1086/// ::= primary binoprhs 1087/// 1088static ExprAST *ParseExpression() { 1089 ExprAST *LHS = ParsePrimary(); 1090 if (!LHS) return 0; 1091 1092 return ParseBinOpRHS(0, LHS); 1093} 1094 1095/// prototype 1096/// ::= id '(' id* ')' 1097static PrototypeAST *ParsePrototype() { 1098 if (CurTok != tok_identifier) 1099 return ErrorP("Expected function name in prototype"); 1100 1101 std::string FnName = IdentifierStr; 1102 getNextToken(); 1103 1104 if (CurTok != '(') 1105 return ErrorP("Expected '(' in prototype"); 1106 1107 std::vector<std::string> ArgNames; 1108 while (getNextToken() == tok_identifier) 1109 ArgNames.push_back(IdentifierStr); 1110 if (CurTok != ')') 1111 return ErrorP("Expected ')' in prototype"); 1112 1113 // success. 1114 getNextToken(); // eat ')'. 1115 1116 return new PrototypeAST(FnName, ArgNames); 1117} 1118 1119/// definition ::= 'def' prototype expression 1120static FunctionAST *ParseDefinition() { 1121 getNextToken(); // eat def. 1122 PrototypeAST *Proto = ParsePrototype(); 1123 if (Proto == 0) return 0; 1124 1125 if (ExprAST *E = ParseExpression()) 1126 return new FunctionAST(Proto, E); 1127 return 0; 1128} 1129 1130/// toplevelexpr ::= expression 1131static FunctionAST *ParseTopLevelExpr() { 1132 if (ExprAST *E = ParseExpression()) { 1133 // Make an anonymous proto. 1134 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>()); 1135 return new FunctionAST(Proto, E); 1136 } 1137 return 0; 1138} 1139 1140/// external ::= 'extern' prototype 1141static PrototypeAST *ParseExtern() { 1142 getNextToken(); // eat extern. 1143 return ParsePrototype(); 1144} 1145 1146//===----------------------------------------------------------------------===// 1147// Top-Level parsing 1148//===----------------------------------------------------------------------===// 1149 1150static void HandleDefinition() { 1151 if (ParseDefinition()) { 1152 fprintf(stderr, "Parsed a function definition.\n"); 1153 } else { 1154 // Skip token for error recovery. 1155 getNextToken(); 1156 } 1157} 1158 1159static void HandleExtern() { 1160 if (ParseExtern()) { 1161 fprintf(stderr, "Parsed an extern\n"); 1162 } else { 1163 // Skip token for error recovery. 1164 getNextToken(); 1165 } 1166} 1167 1168static void HandleTopLevelExpression() { 1169 // Evaluate a top-level expression into an anonymous function. 1170 if (ParseTopLevelExpr()) { 1171 fprintf(stderr, "Parsed a top-level expr\n"); 1172 } else { 1173 // Skip token for error recovery. 1174 getNextToken(); 1175 } 1176} 1177 1178/// top ::= definition | external | expression | ';' 1179static void MainLoop() { 1180 while (1) { 1181 fprintf(stderr, "ready> "); 1182 switch (CurTok) { 1183 case tok_eof: return; 1184 case ';': getNextToken(); break; // ignore top-level semicolons. 1185 case tok_def: HandleDefinition(); break; 1186 case tok_extern: HandleExtern(); break; 1187 default: HandleTopLevelExpression(); break; 1188 } 1189 } 1190} 1191 1192//===----------------------------------------------------------------------===// 1193// Main driver code. 1194//===----------------------------------------------------------------------===// 1195 1196int main() { 1197 // Install standard binary operators. 1198 // 1 is lowest precedence. 1199 BinopPrecedence['<'] = 10; 1200 BinopPrecedence['+'] = 20; 1201 BinopPrecedence['-'] = 20; 1202 BinopPrecedence['*'] = 40; // highest. 1203 1204 // Prime the first token. 1205 fprintf(stderr, "ready> "); 1206 getNextToken(); 1207 1208 // Run the main "interpreter loop" now. 1209 MainLoop(); 1210 1211 return 0; 1212} 1213</pre> 1214</div> 1215<a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a> 1216</div> 1217 1218<!-- *********************************************************************** --> 1219<hr> 1220<address> 1221 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img 1222 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> 1223 <a href="http://validator.w3.org/check/referer"><img 1224 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> 1225 1226 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> 1227 <a href="http://llvm.org/">The LLVM Compiler Infrastructure</a><br> 1228 Last modified: $Date: 2011-10-16 04:07:38 -0400 (Sun, 16 Oct 2011) $ 1229</address> 1230</body> 1231</html> 1232