1=========================================== 2Kaleidoscope: Implementing a Parser and AST 3=========================================== 4 5.. contents:: 6 :local: 7 8Chapter 2 Introduction 9====================== 10 11Welcome to Chapter 2 of the "`Implementing a language with LLVM in 12Objective Caml <index.html>`_" tutorial. This chapter shows you how to 13use the lexer, built in `Chapter 1 <OCamlLangImpl1.html>`_, to build a 14full `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our 15Kaleidoscope language. Once we have a parser, we'll define and build an 16`Abstract Syntax 17Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST). 18 19The parser we will build uses a combination of `Recursive Descent 20Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and 21`Operator-Precedence 22Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to 23parse the Kaleidoscope language (the latter for binary expressions and 24the former for everything else). Before we get to parsing though, lets 25talk about the output of the parser: the Abstract Syntax Tree. 26 27The Abstract Syntax Tree (AST) 28============================== 29 30The AST for a program captures its behavior in such a way that it is 31easy for later stages of the compiler (e.g. code generation) to 32interpret. We basically want one object for each construct in the 33language, and the AST should closely model the language. In 34Kaleidoscope, we have expressions, a prototype, and a function object. 35We'll start with expressions first: 36 37.. code-block:: ocaml 38 39 (* expr - Base type for all expression nodes. *) 40 type expr = 41 (* variant for numeric literals like "1.0". *) 42 | Number of float 43 44The code above shows the definition of the base ExprAST class and one 45subclass which we use for numeric literals. The important thing to note 46about this code is that the Number variant captures the numeric value of 47the literal as an instance variable. This allows later phases of the 48compiler to know what the stored numeric value is. 49 50Right now we only create the AST, so there are no useful functions on 51them. It would be very easy to add a function to pretty print the code, 52for example. Here are the other expression AST node definitions that 53we'll use in the basic form of the Kaleidoscope language: 54 55.. code-block:: ocaml 56 57 (* variant for referencing a variable, like "a". *) 58 | Variable of string 59 60 (* variant for a binary operator. *) 61 | Binary of char * expr * expr 62 63 (* variant for function calls. *) 64 | Call of string * expr array 65 66This is all (intentionally) rather straight-forward: variables capture 67the variable name, binary operators capture their opcode (e.g. '+'), and 68calls capture a function name as well as a list of any argument 69expressions. One thing that is nice about our AST is that it captures 70the language features without talking about the syntax of the language. 71Note that there is no discussion about precedence of binary operators, 72lexical structure, etc. 73 74For our basic language, these are all of the expression nodes we'll 75define. Because it doesn't have conditional control flow, it isn't 76Turing-complete; we'll fix that in a later installment. The two things 77we need next are a way to talk about the interface to a function, and a 78way to talk about functions themselves: 79 80.. code-block:: ocaml 81 82 (* proto - This type represents the "prototype" for a function, which captures 83 * its name, and its argument names (thus implicitly the number of arguments the 84 * function takes). *) 85 type proto = Prototype of string * string array 86 87 (* func - This type represents a function definition itself. *) 88 type func = Function of proto * expr 89 90In Kaleidoscope, functions are typed with just a count of their 91arguments. Since all values are double precision floating point, the 92type of each argument doesn't need to be stored anywhere. In a more 93aggressive and realistic language, the "expr" variants would probably 94have a type field. 95 96With this scaffolding, we can now talk about parsing expressions and 97function bodies in Kaleidoscope. 98 99Parser Basics 100============= 101 102Now that we have an AST to build, we need to define the parser code to 103build it. The idea here is that we want to parse something like "x+y" 104(which is returned as three tokens by the lexer) into an AST that could 105be generated with calls like this: 106 107.. code-block:: ocaml 108 109 let x = Variable "x" in 110 let y = Variable "y" in 111 let result = Binary ('+', x, y) in 112 ... 113 114The error handling routines make use of the builtin ``Stream.Failure`` 115and ``Stream.Error``s. ``Stream.Failure`` is raised when the parser is 116unable to find any matching token in the first position of a pattern. 117``Stream.Error`` is raised when the first token matches, but the rest do 118not. The error recovery in our parser will not be the best and is not 119particular user-friendly, but it will be enough for our tutorial. These 120exceptions make it easier to handle errors in routines that have various 121return types. 122 123With these basic types and exceptions, we can implement the first piece 124of our grammar: numeric literals. 125 126Basic Expression Parsing 127======================== 128 129We start with numeric literals, because they are the simplest to 130process. For each production in our grammar, we'll define a function 131which parses that production. We call this class of expressions 132"primary" expressions, for reasons that will become more clear `later in 133the tutorial <OCamlLangImpl6.html#unary>`_. In order to parse an 134arbitrary primary expression, we need to determine what sort of 135expression it is. For numeric literals, we have: 136 137.. code-block:: ocaml 138 139 (* primary 140 * ::= identifier 141 * ::= numberexpr 142 * ::= parenexpr *) 143 parse_primary = parser 144 (* numberexpr ::= number *) 145 | [< 'Token.Number n >] -> Ast.Number n 146 147This routine is very simple: it expects to be called when the current 148token is a ``Token.Number`` token. It takes the current number value, 149creates a ``Ast.Number`` node, advances the lexer to the next token, and 150finally returns. 151 152There are some interesting aspects to this. The most important one is 153that this routine eats all of the tokens that correspond to the 154production and returns the lexer buffer with the next token (which is 155not part of the grammar production) ready to go. This is a fairly 156standard way to go for recursive descent parsers. For a better example, 157the parenthesis operator is defined like this: 158 159.. code-block:: ocaml 160 161 (* parenexpr ::= '(' expression ')' *) 162 | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e 163 164This function illustrates a number of interesting things about the 165parser: 166 1671) It shows how we use the ``Stream.Error`` exception. When called, this 168function expects that the current token is a '(' token, but after 169parsing the subexpression, it is possible that there is no ')' waiting. 170For example, if the user types in "(4 x" instead of "(4)", the parser 171should emit an error. Because errors can occur, the parser needs a way 172to indicate that they happened. In our parser, we use the camlp4 173shortcut syntax ``token ?? "parse error"``, where if the token before 174the ``??`` does not match, then ``Stream.Error "parse error"`` will be 175raised. 176 1772) Another interesting aspect of this function is that it uses recursion 178by calling ``Parser.parse_primary`` (we will soon see that 179``Parser.parse_primary`` can call ``Parser.parse_primary``). This is 180powerful because it allows us to handle recursive grammars, and keeps 181each production very simple. Note that parentheses do not cause 182construction of AST nodes themselves. While we could do it this way, the 183most important role of parentheses are to guide the parser and provide 184grouping. Once the parser constructs the AST, parentheses are not 185needed. 186 187The next simple production is for handling variable references and 188function calls: 189 190.. code-block:: ocaml 191 192 (* identifierexpr 193 * ::= identifier 194 * ::= identifier '(' argumentexpr ')' *) 195 | [< 'Token.Ident id; stream >] -> 196 let rec parse_args accumulator = parser 197 | [< e=parse_expr; stream >] -> 198 begin parser 199 | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e 200 | [< >] -> e :: accumulator 201 end stream 202 | [< >] -> accumulator 203 in 204 let rec parse_ident id = parser 205 (* Call. *) 206 | [< 'Token.Kwd '('; 207 args=parse_args []; 208 'Token.Kwd ')' ?? "expected ')'">] -> 209 Ast.Call (id, Array.of_list (List.rev args)) 210 211 (* Simple variable ref. *) 212 | [< >] -> Ast.Variable id 213 in 214 parse_ident id stream 215 216This routine follows the same style as the other routines. (It expects 217to be called if the current token is a ``Token.Ident`` token). It also 218has recursion and error handling. One interesting aspect of this is that 219it uses *look-ahead* to determine if the current identifier is a stand 220alone variable reference or if it is a function call expression. It 221handles this by checking to see if the token after the identifier is a 222'(' token, constructing either a ``Ast.Variable`` or ``Ast.Call`` node 223as appropriate. 224 225We finish up by raising an exception if we received a token we didn't 226expect: 227 228.. code-block:: ocaml 229 230 | [< >] -> raise (Stream.Error "unknown token when expecting an expression.") 231 232Now that basic expressions are handled, we need to handle binary 233expressions. They are a bit more complex. 234 235Binary Expression Parsing 236========================= 237 238Binary expressions are significantly harder to parse because they are 239often ambiguous. For example, when given the string "x+y\*z", the parser 240can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common 241definitions from mathematics, we expect the later parse, because "\*" 242(multiplication) has higher *precedence* than "+" (addition). 243 244There are many ways to handle this, but an elegant and efficient way is 245to use `Operator-Precedence 246Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_. 247This parsing technique uses the precedence of binary operators to guide 248recursion. To start with, we need a table of precedences: 249 250.. code-block:: ocaml 251 252 (* binop_precedence - This holds the precedence for each binary operator that is 253 * defined *) 254 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10 255 256 (* precedence - Get the precedence of the pending binary operator token. *) 257 let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1 258 259 ... 260 261 let main () = 262 (* Install standard binary operators. 263 * 1 is the lowest precedence. *) 264 Hashtbl.add Parser.binop_precedence '<' 10; 265 Hashtbl.add Parser.binop_precedence '+' 20; 266 Hashtbl.add Parser.binop_precedence '-' 20; 267 Hashtbl.add Parser.binop_precedence '*' 40; (* highest. *) 268 ... 269 270For the basic form of Kaleidoscope, we will only support 4 binary 271operators (this can obviously be extended by you, our brave and intrepid 272reader). The ``Parser.precedence`` function returns the precedence for 273the current token, or -1 if the token is not a binary operator. Having a 274``Hashtbl.t`` makes it easy to add new operators and makes it clear that 275the algorithm doesn't depend on the specific operators involved, but it 276would be easy enough to eliminate the ``Hashtbl.t`` and do the 277comparisons in the ``Parser.precedence`` function. (Or just use a 278fixed-size array). 279 280With the helper above defined, we can now start parsing binary 281expressions. The basic idea of operator precedence parsing is to break 282down an expression with potentially ambiguous binary operators into 283pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g". 284Operator precedence parsing considers this as a stream of primary 285expressions separated by binary operators. As such, it will first parse 286the leading primary expression "a", then it will see the pairs [+, b] 287[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are 288primary expressions, the binary expression parser doesn't need to worry 289about nested subexpressions like (c+d) at all. 290 291To start, an expression is a primary expression potentially followed by 292a sequence of [binop,primaryexpr] pairs: 293 294.. code-block:: ocaml 295 296 (* expression 297 * ::= primary binoprhs *) 298 and parse_expr = parser 299 | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream 300 301``Parser.parse_bin_rhs`` is the function that parses the sequence of 302pairs for us. It takes a precedence and a pointer to an expression for 303the part that has been parsed so far. Note that "x" is a perfectly valid 304expression: As such, "binoprhs" is allowed to be empty, in which case it 305returns the expression that is passed into it. In our example above, the 306code passes the expression for "a" into ``Parser.parse_bin_rhs`` and the 307current token is "+". 308 309The precedence value passed into ``Parser.parse_bin_rhs`` indicates the 310*minimal operator precedence* that the function is allowed to eat. For 311example, if the current pair stream is [+, x] and 312``Parser.parse_bin_rhs`` is passed in a precedence of 40, it will not 313consume any tokens (because the precedence of '+' is only 20). With this 314in mind, ``Parser.parse_bin_rhs`` starts with: 315 316.. code-block:: ocaml 317 318 (* binoprhs 319 * ::= ('+' primary)* *) 320 and parse_bin_rhs expr_prec lhs stream = 321 match Stream.peek stream with 322 (* If this is a binop, find its precedence. *) 323 | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c -> 324 let token_prec = precedence c in 325 326 (* If this is a binop that binds at least as tightly as the current binop, 327 * consume it, otherwise we are done. *) 328 if token_prec < expr_prec then lhs else begin 329 330This code gets the precedence of the current token and checks to see if 331if is too low. Because we defined invalid tokens to have a precedence of 332-1, this check implicitly knows that the pair-stream ends when the token 333stream runs out of binary operators. If this check succeeds, we know 334that the token is a binary operator and that it will be included in this 335expression: 336 337.. code-block:: ocaml 338 339 (* Eat the binop. *) 340 Stream.junk stream; 341 342 (* Parse the primary expression after the binary operator *) 343 let rhs = parse_primary stream in 344 345 (* Okay, we know this is a binop. *) 346 let rhs = 347 match Stream.peek stream with 348 | Some (Token.Kwd c2) -> 349 350As such, this code eats (and remembers) the binary operator and then 351parses the primary expression that follows. This builds up the whole 352pair, the first of which is [+, b] for the running example. 353 354Now that we parsed the left-hand side of an expression and one pair of 355the RHS sequence, we have to decide which way the expression associates. 356In particular, we could have "(a+b) binop unparsed" or "a + (b binop 357unparsed)". To determine this, we look ahead at "binop" to determine its 358precedence and compare it to BinOp's precedence (which is '+' in this 359case): 360 361.. code-block:: ocaml 362 363 (* If BinOp binds less tightly with rhs than the operator after 364 * rhs, let the pending operator take rhs as its lhs. *) 365 let next_prec = precedence c2 in 366 if token_prec < next_prec 367 368If the precedence of the binop to the right of "RHS" is lower or equal 369to the precedence of our current operator, then we know that the 370parentheses associate as "(a+b) binop ...". In our example, the current 371operator is "+" and the next operator is "+", we know that they have the 372same precedence. In this case we'll create the AST node for "a+b", and 373then continue parsing: 374 375.. code-block:: ocaml 376 377 ... if body omitted ... 378 in 379 380 (* Merge lhs/rhs. *) 381 let lhs = Ast.Binary (c, lhs, rhs) in 382 parse_bin_rhs expr_prec lhs stream 383 end 384 385In our example above, this will turn "a+b+" into "(a+b)" and execute the 386next iteration of the loop, with "+" as the current token. The code 387above will eat, remember, and parse "(c+d)" as the primary expression, 388which makes the current pair equal to [+, (c+d)]. It will then evaluate 389the 'if' conditional above with "\*" as the binop to the right of the 390primary. In this case, the precedence of "\*" is higher than the 391precedence of "+" so the if condition will be entered. 392 393The critical question left here is "how can the if condition parse the 394right hand side in full"? In particular, to build the AST correctly for 395our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression 396variable. The code to do this is surprisingly simple (code from the 397above two blocks duplicated for context): 398 399.. code-block:: ocaml 400 401 match Stream.peek stream with 402 | Some (Token.Kwd c2) -> 403 (* If BinOp binds less tightly with rhs than the operator after 404 * rhs, let the pending operator take rhs as its lhs. *) 405 if token_prec < precedence c2 406 then parse_bin_rhs (token_prec + 1) rhs stream 407 else rhs 408 | _ -> rhs 409 in 410 411 (* Merge lhs/rhs. *) 412 let lhs = Ast.Binary (c, lhs, rhs) in 413 parse_bin_rhs expr_prec lhs stream 414 end 415 416At this point, we know that the binary operator to the RHS of our 417primary has higher precedence than the binop we are currently parsing. 418As such, we know that any sequence of pairs whose operators are all 419higher precedence than "+" should be parsed together and returned as 420"RHS". To do this, we recursively invoke the ``Parser.parse_bin_rhs`` 421function specifying "token\_prec+1" as the minimum precedence required 422for it to continue. In our example above, this will cause it to return 423the AST node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of 424the '+' expression. 425 426Finally, on the next iteration of the while loop, the "+g" piece is 427parsed and added to the AST. With this little bit of code (14 428non-trivial lines), we correctly handle fully general binary expression 429parsing in a very elegant way. This was a whirlwind tour of this code, 430and it is somewhat subtle. I recommend running through it with a few 431tough examples to see how it works. 432 433This wraps up handling of expressions. At this point, we can point the 434parser at an arbitrary token stream and build an expression from it, 435stopping at the first token that is not part of the expression. Next up 436we need to handle function definitions, etc. 437 438Parsing the Rest 439================ 440 441The next thing missing is handling of function prototypes. In 442Kaleidoscope, these are used both for 'extern' function declarations as 443well as function body definitions. The code to do this is 444straight-forward and not very interesting (once you've survived 445expressions): 446 447.. code-block:: ocaml 448 449 (* prototype 450 * ::= id '(' id* ')' *) 451 let parse_prototype = 452 let rec parse_args accumulator = parser 453 | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e 454 | [< >] -> accumulator 455 in 456 457 parser 458 | [< 'Token.Ident id; 459 'Token.Kwd '(' ?? "expected '(' in prototype"; 460 args=parse_args []; 461 'Token.Kwd ')' ?? "expected ')' in prototype" >] -> 462 (* success. *) 463 Ast.Prototype (id, Array.of_list (List.rev args)) 464 465 | [< >] -> 466 raise (Stream.Error "expected function name in prototype") 467 468Given this, a function definition is very simple, just a prototype plus 469an expression to implement the body: 470 471.. code-block:: ocaml 472 473 (* definition ::= 'def' prototype expression *) 474 let parse_definition = parser 475 | [< 'Token.Def; p=parse_prototype; e=parse_expr >] -> 476 Ast.Function (p, e) 477 478In addition, we support 'extern' to declare functions like 'sin' and 479'cos' as well as to support forward declaration of user functions. These 480'extern's are just prototypes with no body: 481 482.. code-block:: ocaml 483 484 (* external ::= 'extern' prototype *) 485 let parse_extern = parser 486 | [< 'Token.Extern; e=parse_prototype >] -> e 487 488Finally, we'll also let the user type in arbitrary top-level expressions 489and evaluate them on the fly. We will handle this by defining anonymous 490nullary (zero argument) functions for them: 491 492.. code-block:: ocaml 493 494 (* toplevelexpr ::= expression *) 495 let parse_toplevel = parser 496 | [< e=parse_expr >] -> 497 (* Make an anonymous proto. *) 498 Ast.Function (Ast.Prototype ("", [||]), e) 499 500Now that we have all the pieces, let's build a little driver that will 501let us actually *execute* this code we've built! 502 503The Driver 504========== 505 506The driver for this simply invokes all of the parsing pieces with a 507top-level dispatch loop. There isn't much interesting here, so I'll just 508include the top-level loop. See `below <#code>`_ for full code in the 509"Top-Level Parsing" section. 510 511.. code-block:: ocaml 512 513 (* top ::= definition | external | expression | ';' *) 514 let rec main_loop stream = 515 match Stream.peek stream with 516 | None -> () 517 518 (* ignore top-level semicolons. *) 519 | Some (Token.Kwd ';') -> 520 Stream.junk stream; 521 main_loop stream 522 523 | Some token -> 524 begin 525 try match token with 526 | Token.Def -> 527 ignore(Parser.parse_definition stream); 528 print_endline "parsed a function definition."; 529 | Token.Extern -> 530 ignore(Parser.parse_extern stream); 531 print_endline "parsed an extern."; 532 | _ -> 533 (* Evaluate a top-level expression into an anonymous function. *) 534 ignore(Parser.parse_toplevel stream); 535 print_endline "parsed a top-level expr"; 536 with Stream.Error s -> 537 (* Skip token for error recovery. *) 538 Stream.junk stream; 539 print_endline s; 540 end; 541 print_string "ready> "; flush stdout; 542 main_loop stream 543 544The most interesting part of this is that we ignore top-level 545semicolons. Why is this, you ask? The basic reason is that if you type 546"4 + 5" at the command line, the parser doesn't know whether that is the 547end of what you will type or not. For example, on the next line you 548could type "def foo..." in which case 4+5 is the end of a top-level 549expression. Alternatively you could type "\* 6", which would continue 550the expression. Having top-level semicolons allows you to type "4+5;", 551and the parser will know you are done. 552 553Conclusions 554=========== 555 556With just under 300 lines of commented code (240 lines of non-comment, 557non-blank code), we fully defined our minimal language, including a 558lexer, parser, and AST builder. With this done, the executable will 559validate Kaleidoscope code and tell us if it is grammatically invalid. 560For example, here is a sample interaction: 561 562.. code-block:: bash 563 564 $ ./toy.byte 565 ready> def foo(x y) x+foo(y, 4.0); 566 Parsed a function definition. 567 ready> def foo(x y) x+y y; 568 Parsed a function definition. 569 Parsed a top-level expr 570 ready> def foo(x y) x+y ); 571 Parsed a function definition. 572 Error: unknown token when expecting an expression 573 ready> extern sin(a); 574 ready> Parsed an extern 575 ready> ^D 576 $ 577 578There is a lot of room for extension here. You can define new AST nodes, 579extend the language in many ways, etc. In the `next 580installment <OCamlLangImpl3.html>`_, we will describe how to generate 581LLVM Intermediate Representation (IR) from the AST. 582 583Full Code Listing 584================= 585 586Here is the complete code listing for this and the previous chapter. 587Note that it is fully self-contained: you don't need LLVM or any 588external libraries at all for this. (Besides the ocaml standard 589libraries, of course.) To build this, just compile with: 590 591.. code-block:: bash 592 593 # Compile 594 ocamlbuild toy.byte 595 # Run 596 ./toy.byte 597 598Here is the code: 599 600\_tags: 601 :: 602 603 <{lexer,parser}.ml>: use_camlp4, pp(camlp4of) 604 605token.ml: 606 .. code-block:: ocaml 607 608 (*===----------------------------------------------------------------------=== 609 * Lexer Tokens 610 *===----------------------------------------------------------------------===*) 611 612 (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of 613 * these others for known things. *) 614 type token = 615 (* commands *) 616 | Def | Extern 617 618 (* primary *) 619 | Ident of string | Number of float 620 621 (* unknown *) 622 | Kwd of char 623 624lexer.ml: 625 .. code-block:: ocaml 626 627 (*===----------------------------------------------------------------------=== 628 * Lexer 629 *===----------------------------------------------------------------------===*) 630 631 let rec lex = parser 632 (* Skip any whitespace. *) 633 | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream 634 635 (* identifier: [a-zA-Z][a-zA-Z0-9] *) 636 | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] -> 637 let buffer = Buffer.create 1 in 638 Buffer.add_char buffer c; 639 lex_ident buffer stream 640 641 (* number: [0-9.]+ *) 642 | [< ' ('0' .. '9' as c); stream >] -> 643 let buffer = Buffer.create 1 in 644 Buffer.add_char buffer c; 645 lex_number buffer stream 646 647 (* Comment until end of line. *) 648 | [< ' ('#'); stream >] -> 649 lex_comment stream 650 651 (* Otherwise, just return the character as its ascii value. *) 652 | [< 'c; stream >] -> 653 [< 'Token.Kwd c; lex stream >] 654 655 (* end of stream. *) 656 | [< >] -> [< >] 657 658 and lex_number buffer = parser 659 | [< ' ('0' .. '9' | '.' as c); stream >] -> 660 Buffer.add_char buffer c; 661 lex_number buffer stream 662 | [< stream=lex >] -> 663 [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >] 664 665 and lex_ident buffer = parser 666 | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] -> 667 Buffer.add_char buffer c; 668 lex_ident buffer stream 669 | [< stream=lex >] -> 670 match Buffer.contents buffer with 671 | "def" -> [< 'Token.Def; stream >] 672 | "extern" -> [< 'Token.Extern; stream >] 673 | id -> [< 'Token.Ident id; stream >] 674 675 and lex_comment = parser 676 | [< ' ('\n'); stream=lex >] -> stream 677 | [< 'c; e=lex_comment >] -> e 678 | [< >] -> [< >] 679 680ast.ml: 681 .. code-block:: ocaml 682 683 (*===----------------------------------------------------------------------=== 684 * Abstract Syntax Tree (aka Parse Tree) 685 *===----------------------------------------------------------------------===*) 686 687 (* expr - Base type for all expression nodes. *) 688 type expr = 689 (* variant for numeric literals like "1.0". *) 690 | Number of float 691 692 (* variant for referencing a variable, like "a". *) 693 | Variable of string 694 695 (* variant for a binary operator. *) 696 | Binary of char * expr * expr 697 698 (* variant for function calls. *) 699 | Call of string * expr array 700 701 (* proto - This type represents the "prototype" for a function, which captures 702 * its name, and its argument names (thus implicitly the number of arguments the 703 * function takes). *) 704 type proto = Prototype of string * string array 705 706 (* func - This type represents a function definition itself. *) 707 type func = Function of proto * expr 708 709parser.ml: 710 .. code-block:: ocaml 711 712 (*===---------------------------------------------------------------------=== 713 * Parser 714 *===---------------------------------------------------------------------===*) 715 716 (* binop_precedence - This holds the precedence for each binary operator that is 717 * defined *) 718 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10 719 720 (* precedence - Get the precedence of the pending binary operator token. *) 721 let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1 722 723 (* primary 724 * ::= identifier 725 * ::= numberexpr 726 * ::= parenexpr *) 727 let rec parse_primary = parser 728 (* numberexpr ::= number *) 729 | [< 'Token.Number n >] -> Ast.Number n 730 731 (* parenexpr ::= '(' expression ')' *) 732 | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e 733 734 (* identifierexpr 735 * ::= identifier 736 * ::= identifier '(' argumentexpr ')' *) 737 | [< 'Token.Ident id; stream >] -> 738 let rec parse_args accumulator = parser 739 | [< e=parse_expr; stream >] -> 740 begin parser 741 | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e 742 | [< >] -> e :: accumulator 743 end stream 744 | [< >] -> accumulator 745 in 746 let rec parse_ident id = parser 747 (* Call. *) 748 | [< 'Token.Kwd '('; 749 args=parse_args []; 750 'Token.Kwd ')' ?? "expected ')'">] -> 751 Ast.Call (id, Array.of_list (List.rev args)) 752 753 (* Simple variable ref. *) 754 | [< >] -> Ast.Variable id 755 in 756 parse_ident id stream 757 758 | [< >] -> raise (Stream.Error "unknown token when expecting an expression.") 759 760 (* binoprhs 761 * ::= ('+' primary)* *) 762 and parse_bin_rhs expr_prec lhs stream = 763 match Stream.peek stream with 764 (* If this is a binop, find its precedence. *) 765 | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c -> 766 let token_prec = precedence c in 767 768 (* If this is a binop that binds at least as tightly as the current binop, 769 * consume it, otherwise we are done. *) 770 if token_prec < expr_prec then lhs else begin 771 (* Eat the binop. *) 772 Stream.junk stream; 773 774 (* Parse the primary expression after the binary operator. *) 775 let rhs = parse_primary stream in 776 777 (* Okay, we know this is a binop. *) 778 let rhs = 779 match Stream.peek stream with 780 | Some (Token.Kwd c2) -> 781 (* If BinOp binds less tightly with rhs than the operator after 782 * rhs, let the pending operator take rhs as its lhs. *) 783 let next_prec = precedence c2 in 784 if token_prec < next_prec 785 then parse_bin_rhs (token_prec + 1) rhs stream 786 else rhs 787 | _ -> rhs 788 in 789 790 (* Merge lhs/rhs. *) 791 let lhs = Ast.Binary (c, lhs, rhs) in 792 parse_bin_rhs expr_prec lhs stream 793 end 794 | _ -> lhs 795 796 (* expression 797 * ::= primary binoprhs *) 798 and parse_expr = parser 799 | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream 800 801 (* prototype 802 * ::= id '(' id* ')' *) 803 let parse_prototype = 804 let rec parse_args accumulator = parser 805 | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e 806 | [< >] -> accumulator 807 in 808 809 parser 810 | [< 'Token.Ident id; 811 'Token.Kwd '(' ?? "expected '(' in prototype"; 812 args=parse_args []; 813 'Token.Kwd ')' ?? "expected ')' in prototype" >] -> 814 (* success. *) 815 Ast.Prototype (id, Array.of_list (List.rev args)) 816 817 | [< >] -> 818 raise (Stream.Error "expected function name in prototype") 819 820 (* definition ::= 'def' prototype expression *) 821 let parse_definition = parser 822 | [< 'Token.Def; p=parse_prototype; e=parse_expr >] -> 823 Ast.Function (p, e) 824 825 (* toplevelexpr ::= expression *) 826 let parse_toplevel = parser 827 | [< e=parse_expr >] -> 828 (* Make an anonymous proto. *) 829 Ast.Function (Ast.Prototype ("", [||]), e) 830 831 (* external ::= 'extern' prototype *) 832 let parse_extern = parser 833 | [< 'Token.Extern; e=parse_prototype >] -> e 834 835toplevel.ml: 836 .. code-block:: ocaml 837 838 (*===----------------------------------------------------------------------=== 839 * Top-Level parsing and JIT Driver 840 *===----------------------------------------------------------------------===*) 841 842 (* top ::= definition | external | expression | ';' *) 843 let rec main_loop stream = 844 match Stream.peek stream with 845 | None -> () 846 847 (* ignore top-level semicolons. *) 848 | Some (Token.Kwd ';') -> 849 Stream.junk stream; 850 main_loop stream 851 852 | Some token -> 853 begin 854 try match token with 855 | Token.Def -> 856 ignore(Parser.parse_definition stream); 857 print_endline "parsed a function definition."; 858 | Token.Extern -> 859 ignore(Parser.parse_extern stream); 860 print_endline "parsed an extern."; 861 | _ -> 862 (* Evaluate a top-level expression into an anonymous function. *) 863 ignore(Parser.parse_toplevel stream); 864 print_endline "parsed a top-level expr"; 865 with Stream.Error s -> 866 (* Skip token for error recovery. *) 867 Stream.junk stream; 868 print_endline s; 869 end; 870 print_string "ready> "; flush stdout; 871 main_loop stream 872 873toy.ml: 874 .. code-block:: ocaml 875 876 (*===----------------------------------------------------------------------=== 877 * Main driver code. 878 *===----------------------------------------------------------------------===*) 879 880 let main () = 881 (* Install standard binary operators. 882 * 1 is the lowest precedence. *) 883 Hashtbl.add Parser.binop_precedence '<' 10; 884 Hashtbl.add Parser.binop_precedence '+' 20; 885 Hashtbl.add Parser.binop_precedence '-' 20; 886 Hashtbl.add Parser.binop_precedence '*' 40; (* highest. *) 887 888 (* Prime the first token. *) 889 print_string "ready> "; flush stdout; 890 let stream = Lexer.lex (Stream.of_channel stdin) in 891 892 (* Run the main "interpreter loop" now. *) 893 Toplevel.main_loop stream; 894 ;; 895 896 main () 897 898`Next: Implementing Code Generation to LLVM IR <OCamlLangImpl3.html>`_ 899 900