1=============================================================== 2Tutorial for building tools using LibTooling and LibASTMatchers 3=============================================================== 4 5This document is intended to show how to build a useful source-to-source 6translation tool based on Clang's `LibTooling <LibTooling.html>`_. It is 7explicitly aimed at people who are new to Clang, so all you should need 8is a working knowledge of C++ and the command line. 9 10In order to work on the compiler, you need some basic knowledge of the 11abstract syntax tree (AST). To this end, the reader is encouraged to 12skim the :doc:`Introduction to the Clang 13AST <IntroductionToTheClangAST>` 14 15Step 0: Obtaining Clang 16======================= 17 18As Clang is part of the LLVM project, you'll need to download LLVM's 19source code first. Both Clang and LLVM are in the same git repository, 20under different directories. For further information, see the `getting 21started guide <https://llvm.org/docs/GettingStarted.html>`_. 22 23.. code-block:: console 24 25 cd ~/clang-llvm 26 git clone https://github.com/llvm/llvm-project.git 27 28Next you need to obtain the CMake build system and Ninja build tool. 29 30.. code-block:: console 31 32 cd ~/clang-llvm 33 git clone https://github.com/martine/ninja.git 34 cd ninja 35 git checkout release 36 ./bootstrap.py 37 sudo cp ninja /usr/bin/ 38 39 cd ~/clang-llvm 40 git clone git://cmake.org/stage/cmake.git 41 cd cmake 42 git checkout next 43 ./bootstrap 44 make 45 sudo make install 46 47Okay. Now we'll build Clang! 48 49.. code-block:: console 50 51 cd ~/clang-llvm 52 mkdir build && cd build 53 cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DLLVM_BUILD_TESTS=ON # Enable tests; default is off. 54 ninja 55 ninja check # Test LLVM only. 56 ninja clang-test # Test Clang only. 57 ninja install 58 59And we're live. 60 61All of the tests should pass. 62 63Finally, we want to set Clang as its own compiler. 64 65.. code-block:: console 66 67 cd ~/clang-llvm/build 68 ccmake ../llvm 69 70The second command will bring up a GUI for configuring Clang. You need 71to set the entry for ``CMAKE_CXX_COMPILER``. Press ``'t'`` to turn on 72advanced mode. Scroll down to ``CMAKE_CXX_COMPILER``, and set it to 73``/usr/bin/clang++``, or wherever you installed it. Press ``'c'`` to 74configure, then ``'g'`` to generate CMake's files. 75 76Finally, run ninja one last time, and you're done. 77 78Step 1: Create a ClangTool 79========================== 80 81Now that we have enough background knowledge, it's time to create the 82simplest productive ClangTool in existence: a syntax checker. While this 83already exists as ``clang-check``, it's important to understand what's 84going on. 85 86First, we'll need to create a new directory for our tool and tell CMake 87that it exists. As this is not going to be a core clang tool, it will 88live in the ``clang-tools-extra`` repository. 89 90.. code-block:: console 91 92 cd ~/clang-llvm 93 mkdir clang-tools-extra/loop-convert 94 echo 'add_subdirectory(loop-convert)' >> clang-tools-extra/CMakeLists.txt 95 vim clang-tools-extra/loop-convert/CMakeLists.txt 96 97CMakeLists.txt should have the following contents: 98 99:: 100 101 set(LLVM_LINK_COMPONENTS support) 102 103 add_clang_executable(loop-convert 104 LoopConvert.cpp 105 ) 106 target_link_libraries(loop-convert 107 PRIVATE 108 clangTooling 109 clangBasic 110 clangASTMatchers 111 ) 112 113With that done, Ninja will be able to compile our tool. Let's give it 114something to compile! Put the following into 115``clang-tools-extra/loop-convert/LoopConvert.cpp``. A detailed explanation of 116why the different parts are needed can be found in the `LibTooling 117documentation <LibTooling.html>`_. 118 119.. code-block:: c++ 120 121 // Declares clang::SyntaxOnlyAction. 122 #include "clang/Frontend/FrontendActions.h" 123 #include "clang/Tooling/CommonOptionsParser.h" 124 #include "clang/Tooling/Tooling.h" 125 // Declares llvm::cl::extrahelp. 126 #include "llvm/Support/CommandLine.h" 127 128 using namespace clang::tooling; 129 using namespace llvm; 130 131 // Apply a custom category to all command-line options so that they are the 132 // only ones displayed. 133 static llvm::cl::OptionCategory MyToolCategory("my-tool options"); 134 135 // CommonOptionsParser declares HelpMessage with a description of the common 136 // command-line options related to the compilation database and input files. 137 // It's nice to have this help message in all tools. 138 static cl::extrahelp CommonHelp(CommonOptionsParser::HelpMessage); 139 140 // A help message for this specific tool can be added afterwards. 141 static cl::extrahelp MoreHelp("\nMore help text...\n"); 142 143 int main(int argc, const char **argv) { 144 CommonOptionsParser OptionsParser(argc, argv, MyToolCategory); 145 ClangTool Tool(OptionsParser.getCompilations(), 146 OptionsParser.getSourcePathList()); 147 return Tool.run(newFrontendActionFactory<clang::SyntaxOnlyAction>().get()); 148 } 149 150And that's it! You can compile our new tool by running ninja from the 151``build`` directory. 152 153.. code-block:: console 154 155 cd ~/clang-llvm/build 156 ninja 157 158You should now be able to run the syntax checker, which is located in 159``~/clang-llvm/build/bin``, on any source file. Try it! 160 161.. code-block:: console 162 163 echo "int main() { return 0; }" > test.cpp 164 bin/loop-convert test.cpp -- 165 166Note the two dashes after we specify the source file. The additional 167options for the compiler are passed after the dashes rather than loading 168them from a compilation database - there just aren't any options needed 169right now. 170 171Intermezzo: Learn AST matcher basics 172==================================== 173 174Clang recently introduced the :doc:`ASTMatcher 175library <LibASTMatchers>` to provide a simple, powerful, and 176concise way to describe specific patterns in the AST. Implemented as a 177DSL powered by macros and templates (see 178`ASTMatchers.h <../doxygen/ASTMatchers_8h_source.html>`_ if you're 179curious), matchers offer the feel of algebraic data types common to 180functional programming languages. 181 182For example, suppose you wanted to examine only binary operators. There 183is a matcher to do exactly that, conveniently named ``binaryOperator``. 184I'll give you one guess what this matcher does: 185 186.. code-block:: c++ 187 188 binaryOperator(hasOperatorName("+"), hasLHS(integerLiteral(equals(0)))) 189 190Shockingly, it will match against addition expressions whose left hand 191side is exactly the literal 0. It will not match against other forms of 1920, such as ``'\0'`` or ``NULL``, but it will match against macros that 193expand to 0. The matcher will also not match against calls to the 194overloaded operator ``'+'``, as there is a separate ``operatorCallExpr`` 195matcher to handle overloaded operators. 196 197There are AST matchers to match all the different nodes of the AST, 198narrowing matchers to only match AST nodes fulfilling specific criteria, 199and traversal matchers to get from one kind of AST node to another. For 200a complete list of AST matchers, take a look at the `AST Matcher 201References <LibASTMatchersReference.html>`_ 202 203All matcher that are nouns describe entities in the AST and can be 204bound, so that they can be referred to whenever a match is found. To do 205so, simply call the method ``bind`` on these matchers, e.g.: 206 207.. code-block:: c++ 208 209 variable(hasType(isInteger())).bind("intvar") 210 211Step 2: Using AST matchers 212========================== 213 214Okay, on to using matchers for real. Let's start by defining a matcher 215which will capture all ``for`` statements that define a new variable 216initialized to zero. Let's start with matching all ``for`` loops: 217 218.. code-block:: c++ 219 220 forStmt() 221 222Next, we want to specify that a single variable is declared in the first 223portion of the loop, so we can extend the matcher to 224 225.. code-block:: c++ 226 227 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl())))) 228 229Finally, we can add the condition that the variable is initialized to 230zero. 231 232.. code-block:: c++ 233 234 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 235 hasInitializer(integerLiteral(equals(0)))))))) 236 237It is fairly easy to read and understand the matcher definition ("match 238loops whose init portion declares a single variable which is initialized 239to the integer literal 0"), but deciding that every piece is necessary 240is more difficult. Note that this matcher will not match loops whose 241variables are initialized to ``'\0'``, ``0.0``, ``NULL``, or any form of 242zero besides the integer 0. 243 244The last step is giving the matcher a name and binding the ``ForStmt`` 245as we will want to do something with it: 246 247.. code-block:: c++ 248 249 StatementMatcher LoopMatcher = 250 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 251 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 252 253Once you have defined your matchers, you will need to add a little more 254scaffolding in order to run them. Matchers are paired with a 255``MatchCallback`` and registered with a ``MatchFinder`` object, then run 256from a ``ClangTool``. More code! 257 258Add the following to ``LoopConvert.cpp``: 259 260.. code-block:: c++ 261 262 #include "clang/ASTMatchers/ASTMatchers.h" 263 #include "clang/ASTMatchers/ASTMatchFinder.h" 264 265 using namespace clang; 266 using namespace clang::ast_matchers; 267 268 StatementMatcher LoopMatcher = 269 forStmt(hasLoopInit(declStmt(hasSingleDecl(varDecl( 270 hasInitializer(integerLiteral(equals(0)))))))).bind("forLoop"); 271 272 class LoopPrinter : public MatchFinder::MatchCallback { 273 public : 274 virtual void run(const MatchFinder::MatchResult &Result) { 275 if (const ForStmt *FS = Result.Nodes.getNodeAs<clang::ForStmt>("forLoop")) 276 FS->dump(); 277 } 278 }; 279 280And change ``main()`` to: 281 282.. code-block:: c++ 283 284 int main(int argc, const char **argv) { 285 CommonOptionsParser OptionsParser(argc, argv, MyToolCategory); 286 ClangTool Tool(OptionsParser.getCompilations(), 287 OptionsParser.getSourcePathList()); 288 289 LoopPrinter Printer; 290 MatchFinder Finder; 291 Finder.addMatcher(LoopMatcher, &Printer); 292 293 return Tool.run(newFrontendActionFactory(&Finder).get()); 294 } 295 296Now, you should be able to recompile and run the code to discover for 297loops. Create a new file with a few examples, and test out our new 298handiwork: 299 300.. code-block:: console 301 302 cd ~/clang-llvm/llvm/llvm_build/ 303 ninja loop-convert 304 vim ~/test-files/simple-loops.cc 305 bin/loop-convert ~/test-files/simple-loops.cc 306 307Step 3.5: More Complicated Matchers 308=================================== 309 310Our simple matcher is capable of discovering for loops, but we would 311still need to filter out many more ourselves. We can do a good portion 312of the remaining work with some cleverly chosen matchers, but first we 313need to decide exactly which properties we want to allow. 314 315How can we characterize for loops over arrays which would be eligible 316for translation to range-based syntax? Range based loops over arrays of 317size ``N`` that: 318 319- start at index ``0`` 320- iterate consecutively 321- end at index ``N-1`` 322 323We already check for (1), so all we need to add is a check to the loop's 324condition to ensure that the loop's index variable is compared against 325``N`` and another check to ensure that the increment step just 326increments this same variable. The matcher for (2) is straightforward: 327require a pre- or post-increment of the same variable declared in the 328init portion. 329 330Unfortunately, such a matcher is impossible to write. Matchers contain 331no logic for comparing two arbitrary AST nodes and determining whether 332or not they are equal, so the best we can do is matching more than we 333would like to allow, and punting extra comparisons to the callback. 334 335In any case, we can start building this sub-matcher. We can require that 336the increment step be a unary increment like this: 337 338.. code-block:: c++ 339 340 hasIncrement(unaryOperator(hasOperatorName("++"))) 341 342Specifying what is incremented introduces another quirk of Clang's AST: 343Usages of variables are represented as ``DeclRefExpr``'s ("declaration 344reference expressions") because they are expressions which refer to 345variable declarations. To find a ``unaryOperator`` that refers to a 346specific declaration, we can simply add a second condition to it: 347 348.. code-block:: c++ 349 350 hasIncrement(unaryOperator( 351 hasOperatorName("++"), 352 hasUnaryOperand(declRefExpr()))) 353 354Furthermore, we can restrict our matcher to only match if the 355incremented variable is an integer: 356 357.. code-block:: c++ 358 359 hasIncrement(unaryOperator( 360 hasOperatorName("++"), 361 hasUnaryOperand(declRefExpr(to(varDecl(hasType(isInteger()))))))) 362 363And the last step will be to attach an identifier to this variable, so 364that we can retrieve it in the callback: 365 366.. code-block:: c++ 367 368 hasIncrement(unaryOperator( 369 hasOperatorName("++"), 370 hasUnaryOperand(declRefExpr(to( 371 varDecl(hasType(isInteger())).bind("incrementVariable")))))) 372 373We can add this code to the definition of ``LoopMatcher`` and make sure 374that our program, outfitted with the new matcher, only prints out loops 375that declare a single variable initialized to zero and have an increment 376step consisting of a unary increment of some variable. 377 378Now, we just need to add a matcher to check if the condition part of the 379``for`` loop compares a variable against the size of the array. There is 380only one problem - we don't know which array we're iterating over 381without looking at the body of the loop! We are again restricted to 382approximating the result we want with matchers, filling in the details 383in the callback. So we start with: 384 385.. code-block:: c++ 386 387 hasCondition(binaryOperator(hasOperatorName("<")) 388 389It makes sense to ensure that the left-hand side is a reference to a 390variable, and that the right-hand side has integer type. 391 392.. code-block:: c++ 393 394 hasCondition(binaryOperator( 395 hasOperatorName("<"), 396 hasLHS(declRefExpr(to(varDecl(hasType(isInteger()))))), 397 hasRHS(expr(hasType(isInteger()))))) 398 399Why? Because it doesn't work. Of the three loops provided in 400``test-files/simple.cpp``, zero of them have a matching condition. A 401quick look at the AST dump of the first for loop, produced by the 402previous iteration of loop-convert, shows us the answer: 403 404:: 405 406 (ForStmt 0x173b240 407 (DeclStmt 0x173afc8 408 0x173af50 "int i = 409 (IntegerLiteral 0x173afa8 'int' 0)") 410 <<>> 411 (BinaryOperator 0x173b060 '_Bool' '<' 412 (ImplicitCastExpr 0x173b030 'int' 413 (DeclRefExpr 0x173afe0 'int' lvalue Var 0x173af50 'i' 'int')) 414 (ImplicitCastExpr 0x173b048 'int' 415 (DeclRefExpr 0x173b008 'const int' lvalue Var 0x170fa80 'N' 'const int'))) 416 (UnaryOperator 0x173b0b0 'int' lvalue prefix '++' 417 (DeclRefExpr 0x173b088 'int' lvalue Var 0x173af50 'i' 'int')) 418 (CompoundStatement ... 419 420We already know that the declaration and increments both match, or this 421loop wouldn't have been dumped. The culprit lies in the implicit cast 422applied to the first operand (i.e. the LHS) of the less-than operator, 423an L-value to R-value conversion applied to the expression referencing 424``i``. Thankfully, the matcher library offers a solution to this problem 425in the form of ``ignoringParenImpCasts``, which instructs the matcher to 426ignore implicit casts and parentheses before continuing to match. 427Adjusting the condition operator will restore the desired match. 428 429.. code-block:: c++ 430 431 hasCondition(binaryOperator( 432 hasOperatorName("<"), 433 hasLHS(ignoringParenImpCasts(declRefExpr( 434 to(varDecl(hasType(isInteger())))))), 435 hasRHS(expr(hasType(isInteger()))))) 436 437After adding binds to the expressions we wished to capture and 438extracting the identifier strings into variables, we have array-step-2 439completed. 440 441Step 4: Retrieving Matched Nodes 442================================ 443 444So far, the matcher callback isn't very interesting: it just dumps the 445loop's AST. At some point, we will need to make changes to the input 446source code. Next, we'll work on using the nodes we bound in the 447previous step. 448 449The ``MatchFinder::run()`` callback takes a 450``MatchFinder::MatchResult&`` as its parameter. We're most interested in 451its ``Context`` and ``Nodes`` members. Clang uses the ``ASTContext`` 452class to represent contextual information about the AST, as the name 453implies, though the most functionally important detail is that several 454operations require an ``ASTContext*`` parameter. More immediately useful 455is the set of matched nodes, and how we retrieve them. 456 457Since we bind three variables (identified by ConditionVarName, 458InitVarName, and IncrementVarName), we can obtain the matched nodes by 459using the ``getNodeAs()`` member function. 460 461In ``LoopConvert.cpp`` add 462 463.. code-block:: c++ 464 465 #include "clang/AST/ASTContext.h" 466 467Change ``LoopMatcher`` to 468 469.. code-block:: c++ 470 471 StatementMatcher LoopMatcher = 472 forStmt(hasLoopInit(declStmt( 473 hasSingleDecl(varDecl(hasInitializer(integerLiteral(equals(0)))) 474 .bind("initVarName")))), 475 hasIncrement(unaryOperator( 476 hasOperatorName("++"), 477 hasUnaryOperand(declRefExpr( 478 to(varDecl(hasType(isInteger())).bind("incVarName")))))), 479 hasCondition(binaryOperator( 480 hasOperatorName("<"), 481 hasLHS(ignoringParenImpCasts(declRefExpr( 482 to(varDecl(hasType(isInteger())).bind("condVarName"))))), 483 hasRHS(expr(hasType(isInteger())))))).bind("forLoop"); 484 485And change ``LoopPrinter::run`` to 486 487.. code-block:: c++ 488 489 void LoopPrinter::run(const MatchFinder::MatchResult &Result) { 490 ASTContext *Context = Result.Context; 491 const ForStmt *FS = Result.Nodes.getNodeAs<ForStmt>("forLoop"); 492 // We do not want to convert header files! 493 if (!FS || !Context->getSourceManager().isWrittenInMainFile(FS->getForLoc())) 494 return; 495 const VarDecl *IncVar = Result.Nodes.getNodeAs<VarDecl>("incVarName"); 496 const VarDecl *CondVar = Result.Nodes.getNodeAs<VarDecl>("condVarName"); 497 const VarDecl *InitVar = Result.Nodes.getNodeAs<VarDecl>("initVarName"); 498 499 if (!areSameVariable(IncVar, CondVar) || !areSameVariable(IncVar, InitVar)) 500 return; 501 llvm::outs() << "Potential array-based loop discovered.\n"; 502 } 503 504Clang associates a ``VarDecl`` with each variable to represent the variable's 505declaration. Since the "canonical" form of each declaration is unique by 506address, all we need to do is make sure neither ``ValueDecl`` (base class of 507``VarDecl``) is ``NULL`` and compare the canonical Decls. 508 509.. code-block:: c++ 510 511 static bool areSameVariable(const ValueDecl *First, const ValueDecl *Second) { 512 return First && Second && 513 First->getCanonicalDecl() == Second->getCanonicalDecl(); 514 } 515 516If execution reaches the end of ``LoopPrinter::run()``, we know that the 517loop shell that looks like 518 519.. code-block:: c++ 520 521 for (int i= 0; i < expr(); ++i) { ... } 522 523For now, we will just print a message explaining that we found a loop. 524The next section will deal with recursively traversing the AST to 525discover all changes needed. 526 527As a side note, it's not as trivial to test if two expressions are the same, 528though Clang has already done the hard work for us by providing a way to 529canonicalize expressions: 530 531.. code-block:: c++ 532 533 static bool areSameExpr(ASTContext *Context, const Expr *First, 534 const Expr *Second) { 535 if (!First || !Second) 536 return false; 537 llvm::FoldingSetNodeID FirstID, SecondID; 538 First->Profile(FirstID, *Context, true); 539 Second->Profile(SecondID, *Context, true); 540 return FirstID == SecondID; 541 } 542 543This code relies on the comparison between two 544``llvm::FoldingSetNodeIDs``. As the documentation for 545``Stmt::Profile()`` indicates, the ``Profile()`` member function builds 546a description of a node in the AST, based on its properties, along with 547those of its children. ``FoldingSetNodeID`` then serves as a hash we can 548use to compare expressions. We will need ``areSameExpr`` later. Before 549you run the new code on the additional loops added to 550test-files/simple.cpp, try to figure out which ones will be considered 551potentially convertible. 552