1Wed Jun 25 15:13:51 CDT 2003 2 3First-level instrumentation 4--------------------------- 5 6We use opt to do Bytecode-to-bytecode instrumentation. Look at 7back-edges and insert llvm_first_trigger() function call which takes 8no arguments and no return value. This instrumentation is designed to 9be easy to remove, for instance by writing a NOP over the function 10call instruction. 11 12Keep count of every call to llvm_first_trigger(), and maintain 13counters in a map indexed by return address. If the trigger count 14exceeds a threshold, we identify a hot loop and perform second-level 15instrumentation on the hot loop region (the instructions between the 16target of the back-edge and the branch that causes the back-edge). We 17do not move code across basic-block boundaries. 18 19 20Second-level instrumentation 21--------------------------- 22 23We remove the first-level instrumentation by overwriting the CALL to 24llvm_first_trigger() with a NOP. 25 26The reoptimizer maintains a map between machine-code basic blocks and 27LLVM BasicBlock*s. We only keep track of paths that start at the 28first machine-code basic block of the hot loop region. 29 30How do we keep track of which edges to instrument, and which edges are 31exits from the hot region? 3 step process. 32 331) Do a DFS from the first machine-code basic block of the hot loop 34region and mark reachable edges. 35 362) Do a DFS from the last machine-code basic block of the hot loop 37region IGNORING back edges, and mark the edges which are reachable in 381) and also in 2) (i.e., must be reachable from both the start BB and 39the end BB of the hot region). 40 413) Mark BBs which end in edges that exit the hot region; we need to 42instrument these differently. 43 44Assume that there is 1 free register. On SPARC we use %g1, which LLC 45has agreed not to use. Shift a 1 into it at the beginning. At every 46edge which corresponds to a conditional branch, we shift 0 for not 47taken and 1 for taken into a register. This uniquely numbers the paths 48through the hot region. Silently fail if we need more than 64 bits. 49 50At the end BB we call countPath and increment the counter based on %g1 51and the return address of the countPath call. We keep track of the 52number of iterations and the number of paths. We only run this 53version 30 or 40 times. 54 55Find the BBs that total 90% or more of execution, and aggregate them 56together to form our trace. But we do not allow more than 5 paths; if 57we have more than 5 we take the ones that are executed the most. We 58verify our assumption that we picked a hot back-edge in first-level 59instrumentation, by making sure that the number of times we took an 60exit edge from the hot trace is less than 10% of the number of 61iterations. 62 63LLC has been taught to recognize llvm_first_trigger() calls and NOT 64generate saves and restores of caller-saved registers around these 65calls. 66 67 68Phase behavior 69-------------- 70 71We turn off llvm_first_trigger() calls with NOPs, but this would hide 72phase behavior from us (when some funcs/traces stop being hot and 73others become hot.) 74 75We have a SIGALRM timer that counts time for us. Every time we get a 76SIGALRM we look at our priority queue of locations where we have 77removed llvm_first_trigger() calls. Each location is inserted along 78with a time when we will next turn instrumentation back on for that 79call site. If the time has arrived for a particular call site, we pop 80that off the prio. queue and turn instrumentation back on for that 81call site. 82 83 84Generating traces 85----------------- 86 87When we finally generate an optimized trace we first copy the code 88into the trace cache. This leaves us with 3 copies of the code: the 89original code, the instrumented code, and the optimized trace. The 90optimized trace does not have instrumentation. The original code and 91the instrumented code are modified to have a branch to the trace 92cache, where the optimized traces are kept. 93 94We copy the code from the original to the instrumentation version 95by tracing the LLVM-to-Machine code basic block map and then copying 96each machine code basic block we think is in the hot region into the 97trace cache. Then we instrument that code. The process is similar for 98generating the final optimized trace; we copy the same basic blocks 99because we might need to put in fixup code for exit BBs. 100 101LLVM basic blocks are not typically used in the Reoptimizer except 102for the mapping information. 103 104We are restricted to using single instructions to branch between the 105original code, trace, and instrumented code. So we have to keep the 106code copies in memory near the original code (they can't be far enough 107away that a single pc-relative branch would not work.) Malloc() or 108data region space is too far away. this impacts the design of the 109trace cache. 110 111We use a dummy function that is full of a bunch of for loops which we 112overwrite with trace-cache code. The trace manager keeps track of 113whether or not we have enough space in the trace cache, etc. 114 115The trace insertion routine takes an original start address, a vector 116of machine instructions representing the trace, index of branches and 117their corresponding absolute targets, and index of calls and their 118corresponding absolute targets. 119 120The trace insertion routine is responsible for inserting branches from 121the beginning of the original code to the beginning of the optimized 122trace. This is because at some point the trace cache may run out of 123space and it may have to evict a trace, at which point the branch to 124the trace would also have to be removed. It uses a round-robin 125replacement policy; we have found that this is almost as good as LRU 126and better than random (especially because of problems fitting the new 127trace in.) 128 129We cannot deal with discontiguous trace cache areas. The trace cache 130is supposed to be cache-line-aligned, but it is not page-aligned. 131 132We generate instrumentation traces and optimized traces into separate 133trace caches. We keep the instrumented code around because you don't 134want to delete a trace when you still might have to return to it 135(i.e., return from a llvm_first_trigger() or countPath() call.) 136 137 138