1Register allocation in Subzero 2============================== 3 4Introduction 5------------ 6 7`Subzero 8<https://chromium.googlesource.com/native_client/pnacl-subzero/+/master/docs/DESIGN.rst>`_ 9is a fast code generator that translates architecture-independent `PNaCl bitcode 10<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_ into 11architecture-specific machine code. PNaCl bitcode is LLVM bitcode that has been 12simplified (e.g. weird-width primitive types like 57-bit integers are not 13allowed) and has had architecture-independent optimizations already applied. 14Subzero aims to generate high-quality code as fast as practical, and as such 15Subzero needs to make tradeoffs between compilation speed and output code 16quality. 17 18In Subzero, we have found register allocation to be one of the most important 19optimizations toward achieving the best code quality, which is in tension with 20register allocation's reputation for being complex and expensive. Linear-scan 21register allocation is a modern favorite for getting fairly good register 22allocation at relatively low cost. Subzero uses linear-scan for its core 23register allocator, with a few internal modifications to improve its 24effectiveness (specifically: register preference, register overlap, and register 25aliases). Subzero also does a few high-level things on top of its core register 26allocator to improve overall effectiveness (specifically: repeat until 27convergence, delayed phi lowering, and local live range splitting). 28 29What we describe here are techniques that have worked well for Subzero, in the 30context of its particular intermediate representation (IR) and compilation 31strategy. Some of these techniques may not be applicable to another compiler, 32depending on its particular IR and compilation strategy. Some key concepts in 33Subzero are the following: 34 35- Subzero's ``Variable`` operand is an operand that resides either on the stack 36 or in a physical register. A Variable can be tagged as *must-have-register* 37 or *must-not-have-register*, but its default is *may-have-register*. All uses 38 of the Variable map to the same physical register or stack location. 39 40- Basic lowering is done before register allocation. Lowering is the process of 41 translating PNaCl bitcode instructions into native instructions. Sometimes a 42 native instruction, like the x86 ``add`` instruction, allows one of its 43 Variable operands to be either in a physical register or on the stack, in 44 which case the lowering is relatively simple. But if the lowered instruction 45 requires the operand to be in a physical register, we generate code that 46 copies the Variable into a *must-have-register* temporary, and then use that 47 temporary in the lowered instruction. 48 49- Instructions within a basic block appear in a linearized order (as opposed to 50 something like a directed acyclic graph of dependency edges). 51 52- An instruction has 0 or 1 *dest* Variables and an arbitrary (but usually 53 small) number of *source* Variables. Assuming SSA form, the instruction 54 begins the live range of the dest Variable, and may end the live range of one 55 or more of the source Variables. 56 57Executive summary 58----------------- 59 60- Liveness analysis and live range construction are prerequisites for register 61 allocation. Without careful attention, they can be potentially very 62 expensive, especially when the number of variables and basic blocks gets very 63 large. Subzero uses some clever data structures to take advantage of the 64 sparsity of the data, resulting in stable performance as function size scales 65 up. This means that the proportion of compilation time spent on liveness 66 analysis stays roughly the same. 67 68- The core of Subzero's register allocator is taken from `Mössenböck and 69 Pfeiffer's paper <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_ on 70 linear-scan register allocation. 71 72- We enhance the core algorithm with a good automatic preference mechanism when 73 more than one register is available, to try to minimize register shuffling. 74 75- We also enhance it to allow overlapping live ranges to share the same 76 register, when one variable is recognized as a read-only copy of the other. 77 78- We deal with register aliasing in a clean and general fashion. Register 79 aliasing is when e.g. 16-bit architectural registers share some bits with 80 their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit 81 registers. 82 83- We improve register allocation by rerunning the algorithm on likely candidates 84 that didn't get registers during the previous iteration, without imposing much 85 additional cost. 86 87- The main register allocation is done before phi lowering, because phi lowering 88 imposes early and unnecessary ordering constraints on the resulting 89 assigments, which create spurious interferences in live ranges. 90 91- Within each basic block, we aggressively split each variable's live range at 92 every use, so that portions of the live range can get registers even if the 93 whole live range can't. Doing this separately for each basic block avoids 94 merge complications, and keeps liveness analysis and register allocation fast 95 by fitting well into the sparsity optimizations of their data structures. 96 97Liveness analysis 98----------------- 99 100With respect to register allocation, the main purpose of liveness analysis is to 101calculate the live range of each variable. The live range is represented as a 102set of instruction number ranges. Instruction numbers within a basic block must 103be monotonically increasing, and the instruction ranges of two different basic 104blocks must not overlap. 105 106Basic liveness 107^^^^^^^^^^^^^^ 108 109Liveness analysis is a straightforward dataflow algorithm. For each basic 110block, we keep track of the live-in and live-out set, i.e. the set of variables 111that are live coming into or going out of the basic block. Processing of a 112basic block starts by initializing a temporary set as the union of the live-in 113sets of the basic block's successor blocks. (This basic block's live-out set is 114captured as the initial value of the temporary set.) Then each instruction of 115the basic block is processed in reverse order. All the source variables of the 116instruction are marked as live, by adding them to the temporary set, and the 117dest variable of the instruction (if any) is marked as not-live, by removing it 118from the temporary set. When we finish processing all of the block's 119instructions, we add/union the temporary set into the basic block's live-in set. 120If this changes the basic block's live-in set, then we mark all of this basic 121block's predecessor blocks to be reprocessed. Then we repeat for other basic 122blocks until convergence, i.e. no more basic blocks are marked to be 123reprocessed. If basic blocks are processed in reverse topological order, then 124the number of times each basic block need to be reprocessed is generally its 125loop nest depth. 126 127The result of this pass is the live-in and live-out set for each basic block. 128 129With so many set operations, choice of data structure is crucial for 130performance. We tried a few options, and found that a simple dense bit vector 131works best. This keeps the per-instruction cost very low. However, we found 132that when the function gets very large, merging and copying bit vectors at basic 133block boundaries dominates the cost. This is due to the number of variables 134(hence the bit vector size) and the number of basic blocks getting large. 135 136A simple enhancement brought this under control in Subzero. It turns out that 137the vast majority of variables are referenced, and therefore live, only in a 138single basic block. This is largely due to the SSA form of PNaCl bitcode. To 139take advantage of this, we partition variables by single-block versus 140multi-block liveness. Multi-block variables get lower-numbered bit vector 141indexes, and single-block variables get higher-number indexes. Single-block bit 142vector indexes are reused across different basic blocks. As such, the size of 143live-in and live-out bit vectors is limited to the number of multi-block 144variables, and the temporary set's size can be limited to that plus the largest 145number of single-block variables across all basic blocks. 146 147For the purpose of live range construction, we also need to track definitions 148(LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the 149basic block. These are easy to detect while processing the instructions; data 150structure choices are described below. 151 152Live range construction 153^^^^^^^^^^^^^^^^^^^^^^^ 154 155After the live-in and live-out sets are calculated, we construct each variable's 156live range (as an ordered list of instruction ranges, described above). We do 157this by considering the live-in and live-out sets, combined with LiveBegin and 158LiveEnd information. This is done separately for each basic block. 159 160As before, we need to take advantage of sparsity of variable uses across basic 161blocks, to avoid overly copying/merging data structures. The following is what 162worked well for Subzero (after trying several other options). 163 164The basic liveness pass, described above, keeps track of when a variable's live 165range begins or ends within the block. LiveBegin and LiveEnd are unordered 166vectors where each element is a pair of the variable and the instruction number, 167representing that the particular variable's live range begins or ends at the 168particular instruction. When the liveness pass finds a variable whose live 169range begins or ends, it appends and entry to LiveBegin or LiveEnd. 170 171During live range construction, the LiveBegin and LiveEnd vectors are sorted by 172variable number. Then we iterate across both vectors in parallel. If a 173variable appears in both LiveBegin and LiveEnd, then its live range is entirely 174within this block. If it appears in only LiveBegin, then its live range starts 175here and extends through the end of the block. If it appears in only LiveEnd, 176then its live range starts at the beginning of the block and ends here. (Note 177that this only covers the live range within this block, and this process is 178repeated across all blocks.) 179 180It is also possible that a variable is live within this block but its live range 181does not begin or end in this block. These variables are identified simply by 182taking the intersection of the live-in and live-out sets. 183 184As a result of these data structures, performance of liveness analysis and live 185range construction tend to be stable across small, medium, and large functions, 186as measured by a fairly consistent proportion of total compilation time spent on 187the liveness passes. 188 189Linear-scan register allocation 190------------------------------- 191 192The basis of Subzero's register allocator is the allocator described by 193Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in 194the Context of SSA Form and Register Constraints 195<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. It allows live ranges to 196be a union of intervals rather than a single conservative interval, and it 197allows pre-coloring of variables with specific physical registers. 198 199The paper suggests an approach of aggressively coalescing variables across Phi 200instructions (i.e., trying to force Phi source and dest variables to have the 201same register assignment), but we omit that in favor of the more natural 202preference mechanism described below. 203 204We found the paper quite remarkable in that a straightforward implementation of 205its pseudo-code led to an entirely correct register allocator. The only thing 206we found in the specification that was even close to a mistake is that it was 207too aggressive in evicting inactive ranges in the "else" clause of the 208AssignMemLoc routine. An inactive range only needs to be evicted if it actually 209overlaps the current range being considered, whereas the paper evicts it 210unconditionally. (Search for "original paper" in Subzero's register allocator 211source code.) 212 213Register preference 214------------------- 215 216The linear-scan algorithm from the paper talks about choosing an available 217register, but isn't specific on how to choose among several available registers. 218The simplest approach is to just choose the first available register, e.g. the 219lowest-numbered register. Often a better choice is possible. 220 221Specifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)`` 222with source variables ``S1,S2,...``, and that instruction ends the live range of 223one of those source variables ``Sn``, and ``Sn`` was assigned a register, then 224``Sn``'s register is usually a good choice for ``V``. This is especially true 225when the instruction is a simple assignment, because an assignment where the 226dest and source variables end up with the same register can be trivially elided, 227reducing the amount of register-shuffling code. 228 229This requires being able to find and inspect a variable's defining instruction, 230which is not an assumption in the original paper but is easily tracked in 231practice. 232 233Register overlap 234---------------- 235 236Because Subzero does register allocation after basic lowering, the lowering has 237to be prepared for the possibility of any given program variable not getting a 238physical register. It does this by introducing *must-have-register* temporary 239variables, and copies the program variable into the temporary to ensure that 240register requirements in the target instruction are met. 241 242In many cases, the program variable and temporary variable have overlapping live 243ranges, and would be forced to have different registers even if the temporary 244variable is effectively a read-only copy of the program variable. We recognize 245this when the program variable has no definitions within the temporary 246variable's live range, and the temporary variable has no definitions within the 247program variable's live range with the exception of the copy assignment. 248 249This analysis is done as part of register preference detection. 250 251The main impact on the linear-scan implementation is that instead of 252setting/resetting a boolean flag to indicate whether a register is free or in 253use, we increment/decrement a number-of-uses counter. 254 255Register aliases 256---------------- 257 258Sometimes registers of different register classes partially overlap. For 259example, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't 260alias each other), and all three alias ``eax`` and ``rax``. And in ARM, 261registers ``s0`` and ``s1`` (which are single-precision floating-point 262registers) alias ``d0`` (double-precision floating-point), and registers ``d0`` 263and ``d1`` alias ``q0`` (128-bit vector register). The linear-scan paper 264doesn't address this issue; it assumes that all registers are independent. A 265simple solution is to essentially avoid aliasing by disallowing a subset of the 266registers, but there is obviously a reduction in code quality when e.g. half of 267the registers are taken away. 268 269Subzero handles this more elegantly. For each register, we keep a bitmask 270indicating which registers alias/conflict with it. For example, in x86, 271``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and 272in ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``. Whenever we want to 273mark a register as being used or released, we do the same for all of its 274aliases. 275 276Before implementing this, we were a bit apprehensive about the compile-time 277performance impact. Instead of setting one bit in a bit vector or decrementing 278one counter, this generally needs to be done in a loop that iterates over all 279aliases. Fortunately, this seemed to make very little difference in 280performance, as the bulk of the cost ends up being in live range overlap 281computations, which are not affected by register aliasing. 282 283Repeat until convergence 284------------------------ 285 286Sometimes the linear-scan algorithm makes a register assignment only to later 287revoke it in favor of a higher-priority variable, but it turns out that a 288different initial register choice would not have been revoked. For relatively 289low compile-time cost, we can give those variables another chance. 290 291During register allocation, we keep track of the revoked variables and then do 292another round of register allocation targeted only to that set. We repeat until 293no new register assignments are made, which is usually just a handful of 294successively cheaper iterations. 295 296Another approach would be to repeat register allocation for *all* variables that 297haven't had a register assigned (rather than variables that got a register that 298was subsequently revoked), but our experience is that this greatly increases 299compile-time cost, with little or no code quality gain. 300 301Delayed Phi lowering 302-------------------- 303 304The linear-scan algorithm works for phi instructions as well as regular 305instructions, but it is tempting to lower phi instructions into non-SSA 306assignments before register allocation, so that register allocation need only 307happen once. 308 309Unfortunately, simple phi lowering imposes an arbitrary ordering on the 310resulting assignments that can cause artificial overlap/interference between 311lowered assignments, and can lead to worse register allocation decisions. As a 312simple example, consider these two phi instructions which are semantically 313unordered:: 314 315 A = phi(B) // B's live range ends here 316 C = phi(D) // D's live range ends here 317 318A straightforward lowering might yield:: 319 320 A = B // end of B's live range 321 C = D // end of D's live range 322 323The potential problem here is that A and D's live ranges overlap, and so they 324are prevented from having the same register. Swapping the order of lowered 325assignments fixes that (but then B and C would overlap), but we can't really 326know which is better until after register allocation. 327 328Subzero deals with this by doing the main register allocation before phi 329lowering, followed by phi lowering, and finally a special register allocation 330pass limited to the new lowered assignments. 331 332Phi lowering considers the phi operands separately for each predecessor edge, 333and starts by finding a topological sort of the Phi instructions, such that 334assignments can be executed in that order without violating dependencies on 335registers or stack locations. If a topological sort is not possible due to a 336cycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can 337become ``T=A;A=B;B=C;C=T``. The topological order is tuned to favor freeing up 338registers early to reduce register pressure. 339 340It then lowers the linearized assignments into machine instructions (which may 341require extra physical registers e.g. to copy from one stack location to 342another), and finally runs the register allocator limited to those instructions. 343 344In rare cases, the register allocator may fail on some *must-have-register* 345variable, because register pressure is too high to satisfy requirements arising 346from cycle-breaking temporaries and registers required for stack-to-stack 347copies. In this case, we have to find a register with no active uses within the 348variable's live range, and actively spill/restore that register around the live 349range. This makes the code quality suffer and may be slow to implement 350depending on compiler data structures, but in practice we find the situation to 351be vanishingly rare and so not really worth optimizing. 352 353Local live range splitting 354-------------------------- 355 356The basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets 357a register for its entire live range, or not at all. This is unfortunate when a 358variable has many uses close together, but ultimately a large enough live range 359to prevent register assignment. Another bad example is on x86 where all vector 360and floating-point registers (the ``xmm`` registers) are killed by call 361instructions, per the x86 call ABI, so such variables are completely prevented 362from having a register when their live ranges contain a call instruction. 363 364The general solution here is some policy for splitting live ranges. A variable 365can be split into multiple copies and each can be register-allocated separately. 366The complication comes in finding a sane policy for where and when to split 367variables such that complexity doesn't explode, and how to join the different 368values at merge points. 369 370Subzero implements aggressive block-local splitting of variables. Each basic 371block is handled separately and independently. Within the block, we maintain a 372table ``T`` that maps each variable ``V`` to its split version ``T[V]``, with 373every variable ``V``'s initial state set (implicitly) as ``T[V]=V``. For each 374instruction in the block, and for each *may-have-register* variable ``V`` in the 375instruction, we do the following: 376 377- Replace all uses of ``V`` in the instruction with ``T[V]``. 378 379- Create a new split variable ``V'``. 380 381- Add a new assignment ``V'=T[V]``, placing it adjacent to (either immediately 382 before or immediately after) the current instruction. 383 384- Update ``T[V]`` to be ``V'``. 385 386This leads to a chain of copies of ``V`` through the block, linked by assignment 387instructions. The live ranges of these copies are usually much smaller, and 388more likely to get registers. In fact, because of the preference mechanism 389described above, they are likely to get the same register whenever possible. 390 391One obvious question comes up: won't this proliferation of new variables cause 392an explosion in the running time of liveness analysis and register allocation? 393As it turns out, not really. 394 395First, for register allocation, the cost tends to be dominated by live range 396overlap computations, whose cost is roughly proportional to the size of the live 397range. All the new variable copies' live ranges sum up to the original 398variable's live range, so the cost isn't vastly greater. 399 400Second, for liveness analysis, the cost is dominated by merging bit vectors 401corresponding to the set of variables that have multi-block liveness. All the 402new copies are guaranteed to be single-block, so the main additional cost is 403that of processing the new assignments. 404 405There's one other key issue here. The original variable and all of its copies 406need to be "linked", in the sense that all of these variables that get a stack 407slot (because they didn't get a register) are guaranteed to have the same stack 408slot. This way, we can avoid generating any code related to ``V'=V`` when 409neither gets a register. In addition, we can elide instructions that write a 410value to a stack slot that is linked to another stack slot, because that is 411guaranteed to be just rewriting the same value to the stack. 412