1Design of the Subzero fast code generator 2========================================= 3 4Introduction 5------------ 6 7The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes 8compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the 9PNaCl toolchain to compile their application to architecture-neutral PNaCl 10bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as 11possible. The ``.pexe`` file is downloaded to the user's browser where the 12PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to 13`sandboxed 14<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ 15native code. The translator uses architecture-specific optimizations as much as 16practical to generate good native code. 17 18The native code can be cached by the browser to avoid repeating translation on 19future page loads. However, first-time user experience is hampered by long 20translation times. The LLVM-based PNaCl translator is pretty slow, even when 21using ``-O0`` to minimize optimizations, so delays are especially noticeable on 22slow browser platforms such as ARM-based Chromebooks. 23 24Translator slowness can be mitigated or hidden in a number of ways. 25 26- Parallel translation. However, slow machines where this matters most, e.g. 27 ARM-based Chromebooks, are likely to have fewer cores to parallelize across, 28 and are likely to less memory available for multiple translation threads to 29 use. 30 31- Streaming translation, i.e. start translating as soon as the download starts. 32 This doesn't help much when translation speed is 10× slower than download 33 speed, or the ``.pexe`` file is already cached while the translated binary was 34 flushed from the cache. 35 36- Arrange the web page such that translation is done in parallel with 37 downloading large assets. 38 39- Arrange the web page to distract the user with `cat videos 40 <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in 41 progress. 42 43Or, improve translator performance to something more reasonable. 44 45This document describes Subzero's attempt to improve translation speed by an 46order of magnitude while rivaling LLVM's code quality. Subzero does this 47through minimal IR layering, lean data structures and passes, and a careful 48selection of fast optimization passes. It has two optimization recipes: full 49optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the 50following (described in more detail below): 51 52+----------------------------------------+-----------------------------+ 53| O2 recipe | Om1 recipe | 54+========================================+=============================+ 55| Parse .pexe file | Parse .pexe file | 56+----------------------------------------+-----------------------------+ 57| Loop nest analysis | | 58+----------------------------------------+-----------------------------+ 59| Local common subexpression elimination | | 60+----------------------------------------+-----------------------------+ 61| Address mode inference | | 62+----------------------------------------+-----------------------------+ 63| Read-modify-write (RMW) transform | | 64+----------------------------------------+-----------------------------+ 65| Basic liveness analysis | | 66+----------------------------------------+-----------------------------+ 67| Load optimization | | 68+----------------------------------------+-----------------------------+ 69| | Phi lowering (simple) | 70+----------------------------------------+-----------------------------+ 71| Target lowering | Target lowering | 72+----------------------------------------+-----------------------------+ 73| Full liveness analysis | | 74+----------------------------------------+-----------------------------+ 75| Register allocation | Minimal register allocation | 76+----------------------------------------+-----------------------------+ 77| Phi lowering (advanced) | | 78+----------------------------------------+-----------------------------+ 79| Post-phi register allocation | | 80+----------------------------------------+-----------------------------+ 81| Branch optimization | | 82+----------------------------------------+-----------------------------+ 83| Code emission | Code emission | 84+----------------------------------------+-----------------------------+ 85 86Goals 87===== 88 89Translation speed 90----------------- 91 92We'd like to be able to translate a ``.pexe`` file as fast as download speed. 93Any faster is in a sense wasted effort. Download speed varies greatly, but 94we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a 95slow machine. We observe a 3× single-thread performance difference between A15 96and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a 97``.pexe`` file could be compressed to 50% on the web server using gzip transport 98compression, so we set the translation speed goal to 6 MB/sec on the high-end 99Xeon workstation. 100 101Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at 102⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode. 103 104In other words, Subzero's goal is to improve over LLVM's translation speed by 10510×. 106 107Code quality 108------------ 109 110Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0`` 111code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On 112average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``. 113 114It's important to note that the quality of Subzero-generated code depends on 115target-neutral optimizations and simplifications being run beforehand in the 116developer environment. The ``.pexe`` file reflects these optimizations. For 117example, Subzero assumes that the basic blocks are ordered topologically where 118possible (which makes liveness analysis converge fastest), and Subzero does not 119do any function inlining because it should already have been done. 120 121Translator size 122--------------- 123 124The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size. 125We think 1 MB is a more reasonable size -- especially for such a component that 126is distributed to a billion Chrome users. Thus we target a 10× reduction in 127binary size. 128 129For development, Subzero can be built for all target architectures, and all 130debugging and diagnostic options enabled. For a smaller translator, we restrict 131to a single target architecture, and define a ``MINIMAL`` build where 132unnecessary features are compiled out. 133 134Subzero leverages some data structures from LLVM's ``ADT`` and ``Support`` 135include directories, which have little impact on translator size. It also uses 136some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again 137with little size impact. In non-``MINIMAL`` builds, the translator size is much 138larger due to including code for parsing text-format bitcode files and forming 139LLVM IR. 140 141Memory footprint 142---------------- 143 144The current LLVM-based translator suffers from an issue in which some 145function-specific data has to be retained in memory until all translation 146completes, and therefore the memory footprint grows without bound. Large 147``.pexe`` files can lead to the translator process holding hundreds of MB of 148memory by the end. The translator runs in a separate process, so this memory 149growth doesn't *directly* affect other processes, but it does dirty the physical 150memory and contributes to a perception of bloat and sometimes a reality of 151out-of-memory tab killing, especially noticeable on weaker systems. 152 153Subzero should maintain a stable memory footprint throughout translation. It's 154not really practical to set a specific limit, because there is not really a 155practical limit on a single function's size, but the footprint should be 156"reasonable" and be proportional to the largest input function size, not the 157total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or 158inexorable memory growth. (We use ASAN builds to test for leaks.) 159 160Multithreaded translation 161------------------------- 162 163It should be practical to translate different functions concurrently and see 164good scalability. Some locking may be needed, such as accessing output buffers 165or constant pools, but that should be fairly minimal. In contrast, LLVM was 166only designed for module-level parallelism, and as such, the PNaCl translator 167internally splits a ``.pexe`` file into several modules for concurrent 168translation. All output needs to be deterministic regardless of the level of 169multithreading, i.e. functions and data should always be output in the same 170order. 171 172Target architectures 173-------------------- 174 175Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future 176targets include ARM64 and MIPS64, though these targets lack NaCl support 177including a sandbox model or a validator. 178 179The first implementation is for x86-32, because it was expected to be 180particularly challenging, and thus more likely to draw out any design problems 181early: 182 183- There are a number of special cases, asymmetries, and warts in the x86 184 instruction set. 185 186- Complex addressing modes may be leveraged for better code quality. 187 188- 64-bit integer operations have to be lowered into longer sequences of 32-bit 189 operations. 190 191- Paucity of physical registers may reveal code quality issues early in the 192 design. 193 194Detailed design 195=============== 196 197Intermediate representation - ICE 198--------------------------------- 199 200Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's 201IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a 202representation of global variables and initializers, and a set of functions. 203Each function contains a list of basic blocks, and each basic block constains a 204list of instructions. Instructions that operate on stack and register variables 205do so using static single assignment (SSA) form. 206 207The ``.pexe`` file is translated one function at a time (or in parallel by 208multiple translation threads). The recipe for optimization passes depends on 209the specific target and optimization level, and is described in detail below. 210Global variables (types and initializers) are simply and directly translated to 211object code, without any meaningful attempts at optimization. 212 213A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class. 214Its key contents include: 215 216- A list of ``CfgNode`` pointers, generally held in topological order. 217 218- A list of ``Variable`` pointers corresponding to local variables used in the 219 function plus compiler-generated temporaries. 220 221A basic block is represented by the ``Ice::CfgNode`` class. Its key contents 222include: 223 224- A linear list of instructions, in the same style as LLVM. The last 225 instruction of the list is always a terminator instruction: branch, switch, 226 return, unreachable. 227 228- A list of Phi instructions, also in the same style as LLVM. They are held as 229 a linear list for convenience, though per Phi semantics, they are executed "in 230 parallel" without dependencies on each other. 231 232- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and 233 another list for outgoing edges. 234 235- The node's unique, 0-based index into the CFG's node list. 236 237An instruction is represented by the ``Ice::Inst`` class. Its key contents 238include: 239 240- A list of source operands. 241 242- Its destination variable, if the instruction produces a result in an 243 ``Ice::Variable``. 244 245- A bitvector indicating which variables' live ranges this instruction ends. 246 This is computed during liveness analysis. 247 248Instructions kinds are divided into high-level ICE instructions and low-level 249ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode 250instruction kinds. Each target architecture implementation extends the 251instruction space with its own set of low-level instructions. Generally, 252low-level instructions correspond to individual machine instructions. The 253high-level ICE instruction space includes a few additional instruction kinds 254that are not part of LLVM but are generally useful (e.g., an Assignment 255instruction), or are useful across targets (e.g., BundleLock and BundleUnlock 256instructions for sandboxing). 257 258Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl 259ABI restrictions as documented in the `PNaCl Bitcode Reference Manual 260<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are 261the following: 262 263- Alloca: allocate data on the stack 264 265- Arithmetic: binary operations of the form ``A = B op C`` 266 267- Br: conditional or unconditional branch 268 269- Call: function call 270 271- Cast: unary type-conversion operations 272 273- ExtractElement: extract a scalar element from a vector-type value 274 275- Fcmp: floating-point comparison 276 277- Icmp: integer comparison 278 279- Intrinsic: invoke a known intrinsic 280 281- InsertElement: insert a scalar element into a vector-type value 282 283- Load: load a value from memory 284 285- Phi: implement the SSA phi node 286 287- Ret: return from the function 288 289- Select: essentially the C language operation of the form ``X = C ? Y : Z`` 290 291- Store: store a value into memory 292 293- Switch: generalized branch to multiple possible locations 294 295- Unreachable: indicate that this portion of the code is unreachable 296 297The additional high-level ICE instructions are the following: 298 299- Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi 300 instructions to non-SSA assignments, before lowering to machine code. 301 302- BundleLock, BundleUnlock. These are markers used for sandboxing, but are 303 common across all targets and so they are elevated to the high-level 304 instruction set. 305 306- FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in 307 liveness analysis, elevated to the high-level because they are used by all 308 targets. They are described in more detail at the end of this section. 309 310- JumpTable: this represents the result of switch optimization analysis, where 311 some switch instructions may use jump tables instead of cascading 312 compare/branches. 313 314An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an 315operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants 316include scalar integer constants, scalar floating point constants, Undef (an 317unspecified constant of a particular scalar or vector type), and symbol 318constants (essentially addresses of globals). Note that the PNaCl ABI does not 319include vector-type constants besides Undef, and as such, Subzero (so far) has 320no reason to represent vector-type constants internally. A variable represents 321a value allocated on the stack (though not including alloca-derived storage). 322Among other things, a variable holds its unique, 0-based index into the CFG's 323variable list. 324 325Each target can extend the ``Constant`` and ``Variable`` classes for its own 326needs. In addition, the ``Operand`` class may be extended, e.g. to define an 327x86 ``MemOperand`` that encodes a base register, an index register, an index 328register shift amount, and a constant offset. 329 330Register allocation and liveness analysis are restricted to Variable operands. 331Because of the importance of register allocation to code quality, and the 332translation-time cost of liveness analysis, Variable operands get some special 333treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate 334across all the Variables of an instruction. An instruction holds a list of 335operands, but an operand may contain 0, 1, or more Variables. As such, the 336``Operand`` class specially holds a list of Variables contained within, for 337quick access. 338 339A Subzero transformation pass may work by deleting an existing instruction and 340replacing it with zero or more new instructions. Instead of actually deleting 341the existing instruction, we generally mark it as deleted and insert the new 342instructions right after the deleted instruction. When printing the IR for 343debugging, this is a big help because it makes it much more clear how the 344non-deleted instructions came about. 345 346Subzero has a few special instructions to help with liveness analysis 347consistency. 348 349- The FakeDef instruction gives a fake definition of some variable. For 350 example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx`` 351 but an ICE instruction can represent only one destination variable. This is 352 similar for multiply instructions, and for function calls that return a 64-bit 353 integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax`` 354 trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``. 355 356- The FakeUse instruction registers a use of a variable, typically to prevent an 357 earlier assignment to that variable from being dead-code eliminated. For 358 example, lowering an operation like ``x=cc?y:z`` may be done using x86's 359 conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a 360 FakeUse of ``x`` between the two instructions, the liveness analysis pass may 361 dead-code eliminate the first instruction. 362 363- The FakeKill instruction is added after a call instruction, and is a quick way 364 of indicating that caller-save registers are invalidated. 365 366Pexe parsing 367------------ 368 369Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It 370parses the ``.pexe`` file function by function, ultimately constructing an ICE 371CFG for each function. After a function is parsed, its CFG is handed off to the 372translation phase. The bitcode parser also parses global initializer data and 373hands it off to be translated to data sections in the object file. 374 375Subzero has another parsing strategy for testing/debugging. LLVM libraries can 376be used to parse a module into LLVM IR (though very slowly relative to Subzero 377native parsing). Then we iterate across the LLVM IR and construct high-level 378ICE, handing off each CFG to the translation phase. 379 380Overview of lowering 381-------------------- 382 383In general, translation goes like this: 384 385- Parse the next function from the ``.pexe`` file and construct a CFG consisting 386 of high-level ICE. 387 388- Do analysis passes and transformation passes on the high-level ICE, as 389 desired. 390 391- Lower each high-level ICE instruction into a sequence of zero or more 392 low-level ICE instructions. Each high-level instruction is generally lowered 393 independently, though the target lowering is allowed to look ahead in the 394 CfgNode's instruction list if desired. 395 396- Do more analysis and transformation passes on the low-level ICE, as desired. 397 398- Assemble the low-level CFG into an ELF object file (alternatively, a textual 399 assembly file that is later assembled by some external tool). 400 401- Repeat for all functions, and also produce object code for data such as global 402 initializers and internal constant pools. 403 404Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``, 405the intention is to apply all available optimizations to get the best code 406quality (though the initial code quality goal is measured against LLVM's ``O0`` 407code quality). For ``Om1``, the intention is to apply as few optimizations as 408possible and produce code as quickly as possible, accepting poor code quality. 409``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words, 410"sub-zero". 411 412High-level debuggability of generated code is so far not a design requirement. 413Subzero doesn't really do transformations that would obfuscate debugging; the 414main thing might be that register allocation (including stack slot coalescing 415for stack-allocated variables whose live ranges don't overlap) may render a 416variable's value unobtainable after its live range ends. This would not be an 417issue for ``Om1`` since it doesn't register-allocate program-level variables, 418nor does it coalesce stack slots. That said, fully supporting debuggability 419would require a few additions: 420 421- DWARF support would need to be added to Subzero's ELF file emitter. Subzero 422 propagates global symbol names, local variable names, and function-internal 423 label names that are present in the ``.pexe`` file. This would allow a 424 debugger to map addresses back to symbols in the ``.pexe`` file. 425 426- To map ``.pexe`` file symbols back to meaningful source-level symbol names, 427 file names, line numbers, etc., Subzero would need to handle `LLVM bitcode 428 metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg`` 429 `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_. 430 431- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so 432 the toolchain would need to be modified to preserve it. 433 434Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but 435produces code with one third the code quality. ``Om1`` is good for testing and 436debugging -- during translation, it tends to expose errors in the basic lowering 437that might otherwise have been hidden by the register allocator or other 438optimization passes. It also helps determine whether a code correctness problem 439is a fundamental problem in the basic lowering, or an error in another 440optimization pass. 441 442The implementation of target lowering also controls the recipe of passes used 443for ``Om1`` and ``O2`` translation. For example, address mode inference may 444only be relevant for x86. 445 446Lowering strategy 447----------------- 448 449The core of Subzero's lowering from high-level ICE to low-level ICE is to lower 450each high-level instruction down to a sequence of low-level target-specific 451instructions, in a largely context-free setting. That is, each high-level 452instruction conceptually has a simple template expansion into low-level 453instructions, and lowering can in theory be done in any order. This may sound 454like a small effort, but quite a large number of templates may be needed because 455of the number of PNaCl types and instruction variants. Furthermore, there may 456be optimized templates, e.g. to take advantage of operator commutativity (for 457example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is 458similar to other template-based approaches in fast code generation or 459interpretation, though some decisions are deferred until after some global 460analysis passes, mostly related to register allocation, stack slot assignment, 461and specific choice of instruction variant and addressing mode. 462 463The key idea for a lowering template is to produce valid low-level instructions 464that are guaranteed to meet address mode and other structural requirements of 465the instruction set. For example, on x86, the source operand of an integer 466store instruction must be an immediate or a physical register; a shift 467instruction's shift amount must be an immediate or in register ``%cl``; a 468function's integer return value is in ``%eax``; most x86 instructions are 469two-operand, in contrast to corresponding three-operand high-level instructions; 470etc. 471 472Because target lowering runs before register allocation, there is no way to know 473whether a given ``Ice::Variable`` operand lives on the stack or in a physical 474register. When the low-level instruction calls for a physical register operand, 475the target lowering can create an infinite-weight Variable. This tells the 476register allocator to assign infinite weight when making decisions, effectively 477guaranteeing some physical register. Variables can also be pre-colored to a 478specific physical register (``cl`` in the shift example above), which also gives 479infinite weight. 480 481To illustrate, consider a high-level arithmetic instruction on 32-bit integer 482operands:: 483 484 A = B + C 485 486X86 target lowering might produce the following:: 487 488 T.inf = B // mov instruction 489 T.inf += C // add instruction 490 A = T.inf // mov instruction 491 492Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a 493physical register, the three lowered instructions are all encodable regardless 494of whether ``B`` and ``C`` are physical registers, memory, or immediates, and 495whether ``A`` is a physical register or in memory. 496 497In this example, ``A`` must be a Variable and one may be tempted to simplify the 498lowering sequence by setting ``A`` as infinite-weight and using:: 499 500 A = B // mov instruction 501 A += C // add instruction 502 503This has two problems. First, if the original instruction was actually ``A = 504B + A``, the result would be incorrect. Second, assigning ``A`` a physical 505register applies throughout ``A``'s entire live range. This is probably not 506what is intended, and may ultimately lead to a failure to allocate a register 507for an infinite-weight variable. 508 509This style of lowering leads to many temporaries being generated, so in ``O2`` 510mode, we rely on the register allocator to clean things up. For example, in the 511example above, if ``B`` ends up getting a physical register and its live range 512ends at this instruction, the register allocator is likely to reuse that 513register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register 514copy, which is removed as an emission-time peephole optimization. 515 516O2 lowering 517----------- 518 519Currently, the ``O2`` lowering recipe is the following: 520 521- Loop nest analysis 522 523- Local common subexpression elimination 524 525- Address mode inference 526 527- Read-modify-write (RMW) transformation 528 529- Basic liveness analysis 530 531- Load optimization 532 533- Target lowering 534 535- Full liveness analysis 536 537- Register allocation 538 539- Phi instruction lowering (advanced) 540 541- Post-phi lowering register allocation 542 543- Branch optimization 544 545These passes are described in more detail below. 546 547Om1 lowering 548------------ 549 550Currently, the ``Om1`` lowering recipe is the following: 551 552- Phi instruction lowering (simple) 553 554- Target lowering 555 556- Register allocation (infinite-weight and pre-colored only) 557 558Optimization passes 559------------------- 560 561Liveness analysis 562^^^^^^^^^^^^^^^^^ 563 564Liveness analysis is a standard dataflow optimization, implemented as follows. 565For each node (basic block), its live-out set is computed as the union of the 566live-in sets of its successor nodes. Then the node's instructions are processed 567in reverse order, updating the live set, until the beginning of the node is 568reached, and the node's live-in set is recorded. If this iteration has changed 569the node's live-in set, the node's predecessors are marked for reprocessing. 570This continues until no more nodes need reprocessing. If nodes are processed in 571reverse topological order, the number of iterations over the CFG is generally 572equal to the maximum loop nest depth. 573 574To implement this, each node records its live-in and live-out sets, initialized 575to the empty set. Each instruction records which of its Variables' live ranges 576end in that instruction, initialized to the empty set. A side effect of 577liveness analysis is dead instruction elimination. Each instruction can be 578marked as tentatively dead, and after the algorithm converges, the tentatively 579dead instructions are permanently deleted. 580 581Optionally, after this liveness analysis completes, we can do live range 582construction, in which we calculate the live range of each variable in terms of 583instruction numbers. A live range is represented as a union of segments, where 584the segment endpoints are instruction numbers. Instruction numbers are required 585to be unique across the CFG, and monotonically increasing within a basic block. 586As a union of segments, live ranges can contain "gaps" and are therefore 587precise. Because of SSA properties, a variable's live range can start at most 588once in a basic block, and can end at most once in a basic block. Liveness 589analysis keeps track of which variable/instruction tuples begin live ranges and 590end live ranges, and combined with live-in and live-out sets, we can efficiently 591build up live ranges of all variables across all basic blocks. 592 593A lot of care is taken to try to make liveness analysis fast and efficient. 594Because of the lowering strategy, the number of variables is generally 595proportional to the number of instructions, leading to an O(N^2) complexity 596algorithm if implemented naively. To improve things based on sparsity, we note 597that most variables are "local" and referenced in at most one basic block (in 598contrast to the "global" variables with multi-block usage), and therefore cannot 599be live across basic blocks. Therefore, the live-in and live-out sets, 600typically represented as bit vectors, can be limited to the set of global 601variables, and the intra-block liveness bit vector can be compacted to hold the 602global variables plus the local variables for that block. 603 604Register allocation 605^^^^^^^^^^^^^^^^^^^ 606 607Subzero implements a simple linear-scan register allocator, based on the 608allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan 609Register Allocation in the Context of SSA Form and Register Constraints 610<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has 611several nice features: 612 613- Live ranges are represented as unions of segments, as described above, rather 614 than a single start/end tuple. 615 616- It allows pre-coloring of variables with specific physical registers. 617 618- It applies equally well to pre-lowered Phi instructions. 619 620The paper suggests an approach of aggressively coalescing variables across Phi 621instructions (i.e., trying to force Phi source and destination variables to have 622the same register assignment), but we reject that in favor of the more natural 623preference mechanism described below. 624 625We enhance the algorithm in the paper with the capability of automatic inference 626of register preference, and with the capability of allowing overlapping live 627ranges to safely share the same register in certain circumstances. If we are 628considering register allocation for variable ``A``, and ``A`` has a single 629defining instruction ``A=B+C``, then the preferred register for ``A``, if 630available, would be the register assigned to ``B`` or ``C``, if any, provided 631that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this 632way we infer a good register preference for ``A``. 633 634We allow overlapping live ranges to get the same register in certain cases. 635Suppose a high-level instruction like:: 636 637 A = unary_op(B) 638 639has been target-lowered like:: 640 641 T.inf = B 642 A = unary_op(T.inf) 643 644Further, assume that ``B``'s live range continues beyond this instruction 645sequence, and that ``B`` has already been assigned some register. Normally, we 646might want to infer ``B``'s register as a good candidate for ``T.inf``, but it 647turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have 648different registers. But ``T.inf`` is just a read-only copy of ``B`` that is 649guaranteed to be in a register, so in theory these overlapping live ranges could 650safely have the same register. Our implementation allows this overlap as long 651as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never 652modified within ``T.inf``'s live range. 653 654Subzero's register allocator can be run in 3 configurations. 655 656- Normal mode. All Variables are considered for register allocation. It 657 requires full liveness analysis and live range construction as a prerequisite. 658 This is used by ``O2`` lowering. 659 660- Minimal mode. Only infinite-weight or pre-colored Variables are considered. 661 All other Variables are stack-allocated. It does not require liveness 662 analysis; instead, it quickly scans the instructions and records first 663 definitions and last uses of all relevant Variables, using that to construct a 664 single-segment live range. Although this includes most of the Variables, the 665 live ranges are mostly simple, short, and rarely overlapping, which the 666 register allocator handles efficiently. This is used by ``Om1`` lowering. 667 668- Post-phi lowering mode. Advanced phi lowering is done after normal-mode 669 register allocation, and may result in new infinite-weight Variables that need 670 registers. One would like to just run something like minimal mode to assign 671 registers to the new Variables while respecting existing register allocation 672 decisions. However, it sometimes happens that there are no free registers. 673 In this case, some register needs to be forcibly spilled to the stack and 674 temporarily reassigned to the new Variable, and reloaded at the end of the new 675 Variable's live range. The register must be one that has no explicit 676 references during the Variable's live range. Since Subzero currently doesn't 677 track def/use chains (though it does record the CfgNode where a Variable is 678 defined), we just do a brute-force search across the CfgNode's instruction 679 list for the instruction numbers of interest. This situation happens very 680 rarely, so there's little point for now in improving its performance. 681 682The basic linear-scan algorithm may, as it proceeds, rescind an early register 683allocation decision, leaving that Variable to be stack-allocated. Some of these 684times, it turns out that the Variable could have been given a different register 685without conflict, but by this time it's too late. The literature recognizes 686this situation and describes "second-chance bin-packing", which Subzero can do. 687We can rerun the register allocator in a mode that respects existing register 688allocation decisions, and sometimes it finds new non-conflicting opportunities. 689In fact, we can repeatedly run the register allocator until convergence. 690Unfortunately, in the current implementation, these subsequent register 691allocation passes end up being extremely expensive. This is because of the 692treatment of the "unhandled pre-colored" Variable set, which is normally very 693small but ends up being quite large on subsequent passes. Its performance can 694probably be made acceptable with a better choice of data structures, but for now 695this second-chance mechanism is disabled. 696 697Future work is to implement LLVM's `Greedy 698<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ 699register allocator as a replacement for the basic linear-scan algorithm, given 700LLVM's experience with its improvement in code quality. (The blog post claims 701that the Greedy allocator also improved maintainability because a lot of hacks 702could be removed, but Subzero is probably not yet to that level of hacks, and is 703less likely to see that particular benefit.) 704 705Local common subexpression elimination 706^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 707 708The Local CSE implementation goes through each instruction and records a portion 709of each ``Seen`` instruction in a hashset-like container. That portion consists 710of the entire instruction except for any dest variable. That means ``A = X + Y`` 711and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows 712us to detect common subexpressions. 713 714Whenever a repetition is detected, the redundant variables are stored in a 715container mapping the replacee to the replacement. In the case above, it would 716be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``. 717 718At any point if a variable that has an entry in the replacement table is 719encountered, it is replaced with the variable it is mapped to. This ensures that 720the redundant variables will not have any uses in the basic block, allowing 721dead code elimination to clean up the redundant instruction. 722 723With SSA, the information stored is never invalidated. However, non-SSA input is 724supported with the ``-lcse=no-ssa`` option. This has to maintain some extra 725dependency information to ensure proper invalidation on variable assignment. 726This is not rigorously tested because this pass is run at an early stage where 727it is safe to assume variables have a single definition. This is not enabled by 728default because it bumps the compile time overhead from 2% to 6%. 729 730Loop-invariant code motion 731^^^^^^^^^^^^^^^^^^^^^^^^^^ 732 733This pass utilizes the loop analysis information to hoist invariant instructions 734to loop pre-headers. A loop must have a single entry node (header) and that node 735must have a single external predecessor for this optimization to work, as it is 736currently implemented. 737 738The pass works by iterating over all instructions in the loop until the set of 739invariant instructions converges. In each iteration, a non-invariant instruction 740involving only constants or a variable known to be invariant is added to the 741result set. The destination variable of that instruction is added to the set of 742variables known to be invariant (which is initialized with the constant args). 743 744Improving the loop-analysis infrastructure is likely to have significant impact 745on this optimization. Inserting an extra node to act as the pre-header when the 746header has multiple incoming edges from outside could also be a good idea. 747Expanding the initial invariant variable set to contain all variables that do 748not have definitions inside the loop does not seem to improve anything. 749 750Short circuit evaluation 751^^^^^^^^^^^^^^^^^^^^^^^^ 752 753Short circuit evaluation splits nodes and introduces early jumps when the result 754of a logical operation can be determined early and there are no observable side 755effects of skipping the rest of the computation. The instructions considered 756backwards from the end of the basic blocks. When a definition of a variable 757involved in a conditional jump is found, an extra jump can be inserted in that 758location, moving the rest of the instructions in the node to a newly inserted 759node. Consider this example:: 760 761 __N : 762 a = <something> 763 Instruction 1 without side effect 764 ... b = <something> ... 765 Instruction N without side effect 766 t1 = or a b 767 br t1 __X __Y 768 769is transformed to:: 770 771 __N : 772 a = <something> 773 br a __X __N_ext 774 775 __N_ext : 776 Instruction 1 without side effect 777 ... b = <something> ... 778 Instruction N without side effect 779 br b __X __Y 780 781The logic for AND is analogous, the only difference is that the early jump is 782facilitated by a ``false`` value instead of ``true``. 783 784Global Variable Splitting 785^^^^^^^^^^^^^^^^^^^^^^^^^ 786 787Global variable splitting (``-split-global-vars``) is run after register 788allocation. It works on the variables that did not manage to get registers (but 789are allowed to) and decomposes their live ranges into the individual segments 790(which span a single node at most). New variables are created (but not yet used) 791with these smaller live ranges and the register allocator is run again. This is 792not inefficient as the old variables that already had registers are now 793considered pre-colored. 794 795The new variables that get registers replace their parent variables for their 796portion of its (parent's) live range. A copy from the old variable to the new 797is introduced before the first use and the reverse after the last def in the 798live range. 799 800Basic phi lowering 801^^^^^^^^^^^^^^^^^^ 802 803The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` 804implements it). Consider this example:: 805 806 L1: 807 ... 808 br L3 809 L2: 810 ... 811 br L3 812 L3: 813 A = phi [B, L1], [C, L2] 814 X = phi [Y, L1], [Z, L2] 815 816For each destination of a phi instruction, we can create a temporary and insert 817the temporary's assignment at the end of the predecessor block:: 818 819 L1: 820 ... 821 A' = B 822 X' = Y 823 br L3 824 L2: 825 ... 826 A' = C 827 X' = Z 828 br L3 829 L2: 830 A = A' 831 X = X' 832 833This transformation is very simple and reliable. It can be done before target 834lowering and register allocation, and it easily avoids the classic lost-copy and 835related problems. ``Om1`` lowering uses this strategy. 836 837However, it has the disadvantage of initializing temporaries even for branches 838not taken, though that could be mitigated by splitting non-critical edges and 839putting assignments in the edge-split nodes. Another problem is that without 840extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a 841specific ordering even though phi semantics are that the assignments are 842parallel or unordered. This sometimes imposes false live range overlaps and 843leads to poorer register allocation. 844 845Advanced phi lowering 846^^^^^^^^^^^^^^^^^^^^^ 847 848``O2`` lowering defers phi lowering until after register allocation to avoid the 849problem of false live range overlaps. It works as follows. We split each 850incoming edge and move the (parallel) phi assignments into the split nodes. We 851linearize each set of assignments by finding a safe, topological ordering of the 852assignments, respecting register assignments as well. For example:: 853 854 A = B 855 X = Y 856 857Normally these assignments could be executed in either order, but if ``B`` and 858``X`` are assigned the same physical register, we would want to use the above 859ordering. Dependency cycles are broken by introducing a temporary. For 860example:: 861 862 A = B 863 B = A 864 865Here, a temporary breaks the cycle:: 866 867 t = A 868 A = B 869 B = t 870 871Finally, we use the existing target lowering to lower the assignments in this 872basic block, and once that is done for all basic blocks, we run the post-phi 873variant of register allocation on the edge-split basic blocks. 874 875When computing a topological order, we try to first schedule assignments whose 876source has a physical register, and last schedule assignments whose destination 877has a physical register. This helps reduce register pressure. 878 879X86 address mode inference 880^^^^^^^^^^^^^^^^^^^^^^^^^^ 881 882We try to take advantage of the x86 addressing mode that includes a base 883register, an index register, an index register scale amount, and an immediate 884offset. We do this through simple pattern matching. Starting with a load or 885store instruction where the address is a variable, we initialize the base 886register to that variable, and look up the instruction where that variable is 887defined. If that is an add instruction of two variables and the index register 888hasn't been set, we replace the base and index register with those two 889variables. If instead it is an add instruction of a variable and a constant, we 890replace the base register with the variable and add the constant to the 891immediate offset. 892 893There are several more patterns that can be matched. This pattern matching 894continues on the load or store instruction until no more matches are found. 895Because a program typically has few load and store instructions (not to be 896confused with instructions that manipulate stack variables), this address mode 897inference pass is fast. 898 899X86 read-modify-write inference 900^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 901 902A reasonably common bitcode pattern is a non-atomic update of a memory 903location:: 904 905 x = load addr 906 y = add x, 1 907 store y, addr 908 909On x86, with good register allocation, the Subzero passes described above 910generate code with only this quality:: 911 912 mov [%ebx], %eax 913 add $1, %eax 914 mov %eax, [%ebx] 915 916However, x86 allows for this kind of code:: 917 918 add $1, [%ebx] 919 920which requires fewer instructions, but perhaps more importantly, requires fewer 921physical registers. 922 923It's also important to note that this transformation only makes sense if the 924store instruction ends ``x``'s live range. 925 926Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW) 927opportunities via simple pattern matching. The only problem is that it is run 928before liveness analysis, which is needed to determine whether ``x``'s live 929range ends after the RMW. Since liveness analysis is one of the most expensive 930passes, it's not attractive to run it an extra time just for RMW analysis. 931Instead, we essentially generate both the RMW and the non-RMW versions, and then 932during lowering, the RMW version deletes itself if it finds x still live. 933 934X86 compare-branch inference 935^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 936 937In the LLVM instruction set, the compare/branch pattern works like this:: 938 939 cond = icmp eq a, b 940 br cond, target 941 942The result of the icmp instruction is a single bit, and a conditional branch 943tests that bit. By contrast, most target architectures use this pattern:: 944 945 cmp a, b // implicitly sets various bits of FLAGS register 946 br eq, target // branch on a particular FLAGS bit 947 948A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests 949``cond`` and conditionally branches. Subzero has a pass that identifies 950boolean-based operations like this and folds them into a single 951compare/branch-like operation. It is set up for more than just cmp/br though. 952Boolean producers include icmp (integer compare), fcmp (floating-point compare), 953and trunc (integer truncation when the destination has bool type). Boolean 954consumers include branch, select (the ternary operator from the C language), and 955sign-extend and zero-extend when the source has bool type. 956 957Sandboxing 958^^^^^^^^^^ 959 960Native Client's sandbox model uses software fault isolation (SFI) to provide 961safety when running untrusted code in a browser or other environment. Subzero 962implements Native Client's `sandboxing 963<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ 964to enable Subzero-translated executables to be run inside Chrome. Subzero also 965provides a fairly simple framework for investigating alternative sandbox models 966or other restrictions on the sandbox model. 967 968Sandboxing in Subzero is not actually implemented as a separate pass, but is 969integrated into lowering and assembly. 970 971- Indirect branches, including the ret instruction, are masked to a bundle 972 boundary and bundle-locked. 973 974- Call instructions are aligned to the end of the bundle so that the return 975 address is bundle-aligned. 976 977- Indirect branch targets, including function entry and targets in a switch 978 statement jump table, are bundle-aligned. 979 980- The intrinsic for reading the thread pointer is inlined appropriately. 981 982- For x86-64, non-stack memory accesses are with respect to the reserved sandbox 983 base register. We reduce the aggressiveness of address mode inference to 984 leave room for the sandbox base register during lowering. There are no memory 985 sandboxing changes for x86-32. 986 987Code emission 988------------- 989 990Subzero's integrated assembler is derived from Dart's `assembler code 991<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass 992that iterates through the low-level ICE instructions and invokes the relevant 993assembler functions. Placeholders are added for later fixup of branch target 994offsets. (Backward branches use short offsets if possible; forward branches 995generally use long offsets unless it is an intra-block branch of "known" short 996length.) The assembler emits into a staging buffer. Once emission into the 997staging buffer for a function is complete, the data is emitted to the output 998file as an ELF object file, and metadata such as relocations, symbol table, and 999string table, are accumulated for emission at the end. Global data initializers 1000are emitted similarly. A key point is that at this point, the staging buffer 1001can be deallocated, and only a minimum of data needs to held until the end. 1002 1003As a debugging alternative, Subzero can emit textual assembly code which can 1004then be run through an external assembler. This is of course super slow, but 1005quite valuable when bringing up a new target. 1006 1007As another debugging option, the staging buffer can be emitted as textual 1008assembly, primarily in the form of ".byte" lines. This allows the assembler to 1009be tested separately from the ELF related code. 1010 1011Memory management 1012----------------- 1013 1014Where possible, we allocate from a ``CfgLocalAllocator`` which derives from 1015LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects 1016allocated from the arena are never actually freed; instead, when the CFG 1017translation completes and the CFG is deleted, the entire arena memory is 1018reclaimed at once. This style of allocation works well in an environment like a 1019compiler where there are distinct phases with only easily-identifiable objects 1020living across phases. It frees the developer from having to manage object 1021deletion, and it amortizes deletion costs across just a single arena deletion at 1022the end of the phase. Furthermore, it helps scalability by allocating entirely 1023from thread-local memory pools, and minimizing global locking of the heap. 1024 1025Instructions are probably the most heavily allocated complex class in Subzero. 1026We represent an instruction list as an intrusive doubly linked list, allocate 1027all instructions from the ``CfgLocalAllocator``, and we make sure each 1028instruction subclass is basically `POD 1029<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a 1030trivial destructor. This way, when the CFG is finished, we don't need to 1031individually deallocate every instruction. We do similar for Variables, which 1032is probably the second most popular complex class. 1033 1034There are some situations where passes need to use some `STL container class 1035<http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the 1036``CfgLocalAllocator`` as the container allocator if this is needed. 1037 1038Multithreaded translation 1039------------------------- 1040 1041Subzero is designed to be able to translate functions in parallel. With the 1042``-threads=N`` command-line option, there is a 3-stage producer-consumer 1043pipeline: 1044 1045- A single thread parses the ``.pexe`` file and produces a sequence of work 1046 units. A work unit can be either a fully constructed CFG, or a set of global 1047 initializers. The work unit includes its sequence number denoting its parse 1048 order. Each work unit is added to the translation queue. 1049 1050- There are N translation threads that draw work units from the translation 1051 queue and lower them into assembler buffers. Each assembler buffer is added 1052 to the emitter queue, tagged with its sequence number. The CFG and its 1053 ``CfgLocalAllocator`` are disposed of at this point. 1054 1055- A single thread draws assembler buffers from the emitter queue and appends to 1056 the output file. It uses the sequence numbers to reintegrate the assembler 1057 buffers according to the original parse order, such that output order is 1058 always deterministic. 1059 1060This means that with ``-threads=N``, there are actually ``N+1`` spawned threads 1061for a total of ``N+2`` execution threads, taking the parser and emitter threads 1062into account. For the special case of ``N=0``, execution is entirely sequential 1063-- the same thread parses, translates, and emits, one function at a time. This 1064is useful for performance measurements. 1065 1066Ideally, we would like to get near-linear scalability as the number of 1067translation threads increases. We expect that ``-threads=1`` should be slightly 1068faster than ``-threads=0`` as the small amount of time spent parsing and 1069emitting is done largely in parallel with translation. With perfect 1070scalability, we see ``-threads=N`` translating ``N`` times as fast as 1071``-threads=1``, up until the point where parsing or emitting becomes the 1072bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory 1073performance would become a bottleneck and efficiency might peak at, say, 75%. 1074 1075Currently, parsing takes about 11% of total sequential time. If translation 1076scalability ever gets so fast and awesomely scalable that parsing becomes a 1077bottleneck, it should be possible to make parsing multithreaded as well. 1078 1079Internally, all shared, mutable data is held in the GlobalContext object, and 1080access to each field is guarded by a mutex. 1081 1082Security 1083-------- 1084 1085Subzero includes a number of security features in the generated code, as well as 1086in the Subzero translator itself, which run on top of the existing Native Client 1087sandbox as well as Chrome's OS-level sandbox. 1088 1089Sandboxed translator 1090^^^^^^^^^^^^^^^^^^^^ 1091 1092When running inside the browser, the Subzero translator executes as sandboxed, 1093untrusted code that is initially checked by the validator, just like the 1094LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no 1095more or less secure than the translator it replaces, from the point of view of 1096the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and 1097was designed from the start with security in mind, so one expects fewer attacker 1098opportunities here. 1099 1100Fuzzing 1101^^^^^^^ 1102 1103We have started fuzz-testing the ``.pexe`` files input to Subzero, using a 1104combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer 1105<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to 1106find and fix cases where Subzero crashes or otherwise ungracefully fails on 1107unexpected inputs, and to do so automatically over a large range of unexpected 1108inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility 1109of an attacker exploiting these bugs. 1110 1111Most of the problems found so far are ones most appropriately handled in the 1112parser. However, there have been a couple that have identified problems in the 1113lowering, or otherwise inappropriately triggered assertion failures and fatal 1114errors. We continue to dig into this area. 1115 1116Future security work 1117^^^^^^^^^^^^^^^^^^^^ 1118 1119Subzero is well-positioned to explore other future security enhancements, e.g.: 1120 1121- Tightening the Native Client sandbox. ABI changes, such as the previous work 1122 on `hiding the sandbox base address 1123 <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_ 1124 in x86-64, are easy to experiment with in Subzero. 1125 1126- Making the executable code section read-only. This would prevent a PNaCl 1127 application from inspecting its own binary and trying to find ROP gadgets even 1128 after code diversification has been performed. It may still be susceptible to 1129 `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks, 1130 security is still overall improved. 1131 1132- Instruction selection diversification. It may be possible to lower a given 1133 instruction in several largely equivalent ways, which gives more opportunities 1134 for code randomization. 1135 1136Chrome integration 1137------------------ 1138 1139Currently Subzero is available in Chrome for the x86-32 architecture, but under 1140a flag. When the flag is enabled, Subzero is used when the `manifest file 1141<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_ 1142linking to the ``.pexe`` file specifies the ``O0`` optimization level. 1143 1144The next step is to remove the flag, i.e. invoke Subzero as the only translator 1145for ``O0``-specified manifest files. 1146 1147Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which 1148case Subzero could be used for all PNaCl translation. 1149 1150Command line options 1151-------------------- 1152 1153Subzero has a number of command-line options for debugging and diagnostics. 1154Among the more interesting are the following. 1155 1156- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other 1157 diagnostic output, with various levels of detail after each pass. Instruction 1158 numbers can be printed or suppressed. Deleted instructions can be printed or 1159 suppressed (they are retained in the instruction list, as discussed earlier, 1160 because they can help explain how lower-level instructions originated). 1161 Liveness information can be printed when available. Details of register 1162 allocation can be printed as register allocator decisions are made. And more. 1163 1164- Running Subzero with any level of verbosity produces an enormous amount of 1165 output. When debugging a single function, verbose output can be suppressed 1166 except for a particular function. The ``-verbose-focus`` flag suppresses 1167 verbose output except for the specified function. 1168 1169- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing 1170 at exit. Timing markers can be placed in the Subzero source code to demarcate 1171 logical operations or passes of interest. Basic timing information plus 1172 call-stack type timing information is printed at the end. 1173 1174- Along with ``-timing``, the user can instead get a report on the overall 1175 translation time for each function, to help focus on timing outliers. Also, 1176 ``-timing-focus`` limits the ``-timing`` reporting to a single function, 1177 instead of aggregating pass timing across all functions. 1178 1179- The ``-szstats`` option reports various statistics on each function, such as 1180 stack frame size, static instruction count, etc. It may be helpful to track 1181 these stats over time as Subzero is improved, as an approximate measure of 1182 code quality. 1183 1184- The flag ``-asm-verbose``, in conjunction with emitting textual assembly 1185 output, annotate the assembly output with register-focused liveness 1186 information. In particular, each basic block is annotated with which 1187 registers are live-in and live-out, and each instruction is annotated with 1188 which registers' and stack locations' live ranges end at that instruction. 1189 This is really useful when studying the generated code to find opportunities 1190 for code quality improvements. 1191 1192Testing and debugging 1193--------------------- 1194 1195LLVM lit tests 1196^^^^^^^^^^^^^^ 1197 1198For basic testing, Subzero uses LLVM's `lit 1199<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We 1200have a suite of hundreds of small functions where we test for particular 1201assembly code patterns across different target architectures. 1202 1203Cross tests 1204^^^^^^^^^^^ 1205 1206Unfortunately, the lit tests don't do a great job of precisely testing the 1207correctness of the output. Much better are the cross tests, which are execution 1208tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide 1209variety of interesting inputs. Each cross test consists of a set of C, C++, 1210and/or low-level bitcode files. The C and C++ source files are compiled down to 1211bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero. 1212Subzero mangles global symbol names with a special prefix to avoid duplicate 1213symbol errors. A driver program invokes both versions on a large set of 1214interesting inputs, and reports when the Subzero and ``pnacl-llc`` results 1215differ. Cross tests turn out to be an excellent way of testing the basic 1216lowering patterns, but they are less useful for testing more global things like 1217liveness analysis and register allocation. 1218 1219Bisection debugging 1220^^^^^^^^^^^^^^^^^^^ 1221 1222Sometimes with a new application, Subzero will end up producing incorrect code 1223that either crashes at runtime or otherwise produces the wrong results. When 1224this happens, we need to narrow it down to a single function (or small set of 1225functions) that yield incorrect behavior. For this, we have a bisection 1226debugging framework. Here, we initially translate the entire application once 1227with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to 1228selectively weaken symbols based on a list provided on the command line. The 1229two object files can then be linked together without link errors, with the 1230desired version of each method "winning". Then the binary is tested, and 1231bisection proceeds based on whether the binary produces correct output. 1232 1233When the bisection completes, we are left with a minimal set of 1234Subzero-translated functions that cause the failure. Usually it is a single 1235function, though sometimes it might require a combination of several functions 1236to cause a failure; this may be due to an incorrect call ABI, for example. 1237However, Murphy's Law implies that the single failing function is enormous and 1238impractical to debug. In that case, we can restart the bisection, explicitly 1239ignoring the enormous function, and try to find another candidate to debug. 1240(Future work is to automate this to find all minimal sets of functions, so that 1241debugging can focus on the simplest example.) 1242 1243Fuzz testing 1244^^^^^^^^^^^^ 1245 1246As described above, we try to find internal Subzero bugs using fuzz testing 1247techniques. 1248 1249Sanitizers 1250^^^^^^^^^^ 1251 1252Subzero can be built with `AddressSanitizer 1253<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer 1254<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is 1255done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far, 1256multithreading has been simple enough that TSan hasn't found any bugs, but ASan 1257has found at least one memory leak which was subsequently fixed. 1258`UndefinedBehaviorSanitizer 1259<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_ 1260(UBSan) support is in progress. `Control flow integrity sanitization 1261<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under 1262consideration. 1263 1264Current status 1265============== 1266 1267Target architectures 1268-------------------- 1269 1270Subzero is currently more or less complete for the x86-32 target. It has been 1271refactored and extended to handle x86-64 as well, and that is mostly complete at 1272this point. 1273 1274ARM32 work is in progress. It currently lacks the testing level of x86, at 1275least in part because Subzero's register allocator needs modifications to handle 1276ARM's aliasing of floating point and vector registers. Specifically, a 64-bit 1277register is actually a gang of two consecutive and aligned 32-bit registers, and 1278a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers. 1279ARM64 work has not started; when it does, it will be native-only since the 1280Native Client sandbox model, validator, and other tools have never been defined. 1281 1282An external contributor is adding MIPS support, in most part by following the 1283ARM work. 1284 1285Translator performance 1286---------------------- 1287 1288Single-threaded translation speed is currently about 5× the ``pnacl-llc`` 1289translation speed. For a large ``.pexe`` file, the time breaks down as: 1290 1291- 11% for parsing and initial IR building 1292 1293- 4% for emitting to /dev/null 1294 1295- 27% for liveness analysis (two liveness passes plus live range construction) 1296 1297- 15% for linear-scan register allocation 1298 1299- 9% for basic lowering 1300 1301- 10% for advanced phi lowering 1302 1303- ~11% for other minor analysis 1304 1305- ~10% measurement overhead to acquire these numbers 1306 1307Some improvements could undoubtedly be made, but it will be hard to increase the 1308speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With 1309``-Om1`` (lack of) optimization, we do actually achieve roughly 10× 1310``pnacl-llc`` translation speed, but code quality drops by a factor of 3. 1311 1312Code quality 1313------------ 1314 1315Measured across 16 components of spec2k, Subzero's code quality is uniformly 1316better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly 1317between ``pnacl-llc`` ``-O0`` and ``-O2``. 1318 1319Translator size 1320--------------- 1321 1322When built in MINIMAL mode, the x86-64 native translator size for the x86-32 1323target is about 700 KB, not including the size of functions referenced in 1324dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1 1325MB, and it is statically linked and also includes nop padding for bundling as 1326well as indirect branch masking. 1327 1328Translator memory footprint 1329--------------------------- 1330 1331It's hard to draw firm conclusions about memory footprint, since the footprint 1332is at least proportional to the input function size, and there is no real limit 1333on the size of functions in the ``.pexe`` file. 1334 1335That said, we looked at the memory footprint over time as Subzero translated 1336``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our 1337disposal. One of LLVM's libraries that Subzero uses can report the current 1338malloc heap usage. With single-threaded translation, Subzero tends to hover 1339around 15 MB of memory usage. There are a couple of monstrous functions where 1340Subzero grows to around 100 MB, but then it drops back down after those 1341functions finish translating. In contrast, ``pnacl-llc`` grows larger and 1342larger throughout translation, reaching several hundred MB by the time it 1343completes. 1344 1345It's a bit more interesting when we enable multithreaded translation. When 1346there are N translation threads, Subzero implements a policy that limits the 1347size of the translation queue to N entries -- if it is "full" when the parser 1348tries to add a new CFG, the parser blocks until one of the translation threads 1349removes a CFG. This means the number of in-memory CFGs can (and generally does) 1350reach 2*N+1, and so the memory footprint rises in proportion to the number of 1351threads. Adding to the pressure is the observation that the monstrous functions 1352also take proportionally longer time to translate, so there's a good chance many 1353of the monstrous functions will be active at the same time with multithreaded 1354translation. As a result, for N=32, Subzero's memory footprint peaks at about 1355260 MB, but drops back down as the large functions finish translating. 1356 1357If this peak memory size becomes a problem, it might be possible for the parser 1358to resequence the functions to try to spread out the larger functions, or to 1359throttle the translation queue to prevent too many in-flight large functions. 1360It may also be possible to throttle based on memory pressure signaling from 1361Chrome. 1362 1363Translator scalability 1364---------------------- 1365 1366Currently scalability is "not very good". Multiple translation threads lead to 1367faster translation, but not to the degree desired. We haven't dug in to 1368investigate yet. 1369 1370There are a few areas to investigate. First, there may be contention on the 1371constant pool, which all threads access, and which requires locked access even 1372for reading. This could be mitigated by keeping a CFG-local cache of the most 1373common constants. 1374 1375Second, there may be contention on memory allocation. While almost all CFG 1376objects are allocated from the CFG-local allocator, some passes use temporary 1377STL containers that use the default allocator, which may require global locking. 1378This could be mitigated by switching these to the CFG-local allocator. 1379 1380Third, multithreading may make the default allocator strategy more expensive. 1381In a single-threaded environment, a pass will allocate its containers, run the 1382pass, and deallocate the containers. This results in stack-like allocation 1383behavior and makes the heap free list easier to manage, with less heap 1384fragmentation. But when multithreading is added, the allocations and 1385deallocations become much less stack-like, making allocation and deallocation 1386operations individually more expensive. Again, this could be mitigated by 1387switching these to the CFG-local allocator. 1388