1# XLA GPU Backend 2 3<!--* freshness: { owner: "cheshire" reviewed: "2022-08-04" } *--> 4 5## Compile time 6 7At compile time, 8[`GpuCompiler`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/gpu_compiler.h) 9generates 10[`GpuExecutable`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/gpu_executable.h), 11whose `ExecuteOnStream` interface will be called by the XLA service at runtime. 12The figure below shows the work flow of `GpuCompiler`. 13 14```dot 15strict digraph { 16 compound=true; 17 rankdir=LR 18 graph [autosize=false, size="7!,7!", resolution=72]; 19 20 { 21 rank = same 22 unopt_hlo [id=googlegreen shape=oval label="Unoptimized\nHLO"] 23 hlo_passes [id=googleblue shape=box label="HLO passes"] 24 opt_hlo [id=googlegreen shape=oval label="Optimized and\nCanonicalized HLO"] 25 } 26 27 { 28 rank=same 29 buffer_assigner [id=googleblue shape=box label=BufferAssigner] 30 buffer_assignment [id=googlegreen shape=oval label=BufferAssignment] 31 lmhlo [id=googlegreen shape=oval label=LMHLO] 32 } 33 ir_emitter [id=googleblue shape=box label=IREmitter] 34 gpu_ir [id=googlegreen shape=oval label="LLVM IR\n(GPU)"] 35 llvm_gpu [id=googleblue shape=box label="LLVM JIT\n(GPU)"] 36 37 subgraph cluster_gpu_executable { 38 label="GpuExecutable" 39 ptx [id=googlegreen shape=oval label=PTX] 40 41 subgraph cluster_thunk_sequence { 42 label="ThunkSequence" 43 thunk0 [id=googlegreen shape=oval label=Thunk] 44 thunk1 [id=googlegreen shape=oval label=Thunk] 45 } 46 } 47 48 unopt_hlo -> hlo_passes -> opt_hlo -> { lmhlo buffer_assigner } 49 buffer_assigner -> buffer_assignment -> lmhlo -> ir_emitter 50 ir_emitter -> gpu_ir -> llvm_gpu -> ptx 51 ir_emitter -> { thunk0 thunk1 } 52} 53``` 54 55<center><img style="width:25%" src="./images/gpu_backend_chart.svg"></img></center> 56 57### Optimization 58 59`GpuCompiler` runs a pipeline of target-independent and target-dependent 60optimizations on the input HLO graph. For example, it folds 61[`Transpose`](https://www.tensorflow.org/xla/operation_semantics#transpose) into 62[`Dot`](https://www.tensorflow.org/xla/operation_semantics#dot) in certain 63situations so that the `Transpose` instruction can be elided in a cuBLAS gemm 64call. 65 66### Canonicalization 67 68After HLO optimizations, `GpuCompiler` runs canonicalization transformations to 69ensure `IrEmitter` can emit valid IR. Canonicalization makes later IR emission 70easier, because `IrEmitter` currently works on one HLO instruction at a time 71without a global view of the entire graph. 72 73### Buffer Analysis 74 75The buffer assignment pass assigns a buffer if necessary to store the result of 76each HLO instruction. Actual buffer allocation happens at runtime. Therefore, at 77compile time, the buffer analysis assigns `BufferAllocation`s, which contains 78metadata (such as the index and shape of the buffer) for `GpuExecutable` to 79allocate and deallocate buffers. 80 81### LMHLO Conversion 82 83`GpuCompiler` takes the optimized HLO and `BufferAssignment`, and convert them 84to the MLIR dialect 85[`LMHLO`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/mlir_hlo/include/mlir-hlo/Dialect/lhlo/IR/lhlo_ops.td). 86 87The `LMHLO` dialect is a graph consists of `LMHLO` ops. `LMHLO` ops are 88buffer-based and sequentially ordered. The sequential order reflects the 89execution order. 90 91In `LMHLO`, direct operand-user information is stripped away, as each op is only 92connected with its buffers, not ops which generate those buffers. 93 94Notice that some `LMHLO` ops, e.g. `lmhlo.fusion` or `lmhlo.reduce`, contain 95[`MHLO`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/mlir_hlo/include/mlir-hlo/Dialect/mhlo/IR/hlo_ops.td)-based 96regions. They are tensor-based `MHLO` regions because ops in them don't have 97buffers associated. 98 99The code that converts XLA HLO to `LMHLO` is 100[here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/mlir/xla/transforms/mhlo_to_lhlo_with_xla.h). 101 102Currently, lowering of those `MHLO` regions takes a twist: 103 104* First, `MHLO` regions get converted back to XLA HLO graphs. 105* Then the converted XLA HLO graphs are handled by 106 [`FusedIrEmitter`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h) 107 and 108 [`ElementalIrEmitter`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/elemental_ir_emitter.h). 109 110### IR Emission 111 112`IrEmitter` emits CUDA kernels in LLVM IR to implement most `LMHLO` operations 113in the input graph. `GpuCompiler` then compiles emitted LLVM IR to PTX using 114LLVM as a JIT compiler. `IrEmitter` does not need to emit IR for some 115instructions. For example, `GpuCompiler` implements certain Dot instructions as 116cuBLAS gemms, which do not require any customized kernels. 117 118`IrEmitter` has two subclasses: 119 120* `IrEmitterUnnested`, which emits code for all `LMHLO` instructions, and 121* `IrEmitterNested`, which handles instructions in nested computations (e.g. 122 those scalar computations in `Map` and `Reduce`). 123 124`IrEmitterUnnested` emits zero or more global functions for each `LMHLO` 125instruction. In contrast, `IrEmitterNested` emits a device function for each HLO 126instruction. These device functions, if small, are likely to be inlined to 127kernels. 128 129### Thunk Building 130 131Besides emitting LLVM IR, `IrEmitter` also generates a sequence of 132[thunks](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/thunk.h). 133Each thunk contains metadata for `GpuExecutable` to invoke an HLO instruction at 134runtime. For HLO instructions implemented as cuBLAS gemms, `IrEmitter` generates 135`GemmThunk`s whose `ExecuteOnStream` interface calls a cuBLAS gemm via 136[StreamExecutor](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/stream_executor) 137APIs. For instructions implemented as customized kernels, `IrEmitter` generates 138`KernelThunk`s which contain necessary arguments for launching kernels. 139 140<center><img style="width:25%" src="./images/kernel_thunk.svg"></img></center> 141 142For instance, the figure above shows an HLO graph that performs an elementwise 143add on two input arrays of shape `f32[256]`. Suppose the buffer analysis assigns 144buffer 0, 1, and 2 to `Param0`, `Param1`, and `Add`, respectively. Also suppose 145`IrEmitter` emits a kernel named "add" for the `Add` instruction. In this case, 146`IrEmitter` generates 147 148``` 149KernelThunk { input buffers = [0, 1], output buffer = [2], kernel name = "add" } 150``` 151 152At runtime, `GpuExecutable` launches the kernel named "add" with the base 153addresses of buffer 0, 1, and 2. 154 155### Constructing GpuExecutable 156 157Finally, `GpuCompiler` constructs a `GpuExecutable` object that wraps the PTX 158assembly and the thunk sequence generated by the `IrEmitter`. 159 160## Runtime 161 162At runtime, `GpuExecutable` does the following: 163 1641. Allocates all buffers assigned by the buffer analysis. 1652. Invokes all the thunks in its thunk sequence by calling their 166 `ExecuteOnStream` interface. The base address of the allocated buffers are 167 passed as an array of `void*` to the kernels and device functions emitted by 168 the `IrEmitter`. 1693. Deallocates all buffers that do not live out. 170