• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# XLA GPU Backend
2
3<!--* freshness: { owner: "cheshire" reviewed: "2022-08-04" } *-->
4
5## Compile time
6
7At compile time,
8[`GpuCompiler`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/gpu_compiler.h)
9generates
10[`GpuExecutable`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/gpu_executable.h),
11whose `ExecuteOnStream` interface will be called by the XLA service at runtime.
12The figure below shows the work flow of `GpuCompiler`.
13
14```dot
15strict digraph {
16  compound=true;
17  rankdir=LR
18  graph [autosize=false, size="7!,7!", resolution=72];
19
20  {
21    rank = same
22    unopt_hlo [id=googlegreen shape=oval label="Unoptimized\nHLO"]
23    hlo_passes [id=googleblue shape=box label="HLO passes"]
24    opt_hlo [id=googlegreen shape=oval label="Optimized and\nCanonicalized HLO"]
25  }
26
27  {
28    rank=same
29    buffer_assigner [id=googleblue shape=box label=BufferAssigner]
30    buffer_assignment [id=googlegreen shape=oval label=BufferAssignment]
31    lmhlo [id=googlegreen shape=oval label=LMHLO]
32  }
33  ir_emitter [id=googleblue shape=box label=IREmitter]
34  gpu_ir [id=googlegreen shape=oval label="LLVM IR\n(GPU)"]
35  llvm_gpu [id=googleblue shape=box label="LLVM JIT\n(GPU)"]
36
37  subgraph cluster_gpu_executable {
38    label="GpuExecutable"
39    ptx [id=googlegreen shape=oval label=PTX]
40
41    subgraph cluster_thunk_sequence {
42      label="ThunkSequence"
43      thunk0 [id=googlegreen shape=oval label=Thunk]
44      thunk1 [id=googlegreen shape=oval label=Thunk]
45    }
46  }
47
48  unopt_hlo -> hlo_passes -> opt_hlo -> { lmhlo buffer_assigner }
49  buffer_assigner -> buffer_assignment -> lmhlo -> ir_emitter
50  ir_emitter -> gpu_ir -> llvm_gpu -> ptx
51  ir_emitter -> { thunk0 thunk1 }
52}
53```
54
55<center><img style="width:25%" src="./images/gpu_backend_chart.svg"></img></center>
56
57### Optimization
58
59`GpuCompiler` runs a pipeline of target-independent and target-dependent
60optimizations on the input HLO graph. For example, it folds
61[`Transpose`](https://www.tensorflow.org/xla/operation_semantics#transpose) into
62[`Dot`](https://www.tensorflow.org/xla/operation_semantics#dot) in certain
63situations so that the `Transpose` instruction can be elided in a cuBLAS gemm
64call.
65
66### Canonicalization
67
68After HLO optimizations, `GpuCompiler` runs canonicalization transformations to
69ensure `IrEmitter` can emit valid IR. Canonicalization makes later IR emission
70easier, because `IrEmitter` currently works on one HLO instruction at a time
71without a global view of the entire graph.
72
73### Buffer Analysis
74
75The buffer assignment pass assigns a buffer if necessary to store the result of
76each HLO instruction. Actual buffer allocation happens at runtime. Therefore, at
77compile time, the buffer analysis assigns `BufferAllocation`s, which contains
78metadata (such as the index and shape of the buffer) for `GpuExecutable` to
79allocate and deallocate buffers.
80
81### LMHLO Conversion
82
83`GpuCompiler` takes the optimized HLO and `BufferAssignment`, and convert them
84to the MLIR dialect
85[`LMHLO`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/mlir_hlo/include/mlir-hlo/Dialect/lhlo/IR/lhlo_ops.td).
86
87The `LMHLO` dialect is a graph consists of `LMHLO` ops. `LMHLO` ops are
88buffer-based and sequentially ordered. The sequential order reflects the
89execution order.
90
91In `LMHLO`, direct operand-user information is stripped away, as each op is only
92connected with its buffers, not ops which generate those buffers.
93
94Notice that some `LMHLO` ops, e.g. `lmhlo.fusion` or `lmhlo.reduce`, contain
95[`MHLO`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/mlir_hlo/include/mlir-hlo/Dialect/mhlo/IR/hlo_ops.td)-based
96regions. They are tensor-based `MHLO` regions because ops in them don't have
97buffers associated.
98
99The code that converts XLA HLO to `LMHLO` is
100[here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/mlir/xla/transforms/mhlo_to_lhlo_with_xla.h).
101
102Currently, lowering of those `MHLO` regions takes a twist:
103
104*   First, `MHLO` regions get converted back to XLA HLO graphs.
105*   Then the converted XLA HLO graphs are handled by
106    [`FusedIrEmitter`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/llvm_ir/fused_ir_emitter.h)
107    and
108    [`ElementalIrEmitter`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/elemental_ir_emitter.h).
109
110### IR Emission
111
112`IrEmitter` emits CUDA kernels in LLVM IR to implement most `LMHLO` operations
113in the input graph. `GpuCompiler` then compiles emitted LLVM IR to PTX using
114LLVM as a JIT compiler. `IrEmitter` does not need to emit IR for some
115instructions. For example, `GpuCompiler` implements certain Dot instructions as
116cuBLAS gemms, which do not require any customized kernels.
117
118`IrEmitter` has two subclasses:
119
120*   `IrEmitterUnnested`, which emits code for all `LMHLO` instructions, and
121*   `IrEmitterNested`, which handles instructions in nested computations (e.g.
122    those scalar computations in `Map` and `Reduce`).
123
124`IrEmitterUnnested` emits zero or more global functions for each `LMHLO`
125instruction. In contrast, `IrEmitterNested` emits a device function for each HLO
126instruction. These device functions, if small, are likely to be inlined to
127kernels.
128
129### Thunk Building
130
131Besides emitting LLVM IR, `IrEmitter` also generates a sequence of
132[thunks](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/gpu/thunk.h).
133Each thunk contains metadata for `GpuExecutable` to invoke an HLO instruction at
134runtime. For HLO instructions implemented as cuBLAS gemms, `IrEmitter` generates
135`GemmThunk`s whose `ExecuteOnStream` interface calls a cuBLAS gemm via
136[StreamExecutor](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/stream_executor)
137APIs. For instructions implemented as customized kernels, `IrEmitter` generates
138`KernelThunk`s which contain necessary arguments for launching kernels.
139
140<center><img style="width:25%" src="./images/kernel_thunk.svg"></img></center>
141
142For instance, the figure above shows an HLO graph that performs an elementwise
143add on two input arrays of shape `f32[256]`. Suppose the buffer analysis assigns
144buffer 0, 1, and 2 to `Param0`, `Param1`, and `Add`, respectively. Also suppose
145`IrEmitter` emits a kernel named "add" for the `Add` instruction. In this case,
146`IrEmitter` generates
147
148```
149KernelThunk { input buffers = [0, 1], output buffer = [2], kernel name = "add" }
150```
151
152At runtime, `GpuExecutable` launches the kernel named "add" with the base
153addresses of buffer 0, 1, and 2.
154
155### Constructing GpuExecutable
156
157Finally, `GpuCompiler` constructs a `GpuExecutable` object that wraps the PTX
158assembly and the thunk sequence generated by the `IrEmitter`.
159
160## Runtime
161
162At runtime, `GpuExecutable` does the following:
163
1641.  Allocates all buffers assigned by the buffer analysis.
1652.  Invokes all the thunks in its thunk sequence by calling their
166    `ExecuteOnStream` interface. The base address of the allocated buffers are
167    passed as an array of `void*` to the kernels and device functions emitted by
168    the `IrEmitter`.
1693.  Deallocates all buffers that do not live out.
170