1IR3 NOTES 2========= 3 4Some notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx. The same shader ISA is present, with some small differences, in adreno a4xx. 5 6Compared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set. However, the compiler is responsible, in most cases, to schedule the instructions. The hardware does not try to hide the shader core pipeline stages. For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or nops). When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit. Although that results in a lot of edge cases where things fall over, like: 7 8:: 9 10 ADD TEMP[0], TEMP[1], TEMP[2] 11 MUL TEMP[0], TEMP[1], TEMP[0].wzyx 12 13Here, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``. Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over. 14 15So the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment. 16 17For additional documentation about the hardware, see wiki: `a3xx ISA 18<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_. 19 20External Structure 21------------------ 22 23``ir3_shader`` 24 A single vertex/fragment/etc shader from gallium perspective (i.e. 25 maps to a single TGSI shader), and manages a set of shader variants 26 which are generated on demand based on the shader key. 27 28``ir3_shader_key`` 29 The configuration key that identifies a shader variant. Ie. based 30 on other GL state (two-sided-color, render-to-alpha, etc) or render 31 stages (binning-pass vertex shader) different shader variants are 32 generated. 33 34``ir3_shader_variant`` 35 The actual hw shader generated based on input TGSI and shader key. 36 37``ir3_compiler`` 38 Compiler frontend which generates ir3 and runs the various backend 39 stages to schedule and do register assignment. 40 41The IR 42------ 43 44The ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s). But there are a few extensions, in the form of meta_ instructions. And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value. So, for example, the following TGSI shader: 45 46:: 47 48 VERT 49 DCL IN[0] 50 DCL IN[1] 51 DCL OUT[0], POSITION 52 DCL TEMP[0], LOCAL 53 1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz 54 2: MOV OUT[0], TEMP[0].xxxx 55 3: END 56 57eventually generates: 58 59.. graphviz:: 60 61 digraph G { 62 rankdir=RL; 63 nodesep=0.25; 64 ranksep=1.5; 65 subgraph clusterdce198 { 66 label="vert"; 67 inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"]; 68 instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"]; 69 instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"]; 70 inputdce198:<in2>:w -> instrdcedd0:<src0> 71 inputdce198:<in6>:w -> instrdcedd0:<src1> 72 instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"]; 73 inputdce198:<in1>:w -> instrdcec30:<src0> 74 inputdce198:<in5>:w -> instrdcec30:<src1> 75 instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"]; 76 inputdce198:<in0>:w -> instrdceb60:<src0> 77 inputdce198:<in4>:w -> instrdceb60:<src1> 78 instrdceb60:<dst0> -> instrdcec30:<src2> 79 instrdcec30:<dst0> -> instrdcedd0:<src2> 80 instrdcedd0:<dst0> -> instrdcf348:<src0> 81 instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"]; 82 instrdcedd0:<dst0> -> instrdcf400:<src0> 83 instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"]; 84 instrdcedd0:<dst0> -> instrdcf4b8:<src0> 85 outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"]; 86 instrdcf348:<dst0> -> outputdce198:<out0>:e 87 instrdcf400:<dst0> -> outputdce198:<out1>:e 88 instrdcf4b8:<dst0> -> outputdce198:<out2>:e 89 instrdcedd0:<dst0> -> outputdce198:<out3>:e 90 } 91 } 92 93(after scheduling, etc, but before register assignment). 94 95Internal Structure 96~~~~~~~~~~~~~~~~~~ 97 98``ir3_block`` 99 Represents a basic block. 100 101 TODO: currently blocks are nested, but I think I need to change that 102 to a more conventional arrangement before implementing proper flow 103 control. Currently the only flow control handles is if/else which 104 gets flattened out and results chosen with ``sel`` instructions. 105 106``ir3_instruction`` 107 Represents a machine instruction or meta_ instruction. Has pointers 108 to dst register (``regs[0]``) and src register(s) (``regs[1..n]``), 109 as needed. 110 111``ir3_register`` 112 Represents a src or dst register, flags indicate const/relative/etc. 113 If ``IR3_REG_SSA`` is set on a src register, the actual register 114 number (name) has not been assigned yet, and instead the ``instr`` 115 field points to src instruction. 116 117In addition there are various util macros/functions to simplify manipulation/traversal of the graph: 118 119``foreach_src(srcreg, instr)`` 120 Iterate each instruction's source ``ir3_register``\s 121 122``foreach_src_n(srcreg, n, instr)`` 123 Like ``foreach_src``, also setting ``n`` to the source number (starting 124 with ``0``). 125 126``foreach_ssa_src(srcinstr, instr)`` 127 Iterate each instruction's SSA source ``ir3_instruction``\s. This skips 128 non-SSA sources (consts, etc), but includes virtual sources (such as the 129 address register if `relative addressing`_ is used). 130 131``foreach_ssa_src_n(srcinstr, n, instr)`` 132 Like ``foreach_ssa_src``, also setting ``n`` to the source number. 133 134For example: 135 136.. code-block:: c 137 138 foreach_ssa_src_n(src, i, instr) { 139 unsigned d = delay_calc_srcn(ctx, src, instr, i); 140 delay = MAX2(delay, d); 141 } 142 143 144TODO probably other helper/util stuff worth mentioning here 145 146.. _meta: 147 148Meta Instructions 149~~~~~~~~~~~~~~~~~ 150 151**input** 152 Used for shader inputs (registers configured in the command-stream 153 to hold particular input values, written by the shader core before 154 start of execution. Also used for connecting up values within a 155 basic block to an output of a previous block. 156 157**output** 158 Used to hold outputs of a basic block. 159 160**flow** 161 TODO 162 163**phi** 164 TODO 165 166**fanin** 167 Groups registers which need to be assigned to consecutive scalar 168 registers, for example `sam` (texture fetch) src instructions (see 169 `register groups`_) or array element dereference 170 (see `relative addressing`_). 171 172**fanout** 173 The counterpart to **fanin**, when an instruction such as `sam` 174 writes multiple components, splits the result into individual 175 scalar components to be consumed by other instructions. 176 177 178.. _`flow control`: 179 180Flow Control 181~~~~~~~~~~~~ 182 183TODO 184 185 186.. _`register groups`: 187 188Register Groups 189~~~~~~~~~~~~~~~ 190 191Certain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers. In the simplest example: 192 193:: 194 195 sam (f32)(xyz)r2.x, r0.z, s#0, t#0 196 197for a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``. 198 199Before register assignment, to group the two components of the texture src together: 200 201.. graphviz:: 202 203 digraph G { 204 { rank=same; 205 fanin; 206 }; 207 { rank=same; 208 coord_x; 209 coord_y; 210 }; 211 sam -> fanin [label="regs[1]"]; 212 fanin -> coord_x [label="regs[1]"]; 213 fanin -> coord_y [label="regs[2]"]; 214 coord_x -> coord_y [label="right",style=dotted]; 215 coord_y -> coord_x [label="left",style=dotted]; 216 coord_x [label="coord.x"]; 217 coord_y [label="coord.y"]; 218 } 219 220The frontend sets up the SSA ptrs from ``sam`` source register to the ``fanin`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values. And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``fanin``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers. 221 222And likewise, for the consecutive scalar registers for the destination: 223 224.. graphviz:: 225 226 digraph { 227 { rank=same; 228 A; 229 B; 230 C; 231 }; 232 { rank=same; 233 fanout_0; 234 fanout_1; 235 fanout_2; 236 }; 237 A -> fanout_0; 238 B -> fanout_1; 239 C -> fanout_2; 240 fanout_0 [label="fanout\noff=0"]; 241 fanout_0 -> sam; 242 fanout_1 [label="fanout\noff=1"]; 243 fanout_1 -> sam; 244 fanout_2 [label="fanout\noff=2"]; 245 fanout_2 -> sam; 246 fanout_0 -> fanout_1 [label="right",style=dotted]; 247 fanout_1 -> fanout_0 [label="left",style=dotted]; 248 fanout_1 -> fanout_2 [label="right",style=dotted]; 249 fanout_2 -> fanout_1 [label="left",style=dotted]; 250 sam; 251 } 252 253.. _`relative addressing`: 254 255Relative Addressing 256~~~~~~~~~~~~~~~~~~~ 257 258Most instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers. In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, i.e. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number). 259 260 Note that cat5 (texture sample) instructions are the notable exception, not 261 supporting relative addressing of src or dst. 262 263Relative addressing of the const file (for example, a uniform array) is relatively simple. We don't do register assignment of the const file, so all that is required is to schedule things properly. Ie. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time. 264 265But relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (i.e. the array elements must be assigned to consecutive scalar registers). And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written. 266 267Each instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s). This behaves as an additional virtual src register, i.e. ``foreach_ssa_src()`` will also iterate the address register (last). 268 269 Note that ``nop``\'s for timing constraints, type specifiers (i.e. 270 ``add.f`` vs ``add.u``), etc, omitted for brevity in examples 271 272:: 273 274 mova a0.x, hr1.y 275 sub r1.y, r2.x, r3.x 276 add r0.x, r1.y, c<a0.x + 2> 277 278results in: 279 280.. graphviz:: 281 282 digraph { 283 rankdir=LR; 284 sub; 285 const [label="const file"]; 286 add; 287 mova; 288 add -> mova; 289 add -> sub; 290 add -> const [label="off=2"]; 291 } 292 293The scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time. 294 295To implement variable arrays, values are stored in consecutive scalar registers. This has some overlap with `register groups`_, in that ``fanin`` and ``fanout`` are used to help group things for the `register assignment`_ pass. 296 297To use a variable array as a src register, a slight variation of what is done for const array src. The instruction src is a `fanin` instruction that groups all the array members: 298 299:: 300 301 mova a0.x, hr1.y 302 sub r1.y, r2.x, r3.x 303 add r0.x, r1.y, r<a0.x + 2> 304 305results in: 306 307.. graphviz:: 308 309 digraph { 310 a0 [label="r0.z"]; 311 a1 [label="r0.w"]; 312 a2 [label="r1.x"]; 313 a3 [label="r1.y"]; 314 sub; 315 fanin; 316 mova; 317 add; 318 add -> sub; 319 add -> fanin [label="off=2"]; 320 add -> mova; 321 fanin -> a0; 322 fanin -> a1; 323 fanin -> a2; 324 fanin -> a3; 325 } 326 327TODO better describe how actual deref offset is derived, i.e. based on array base register. 328 329To do an indirect write to a variable array, a ``fanout`` is used. Say the array was assigned to registers ``r0.z`` through ``r1.y`` (hence the constant offset of 2): 330 331 Note that only cat1 (mov) can do indirect write. 332 333:: 334 335 mova a0.x, hr1.y 336 min r2.x, r2.x, c0.x 337 mov r<a0.x + 2>, r2.x 338 mul r0.x, r0.z, c0.z 339 340 341In this case, the ``mov`` instruction does not write all elements of the array (compared to usage of ``fanout`` for ``sam`` instructions in grouping_). But the ``mov`` instruction does need an additional dependency (via ``fanin``) on instructions that last wrote the array element members, to ensure that they get scheduled before the ``mov`` in scheduling_ stage (which also serves to group the array elements for the `register assignment`_ stage). 342 343.. graphviz:: 344 345 digraph { 346 a0 [label="r0.z"]; 347 a1 [label="r0.w"]; 348 a2 [label="r1.x"]; 349 a3 [label="r1.y"]; 350 min; 351 mova; 352 mov; 353 mul; 354 fanout [label="fanout\noff=0"]; 355 mul -> fanout; 356 fanout -> mov; 357 fanin; 358 fanin -> a0; 359 fanin -> a1; 360 fanin -> a2; 361 fanin -> a3; 362 mov -> min; 363 mov -> mova; 364 mov -> fanin; 365 } 366 367Note that there would in fact be ``fanout`` nodes generated for each array element (although only the reachable ones will be scheduled, etc). 368 369 370 371Shader Passes 372------------- 373 374After the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_. Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail. 375 376 Note that we essentially have ~256 scalar registers in the 377 architecture (although larger register usage will at some thresholds 378 limit the number of threads which can run in parallel). And at some 379 point we will have to deal with spilling. 380 381.. _flatten: 382 383Flatten 384~~~~~~~ 385 386In this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions. The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else. 387 388 389.. _`copy propagation`: 390 391Copy Propagation 392~~~~~~~~~~~~~~~~ 393 394Currently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources. And the CP pass simply removes all simple ``mov``\s (i.e. src-type is same as dst-type, no abs/neg flags, etc). 395 396The eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things. 397 398 399.. _grouping: 400 401Grouping 402~~~~~~~~ 403 404In the grouping pass, instructions which need to be grouped (for ``fanin``\s, etc) have their ``left`` / ``right`` neighbor pointers setup. In cases where there is a conflict (i.e. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted. This ensures that there is some possible valid `register assignment`_ at the later stages. 405 406 407.. _depth: 408 409Depth 410~~~~~ 411 412In the depth pass, a depth is calculated for each instruction node within it's basic block. The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of it's source instructions. (meta_ instructions don't add to the depth). As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction. Unreachable instructions and inputs are marked. 413 414 TODO: we should probably calculate both hard and soft depths (?) to 415 try to coax additional instructions to fit in places where we need 416 to use sync bits, such as after a texture fetch or SFU. 417 418.. _scheduling: 419 420Scheduling 421~~~~~~~~~~ 422 423After the grouping_ pass, there are no more instructions to insert or remove. Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after it's source instructions plus delay slots. Insert ``nop``\s as required. 424 425.. _`register assignment`: 426 427Register Assignment 428~~~~~~~~~~~~~~~~~~~ 429 430TODO 431 432 433