1r600-sb 2======= 3 4* * * * * 5 6Debugging 7--------- 8 9### Environment variables 10 11- **R600\_DEBUG** 12 13 There are new flags: 14 15 - **sb** - Enable optimization of graphics shaders 16 - **sbcl** - Enable optimization of compute shaders (experimental) 17 - **sbdry** - Dry run, optimize but use source bytecode - 18 useful if you only want to check shader dumps 19 without the risk of lockups and other problems 20 - **sbstat** - Print optimization statistics (only time so far) 21 - **sbdump** - Print IR after some passes. 22 23### Regression debugging 24 25If there are any regressions as compared to the default backend 26(R600\_SB=0), it's possible to use the following environment variables 27to find the incorrectly optimized shader that causes the regression. 28 29- **R600\_SB\_DSKIP\_MODE** - allows to skip optimization for some 30 shaders 31 - 0 - disabled (default) 32 - 1 - skip optimization for the shaders in the range 33 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END], that is, 34 optimize only the shaders that are not in this range 35 - 2 - optimize only the shaders in the range 36 [R600\_SB\_DSKIP\_START; R600\_SB\_DSKIP\_END] 37 38- **R600\_SB\_DSKIP\_START** - start of the range (1-based) 39 40- **R600\_SB\_DSKIP\_END** - end of the range (1-based) 41 42Example - optimize only the shaders 5, 6, and 7: 43 44 R600_SB_DSKIP_START=5 R600_SB_DSKIP_END=7 R600_SB_DSKIP_MODE=2 45 46All shaders compiled by the application are numbered starting from 1, 47the number of shaders used by the application may be obtained by running 48it with "R600_DEBUG=sb,sbstat" - it will print "sb: shader \#index\#" 49for each compiled shader. 50 51After figuring out the total number of shaders used by the application, 52the variables above allow to use bisection to find the shader that is 53the cause of regression. E.g. if the application uses 100 shaders, we 54can divide the range [1; 100] and run the application with the 55optimization enabled only for the first half of the shaders: 56 57 R600_SB_DSKIP_START=1 R600_SB_DSKIP_END=50 R600_SB_DSKIP_MODE=2 <app> 58 59If the regression is reproduced with these parameters, then the failing 60shader is in the range [1; 50], if it's not reproduced - then it's in 61the range [51; 100]. Then we can divide the new range again and repeat 62the testing, until we'll reduce the range to a single failing shader. 63 64*NOTE: This method relies on the assumption that the application 65produces the same sequence of the shaders on each run. It's not always 66true - some applications may produce different sequences of the shaders, 67in such cases the tools like apitrace may be used to record the trace 68with the application, then this method may be applied when replaying the 69trace - also this may be faster and/or more convenient than testing the 70application itself.* 71 72* * * * * 73 74Intermediate Representation 75--------------------------- 76 77### Values 78 79All kinds of the operands (literal constants, references to kcache 80constants, references to GPRs, etc) are currently represented by the 81**value** class (possibly it makes sense to switch to hierarchy of 82classes derived from **value** instead, to save some memory). 83 84All values (except some pseudo values like the exec\_mask or predicate 85register) represent 32bit scalar values - there are no vector values, 86CF/FETCH instructions use groups of 4 values for src and dst operands. 87 88### Nodes 89 90Shader programs are represented using the tree data structure, some 91nodes contain a list of subnodes. 92 93#### Control flow nodes 94 95Control flow information is represented using four special node types 96(based on the ideas from [[1]](#references) ) 97 98- **region\_node** - single-entry, single-exit region. 99 100 All loops and if's in the program are enclosed in region nodes. 101 Region nodes have two containers for phi nodes - 102 region\_node::loop\_phi contains the phi expressions to be executed 103 at the region entry, region\_node::phi contains the phi expressions 104 to be executed at the region exit. It's the only type of the node 105 that contains associated phi expressions. 106 107- **depart\_node** - "depart region \$id after { ... }" 108 109 Depart target region (jump to exit point) after executing contained 110 code. 111 112- **repeat\_node** - "repeat region \$id after { ... }" 113 114 Repeat target region (jump to entry point) after executing contained 115 code. 116 117- **if\_node** - "if (cond) { ... }" 118 119 Execute contained code if condition is true. The difference from 120 [[1]](#references) is that we don't have associated phi expressions 121 for the **if\_node**, we enclose **if\_node** in the 122 **region\_node** and store corresponding phi's in the 123 **region\_node**, this allows more uniform handling. 124 125The target region of depart and repeat nodes is always the region where 126they are located (possibly in the nested region), there are no arbitrary 127jumps/goto's - control flow in the program is always structured. 128 129Typical control flow constructs can be represented as in the following 130examples: 131 132GLSL: 133 134 if (cond) { 135 < 1 > 136 } else { 137 < 2 > 138 } 139 140IR: 141 142 region #0 { 143 depart region #0 after { 144 if (cond) { 145 depart region #0 after { 146 < 1 > 147 } 148 } 149 < 2 > 150 } 151 <region #0 phi nodes > 152 } 153 154GLSL: 155 156 while (cond) { 157 < 1 > 158 } 159 160IR: 161 162 region #0 { 163 <region #0 loop_phi nodes> 164 repeat region #0 after { 165 region #1 { 166 depart region #1 after { 167 if (!cond) { 168 depart region #0 169 } 170 } 171 } 172 < 1 > 173 } 174 <region #0 phi nodes> 175 } 176 177'Break' and 'continue' inside the loops are directly translated to the 178depart and repeat nodes for the corresponding loop region. 179 180This may look a bit too complicated, but in fact this allows more simple 181and uniform handling of the control flow. 182 183All loop\_phi and phi nodes for some region always have the same number 184of source operands. The number of source operands for 185region\_node::loop\_phi nodes is 1 + number of repeat nodes that 186reference this region as a target. The number of source operands for 187region\_node::phi nodes is equal to the number of depart nodes that 188reference this region as a target. All depart/repeat nodes for the 189region have unique indices equal to the index of source operand for 190phi/loop\_phi nodes. 191 192First source operand for region\_node::loop\_phi nodes (src[0]) is an 193incoming value that enters the region from the outside. Each remaining 194source operand comes from the corresponding repeat node. 195 196More complex example: 197 198GLSL: 199 200 a = 1; 201 while (a < 5) { 202 a = a * 2; 203 if (b == 3) { 204 continue; 205 } else { 206 a = 6; 207 } 208 if (c == 4) 209 break; 210 a = a + 1; 211 } 212 213IR with SSA form: 214 215 a.1 = 1; 216 region #0 { 217 // loop phi values: src[0] - incoming, src[1] - from repeat_1, src[2] - from repeat_2 218 region#0 loop_phi: a.2 = phi a.1, a.6, a.3 219 220 repeat_1 region #0 after { 221 a.3 = a.2 * 2; 222 cond1 = (b == 3); 223 region #1 { 224 depart_0 region #1 after { 225 if (cond1) { 226 repeat_2 region #0; 227 } 228 } 229 a.4 = 6; 230 231 region #1 phi: a.5 = phi a.4; // src[0] - from depart_0 232 } 233 cond2 = (c == 4); 234 region #2 { 235 depart_0 region #2 after { 236 if (cond2) { 237 depart_0 region #0; 238 } 239 } 240 } 241 a.6 = a.5 + 1; 242 } 243 244 region #0 phi: a.7 = phi a.5 // src[0] from depart_0 245 } 246 247Phi nodes with single source operand are just copies, they are not 248really necessary, but this allows to handle all **depart\_node**s in the 249uniform way. 250 251#### Instruction nodes 252 253Instruction nodes represent different kinds of instructions - 254**alu\_node**, **cf\_node**, **fetch\_node**, etc. Each of them contains 255the "bc" structure where all fields of the bytecode are stored (the type 256is **bc\_alu** for **alu\_node**, etc). The operands are represented 257using the vectors of pointers to **value** class (node::src, node::dst) 258 259#### SSA-specific nodes 260 261Phi nodes currently don't have special node class, they are stored as 262**node**. Destination vector contains a single destination value, source 263vector contains 1 or more source values. 264 265Psi nodes [[5], [6]](#references) also don't have a special node class 266and stored as **node**. Source vector contains 3 values for each source 267operand - the **value** of predicate, **value** of corresponding 268PRED\_SEL field, and the source **value** itself. 269 270### Indirect addressing 271 272Special kind of values (VLK\_RELREG) is used to represent indirect 273operands. These values don't have SSA versions. The representation is 274mostly based on the [[2]](#references). Indirect operand contains the 275"offset/address" value (value::rel), (e.g. some SSA version of the AR 276register value, though after some passes it may be any value - constant, 277register, etc), also it contains the maydef and mayuse vectors of 278pointers to **value**s (similar to dst/src vectors in the **node**) to 279represent the effects of aliasing in the SSA form. 280 281E.g. if we have the array R5.x ... R8.x and the following instruction : 282 283 MOV R0.x, R[5 + AR].x 284 285then source indirect operand is represented with the VLK\_RELREG value, 286value::rel is AR, value::maydef is empty (in fact it always contain the 287same number of elements as mayuse to simplify the handling, but they are 288NULLs), value::mayuse contains [R5.x, R6.x, R7.x, R8.x] (or the 289corresponding SSA versions after ssa\_rename). 290 291Additional "virtual variables" as in [HSSA [2]](#references) are not 292used, also there is no special handling for "zero versions". Typical 293programs in our case are small, indirect addressing is rare, array sizes 294are limited by max gpr number, so we don't really need to use special 295tricks to avoid the explosion of value versions. Also this allows more 296precise liveness computation for array elements without modifications to 297the algorithms. 298 299With the following instruction: 300 301 MOV R[5+AR].x, R0.x 302 303we'll have both maydef and mayuse vectors for dst operand filled with 304array values initially: [R5.x, R6.x, R7.x, R8.x]. After the ssa\_rename 305pass mayuse will contain previous versions, maydef will contain new 306potentially-defined versions. 307 308* * * * * 309 310Passes 311------ 312 313- **bc\_parser** - creates the IR from the source bytecode, 314 initializes src and dst value vectors for instruction nodes. Most 315 ALU nodes have one dst operand and the number of source operands is 316 equal to the number of source operands for the ISA instruction. 317 Nodes for PREDSETxx instructions have 3 dst operands - dst[0] is dst 318 gpr as in the original instruction, other two are pseudo-operands 319 that represent possibly updated predicate and exec\_mask. Predicate 320 values are used in the predicated alu instructions (node::pred), 321 exec\_mask values are used in the if\_nodes (if\_node::cond). Each 322 vector operand in the CF/TEX/VTX instructions is represented with 4 323 values - components of the vector. 324 325- **ssa\_prepare** - creates phi expressions. 326 327- **ssa\_rename** - renames the values (assigns versions). 328 329- **liveness** - liveness computation, sets 'dead' flag for unused 330 nodes and values, optionally computes interference information for 331 the values. 332 333- **dce\_cleanup** - eliminates 'dead' nodes, also removes some 334 unnecessary nodes created by bc\_parser, e.g. the nodes for the JUMP 335 instructions in the source, containers for ALU groups (they were 336 only needed for the ssa\_rename pass) 337 338- **if\_conversion** - converts control flow with if\_nodes to the 339 data flow in cases where it can improve performance (small alu-only 340 branches). Both branches are executed speculatively and the phi 341 expressions are replaced with conditional moves (CNDxx) to select 342 the final value using the same condition predicate as was used by 343 the original if\_node. E.g. **if\_node** used dst[2] from PREDSETxx 344 instruction, CNDxx now uses dst[0] from the same PREDSETxx 345 instruction. 346 347- **peephole** - peephole optimizations 348 349- **gvn** - Global Value Numbering [[2]](#references), 350 [[3]](#references) 351 352- **gcm** - Global Code Motion [[3]](#references). Also performs 353 grouping of the instructions of the same kind (CF/FETCH/ALU). 354 355- register allocation passes, some ideas are used from 356 [[4]](#references), but implementation is simplified to make it more 357 efficient in terms of the compilation speed (e.g. no recursive 358 recoloring) while achieving good enough results. 359 360 - **ra\_split** - prepares the program to register allocation. 361 Splits live ranges for constrained values by inserting the 362 copies to/from temporary values, so that the live range of the 363 constrained values becomes minimal. 364 365 - **ra\_coalesce** - performs global allocation on registers used 366 in CF/FETCH instructions. It's performed first to make sure they 367 end up in the same GPR. Also tries to allocate all values 368 involved in copies (inserted by the ra\_split pass) to the same 369 register, so that the copies may be eliminated. 370 371 - **ra\_init** - allocates gpr arrays (if indirect addressing is 372 used), and remaining values. 373 374- **post\_scheduler** - ALU scheduler, handles VLIW packing and 375 performs the final register allocation for local values inside ALU 376 clauses. Eliminates all coalesced copies (if src and dst of the copy 377 are allocated to the same register). 378 379- **ra\_checker** - optional debugging pass that tries to catch basic 380 errors of the scheduler or regalloc, 381 382- **bc\_finalize** - propagates the regalloc information from values 383 in node::src and node::dst vectors to the bytecode fields, converts 384 control flow structure (region/depart/repeat) to the target 385 instructions (JUMP/ELSE/POP, 386 LOOP\_START/LOOP\_END/LOOP\_CONTINUE/LOOP\_BREAK). 387 388- **bc\_builder** - builds final bytecode, 389 390* * * * * 391 392References 393---------- 394 395[1] ["Tree-Based Code Optimization. A Thesis Proposal", Carl 396McConnell](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.4210&rep=rep1&type=pdf) 397 398[2] ["Effective Representation of Aliases and Indirect Memory Operations 399in SSA Form", Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, Mark 400Streich](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.6974&rep=rep1&type=pdf) 401 402[3] ["Global Code Motion. Global Value Numbering.", Cliff 403Click](http://www.cs.washington.edu/education/courses/cse501/06wi/reading/click-pldi95.pdf) 404 405[4] ["Register Allocation for Programs in SSA Form", Sebastian 406Hack](http://digbib.ubka.uni-karlsruhe.de/volltexte/documents/6532) 407 408[5] ["An extension to the SSA representation for predicated code", 409Francois de 410Ferriere](http://www.cdl.uni-saarland.de/ssasem/talks/Francois.de.Ferriere.pdf) 411 412[6] ["Improvements to the Psi-SSA Representation", F. de 413Ferriere](http://www.scopesconf.org/scopes-07/presentations/3_Presentation.pdf) 414