1===================== 2Adreno Five Microcode 3===================== 4 5.. contents:: 6 7.. _afuc-introduction: 8 9Introduction 10============ 11 12Adreno GPUs prior to 6xx use two micro-controllers to parse the command-stream, 13setup the hardware for draws (or compute jobs), and do various GPU 14housekeeping. They are relatively simple (basically glorified 15register writers) and basically all their state is in a collection 16of registers. Ie. there is no stack, and no memory assigned to 17them; any global state like which bank of context registers is to 18be used in the next draw is stored in a register. 19 20The setup is similar to radeon, in fact Adreno 2xx thru 4xx used 21basically the same instruction set as r600. There is a "PFP" 22(Prefetch Parser) and "ME" (Micro Engine, also confusingly referred 23to as "PM4"). These make up the "CP" ("Command Parser"). The 24PFP runs ahead of the ME, with some PM4 packets handled entirely 25in the PFP. Between the PFP and ME is a FIFO ("MEQ"). In the 26generations prior to Adreno 5xx, the PFP and ME had different 27instruction sets. 28 29Starting with Adreno 5xx, a new microcontroller with a unified 30instruction set was introduced, although the overall architecture 31and purpose of the two microcontrollers remains the same. 32 33For lack of a better name, this new instruction set is called 34"Adreno Five MicroCode" or "afuc". (No idea what Qualcomm calls 35it internally. 36 37With Adreno 6xx, the separate PF and ME are replaced with a single 38SQE microcontroller using the same instruction set as 5xx. 39 40.. _afuc-overview: 41 42Instruction Set Overview 43======================== 44 4532bit instruction set with basic arithmatic ops that can take 46either two source registers or one src and a 16b immediate. 47 4832 registers, although some are special purpose: 49 50- ``$00`` - always reads zero, otherwise seems to be the PC 51- ``$01`` - current PM4 packet header 52- ``$1c`` - alias ``$rem``, remaining data in packet 53- ``$1d`` - alias ``$addr`` 54- ``$1f`` - alias ``$data`` 55 56Branch instructions have a delay slot so the following instruction 57is always executed regardless of whether branch is taken or not. 58 59 60.. _afuc-alu: 61 62ALU Instructions 63================ 64 65The following instructions are available: 66 67- ``add`` - add 68- ``addhi`` - add + carry (for upper 32b of 64b value) 69- ``sub`` - subtract 70- ``subhi`` - subtract + carry (for upper 32b of 64b value) 71- ``and`` - bitwise AND 72- ``or`` - bitwise OR 73- ``xor`` - bitwise XOR 74- ``not`` - bitwise NOT (no src1) 75- ``shl`` - shift-left 76- ``ushr`` - unsigned shift-right 77- ``ishr`` - signed shift-right 78- ``rot`` - rotate-left (like shift-left with wrap-around) 79- ``mul8`` - multiply low 8b of two src 80- ``min`` - minimum 81- ``max`` - maximum 82- ``comp`` - compare two values 83 84The ALU instructions can take either two src registers, or a src 85plus 16b immediate as 2nd src, ex:: 86 87 add $dst, $src, 0x1234 ; src2 is immed 88 add $dst, $src1, $src2 ; src2 is reg 89 90The ``not`` instruction only takes a single source:: 91 92 not $dst, $src 93 not $dst, 0x1234 94 95.. _afuc-alu-cmp: 96 97The ``cmp`` instruction returns: 98 99- ``0x00`` if src1 > src2 100- ``0x2b`` if src1 == src2 101- ``0x1e`` if src1 < src2 102 103See explanation in :ref:`afuc-branch` 104 105 106.. _afuc-branch: 107 108Branch Instructions 109=================== 110 111The following branch/jump instructions are available: 112 113- ``brne`` - branch if not equal (or bit not set) 114- ``breq`` - branch if equal (or bit set) 115- ``jump`` - unconditional jump 116 117Both ``brne`` and ``breq`` have two forms, comparing the src register 118against either a small immediate (up to 5 bits) or a specific bit:: 119 120 breq $src, b3, #somelabel ; branch if src & (1 << 3) 121 breq $src, 0x3, #somelabel ; branch if src == 3 122 123The branch instructions are encoded with a 16b relative offset. 124Since ``$00`` always reads back zero, it can be used to construct 125an unconditional relative jump. 126 127The :ref:`cmp <afuc-alu-cmp>` instruction can be paired with the 128bit-test variants of ``brne``/``breq`` to implement gt/ge/lt/le, 129due to the bit pattern it returns, for example:: 130 131 cmp $04, $02, $03 132 breq $04, b1, #somelabel 133 134will branch if ``$02`` is less than or equal to ``$03``. 135 136 137.. _afuc-call: 138 139Call/Return 140=========== 141 142Simple subroutines can be implemented with ``call``/``ret``. The 143jump instruction encodes a fixed offset. 144 145 TODO not sure how many levels deep function calls can be nested. 146 There isn't really a stack. Definitely seems to be multiple 147 levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 -> 148 f22. 149 150 151.. _afuc-control: 152 153Config Instructions 154=================== 155 156These seem to read/write config state in other parts of CP. In at 157least some cases I expect these map to CP registers (but possibly 158not directly??) 159 160- ``cread $dst, [$off + addr], flags`` 161- ``cwrite $src, [$off + addr], flags`` 162 163In cases where no offset is needed, ``$00`` is frequently used as 164the offset. 165 166For example, the following sequences sets:: 167 168 ; load CP_INDIRECT_BUFFER parameters from cmdstream: 169 mov $02, $data ; low 32b of IB target address 170 mov $03, $data ; high 32b of IB target 171 mov $04, $data ; IB size in dwords 172 173 ; sanity check # of dwords: 174 breq $04, 0x0, #l23 (#69, 04a2) 175 176 ; this seems something to do with figuring out whether 177 ; we are going from RB->IB1 or IB1->IB2 (ie. so the 178 ; below cwrite instructions update either 179 ; CP_IB1_BASE_LO/HI/BUFSIZE or CP_IB2_BASE_LO/HI/BUFSIZE 180 and $05, $18, 0x0003 181 shl $05, $05, 0x0002 182 183 ; update CP_IBn_BASE_LO/HI/BUFSIZE: 184 cwrite $02, [$05 + 0x0b0], 0x8 185 cwrite $03, [$05 + 0x0b1], 0x8 186 cwrite $04, [$05 + 0x0b2], 0x8 187 188 189 190.. _afuc-reg-access: 191 192Register Access 193=============== 194 195The special registers ``$addr`` and ``$data`` can be used to write GPU 196registers, for example, to write:: 197 198 mov $addr, CP_SCRATCH_REG[0x2] ; set register to write 199 mov $data, $03 ; CP_SCRATCH_REG[0x2] 200 mov $data, $04 ; CP_SCRATCH_REG[0x3] 201 ... 202 203subsequent writes to ``$data`` will increment the address of the register 204to write, so a sequence of consecutive registers can be written 205 206To read:: 207 208 mov $addr, CP_SCRATCH_REG[0x2] 209 mov $03, $addr 210 mov $04, $addr 211 212Many registers that are updated frequently have two banks, so they can be 213updated without stalling for previous draw to finish. These banks are 214arranged so bit 11 is zero for bank 0 and 1 for bank 1. The ME fw (at 215least the version I'm looking at) stores this in ``$17``, so to update 216these registers from ME:: 217 218 or $addr, $17, VFD_INDEX_OFFSET 219 mov $data, $03 220 ... 221 222Note that PFP doesn't seem to use this approach, instead it does something 223like:: 224 225 mov $0c, CP_SCRATCH_REG[0x7] 226 mov $02, 0x789a ; value 227 cwrite $0c, [$00 + 0x010], 0x8 228 cwrite $02, [$00 + 0x011], 0x8 229 230Like with the ``$addr``/``$data`` approach, the destination register address 231increments on each write. 232 233.. _afuc-mem: 234 235Memory Access 236============= 237 238There are no load/store instructions, as such. The microcontrollers 239have only indirect memory access via GPU registers. There are two 240mechanism possible. 241 242Read/Write via CP_NRT Registers 243------------------------------- 244 245This seems to be only used by ME. If PFP were also using it, they would 246race with each other. It seems to be primarily used for small reads. 247 248- ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write 249- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR`` 250 251The address register increments with successive reads or writes. 252 253Memory Write example:: 254 255 ; store 64b value in $04+$05 to 64b address in $02+$03 256 mov $addr, CP_ME_NRT_ADDR_LO 257 mov $data, $02 258 mov $data, $03 259 mov $addr, CP_ME_NRT_DATA 260 mov $data, $04 261 mov $data, $05 262 263Memory Read example:: 264 265 ; load 64b value from address in $02+$03 into $04+$05 266 mov $addr, CP_ME_NRT_ADDR_LO 267 mov $data, $02 268 mov $data, $03 269 mov $04, $addr 270 mov $05, $addr 271 272 273Read via Control Instructions 274----------------------------- 275 276This is used by PFP whenever it needs to read memory. Also seems to be 277used by ME for streaming reads (larger amounts of data). The DMA access 278seems to be done by ROQ. 279 280 TODO might also be possible for write access 281 282 TODO some of the control commands might be synchronizing access 283 between PFP and ME?? 284 285An example from ``CP_DRAW_INDIRECT`` packet handler:: 286 287 mov $07, 0x0004 ; # of dwords to read from draw-indirect buffer 288 ; load address of indirect buffer from cmdstream: 289 cwrite $data, [$00 + 0x0b8], 0x8 290 cwrite $data, [$00 + 0x0b9], 0x8 291 ; set # of dwords to read: 292 cwrite $07, [$00 + 0x0ba], 0x8 293 ... 294 ; read parameters from draw-indirect buffer: 295 mov $09, $addr 296 mov $07, $addr 297 cread $12, [$00 + 0x040], 0x8 298 ; the start parameter gets written into MEQ, which ME writes 299 ; to VFD_INDEX_OFFSET register: 300 mov $data, $addr 301 302 303A6XX NOTES 304========== 305 306The ``$14`` register holds global flags set by: 307 308 CP_SKIP_IB2_ENABLE_LOCAL - b8 309 CP_SKIP_IB2_ENABLE_GLOBAL - b9 310 CP_SET_MARKER 311 MODE=GMEM - sets b15 312 MODE=BLIT2D - clears b15, b12, b7 313 CP_SET_MODE - b29+b30 314 CP_SET_VISIBILITY_OVERRIDE - b11, b21, b30? 315 CP_SET_DRAW_STATE - checks b29+b30 316 317 CP_COND_REG_EXEC - checks b10, which should be predicate flag? 318