1============================= 2User Guide for AMDGPU Backend 3============================= 4 5.. contents:: 6 :local: 7 8Introduction 9============ 10 11The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the 12R600 family up until the current GCN families. It lives in the 13``lib/Target/AMDGPU`` directory. 14 15LLVM 16==== 17 18.. _amdgpu-target-triples: 19 20Target Triples 21-------------- 22 23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to 24specify the target triple: 25 26 .. table:: AMDGPU Architectures 27 :name: amdgpu-architecture-table 28 29 ============ ============================================================== 30 Architecture Description 31 ============ ============================================================== 32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders. 33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders. 34 ============ ============================================================== 35 36 .. table:: AMDGPU Vendors 37 :name: amdgpu-vendor-table 38 39 ============ ============================================================== 40 Vendor Description 41 ============ ============================================================== 42 ``amd`` Can be used for all AMD GPU usage. 43 ``mesa3d`` Can be used if the OS is ``mesa3d``. 44 ============ ============================================================== 45 46 .. table:: AMDGPU Operating Systems 47 :name: amdgpu-os-table 48 49 ============== ============================================================ 50 OS Description 51 ============== ============================================================ 52 *<empty>* Defaults to the *unknown* OS. 53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes 54 such as AMD's ROCm [AMD-ROCm]_. 55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL 56 runtime. 57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D 58 runtime. 59 ============== ============================================================ 60 61 .. table:: AMDGPU Environments 62 :name: amdgpu-environment-table 63 64 ============ ============================================================== 65 Environment Description 66 ============ ============================================================== 67 *<empty>* Default. 68 ============ ============================================================== 69 70.. _amdgpu-processors: 71 72Processors 73---------- 74 75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The 76names from both the *Processor* and *Alternative Processor* can be used. 77 78 .. table:: AMDGPU Processors 79 :name: amdgpu-processor-table 80 81 =========== =============== ============ ===== ========= ======= ================== 82 Processor Alternative Target dGPU/ Target ROCm Example 83 Processor Triple APU Features Support Products 84 Architecture Supported 85 [Default] 86 =========== =============== ============ ===== ========= ======= ================== 87 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_ 88 ----------------------------------------------------------------------------------- 89 ``r600`` ``r600`` dGPU 90 ``r630`` ``r600`` dGPU 91 ``rs880`` ``r600`` dGPU 92 ``rv670`` ``r600`` dGPU 93 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_ 94 ----------------------------------------------------------------------------------- 95 ``rv710`` ``r600`` dGPU 96 ``rv730`` ``r600`` dGPU 97 ``rv770`` ``r600`` dGPU 98 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_ 99 ----------------------------------------------------------------------------------- 100 ``cedar`` ``r600`` dGPU 101 ``cypress`` ``r600`` dGPU 102 ``juniper`` ``r600`` dGPU 103 ``redwood`` ``r600`` dGPU 104 ``sumo`` ``r600`` dGPU 105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_ 106 ----------------------------------------------------------------------------------- 107 ``barts`` ``r600`` dGPU 108 ``caicos`` ``r600`` dGPU 109 ``cayman`` ``r600`` dGPU 110 ``turks`` ``r600`` dGPU 111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_ 112 ----------------------------------------------------------------------------------- 113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU 114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU 115 - ``oland`` 116 - ``pitcairn`` 117 - ``verde`` 118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_ 119 ----------------------------------------------------------------------------------- 120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000 121 - A6 Pro-7050B 122 - A8-7100 123 - A8 Pro-7150B 124 - A10-7300 125 - A10 Pro-7350B 126 - FX-7500 127 - A8-7200P 128 - A10-7400P 129 - FX-7600P 130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100 131 - FirePro W9100 132 - FirePro S9150 133 - FirePro S9170 134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290 135 - Radeon R9 290x 136 - Radeon R390 137 - Radeon R390x 138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100 139 - ``mullins`` - E1-2200 140 - E1-2500 141 - E2-3000 142 - E2-3800 143 - A4-5000 144 - A4-5100 145 - A6-5200 146 - A4 Pro-3340B 147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790 148 - Radeon HD 8770 149 - R7 260 150 - R7 260X 151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_ 152 ----------------------------------------------------------------------------------- 153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P 154 [on] - Pro A6-8500B 155 - A8-8600P 156 - Pro A8-8600B 157 - FX-8800P 158 - Pro A12-8800B 159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P 160 [on] - Pro A10-8700B 161 - A10-8780P 162 \ ``amdgcn`` APU - xnack - A10-9600P 163 [on] - A10-9630P 164 - A12-9700P 165 - A12-9730P 166 - FX-9800P 167 - FX-9830P 168 \ ``amdgcn`` APU - xnack - E2-9010 169 [on] - A6-9210 170 - A9-9410 171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150 172 - ``tonga`` [off] - FirePro S7100 173 - FirePro W7100 174 - Radeon R285 175 - Radeon R9 380 176 - Radeon R9 385 177 - Mobile FirePro 178 M7170 179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano 180 [off] - Radeon R9 Fury 181 - Radeon R9 FuryX 182 - Radeon Pro Duo 183 - FirePro S9300x2 184 - Radeon Instinct MI8 185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470 186 [off] - Radeon RX 480 187 - Radeon Instinct MI6 188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460 189 [off] 190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack 191 [on] 192 **GCN GFX9** [AMD-GCN-GFX9]_ 193 ----------------------------------------------------------------------------------- 194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega 195 [off] Frontier Edition 196 - Radeon RX Vega 56 197 - Radeon RX Vega 64 198 - Radeon RX Vega 64 199 Liquid 200 - Radeon Instinct MI25 201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G 202 [on] - Ryzen 5 2400G 203 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA* 204 [off] 205 .. TODO 206 Add product 207 names. 208 ``gfx906`` ``amdgcn`` dGPU - xnack *TBA* 209 [off] 210 .. TODO 211 Add product 212 names. 213 =========== =============== ============ ===== ========= ======= ================== 214 215.. _amdgpu-target-features: 216 217Target Features 218--------------- 219 220Target features control how code is generated to support certain 221processor specific features. Not all target features are supported by 222all processors. The runtime must ensure that the features supported by 223the device used to execute the code match the features enabled when 224generating the code. A mismatch of features may result in incorrect 225execution, or a reduction in performance. 226 227The target features supported by each processor, and the default value 228used if not specified explicitly, is listed in 229:ref:`amdgpu-processor-table`. 230 231Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU 232target features. 233 234For example: 235 236``-mxnack`` 237 Enable the ``xnack`` feature. 238``-mno-xnack`` 239 Disable the ``xnack`` feature. 240 241 .. table:: AMDGPU Target Features 242 :name: amdgpu-target-feature-table 243 244 ============== ================================================== 245 Target Feature Description 246 ============== ================================================== 247 -m[no-]xnack Enable/disable generating code that has 248 memory clauses that are compatible with 249 having XNACK replay enabled. 250 251 This is used for demand paging and page 252 migration. If XNACK replay is enabled in 253 the device, then if a page fault occurs 254 the code may execute incorrectly if the 255 ``xnack`` feature is not enabled. Executing 256 code that has the feature enabled on a 257 device that does not have XNACK replay 258 enabled will execute correctly, but may 259 be less performant than code with the 260 feature disabled. 261 ============== ================================================== 262 263.. _amdgpu-address-spaces: 264 265Address Spaces 266-------------- 267 268The AMDGPU backend uses the following address space mappings. 269 270The memory space names used in the table, aside from the region memory space, is 271from the OpenCL standard. 272 273LLVM Address Space number is used throughout LLVM (for example, in LLVM IR). 274 275 .. table:: Address Space Mapping 276 :name: amdgpu-address-space-mapping-table 277 278 ================== ================= 279 LLVM Address Space Memory Space 280 ================== ================= 281 0 Generic (Flat) 282 1 Global 283 2 Region (GDS) 284 3 Local (group/LDS) 285 4 Constant 286 5 Private (Scratch) 287 6 Constant 32-bit 288 ================== ================= 289 290.. _amdgpu-memory-scopes: 291 292Memory Scopes 293------------- 294 295This section provides LLVM memory synchronization scopes supported by the AMDGPU 296backend memory model when the target triple OS is ``amdhsa`` (see 297:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`). 298 299The memory model supported is based on the HSA memory model [HSA]_ which is 300based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before 301relation is transitive over the synchonizes-with relation independent of scope, 302and synchonizes-with allows the memory scope instances to be inclusive (see 303table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`). 304 305This is different to the OpenCL [OpenCL]_ memory model which does not have scope 306inclusion and requires the memory scopes to exactly match. However, this 307is conservatively correct for OpenCL. 308 309 .. table:: AMDHSA LLVM Sync Scopes 310 :name: amdgpu-amdhsa-llvm-sync-scopes-table 311 312 ================ ========================================================== 313 LLVM Sync Scope Description 314 ================ ========================================================== 315 *none* The default: ``system``. 316 317 Synchronizes with, and participates in modification and 318 seq_cst total orderings with, other operations (except 319 image operations) for all address spaces (except private, 320 or generic that accesses private) provided the other 321 operation's sync scope is: 322 323 - ``system``. 324 - ``agent`` and executed by a thread on the same agent. 325 - ``workgroup`` and executed by a thread in the same 326 workgroup. 327 - ``wavefront`` and executed by a thread in the same 328 wavefront. 329 330 ``agent`` Synchronizes with, and participates in modification and 331 seq_cst total orderings with, other operations (except 332 image operations) for all address spaces (except private, 333 or generic that accesses private) provided the other 334 operation's sync scope is: 335 336 - ``system`` or ``agent`` and executed by a thread on the 337 same agent. 338 - ``workgroup`` and executed by a thread in the same 339 workgroup. 340 - ``wavefront`` and executed by a thread in the same 341 wavefront. 342 343 ``workgroup`` Synchronizes with, and participates in modification and 344 seq_cst total orderings with, other operations (except 345 image operations) for all address spaces (except private, 346 or generic that accesses private) provided the other 347 operation's sync scope is: 348 349 - ``system``, ``agent`` or ``workgroup`` and executed by a 350 thread in the same workgroup. 351 - ``wavefront`` and executed by a thread in the same 352 wavefront. 353 354 ``wavefront`` Synchronizes with, and participates in modification and 355 seq_cst total orderings with, other operations (except 356 image operations) for all address spaces (except private, 357 or generic that accesses private) provided the other 358 operation's sync scope is: 359 360 - ``system``, ``agent``, ``workgroup`` or ``wavefront`` 361 and executed by a thread in the same wavefront. 362 363 ``singlethread`` Only synchronizes with, and participates in modification 364 and seq_cst total orderings with, other operations (except 365 image operations) running in the same thread for all 366 address spaces (for example, in signal handlers). 367 ================ ========================================================== 368 369AMDGPU Intrinsics 370----------------- 371 372The AMDGPU backend implements the following LLVM IR intrinsics. 373 374*This section is WIP.* 375 376.. TODO 377 List AMDGPU intrinsics 378 379AMDGPU Attributes 380----------------- 381 382The AMDGPU backend supports the following LLVM IR attributes. 383 384 .. table:: AMDGPU LLVM IR Attributes 385 :name: amdgpu-llvm-ir-attributes-table 386 387 ======================================= ========================================================== 388 LLVM Attribute Description 389 ======================================= ========================================================== 390 "amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that 391 will be specified when the kernel is dispatched. Generated 392 by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_. 393 "amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel 394 argument block size for the implicit arguments. This 395 varies by OS and language (for OpenCL see 396 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 397 "amdgpu-max-work-group-size"="n" Specify the maximum work-group size that will be specifed 398 when the kernel is dispatched. 399 "amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by 400 the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_. 401 "amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the 402 ``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_. 403 "amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per 404 execution unit. Generated by the ``amdgpu_waves_per_eu`` 405 CLANG attribute [CLANG-ATTR]_. 406 ======================================= ========================================================== 407 408Code Object 409=========== 410 411The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that 412can be linked by ``lld`` to produce a standard ELF shared code object which can 413be loaded and executed on an AMDGPU target. 414 415Header 416------ 417 418The AMDGPU backend uses the following ELF header: 419 420 .. table:: AMDGPU ELF Header 421 :name: amdgpu-elf-header-table 422 423 ========================== =============================== 424 Field Value 425 ========================== =============================== 426 ``e_ident[EI_CLASS]`` ``ELFCLASS64`` 427 ``e_ident[EI_DATA]`` ``ELFDATA2LSB`` 428 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE`` 429 - ``ELFOSABI_AMDGPU_HSA`` 430 - ``ELFOSABI_AMDGPU_PAL`` 431 - ``ELFOSABI_AMDGPU_MESA3D`` 432 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA`` 433 - ``ELFABIVERSION_AMDGPU_PAL`` 434 - ``ELFABIVERSION_AMDGPU_MESA3D`` 435 ``e_type`` - ``ET_REL`` 436 - ``ET_DYN`` 437 ``e_machine`` ``EM_AMDGPU`` 438 ``e_entry`` 0 439 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table` 440 ========================== =============================== 441 442.. 443 444 .. table:: AMDGPU ELF Header Enumeration Values 445 :name: amdgpu-elf-header-enumeration-values-table 446 447 =============================== ===== 448 Name Value 449 =============================== ===== 450 ``EM_AMDGPU`` 224 451 ``ELFOSABI_NONE`` 0 452 ``ELFOSABI_AMDGPU_HSA`` 64 453 ``ELFOSABI_AMDGPU_PAL`` 65 454 ``ELFOSABI_AMDGPU_MESA3D`` 66 455 ``ELFABIVERSION_AMDGPU_HSA`` 1 456 ``ELFABIVERSION_AMDGPU_PAL`` 0 457 ``ELFABIVERSION_AMDGPU_MESA3D`` 0 458 =============================== ===== 459 460``e_ident[EI_CLASS]`` 461 The ELF class is: 462 463 * ``ELFCLASS32`` for ``r600`` architecture. 464 465 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64 466 bit applications. 467 468``e_ident[EI_DATA]`` 469 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering. 470 471``e_ident[EI_OSABI]`` 472 One of the following AMD GPU architecture specific OS ABIs 473 (see :ref:`amdgpu-os-table`): 474 475 * ``ELFOSABI_NONE`` for *unknown* OS. 476 477 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS. 478 479 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS. 480 481 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS. 482 483``e_ident[EI_ABIVERSION]`` 484 The ABI version of the AMD GPU architecture specific OS ABI to which the code 485 object conforms: 486 487 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA 488 runtime ABI. 489 490 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL 491 runtime ABI. 492 493 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA 494 3D runtime ABI. 495 496``e_type`` 497 Can be one of the following values: 498 499 500 ``ET_REL`` 501 The type produced by the AMD GPU backend compiler as it is relocatable code 502 object. 503 504 ``ET_DYN`` 505 The type produced by the linker as it is a shared code object. 506 507 The AMD HSA runtime loader requires a ``ET_DYN`` code object. 508 509``e_machine`` 510 The value ``EM_AMDGPU`` is used for the machine for all processors supported 511 by the ``r600`` and ``amdgcn`` architectures (see 512 :ref:`amdgpu-processor-table`). The specific processor is specified in the 513 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see 514 :ref:`amdgpu-elf-header-e_flags-table`). 515 516``e_entry`` 517 The entry point is 0 as the entry points for individual kernels must be 518 selected in order to invoke them through AQL packets. 519 520``e_flags`` 521 The AMDGPU backend uses the following ELF header flags: 522 523 .. table:: AMDGPU ELF Header ``e_flags`` 524 :name: amdgpu-elf-header-e_flags-table 525 526 ================================= ========== ============================= 527 Name Value Description 528 ================================= ========== ============================= 529 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`. 530 -------------------------------------------- ----------------------------- 531 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection 532 mask for 533 ``EF_AMDGPU_MACH_xxx`` values 534 defined in 535 :ref:`amdgpu-ef-amdgpu-mach-table`. 536 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack`` 537 target feature is 538 enabled for all code 539 contained in the code object. 540 If the processor 541 does not support the 542 ``xnack`` target 543 feature then must 544 be 0. 545 See 546 :ref:`amdgpu-target-features`. 547 ================================= ========== ============================= 548 549 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values 550 :name: amdgpu-ef-amdgpu-mach-table 551 552 ================================= ========== ============================= 553 Name Value Description (see 554 :ref:`amdgpu-processor-table`) 555 ================================= ========== ============================= 556 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified* 557 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600`` 558 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630`` 559 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880`` 560 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670`` 561 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710`` 562 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730`` 563 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770`` 564 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar`` 565 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress`` 566 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper`` 567 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood`` 568 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo`` 569 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts`` 570 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos`` 571 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman`` 572 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks`` 573 *reserved* 0x011 - Reserved for ``r600`` 574 0x01f architecture processors. 575 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600`` 576 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601`` 577 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700`` 578 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701`` 579 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702`` 580 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703`` 581 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704`` 582 *reserved* 0x027 Reserved. 583 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801`` 584 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802`` 585 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803`` 586 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810`` 587 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900`` 588 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902`` 589 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904`` 590 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906`` 591 *reserved* 0x030 Reserved. 592 ================================= ========== ============================= 593 594Sections 595-------- 596 597An AMDGPU target ELF code object has the standard ELF sections which include: 598 599 .. table:: AMDGPU ELF Sections 600 :name: amdgpu-elf-sections-table 601 602 ================== ================ ================================= 603 Name Type Attributes 604 ================== ================ ================================= 605 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 606 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 607 ``.debug_``\ *\** ``SHT_PROGBITS`` *none* 608 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC`` 609 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 610 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 611 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE`` 612 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC`` 613 ``.note`` ``SHT_NOTE`` *none* 614 ``.rela``\ *name* ``SHT_RELA`` *none* 615 ``.rela.dyn`` ``SHT_RELA`` *none* 616 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC`` 617 ``.shstrtab`` ``SHT_STRTAB`` *none* 618 ``.strtab`` ``SHT_STRTAB`` *none* 619 ``.symtab`` ``SHT_SYMTAB`` *none* 620 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR`` 621 ================== ================ ================================= 622 623These sections have their standard meanings (see [ELF]_) and are only generated 624if needed. 625 626``.debug``\ *\** 627 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the 628 DWARF produced by the AMDGPU backend. 629 630``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash`` 631 The standard sections used by a dynamic loader. 632 633``.note`` 634 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU 635 backend. 636 637``.rela``\ *name*, ``.rela.dyn`` 638 For relocatable code objects, *name* is the name of the section that the 639 relocation records apply. For example, ``.rela.text`` is the section name for 640 relocation records associated with the ``.text`` section. 641 642 For linked shared code objects, ``.rela.dyn`` contains all the relocation 643 records from each of the relocatable code object's ``.rela``\ *name* sections. 644 645 See :ref:`amdgpu-relocation-records` for the relocation records supported by 646 the AMDGPU backend. 647 648``.text`` 649 The executable machine code for the kernels and functions they call. Generated 650 as position independent code. See :ref:`amdgpu-code-conventions` for 651 information on conventions used in the isa generation. 652 653.. _amdgpu-note-records: 654 655Note Records 656------------ 657 658As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must 659be generated after the ``name`` field to ensure the ``desc`` field is 4 byte 660aligned. In addition, minimal zero byte padding must be generated to ensure the 661``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the 662``.note`` section must be at least 4 to indicate at least 8 byte alignment. 663 664The AMDGPU backend code object uses the following ELF note records in the 665``.note`` section. The *Description* column specifies the layout of the note 666record's ``desc`` field. All fields are consecutive bytes. Note records with 667variable size strings have a corresponding ``*_size`` field that specifies the 668number of bytes, including the terminating null character, in the string. The 669string(s) come immediately after the preceding fields. 670 671Additional note records can be present. 672 673 .. table:: AMDGPU ELF Note Records 674 :name: amdgpu-elf-note-records-table 675 676 ===== ============================== ====================================== 677 Name Type Description 678 ===== ============================== ====================================== 679 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string> 680 ===== ============================== ====================================== 681 682.. 683 684 .. table:: AMDGPU ELF Note Record Enumeration Values 685 :name: amdgpu-elf-note-record-enumeration-values-table 686 687 ============================== ===== 688 Name Value 689 ============================== ===== 690 *reserved* 0-9 691 ``NT_AMD_AMDGPU_HSA_METADATA`` 10 692 *reserved* 11 693 ============================== ===== 694 695``NT_AMD_AMDGPU_HSA_METADATA`` 696 Specifies extensible metadata associated with the code objects executed on HSA 697 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when 698 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See 699 :ref:`amdgpu-amdhsa-code-object-metadata` for the syntax of the code 700 object metadata string. 701 702.. _amdgpu-symbols: 703 704Symbols 705------- 706 707Symbols include the following: 708 709 .. table:: AMDGPU ELF Symbols 710 :name: amdgpu-elf-symbols-table 711 712 ===================== ============== ============= ================== 713 Name Type Section Description 714 ===================== ============== ============= ================== 715 *link-name* ``STT_OBJECT`` - ``.data`` Global variable 716 - ``.rodata`` 717 - ``.bss`` 718 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor 719 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point 720 ===================== ============== ============= ================== 721 722Global variable 723 Global variables both used and defined by the compilation unit. 724 725 If the symbol is defined in the compilation unit then it is allocated in the 726 appropriate section according to if it has initialized data or is readonly. 727 728 If the symbol is external then its section is ``STN_UNDEF`` and the loader 729 will resolve relocations using the definition provided by another code object 730 or explicitly defined by the runtime. 731 732 All global symbols, whether defined in the compilation unit or external, are 733 accessed by the machine code indirectly through a GOT table entry. This 734 allows them to be preemptable. The GOT table is only supported when the target 735 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). 736 737 .. TODO 738 Add description of linked shared object symbols. Seems undefined symbols 739 are marked as STT_NOTYPE. 740 741Kernel descriptor 742 Every HSA kernel has an associated kernel descriptor. It is the address of the 743 kernel descriptor that is used in the AQL dispatch packet used to invoke the 744 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is 745 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`. 746 747Kernel entry point 748 Every HSA kernel also has a symbol for its machine code entry point. 749 750.. _amdgpu-relocation-records: 751 752Relocation Records 753------------------ 754 755AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported 756relocatable fields are: 757 758``word32`` 759 This specifies a 32-bit field occupying 4 bytes with arbitrary byte 760 alignment. These values use the same byte order as other word values in the 761 AMD GPU architecture. 762 763``word64`` 764 This specifies a 64-bit field occupying 8 bytes with arbitrary byte 765 alignment. These values use the same byte order as other word values in the 766 AMD GPU architecture. 767 768Following notations are used for specifying relocation calculations: 769 770**A** 771 Represents the addend used to compute the value of the relocatable field. 772 773**G** 774 Represents the offset into the global offset table at which the relocation 775 entry's symbol will reside during execution. 776 777**GOT** 778 Represents the address of the global offset table. 779 780**P** 781 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``) 782 of the storage unit being relocated (computed using ``r_offset``). 783 784**S** 785 Represents the value of the symbol whose index resides in the relocation 786 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``. 787 788**B** 789 Represents the base address of a loaded executable or shared object which is 790 the difference between the ELF address and the actual load address. Relocations 791 using this are only valid in executable or shared objects. 792 793The following relocation types are supported: 794 795 .. table:: AMDGPU ELF Relocation Records 796 :name: amdgpu-elf-relocation-records-table 797 798 ========================== ======= ===== ========== ============================== 799 Relocation Type Kind Value Field Calculation 800 ========================== ======= ===== ========== ============================== 801 ``R_AMDGPU_NONE`` 0 *none* *none* 802 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF 803 Dynamic 804 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32 805 Dynamic 806 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A 807 Dynamic 808 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P 809 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P 810 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A 811 Dynamic 812 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P 813 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF 814 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32 815 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF 816 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32 817 *reserved* 12 818 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A 819 ========================== ======= ===== ========== ============================== 820 821``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by 822the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``. 823 824There is no current OS loader support for 32 bit programs and so 825``R_AMDGPU_ABS32`` is not used. 826 827.. _amdgpu-dwarf: 828 829DWARF 830----- 831 832Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain 833information that maps the code object executable code and data to the source 834language constructs. It can be used by tools such as debuggers and profilers. 835 836Address Space Mapping 837~~~~~~~~~~~~~~~~~~~~~ 838 839The following address space mapping is used: 840 841 .. table:: AMDGPU DWARF Address Space Mapping 842 :name: amdgpu-dwarf-address-space-mapping-table 843 844 =================== ================= 845 DWARF Address Space Memory Space 846 =================== ================= 847 1 Private (Scratch) 848 2 Local (group/LDS) 849 *omitted* Global 850 *omitted* Constant 851 *omitted* Generic (Flat) 852 *not supported* Region (GDS) 853 =================== ================= 854 855See :ref:`amdgpu-address-spaces` for information on the memory space terminology 856used in the table. 857 858An ``address_class`` attribute is generated on pointer type DIEs to specify the 859DWARF address space of the value of the pointer when it is in the *private* or 860*local* address space. Otherwise the attribute is omitted. 861 862An ``XDEREF`` operation is generated in location list expressions for variables 863that are allocated in the *private* and *local* address space. Otherwise no 864``XDREF`` is omitted. 865 866Register Mapping 867~~~~~~~~~~~~~~~~ 868 869*This section is WIP.* 870 871.. TODO 872 Define DWARF register enumeration. 873 874 If want to present a wavefront state then should expose vector registers as 875 64 wide (rather than per work-item view that LLVM uses). Either as separate 876 registers, or a 64x4 byte single register. In either case use a new LANE op 877 (akin to XDREF) to select the current lane usage in a location 878 expression. This would also allow scalar register spilling to vector register 879 lanes to be expressed (currently no debug information is being generated for 880 spilling). If choose a wide single register approach then use LANE in 881 conjunction with PIECE operation to select the dword part of the register for 882 the current lane. If the separate register approach then use LANE to select 883 the register. 884 885Source Text 886~~~~~~~~~~~ 887 888Source text for online-compiled programs (e.g. those compiled by the OpenCL 889runtime) may be embedded into the DWARF v5 line table using the ``clang 890-gembed-source`` option, described in table :ref:`amdgpu-debug-options`. 891 892For example: 893 894``-gembed-source`` 895 Enable the embedded source DWARF v5 extension. 896``-gno-embed-source`` 897 Disable the embedded source DWARF v5 extension. 898 899 .. table:: AMDGPU Debug Options 900 :name: amdgpu-debug-options 901 902 ==================== ================================================== 903 Debug Flag Description 904 ==================== ================================================== 905 -g[no-]embed-source Enable/disable embedding source text in DWARF 906 debug sections. Useful for environments where 907 source cannot be written to disk, such as 908 when performing online compilation. 909 ==================== ================================================== 910 911This option enables one extended content types in the DWARF v5 Line Number 912Program Header, which is used to encode embedded source. 913 914 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types 915 :name: amdgpu-dwarf-extended-content-types 916 917 ============================ ====================== 918 Content Type Form 919 ============================ ====================== 920 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp`` 921 ============================ ====================== 922 923The source field will contain the UTF-8 encoded, null-terminated source text 924with ``'\n'`` line endings. When the source field is present, consumers can use 925the embedded source instead of attempting to discover the source on disk. When 926the source field is absent, consumers can access the file to get the source 927text. 928 929The above content type appears in the ``file_name_entry_format`` field of the 930line table prologue, and its corresponding value appear in the ``file_names`` 931field. The current encoding of the content type is documented in table 932:ref:`amdgpu-dwarf-extended-content-types-encoding` 933 934 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding 935 :name: amdgpu-dwarf-extended-content-types-encoding 936 937 ============================ ==================== 938 Content Type Value 939 ============================ ==================== 940 ``DW_LNCT_LLVM_source`` 0x2001 941 ============================ ==================== 942 943.. _amdgpu-code-conventions: 944 945Code Conventions 946================ 947 948This section provides code conventions used for each supported target triple OS 949(see :ref:`amdgpu-target-triples`). 950 951AMDHSA 952------ 953 954This section provides code conventions used when the target triple OS is 955``amdhsa`` (see :ref:`amdgpu-target-triples`). 956 957.. _amdgpu-amdhsa-code-object-target-identification: 958 959Code Object Target Identification 960~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 961 962The AMDHSA OS uses the following syntax to specify the code object 963target as a single string: 964 965 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>`` 966 967Where: 968 969 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>`` 970 are the same as the *Target Triple* (see 971 :ref:`amdgpu-target-triples`). 972 973 - ``<Processor>`` is the same as the *Processor* (see 974 :ref:`amdgpu-processors`). 975 976 - ``<Target Features>`` is a list of the enabled *Target Features* 977 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that 978 apply to *Processor*. The list must be in the same order as listed 979 in the table :ref:`amdgpu-target-feature-table`. Note that *Target 980 Features* must be included in the list if they are enabled even if 981 that is the default for *Processor*. 982 983For example: 984 985 ``"amdgcn-amd-amdhsa--gfx902+xnack"`` 986 987.. _amdgpu-amdhsa-code-object-metadata: 988 989Code Object Metadata 990~~~~~~~~~~~~~~~~~~~~ 991 992The code object metadata specifies extensible metadata associated with the code 993objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm 994[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record 995(see :ref:`amdgpu-note-records`) and is required when the target triple OS is 996``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum 997information necessary to support the ROCM kernel queries. For example, the 998segment sizes needed in a dispatch packet. In addition, a high level language 999runtime may require other information to be included. For example, the AMD 1000OpenCL runtime records kernel argument information. 1001 1002The metadata is specified as a YAML formatted string (see [YAML]_ and 1003:doc:`YamlIO`). 1004 1005.. TODO 1006 Is the string null terminated? It probably should not if YAML allows it to 1007 contain null characters, otherwise it should be. 1008 1009The metadata is represented as a single YAML document comprised of the mapping 1010defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and 1011referenced tables. 1012 1013For boolean values, the string values of ``false`` and ``true`` are used for 1014false and true respectively. 1015 1016Additional information can be added to the mappings. To avoid conflicts, any 1017non-AMD key names should be prefixed by "*vendor-name*.". 1018 1019 .. table:: AMDHSA Code Object Metadata Mapping 1020 :name: amdgpu-amdhsa-code-object-metadata-mapping-table 1021 1022 ========== ============== ========= ======================================= 1023 String Key Value Type Required? Description 1024 ========== ============== ========= ======================================= 1025 "Version" sequence of Required - The first integer is the major 1026 2 integers version. Currently 1. 1027 - The second integer is the minor 1028 version. Currently 0. 1029 "Printf" sequence of Each string is encoded information 1030 strings about a printf function call. The 1031 encoded information is organized as 1032 fields separated by colon (':'): 1033 1034 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString`` 1035 1036 where: 1037 1038 ``ID`` 1039 A 32 bit integer as a unique id for 1040 each printf function call 1041 1042 ``N`` 1043 A 32 bit integer equal to the number 1044 of arguments of printf function call 1045 minus 1 1046 1047 ``S[i]`` (where i = 0, 1, ... , N-1) 1048 32 bit integers for the size in bytes 1049 of the i-th FormatString argument of 1050 the printf function call 1051 1052 FormatString 1053 The format string passed to the 1054 printf function call. 1055 "Kernels" sequence of Required Sequence of the mappings for each 1056 mapping kernel in the code object. See 1057 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table` 1058 for the definition of the mapping. 1059 ========== ============== ========= ======================================= 1060 1061.. 1062 1063 .. table:: AMDHSA Code Object Kernel Metadata Mapping 1064 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table 1065 1066 ================= ============== ========= ================================ 1067 String Key Value Type Required? Description 1068 ================= ============== ========= ================================ 1069 "Name" string Required Source name of the kernel. 1070 "SymbolName" string Required Name of the kernel 1071 descriptor ELF symbol. 1072 "Language" string Source language of the kernel. 1073 Values include: 1074 1075 - "OpenCL C" 1076 - "OpenCL C++" 1077 - "HCC" 1078 - "OpenMP" 1079 1080 "LanguageVersion" sequence of - The first integer is the major 1081 2 integers version. 1082 - The second integer is the 1083 minor version. 1084 "Attrs" mapping Mapping of kernel attributes. 1085 See 1086 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table` 1087 for the mapping definition. 1088 "Args" sequence of Sequence of mappings of the 1089 mapping kernel arguments. See 1090 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table` 1091 for the definition of the mapping. 1092 "CodeProps" mapping Mapping of properties related to 1093 the kernel code. See 1094 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table` 1095 for the mapping definition. 1096 ================= ============== ========= ================================ 1097 1098.. 1099 1100 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping 1101 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table 1102 1103 =================== ============== ========= ============================== 1104 String Key Value Type Required? Description 1105 =================== ============== ========= ============================== 1106 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values 1107 3 integers must be >=1 and the dispatch 1108 work-group size X, Y, Z must 1109 correspond to the specified 1110 values. Defaults to 0, 0, 0. 1111 1112 Corresponds to the OpenCL 1113 ``reqd_work_group_size`` 1114 attribute. 1115 "WorkGroupSizeHint" sequence of The dispatch work-group size 1116 3 integers X, Y, Z is likely to be the 1117 specified values. 1118 1119 Corresponds to the OpenCL 1120 ``work_group_size_hint`` 1121 attribute. 1122 "VecTypeHint" string The name of a scalar or vector 1123 type. 1124 1125 Corresponds to the OpenCL 1126 ``vec_type_hint`` attribute. 1127 1128 "RuntimeHandle" string The external symbol name 1129 associated with a kernel. 1130 OpenCL runtime allocates a 1131 global buffer for the symbol 1132 and saves the kernel's address 1133 to it, which is used for 1134 device side enqueueing. Only 1135 available for device side 1136 enqueued kernels. 1137 =================== ============== ========= ============================== 1138 1139.. 1140 1141 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping 1142 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table 1143 1144 ================= ============== ========= ================================ 1145 String Key Value Type Required? Description 1146 ================= ============== ========= ================================ 1147 "Name" string Kernel argument name. 1148 "TypeName" string Kernel argument type name. 1149 "Size" integer Required Kernel argument size in bytes. 1150 "Align" integer Required Kernel argument alignment in 1151 bytes. Must be a power of two. 1152 "ValueKind" string Required Kernel argument kind that 1153 specifies how to set up the 1154 corresponding argument. 1155 Values include: 1156 1157 "ByValue" 1158 The argument is copied 1159 directly into the kernarg. 1160 1161 "GlobalBuffer" 1162 A global address space pointer 1163 to the buffer data is passed 1164 in the kernarg. 1165 1166 "DynamicSharedPointer" 1167 A group address space pointer 1168 to dynamically allocated LDS 1169 is passed in the kernarg. 1170 1171 "Sampler" 1172 A global address space 1173 pointer to a S# is passed in 1174 the kernarg. 1175 1176 "Image" 1177 A global address space 1178 pointer to a T# is passed in 1179 the kernarg. 1180 1181 "Pipe" 1182 A global address space pointer 1183 to an OpenCL pipe is passed in 1184 the kernarg. 1185 1186 "Queue" 1187 A global address space pointer 1188 to an OpenCL device enqueue 1189 queue is passed in the 1190 kernarg. 1191 1192 "HiddenGlobalOffsetX" 1193 The OpenCL grid dispatch 1194 global offset for the X 1195 dimension is passed in the 1196 kernarg. 1197 1198 "HiddenGlobalOffsetY" 1199 The OpenCL grid dispatch 1200 global offset for the Y 1201 dimension is passed in the 1202 kernarg. 1203 1204 "HiddenGlobalOffsetZ" 1205 The OpenCL grid dispatch 1206 global offset for the Z 1207 dimension is passed in the 1208 kernarg. 1209 1210 "HiddenNone" 1211 An argument that is not used 1212 by the kernel. Space needs to 1213 be left for it, but it does 1214 not need to be set up. 1215 1216 "HiddenPrintfBuffer" 1217 A global address space pointer 1218 to the runtime printf buffer 1219 is passed in kernarg. 1220 1221 "HiddenDefaultQueue" 1222 A global address space pointer 1223 to the OpenCL device enqueue 1224 queue that should be used by 1225 the kernel by default is 1226 passed in the kernarg. 1227 1228 "HiddenCompletionAction" 1229 A global address space pointer 1230 to help link enqueued kernels into 1231 the ancestor tree for determining 1232 when the parent kernel has finished. 1233 1234 "ValueType" string Required Kernel argument value type. Only 1235 present if "ValueKind" is 1236 "ByValue". For vector data 1237 types, the value is for the 1238 element type. Values include: 1239 1240 - "Struct" 1241 - "I8" 1242 - "U8" 1243 - "I16" 1244 - "U16" 1245 - "F16" 1246 - "I32" 1247 - "U32" 1248 - "F32" 1249 - "I64" 1250 - "U64" 1251 - "F64" 1252 1253 .. TODO 1254 How can it be determined if a 1255 vector type, and what size 1256 vector? 1257 "PointeeAlign" integer Alignment in bytes of pointee 1258 type for pointer type kernel 1259 argument. Must be a power 1260 of 2. Only present if 1261 "ValueKind" is 1262 "DynamicSharedPointer". 1263 "AddrSpaceQual" string Kernel argument address space 1264 qualifier. Only present if 1265 "ValueKind" is "GlobalBuffer" or 1266 "DynamicSharedPointer". Values 1267 are: 1268 1269 - "Private" 1270 - "Global" 1271 - "Constant" 1272 - "Local" 1273 - "Generic" 1274 - "Region" 1275 1276 .. TODO 1277 Is GlobalBuffer only Global 1278 or Constant? Is 1279 DynamicSharedPointer always 1280 Local? Can HCC allow Generic? 1281 How can Private or Region 1282 ever happen? 1283 "AccQual" string Kernel argument access 1284 qualifier. Only present if 1285 "ValueKind" is "Image" or 1286 "Pipe". Values 1287 are: 1288 1289 - "ReadOnly" 1290 - "WriteOnly" 1291 - "ReadWrite" 1292 1293 .. TODO 1294 Does this apply to 1295 GlobalBuffer? 1296 "ActualAccQual" string The actual memory accesses 1297 performed by the kernel on the 1298 kernel argument. Only present if 1299 "ValueKind" is "GlobalBuffer", 1300 "Image", or "Pipe". This may be 1301 more restrictive than indicated 1302 by "AccQual" to reflect what the 1303 kernel actual does. If not 1304 present then the runtime must 1305 assume what is implied by 1306 "AccQual" and "IsConst". Values 1307 are: 1308 1309 - "ReadOnly" 1310 - "WriteOnly" 1311 - "ReadWrite" 1312 1313 "IsConst" boolean Indicates if the kernel argument 1314 is const qualified. Only present 1315 if "ValueKind" is 1316 "GlobalBuffer". 1317 1318 "IsRestrict" boolean Indicates if the kernel argument 1319 is restrict qualified. Only 1320 present if "ValueKind" is 1321 "GlobalBuffer". 1322 1323 "IsVolatile" boolean Indicates if the kernel argument 1324 is volatile qualified. Only 1325 present if "ValueKind" is 1326 "GlobalBuffer". 1327 1328 "IsPipe" boolean Indicates if the kernel argument 1329 is pipe qualified. Only present 1330 if "ValueKind" is "Pipe". 1331 1332 .. TODO 1333 Can GlobalBuffer be pipe 1334 qualified? 1335 ================= ============== ========= ================================ 1336 1337.. 1338 1339 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping 1340 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table 1341 1342 ============================ ============== ========= ===================== 1343 String Key Value Type Required? Description 1344 ============================ ============== ========= ===================== 1345 "KernargSegmentSize" integer Required The size in bytes of 1346 the kernarg segment 1347 that holds the values 1348 of the arguments to 1349 the kernel. 1350 "GroupSegmentFixedSize" integer Required The amount of group 1351 segment memory 1352 required by a 1353 work-group in 1354 bytes. This does not 1355 include any 1356 dynamically allocated 1357 group segment memory 1358 that may be added 1359 when the kernel is 1360 dispatched. 1361 "PrivateSegmentFixedSize" integer Required The amount of fixed 1362 private address space 1363 memory required for a 1364 work-item in 1365 bytes. If the kernel 1366 uses a dynamic call 1367 stack then additional 1368 space must be added 1369 to this value for the 1370 call stack. 1371 "KernargSegmentAlign" integer Required The maximum byte 1372 alignment of 1373 arguments in the 1374 kernarg segment. Must 1375 be a power of 2. 1376 "WavefrontSize" integer Required Wavefront size. Must 1377 be a power of 2. 1378 "NumSGPRs" integer Required Number of scalar 1379 registers used by a 1380 wavefront for 1381 GFX6-GFX9. This 1382 includes the special 1383 SGPRs for VCC, Flat 1384 Scratch (GFX7-GFX9) 1385 and XNACK (for 1386 GFX8-GFX9). It does 1387 not include the 16 1388 SGPR added if a trap 1389 handler is 1390 enabled. It is not 1391 rounded up to the 1392 allocation 1393 granularity. 1394 "NumVGPRs" integer Required Number of vector 1395 registers used by 1396 each work-item for 1397 GFX6-GFX9 1398 "MaxFlatWorkGroupSize" integer Required Maximum flat 1399 work-group size 1400 supported by the 1401 kernel in work-items. 1402 Must be >=1 and 1403 consistent with 1404 ReqdWorkGroupSize if 1405 not 0, 0, 0. 1406 "NumSpilledSGPRs" integer Number of stores from 1407 a scalar register to 1408 a register allocator 1409 created spill 1410 location. 1411 "NumSpilledVGPRs" integer Number of stores from 1412 a vector register to 1413 a register allocator 1414 created spill 1415 location. 1416 ============================ ============== ========= ===================== 1417 1418.. 1419 1420Kernel Dispatch 1421~~~~~~~~~~~~~~~ 1422 1423The HSA architected queuing language (AQL) defines a user space memory interface 1424that can be used to control the dispatch of kernels, in an agent independent 1425way. An agent can have zero or more AQL queues created for it using the ROCm 1426runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the 1427*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue 1428mechanics and packet layouts. 1429 1430The packet processor of a kernel agent is responsible for detecting and 1431dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the 1432packet processor is implemented by the hardware command processor (CP), 1433asynchronous dispatch controller (ADC) and shader processor input controller 1434(SPI). 1435 1436The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel 1437mode driver to initialize and register the AQL queue with CP. 1438 1439To dispatch a kernel the following actions are performed. This can occur in the 1440CPU host program, or from an HSA kernel executing on a GPU. 1441 14421. A pointer to an AQL queue for the kernel agent on which the kernel is to be 1443 executed is obtained. 14442. A pointer to the kernel descriptor (see 1445 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is 1446 obtained. It must be for a kernel that is contained in a code object that that 1447 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is 1448 associated. 14493. Space is allocated for the kernel arguments using the ROCm runtime allocator 1450 for a memory region with the kernarg property for the kernel agent that will 1451 execute the kernel. It must be at least 16 byte aligned. 14524. Kernel argument values are assigned to the kernel argument memory 1453 allocation. The layout is defined in the *HSA Programmer's Language Reference* 1454 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument 1455 memory in the same way constant memory is accessed. (Note that the HSA 1456 specification allows an implementation to copy the kernel argument contents to 1457 another location that is accessed by the kernel.) 14585. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime 1459 api uses 64 bit atomic operations to reserve space in the AQL queue for the 1460 packet. The packet must be set up, and the final write must use an atomic 1461 store release to set the packet kind to ensure the packet contents are 1462 visible to the kernel agent. AQL defines a doorbell signal mechanism to 1463 notify the kernel agent that the AQL queue has been updated. These rules, and 1464 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA 1465 System Architecture Specification* [HSA]_. 14666. A kernel dispatch packet includes information about the actual dispatch, 1467 such as grid and work-group size, together with information from the code 1468 object about the kernel, such as segment sizes. The ROCm runtime queries on 1469 the kernel symbol can be used to obtain the code object values which are 1470 recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`. 14717. CP executes micro-code and is responsible for detecting and setting up the 1472 GPU to execute the wavefronts of a kernel dispatch. 14738. CP ensures that when the a wavefront starts executing the kernel machine 1474 code, the scalar general purpose registers (SGPR) and vector general purpose 1475 registers (VGPR) are set up as required by the machine code. The required 1476 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial 1477 register state is defined in 1478 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`. 14799. The prolog of the kernel machine code (see 1480 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary 1481 before continuing executing the machine code that corresponds to the kernel. 148210. When the kernel dispatch has completed execution, CP signals the completion 1483 signal specified in the kernel dispatch packet if not 0. 1484 1485.. _amdgpu-amdhsa-memory-spaces: 1486 1487Memory Spaces 1488~~~~~~~~~~~~~ 1489 1490The memory space properties are: 1491 1492 .. table:: AMDHSA Memory Spaces 1493 :name: amdgpu-amdhsa-memory-spaces-table 1494 1495 ================= =========== ======== ======= ================== 1496 Memory Space Name HSA Segment Hardware Address NULL Value 1497 Name Name Size 1498 ================= =========== ======== ======= ================== 1499 Private private scratch 32 0x00000000 1500 Local group LDS 32 0xFFFFFFFF 1501 Global global global 64 0x0000000000000000 1502 Constant constant *same as 64 0x0000000000000000 1503 global* 1504 Generic flat flat 64 0x0000000000000000 1505 Region N/A GDS 32 *not implemented 1506 for AMDHSA* 1507 ================= =========== ======== ======= ================== 1508 1509The global and constant memory spaces both use global virtual addresses, which 1510are the same virtual address space used by the CPU. However, some virtual 1511addresses may only be accessible to the CPU, some only accessible by the GPU, 1512and some by both. 1513 1514Using the constant memory space indicates that the data will not change during 1515the execution of the kernel. This allows scalar read instructions to be 1516used. The vector and scalar L1 caches are invalidated of volatile data before 1517each kernel dispatch execution to allow constant memory to change values between 1518kernel dispatches. 1519 1520The local memory space uses the hardware Local Data Store (LDS) which is 1521automatically allocated when the hardware creates work-groups of wavefronts, and 1522freed when all the wavefronts of a work-group have terminated. The data store 1523(DS) instructions can be used to access it. 1524 1525The private memory space uses the hardware scratch memory support. If the kernel 1526uses scratch, then the hardware allocates memory that is accessed using 1527wavefront lane dword (4 byte) interleaving. The mapping used from private 1528address to physical address is: 1529 1530 ``wavefront-scratch-base + 1531 (private-address * wavefront-size * 4) + 1532 (wavefront-lane-id * 4)`` 1533 1534There are different ways that the wavefront scratch base address is determined 1535by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This 1536memory can be accessed in an interleaved manner using buffer instruction with 1537the scratch buffer descriptor and per wavefront scratch offset, by the scratch 1538instructions, or by flat instructions. If each lane of a wavefront accesses the 1539same private address, the interleaving results in adjacent dwords being accessed 1540and hence requires fewer cache lines to be fetched. Multi-dword access is not 1541supported except by flat and scratch instructions in GFX9. 1542 1543The generic address space uses the hardware flat address support available in 1544GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and 1545local appertures), that are outside the range of addressible global memory, to 1546map from a flat address to a private or local address. 1547 1548FLAT instructions can take a flat address and access global, private (scratch) 1549and group (LDS) memory depending in if the address is within one of the 1550apperture ranges. Flat access to scratch requires hardware aperture setup and 1551setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat 1552access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup 1553(see :ref:`amdgpu-amdhsa-m0`). 1554 1555To convert between a segment address and a flat address the base address of the 1556appertures address can be used. For GFX7-GFX8 these are available in the 1557:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with 1558Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For 1559GFX9 the appature base addresses are directly available as inline constant 1560registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit 1561address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32 1562which makes it easier to convert from flat to segment or segment to flat. 1563 1564Image and Samplers 1565~~~~~~~~~~~~~~~~~~ 1566 1567Image and sample handles created by the ROCm runtime are 64 bit addresses of a 1568hardware 32 byte V# and 48 byte S# object respectively. In order to support the 1569HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG 1570enumeration values for the queries that are not trivially deducible from the S# 1571representation. 1572 1573HSA Signals 1574~~~~~~~~~~~ 1575 1576HSA signal handles created by the ROCm runtime are 64 bit addresses of a 1577structure allocated in memory accessible from both the CPU and GPU. The 1578structure is defined by the ROCm runtime and subject to change between releases 1579(see [AMD-ROCm-github]_). 1580 1581.. _amdgpu-amdhsa-hsa-aql-queue: 1582 1583HSA AQL Queue 1584~~~~~~~~~~~~~ 1585 1586The HSA AQL queue structure is defined by the ROCm runtime and subject to change 1587between releases (see [AMD-ROCm-github]_). For some processors it contains 1588fields needed to implement certain language features such as the flat address 1589aperture bases. It also contains fields used by CP such as managing the 1590allocation of scratch memory. 1591 1592.. _amdgpu-amdhsa-kernel-descriptor: 1593 1594Kernel Descriptor 1595~~~~~~~~~~~~~~~~~ 1596 1597A kernel descriptor consists of the information needed by CP to initiate the 1598execution of a kernel, including the entry point address of the machine code 1599that implements the kernel. 1600 1601Kernel Descriptor for GFX6-GFX9 1602+++++++++++++++++++++++++++++++ 1603 1604CP microcode requires the Kernel descriptor to be allocated on 64 byte 1605alignment. 1606 1607 .. table:: Kernel Descriptor for GFX6-GFX9 1608 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table 1609 1610 ======= ======= =============================== ============================ 1611 Bits Size Field Name Description 1612 ======= ======= =============================== ============================ 1613 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local 1614 address space memory 1615 required for a work-group 1616 in bytes. This does not 1617 include any dynamically 1618 allocated local address 1619 space memory that may be 1620 added when the kernel is 1621 dispatched. 1622 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed 1623 private address space 1624 memory required for a 1625 work-item in bytes. If 1626 is_dynamic_callstack is 1 1627 then additional space must 1628 be added to this value for 1629 the call stack. 1630 127:64 8 bytes Reserved, must be 0. 1631 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly 1632 negative) from base 1633 address of kernel 1634 descriptor to kernel's 1635 entry point instruction 1636 which must be 256 byte 1637 aligned. 1638 383:192 24 Reserved, must be 0. 1639 bytes 1640 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS) 1641 program settings used by 1642 CP to set up 1643 ``COMPUTE_PGM_RSRC1`` 1644 configuration 1645 register. See 1646 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 1647 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS) 1648 program settings used by 1649 CP to set up 1650 ``COMPUTE_PGM_RSRC2`` 1651 configuration 1652 register. See 1653 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 1654 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 1655 _BUFFER SGPR user data registers 1656 (see 1657 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1658 1659 The total number of SGPR 1660 user data registers 1661 requested must not exceed 1662 16 and match value in 1663 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``. 1664 Any requests beyond 16 1665 will be ignored. 1666 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above* 1667 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above* 1668 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above* 1669 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above* 1670 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above* 1671 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above* 1672 _SIZE 1673 455 1 bit Reserved, must be 0. 1674 511:456 8 bytes Reserved, must be 0. 1675 512 **Total size 64 bytes.** 1676 ======= ==================================================================== 1677 1678.. 1679 1680 .. table:: compute_pgm_rsrc1 for GFX6-GFX9 1681 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table 1682 1683 ======= ======= =============================== =========================================================================== 1684 Bits Size Field Name Description 1685 ======= ======= =============================== =========================================================================== 1686 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register 1687 blocks used by each work-item; 1688 granularity is device 1689 specific: 1690 1691 GFX6-GFX9 1692 - vgprs_used 0..256 1693 - max(0, ceil(vgprs_used / 4) - 1) 1694 1695 Where vgprs_used is defined 1696 as the highest VGPR number 1697 explicitly referenced plus 1698 one. 1699 1700 Used by CP to set up 1701 ``COMPUTE_PGM_RSRC1.VGPRS``. 1702 1703 The 1704 :ref:`amdgpu-assembler` 1705 calculates this 1706 automatically for the 1707 selected processor from 1708 values provided to the 1709 `.amdhsa_kernel` directive 1710 by the 1711 `.amdhsa_next_free_vgpr` 1712 nested directive (see 1713 :ref:`amdhsa-kernel-directives-table`). 1714 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register 1715 blocks used by a wavefront; 1716 granularity is device 1717 specific: 1718 1719 GFX6-GFX8 1720 - sgprs_used 0..112 1721 - max(0, ceil(sgprs_used / 8) - 1) 1722 GFX9 1723 - sgprs_used 0..112 1724 - 2 * max(0, ceil(sgprs_used / 16) - 1) 1725 1726 Where sgprs_used is 1727 defined as the highest 1728 SGPR number explicitly 1729 referenced plus one, plus 1730 a target-specific number 1731 of additional special 1732 SGPRs for VCC, 1733 FLAT_SCRATCH (GFX7+) and 1734 XNACK_MASK (GFX8+), and 1735 any additional 1736 target-specific 1737 limitations. It does not 1738 include the 16 SGPRs added 1739 if a trap handler is 1740 enabled. 1741 1742 The target-specific 1743 limitations and special 1744 SGPR layout are defined in 1745 the hardware 1746 documentation, which can 1747 be found in the 1748 :ref:`amdgpu-processors` 1749 table. 1750 1751 Used by CP to set up 1752 ``COMPUTE_PGM_RSRC1.SGPRS``. 1753 1754 The 1755 :ref:`amdgpu-assembler` 1756 calculates this 1757 automatically for the 1758 selected processor from 1759 values provided to the 1760 `.amdhsa_kernel` directive 1761 by the 1762 `.amdhsa_next_free_sgpr` 1763 and `.amdhsa_reserve_*` 1764 nested directives (see 1765 :ref:`amdhsa-kernel-directives-table`). 1766 11:10 2 bits PRIORITY Must be 0. 1767 1768 Start executing wavefront 1769 at the specified priority. 1770 1771 CP is responsible for 1772 filling in 1773 ``COMPUTE_PGM_RSRC1.PRIORITY``. 1774 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution 1775 with specified rounding 1776 mode for single (32 1777 bit) floating point 1778 precision floating point 1779 operations. 1780 1781 Floating point rounding 1782 mode values are defined in 1783 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 1784 1785 Used by CP to set up 1786 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 1787 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution 1788 with specified rounding 1789 denorm mode for half/double (16 1790 and 64 bit) floating point 1791 precision floating point 1792 operations. 1793 1794 Floating point rounding 1795 mode values are defined in 1796 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 1797 1798 Used by CP to set up 1799 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 1800 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution 1801 with specified denorm mode 1802 for single (32 1803 bit) floating point 1804 precision floating point 1805 operations. 1806 1807 Floating point denorm mode 1808 values are defined in 1809 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 1810 1811 Used by CP to set up 1812 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 1813 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution 1814 with specified denorm mode 1815 for half/double (16 1816 and 64 bit) floating point 1817 precision floating point 1818 operations. 1819 1820 Floating point denorm mode 1821 values are defined in 1822 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 1823 1824 Used by CP to set up 1825 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``. 1826 20 1 bit PRIV Must be 0. 1827 1828 Start executing wavefront 1829 in privilege trap handler 1830 mode. 1831 1832 CP is responsible for 1833 filling in 1834 ``COMPUTE_PGM_RSRC1.PRIV``. 1835 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution 1836 with DX10 clamp mode 1837 enabled. Used by the vector 1838 ALU to force DX10 style 1839 treatment of NaN's (when 1840 set, clamp NaN to zero, 1841 otherwise pass NaN 1842 through). 1843 1844 Used by CP to set up 1845 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``. 1846 22 1 bit DEBUG_MODE Must be 0. 1847 1848 Start executing wavefront 1849 in single step mode. 1850 1851 CP is responsible for 1852 filling in 1853 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``. 1854 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution 1855 with IEEE mode 1856 enabled. Floating point 1857 opcodes that support 1858 exception flag gathering 1859 will quiet and propagate 1860 signaling-NaN inputs per 1861 IEEE 754-2008. Min_dx10 and 1862 max_dx10 become IEEE 1863 754-2008 compliant due to 1864 signaling-NaN propagation 1865 and quieting. 1866 1867 Used by CP to set up 1868 ``COMPUTE_PGM_RSRC1.IEEE_MODE``. 1869 24 1 bit BULKY Must be 0. 1870 1871 Only one work-group allowed 1872 to execute on a compute 1873 unit. 1874 1875 CP is responsible for 1876 filling in 1877 ``COMPUTE_PGM_RSRC1.BULKY``. 1878 25 1 bit CDBG_USER Must be 0. 1879 1880 Flag that can be used to 1881 control debugging code. 1882 1883 CP is responsible for 1884 filling in 1885 ``COMPUTE_PGM_RSRC1.CDBG_USER``. 1886 26 1 bit FP16_OVFL GFX6-GFX8 1887 Reserved, must be 0. 1888 GFX9 1889 Wavefront starts execution 1890 with specified fp16 overflow 1891 mode. 1892 1893 - If 0, fp16 overflow generates 1894 +/-INF values. 1895 - If 1, fp16 overflow that is the 1896 result of an +/-INF input value 1897 or divide by 0 produces a +/-INF, 1898 otherwise clamps computed 1899 overflow to +/-MAX_FP16 as 1900 appropriate. 1901 1902 Used by CP to set up 1903 ``COMPUTE_PGM_RSRC1.FP16_OVFL``. 1904 31:27 5 bits Reserved, must be 0. 1905 32 **Total size 4 bytes** 1906 ======= =================================================================================================================== 1907 1908.. 1909 1910 .. table:: compute_pgm_rsrc2 for GFX6-GFX9 1911 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table 1912 1913 ======= ======= =============================== =========================================================================== 1914 Bits Size Field Name Description 1915 ======= ======= =============================== =========================================================================== 1916 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the 1917 _WAVEFRONT_OFFSET SGPR wavefront scratch offset 1918 system register (see 1919 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1920 1921 Used by CP to set up 1922 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``. 1923 5:1 5 bits USER_SGPR_COUNT The total number of SGPR 1924 user data registers 1925 requested. This number must 1926 match the number of user 1927 data registers enabled. 1928 1929 Used by CP to set up 1930 ``COMPUTE_PGM_RSRC2.USER_SGPR``. 1931 6 1 bit ENABLE_TRAP_HANDLER Must be 0. 1932 1933 This bit represents 1934 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``, 1935 which is set by the CP if 1936 the runtime has installed a 1937 trap handler. 1938 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the 1939 system SGPR register for 1940 the work-group id in the X 1941 dimension (see 1942 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1943 1944 Used by CP to set up 1945 ``COMPUTE_PGM_RSRC2.TGID_X_EN``. 1946 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the 1947 system SGPR register for 1948 the work-group id in the Y 1949 dimension (see 1950 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1951 1952 Used by CP to set up 1953 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``. 1954 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the 1955 system SGPR register for 1956 the work-group id in the Z 1957 dimension (see 1958 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1959 1960 Used by CP to set up 1961 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``. 1962 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the 1963 system SGPR register for 1964 work-group information (see 1965 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). 1966 1967 Used by CP to set up 1968 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``. 1969 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the 1970 VGPR system registers used 1971 for the work-item ID. 1972 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table` 1973 defines the values. 1974 1975 Used by CP to set up 1976 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``. 1977 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0. 1978 1979 Wavefront starts execution 1980 with address watch 1981 exceptions enabled which 1982 are generated when L1 has 1983 witnessed a thread access 1984 an *address of 1985 interest*. 1986 1987 CP is responsible for 1988 filling in the address 1989 watch bit in 1990 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 1991 according to what the 1992 runtime requests. 1993 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0. 1994 1995 Wavefront starts execution 1996 with memory violation 1997 exceptions exceptions 1998 enabled which are generated 1999 when a memory violation has 2000 occurred for this wavefront from 2001 L1 or LDS 2002 (write-to-read-only-memory, 2003 mis-aligned atomic, LDS 2004 address out of range, 2005 illegal address, etc.). 2006 2007 CP sets the memory 2008 violation bit in 2009 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB`` 2010 according to what the 2011 runtime requests. 2012 23:15 9 bits GRANULATED_LDS_SIZE Must be 0. 2013 2014 CP uses the rounded value 2015 from the dispatch packet, 2016 not this value, as the 2017 dispatch may contain 2018 dynamically allocated group 2019 segment memory. CP writes 2020 directly to 2021 ``COMPUTE_PGM_RSRC2.LDS_SIZE``. 2022 2023 Amount of group segment 2024 (LDS) to allocate for each 2025 work-group. Granularity is 2026 device specific: 2027 2028 GFX6: 2029 roundup(lds-size / (64 * 4)) 2030 GFX7-GFX9: 2031 roundup(lds-size / (128 * 4)) 2032 2033 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution 2034 _INVALID_OPERATION with specified exceptions 2035 enabled. 2036 2037 Used by CP to set up 2038 ``COMPUTE_PGM_RSRC2.EXCP_EN`` 2039 (set from bits 0..6). 2040 2041 IEEE 754 FP Invalid 2042 Operation 2043 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more 2044 _SOURCE input operands is a 2045 denormal number 2046 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by 2047 _DIVISION_BY_ZERO Zero 2048 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow 2049 _OVERFLOW 2050 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow 2051 _UNDERFLOW 2052 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact 2053 _INEXACT 2054 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero 2055 _ZERO (rcp_iflag_f32 instruction 2056 only) 2057 31 1 bit Reserved, must be 0. 2058 32 **Total size 4 bytes.** 2059 ======= =================================================================================================================== 2060 2061.. 2062 2063 .. table:: Floating Point Rounding Mode Enumeration Values 2064 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table 2065 2066 ====================================== ===== ============================== 2067 Enumeration Name Value Description 2068 ====================================== ===== ============================== 2069 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even 2070 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity 2071 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity 2072 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0 2073 ====================================== ===== ============================== 2074 2075.. 2076 2077 .. table:: Floating Point Denorm Mode Enumeration Values 2078 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table 2079 2080 ====================================== ===== ============================== 2081 Enumeration Name Value Description 2082 ====================================== ===== ============================== 2083 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination 2084 Denorms 2085 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms 2086 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms 2087 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush 2088 ====================================== ===== ============================== 2089 2090.. 2091 2092 .. table:: System VGPR Work-Item ID Enumeration Values 2093 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table 2094 2095 ======================================== ===== ============================ 2096 Enumeration Name Value Description 2097 ======================================== ===== ============================ 2098 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension 2099 ID. 2100 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y 2101 dimensions ID. 2102 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z 2103 dimensions ID. 2104 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined. 2105 ======================================== ===== ============================ 2106 2107.. _amdgpu-amdhsa-initial-kernel-execution-state: 2108 2109Initial Kernel Execution State 2110~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2111 2112This section defines the register state that will be set up by the packet 2113processor prior to the start of execution of every wavefront. This is limited by 2114the constraints of the hardware controllers of CP/ADC/SPI. 2115 2116The order of the SGPR registers is defined, but the compiler can specify which 2117ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit 2118fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 2119for enabled registers are dense starting at SGPR0: the first enabled register is 2120SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have 2121an SGPR number. 2122 2123The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to 2124all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using 2125the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually 2126initialized. These are then immediately followed by the System SGPRs that are 2127set up by ADC/SPI and can have different values for each wavefront of the grid 2128dispatch. 2129 2130SGPR register initial state is defined in 2131:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`. 2132 2133 .. table:: SGPR Register Set Up Order 2134 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table 2135 2136 ========== ========================== ====== ============================== 2137 SGPR Order Name Number Description 2138 (kernel descriptor enable of 2139 field) SGPRs 2140 ========== ========================== ====== ============================== 2141 First Private Segment Buffer 4 V# that can be used, together 2142 (enable_sgpr_private with Scratch Wavefront Offset 2143 _segment_buffer) as an offset, to access the 2144 private memory space using a 2145 segment address. 2146 2147 CP uses the value provided by 2148 the runtime. 2149 then Dispatch Ptr 2 64 bit address of AQL dispatch 2150 (enable_sgpr_dispatch_ptr) packet for kernel dispatch 2151 actually executing. 2152 then Queue Ptr 2 64 bit address of amd_queue_t 2153 (enable_sgpr_queue_ptr) object for AQL queue on which 2154 the dispatch packet was 2155 queued. 2156 then Kernarg Segment Ptr 2 64 bit address of Kernarg 2157 (enable_sgpr_kernarg segment. This is directly 2158 _segment_ptr) copied from the 2159 kernarg_address in the kernel 2160 dispatch packet. 2161 2162 Having CP load it once avoids 2163 loading it at the beginning of 2164 every wavefront. 2165 then Dispatch Id 2 64 bit Dispatch ID of the 2166 (enable_sgpr_dispatch_id) dispatch packet being 2167 executed. 2168 then Flat Scratch Init 2 This is 2 SGPRs: 2169 (enable_sgpr_flat_scratch 2170 _init) GFX6 2171 Not supported. 2172 GFX7-GFX8 2173 The first SGPR is a 32 bit 2174 byte offset from 2175 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2176 to per SPI base of memory 2177 for scratch for the queue 2178 executing the kernel 2179 dispatch. CP obtains this 2180 from the runtime. (The 2181 Scratch Segment Buffer base 2182 address is 2183 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2184 plus this offset.) The value 2185 of Scratch Wavefront Offset must 2186 be added to this offset by 2187 the kernel machine code, 2188 right shifted by 8, and 2189 moved to the FLAT_SCRATCH_HI 2190 SGPR register. 2191 FLAT_SCRATCH_HI corresponds 2192 to SGPRn-4 on GFX7, and 2193 SGPRn-6 on GFX8 (where SGPRn 2194 is the highest numbered SGPR 2195 allocated to the wavefront). 2196 FLAT_SCRATCH_HI is 2197 multiplied by 256 (as it is 2198 in units of 256 bytes) and 2199 added to 2200 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` 2201 to calculate the per wavefront 2202 FLAT SCRATCH BASE in flat 2203 memory instructions that 2204 access the scratch 2205 apperture. 2206 2207 The second SGPR is 32 bit 2208 byte size of a single 2209 work-item's scratch memory 2210 usage. CP obtains this from 2211 the runtime, and it is 2212 always a multiple of DWORD. 2213 CP checks that the value in 2214 the kernel dispatch packet 2215 Private Segment Byte Size is 2216 not larger, and requests the 2217 runtime to increase the 2218 queue's scratch size if 2219 necessary. The kernel code 2220 must move it to 2221 FLAT_SCRATCH_LO which is 2222 SGPRn-3 on GFX7 and SGPRn-5 2223 on GFX8. FLAT_SCRATCH_LO is 2224 used as the FLAT SCRATCH 2225 SIZE in flat memory 2226 instructions. Having CP load 2227 it once avoids loading it at 2228 the beginning of every 2229 wavefront. 2230 GFX9 2231 This is the 2232 64 bit base address of the 2233 per SPI scratch backing 2234 memory managed by SPI for 2235 the queue executing the 2236 kernel dispatch. CP obtains 2237 this from the runtime (and 2238 divides it if there are 2239 multiple Shader Arrays each 2240 with its own SPI). The value 2241 of Scratch Wavefront Offset must 2242 be added by the kernel 2243 machine code and the result 2244 moved to the FLAT_SCRATCH 2245 SGPR which is SGPRn-6 and 2246 SGPRn-5. It is used as the 2247 FLAT SCRATCH BASE in flat 2248 memory instructions. 2249 then Private Segment Size 1 The 32 bit byte size of a 2250 (enable_sgpr_private single 2251 work-item's 2252 scratch_segment_size) memory 2253 allocation. This is the 2254 value from the kernel 2255 dispatch packet Private 2256 Segment Byte Size rounded up 2257 by CP to a multiple of 2258 DWORD. 2259 2260 Having CP load it once avoids 2261 loading it at the beginning of 2262 every wavefront. 2263 2264 This is not used for 2265 GFX7-GFX8 since it is the same 2266 value as the second SGPR of 2267 Flat Scratch Init. However, it 2268 may be needed for GFX9 which 2269 changes the meaning of the 2270 Flat Scratch Init value. 2271 then Grid Work-Group Count X 1 32 bit count of the number of 2272 (enable_sgpr_grid work-groups in the X dimension 2273 _workgroup_count_X) for the grid being 2274 executed. Computed from the 2275 fields in the kernel dispatch 2276 packet as ((grid_size.x + 2277 workgroup_size.x - 1) / 2278 workgroup_size.x). 2279 then Grid Work-Group Count Y 1 32 bit count of the number of 2280 (enable_sgpr_grid work-groups in the Y dimension 2281 _workgroup_count_Y && for the grid being 2282 less than 16 previous executed. Computed from the 2283 SGPRs) fields in the kernel dispatch 2284 packet as ((grid_size.y + 2285 workgroup_size.y - 1) / 2286 workgroupSize.y). 2287 2288 Only initialized if <16 2289 previous SGPRs initialized. 2290 then Grid Work-Group Count Z 1 32 bit count of the number of 2291 (enable_sgpr_grid work-groups in the Z dimension 2292 _workgroup_count_Z && for the grid being 2293 less than 16 previous executed. Computed from the 2294 SGPRs) fields in the kernel dispatch 2295 packet as ((grid_size.z + 2296 workgroup_size.z - 1) / 2297 workgroupSize.z). 2298 2299 Only initialized if <16 2300 previous SGPRs initialized. 2301 then Work-Group Id X 1 32 bit work-group id in X 2302 (enable_sgpr_workgroup_id dimension of grid for 2303 _X) wavefront. 2304 then Work-Group Id Y 1 32 bit work-group id in Y 2305 (enable_sgpr_workgroup_id dimension of grid for 2306 _Y) wavefront. 2307 then Work-Group Id Z 1 32 bit work-group id in Z 2308 (enable_sgpr_workgroup_id dimension of grid for 2309 _Z) wavefront. 2310 then Work-Group Info 1 {first_wavefront, 14'b0000, 2311 (enable_sgpr_workgroup ordered_append_term[10:0], 2312 _info) threadgroup_size_in_wavefronts[5:0]} 2313 then Scratch Wavefront Offset 1 32 bit byte offset from base 2314 (enable_sgpr_private of scratch base of queue 2315 _segment_wavefront_offset) executing the kernel 2316 dispatch. Must be used as an 2317 offset with Private 2318 segment address when using 2319 Scratch Segment Buffer. It 2320 must be used to set up FLAT 2321 SCRATCH for flat addressing 2322 (see 2323 :ref:`amdgpu-amdhsa-flat-scratch`). 2324 ========== ========================== ====== ============================== 2325 2326The order of the VGPR registers is defined, but the compiler can specify which 2327ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit 2328fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used 2329for enabled registers are dense starting at VGPR0: the first enabled register is 2330VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a 2331VGPR number. 2332 2333VGPR register initial state is defined in 2334:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`. 2335 2336 .. table:: VGPR Register Set Up Order 2337 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table 2338 2339 ========== ========================== ====== ============================== 2340 VGPR Order Name Number Description 2341 (kernel descriptor enable of 2342 field) VGPRs 2343 ========== ========================== ====== ============================== 2344 First Work-Item Id X 1 32 bit work item id in X 2345 (Always initialized) dimension of work-group for 2346 wavefront lane. 2347 then Work-Item Id Y 1 32 bit work item id in Y 2348 (enable_vgpr_workitem_id dimension of work-group for 2349 > 0) wavefront lane. 2350 then Work-Item Id Z 1 32 bit work item id in Z 2351 (enable_vgpr_workitem_id dimension of work-group for 2352 > 1) wavefront lane. 2353 ========== ========================== ====== ============================== 2354 2355The setting of registers is done by GPU CP/ADC/SPI hardware as follows: 2356 23571. SGPRs before the Work-Group Ids are set by CP using the 16 User Data 2358 registers. 23592. Work-group Id registers X, Y, Z are set by ADC which supports any 2360 combination including none. 23613. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why 2362 its value cannot included with the flat scratch init value which is per queue. 23634. The VGPRs are set by SPI which only supports specifying either (X), (X, Y) 2364 or (X, Y, Z). 2365 2366Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit 2367value to the hardware required SGPRn-3 and SGPRn-4 respectively. 2368 2369The global segment can be accessed either using buffer instructions (GFX6 which 2370has V# 64 bit address support), flat instructions (GFX7-GFX9), or global 2371instructions (GFX9). 2372 2373If buffer operations are used then the compiler can generate a V# with the 2374following properties: 2375 2376* base address of 0 2377* no swizzle 2378* ATC: 1 if IOMMU present (such as APU) 2379* ptr64: 1 2380* MTYPE set to support memory coherence that matches the runtime (such as CC for 2381 APU and NC for dGPU). 2382 2383.. _amdgpu-amdhsa-kernel-prolog: 2384 2385Kernel Prolog 2386~~~~~~~~~~~~~ 2387 2388.. _amdgpu-amdhsa-m0: 2389 2390M0 2391++ 2392 2393GFX6-GFX8 2394 The M0 register must be initialized with a value at least the total LDS size 2395 if the kernel may access LDS via DS or flat operations. Total LDS size is 2396 available in dispatch packet. For M0, it is also possible to use maximum 2397 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for 2398 GFX7-GFX8). 2399GFX9 2400 The M0 register is not used for range checking LDS accesses and so does not 2401 need to be initialized in the prolog. 2402 2403.. _amdgpu-amdhsa-flat-scratch: 2404 2405Flat Scratch 2406++++++++++++ 2407 2408If the kernel may use flat operations to access scratch memory, the prolog code 2409must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which 2410are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront 2411Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`): 2412 2413GFX6 2414 Flat scratch is not supported. 2415 2416GFX7-GFX8 2417 1. The low word of Flat Scratch Init is 32 bit byte offset from 2418 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory 2419 being managed by SPI for the queue executing the kernel dispatch. This is 2420 the same value used in the Scratch Segment Buffer V# base address. The 2421 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte 2422 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since 2423 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted 2424 by 8 before moving into FLAT_SCRATCH_LO. 2425 2. The second word of Flat Scratch Init is 32 bit byte size of a single 2426 work-items scratch memory usage. This is directly loaded from the kernel 2427 dispatch packet Private Segment Byte Size and rounded up to a multiple of 2428 DWORD. Having CP load it once avoids loading it at the beginning of every 2429 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH 2430 SIZE. 2431 2432GFX9 2433 The Flat Scratch Init is the 64 bit address of the base of scratch backing 2434 memory being managed by SPI for the queue executing the kernel dispatch. The 2435 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH 2436 pair for use as the flat scratch base in flat memory instructions. 2437 2438.. _amdgpu-amdhsa-memory-model: 2439 2440Memory Model 2441~~~~~~~~~~~~ 2442 2443This section describes the mapping of LLVM memory model onto AMDGPU machine code 2444(see :ref:`memmodel`). *The implementation is WIP.* 2445 2446.. TODO 2447 Update when implementation complete. 2448 2449The AMDGPU backend supports the memory synchronization scopes specified in 2450:ref:`amdgpu-memory-scopes`. 2451 2452The code sequences used to implement the memory model are defined in table 2453:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`. 2454 2455The sequences specify the order of instructions that a single thread must 2456execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect 2457to other memory instructions executed by the same thread. This allows them to be 2458moved earlier or later which can allow them to be combined with other instances 2459of the same instruction, or hoisted/sunk out of loops to improve 2460performance. Only the instructions related to the memory model are given; 2461additional ``s_waitcnt`` instructions are required to ensure registers are 2462defined before being used. These may be able to be combined with the memory 2463model ``s_waitcnt`` instructions as described above. 2464 2465The AMDGPU backend supports the following memory models: 2466 2467 HSA Memory Model [HSA]_ 2468 The HSA memory model uses a single happens-before relation for all address 2469 spaces (see :ref:`amdgpu-address-spaces`). 2470 OpenCL Memory Model [OpenCL]_ 2471 The OpenCL memory model which has separate happens-before relations for the 2472 global and local address spaces. Only a fence specifying both global and 2473 local address space, and seq_cst instructions join the relationships. Since 2474 the LLVM ``memfence`` instruction does not allow an address space to be 2475 specified the OpenCL fence has to convervatively assume both local and 2476 global address space was specified. However, optimizations can often be 2477 done to eliminate the additional ``s_waitcnt`` instructions when there are 2478 no intervening memory instructions which access the corresponding address 2479 space. The code sequences in the table indicate what can be omitted for the 2480 OpenCL memory. The target triple environment is used to determine if the 2481 source language is OpenCL (see :ref:`amdgpu-opencl`). 2482 2483``ds/flat_load/store/atomic`` instructions to local memory are termed LDS 2484operations. 2485 2486``buffer/global/flat_load/store/atomic`` instructions to global memory are 2487termed vector memory operations. 2488 2489For GFX6-GFX9: 2490 2491* Each agent has multiple compute units (CU). 2492* Each CU has multiple SIMDs that execute wavefronts. 2493* The wavefronts for a single work-group are executed in the same CU but may be 2494 executed by different SIMDs. 2495* Each CU has a single LDS memory shared by the wavefronts of the work-groups 2496 executing on it. 2497* All LDS operations of a CU are performed as wavefront wide operations in a 2498 global order and involve no caching. Completion is reported to a wavefront in 2499 execution order. 2500* The LDS memory has multiple request queues shared by the SIMDs of a 2501 CU. Therefore, the LDS operations performed by different wavefronts of a work-group 2502 can be reordered relative to each other, which can result in reordering the 2503 visibility of vector memory operations with respect to LDS operations of other 2504 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to 2505 ensure synchronization between LDS operations and vector memory operations 2506 between wavefronts of a work-group, but not between operations performed by the 2507 same wavefront. 2508* The vector memory operations are performed as wavefront wide operations and 2509 completion is reported to a wavefront in execution order. The exception is 2510 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of 2511 vector memory order if they access LDS memory, and out of LDS operation order 2512 if they access global memory. 2513* The vector memory operations access a single vector L1 cache shared by all 2514 SIMDs a CU. Therefore, no special action is required for coherence between the 2515 lanes of a single wavefront, or for coherence between wavefronts in the same 2516 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts 2517 executing in different work-groups as they may be executing on different CUs. 2518* The scalar memory operations access a scalar L1 cache shared by all wavefronts 2519 on a group of CUs. The scalar and vector L1 caches are not coherent. However, 2520 scalar operations are used in a restricted way so do not impact the memory 2521 model. See :ref:`amdgpu-amdhsa-memory-spaces`. 2522* The vector and scalar memory operations use an L2 cache shared by all CUs on 2523 the same agent. 2524* The L2 cache has independent channels to service disjoint ranges of virtual 2525 addresses. 2526* Each CU has a separate request queue per channel. Therefore, the vector and 2527 scalar memory operations performed by wavefronts executing in different work-groups 2528 (which may be executing on different CUs) of an agent can be reordered 2529 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure 2530 synchronization between vector memory operations of different CUs. It ensures a 2531 previous vector memory operation has completed before executing a subsequent 2532 vector memory or LDS operation and so can be used to meet the requirements of 2533 acquire and release. 2534* The L2 cache can be kept coherent with other agents on some targets, or ranges 2535 of virtual addresses can be set up to bypass it to ensure system coherence. 2536 2537Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8), 2538or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the 2539memory, atomic memory orderings are not meaningful and all accesses are treated 2540as non-atomic. 2541 2542Constant address space uses ``buffer/global_load`` instructions (or equivalent 2543scalar memory instructions). Since the constant address space contents do not 2544change during the execution of a kernel dispatch it is not legal to perform 2545stores, and atomic memory orderings are not meaningful and all access are 2546treated as non-atomic. 2547 2548A memory synchronization scope wider than work-group is not meaningful for the 2549group (LDS) address space and is treated as work-group. 2550 2551The memory model does not support the region address space which is treated as 2552non-atomic. 2553 2554Acquire memory ordering is not meaningful on store atomic instructions and is 2555treated as non-atomic. 2556 2557Release memory ordering is not meaningful on load atomic instructions and is 2558treated a non-atomic. 2559 2560Acquire-release memory ordering is not meaningful on load or store atomic 2561instructions and is treated as acquire and release respectively. 2562 2563AMDGPU backend only uses scalar memory operations to access memory that is 2564proven to not change during the execution of the kernel dispatch. This includes 2565constant address space and global address space for program scope const 2566variables. Therefore the kernel machine code does not have to maintain the 2567scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar 2568and vector L1 caches are invalidated between kernel dispatches by CP since 2569constant address space data may change between kernel dispatch executions. See 2570:ref:`amdgpu-amdhsa-memory-spaces`. 2571 2572The one execption is if scalar writes are used to spill SGPR registers. In this 2573case the AMDGPU backend ensures the memory location used to spill is never 2574accessed by vector memory operations at the same time. If scalar writes are used 2575then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function 2576return since the locations may be used for vector memory instructions by a 2577future wavefront that uses the same scratch area, or a function call that creates a 2578frame at the same address, respectively. There is no need for a ``s_dcache_inv`` 2579as all scalar writes are write-before-read in the same thread. 2580 2581Scratch backing memory (which is used for the private address space) 2582is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private 2583address space is only accessed by a single thread, and is always 2584write-before-read, there is never a need to invalidate these entries from the L1 2585cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the 2586volatile cache lines. 2587 2588On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing 2589to invalidate the L2 cache. This also causes it to be treated as 2590non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC 2591(cache coherent) and so the L2 cache will coherent with the CPU and other 2592agents. 2593 2594 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9 2595 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table 2596 2597 ============ ============ ============== ========== =============================== 2598 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code 2599 Ordering Sync Scope Address 2600 Space 2601 ============ ============ ============== ========== =============================== 2602 **Non-Atomic** 2603 ----------------------------------------------------------------------------------- 2604 load *none* *none* - global - !volatile & !nontemporal 2605 - generic 2606 - private 1. buffer/global/flat_load 2607 - constant 2608 - volatile & !nontemporal 2609 2610 1. buffer/global/flat_load 2611 glc=1 2612 2613 - nontemporal 2614 2615 1. buffer/global/flat_load 2616 glc=1 slc=1 2617 2618 load *none* *none* - local 1. ds_load 2619 store *none* *none* - global - !nontemporal 2620 - generic 2621 - private 1. buffer/global/flat_store 2622 - constant 2623 - nontemporal 2624 2625 1. buffer/global/flat_stote 2626 glc=1 slc=1 2627 2628 store *none* *none* - local 1. ds_store 2629 **Unordered Atomic** 2630 ----------------------------------------------------------------------------------- 2631 load atomic unordered *any* *any* *Same as non-atomic*. 2632 store atomic unordered *any* *any* *Same as non-atomic*. 2633 atomicrmw unordered *any* *any* *Same as monotonic 2634 atomic*. 2635 **Monotonic Atomic** 2636 ----------------------------------------------------------------------------------- 2637 load atomic monotonic - singlethread - global 1. buffer/global/flat_load 2638 - wavefront - generic 2639 - workgroup 2640 load atomic monotonic - singlethread - local 1. ds_load 2641 - wavefront 2642 - workgroup 2643 load atomic monotonic - agent - global 1. buffer/global/flat_load 2644 - system - generic glc=1 2645 store atomic monotonic - singlethread - global 1. buffer/global/flat_store 2646 - wavefront - generic 2647 - workgroup 2648 - agent 2649 - system 2650 store atomic monotonic - singlethread - local 1. ds_store 2651 - wavefront 2652 - workgroup 2653 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic 2654 - wavefront - generic 2655 - workgroup 2656 - agent 2657 - system 2658 atomicrmw monotonic - singlethread - local 1. ds_atomic 2659 - wavefront 2660 - workgroup 2661 **Acquire Atomic** 2662 ----------------------------------------------------------------------------------- 2663 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load 2664 - wavefront - local 2665 - generic 2666 load atomic acquire - workgroup - global 1. buffer/global/flat_load 2667 load atomic acquire - workgroup - local 1. ds_load 2668 2. s_waitcnt lgkmcnt(0) 2669 2670 - If OpenCL, omit. 2671 - Must happen before 2672 any following 2673 global/generic 2674 load/load 2675 atomic/store/store 2676 atomic/atomicrmw. 2677 - Ensures any 2678 following global 2679 data read is no 2680 older than the load 2681 atomic value being 2682 acquired. 2683 load atomic acquire - workgroup - generic 1. flat_load 2684 2. s_waitcnt lgkmcnt(0) 2685 2686 - If OpenCL, omit. 2687 - Must happen before 2688 any following 2689 global/generic 2690 load/load 2691 atomic/store/store 2692 atomic/atomicrmw. 2693 - Ensures any 2694 following global 2695 data read is no 2696 older than the load 2697 atomic value being 2698 acquired. 2699 load atomic acquire - agent - global 1. buffer/global/flat_load 2700 - system glc=1 2701 2. s_waitcnt vmcnt(0) 2702 2703 - Must happen before 2704 following 2705 buffer_wbinvl1_vol. 2706 - Ensures the load 2707 has completed 2708 before invalidating 2709 the cache. 2710 2711 3. buffer_wbinvl1_vol 2712 2713 - Must happen before 2714 any following 2715 global/generic 2716 load/load 2717 atomic/atomicrmw. 2718 - Ensures that 2719 following 2720 loads will not see 2721 stale global data. 2722 2723 load atomic acquire - agent - generic 1. flat_load glc=1 2724 - system 2. s_waitcnt vmcnt(0) & 2725 lgkmcnt(0) 2726 2727 - If OpenCL omit 2728 lgkmcnt(0). 2729 - Must happen before 2730 following 2731 buffer_wbinvl1_vol. 2732 - Ensures the flat_load 2733 has completed 2734 before invalidating 2735 the cache. 2736 2737 3. buffer_wbinvl1_vol 2738 2739 - Must happen before 2740 any following 2741 global/generic 2742 load/load 2743 atomic/atomicrmw. 2744 - Ensures that 2745 following loads 2746 will not see stale 2747 global data. 2748 2749 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic 2750 - wavefront - local 2751 - generic 2752 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic 2753 atomicrmw acquire - workgroup - local 1. ds_atomic 2754 2. waitcnt lgkmcnt(0) 2755 2756 - If OpenCL, omit. 2757 - Must happen before 2758 any following 2759 global/generic 2760 load/load 2761 atomic/store/store 2762 atomic/atomicrmw. 2763 - Ensures any 2764 following global 2765 data read is no 2766 older than the 2767 atomicrmw value 2768 being acquired. 2769 2770 atomicrmw acquire - workgroup - generic 1. flat_atomic 2771 2. waitcnt lgkmcnt(0) 2772 2773 - If OpenCL, omit. 2774 - Must happen before 2775 any following 2776 global/generic 2777 load/load 2778 atomic/store/store 2779 atomic/atomicrmw. 2780 - Ensures any 2781 following global 2782 data read is no 2783 older than the 2784 atomicrmw value 2785 being acquired. 2786 2787 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic 2788 - system 2. s_waitcnt vmcnt(0) 2789 2790 - Must happen before 2791 following 2792 buffer_wbinvl1_vol. 2793 - Ensures the 2794 atomicrmw has 2795 completed before 2796 invalidating the 2797 cache. 2798 2799 3. buffer_wbinvl1_vol 2800 2801 - Must happen before 2802 any following 2803 global/generic 2804 load/load 2805 atomic/atomicrmw. 2806 - Ensures that 2807 following loads 2808 will not see stale 2809 global data. 2810 2811 atomicrmw acquire - agent - generic 1. flat_atomic 2812 - system 2. s_waitcnt vmcnt(0) & 2813 lgkmcnt(0) 2814 2815 - If OpenCL, omit 2816 lgkmcnt(0). 2817 - Must happen before 2818 following 2819 buffer_wbinvl1_vol. 2820 - Ensures the 2821 atomicrmw has 2822 completed before 2823 invalidating the 2824 cache. 2825 2826 3. buffer_wbinvl1_vol 2827 2828 - Must happen before 2829 any following 2830 global/generic 2831 load/load 2832 atomic/atomicrmw. 2833 - Ensures that 2834 following loads 2835 will not see stale 2836 global data. 2837 2838 fence acquire - singlethread *none* *none* 2839 - wavefront 2840 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0) 2841 2842 - If OpenCL and 2843 address space is 2844 not generic, omit. 2845 - However, since LLVM 2846 currently has no 2847 address space on 2848 the fence need to 2849 conservatively 2850 always generate. If 2851 fence had an 2852 address space then 2853 set to address 2854 space of OpenCL 2855 fence flag, or to 2856 generic if both 2857 local and global 2858 flags are 2859 specified. 2860 - Must happen after 2861 any preceding 2862 local/generic load 2863 atomic/atomicrmw 2864 with an equal or 2865 wider sync scope 2866 and memory ordering 2867 stronger than 2868 unordered (this is 2869 termed the 2870 fence-paired-atomic). 2871 - Must happen before 2872 any following 2873 global/generic 2874 load/load 2875 atomic/store/store 2876 atomic/atomicrmw. 2877 - Ensures any 2878 following global 2879 data read is no 2880 older than the 2881 value read by the 2882 fence-paired-atomic. 2883 2884 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) & 2885 - system vmcnt(0) 2886 2887 - If OpenCL and 2888 address space is 2889 not generic, omit 2890 lgkmcnt(0). 2891 - However, since LLVM 2892 currently has no 2893 address space on 2894 the fence need to 2895 conservatively 2896 always generate 2897 (see comment for 2898 previous fence). 2899 - Could be split into 2900 separate s_waitcnt 2901 vmcnt(0) and 2902 s_waitcnt 2903 lgkmcnt(0) to allow 2904 them to be 2905 independently moved 2906 according to the 2907 following rules. 2908 - s_waitcnt vmcnt(0) 2909 must happen after 2910 any preceding 2911 global/generic load 2912 atomic/atomicrmw 2913 with an equal or 2914 wider sync scope 2915 and memory ordering 2916 stronger than 2917 unordered (this is 2918 termed the 2919 fence-paired-atomic). 2920 - s_waitcnt lgkmcnt(0) 2921 must happen after 2922 any preceding 2923 local/generic load 2924 atomic/atomicrmw 2925 with an equal or 2926 wider sync scope 2927 and memory ordering 2928 stronger than 2929 unordered (this is 2930 termed the 2931 fence-paired-atomic). 2932 - Must happen before 2933 the following 2934 buffer_wbinvl1_vol. 2935 - Ensures that the 2936 fence-paired atomic 2937 has completed 2938 before invalidating 2939 the 2940 cache. Therefore 2941 any following 2942 locations read must 2943 be no older than 2944 the value read by 2945 the 2946 fence-paired-atomic. 2947 2948 2. buffer_wbinvl1_vol 2949 2950 - Must happen before any 2951 following global/generic 2952 load/load 2953 atomic/store/store 2954 atomic/atomicrmw. 2955 - Ensures that 2956 following loads 2957 will not see stale 2958 global data. 2959 2960 **Release Atomic** 2961 ----------------------------------------------------------------------------------- 2962 store atomic release - singlethread - global 1. buffer/global/ds/flat_store 2963 - wavefront - local 2964 - generic 2965 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0) 2966 2967 - If OpenCL, omit. 2968 - Must happen after 2969 any preceding 2970 local/generic 2971 load/store/load 2972 atomic/store 2973 atomic/atomicrmw. 2974 - Must happen before 2975 the following 2976 store. 2977 - Ensures that all 2978 memory operations 2979 to local have 2980 completed before 2981 performing the 2982 store that is being 2983 released. 2984 2985 2. buffer/global/flat_store 2986 store atomic release - workgroup - local 1. ds_store 2987 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 2988 2989 - If OpenCL, omit. 2990 - Must happen after 2991 any preceding 2992 local/generic 2993 load/store/load 2994 atomic/store 2995 atomic/atomicrmw. 2996 - Must happen before 2997 the following 2998 store. 2999 - Ensures that all 3000 memory operations 3001 to local have 3002 completed before 3003 performing the 3004 store that is being 3005 released. 3006 3007 2. flat_store 3008 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) & 3009 - system - generic vmcnt(0) 3010 3011 - If OpenCL, omit 3012 lgkmcnt(0). 3013 - Could be split into 3014 separate s_waitcnt 3015 vmcnt(0) and 3016 s_waitcnt 3017 lgkmcnt(0) to allow 3018 them to be 3019 independently moved 3020 according to the 3021 following rules. 3022 - s_waitcnt vmcnt(0) 3023 must happen after 3024 any preceding 3025 global/generic 3026 load/store/load 3027 atomic/store 3028 atomic/atomicrmw. 3029 - s_waitcnt lgkmcnt(0) 3030 must happen after 3031 any preceding 3032 local/generic 3033 load/store/load 3034 atomic/store 3035 atomic/atomicrmw. 3036 - Must happen before 3037 the following 3038 store. 3039 - Ensures that all 3040 memory operations 3041 to memory have 3042 completed before 3043 performing the 3044 store that is being 3045 released. 3046 3047 2. buffer/global/ds/flat_store 3048 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic 3049 - wavefront - local 3050 - generic 3051 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0) 3052 3053 - If OpenCL, omit. 3054 - Must happen after 3055 any preceding 3056 local/generic 3057 load/store/load 3058 atomic/store 3059 atomic/atomicrmw. 3060 - Must happen before 3061 the following 3062 atomicrmw. 3063 - Ensures that all 3064 memory operations 3065 to local have 3066 completed before 3067 performing the 3068 atomicrmw that is 3069 being released. 3070 3071 2. buffer/global/flat_atomic 3072 atomicrmw release - workgroup - local 1. ds_atomic 3073 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0) 3074 3075 - If OpenCL, omit. 3076 - Must happen after 3077 any preceding 3078 local/generic 3079 load/store/load 3080 atomic/store 3081 atomic/atomicrmw. 3082 - Must happen before 3083 the following 3084 atomicrmw. 3085 - Ensures that all 3086 memory operations 3087 to local have 3088 completed before 3089 performing the 3090 atomicrmw that is 3091 being released. 3092 3093 2. flat_atomic 3094 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) & 3095 - system - generic vmcnt(0) 3096 3097 - If OpenCL, omit 3098 lgkmcnt(0). 3099 - Could be split into 3100 separate s_waitcnt 3101 vmcnt(0) and 3102 s_waitcnt 3103 lgkmcnt(0) to allow 3104 them to be 3105 independently moved 3106 according to the 3107 following rules. 3108 - s_waitcnt vmcnt(0) 3109 must happen after 3110 any preceding 3111 global/generic 3112 load/store/load 3113 atomic/store 3114 atomic/atomicrmw. 3115 - s_waitcnt lgkmcnt(0) 3116 must happen after 3117 any preceding 3118 local/generic 3119 load/store/load 3120 atomic/store 3121 atomic/atomicrmw. 3122 - Must happen before 3123 the following 3124 atomicrmw. 3125 - Ensures that all 3126 memory operations 3127 to global and local 3128 have completed 3129 before performing 3130 the atomicrmw that 3131 is being released. 3132 3133 2. buffer/global/ds/flat_atomic 3134 fence release - singlethread *none* *none* 3135 - wavefront 3136 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0) 3137 3138 - If OpenCL and 3139 address space is 3140 not generic, omit. 3141 - However, since LLVM 3142 currently has no 3143 address space on 3144 the fence need to 3145 conservatively 3146 always generate. If 3147 fence had an 3148 address space then 3149 set to address 3150 space of OpenCL 3151 fence flag, or to 3152 generic if both 3153 local and global 3154 flags are 3155 specified. 3156 - Must happen after 3157 any preceding 3158 local/generic 3159 load/load 3160 atomic/store/store 3161 atomic/atomicrmw. 3162 - Must happen before 3163 any following store 3164 atomic/atomicrmw 3165 with an equal or 3166 wider sync scope 3167 and memory ordering 3168 stronger than 3169 unordered (this is 3170 termed the 3171 fence-paired-atomic). 3172 - Ensures that all 3173 memory operations 3174 to local have 3175 completed before 3176 performing the 3177 following 3178 fence-paired-atomic. 3179 3180 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) & 3181 - system vmcnt(0) 3182 3183 - If OpenCL and 3184 address space is 3185 not generic, omit 3186 lgkmcnt(0). 3187 - If OpenCL and 3188 address space is 3189 local, omit 3190 vmcnt(0). 3191 - However, since LLVM 3192 currently has no 3193 address space on 3194 the fence need to 3195 conservatively 3196 always generate. If 3197 fence had an 3198 address space then 3199 set to address 3200 space of OpenCL 3201 fence flag, or to 3202 generic if both 3203 local and global 3204 flags are 3205 specified. 3206 - Could be split into 3207 separate s_waitcnt 3208 vmcnt(0) and 3209 s_waitcnt 3210 lgkmcnt(0) to allow 3211 them to be 3212 independently moved 3213 according to the 3214 following rules. 3215 - s_waitcnt vmcnt(0) 3216 must happen after 3217 any preceding 3218 global/generic 3219 load/store/load 3220 atomic/store 3221 atomic/atomicrmw. 3222 - s_waitcnt lgkmcnt(0) 3223 must happen after 3224 any preceding 3225 local/generic 3226 load/store/load 3227 atomic/store 3228 atomic/atomicrmw. 3229 - Must happen before 3230 any following store 3231 atomic/atomicrmw 3232 with an equal or 3233 wider sync scope 3234 and memory ordering 3235 stronger than 3236 unordered (this is 3237 termed the 3238 fence-paired-atomic). 3239 - Ensures that all 3240 memory operations 3241 have 3242 completed before 3243 performing the 3244 following 3245 fence-paired-atomic. 3246 3247 **Acquire-Release Atomic** 3248 ----------------------------------------------------------------------------------- 3249 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic 3250 - wavefront - local 3251 - generic 3252 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0) 3253 3254 - If OpenCL, omit. 3255 - Must happen after 3256 any preceding 3257 local/generic 3258 load/store/load 3259 atomic/store 3260 atomic/atomicrmw. 3261 - Must happen before 3262 the following 3263 atomicrmw. 3264 - Ensures that all 3265 memory operations 3266 to local have 3267 completed before 3268 performing the 3269 atomicrmw that is 3270 being released. 3271 3272 2. buffer/global/flat_atomic 3273 atomicrmw acq_rel - workgroup - local 1. ds_atomic 3274 2. s_waitcnt lgkmcnt(0) 3275 3276 - If OpenCL, omit. 3277 - Must happen before 3278 any following 3279 global/generic 3280 load/load 3281 atomic/store/store 3282 atomic/atomicrmw. 3283 - Ensures any 3284 following global 3285 data read is no 3286 older than the load 3287 atomic value being 3288 acquired. 3289 3290 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0) 3291 3292 - If OpenCL, omit. 3293 - Must happen after 3294 any preceding 3295 local/generic 3296 load/store/load 3297 atomic/store 3298 atomic/atomicrmw. 3299 - Must happen before 3300 the following 3301 atomicrmw. 3302 - Ensures that all 3303 memory operations 3304 to local have 3305 completed before 3306 performing the 3307 atomicrmw that is 3308 being released. 3309 3310 2. flat_atomic 3311 3. s_waitcnt lgkmcnt(0) 3312 3313 - If OpenCL, omit. 3314 - Must happen before 3315 any following 3316 global/generic 3317 load/load 3318 atomic/store/store 3319 atomic/atomicrmw. 3320 - Ensures any 3321 following global 3322 data read is no 3323 older than the load 3324 atomic value being 3325 acquired. 3326 3327 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) & 3328 - system vmcnt(0) 3329 3330 - If OpenCL, omit 3331 lgkmcnt(0). 3332 - Could be split into 3333 separate s_waitcnt 3334 vmcnt(0) and 3335 s_waitcnt 3336 lgkmcnt(0) to allow 3337 them to be 3338 independently moved 3339 according to the 3340 following rules. 3341 - s_waitcnt vmcnt(0) 3342 must happen after 3343 any preceding 3344 global/generic 3345 load/store/load 3346 atomic/store 3347 atomic/atomicrmw. 3348 - s_waitcnt lgkmcnt(0) 3349 must happen after 3350 any preceding 3351 local/generic 3352 load/store/load 3353 atomic/store 3354 atomic/atomicrmw. 3355 - Must happen before 3356 the following 3357 atomicrmw. 3358 - Ensures that all 3359 memory operations 3360 to global have 3361 completed before 3362 performing the 3363 atomicrmw that is 3364 being released. 3365 3366 2. buffer/global/flat_atomic 3367 3. s_waitcnt vmcnt(0) 3368 3369 - Must happen before 3370 following 3371 buffer_wbinvl1_vol. 3372 - Ensures the 3373 atomicrmw has 3374 completed before 3375 invalidating the 3376 cache. 3377 3378 4. buffer_wbinvl1_vol 3379 3380 - Must happen before 3381 any following 3382 global/generic 3383 load/load 3384 atomic/atomicrmw. 3385 - Ensures that 3386 following loads 3387 will not see stale 3388 global data. 3389 3390 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) & 3391 - system vmcnt(0) 3392 3393 - If OpenCL, omit 3394 lgkmcnt(0). 3395 - Could be split into 3396 separate s_waitcnt 3397 vmcnt(0) and 3398 s_waitcnt 3399 lgkmcnt(0) to allow 3400 them to be 3401 independently moved 3402 according to the 3403 following rules. 3404 - s_waitcnt vmcnt(0) 3405 must happen after 3406 any preceding 3407 global/generic 3408 load/store/load 3409 atomic/store 3410 atomic/atomicrmw. 3411 - s_waitcnt lgkmcnt(0) 3412 must happen after 3413 any preceding 3414 local/generic 3415 load/store/load 3416 atomic/store 3417 atomic/atomicrmw. 3418 - Must happen before 3419 the following 3420 atomicrmw. 3421 - Ensures that all 3422 memory operations 3423 to global have 3424 completed before 3425 performing the 3426 atomicrmw that is 3427 being released. 3428 3429 2. flat_atomic 3430 3. s_waitcnt vmcnt(0) & 3431 lgkmcnt(0) 3432 3433 - If OpenCL, omit 3434 lgkmcnt(0). 3435 - Must happen before 3436 following 3437 buffer_wbinvl1_vol. 3438 - Ensures the 3439 atomicrmw has 3440 completed before 3441 invalidating the 3442 cache. 3443 3444 4. buffer_wbinvl1_vol 3445 3446 - Must happen before 3447 any following 3448 global/generic 3449 load/load 3450 atomic/atomicrmw. 3451 - Ensures that 3452 following loads 3453 will not see stale 3454 global data. 3455 3456 fence acq_rel - singlethread *none* *none* 3457 - wavefront 3458 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0) 3459 3460 - If OpenCL and 3461 address space is 3462 not generic, omit. 3463 - However, 3464 since LLVM 3465 currently has no 3466 address space on 3467 the fence need to 3468 conservatively 3469 always generate 3470 (see comment for 3471 previous fence). 3472 - Must happen after 3473 any preceding 3474 local/generic 3475 load/load 3476 atomic/store/store 3477 atomic/atomicrmw. 3478 - Must happen before 3479 any following 3480 global/generic 3481 load/load 3482 atomic/store/store 3483 atomic/atomicrmw. 3484 - Ensures that all 3485 memory operations 3486 to local have 3487 completed before 3488 performing any 3489 following global 3490 memory operations. 3491 - Ensures that the 3492 preceding 3493 local/generic load 3494 atomic/atomicrmw 3495 with an equal or 3496 wider sync scope 3497 and memory ordering 3498 stronger than 3499 unordered (this is 3500 termed the 3501 acquire-fence-paired-atomic 3502 ) has completed 3503 before following 3504 global memory 3505 operations. This 3506 satisfies the 3507 requirements of 3508 acquire. 3509 - Ensures that all 3510 previous memory 3511 operations have 3512 completed before a 3513 following 3514 local/generic store 3515 atomic/atomicrmw 3516 with an equal or 3517 wider sync scope 3518 and memory ordering 3519 stronger than 3520 unordered (this is 3521 termed the 3522 release-fence-paired-atomic 3523 ). This satisfies the 3524 requirements of 3525 release. 3526 3527 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) & 3528 - system vmcnt(0) 3529 3530 - If OpenCL and 3531 address space is 3532 not generic, omit 3533 lgkmcnt(0). 3534 - However, since LLVM 3535 currently has no 3536 address space on 3537 the fence need to 3538 conservatively 3539 always generate 3540 (see comment for 3541 previous fence). 3542 - Could be split into 3543 separate s_waitcnt 3544 vmcnt(0) and 3545 s_waitcnt 3546 lgkmcnt(0) to allow 3547 them to be 3548 independently moved 3549 according to the 3550 following rules. 3551 - s_waitcnt vmcnt(0) 3552 must happen after 3553 any preceding 3554 global/generic 3555 load/store/load 3556 atomic/store 3557 atomic/atomicrmw. 3558 - s_waitcnt lgkmcnt(0) 3559 must happen after 3560 any preceding 3561 local/generic 3562 load/store/load 3563 atomic/store 3564 atomic/atomicrmw. 3565 - Must happen before 3566 the following 3567 buffer_wbinvl1_vol. 3568 - Ensures that the 3569 preceding 3570 global/local/generic 3571 load 3572 atomic/atomicrmw 3573 with an equal or 3574 wider sync scope 3575 and memory ordering 3576 stronger than 3577 unordered (this is 3578 termed the 3579 acquire-fence-paired-atomic 3580 ) has completed 3581 before invalidating 3582 the cache. This 3583 satisfies the 3584 requirements of 3585 acquire. 3586 - Ensures that all 3587 previous memory 3588 operations have 3589 completed before a 3590 following 3591 global/local/generic 3592 store 3593 atomic/atomicrmw 3594 with an equal or 3595 wider sync scope 3596 and memory ordering 3597 stronger than 3598 unordered (this is 3599 termed the 3600 release-fence-paired-atomic 3601 ). This satisfies the 3602 requirements of 3603 release. 3604 3605 2. buffer_wbinvl1_vol 3606 3607 - Must happen before 3608 any following 3609 global/generic 3610 load/load 3611 atomic/store/store 3612 atomic/atomicrmw. 3613 - Ensures that 3614 following loads 3615 will not see stale 3616 global data. This 3617 satisfies the 3618 requirements of 3619 acquire. 3620 3621 **Sequential Consistent Atomic** 3622 ----------------------------------------------------------------------------------- 3623 load atomic seq_cst - singlethread - global *Same as corresponding 3624 - wavefront - local load atomic acquire, 3625 - generic except must generated 3626 all instructions even 3627 for OpenCL.* 3628 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0) 3629 - generic 3630 - Must 3631 happen after 3632 preceding 3633 global/generic load 3634 atomic/store 3635 atomic/atomicrmw 3636 with memory 3637 ordering of seq_cst 3638 and with equal or 3639 wider sync scope. 3640 (Note that seq_cst 3641 fences have their 3642 own s_waitcnt 3643 lgkmcnt(0) and so do 3644 not need to be 3645 considered.) 3646 - Ensures any 3647 preceding 3648 sequential 3649 consistent local 3650 memory instructions 3651 have completed 3652 before executing 3653 this sequentially 3654 consistent 3655 instruction. This 3656 prevents reordering 3657 a seq_cst store 3658 followed by a 3659 seq_cst load. (Note 3660 that seq_cst is 3661 stronger than 3662 acquire/release as 3663 the reordering of 3664 load acquire 3665 followed by a store 3666 release is 3667 prevented by the 3668 waitcnt of 3669 the release, but 3670 there is nothing 3671 preventing a store 3672 release followed by 3673 load acquire from 3674 competing out of 3675 order.) 3676 3677 2. *Following 3678 instructions same as 3679 corresponding load 3680 atomic acquire, 3681 except must generated 3682 all instructions even 3683 for OpenCL.* 3684 load atomic seq_cst - workgroup - local *Same as corresponding 3685 load atomic acquire, 3686 except must generated 3687 all instructions even 3688 for OpenCL.* 3689 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) & 3690 - system - generic vmcnt(0) 3691 3692 - Could be split into 3693 separate s_waitcnt 3694 vmcnt(0) 3695 and s_waitcnt 3696 lgkmcnt(0) to allow 3697 them to be 3698 independently moved 3699 according to the 3700 following rules. 3701 - waitcnt lgkmcnt(0) 3702 must happen after 3703 preceding 3704 global/generic load 3705 atomic/store 3706 atomic/atomicrmw 3707 with memory 3708 ordering of seq_cst 3709 and with equal or 3710 wider sync scope. 3711 (Note that seq_cst 3712 fences have their 3713 own s_waitcnt 3714 lgkmcnt(0) and so do 3715 not need to be 3716 considered.) 3717 - waitcnt vmcnt(0) 3718 must happen after 3719 preceding 3720 global/generic load 3721 atomic/store 3722 atomic/atomicrmw 3723 with memory 3724 ordering of seq_cst 3725 and with equal or 3726 wider sync scope. 3727 (Note that seq_cst 3728 fences have their 3729 own s_waitcnt 3730 vmcnt(0) and so do 3731 not need to be 3732 considered.) 3733 - Ensures any 3734 preceding 3735 sequential 3736 consistent global 3737 memory instructions 3738 have completed 3739 before executing 3740 this sequentially 3741 consistent 3742 instruction. This 3743 prevents reordering 3744 a seq_cst store 3745 followed by a 3746 seq_cst load. (Note 3747 that seq_cst is 3748 stronger than 3749 acquire/release as 3750 the reordering of 3751 load acquire 3752 followed by a store 3753 release is 3754 prevented by the 3755 waitcnt of 3756 the release, but 3757 there is nothing 3758 preventing a store 3759 release followed by 3760 load acquire from 3761 competing out of 3762 order.) 3763 3764 2. *Following 3765 instructions same as 3766 corresponding load 3767 atomic acquire, 3768 except must generated 3769 all instructions even 3770 for OpenCL.* 3771 store atomic seq_cst - singlethread - global *Same as corresponding 3772 - wavefront - local store atomic release, 3773 - workgroup - generic except must generated 3774 all instructions even 3775 for OpenCL.* 3776 store atomic seq_cst - agent - global *Same as corresponding 3777 - system - generic store atomic release, 3778 except must generated 3779 all instructions even 3780 for OpenCL.* 3781 atomicrmw seq_cst - singlethread - global *Same as corresponding 3782 - wavefront - local atomicrmw acq_rel, 3783 - workgroup - generic except must generated 3784 all instructions even 3785 for OpenCL.* 3786 atomicrmw seq_cst - agent - global *Same as corresponding 3787 - system - generic atomicrmw acq_rel, 3788 except must generated 3789 all instructions even 3790 for OpenCL.* 3791 fence seq_cst - singlethread *none* *Same as corresponding 3792 - wavefront fence acq_rel, 3793 - workgroup except must generated 3794 - agent all instructions even 3795 - system for OpenCL.* 3796 ============ ============ ============== ========== =============================== 3797 3798The memory order also adds the single thread optimization constrains defined in 3799table 3800:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`. 3801 3802 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9 3803 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table 3804 3805 ============ ============================================================== 3806 LLVM Memory Optimization Constraints 3807 Ordering 3808 ============ ============================================================== 3809 unordered *none* 3810 monotonic *none* 3811 acquire - If a load atomic/atomicrmw then no following load/load 3812 atomic/store/ store atomic/atomicrmw/fence instruction can 3813 be moved before the acquire. 3814 - If a fence then same as load atomic, plus no preceding 3815 associated fence-paired-atomic can be moved after the fence. 3816 release - If a store atomic/atomicrmw then no preceding load/load 3817 atomic/store/ store atomic/atomicrmw/fence instruction can 3818 be moved after the release. 3819 - If a fence then same as store atomic, plus no following 3820 associated fence-paired-atomic can be moved before the 3821 fence. 3822 acq_rel Same constraints as both acquire and release. 3823 seq_cst - If a load atomic then same constraints as acquire, plus no 3824 preceding sequentially consistent load atomic/store 3825 atomic/atomicrmw/fence instruction can be moved after the 3826 seq_cst. 3827 - If a store atomic then the same constraints as release, plus 3828 no following sequentially consistent load atomic/store 3829 atomic/atomicrmw/fence instruction can be moved before the 3830 seq_cst. 3831 - If an atomicrmw/fence then same constraints as acq_rel. 3832 ============ ============================================================== 3833 3834Trap Handler ABI 3835~~~~~~~~~~~~~~~~ 3836 3837For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes 3838(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports 3839the ``s_trap`` instruction with the following usage: 3840 3841 .. table:: AMDGPU Trap Handler for AMDHSA OS 3842 :name: amdgpu-trap-handler-for-amdhsa-os-table 3843 3844 =================== =============== =============== ======================= 3845 Usage Code Sequence Trap Handler Description 3846 Inputs 3847 =================== =============== =============== ======================= 3848 reserved ``s_trap 0x00`` Reserved by hardware. 3849 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA 3850 ``queue_ptr`` ``debugtrap`` 3851 ``VGPR0``: intrinsic (not 3852 ``arg`` implemented). 3853 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be 3854 ``queue_ptr`` terminated and its 3855 associated queue put 3856 into the error state. 3857 ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not 3858 installed then 3859 behaves as a 3860 no-operation. The 3861 trap handler is 3862 entered and 3863 immediately returns 3864 to continue 3865 execution of the 3866 wavefront. 3867 - If the debugger is 3868 installed, causes 3869 the debug trap to be 3870 reported by the 3871 debugger and the 3872 wavefront is put in 3873 the halt state until 3874 resumed by the 3875 debugger. 3876 reserved ``s_trap 0x04`` Reserved. 3877 reserved ``s_trap 0x05`` Reserved. 3878 reserved ``s_trap 0x06`` Reserved. 3879 debugger breakpoint ``s_trap 0x07`` Reserved for debugger 3880 breakpoints. 3881 reserved ``s_trap 0x08`` Reserved. 3882 reserved ``s_trap 0xfe`` Reserved. 3883 reserved ``s_trap 0xff`` Reserved. 3884 =================== =============== =============== ======================= 3885 3886AMDPAL 3887------ 3888 3889This section provides code conventions used when the target triple OS is 3890``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters 3891from the application/runtime to each invocation of a hardware shader. These 3892parameters include both generic, application-controlled parameters called 3893*user data* as well as system-generated parameters that are a product of the 3894draw or dispatch execution. 3895 3896User Data 3897~~~~~~~~~ 3898 3899Each hardware stage has a set of 32-bit *user data registers* which can be 3900written from a command buffer and then loaded into SGPRs when waves are launched 3901via a subsequent dispatch or draw operation. This is the way most arguments are 3902passed from the application/runtime to a hardware shader. 3903 3904Compute User Data 3905~~~~~~~~~~~~~~~~~ 3906 3907Compute shader user data mappings are simpler than graphics shaders, and have a 3908fixed mapping. 3909 3910Note that there are always 10 available *user data entries* in registers - 3911entries beyond that limit must be fetched from memory (via the spill table 3912pointer) by the shader. 3913 3914 .. table:: PAL Compute Shader User Data Registers 3915 :name: pal-compute-user-data-registers 3916 3917 ============= ================================ 3918 User Register Description 3919 ============= ================================ 3920 0 Global Internal Table (32-bit pointer) 3921 1 Per-Shader Internal Table (32-bit pointer) 3922 2 - 11 Application-Controlled User Data (10 32-bit values) 3923 12 Spill Table (32-bit pointer) 3924 13 - 14 Thread Group Count (64-bit pointer) 3925 15 GDS Range 3926 ============= ================================ 3927 3928Graphics User Data 3929~~~~~~~~~~~~~~~~~~ 3930 3931Graphics pipelines support a much more flexible user data mapping: 3932 3933 .. table:: PAL Graphics Shader User Data Registers 3934 :name: pal-graphics-user-data-registers 3935 3936 ============= ================================ 3937 User Register Description 3938 ============= ================================ 3939 0 Global Internal Table (32-bit pointer) 3940 + Per-Shader Internal Table (32-bit pointer) 3941 + 1-15 Application Controlled User Data 3942 (1-15 Contiguous 32-bit Values in Registers) 3943 + Spill Table (32-bit pointer) 3944 + Draw Index (First Stage Only) 3945 + Vertex Offset (First Stage Only) 3946 + Instance Offset (First Stage Only) 3947 ============= ================================ 3948 3949 The placement of the global internal table remains fixed in the first *user 3950 data SGPR register*. Otherwise all parameters are optional, and can be mapped 3951 to any desired *user data SGPR register*, with the following regstrictions: 3952 3953 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first 3954 activehardware stage in a graphics pipeline (i.e. where the API vertex 3955 shader runs). 3956 3957 * Application-controlled user data must be mapped into a contiguous range of 3958 user data registers. 3959 3960 * The application-controlled user data range supports compaction remapping, so 3961 only *entries* that are actually consumed by the shader must be assigned to 3962 corresponding *registers*. Note that in order to support an efficient runtime 3963 implementation, the remapping must pack *registers* in the same order as 3964 *entries*, with unused *entries* removed. 3965 3966.. _pal_global_internal_table: 3967 3968Global Internal Table 3969~~~~~~~~~~~~~~~~~~~~~ 3970 3971The global internal table is a table of *shader resource descriptors* (SRDs) that 3972define how certain engine-wide, runtime-managed resources should be accessed 3973from a shader. The majority of these resources have HW-defined formats, and it 3974is up to the compiler to write/read data as required by the target hardware. 3975 3976The following table illustrates the required format: 3977 3978 .. table:: PAL Global Internal Table 3979 :name: pal-git-table 3980 3981 ============= ================================ 3982 Offset Description 3983 ============= ================================ 3984 0-3 Graphics Scratch SRD 3985 4-7 Compute Scratch SRD 3986 8-11 ES/GS Ring Output SRD 3987 12-15 ES/GS Ring Input SRD 3988 16-19 GS/VS Ring Output #0 3989 20-23 GS/VS Ring Output #1 3990 24-27 GS/VS Ring Output #2 3991 28-31 GS/VS Ring Output #3 3992 32-35 GS/VS Ring Input SRD 3993 36-39 Tessellation Factor Buffer SRD 3994 40-43 Off-Chip LDS Buffer SRD 3995 44-47 Off-Chip Param Cache Buffer SRD 3996 48-51 Sample Position Buffer SRD 3997 52 vaRange::ShadowDescriptorTable High Bits 3998 ============= ================================ 3999 4000 The pointer to the global internal table passed to the shader as user data 4001 is a 32-bit pointer. The top 32 bits should be assumed to be the same as 4002 the top 32 bits of the pipeline, so the shader may use the program 4003 counter's top 32 bits. 4004 4005Unspecified OS 4006-------------- 4007 4008This section provides code conventions used when the target triple OS is 4009empty (see :ref:`amdgpu-target-triples`). 4010 4011Trap Handler ABI 4012~~~~~~~~~~~~~~~~ 4013 4014For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does 4015not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap`` 4016instructions are handled as follows: 4017 4018 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS 4019 :name: amdgpu-trap-handler-for-non-amdhsa-os-table 4020 4021 =============== =============== =========================================== 4022 Usage Code Sequence Description 4023 =============== =============== =========================================== 4024 llvm.trap s_endpgm Causes wavefront to be terminated. 4025 llvm.debugtrap *none* Compiler warning given that there is no 4026 trap handler installed. 4027 =============== =============== =========================================== 4028 4029Source Languages 4030================ 4031 4032.. _amdgpu-opencl: 4033 4034OpenCL 4035------ 4036 4037When the language is OpenCL the following differences occur: 4038 40391. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 40402. The AMDGPU backend appends additional arguments to the kernel's explicit 4041 arguments for the AMDHSA OS (see 4042 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`). 40433. Additional metadata is generated 4044 (see :ref:`amdgpu-amdhsa-code-object-metadata`). 4045 4046 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS 4047 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table 4048 4049 ======== ==== ========= =========================================== 4050 Position Byte Byte Description 4051 Size Alignment 4052 ======== ==== ========= =========================================== 4053 1 8 8 OpenCL Global Offset X 4054 2 8 8 OpenCL Global Offset Y 4055 3 8 8 OpenCL Global Offset Z 4056 4 8 8 OpenCL address of printf buffer 4057 5 8 8 OpenCL address of virtual queue used by 4058 enqueue_kernel. 4059 6 8 8 OpenCL address of AqlWrap struct used by 4060 enqueue_kernel. 4061 ======== ==== ========= =========================================== 4062 4063.. _amdgpu-hcc: 4064 4065HCC 4066--- 4067 4068When the language is HCC the following differences occur: 4069 40701. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`). 4071 4072.. _amdgpu-assembler: 4073 4074Assembler 4075--------- 4076 4077AMDGPU backend has LLVM-MC based assembler which is currently in development. 4078It supports AMDGCN GFX6-GFX9. 4079 4080This section describes general syntax for instructions and operands. 4081 4082Instructions 4083~~~~~~~~~~~~ 4084 4085.. toctree:: 4086 :hidden: 4087 4088 AMDGPUAsmGFX7 4089 AMDGPUAsmGFX8 4090 AMDGPUAsmGFX9 4091 AMDGPUOperandSyntax 4092 4093An instruction has the following syntax: 4094 4095 *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...* 4096 4097Note that operands are normally comma-separated while modifiers are space-separated. 4098 4099The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted. 4100 4101See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`, 4102:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`. 4103 4104Note that features under development are not included in this description. 4105 4106For more information about instructions, their semantics and supported combinations of 4107operands, refer to one of instruction set architecture manuals 4108[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_. 4109 4110Operands 4111~~~~~~~~ 4112 4113The following syntax for register operands is supported: 4114 4115* SGPR registers: s0, ... or s[0], ... 4116* VGPR registers: v0, ... or v[0], ... 4117* TTMP registers: ttmp0, ... or ttmp[0], ... 4118* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi) 4119* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi) 4120* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ... 4121* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3] 4122* Register index expressions: v[2*2], s[1-1:2-1] 4123* 'off' indicates that an operand is not enabled 4124 4125Modifiers 4126~~~~~~~~~ 4127 4128Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`. 4129 4130Instruction Examples 4131~~~~~~~~~~~~~~~~~~~~ 4132 4133DS 4134++ 4135 4136.. code-block:: nasm 4137 4138 ds_add_u32 v2, v4 offset:16 4139 ds_write_src2_b64 v2 offset0:4 offset1:8 4140 ds_cmpst_f32 v2, v4, v6 4141 ds_min_rtn_f64 v[8:9], v2, v[4:5] 4142 4143 4144For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual. 4145 4146FLAT 4147++++ 4148 4149.. code-block:: nasm 4150 4151 flat_load_dword v1, v[3:4] 4152 flat_store_dwordx3 v[3:4], v[5:7] 4153 flat_atomic_swap v1, v[3:4], v5 glc 4154 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc 4155 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc 4156 4157For full list of supported instructions, refer to "FLAT instructions" in ISA Manual. 4158 4159MUBUF 4160+++++ 4161 4162.. code-block:: nasm 4163 4164 buffer_load_dword v1, off, s[4:7], s1 4165 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe 4166 buffer_store_format_xy v[1:2], off, s[4:7], s1 4167 buffer_wbinvl1 4168 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc 4169 4170For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual. 4171 4172SMRD/SMEM 4173+++++++++ 4174 4175.. code-block:: nasm 4176 4177 s_load_dword s1, s[2:3], 0xfc 4178 s_load_dwordx8 s[8:15], s[2:3], s4 4179 s_load_dwordx16 s[88:103], s[2:3], s4 4180 s_dcache_inv_vol 4181 s_memtime s[4:5] 4182 4183For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual. 4184 4185SOP1 4186++++ 4187 4188.. code-block:: nasm 4189 4190 s_mov_b32 s1, s2 4191 s_mov_b64 s[0:1], 0x80000000 4192 s_cmov_b32 s1, 200 4193 s_wqm_b64 s[2:3], s[4:5] 4194 s_bcnt0_i32_b64 s1, s[2:3] 4195 s_swappc_b64 s[2:3], s[4:5] 4196 s_cbranch_join s[4:5] 4197 4198For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual. 4199 4200SOP2 4201++++ 4202 4203.. code-block:: nasm 4204 4205 s_add_u32 s1, s2, s3 4206 s_and_b64 s[2:3], s[4:5], s[6:7] 4207 s_cselect_b32 s1, s2, s3 4208 s_andn2_b32 s2, s4, s6 4209 s_lshr_b64 s[2:3], s[4:5], s6 4210 s_ashr_i32 s2, s4, s6 4211 s_bfm_b64 s[2:3], s4, s6 4212 s_bfe_i64 s[2:3], s[4:5], s6 4213 s_cbranch_g_fork s[4:5], s[6:7] 4214 4215For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual. 4216 4217SOPC 4218++++ 4219 4220.. code-block:: nasm 4221 4222 s_cmp_eq_i32 s1, s2 4223 s_bitcmp1_b32 s1, s2 4224 s_bitcmp0_b64 s[2:3], s4 4225 s_setvskip s3, s5 4226 4227For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual. 4228 4229SOPP 4230++++ 4231 4232.. code-block:: nasm 4233 4234 s_barrier 4235 s_nop 2 4236 s_endpgm 4237 s_waitcnt 0 ; Wait for all counters to be 0 4238 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above 4239 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1. 4240 s_sethalt 9 4241 s_sleep 10 4242 s_sendmsg 0x1 4243 s_sendmsg sendmsg(MSG_INTERRUPT) 4244 s_trap 1 4245 4246For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual. 4247 4248Unless otherwise mentioned, little verification is performed on the operands 4249of SOPP Instructions, so it is up to the programmer to be familiar with the 4250range or acceptable values. 4251 4252VALU 4253++++ 4254 4255For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA), 4256the assembler will automatically use optimal encoding based on its operands. 4257To force specific encoding, one can add a suffix to the opcode of the instruction: 4258 4259* _e32 for 32-bit VOP1/VOP2/VOPC 4260* _e64 for 64-bit VOP3 4261* _dpp for VOP_DPP 4262* _sdwa for VOP_SDWA 4263 4264VOP1/VOP2/VOP3/VOPC examples: 4265 4266.. code-block:: nasm 4267 4268 v_mov_b32 v1, v2 4269 v_mov_b32_e32 v1, v2 4270 v_nop 4271 v_cvt_f64_i32_e32 v[1:2], v2 4272 v_floor_f32_e32 v1, v2 4273 v_bfrev_b32_e32 v1, v2 4274 v_add_f32_e32 v1, v2, v3 4275 v_mul_i32_i24_e64 v1, v2, 3 4276 v_mul_i32_i24_e32 v1, -3, v3 4277 v_mul_i32_i24_e32 v1, -100, v3 4278 v_addc_u32 v1, s[0:1], v2, v3, s[2:3] 4279 v_max_f16_e32 v1, v2, v3 4280 4281VOP_DPP examples: 4282 4283.. code-block:: nasm 4284 4285 v_mov_b32 v0, v0 quad_perm:[0,2,1,1] 4286 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 4287 v_mov_b32 v0, v0 wave_shl:1 4288 v_mov_b32 v0, v0 row_mirror 4289 v_mov_b32 v0, v0 row_bcast:31 4290 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0 4291 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 4292 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0 4293 4294VOP_SDWA examples: 4295 4296.. code-block:: nasm 4297 4298 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD 4299 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD 4300 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1 4301 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 4302 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0 4303 4304For full list of supported instructions, refer to "Vector ALU instructions". 4305 4306.. TODO 4307 Remove once we switch to code object v3 by default. 4308 4309HSA Code Object Directives 4310~~~~~~~~~~~~~~~~~~~~~~~~~~ 4311 4312AMDGPU ABI defines auxiliary data in output code object. In assembly source, 4313one can specify them with assembler directives. 4314 4315.hsa_code_object_version major, minor 4316+++++++++++++++++++++++++++++++++++++ 4317 4318*major* and *minor* are integers that specify the version of the HSA code 4319object that will be generated by the assembler. 4320 4321.hsa_code_object_isa [major, minor, stepping, vendor, arch] 4322+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 4323 4324 4325*major*, *minor*, and *stepping* are all integers that describe the instruction 4326set architecture (ISA) version of the assembly program. 4327 4328*vendor* and *arch* are quoted strings. *vendor* should always be equal to 4329"AMD" and *arch* should always be equal to "AMDGPU". 4330 4331By default, the assembler will derive the ISA version, *vendor*, and *arch* 4332from the value of the -mcpu option that is passed to the assembler. 4333 4334.amdgpu_hsa_kernel (name) 4335+++++++++++++++++++++++++ 4336 4337This directives specifies that the symbol with given name is a kernel entry point 4338(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL. 4339 4340.amd_kernel_code_t 4341++++++++++++++++++ 4342 4343This directive marks the beginning of a list of key / value pairs that are used 4344to specify the amd_kernel_code_t object that will be emitted by the assembler. 4345The list must be terminated by the *.end_amd_kernel_code_t* directive. For 4346any amd_kernel_code_t values that are unspecified a default value will be 4347used. The default value for all keys is 0, with the following exceptions: 4348 4349- *kernel_code_version_major* defaults to 1. 4350- *machine_kind* defaults to 1. 4351- *machine_version_major*, *machine_version_minor*, and 4352 *machine_version_stepping* are derived from the value of the -mcpu option 4353 that is passed to the assembler. 4354- *kernel_code_entry_byte_offset* defaults to 256. 4355- *wavefront_size* defaults to 6. 4356- *kernarg_segment_alignment*, *group_segment_alignment*, and 4357 *private_segment_alignment* default to 4. Note that alignments are specified 4358 as a power of two, so a value of **n** means an alignment of 2^ **n**. 4359 4360The *.amd_kernel_code_t* directive must be placed immediately after the 4361function label and before any instructions. 4362 4363For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document, 4364comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s. 4365 4366Here is an example of a minimal amd_kernel_code_t specification: 4367 4368.. code-block:: none 4369 4370 .hsa_code_object_version 1,0 4371 .hsa_code_object_isa 4372 4373 .hsatext 4374 .globl hello_world 4375 .p2align 8 4376 .amdgpu_hsa_kernel hello_world 4377 4378 hello_world: 4379 4380 .amd_kernel_code_t 4381 enable_sgpr_kernarg_segment_ptr = 1 4382 is_ptr64 = 1 4383 compute_pgm_rsrc1_vgprs = 0 4384 compute_pgm_rsrc1_sgprs = 0 4385 compute_pgm_rsrc2_user_sgpr = 2 4386 kernarg_segment_byte_size = 8 4387 wavefront_sgpr_count = 2 4388 workitem_vgpr_count = 3 4389 .end_amd_kernel_code_t 4390 4391 s_load_dwordx2 s[0:1], s[0:1] 0x0 4392 v_mov_b32 v0, 3.14159 4393 s_waitcnt lgkmcnt(0) 4394 v_mov_b32 v1, s0 4395 v_mov_b32 v2, s1 4396 flat_store_dword v[1:2], v0 4397 s_endpgm 4398 .Lfunc_end0: 4399 .size hello_world, .Lfunc_end0-hello_world 4400 4401Predefined Symbols (-mattr=+code-object-v3) 4402~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4403 4404The AMDGPU assembler defines and updates some symbols automatically. These 4405symbols do not affect code generation. 4406 4407.amdgcn.gfx_generation_number 4408+++++++++++++++++++++++++++++ 4409 4410Set to the GFX generation number of the target being assembled for. For 4411example, when assembling for a "GFX9" target this will be set to the integer 4412value "9". The possible GFX generation numbers are presented in 4413:ref:`amdgpu-processors`. 4414 4415.amdgcn.next_free_vgpr 4416++++++++++++++++++++++ 4417 4418Set to zero before assembly begins. At each instruction, if the current value 4419of this symbol is less than or equal to the maximum VGPR number explicitly 4420referenced within that instruction then the symbol value is updated to equal 4421that VGPR number plus one. 4422 4423May be used to set the `.amdhsa_next_free_vpgr` directive in 4424:ref:`amdhsa-kernel-directives-table`. 4425 4426May be set at any time, e.g. manually set to zero at the start of each kernel. 4427 4428.amdgcn.next_free_sgpr 4429++++++++++++++++++++++ 4430 4431Set to zero before assembly begins. At each instruction, if the current value 4432of this symbol is less than or equal the maximum SGPR number explicitly 4433referenced within that instruction then the symbol value is updated to equal 4434that SGPR number plus one. 4435 4436May be used to set the `.amdhsa_next_free_spgr` directive in 4437:ref:`amdhsa-kernel-directives-table`. 4438 4439May be set at any time, e.g. manually set to zero at the start of each kernel. 4440 4441Code Object Directives (-mattr=+code-object-v3) 4442~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4443 4444Directives which begin with ``.amdgcn`` are valid for all ``amdgcn`` 4445architecture processors, and are not OS-specific. Directives which begin with 4446``.amdhsa`` are specific to ``amdgcn`` architecture processors when the 4447``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and 4448:ref:`amdgpu-processors`. 4449 4450.amdgcn_target <target> 4451+++++++++++++++++++++++ 4452 4453Optional directive which declares the target supported by the containing 4454assembler source file. Valid values are described in 4455:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler 4456to validate command-line options such as ``-triple``, ``-mcpu``, and those 4457which specify target features. 4458 4459.amdhsa_kernel <name> 4460+++++++++++++++++++++ 4461 4462Creates a correctly aligned AMDHSA kernel descriptor and a symbol, 4463``<name>.kd``, in the current location of the current section. Only valid when 4464the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first 4465instruction to execute, and does not need to be previously defined. 4466 4467Marks the beginning of a list of directives used to generate the bytes of a 4468kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`. 4469Directives which may appear in this list are described in 4470:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must 4471be valid for the target being assembled for, and cannot be repeated. Directives 4472support the range of values specified by the field they reference in 4473:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is 4474assumed to have its default value, unless it is marked as "Required", in which 4475case it is an error to omit the directive. This list of directives is 4476terminated by an ``.end_amdhsa_kernel`` directive. 4477 4478 .. table:: AMDHSA Kernel Assembler Directives 4479 :name: amdhsa-kernel-directives-table 4480 4481 ======================================================== ================ ============ =================== 4482 Directive Default Supported On Description 4483 ======================================================== ================ ============ =================== 4484 ``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX9 Controls GROUP_SEGMENT_FIXED_SIZE in 4485 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. 4486 ``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX9 Controls PRIVATE_SEGMENT_FIXED_SIZE in 4487 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. 4488 ``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in 4489 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. 4490 ``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_PTR in 4491 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. 4492 ``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_QUEUE_PTR in 4493 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. 4494 ``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in 4495 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. 4496 ``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_ID in 4497 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. 4498 ``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX9 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in 4499 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. 4500 ``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in 4501 :ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`. 4502 ``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in 4503 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4504 ``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_X in 4505 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4506 ``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Y in 4507 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4508 ``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Z in 4509 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4510 ``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_INFO in 4511 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4512 ``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX9 Controls ENABLE_VGPR_WORKITEM_ID in 4513 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4514 Possible values are defined in 4515 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`. 4516 ``.amdhsa_next_free_vgpr`` Required GFX6-GFX9 Maximum VGPR number explicitly referenced, plus one. 4517 Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in 4518 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4519 ``.amdhsa_next_free_sgpr`` Required GFX6-GFX9 Maximum SGPR number explicitly referenced, plus one. 4520 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 4521 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4522 ``.amdhsa_reserve_vcc`` 1 GFX6-GFX9 Whether the kernel may use the special VCC SGPR. 4523 Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 4524 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4525 ``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX9 Whether the kernel may use flat instructions to access 4526 scratch memory. Used to calculate 4527 GRANULATED_WAVEFRONT_SGPR_COUNT in 4528 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4529 ``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX9 Whether the kernel may trigger XNACK replay. 4530 Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in 4531 Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4532 (+xnack) 4533 ``.amdhsa_float_round_mode_32`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_32 in 4534 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4535 Possible values are defined in 4536 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 4537 ``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_16_64 in 4538 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4539 Possible values are defined in 4540 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`. 4541 ``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX9 Controls FLOAT_DENORM_MODE_32 in 4542 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4543 Possible values are defined in 4544 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 4545 ``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX9 Controls FLOAT_DENORM_MODE_16_64 in 4546 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4547 Possible values are defined in 4548 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`. 4549 ``.amdhsa_dx10_clamp`` 1 GFX6-GFX9 Controls ENABLE_DX10_CLAMP in 4550 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4551 ``.amdhsa_ieee_mode`` 1 GFX6-GFX9 Controls ENABLE_IEEE_MODE in 4552 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4553 ``.amdhsa_fp16_overflow`` 0 GFX9 Controls FP16_OVFL in 4554 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`. 4555 ``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in 4556 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4557 ``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in 4558 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4559 ``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in 4560 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4561 ``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in 4562 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4563 ``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in 4564 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4565 ``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in 4566 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4567 ``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in 4568 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`. 4569 ======================================================== ================ ============ =================== 4570 4571Example HSA Source Code (-mattr=+code-object-v3) 4572~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4573 4574Here is an example of a minimal assembly source file, defining one HSA kernel: 4575 4576.. code-block:: nasm 4577 4578 .amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional 4579 4580 .text 4581 .globl hello_world 4582 .p2align 8 4583 .type hello_world,@function 4584 hello_world: 4585 s_load_dwordx2 s[0:1], s[0:1] 0x0 4586 v_mov_b32 v0, 3.14159 4587 s_waitcnt lgkmcnt(0) 4588 v_mov_b32 v1, s0 4589 v_mov_b32 v2, s1 4590 flat_store_dword v[1:2], v0 4591 s_endpgm 4592 .Lfunc_end0: 4593 .size hello_world, .Lfunc_end0-hello_world 4594 4595 .rodata 4596 .p2align 6 4597 .amdhsa_kernel hello_world 4598 .amdhsa_user_sgpr_kernarg_segment_ptr 1 4599 .amdhsa_next_free_vgpr .amdgcn.next_free_vgpr 4600 .amdhsa_next_free_sgpr .amdgcn.next_free_sgpr 4601 .end_amdhsa_kernel 4602 4603 4604Additional Documentation 4605======================== 4606 4607.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__ 4608.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__ 4609.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__ 4610.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__ 4611.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__ 4612.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_ 4613.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__ 4614.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__ 4615.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__ 4616.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__ 4617.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__ 4618.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__ 4619.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__ 4620.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__ 4621.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__ 4622.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__ 4623.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__ 4624