• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1// Copyright 2023-2024 The Khronos Group Inc.
2//
3// SPDX-License-Identifier: CC-BY-4.0
4
5# VK_AMDX_shader_enqueue
6:toc: left
7:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/
8:sectnums:
9
10This extension adds the ability for developers to enqueue compute workgroups from a shader.
11
12## Problem Statement
13
14Applications are increasingly using more complex renderers, often incorporating multiple compute passes that classify, sort, or otherwise preprocess input data.
15These passes may be used to determine how future work is performed on the GPU; but triggering that future GPU work requires either a round trip to the host, or going through buffer memory and using indirect commands.
16Host round trips necessarily include more system bandwidth and latency as command buffers need to be built and transmitted back to the GPU.
17Indirect commands work well in many cases, but they have little flexibility when it comes to determining what is actually dispatched; they must be enqueued ahead of time, synchronized with heavy API barriers, and execute with a single pre-recorded pipeline.
18
19Whilst latency can be hidden and indirect commands can work in many cases where additional latency and bandwidth is not acceptable, recent engine developments such as Unreal 5's Nanite technology explicitly require the flexibility of shader selection _and_ low latency.
20A desirable solution should be able to have the flexibility required for these systems, while keeping the execution loop firmly on the GPU.
21
22
23## Solution Space
24
25Three main possibilities exist:
26
27  . Extend indirect commands
28  . VK_NV_device_generated_commands
29  . Shader enqueue
30
31More flexible indirect commands could feasibly allow things like shader selection, introduce more complex flow control, or include indirect state setting commands.
32The main issue with these is that these always require parameters to be written through regular buffer memory, and that buffer memory has to be sized for each indirect command to handle the maximum number of possibilities.
33As well as the large allocation size causing memory pressure, pushing all that data through buffer memory will reduce the bandwidth available for other operations.
34All of this could cause bottlenecks elsewhere in the pipeline.
35Hypothetically a new interface for better scheduling/memory management could be introduced, but that starts looking a lot like option 3.
36
37Option 2 - implementing a cross-vendor equivalent of VK_NV_device_generated_commands would be a workable solution that adds both flexibility and avoids a CPU round trip.
38The reason it has not enjoyed wider support is due to concerns about how the commands are generated - it uses a tokenised API which has to be processed by the GPU before it can be executed.
39For existing GPUs this can mean doing things like running a single compute shader invocation to process each token stream into a runnable command buffer, adding both latency and bandwidth on the GPU.
40
41Option 3 - OpenCL and CUDA have had some form of shader enqueue API for a while, where the focus has typically been primarily on enabling developers and on compute workloads.
42From a user interface perspective these have had a decent amount of battle testing and is quite a popular and flexible interface.
43
44This proposal is built around something like Option 3, but extended to be explicit and performant.
45
46
47## Proposal
48
49### API Changes
50
51#### Graph Pipelines
52
53In order to facilitate dispatch of multiple shaders from the GPU, the implementation needs some information about how pipelines will be launched and synchronized.
54This proposal introduces a new _execution graph pipeline_ that defines execution paths between multiple shaders, and allows dynamic execution of different shaders.
55
56[source,c]
57----
58VkResult vkCreateExecutionGraphPipelinesAMDX(
59    VkDevice                                        device,
60    VkPipelineCache                                 pipelineCache,
61    uint32_t                                        createInfoCount,
62    const VkExecutionGraphPipelineCreateInfoAMDX*    pCreateInfos,
63    const VkAllocationCallbacks*                    pAllocator,
64    VkPipeline*                                     pPipelines);
65
66typedef struct VkExecutionGraphPipelineCreateInfoAMDX {
67    VkStructureType                             sType;
68    const void*                                 pNext;
69    VkPipelineCreateFlags                       flags;
70    uint32_t                                    stageCount;
71    const VkPipelineShaderStageCreateInfo*      pStages;
72    const VkPipelineLibraryCreateInfoKHR*       pLibraryInfo;
73    VkPipelineLayout                            layout;
74    VkPipeline                                  basePipelineHandle;
75    int32_t                                     basePipelineIndex;
76} VkExecutionGraphPipelineCreateInfoAMDX;
77----
78
79Shaders defined by `pStages` and any pipelines in `pLibraryInfo->pLibraries` define the possible nodes of the graph.
80The linkage between nodes however is defined wholly in shader code.
81
82Shaders in `pStages` must be in the `GLCompute` execution model, and may have the *CoalescingAMDX* execution mode.
83Pipelines in `pLibraries` can be compute pipelines or other graph pipelines created with the `VK_PIPELINE_CREATE_LIBRARY_BIT_KHR` flag bit.
84
85Each shader in an execution graph is associated with a name and an index, which are used to identify the target shader when dispatching a payload.
86The `VkPipelineShaderStageNodeCreateInfoAMDX` provides options for specifying how the shader is specified with regards to its entry point name and index, and can be chained to the link:{refpage}VkPipelineShaderStageCreateInfo.html[VkPipelineShaderStageCreateInfo] structure.
87
88[source,c]
89----
90const uint32_t VK_SHADER_INDEX_UNUSED_AMDX = 0xFFFFFFFF;
91
92typedef struct VkPipelineShaderStageNodeCreateInfoAMDX {
93    VkStructureType                             sType;
94    const void*                                 pNext;
95    const char*                                 pName;
96    uint32_t                                    index;
97} VkPipelineShaderStageNodeCreateInfoAMDX;
98----
99
100* `index` sets the index value for a shader.
101* `pName` allows applications to override the name specified in SPIR-V by *OpEntryPoint*.
102
103If `pName` is `NULL` then the original name is used, as specified by `VkPipelineShaderStageCreateInfo::pName`.
104If `index` is `VK_SHADER_INDEX_UNUSED_AMDX` then the original index is used, either as specified by the `ShaderIndexAMDX` `Execution` `Mode`, or `0` if that too is not specified.
105If this structure is not provided, `pName` defaults to `NULL`, and `index` defaults to `VK_SHADER_INDEX_UNUSED_AMDX`.
106
107When dispatching from another shader, the index is dynamic and can be specified in uniform control flow - however the name must be statically declared as a decoration on the payload.
108Allowing the index to be set dynamically lets applications stream shaders in and out dynamically, by simply changing constant data and relinking the graph pipeline from new libraries.
109Shaders with the same name and different indexes must consume identical payloads and have the same execution model.
110Shaders with the same name in an execution graph pipeline must have unique indexes.
111
112#### Scratch Memory
113
114Implementations may need scratch memory to manage dispatch queues or similar when executing a pipeline graph, and this is explicitly managed by the application.
115
116[source,c]
117----
118typedef struct VkExecutionGraphPipelineScratchSizeAMDX {
119    VkStructureType                     sType;
120    void*                               pNext;
121    VkDeviceSize                        size;
122} VkExecutionGraphPipelineScratchSizeAMDX;
123
124VkResult vkGetExecutionGraphPipelineScratchSizeAMDX(
125    VkDevice                                device,
126    VkPipeline                              executionGraph,
127    VkExecutionGraphPipelineScratchSizeAMDX* pSizeInfo);
128----
129
130Applications can query the required amount of scratch memory required for a given pipeline, and the address of a buffer of that size must be provided when calling `vkCmdDispatchGraphAMDX`.
131The amount of scratch memory needed by a given pipeline is related to the number and size of payloads across the whole graph; while the exact relationship is implementation dependent, reducing the number of unique nodes (different name string) and size of payloads can reduce scratch memory consumption.
132
133Buffers created for this purpose must use the new buffer usage flags:
134
135[source,c]
136----
137VK_BUFFER_USAGE_EXECUTION_GRAPH_SCRATCH_BIT_AMDX
138VK_BUFFER_USAGE_2_EXECUTION_GRAPH_SCRATCH_BIT_AMDX
139----
140
141Scratch memory needs to be initialized against a graph pipeline before it can be used with that graph for the first time, using the following command:
142
143[source,c]
144----
145void vkCmdInitializeGraphScratchMemoryAMDX(
146    VkCommandBuffer                             commandBuffer,
147    VkDeviceAddress                             scratch);
148----
149
150This command initializes it for the currently bound execution graph pipeline.
151Scratch memory will need to be re-initialized if it is going to be reused with a different execution graph pipeline, but can be used with the same pipeline repeatedly without re-initialization.
152Scratch memory initialization can be synchronized using the compute pipeline stage `VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT` and shader write access flag `VK_ACCESS_SHADER_WRITE_BIT`.
153
154
155#### Dispatch a graph
156
157Once an execution graph has been created and scratch memory has been initialized for it, the following commands can be used to execute the graph:
158
159[source,c]
160----
161typedef struct VkDispatchGraphInfoAMDX {
162    uint32_t                                    nodeIndex;
163    uint32_t                                    payloadCount;
164    VkDeviceOrHostAddressConstAMDX              payloads;
165    uint64_t                                    payloadStride;
166} VkDispatchGraphInfoAMDX;
167
168typedef struct VkDispatchGraphCountInfoAMDX {
169    uint32_t                                    count;
170    VkDeviceOrHostAddressConstAMDX              infos;
171    uint64_t                                    stride;
172} VkDispatchGraphCountInfoAMDX;
173
174void vkCmdDispatchGraphAMDX(
175    VkCommandBuffer                             commandBuffer,
176    VkDeviceAddress                             scratch,
177    const VkDispatchGraphCountInfoAMDX*         pCountInfo);
178
179void vkCmdDispatchGraphIndirectAMDX(
180    VkCommandBuffer                             commandBuffer,
181    VkDeviceAddress                             scratch,
182    const VkDispatchGraphCountInfoAMDX*         pCountInfo);
183
184void vkCmdDispatchGraphIndirectCountAMDX(
185    VkCommandBuffer                             commandBuffer,
186    VkDeviceAddress                             scratch,
187    VkDeviceAddress                             countInfo);
188----
189
190Each of the above commands enqueues an array of nodes in the bound execution graph pipeline with separate payloads, according to the contents of the `VkDispatchGraphCountInfoAMDX` and `VkDispatchGraphInfoAMDX` structures.
191
192`vkCmdDispatchGraphAMDX` takes all of its arguments from the host pointers.
193`VkDispatchGraphCountInfoAMDX::infos.hostAddress` is a pointer to an array of `VkDispatchGraphInfoAMDX` structures,
194with stride equal to `VkDispatchGraphCountInfoAMDX::stride` and `VkDispatchGraphCountInfoAMDX::count` elements.
195
196`vkCmdDispatchGraphIndirectAMDX` consumes most parameters on the host, but uses the device address for `VkDispatchGraphCountInfoAMDX::infos`, and also treating `payloads` parameters as device addresses.
197
198`vkCmdDispatchGraphIndirectCountAMDX` consumes `countInfo` on the device and all child parameters also use device addresses.
199
200Data consumed via a device address must be from buffers created with the `VK_BUFFER_USAGE_SHADER_DEVICE_ADDRESS_BIT` and `VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT` flags.
201`payloads` is a pointer to a linear array of payloads in memory, with a stride equal to `payloadStride`.
202`payloadCount` may be `0`.
203`scratch` may be used by the implementation to hold temporary data during graph execution, and can be synchronized using the compute pipeline stage and shader write access flags.
204
205These dispatch commands must not be called in protected command buffers or secondary command buffers.
206
207If a selected node does not include a `StaticNumWorkgroupsAMDX` or `CoalescingAMDX` declaration, the first part of each element of `payloads` must be a `VkDispatchIndirectCommand` structure, indicating the number of workgroups to dispatch in each dimension.
208If an input payload variable in `NodePayloadAMDX` storage class is defined in the shader, its structure type *must* include link:{refpage}VkDispatchIndirectCommand.html[VkDispatchIndirectCommand] in its first 12 bytes.
209
210If that node does not include a `MaxNumWorkgroupsAMDX` declaration, it is assumed that the node may be dispatched with a grid size up to `VkPhysicalDeviceLimits::maxComputeWorkGroupCount`.
211
212If that node does not include a `CoalescingAMDX` declaration, all data in the payload is broadcast to all workgroups dispatched in this way.
213If that node includes a `CoalescingAMDX` declaration, data in the payload will be consumed by exactly one workgroup.
214There is no guarantee of how payloads will be consumed by `CoalescingAMDX` nodes.
215
216The `nodeIndex` is a unique integer identifier identifying a specific shader name and shader index (defined by `VkPipelineShaderStageNodeCreateInfoAMDX`) added to the executable graph pipeline.
217`vkGetExecutionGraphPipelineNodeIndexAMDX` can be used to query the identifier for a given node:
218
219[source,c]
220----
221VkResult vkGetExecutionGraphPipelineNodeIndexAMDX(
222    VkDevice                                        device,
223    VkPipeline                                      executionGraph,
224    const VkPipelineShaderStageNodeCreateInfoAMDX*   pNodeInfo,
225    uint32_t*                                       pNodeIndex);
226----
227
228`pNodeInfo` specifies the shader name and index as set up when creating the pipeline, with the associated node index returned in `pNodeIndex`.
229When used with this function, `pNodeInfo->pName` must not be `NULL`.
230
231[NOTE]
232====
233To summarize, execution graphs use two kinds of indexes:
234
235. _shader index_ specified in `VkPipelineShaderStageNodeCreateInfoAMDX` and used to enqueue payloads,
236. _node index_ specified in `VkDispatchGraphInfoAMDX` and used only for launching the graph from a command buffer.
237====
238
239Execution graph pipelines and their resources are bound using a new pipeline bind point:
240
241[source,c]
242----
243VK_PIPELINE_BIND_POINT_EXECUTION_GRAPH_AMDX
244----
245
246
247#### Properties
248
249The following new properties are added to Vulkan:
250
251[source,c]
252----
253typedef VkPhysicalDeviceShaderEnqueuePropertiesAMDX {
254    VkStructureType                     sType;
255    void*                               pNext;
256    uint32_t                            maxExecutionGraphDepth;
257    uint32_t                            maxExecutionGraphShaderOutputNodes;
258    uint32_t                            maxExecutionGraphShaderPayloadSize;
259    uint32_t                            maxExecutionGraphShaderPayloadCount;
260    uint32_t                            executionGraphDispatchAddressAlignment;
261} VkPhysicalDeviceShaderEnqueuePropertiesAMDX;
262----
263
264Each limit is defined as follows:
265
266  * `maxExecutionGraphDepth` defines the maximum node chain length in the graph, and must be at least 32.
267  The dispatched node is at depth 1 and the node enqueued by it is at depth 2, and so on.
268  If a node uses tail recursion, each recursive call increases the depth by 1 as well.
269  * `maxExecutionGraphShaderOutputNodes` specifies the maximum number of unique nodes that can be dispatched from a single shader, and must be at least 256.
270  * `maxExecutionGraphShaderPayloadSize` specifies the maximum total size of payload declarations in a shader, and must be at least 32KB.
271  * `maxExecutionGraphShaderPayloadCount` specifies the maximum number of output payloads that can be initialized in a single workgroup, and must be at least 256.
272  * `executionGraphDispatchAddressAlignment` specifies the alignment of non-scratch `VkDeviceAddress` arguments consumed by graph dispatch commands, and must be no more than 4 bytes.
273
274
275#### Features
276
277The following new feature is added to Vulkan:
278
279[source,c]
280----
281typedef VkPhysicalDeviceShaderEnqueueFeaturesAMDX {
282    VkStructureType                     sType;
283    void*                               pNext;
284    VkBool32                            shaderEnqueue;
285} VkPhysicalDeviceShaderEnqueueFeaturesAMDX;
286----
287
288The `shaderEnqueue` feature enables all functionality in this extension.
289
290
291### SPIR-V Changes
292
293A new capability is added:
294
295[cols="1,10,8",options="header"]
296|====
2972+^.^| Capability | Enabling Capabilities
298| 5067 | *ShaderEnqueueAMDX* +
299Uses shader enqueue capabilities | *Shader*
300|====
301
302A new storage class is added:
303
304[cols="1,10,8",options="header"]
305|====
3062+^.^| Storage Class | Enabling Capabilities
307| 5068 | *NodePayloadAMDX* +
308Input payload from a node dispatch. +
309In the *GLCompute* execution model with the *CoalescingAMDX* execution mode, it is visible across all functions in all invocations in a workgroup; otherwise it is visible across all functions in all invocations in a dispatch. +
310Variables declared with this storage class are read-write, and must not have initializers.
311| *ShaderEnqueueAMDX*
312| 5076 | *NodeOutputPayloadAMDX* +
313Output payload to be used for dispatch. +
314Variables declared with this storage class are read-write, must not have initializers, and must be initialized with *OpInitializeNodePayloadsAMDX* before they are accessed. +
315Once initialized, a variable declared with this storage class is visible to all invocations in the declared _Scope_. +
316Valid in *GLCompute* execution models.
317| *ShaderEnqueueAMDX*
318|====
319
320An entry point must only declare one variable in the `NodePayloadAMDX` storage class in its interface.
321
322New execution modes are added:
323
324[cols="1,10,3,3,3,8",options="header"]
325|====
3262+^.^| Execution Mode 3+| Extra Operands | Enabling Capabilities
327| 5069 | *CoalescingAMDX* +
328Indicates that a GLCompute shader has coalescing semantics. (GLCompute only) +
329 +
330Must not be declared alongside *StaticNumWorkgroupsAMDX* or *MaxNumWorkgroupsAMDX*.
3313+|
332|*ShaderEnqueueAMDX*
333| 5071 | *MaxNodeRecursionAMDX* +
334Maximum number of times a node can enqueue itself.
3353+| _<id>_ +
336_Number of recursions_
337|*ShaderEnqueueAMDX*
338| 5072 | *StaticNumWorkgroupsAMDX* +
339Statically declare the number of workgroups dispatched for this shader, instead of obeying an API- or payload-specified value. Values are reflected in the NumWorkgroups built-in value. (GLCompute only) +
340 +
341Must not be declared alongside *CoalescingAMDX* or *MaxNumWorkgroupsAMDX*.
342| _<id>_ +
343_x size_
344| _<id>_ +
345_y size_
346| _<id>_ +
347_z size_
348|*ShaderEnqueueAMDX*
349| 5077 | *MaxNumWorkgroupsAMDX* +
350Declare the maximum number of workgroups dispatched for this shader. Dispatches must not exceed this value (GLCompute only) +
351 +
352Must not be declared alongside *CoalescingAMDX* or *StaticNumWorkgroupsAMDX*.
353| _<id>_ +
354_x size_
355| _<id>_ +
356_y size_
357| _<id>_ +
358_z size_
359|*ShaderEnqueueAMDX*
360| 5073 | *ShaderIndexAMDX* +
361Declare the node index for this shader. (GLCompute only) 3+| _<id>_ +
362_Shader Index_
363|*ShaderEnqueueAMDX*
364|====
365
366A shader module declaring `ShaderEnqueueAMDX` capability must only be used in execution graph pipelines created by
367`vkCreateExecutionGraphPipelinesAMDX` command.
368
369`MaxNodeRecursionAMDX` must be specified if a shader re-enqueues itself, which takes place if that shader
370initializes and finalizes a payload for the same node _name_ and _index_. Other forms of recursion are not allowed.
371
372An application must not dispatch the shader with a number of workgroups in any dimension greater than the values specified by `MaxNumWorkgroupsAMDX`.
373
374`StaticNumWorkgroupsAMDX` allows the declaration of the number of workgroups to dispatch to be coded into the shader itself, which can be useful for optimizing some algorithms. When a compute shader is dispatched using existing `vkCmdDispatchGraph*` commands, the workgroup counts specified there are overridden. When enqueuing such shaders with a payload, these arguments will not be consumed from the payload before user-specified data begins.
375
376The values of `MaxNumWorkgroupsAMDX` and `StaticNumWorkgroupsAMDX` must be less than or equal to `link:{refpage}VkPhysicalDeviceLimits.html[VkPhysicalDeviceLimits]::maxComputeWorkGroupCount`.
377
378The arguments to each of these execution modes must be a constant 32-bit integer value, and may be supplied via specialization constants.
379
380When a *GLCompute* shader is being used in an execution graph, `NumWorkgroups` must not be used.
381
382When *CoalescingAMDX* is used, it has the following effects on a compute shader's inputs and outputs:
383
384 - The `WorkgroupId` built-in is always `(0,0,0)`
385   - NB: This affects related built-ins like `GlobalInvocationId`
386   - So similar to `StaticNumWorkgroupsAMDX`, no dispatch size is consumed from the payload-specified
387 - The input in the `NodePayloadAMDX` storage class must have a type of *OpTypeArray* or *OpTypeRuntimeArray*.
388   - This input must be decorated with `NodeMaxPayloadsAMDX`, indicating the number of payloads that can be received.
389   - The number of payloads received is provided in the `CoalescedInputCountAMDX` built-in.
390   - If *OpTypeArray* is used, that input's array length must be equal to the size indicated by the `NodeMaxPayloadsAMDX` decoration.
391
392New decorations are added:
393
394[cols="1,10,3,4",options="header"]
395|====
3962+^.^| Decoration | Extra Operands | Enabling Capabilities
397| 5020 | *NodeMaxPayloadsAMDX* +
398Must only be used to decorate a variable in the *NodeOutputPayloadAMDX* or *NodePayloadAMDX* storage class. +
399 +
400Variables in the *NodeOutputPayloadAMDX* storage class must have this decoration.
401If such a variable is decorated, the operand indicates the maximum number of payloads in the array +
402as well as the maximum number of payloads that can be allocated by a single workgroup for this output. +
403 +
404Variables in the *NodePayloadAMDX* storage class must have this decoration if the *CoalescingAMDX* execution mode is specified, otherwise they must not.
405If such a variable is decorated, the operand indicates the maximum number of payloads in the array. +
406| _<id>_ +
407_Max number of payloads_
408|*ShaderEnqueueAMDX*
409| 5019 | *NodeSharesPayloadLimitsWithAMDX* +
410Decorates a variable in the *NodeOutputPayloadAMDX* storage class to indicate that it shares output resources with _Payload Array_ when dispatched. +
411 +
412Without the decoration, each variable's resources are separately allocated against the output limits; by using the decoration only the limit of _Payload Array_ is considered.
413Applications must still ensure that at runtime the actual usage does not exceed these limits, as this decoration only relaxes static validation. +
414 +
415Must only be used to decorate a variable in the *NodeOutputPayloadAMDX* storage class,
416_Payload Array_ must be a different variable in the *NodeOutputPayloadAMDX* storage class, and
417_Payload Array_ must not be itself decorated with *NodeSharesPayloadLimitsWithAMDX*. +
418 +
419It is only necessary to decorate one variable to indicate sharing between two node outputs.
420Multiple variables can be decorated with the same _Payload Array_ to indicate sharing across multiple node outputs.
421| _<id>_ +
422_Payload Array_
423|*ShaderEnqueueAMDX*
424| 5091 | *PayloadNodeNameAMDX* +
425Decorates a variable in the *NodeOutputPayloadAMDX* storage class to indicate that the payloads in the array
426will be enqueued for the shader with _Node Name_. +
427 +
428Must only be used to decorate a variable that is initialized by *OpInitializeNodePayloadsAMDX*.
429| _Literal_ +
430_Node Name_
431|*ShaderEnqueueAMDX*
432| 5078 | *TrackFinishWritingAMDX* +
433Decorates a variable in the *NodeOutputPayloadAMDX* or *NodePayloadAMDX* storage class to indicate that a payload that is first
434enqueued and then accessed in a receiving shader, will be used with *OpFinishWritingNodePayloadAMDX* instruction. +
435 +
436Must only be used to decorate a variable in the *NodeOutputPayloadAMDX* or *NodePayloadAMDX* storage class. +
437 +
438Must not be used to decorate a variable in the *NodePayloadAMDX* storage class if the shader uses *CoalescingAMDX* execution mode. +
439 +
440If a variable in *NodeOutputPayloadAMDX* storage class is decorated, then a matching variable with *NodePayloadAMDX* storage class
441in the receiving shader must be decorated as well. +
442 +
443If a variable in *NodePayloadAMDX* storage class is decorated, then a matching variable with *NodeOutputPayloadAMDX* storage class
444in the enqueuing shader must be decorated as well. +
445|
446|*ShaderEnqueueAMDX*
447|====
448
449This allows more control over the `maxExecutionGraphShaderPayloadSize` limit, and can be useful when a shader may output some large number of payloads but to potentially different nodes.
450
451Two new built-ins are provided:
452
453[cols="1,10,8",options="header"]
454|====
4552+^.^| BuiltIn | Enabling Capabilities
456| 5073 | *ShaderIndexAMDX* +
457Index assigned to the current shader.
458|*ShaderEnqueueAMDX*
459| 5021 | *CoalescedInputCountAMDX* +
460Number of valid inputs in the *NodePayloadAMDX* storage class array when using the *CoalescingAMDX* Execution Mode. (GLCompute only)
461|*ShaderEnqueueAMDX*
462|====
463
464The business of actually allocating and enqueuing payloads is done by *OpInitializeNodePayloadsAMDX*:
465
466[cols="1,2,2,2,2,2"]
467|======
4685+|[[OpInitializeNodePayloadsAMDX]]*OpInitializeNodePayloadsAMDX* +
469 +
470Allocate payloads in memory and make them accessible through the _Payload Array_ variable.
471The payloads are enqueued for the node shader identified by the _Node Index_ and _Node Name_ in the decoration
472*PayloadNodeNameAMDX* on the _Payload Array_ variable. +
473 +
474_Payload Array_ variable must be an *OpTypePointer* with a _Storage Class_ of _OutputNodePayloadAMDX_, and a _Type_ of *OpTypeArray* with an _Element Type_ of *OpTypeStruct*. +
475 +
476The array pointed to by _Payload Array_ variable must have _Payload Count_ elements. +
477 +
478Payloads are allocated for the _Scope_ indicated by _Visibility_, and are visible to all invocations in that _Scope_. +
479 +
480_Payload Count_ is the number of payloads to initialize in the _Payload Array_. +
481 +
482_Payload Count_ must be less than or equal to the *NodeMaxPayloadsAMDX* decoration on the _Payload Array_ variable. +
483 +
484_Payload Count_ and _Node Index_ must be dynamically uniform within the scope identified by _Visibility_. +
485 +
486_Visibility_ must only be either _Invocation_ or _Workgroup_. +
487 +
488This instruction must be called in uniform control flow. +
489This instruction must not be called on a _Payload Array_ variable that has previously been initialized.
4901+|Capability: +
491*ShaderEnqueueAMDX*
492| 5 | 5090
493| _<id>_ +
494_Payload Array_
495| _Scope <id>_ +
496_Visibility_
497| _<id>_ +
498_Payload Count_
499| _<id>_ +
500_Node Index_
501|======
502
503
504Once a payload element is initialized, it will be enqueued to workgroups in the corresponding shader after the calling shader has written all of its values.
505Enqueues are performed in the same manner as the `vkCmdDispatchGraph*` API commands.
506If the node enqueued has the `CoalescingAMDX` execution mode, there is no guarantee what set of payloads are visible to the same workgroup.
507
508The shader must not enqueue payloads to a shader with the same name as this shader unless the index identifies this shader and `MaxNodeRecursionAMDX` is declared with a sufficient depth.
509Shaders with the same name and different indexes can each recurse independently.
510
511
512A shader can explicitly specify that it is done writing to outputs (allowing the enqueue to happen sooner) by calling *OpFinalizeNodePayloadsAMDX*:
513
514[cols="3,1,1"]
515|======
5162+|[[OpFinalizeNodePayloadsAMDX]]*OpFinalizeNodePayloadsAMDX* +
517 +
518Optionally indicates that all accesses to an array of output payloads have completed.
519 +
520_Payload Array_ is a payload array previously initialized by *OpInitializeNodePayloadsAMDX*.
521 +
522This instruction must be called in uniform control flow.
523 +
524_Payload Array_ must be an *OpTypePointer* with a _Storage Class_ of _OutputNodePayloadAMDX_, and a _Type_ of *OpTypeArray* or *OpTypeRuntimeArray* with an _Element Type_ of *OpTypeStruct*.
525_Payload Array_ must not have been previously finalized by *OpFinalizeNodePayloadsAMDX*.
5261+|Capability: +
527*ShaderEnqueueAMDX*
528| 2 | 5075
529| _<id>_ +
530_Payload Array_
531|======
532
533Once this has been called, accessing any element of _Payload Array_ is undefined behavior.
534
535[cols="3,1,1,1,1"]
536|======
5374+|[[OpFinishWritingNodePayloadAMDX]]*OpFinishWritingNodePayloadAMDX* +
538 +
539Optionally indicates that all writes to the input payload by the current workgroup have completed.
540 +
541Returns `true` when all workgroups that can access this payload have called this function.
542
543Must not be called if the shader is using *CoalescingAMDX* execution mode,
544or if the shader was dispatched with a `vkCmdDispatchGraph*` command, rather than enqueued from another shader.
545
546Must not be called if the input payload is not decorated with *TrackFinishWritingAMDX*.
547
548_Result Type_ must be *OpTypeBool*.
549 +
550_Payload_ is a variable in the *NodePayloadAMDX* storage class.
5511+|Capability: +
552*ShaderEnqueueAMDX*
553| 4 | 5078
554| _<id>_ +
555_Result Type_
556| _Result_ _<id>_
557| _<id>_ +
558_Payload_
559|======
560
561Once this has been called for a given payload, writing values into that payload by the current invocation/workgroup is undefined behavior.
562
563
564## Issues
565
566### RESOLVED: For compute nodes, can the input payload be modified? If so what sees that modification?
567
568Yes, input payloads are writeable and *OpFinishWritingNodePayloadAMDX* instruction is provided to indicate that all
569workgroups that share the same payload have finished writing to it.
570
571Limitations apply to this functionality. Please refer to the instruction's specification.
572
573
574### UNRESOLVED: Do we need input from the application to tune the scratch allocation?
575
576For now no, more research is required to determine what information would be actually useful to know.
577
578
579### PROPOSED: How does this extension interact with device groups?
580
581It works the same as any other dispatch commands - work is replicated to all devices unless applications split the work themselves.
582There is no automatic scheduling between devices.
583