1// Copyright 2021-2023 The Khronos Group, Inc. 2// 3// SPDX-License-Identifier: CC-BY-4.0 4 5= VK_EXT_shader_tile_image 6:toc: left 7:refpage: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/ 8:sectnums: 9 10`VK_EXT_shader_tile_image` is a device extension that explicitly enables access to on-chip pixel data. For GPUs supporting this extension, it is a replacement for many use-cases for subpasses, which are not available when the `VK_KHR_dynamic_rendering` extension is used. 11 12== Problem Statement 13 14Some implementations, in particular tile-based GPUs, want to allow applications to effectively exploit local, e.g. on-chip, memory. 15A classic example would be optimizing G-buffer based deferred shading techniques where the G-buffer is produced and consumed on-chip. 16 17Subpasses were designed to support such use-cases with an API mechanism that was portable across all implementations. In practice, that has led to some problems, including: 18 19 * the high level abstraction is far removed from the mental model an application developer needs to have to be able to optimize for keeping data on-chip 20 * the subpass design affects other parts of the API and is seen as a 'tax' on applications that do not target implementations that benefit from on-chip storage 21 * developers wanting to optimize for a specific class of GPUs often need to make GPU specific optimization choices, so the abstraction does not add much 22 23These problems motivated `VK_KHR_dynamic_rendering`, which offers an alternative API without subpasses. But keeping data on-chip is still an important optimization for a class of GPUs. 24 25This proposal aims to provide the most essential functionality of subpasses, but in an explicit manner. 26The abstractions in this proposal are a closer match to what the underlying GPU implementation does and should make it easier to communicate best practices and performance guarantees to developers. 27 28== Solution Space 29 30=== High-level choices 31 32The solution space can be split in two axes: scope and abstraction level. 33 34The abstraction level is a question of whether we want an API that is only targeted at tile-based GPUs or if we should have a higher-level API that would allow the feature to be supported on a wider range of GPUs. 35The main argument for a higher abstraction level is application portability. 36Arguments against additional abstractions include: 37 38 * It would be hard for developers to reason about performance expectations, for the same reasons that it is hard to do this for subpasses 39 * "Framebuffer fetch" and "programmable blend" semantics are naturally expressed as direct reads from color attachments, and adding abstractions just obfuscate what (some) GPU hardware is doing 40 * GPUs that are not tile-based would not gain much from exposing this - at least not unless the scope is expanded - so the abstractions add little practical value 41 42There are two choices broadly based on what the functionality is for, and which GPUs are able to support it: 43 441. An explicit API to allow certain tile-based GPUs to expose on-chip memory with fast raster order access. 45 * Provides framebuffer fetch and Pixel Local Storage functionality and forms the basis for Tile Shader like functionality. 46 * This is mainly targeted at GPUs which defer fragment shading into framebuffer tiles where each tile is typically processed just once. 47 * This addresses use cases such as keeping G-buffer data on-chip. 48 * No DRAM bandwidth paid for render targets which are cleared on load, consumed within the render pass, and content discarded at end of render pass. 49 * Raster order access (coherent access) to framebuffer data from fragment shader is efficient or even "free" - depending on the GPU. 50 * No descriptors needed for render target access. 51 522. A slightly higher level API to enable broad GPU support for framebuffer fetch like functionality within draw calls in dynamic render passes. 53 * Provides framebuffer fetch like functionality. 54 * This is intended to be supported by a wide range of GPUs. The GPUs in general have optimised support for framebuffer fetch within a render pass. 55 * This addresses use cases such a programmable image composition, or programmable resolve. 56 * Attachment data is not guaranteed to be on-chip within a render pass and may spill to DRAM. Implementations may opportunistically cache data in their cache hierarchy. 57 * Raster order access to framebuffer data from fragment shader is not "free". Many implementations may prefer non-coherent access with explicit synchronization from applications. 58 * Descriptors need to be bound for render target access (at least for some implementations). 59 60This proposal targets the first choice. 61 62The options for scope include: 63 64 * "Framebuffer fetch" equivalent, i.e. enable access to the previously written pixel in the local framebuffer region 65 * "Pixel local storage" equivalent, i.e. as above with the addition of pixel format reinterpretation 66 * "Tile shader" equivalent, i.e. enable access to a region larger than 1x1 pixels 67 68This proposal targets the first option, but adds building blocks to enable future enhancements. 69The reasoning behind this choice is that: 70 71 * It should be possible to support this extension on existing GPUs 72 * Many use-cases that benefit from subpasses could be implemented with this functionality 73 * Ease of integration; this option requires the least amount of changes to rendering engines 74 * Time to market; several IHVs would like at least the subpass equivalent functionality to be implemented alongside `VK_KHR_dynamic_rendering` 75 76=== Implementation choices 77 78It is useful to provide tile image access for all attachment types. 79But implementations may manage depth/stencil differently than color, which could add constraints. 80We will therefore expose separate feature bits for color, depth, and stencil access. 81 82Tile image variables currently have to 'alias' a color attachment location, and their format is implicitly specified to match the color attachment format. 83 84== Proposal 85 86=== Concept 87 88image::{images}/tile_image.svg[align="center",title="Tile Image",align="center",opts="{imageopts}"] 89 90Introduce the concept of a 'tile image'. When the extension is enabled, the framebuffer is logically divided into a grid of non-overlapping tiles called tile images. 91 92=== API changes 93 94Add a new feature struct `VkPhysicalDeviceShaderTileImageFeaturesEXT` containing: 95 96 * shaderTileImageColorReadAccess 97 * shaderTileImageDepthReadAccess 98 * shaderTileImageStencilReadAccess 99 100shaderTileImageColorReadAccess is mandatory if this extension is supported. 101 102shaderTileImageColorReadAccess provides the ability to access current (rasterization order) color values from tile memory via tile images. 103There is no support for the storage format to be redefined as part of this feature. 104Output data is still written via Fragment Output variables. 105Since the framebuffer format is not re-declared, fixed-function blending works as normal. 106 107Existing shaders do not to need to be modified to write to color attachments. 108 109Reading color values using the functionality in this extension guarantees that the access is in rasterization order. 110See the spec (Fragment Shader Tile Image Reads) for details on which samples reads qualify for coherent read access. 111 112shaderTileImageDepthReadAccess and shaderTileImageStencilReadAccess provide similar ability to read the depth and stencil values of any sample location covered by the fragment. 113Depth and stencil fetches use implicit tile images. 114If no depth / stencil attachment is present then the values returned by fetches are undefined. 115Early fragment tests are disallowed if depth or stencil fetch is used. 116 117Reading depth/stencil values have similar rasterization order and synchronization guarantees as color. 118 119=== SPIR-V changes 120 121This proposal leverages `OpTypeImage` and makes 'TileImageDataEXT' another `Dim` similar to `SubpassData`. 122 123Specifically: 124 125 * `Dim` is extended with `TileImageDataEXT`. 126 * `OpTypeImage` gets the additional constraint that if `Dim` is `TileImageDataEXT`: 127 ** `Sampled` must: be `2` 128 ** `Image Format` must be `Unknown` as the format is implicitly specified by the color attachment 129 *** (We could relax this in a further extension if we wanted to support format reinterpretation in the shader.) 130 ** `Execution Model` must be `Fragment` 131 ** `Arrayed` must be `0` 132 ** Extend the use of `Location` such that it specifies the color attachment index 133 * Add `OpColorAttachmentReadEXT`, which is similar to `OpImageRead` but helps disambiguate between color/depth/stencil. 134 * Add `OpDepthAttachmentReadEXT` and `OpStencilAttachmentReadEXT` to read depth/stencil 135 ** These take an optional `Sample` parameter for MSAA use-cases 136 * Add a `TileImageEXT` Storage Class that is only supported for variables of `OpTypeImage` with `Dim` equal to `TileImageDataEXT` 137 138=== GLSL changes 139 140Main changes: 141 142 * New type: `attachmentEXT` 143 * The `location` layout qualifier is used to specify the corresponding color attachment 144 * New storage qualifier (supported only in fragment shaders): `tileImageEXT` 145 * New functions: `colorAttachmentReadEXT`, `depthAttachmentReadEXT`, `stencilAttachmentReadEXT` 146 147Mapping to SPIR-V: 148 149 * `attachmentEXT` maps to `OpTypeImage` with `Dim` equal to `TileImageDataEXT` 150 * `colorAttachmentReadEXT` maps to `OpColorAttachmentReadEXT` 151 * `depthAttachmentReadEXT` maps to `OpDepthAttachmentReadEXT` 152 * `stencilAttachmentReadEXT` maps to `OpStencilAttachmentReadEXT` 153 154Function signatures: 155[source,c] 156---- 157// color 158gvec4 colorAttachmentReadEXT(gattachment attachmentEXT); 159gvec4 colorAttachmentReadEXT(gattachment attachmentEXT, int sample); 160 161// depth 162highp float depthAttachmentReadEXT(); 163highp float depthAttachmentReadEXT(int sample); 164 165// stencil 166lowp uint stencilAttachmentReadEXT(); 167lowp uint stencilAttachmentReadEXT(int sample); 168---- 169 170=== HLSL Changes 171 172== Examples 173 174=== Color reads 175 176[source,c] 177---- 178// ------ Subpass Example -------- 179layout( set = 0, binding = 0, input_attachment_index = 0 ) uniform highp subpassInput color0; 180layout( set = 0, binding = 1, input_attachment_index = 1 ) uniform highp subpassInput color1; 181 182layout( location = 0 ) out vec4 fragColor; 183 184void main() 185{ 186 vec4 value = subpassLoad(color0) + subpassLoad(color1); 187 fragColor = value; 188} 189 190// ----- Equivalent Tile Image approach ------ 191 192// NOTES: 193// 'tileImageEXT' is a storage qualifier. 194// 'attachmentEXT' is an opaque type; similar to subpassInput 195// 'aliased' means that the variable shares _tile image_ with the corresponding attachment; there is no in-memory aliasing 196 197layout( location = 0 /* aliased to color attachment 0 */ ) tileImageEXT highp attachmentEXT color0; 198layout( location = 1 /* aliased to color attachment 1 */ ) tileImageEXT highp attachmentEXT color1; 199 200layout( location = 0 ) out vec4 fragColor; 201 202void main() 203{ 204 vec4 value = colorAttachmentReadEXT(color0) + colorAttachmentReadEXT(color1); 205 fragColor = value; 206} 207---- 208 209==== Depth reads 210 211[source,c] 212---- 213void main() 214{ 215 // read sample 0: works for non-MSAA or MSAA targets 216 highp float last_depth = depthAttachmentReadEXT(); 217} 218---- 219 220== Alternate Proposals 221 222The following proposals explore alternate ways to expose the functionality for reading from the tile memory for color data - reading depth and stencil and the API changes are kept unchanged from the main proposal. 223 224=== Proposal B: OpTypeTileImage 225 226==== SPIR-V Changes 227 228Add new type: `TileImage`. We have two options for defining `TileImage`: 229 230. `TileImage` variables which are instanced per-pixel (or per-sample in case of multisampled framebuffers) 231. `TileImage` defines a 2D array of pixels similar to an image but in tile memory. 232.. Note: Defining this as a 2D array fits well for future `Tile Shaders` functionality where tile shader invocations on a tile can access any location within a TileImage on the tile. 233 234Add new instruction: `OpTypeTileImage`. The instruction declares a `tile image`. `Tile image` is an opaque type. `OpTypeTileImage` has the following operands: 235 236* `Image Format`: the imageformat. This must be set to `Unknown` as the format is implicitly specified by the color attachment. 237** (We could relax this in a further extension if we wanted to support format reinterpretation in the shader.) 238* `MS` : indicates whether the content is multisampled. 0 - single-sampled. 1 - multisampled. 239 240`Tile image` variables must be decorated with `Location` which specifies the color attachment index. 241`Execution Model` must be `Fragment`. 242 243Add `OpTileImageRead`, `OpDepthTileImageRead`, `OpStencilTileImageRead` to read from color, depth, stencil tile images. 244Add `Tile` storage class. 245 246==== GLSL Changes 247 248GLSL changes remain the same as in the main proposal except the mapping changes to `OpTypeTileImage` instead of `OpTypeImage`: 249 250 * `tileImage` maps to `OpTypeTileImage` 251 252=== Proposal C: Storage Class / PLS style 253 254==== SPIR-V Changes 255 256Introduce `TileImage` as a new storage class. 257 258* Variables declared with `TileImage` must have `Location` decoration specified - this specifies the attachment index to alias to. 259* If image format reinterpretation is to be supported then a new `Imageformat` decoration is specified. 260* `TileImage` storage class variables are multisampled with the sample count of the framebuffer if multisampling is enabled. 261* Reading of TileImage variables is done via `OpTileImageRead`. 262** `OpTileImageRead` which accepts a `sample` parameter for MSAA use cases. 263 264* If aggregate types are to be supported in `TileImage` storage class, we would need the following: 265** `Location` and `Imageformat` must only be applied to non-structure type (that is, scalars or vectors or arrays of scalars or arrays of vectors). 266 267==== GLSL Changes 268 269* New storage class `tileImage`. 270* Add support for grouping `tileImage` variable declarations into an interface block. 271* layout `location` must be specified for the variables. 272* Add new builtin function `tileImageRead`, which accepts an optional parameter `sample` 273* If reinterpretation of formats is supported (within the same draw call), then we need `tileImageIn` and `tileImageOut` (or make `tileImage` an auxiliary storage specifier, similar to `patch` so we could use `tileImage in` and `tileImage out`). 274 275== Non-coherent access 276 277Some implementations have a penalty for support raster order access to tile image data. To support this functionality on such implementations we would add the following changes to the base proposal: 278 279=== API Changes 280 281* A property bit `shaderTileImagePreferCoherentReadAccess` indicating whether the implementation prefers coherent read accesses are used. 282 283* Support for specifying the barriers - three broad options (see next section) 284 285* Note: The gains from tile image feature with raster order access enabled are expected to match the gains from subpasses. 286 287=== Barrier Proposal A: MemoryBarrier via vkCmdPipelineBarrier2 288 289`vkCmdPipelineBarrier2` would be allowed within dynamic render passes to specify a `VkMemoryBarrier2` with some restrictions. The enums `VK_ACCESS_2_COLOR_ATTACHMENT_READ_BIT` and `VK_ACCESS_2_DEPTH_STENCIL_ATTACHMENT_READ_BIT` are reused for tileimage read accesses. 290 291This approach would allow synchronizing all color attachments, or depth stencil attachment, but does not support synchronizing individual color attachments. 292 293Example synchronizing two draw calls, where the first writes to color attachments and the second reads via the tileimage variables. 294 295[source,c] 296---- 297vkCmdDraw(...); 298 299VkMemoryBarrier2 memoryBarrier = { 300 ... 301 .srcStageMask = VK_PIPELINE_STAGE_2_COLOR_ATTACHMENT_OUTPUT_BIT, 302 .srcAccessMask = VK_ACCESS_2_COLOR_ATTACHMENT_WRITE_BIT, 303 .dstStageMask = VK_PIPELINE_STAGE_2_FRAGMENT_SHADER_BIT, 304 .dstAccessMask = VK_ACCESS_2_COLOR_ATTACHMENT_READ_BIT 305}; 306 307VkDependencyInfo dependencyInfo { 308 ... 309 VK_DEPENDENCY_BY_REGION, //dependency flags 310 1, //memory barrier count 311 &memoryBarrier, //memory barrier 312 ... 313}; 314 315vkCmdPipelineBarrier2(commandBuffer, &dependencyInfo); 316 317vkCmdDraw(...); 318---- 319 320=== Barrier Proposal B: ImageMemoryBarrier via vkCmdPipelineBarrier2 321 322`vkCmdPipelineBarrier2` would be allowed within dynamic render passes to specify a `VkMemoryBarrier2` with some restrictions. The enums `VK_ACCESS_2_COLOR_ATTACHMENT_READ_BIT` and `VK_ACCESS_2_DEPTH_STENCIL_ATTACHMENT_READ_BIT` are reused to express tileimage read accesses. 323 324This approach would allow synchronizing individual color attachments, or depth or stencil attachment. 325 326Example synchronizing two draw calls, where the first writes to color attachments and the second reads via the tileimage variables. 327 328[source,c] 329---- 330vkCmdDraw(...); 331 332VkImageMemoryBarrier2 imageMemoryBarrier = { 333 ... 334 .srcStageMask = VK_PIPELINE_STAGE_2_COLOR_ATTACHMENT_OUTPUT_BIT, 335 .srcAccessMask = VK_ACCESS_2_COLOR_ATTACHMENT_WRITE_BIT, 336 .dstStageMask = VK_PIPELINE_STAGE_2_FRAGMENT_SHADER_BIT, 337 .dstAccessMask = VK_ACCESS_2_COLOR_ATTACHMENT_READ_BIT, 338 .oldLayout = ..., //layouts not allowed to be changed. 339 .newLayout ..., 340 .image = .., //image and subresource identifying the specific attachment. 341 .subresourceRange = .. 342}; 343 344VkDependencyInfo dependencyInfo { 345 ... 346 VK_DEPENDENCY_BY_REGION, //dependency flags 347 ... 348 1, //image memory barrier count 349 &imageMemoryBarrier, //memory barrier 350 ... 351}; 352 353vkCmdPipelineBarrier2(commandBuffer, &dependencyInfo); 354 355vkCmdDraw(...); 356---- 357 358=== Barrier Proposal C: New simple API for tile image barriers 359 360New API entry point `vkCmdTileBarrierEXT(..)` where the app can specify which attachments to synchronize. This can be easily extended to tile shader if an implementation desires explicit barriers - by specifying all of tile memory needs to be synchronized and explicitly specifying tile-wide synchronization. 361 362[source,c] 363---- 364//New Vulkan function and types 365vkCmdTileBarrierEXT( 366 VkCommandBuffer commandBuffer, 367 VkDependencyFlags dependencyFlags, 368 VkTileMemoryTypeFlagsEXT tileMemoryMask); 369 370typedef enum VkTileMemoryTypeFlagsBitsEXT { 371 VK_TILE_IMAGE_COLOR_ATTACHMENTS_BIT = 0x00000001, 372 VK_TILE_IMAGE_DEPTH_STENCIL_ATTACHMENT_BIT = 0x00000002, 373} 374---- 375 376Example synchronizing two draw calls, where the first writes to color attachments and the second reads via the tile image variables. 377 378[source,c] 379---- 380vkCmdDraw(...); 381 382vkCmdTileBarrierEXT(commandBuffer, 383 VK_DEPENDENCY_BY_REGION, 384 VK_TILE_IMAGE_COLOR_ATTACHMENTS_BIT); 385 386vkCmdDraw(...); 387---- 388 389 390=== SPIR-V and GLSL changes 391 392* Tile Image data variables can optionally be specified with "noncoherent" layout qualifier in GLSL. For Depth and Stencil we could use a special fragment shader layout qualifier (similar to early_fragment_tests) to indicate depth and stencil access is "noncoherent". 393* Three new Execution modes in SPIR-V to specify that color, depth or stencil reads via the functionality in this extension are non-coherent (that is the reads are no longer guaranteed to be in raster order with respect to write operations from prior fragments). 394 395== Issues 396 397=== 1. RESOLVED: Should we allow early fragment tests? 398 399Early fragment tests are disallowed if reading frag depth / stencil. 400 401=== 2. RESOLVED: Should depth / stencil fetch be a separate extension? 402 403Access to depth / stencil is defined differently than color, but we suggest keeping them together - with separate feature bits. 404 405=== 3. RESOLVED: What should we name these variables? What should the extension be named? 406 407Other APIs have similar but not identical concepts, so a unique name is useful. 408 409We call these resources tile images. 410On typical implementations supporting this extension, the framebuffer is divided into tiles and fragment processing is deferred such that each framebuffer tile is typically visited just once. 411A tile image is a view of a framebuffer attachment, restricted to the tile being processed. 412 413Note that fragment shaders still can only color, depth, and stencil values from their fragment location and not the entire tile. 414 415The extension is called VK_EXT_shader_tile_image. 416 417=== 4. RESOLVED: Are there any non-obvious interactions with the suspend/resume functionality in `VK_KHR_dynamic_rendering`? 418 419Not at present. 420If we were to allow non-aliased tile image variables, then implementations would have to be able to guarantee that those variables never have to 'spill' from tile image. 421 422=== 5. RESOLVED: Enable / Disable raster order access 423 424Some implementations pay a performance cost to guarantee raster order access. We need to give them a way to disable raster order access and add support for barriers to explicitly perform synchronization. 425 426Three proposals have been added to the Non-coherent access section in this document. The spec changes currently choose Barrier Proposal A: MemoryBarrier via vkCmdPipelineBarrier2. 427 428Vulkan barriers have been difficult for developers to use, so Barrier Proposal C might offer a simpler API. 429 430Consensus was to keep things consistent with existing barriers in Vulkan, so Barrier Proposal A was chosen. 431 432=== 7. RESOLVED: Should this extension reuse OpTypeImage, or introduce a new type for declaring tile images? 433 434OpTypeImage is reused with a special Dim for tile images, following what was done for subpass attachments. 435 436An alternative would have been to make tile images their own type, and introduce an OpTypeTileImage type. 437That would require less special-casing of OpTypeImage, but comes with higher initial burden in tooling. 438 439=== 8. RESOLVED: Should Color, Depth, and Stencil reads use the same SPIR-V opcode? 440 441No. The extension introduces separate opcodes. 442 443Tile based GPUs which guarantee framebuffer residency in tile memory can offer efficient raster order access to color, depth, stencil data with relatively low overhead. 444Some GPU implementations would have a significant performance penalty in raster order access if the implementation cannot determine from the SPIR-V shader whether a specific access is color, depth, or stencil. 445 446This design choice is in-line with other API extensions (GL framebuffer fetch and framebuffer fetch depth stencil) and other APIs where depth/stencil access is clearly disambiguated. 447 448=== 9. RESOLVED: Should Depth and Stencil read opcodes consume an image operand specifying the attachment, or should it be implicit? 449 450No operand is necessary as there is depth and stencil uniquely identify the attachments unlike with color. 451 452The other options considered were: 453 454 A. Allow depth and stencil tile images to be declared as variables. Tile images are defined to map to the color attachment specified via the `Location` decoration - some equivalent needs to be defined for depth and stencil. Pixel Local Storage like functionality of supporting format reinterpretation is only supported for color attachments, and hence must be disallowed for depth and stencil. There is very little benefit to declaring the depth and stencil variables given these restrictions. 455 B. Depth and stencil tile images are exposed as built-in variables. 456 457Given the design choice made for issue 8, the alternate options do not add any value. 458 459=== 10. RESOLVED: Should this extension reuse the image Dim SubpassData or introduce a new Dim? 460 461The extension introduces a new Dim. 462 463This extension is intended to serve as foundation for further functionality - for example Pixel Local Storage like format reinterpretation, or to define the tile size and allow tile shaders to access any pixel within the tile. 464In SPIR-V, input attachments use images with Dim of SubpassData. We use a new Dim so we can easily distinguish whether an image is an input attachment or a tile image. 465 466=== 11. RESOLVED: Should this extension require applications to create and bind descriptors for tile images? 467 468No. 469Some GPUs internally require descriptors to be able to access framebuffer data. The input attachments in Vulkan subpasses help these GPU implementations. 470 471Other GPUs do not require apps to bind such descriptors. The intent with this extension is to provide functionality roughly in the lines of GL_EXT_shader_framebuffer_fetch, GL_EXT_shader_pixel_local_storage - which do not require apps to manage and bind descriptors. 472 473=== 12. RESOLVED: What does 'undefined value' mean for tile image reads? 474 475It simply means that the value has no well-defined meaning to an application. It does _not_ mean that the value is random nor that it could have been leaked from other contexts, processes, or memory other than the framebuffer attachments. 476 477== Further Functionality 478 479=== Fragment Shading Rate interactions 480 481With `VK_KHR_fragment_shading_rate` multi-pixel fragments read some implementation-defined pixel from the input attachments. We could define stronger requirements in this extension. 482 483=== Allow non-aliased Tile Image variables and/or image format redeclaration 484 485This would provide "Pixel local storage" equivalent functionality. 486 487A possible approach for that would be to specify the format as layout parameter - similar to image access: 488[source,c] 489---- 490layout(r11f_g11f_b10f) tile readonly highp tileImage normal; 491---- 492 493=== Tile Image size query 494 495If we were to allow non-aliased Tile Image variables, we would need to expose some limits on tile image size and tile dimensions so that applications can make performance trade-offs on tile size vs storage requirements. 496 497=== Memoryless attachments 498 499We have lazily allocated images in Vulkan, but they do not guarantee that memory is not allocated. 500