1// Copyright 2023-2024 The Khronos Group Inc. 2// 3// SPDX-License-Identifier: CC-BY-4.0 4 5# VK_ARM_render_pass_striped 6:toc: left 7:refpage: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/ 8:sectnums: 9 10This document describes a proposal for a new extension that allows the 11processing of a render pass instance to be split into stripes. 12 13## Problem Statement 14 15It is common to do post-processing on the images produced by a render pass 16instance. This is typically done using additional render pass instances or 17compute passes on the same graphics and compute queue - and on the same 18physical device. 19 20In some cases, however, the post-processing can be more efficiently done 21on specialized hardware - that is not Vulkan-capable. The various memory 22import and export extensions can support this use-case. But existing 23synchronization requires that the Vulkan device has rendered the complete 24output image before the external device can access it. In many cases, the 25latency would be reduced if the post-processing could start before the 26complete image is rendered. 27 28This proposal aims to support this latency reduction by splitting the 29processing of a render pass instance into a set of stripes that can be 30individually synchronized with an external device. 31 32## Solution Space 33 34The first option to consider is to just use the render area (in 35VkRenderingInfo or VkRenderPassBeginInfo) to render each stripe 36separately. This would be functionality correct, but requires that the 37complete render pass is replicated for each stripe which adds overhead 38for both the application and the implementation. 39 40We want to give the implementation visibility of the stripes such that some 41work can be shared across stripes. 42 43The stripes should be specified per render pass. Extending the render pass 44structures with that information is straightforward. 45 46There are more options for handling the synchronization. 47 48The intent is to synchronize with an external (to Vulkan) consumer, so the 49synchronization primitive must be exportable. 50 51If we use semaphores, when should the semaphore objects be provided? 52The options are: 53 54 . Provide the semaphores along with the queue submission commands (e.g. on vkQueueSubmit) as normal. 55 . Provide the semaphores along with the render pass information. 56 57Providing the semaphores with the render pass would give the implementation 58all the information in one place. But it is unusual to provide semaphores 59during recording, and resubmitting command buffers with semaphores embedded 60in adds complexity. 61 62This proposal picks the first option and describes a mapping between an array 63of semaphores and the render passes in the submitted command buffer. 64 65This proposal does not allow stripes to be specified per subpass for two reasons: 66 67 . It is not necessary since the expected use-case is post-processing of the final image 68 . If different subpasses specified different stripeAreas, any subpass merging of those passes would likely have to be disabled 69 70The expectation in this proposal is that the stripes are only used for the 71last subpass in a render pass instance. 72 73## Proposal 74 75### Features 76 77```c 78typedef struct VkPhysicalDeviceRenderPassStripedFeaturesARM 79{ 80 VkStructureType sType; 81 void *pNext; 82 VkBool32 renderPassStriped; 83} VkPhysicalDeviceRenderPassStripedFeaturesARM; 84``` 85 86This feature indicates that striped rendering is supported. 87 88### Properties 89 90```c 91typedef struct VkPhysicalDeviceRenderPassStripedPropertiesARM 92{ 93 VkStructureType sType; 94 void *pNext; 95 VkExtent2D renderPassStripeGranularity; 96 uint32_t maxRenderPassStripes; 97} VkPhysicalDeviceRenderPassStripedPropertiesARM; 98``` 99 100These properties indicate implementation-defined limits on the maximum number 101of stripes that are supported and how fine-grained they can be. For a 102tile-based GPU, the stripe granularity will typically be a multiple of the 103tile size. 104 105### Specifying stripes 106 107Stripes can be specified per render pass instance. 108The following structure can be added to the pNext chain of `VkRenderingInfoKHR` 109or `VkRenderPassBeginInfo`: 110 111```c 112typedef struct VkRenderPassStripeBeginInfoARM 113{ 114 VkStructureType sType; 115 const void *pNext; 116 uint32_t stripeInfoCount; 117 VkRenderPassStripeInfoARM *pStripeInfos; 118} VkRenderPassStripeBeginInfoARM; 119``` 120 121`stripeInfoCount` is the number of stripes and also the number of elements in 122the `pStripeInfos` array. 123 124The individual stripes are specified by the following structure: 125 126```c 127typedef struct VkRenderPassStripeInfoARM 128{ 129 VkStructureType sType; 130 const void *pNext; 131 VkRect2D stripeArea; 132} VkRenderPassStripeInfoARM; 133``` 134 135`stripeArea` is the region of the stripe. 136As a rule, the values of `stripeArea.offset.x` and `stripeArea.extent.width` 137must be a multiple of `renderPassStripeGranularity.width`. But it is difficult 138for an application to guarantee that the dimensions of each image is a 139multiple of the stripe granularity. We therefore allow an exception to the 140general rule for the stripe at the edge where `stripeArea.extent.width` does 141not need to be a multiple of `renderPassStripeGranularity.width` as long as 142the sum of `stripeArea.offset.x` and `stripeArea.extent.width` is equal to 143the `renderArea.extent.width` of the render pass instance. 144The same constraints apply to the values of `stripeArea.offset.y` and 145`stripeArea.extent.height`. 146 147In order to synchronize with an external consumer, a semaphore needs to be 148signaled when a stripe completes. These semaphores are specified on queue 149submit by including the following structure in the pNext chain of 150vkSubmitInfo2->VkCommandBufferSubmitInfo: 151 152```c 153typedef struct VkRenderPassStripeSubmitInfoARM 154{ 155 VkStructureType sType; 156 const void *pNext; 157 uint32_t stripeSemaphoreInfoCount; 158 const VkSemaphoreSubmitInfo* pStripeSemaphoreInfos; 159} VkRenderPassStripeSubmitInfoARM; 160``` 161 162`stripeSemaphoreInfoCount` is the number of elements in `pStripeSemaphoreInfos`. 163The value of `stripeSemaphoreInfoCount` must be equal to the sum of the 164`VkRenderPassStripeBeginInfoARM->stripeInfoCount` parameters that are recorded 165in the `VkCommandBufferSubmitInfo->commandBuffer`. 166 167`pStripeSemaphoreInfos` is a pointer to an array of `VkSemaphoreSubmitInfo` 168structures describing the render pass stripe signal operations. 169The elements of this array are mapped to striped render passes in submission 170order and in stripe order within each render pass. 171Each semaphore is signaled when the associated stripe is complete. 172 173For example, if `VkCommandBufferSubmitInfo->commandBuffer` contains three 174render passes, where the first has two stripes, the second is not striped, 175and the third has three stripes, then `stripeSemaphoreInfoCount` must have 176the value 5, the first two entries of `pStripeSemaphoreInfos` are associated 177with the first render pass and the last three entries are associated with 178the third render pass. The semaphore in the first entry of 179`pStripeSemaphoreInfos` is signaled when the first stripe of the first 180render pass is complete, etc. 181 182## Issues 183 184### PROPOSED: Naming. Do we use "stripe" or "slice"? 185 186The specification uses "slice" to refer to slices of a 3D image. 187This proposal wants to express partitions in 2D. 188The video extensions, e.g., "VK_EXT_video_decode_h264" uses "slice" to mean 189something very similar to what this proposal is aiming to achieve. 190 191We use "stripe" to disambiguate from the way slice is used to describe 192partitions in 3D. 193 194### PROPOSED: Could we use timeline semaphores for this? 195 196Probably. But timeline semaphores are not yet available on all platforms. 197 198Regular timeline semaphores are not shareable to foreign queues so are 199not a good fit for the target use-cases. 200 201External timeline semaphores require the native platform to support 202TIMELINE type native fence. There are two ways to export these: 203 204 . OPAQUE_FD: These are of limited use since they can only be shared by the same driver 205 . Platform timeline fences: There is currently no way to export these from Vulkan 206 207For now, we require binary semaphores. 208 209### PROPOSED: How do stripes interact with multiview or layered rendering? 210 211The main question is whether a stripe covers individual views (or layers of 212the framebuffer) or all of them. 213 214Each view could be post-processed separately - and in that case, one may want 215to start the post-processing for a stripe of one view before the next view is 216processed. In that case, we would want one semaphore per view. 217 218This extension is specified to complete a stripe for all views (or layers) 219before moving on to the next stripe. The main motivation for this choice is 220to reduce the number of synchronization operations required. 221 222### PROPOSED: Is striped rendering supported for all renderable formats? 223 224Yes. 225