1Name 2 3 NVX_linked_gpu_multicast 4 5Name Strings 6 7 GL_NVX_linked_gpu_multicast 8 9Contact 10 11 Joshua Schnarr, NVIDIA Corporation (jschnarr 'at' nvidia.com) 12 Ingo Esser, NVIDIA Corporation (iesser 'at' nvidia.com) 13 14Contributors 15 16 Christoph Kubisch, NVIDIA 17 Mark Kilgard, NVIDIA 18 19Status 20 21 Shipping in NVIDIA release 361 drivers. 22 23Version 24 25 Last Modified Date: July 21, 2016 26 NVIDIA Revision: 4 27 28Number 29 30 OpenGL Extension #493 31 32Dependencies 33 34 This extension is written against the OpenGL 4.5 specification (Compatibility Profile), dated 35 February 2, 2015. 36 37 This extension interacts with ARB_sparse_buffer. 38 39 This extension interacts with ARB_copy_image. 40 41 This extension interacts with EXT_direct_state_access. 42 43 This extension interacts with ARB_shader_viewport_layer_array. 44 45Overview 46 47 This extension enables novel multi-GPU rendering techniques by providing application control 48 over a group of linked GPUs with identical hardware configuration. 49 50 Multi-GPU rendering techniques fall into two categories: implicit and explicit. Existing 51 explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and 52 application complexity. An application must manage one context per GPU and multi-pump the API 53 stream. Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering 54 from one context to multiple GPUs. Common implicit approaches include alternate-frame 55 rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing. They each have 56 drawbacks. AFR scales nicely but interacts poorly with inter-frame dependencies. SFR can 57 improve latency but has challenges with offscreen rendering and scaling of vertex processing. 58 With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample 59 positions and the driver blends the result to improve quality. This also has issues with 60 offscreen rendering and can conflict with other anti-aliasing techniques. 61 62 These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks 63 adequate knowledge to accelerate every application. To resolve this, NVX_linked_gpu_multicast 64 provides application control over multiple GPUs with a single context. 65 66 Key points: 67 68 - One context controls multiple GPUs. Every GPU in the linked group can access every object. 69 70 - Rendering is broadcast. Each draw is repeated across all GPUs in the linked group. 71 72 - Each GPU gets its own instance of all framebuffers and attached textures, allowing 73 individualized output for each GPU. Input data can be customized for each GPU using buffers 74 created with the storage flag, LGPU_SEPARATE_STORAGE_BIT_NVX and a new API, 75 LGPUNamedBufferSubDataNVX. 76 77 - Textures can be transferred from one GPU to another using LGPUCopyImageSubDataNVX. 78 79 80New Procedures and Functions 81 82 void LGPUNamedBufferSubDataNVX( 83 bitfield gpuMask, uint buffer, 84 intptr offset, sizeiptr size, 85 const void *data); 86 87 void LGPUCopyImageSubDataNVX( 88 uint sourceGpu, bitfield destinationGpuMask, 89 uint srcName, enum srcTarget, 90 int srcLevel, 91 int srcX, int srxY, int srcZ, 92 uint dstName, enum dstTarget, 93 int dstLevel, 94 int dstX, int dstY, int dstZ, 95 sizei width, sizei height, sizei depth); 96 97 void LGPUInterlockNVX(void); 98 99New Tokens 100 101 Accepted in the <flags> parameter of BufferStorage and 102 NamedBufferStorageEXT: 103 104 LGPU_SEPARATE_STORAGE_BIT_NVX 0x0800 105 106 Accepted by the <pname> parameter of GetBooleanv, GetIntegerv, 107 GetInteger64v, GetFloatv, and GetDoublev: 108 109 MAX_LGPU_GPUS_NVX 0x92BA 110 111Additions to the OpenGL 4.5 Specification (Compatibility Profile) 112 113 (Add a new chapter after chapter 19 "Compute Shaders") 114 115 20 Multicast Rendering 116 117 This chapter specifies commands for using multiple GPUs in a linked group. Commands are 118 multicast, or repeated across all linked GPUs. Objects are shared by all GPUs, however each 119 GPU has its own instance (copy) of many resources, including framebuffers. When each GPU has 120 its own instance of a resource, it is considered to have per-GPU storage. When all GPUs share 121 a single instance of a resource, this is considered GPU-shared storage. 122 123 The mechanism for linking GPUs is implementation specific, as is the process-global mechanism 124 for enabling multicast rendering support (if necessary). The number of GPUs usable for 125 multicast rendering by a context can be queried by calling GetIntegerv with the symbolic 126 constant MAX_LGPU_GPUS_NVX. Individual GPUs are identified using zero-based indices in the 127 range [0, n-1], where n is the number of multicast GPUs. GPUs are also be identified by 128 bitmasks of the form 2^i, where i is the GPU index. A set of GPUs is specified by the union of 129 masks for each GPU in the set. 130 131 20.1 Multi-GPU Buffer Storage 132 133 Like other resources, buffer objects can have two types of storage, per-GPU storage or 134 GPU-shared storage. Per-GPU storage can be explicitly requested using the 135 LGPU_SEPARATE_STORAGE_BIT_NVX flag with BufferStorage/NamedBufferStorageEXT. If this flag is 136 not set, the type of storage used is undefined. The implementation may use either type 137 and transition between them at any time. Client reads of a buffer with per-GPU storage may 138 source from any GPU. 139 140 The following rules apply to buffer objects with per-GPU storage: 141 142 When mapped with WRITE_ONLY access, writes apply to all GPUs. 143 When bound to UNIFORM_BUFFER, client uniform updates apply to all GPUs. 144 When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply to 145 all GPUs. 146 147 The following commands affect storage on all GPUs, even if the the buffer object has per-GPU 148 storage: 149 150 BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData 151 152 An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with 153 LGPU_SEPARATE_STORAGE_BIT_NVX set with MAP_PERSISTENT_BIT or SPARSE_STORAGE_BIT_ARB. 154 155 To modify buffer object data on one or more GPUs, the client may use the command 156 157 void LGPUNamedBufferSubDataNVX( 158 bitfield gpuMask, uint buffer, 159 intptr offset, sizeiptr size, 160 const void *data); 161 162 This function operates similarly to NamedBufferSubData, except that it updates the per-GPU 163 buffer data on the set of GPUs defined by <gpuMask>. 164 165 An INVALID_VALUE error is generated if <gpuMask> is zero. 166 An INVALID_OPERATION error is generated if <buffer> is not the name of an existing buffer 167 object. 168 An INVALID_VALUE error is generated if <offset> or <size> is negative, or if <offset> + <size> 169 is greater than the value of BUFFER_SIZE for the buffer object. 170 An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped 171 with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with 172 MAP_PERSISTENT_BIT set in the MapBufferRange access flags. 173 An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer 174 object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the 175 DYNAMIC_STORAGE_BIT set. 176 177 20.2 Multi-GPU Framebuffers and Textures 178 179 All buffers in the default framebuffer as well as renderbuffers and textures bound to 180 framebuffer objects receive per-GPU storage. Storage for other textures is undefined: it may 181 be per-GPU or GPU-shared and can transition between the types at any time. 182 183 To copy texel data between GPUs, the client may use the command 184 185 void LGPUCopyImageSubDataNVX( 186 uint sourceGpu, bitfield destinationGpuMask, 187 uint srcName, enum srcTarget, 188 int srcLevel, 189 int srcX, int srxY, int srcZ, 190 uint dstName, enum dstTarget, 191 int dstLevel, 192 int dstX, int dstY, int dstZ, 193 sizei width, sizei height, sizei depth); 194 195 This function operates similarly to CopyImageSubData, except that it takes a source GPU 196 and a destination GPU set defined by <destinationGpuMask>. 197 198 INVALID_ENUM is generated 199 * if either <srcTarget> or <dstTarget> 200 - is not RENDERBUFFER or a valid non-proxy texture target 201 - is TEXTURE_BUFFER, or 202 - is one of the cubemap face selectors described in table 3.17, 203 * if the target does not match the type of the object. 204 205 INVALID_OPERATION is generated 206 * if either object is a texture and the texture is not complete, 207 * if the source and destination formats are not compatible, 208 * if the source and destination number of samples do not match, 209 * if one image is compressed and the other is uncompressed and the 210 block size of compressed image is not equal to the texel size 211 of the compressed image. 212 213 INVALID_VALUE is generated 214 * if <sourceGpu> is greater than or equal to MAX_LGPU_GPUS_NVX, 215 * if <destinationGpuMask> is zero, 216 * if either <srcName> or <dstName> does not correspond to a valid 217 renderbuffer or texture object according to the corresponding 218 target parameter, or 219 * if the specified level is not a valid level for the image, or 220 * if the dimensions of the either subregion exceeds the boundaries 221 of the corresponding image object, or 222 * if the image format is compressed and the dimensions of the 223 subregion fail to meet the alignment constraints of the format. 224 225 226 20.3 Multi-GPU Synchronization 227 228 LGPUCopyImageSubDataNVX provides implicit synchronization with previous rendering to the given 229 texture or renderbuffer on the source GPU. Synchronization of the copy with the destination 230 GPU(s) is achieved with the interlock function: 231 232 void LGPUInterlockNVX(void) 233 234 This is called to synchronize all linked GPUs to the same point in the API stream. To 235 guarantee consistency, the interlock command must be used as a barrier between any two 236 accesses by multiple GPUs to the same memory when at least one of the accesses is a write. 237 For consistent copies between GPUs, synchronization is required before and after each copy: 238 239 1. Prior to each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called after 240 the most recent read or write of the target image by a destination GPU. 241 242 2. After each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called 243 prior to any future read or write of the target image by a destination GPU. 244 245 GPU writes and reads to/from GPU-shared locations require synchronization as well. GPU writes 246 such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not 247 automatically synchronized with writes by other GPUs. Neither are GPU reads such as texture 248 fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs. 249 Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees 250 for rendering, writes and reads on a single GPU. 251 252 253 Additions to the AGL/GLX/WGL Specifications 254 255 None 256 257GLX Protocol 258 259 None 260 261Errors 262 263 Relaxation of INVALID_ENUM errors 264 --------------------------------- 265 GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as 266 described in the "New Tokens" section. 267 268New State 269 270 None 271 272New Implementation Dependent State 273 274 Add to Table 23.82, Implementation-Dependent Values, p. 784 275 276 Minimum 277 Get Value Type Get Command Value Description Sec. Attribute 278 ---------------------- ---- ----------- ------- ----------------------- ---- --------- 279 MAX_LGPU_GPUS_NVX Z+ GetIntegerv 2 Maximum number of 6.9 - 280 usable GPUs 281Sample Code 282 283 Binocular stereo rendering example using NVX_linked_gpu_multicast with single GPU fallback: 284 285 struct ViewData { 286 GLint viewport_index; 287 GLfloat mvp[16]; 288 GLfloat modelview[16]; 289 }; 290 ViewData leftViewData = { 0, {...}, {...} }; 291 ViewData rightViewData = { 1, {...}, {...} }; 292 293 GLuint ubo[2]; 294 glCreateBuffers(2, &ubo[0]); 295 296 if (has_NVX_linked_gpu_multicast) { 297 glNamedBufferStorage(ubo[0], size, NULL, GL_LGPU_SEPARATE_STORAGE_BIT_NVX | GL_DYNAMIC_STORAGE_BIT); 298 glLGPUNamedBufferSubDataNVX(0x1, ubo[0], 0, size, &leftViewData); 299 glLGPUNamedBufferSubDataNVX(0x2, ubo[0], 0, size, &rightViewData); 300 } else { 301 glNamedBufferStorage(ubo[0], size, &leftViewData, 0); 302 glNamedBufferStorage(ubo[1], size, &rightViewData, 0); 303 } 304 305 glViewportIndexedf(0, 0, 0, 640, 480); // left viewport 306 glViewportIndexedf(1, 640, 0, 640, 480); // right viewport 307 // Vertex shader sets gl_ViewportIndex according to viewport_index in UBO 308 309 glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); 310 311 if (has_NVX_linked_gpu_multicast) { 312 glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]); 313 drawScene(); 314 // Make GPU 1 wait for glClear above to complete on GPU 0 315 glLGPUInterlockNVX(); 316 // Copy right viewport from GPU 1 to GPU 0 317 glLGPUCopyImageSubDataNVX(1, 0x1, 318 renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0, 319 renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0, 320 640, 480, 1); 321 // Make GPU 0 wait for GPU 1 copy to GPU 0 322 glLGPUInterlockNVX(); 323 } else { 324 glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]); 325 drawScene(); 326 glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]); 327 drawScene(); 328 } 329 // Both viewports are now present in GPU 0's renderbuffer 330 331Issues 332 333 (1) Should we provide explicit inter-gpu synchronization API? Will this make the implementation 334 easier or harder for the driver and applications? 335 336 RESOLVED. Yes. A naive implementation of implicit synchronization would simply interlock the 337 GPUs before and after each copy. Smart implicit synchronization would have to track all APIs 338 that can modify buffers and textures, creating an excessive burden for driver implementation 339 and maintenance. An application can track dependencies more easily and outperform a naive 340 driver implementation using explicit synchronization. 341 342 (2) How does this extension interact with queries (e.g. occlusion queries)? 343 344 RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs 345 return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve 346 query results for all GPUs through a buffer with separate storage (LGPU_SEPARATE_STORAGE_BIT). 347 348 (3) Which textures and buffers have separate storage for each GPU? 349 350 The default framebuffer and framebuffer texture attachments. Also buffers allocated with 351 LGPU_SEPARATE_STORAGE_BIT. Other buffers and textures may or may not have separate storage. 352 353 (4) Should we provide a mechanism to modify viewports independently for each GPU? 354 355 RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array. 356 357 (5) Should we expose this extension on single-GPU configurations? 358 359 RESOLVED. No. The extension provides no value unless MULTICAST_GPUS_NV > 1. Limiting exposure 360 to these configurations guarantees that at least two GPUs will be available when the extension 361 is reported. 362 363 (6) Can rendering be enabled/disabled on a specific subset of GPUs? 364 365 This functionality will be added in a future version of this extension. 366 367 (7) Should glGet*BufferParameter* return the LGPU_SEPARATE_STORAGE_BIT_NVX bit when 368 BUFFER_STORAGE_FLAGS is queried? 369 370 RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as 371 specified in table 6.3. 372 373Revision History 374 375 Rev. Date Author Changes 376 ---- -------- -------- ----------------------------------------- 377 4 07/21/16 mjk Register extension 378