• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Name
2
3    NVX_linked_gpu_multicast
4
5Name Strings
6
7    GL_NVX_linked_gpu_multicast
8
9Contact
10
11    Joshua Schnarr, NVIDIA Corporation (jschnarr 'at' nvidia.com)
12    Ingo Esser, NVIDIA Corporation (iesser 'at' nvidia.com)
13
14Contributors
15
16    Christoph Kubisch, NVIDIA
17    Mark Kilgard, NVIDIA
18
19Status
20
21    Shipping in NVIDIA release 361 drivers.
22
23Version
24
25    Last Modified Date:         July 21, 2016
26    NVIDIA Revision:            4
27
28Number
29
30    OpenGL Extension #493
31
32Dependencies
33
34    This extension is written against the OpenGL 4.5 specification (Compatibility Profile), dated
35    February 2, 2015.
36
37    This extension interacts with ARB_sparse_buffer.
38
39    This extension interacts with ARB_copy_image.
40
41    This extension interacts with EXT_direct_state_access.
42
43    This extension interacts with ARB_shader_viewport_layer_array.
44
45Overview
46
47    This extension enables novel multi-GPU rendering techniques by providing application control
48    over a group of linked GPUs with identical hardware configuration.
49
50    Multi-GPU rendering techniques fall into two categories: implicit and explicit.  Existing
51    explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and
52    application complexity.  An application must manage one context per GPU and multi-pump the API
53    stream.  Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering
54    from one context to multiple GPUs.  Common implicit approaches include alternate-frame
55    rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing.  They each have
56    drawbacks.  AFR scales nicely but interacts poorly with inter-frame dependencies.  SFR can
57    improve latency but has challenges with offscreen rendering and scaling of vertex processing.
58    With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample
59    positions and the driver blends the result to improve quality.  This also has issues with
60    offscreen rendering and can conflict with other anti-aliasing techniques.
61
62    These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks
63    adequate knowledge to accelerate every application.  To resolve this, NVX_linked_gpu_multicast
64    provides application control over multiple GPUs with a single context.
65
66    Key points:
67
68    - One context controls multiple GPUs.  Every GPU in the linked group can access every object.
69
70    - Rendering is broadcast.  Each draw is repeated across all GPUs in the linked group.
71
72    - Each GPU gets its own instance of all framebuffers and attached textures, allowing
73      individualized output for each GPU.  Input data can be customized for each GPU using buffers
74      created with the storage flag, LGPU_SEPARATE_STORAGE_BIT_NVX and a new API,
75      LGPUNamedBufferSubDataNVX.
76
77    - Textures can be transferred from one GPU to another using LGPUCopyImageSubDataNVX.
78
79
80New Procedures and Functions
81
82    void LGPUNamedBufferSubDataNVX(
83        bitfield gpuMask, uint buffer,
84        intptr offset, sizeiptr size,
85        const void *data);
86
87    void LGPUCopyImageSubDataNVX(
88        uint sourceGpu, bitfield destinationGpuMask,
89        uint srcName, enum srcTarget,
90        int srcLevel,
91        int srcX, int srxY, int srcZ,
92        uint dstName, enum dstTarget,
93        int dstLevel,
94        int dstX, int dstY, int dstZ,
95        sizei width, sizei height, sizei depth);
96
97    void LGPUInterlockNVX(void);
98
99New Tokens
100
101    Accepted in the <flags> parameter of BufferStorage and
102    NamedBufferStorageEXT:
103
104        LGPU_SEPARATE_STORAGE_BIT_NVX               0x0800
105
106    Accepted by the <pname> parameter of GetBooleanv, GetIntegerv,
107    GetInteger64v, GetFloatv, and GetDoublev:
108
109        MAX_LGPU_GPUS_NVX                           0x92BA
110
111Additions to the OpenGL 4.5 Specification (Compatibility Profile)
112
113    (Add a new chapter after chapter 19 "Compute Shaders")
114
115    20 Multicast Rendering
116
117    This chapter specifies commands for using multiple GPUs in a linked group.  Commands are
118    multicast, or repeated across all linked GPUs.  Objects are shared by all GPUs, however each
119    GPU has its own instance (copy) of many resources, including framebuffers.  When each GPU has
120    its own instance of a resource, it is considered to have per-GPU storage.  When all GPUs share
121    a single instance of a resource, this is considered GPU-shared storage.
122
123    The mechanism for linking GPUs is implementation specific, as is the process-global mechanism
124    for enabling multicast rendering support (if necessary).  The number of GPUs usable for
125    multicast rendering by a context can be queried by calling GetIntegerv with the symbolic
126    constant MAX_LGPU_GPUS_NVX.  Individual GPUs are identified using zero-based indices in the
127    range [0, n-1], where n is the number of multicast GPUs.  GPUs are also be identified by
128    bitmasks of the form 2^i, where i is the GPU index.  A set of GPUs is specified by the union of
129    masks for each GPU in the set.
130
131    20.1 Multi-GPU Buffer Storage
132
133    Like other resources, buffer objects can have two types of storage, per-GPU storage or
134    GPU-shared storage.  Per-GPU storage can be explicitly requested using the
135    LGPU_SEPARATE_STORAGE_BIT_NVX flag with BufferStorage/NamedBufferStorageEXT.  If this flag is
136    not set, the type of storage used is undefined.  The implementation may use either type
137    and transition between them at any time.  Client reads of a buffer with per-GPU storage may
138    source from any GPU.
139
140    The following rules apply to buffer objects with per-GPU storage:
141
142      When mapped with WRITE_ONLY access, writes apply to all GPUs.
143      When bound to UNIFORM_BUFFER, client uniform updates apply to all GPUs.
144      When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply to
145      all GPUs.
146
147    The following commands affect storage on all GPUs, even if the the buffer object has per-GPU
148    storage:
149
150      BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData
151
152    An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with
153    LGPU_SEPARATE_STORAGE_BIT_NVX set with MAP_PERSISTENT_BIT or SPARSE_STORAGE_BIT_ARB.
154
155    To modify buffer object data on one or more GPUs, the client may use the command
156
157    void LGPUNamedBufferSubDataNVX(
158        bitfield gpuMask, uint buffer,
159        intptr offset, sizeiptr size,
160        const void *data);
161
162    This function operates similarly to NamedBufferSubData, except that it updates the per-GPU
163    buffer data on the set of GPUs defined by <gpuMask>.
164
165    An INVALID_VALUE error is generated if <gpuMask> is zero.
166    An INVALID_OPERATION error is generated if <buffer> is not the name of an existing buffer
167    object.
168    An INVALID_VALUE error is generated if <offset> or <size> is negative, or if <offset> + <size>
169    is greater than the value of BUFFER_SIZE for the buffer object.
170    An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped
171    with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with
172    MAP_PERSISTENT_BIT set in the MapBufferRange access flags.
173    An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer
174    object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the
175    DYNAMIC_STORAGE_BIT set.
176
177    20.2 Multi-GPU Framebuffers and Textures
178
179    All buffers in the default framebuffer as well as renderbuffers and textures bound to
180    framebuffer objects receive per-GPU storage.  Storage for other textures is undefined: it may
181    be per-GPU or GPU-shared and can transition between the types at any time.
182
183    To copy texel data between GPUs, the client may use the command
184
185    void LGPUCopyImageSubDataNVX(
186        uint sourceGpu, bitfield destinationGpuMask,
187        uint srcName, enum srcTarget,
188        int srcLevel,
189        int srcX, int srxY, int srcZ,
190        uint dstName, enum dstTarget,
191        int dstLevel,
192        int dstX, int dstY, int dstZ,
193        sizei width, sizei height, sizei depth);
194
195    This function operates similarly to CopyImageSubData, except that it takes a source GPU
196    and a destination GPU set defined by <destinationGpuMask>.
197
198    INVALID_ENUM is generated
199     * if either <srcTarget> or <dstTarget>
200      - is not RENDERBUFFER or a valid non-proxy texture target
201      - is TEXTURE_BUFFER, or
202      - is one of the cubemap face selectors described in table 3.17,
203     * if the target does not match the type of the object.
204
205    INVALID_OPERATION is generated
206     * if either object is a texture and the texture is not complete,
207     * if the source and destination formats are not compatible,
208     * if the source and destination number of samples do not match,
209     * if one image is compressed and the other is uncompressed and the
210       block size of compressed image is not equal to the texel size
211       of the compressed image.
212
213    INVALID_VALUE is generated
214     * if <sourceGpu> is greater than or equal to MAX_LGPU_GPUS_NVX,
215     * if <destinationGpuMask> is zero,
216     * if either <srcName> or <dstName> does not correspond to a valid
217       renderbuffer or texture object according to the corresponding
218       target parameter, or
219     * if the specified level is not a valid level for the image, or
220     * if the dimensions of the either subregion exceeds the boundaries
221       of the corresponding image object, or
222     * if the image format is compressed and the dimensions of the
223       subregion fail to meet the alignment constraints of the format.
224
225
226    20.3 Multi-GPU Synchronization
227
228    LGPUCopyImageSubDataNVX provides implicit synchronization with previous rendering to the given
229    texture or renderbuffer on the source GPU.  Synchronization of the copy with the destination
230    GPU(s) is achieved with the interlock function:
231
232      void LGPUInterlockNVX(void)
233
234    This is called to synchronize all linked GPUs to the same point in the API stream.  To
235    guarantee consistency, the interlock command must be used as a barrier between any two
236    accesses by multiple GPUs to the same memory when at least one of the accesses is a write.
237    For consistent copies between GPUs, synchronization is required before and after each copy:
238
239    1. Prior to each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called after
240    the most recent read or write of the target image by a destination GPU.
241
242    2. After each call to LGPUCopyImageSubDataNVX, LGPUInterlockNVX() must be called
243    prior to any future read or write of the target image by a destination GPU.
244
245    GPU writes and reads to/from GPU-shared locations require synchronization as well.  GPU writes
246    such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not
247    automatically synchronized with writes by other GPUs.  Neither are GPU reads such as texture
248    fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs.
249    Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees
250    for rendering, writes and reads on a single GPU.
251
252
253    Additions to the AGL/GLX/WGL Specifications
254
255        None
256
257GLX Protocol
258
259    None
260
261Errors
262
263    Relaxation of INVALID_ENUM errors
264    ---------------------------------
265    GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as
266    described in the "New Tokens" section.
267
268New State
269
270    None
271
272New Implementation Dependent State
273
274    Add to Table 23.82, Implementation-Dependent Values, p. 784
275
276                                                Minimum
277    Get Value               Type   Get Command  Value   Description               Sec.  Attribute
278    ----------------------  ----   -----------  ------- -----------------------   ----  ---------
279    MAX_LGPU_GPUS_NVX        Z+   GetIntegerv      2    Maximum number of         6.9     -
280                                                        usable GPUs
281Sample Code
282
283    Binocular stereo rendering example using NVX_linked_gpu_multicast with single GPU fallback:
284
285    struct ViewData {
286        GLint viewport_index;
287        GLfloat mvp[16];
288        GLfloat modelview[16];
289    };
290    ViewData leftViewData = { 0, {...}, {...} };
291    ViewData rightViewData = { 1, {...}, {...} };
292
293    GLuint ubo[2];
294    glCreateBuffers(2, &ubo[0]);
295
296    if (has_NVX_linked_gpu_multicast) {
297        glNamedBufferStorage(ubo[0], size, NULL, GL_LGPU_SEPARATE_STORAGE_BIT_NVX | GL_DYNAMIC_STORAGE_BIT);
298        glLGPUNamedBufferSubDataNVX(0x1, ubo[0], 0, size, &leftViewData);
299        glLGPUNamedBufferSubDataNVX(0x2, ubo[0], 0, size, &rightViewData);
300    } else {
301        glNamedBufferStorage(ubo[0], size, &leftViewData, 0);
302        glNamedBufferStorage(ubo[1], size, &rightViewData, 0);
303    }
304
305    glViewportIndexedf(0, 0, 0, 640, 480);  // left viewport
306    glViewportIndexedf(1, 640, 0, 640, 480);  // right viewport
307    // Vertex shader sets gl_ViewportIndex according to viewport_index in UBO
308
309    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
310
311    if (has_NVX_linked_gpu_multicast) {
312        glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
313        drawScene();
314        // Make GPU 1 wait for glClear above to complete on GPU 0
315        glLGPUInterlockNVX();
316        // Copy right viewport from GPU 1 to GPU 0
317        glLGPUCopyImageSubDataNVX(1, 0x1,
318                                  renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
319                                  renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
320                                  640, 480, 1);
321        // Make GPU 0 wait for GPU 1 copy to GPU 0
322        glLGPUInterlockNVX();
323    } else {
324        glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
325        drawScene();
326        glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]);
327        drawScene();
328    }
329    // Both viewports are now present in GPU 0's renderbuffer
330
331Issues
332
333  (1) Should we provide explicit inter-gpu synchronization API?  Will this make the implementation
334    easier or harder for the driver and applications?
335
336    RESOLVED. Yes. A naive implementation of implicit synchronization would simply interlock the
337    GPUs before and after each copy.  Smart implicit synchronization would have to track all APIs
338    that can modify buffers and textures, creating an excessive burden for driver implementation
339    and maintenance.  An application can track dependencies more easily and outperform a naive
340    driver implementation using explicit synchronization.
341
342  (2) How does this extension interact with queries (e.g. occlusion queries)?
343
344    RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs
345    return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve
346    query results for all GPUs through a buffer with separate storage (LGPU_SEPARATE_STORAGE_BIT).
347
348  (3) Which textures and buffers have separate storage for each GPU?
349
350    The default framebuffer and framebuffer texture attachments. Also buffers allocated with
351    LGPU_SEPARATE_STORAGE_BIT. Other buffers and textures may or may not have separate storage.
352
353  (4) Should we provide a mechanism to modify viewports independently for each GPU?
354
355    RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array.
356
357  (5) Should we expose this extension on single-GPU configurations?
358
359    RESOLVED. No. The extension provides no value unless MULTICAST_GPUS_NV > 1.  Limiting exposure
360    to these configurations guarantees that at least two GPUs will be available when the extension
361    is reported.
362
363  (6) Can rendering be enabled/disabled on a specific subset of GPUs?
364
365    This functionality will be added in a future version of this extension.
366
367  (7) Should glGet*BufferParameter* return the LGPU_SEPARATE_STORAGE_BIT_NVX bit when
368    BUFFER_STORAGE_FLAGS is queried?
369
370    RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as
371    specified in table 6.3.
372
373Revision History
374
375    Rev.    Date    Author    Changes
376    ----  --------  --------  -----------------------------------------
377     4    07/21/16  mjk       Register extension
378