• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Name
2
3    NV_gpu_multicast
4
5Name Strings
6
7    GL_NV_gpu_multicast
8
9Contact
10
11    Joshua Schnarr, NVIDIA Corporation (jschnarr 'at' nvidia.com)
12    Ingo Esser, NVIDIA Corporation (iesser 'at' nvidia.com)
13
14Contributors
15
16    Christoph Kubisch, NVIDIA
17    Mark Kilgard, NVIDIA
18    Robert Menzel, NVIDIA
19    Kevin Lefebvre, NVIDIA
20    Ralf Biermann, NVIDIA
21
22Status
23
24    Shipping in NVIDIA release 370.XX drivers and up.
25
26Version
27
28    Last Modified Date:         April 2, 2019
29    Revision:                   7
30
31Number
32
33    OpenGL Extension #494
34
35Dependencies
36
37    This extension is written against the OpenGL 4.5 specification
38    (Compatibility Profile), dated February 2, 2015.
39
40    This extension requires ARB_copy_image.
41
42    This extension interacts with ARB_sample_locations.
43
44    This extension interacts with ARB_sparse_buffer.
45
46    This extension requires EXT_direct_state_access.
47
48    This extension interacts with EXT_bindable_uniform
49
50Overview
51
52    This extension enables novel multi-GPU rendering techniques by providing application control
53    over a group of linked GPUs with identical hardware configuration.
54
55    Multi-GPU rendering techniques fall into two categories: implicit and explicit.  Existing
56    explicit approaches like WGL_NV_gpu_affinity have two main drawbacks: CPU overhead and
57    application complexity.  An application must manage one context per GPU and multi-pump the API
58    stream.  Implicit multi-GPU rendering techniques avoid these issues by broadcasting rendering
59    from one context to multiple GPUs.  Common implicit approaches include alternate-frame
60    rendering (AFR), split-frame rendering (SFR) and multi-GPU anti-aliasing.  They each have
61    drawbacks.  AFR scales nicely but interacts poorly with inter-frame dependencies.  SFR can
62    improve latency but has challenges with offscreen rendering and scaling of vertex processing.
63    With multi-GPU anti-aliasing, each GPU renders the same content with alternate sample
64    positions and the driver blends the result to improve quality.  This also has issues with
65    offscreen rendering and can conflict with other anti-aliasing techniques.
66
67    These issues with implicit multi-GPU rendering all have the same root cause: the driver lacks
68    adequate knowledge to accelerate every application.  To resolve this, NV_gpu_multicast
69    provides fine-grained, explicit application control over multiple GPUs with a single context.
70
71    Key points:
72
73    - One context controls multiple GPUs.  Every GPU in the linked group can access every object.
74
75    - Rendering is broadcast.  Each draw is repeated across all GPUs in the linked group.
76
77    - Each GPU gets its own instance of all framebuffers, allowing individualized output for each
78      GPU.  Input data can be customized for each GPU using buffers created with the storage flag,
79      PER_GPU_STORAGE_BIT_NV and a new API, MulticastBufferSubDataNV.
80
81    - New interfaces provide mechanisms to transfer textures and buffers from one GPU to another.
82
83New Procedures and Functions
84
85    void RenderGpuMaskNV(bitfield mask);
86
87    void MulticastBufferSubDataNV(
88        bitfield gpuMask, uint buffer,
89        intptr offset, sizeiptr size,
90        const void *data);
91
92    void MulticastCopyBufferSubDataNV(
93        uint readGpu, bitfield writeGpuMask,
94        uint readBuffer, uint writeBuffer,
95        intptr readOffset, intptr writeOffset, sizeiptr size);
96
97    void MulticastCopyImageSubDataNV(
98        uint srcGpu, bitfield dstGpuMask,
99        uint srcName, enum srcTarget,
100        int srcLevel,
101        int srcX, int srcY, int srcZ,
102        uint dstName, enum dstTarget,
103        int dstLevel,
104        int dstX, int dstY, int dstZ,
105        sizei srcWidth, sizei srcHeight, sizei srcDepth);
106
107    void MulticastBlitFramebufferNV(uint srcGpu, uint dstGpu,
108                                    int srcX0, int srcY0, int srcX1, int srcY1,
109                                    int dstX0, int dstY0, int dstX1, int dstY1,
110                                    bitfield mask, enum filter);
111
112    void MulticastFramebufferSampleLocationsfvNV(uint gpu, uint framebuffer, uint start,
113                                                 sizei count, const float *v);
114
115    void MulticastBarrierNV(void);
116
117    void MulticastWaitSyncNV(uint signalGpu, bitfield waitGpuMask);
118
119    void MulticastGetQueryObjectivNV(uint gpu, uint id, enum pname, int *params);
120    void MulticastGetQueryObjectuivNV(uint gpu, uint id, enum pname, uint *params);
121    void MulticastGetQueryObjecti64vNV(uint gpu, uint id, enum pname, int64 *params);
122    void MulticastGetQueryObjectui64vNV(uint gpu, uint id, enum pname, uint64 *params);
123
124New Tokens
125
126    Accepted in the <flags> parameter of BufferStorage and NamedBufferStorageEXT:
127
128        PER_GPU_STORAGE_BIT_NV                     0x0800
129
130    Accepted by the <pname> parameter of GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and
131    GetDoublev:
132
133        MULTICAST_GPUS_NV                          0x92BA
134        RENDER_GPU_MASK_NV                         0x9558
135
136    Accepted as a value for <pname> for the TexParameter{if}, TexParameter{if}v,
137    TextureParameter{if}, TextureParameter{if}v, MultiTexParameter{if}EXT and
138    MultiTexParameter{if}vEXT commands and for the <value> parameter of GetTexParameter{if}v,
139    GetTextureParameter{if}vEXT and GetMultiTexParameter{if}vEXT:
140
141        PER_GPU_STORAGE_NV                          0x9548
142
143    Accepted by the <pname> parameter of GetMultisamplefv:
144
145        MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV   0x9549
146
147Additions to the OpenGL 4.5 Specification (Compatibility Profile)
148
149    (Add a new chapter after chapter 19 "Compute Shaders")
150
151    20 Multicast Rendering
152
153    Some implementations support multiple linked GPUs driven by a single context.  Often the
154    distribution of work to individual GPUs is managed by the GL without client knowledge.  This
155    chapter specifies commands for explicitly distributing work across GPUs in a linked group.
156    Rendering can be enabled or disabled for specific GPUs.  Draw commands are multicast, or
157    repeated across all enabled GPUs.  Objects are shared by all GPUs, however each GPU has its
158    own instance (copy) of many resources, including framebuffers.  When each GPU has its own
159    instance of a resource, it is considered to have per-GPU storage.  When all GPUs share a
160    single instance of a resource, this is considered GPU-shared storage.
161
162    The mechanism for linking GPUs is implementation specific, as is the mechanism for enabling
163    multicast rendering support (if necessary).  The number of GPUs usable for multicast rendering
164    by a context can be queried by calling GetIntegerv with the symbolic constant
165    MULTICAST_GPUS_NV.  This number is constant for the lifetime of a context.  Individual GPUs
166    are identified using zero-based indices in the range [0, n-1], where n is the number of
167    multicast GPUs.  GPUs are also identified by bitmasks of the form 2^i, where i is the GPU
168    index.  A set of GPUs is specified by the union of masks for each GPU in the set.
169
170    20.1 Controlling Individual GPUs
171
172    Render commands are restricted to a specific set of GPUs with
173
174      void RenderGpuMaskNV(bitfield mask);
175
176    The following errors apply to RenderGpuMaskNV:
177
178    INVALID_OPERATION is generated
179    * if <mask> is zero,
180    * if <mask> is not zero and <mask> is greater than or equal to 2^n, where n is equal
181    to MULTICAST_GPUS_NV,
182    * if issued between BeginConditionalRender and the corresponding EndConditionalRender.
183
184    If the command does not generate an error, RENDER_GPU_MASK_NV is set to <mask>.  The default
185    value of RENDER_GPU_MASK_NV is (2^n)-1.
186
187    Render commands are skipped for a GPU that is not present in RENDER_GPU_MASK_NV.  For example:
188    draw calls, clears, compute dispatches, and copies or pixel path operations that write to a
189    framebuffer (e.g. DrawPixels, BlitFramebuffer).  For a full list of render commands see
190    section 2.4 (page 26).  MulticastBlitFramebufferNV is an exception to this policy: while it is
191    a rendering command, it has its own source and destinations mask.  Note that buffer and
192    textures updates are not affected by RENDER_GPU_MASK_NV.
193
194    20.2 Multi-GPU Buffer Storage
195
196    Like other resources, buffer objects can have two types of storage, per-GPU storage or
197    GPU-shared storage.  Per-GPU storage can be explicitly requested using the
198    PER_GPU_STORAGE_BIT_NV flag with BufferStorage/NamedBufferStorageEXT.  If this flag is not
199    set, the type of storage used is undefined.  The implementation may use either type and
200    transition between them at any time.  Client reads of a buffer with per-GPU storage may source
201    from any GPU.
202
203    The following rules apply to buffer objects with per-GPU storage:
204
205      When mapped updates apply to all GPUs (only WRITE_ONLY access is supported).
206      When used as the write buffer for CopyBufferSubData or CopyNamedBufferSubData, writes apply
207      to all GPUs.
208
209    The following commands affect storage on all GPUs, even if the buffer object has per-GPU
210    storage:
211
212      BufferSubData, NamedBufferSubData, ClearBufferSubData, and ClearNamedBufferData
213
214    An INVALID_VALUE error is generated if BufferStorage/NamedBufferStorageEXT is called with
215    PER_GPU_STORAGE_BIT_NV set with MAP_READ_BIT or SPARSE_STORAGE_BIT_ARB.
216
217    To modify buffer object data on one or more GPUs, the client may use the command
218
219      void MulticastBufferSubDataNV(
220          bitfield gpuMask, uint buffer,
221          intptr offset, sizeiptr size,
222          const void *data);
223
224    This command operates similarly to NamedBufferSubData, except that it updates the per-GPU
225    buffer data on the set of GPUs defined by <gpuMask>.  If <buffer> has GPU-shared storage,
226    <gpuMask> is ignored and the shared instance of the buffer is updated.
227
228    An INVALID_VALUE error is generated if <gpuMask> is zero or is greater than or equal to 2^n,
229    where n is equal to MULTICAST_GPUS_NV.
230    An INVALID_OPERATION error is generated if <buffer> is not the name of an existing buffer
231    object.
232    An INVALID_VALUE error is generated if <offset> or <size> is negative, or if <offset> + <size>
233    is greater than the value of BUFFER_SIZE for the buffer object.
234    An INVALID_OPERATION error is generated if any part of the specified buffer range is mapped
235    with MapBufferRange or MapBuffer (see section 6.3), unless it was mapped with
236    MAP_PERSISTENT_BIT set in the MapBufferRange access flags.
237    An INVALID_OPERATION error is generated if the BUFFER_IMMUTABLE_STORAGE flag of the buffer
238    object is TRUE and the value of BUFFER_STORAGE_FLAGS for the buffer does not have the
239    DYNAMIC_STORAGE_BIT set.
240
241    To copy between buffers created with PER_GPU_STORAGE_BIT_NV, the client may use the command
242
243      void MulticastCopyBufferSubDataNV(
244        uint readGpu, bitfield writeGpuMask,
245        uint readBuffer, uint writeBuffer,
246        intptr readOffset, intptr writeOffset, sizeiptr size);
247
248    This command operates similarly to CopyNamedBufferSubData, while adding control over the
249    source and destination GPU(s).  The read GPU index is specified by <readGpu> and
250    the set of write GPUs is specified by the mask in <writeGpuMask>.
251
252    Implementations may also support this command with buffers not created with
253    PER_GPU_STORAGE_BIT_NV.  This support can be determined with one test copy with an error check
254    (see error discussion below).  Note that a buffer created without PER_GPU_STORAGE_BIT_NV is
255    considered to have undefined storage and the behavior of the command depends on the storage
256    type (per-GPU or GPU-shared) currently used for <writeBuffer>.  If <writeBuffer> is using
257    GPU-shared storage, the normal error checks apply but the command behaves as if <writeGpuMask>
258    includes all GPUs.  If <writeBuffer> is using per-GPU storage, the command behaves as if
259    PER_GPU_STORAGE_BIT_NV were set, however performance may be reduced.
260
261    This following error may apply to MulticastCopyBufferSubDataNV on some implementations and not
262    on others.  In earlier revisions of this extension the error was required, therefore
263    applications should perform a test copy using buffers without PER_GPU_STORAGE_BIT_NV before
264    relying on that functionality:
265
266    An INVALID_OPERATION error is generated if the value of BUFFER_STORAGE_FLAGS for <readBuffer>
267    or <writeBuffer> does not have PER_GPU_STORAGE_BIT_NV set.
268
269    The following errors apply to MulticastCopyBufferSubDataNV:
270
271    An INVALID_OPERATION error is generated if <readBuffer> or <writeBuffer> is not the name of an
272    existing buffer object.
273    An INVALID_VALUE error is generated if any of <readOffset>, <writeOffset>, or <size> are
274    negative, if <readOffset> + <size> exceeds the size of the source buffer object, or if
275    <writeOffset> + <size> exceeds the size of the destination buffer object.
276    An INVALID_OPERATION error is generated if either the source or destination buffer objects is
277    mapped, unless they were mapped with MAP_PERSISTENT_BIT set in the Map*BufferRange access
278    flags.
279    An INVALID_VALUE error is generated if <readGpu> is greater than or equal to
280    MULTICAST_GPUS_NV.
281    An INVALID_OPERATION error is generated if <writeGpuMask> is zero.  An INVALID_VALUE error is
282    generated if <writeGpuMask> is not zero and <writeGpuMask> is greater than or equal to 2^n,
283    where n is equal to MULTICAST_GPUS_NV.
284    An INVALID_VALUE error is generated if the source and destination are the same buffer object,
285    <readGpu> is present in <writeGpuMask>, and the ranges [<readOffset>; <readOffset> + <size>)
286    and [<writeOffset>; <writeOffset> + <size>) overlap.
287
288    20.3 Multi-GPU Framebuffers and Textures
289
290    All buffers in the default framebuffer as well as renderbuffers receive per-GPU storage.  By
291    default, storage for textures is undefined: it may be per-GPU or GPU-shared and can transition
292    between the types at any time.  Per-GPU storage can be specified via
293    [Multi]Tex[ture]Parameter{if}[v] with PER_GPU_STORAGE_NV for the <pname> argument and TRUE for
294    the value.  For this storage parameter to take effect, it must be specified after the texture
295    object is created and before the texture contents are defined by TexImage*, TexStorage* or
296    TextureStorage*.
297
298    20.3.1 Copying Image Data Between GPUs
299
300    To copy texel data between GPUs, the client may use the command:
301
302    void MulticastCopyImageSubDataNV(
303        uint srcGpu, bitfield dstGpuMask,
304        uint srcName, enum srcTarget,
305        int srcLevel,
306        int srcX, int srcY, int srcZ,
307        uint dstName, enum dstTarget,
308        int dstLevel,
309        int dstX, int dstY, int dstZ,
310        sizei srcWidth, sizei srcHeight, sizei srcDepth);
311
312    This command operates equivalently to CopyImageSubData, except that it takes a source GPU and
313    a destination GPU set defined by <srcGpu> and <dstGpuMask> (respectively).  Texel data is
314    copied from the source GPU to all destination GPUs.  The following errors apply to
315    MulticastCopyImageSubDataNV:
316
317    INVALID_ENUM is generated
318     * if either <srcTarget> or <dstTarget>
319      - is not RENDERBUFFER or a valid non-proxy texture target
320      - is TEXTURE_BUFFER, or
321      - is one of the cubemap face selectors described in table 3.17,
322     * if the target does not match the type of the object.
323
324    INVALID_OPERATION is generated
325     * if either object is a texture and the texture is not complete,
326     * if the source and destination formats are not compatible,
327     * if the source and destination number of samples do not match,
328     * if one image is compressed and the other is uncompressed and the
329       block size of compressed image is not equal to the texel size
330       of the compressed image.
331
332    INVALID_VALUE is generated
333     * if <srcGpu> is greater than or equal to MULTICAST_GPUS_NV,
334     * if <dstGpuMask> is zero,
335     * if <dstGpuMask> is greater than or equal to 2^n, where n is equal to
336       MULTICAST_GPUS_NV,
337     * if either <srcName> or <dstName> does not correspond to a valid
338       renderbuffer or texture object according to the corresponding
339       target parameter, or
340     * if the specified level is not a valid level for the image, or
341     * if the dimensions of the either subregion exceeds the boundaries
342       of the corresponding image object, or
343     * if the image format is compressed and the dimensions of the
344       subregion fail to meet the alignment constraints of the format.
345
346    To copy pixel values from one GPU to another use the following command:
347
348    void MulticastBlitFramebufferNV(uint srcGpu, uint dstGpu,
349                                    int srcX0, int srcY0, int srcX1, int srcY1,
350                                    int dstX0, int dstY0, int dstX1, int dstY1,
351                                    bitfield mask, enum filter);
352
353    This command operates equivalently to BlitNamedFramebuffer except that it takes a source GPU
354    and a destination GPU defined by <srcGpu> and <dstGpu> (respectively).  Pixel values are
355    copied from the read framebuffer on the source GPU to the draw framebuffer on the destination
356    GPU.
357
358    In addition to the errors generated by BlitNamedFramebuffer (see listing starting on page
359    634), calling MulticastBlitFramebufferNV will generate INVALID_VALUE if <srcGpu> or <dstGpu>
360    is greater than or equal to MULTICAST_GPUS_NV.
361
362    20.3.2 Per-GPU Sample Locations
363
364    Programmable sample locations can be customized for each GPU and framebuffer using the
365    following command:
366
367    void MulticastFramebufferSampleLocationsfvNV(uint gpu, uint framebuffer, uint start,
368                                                 sizei count, const float *v);
369
370    An INVALID_OPERATION error is generated by MulticastFramebufferSampleLocationsfvNV if
371    <framebuffer> is not the name of an existing framebuffer object.
372
373    INVALID_VALUE is generated if the sum of <start> and <count> is greater than
374    PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB.
375
376    An INVALID_VALUE error is generated if <gpu> is greater than or equal to MULTICAST_GPUS_NV.
377
378    This is equivalent to FramebufferSampleLocationsfvARB except that it sets
379    MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV at the appropriate offset for the specified GPU.
380    Just as with FramebufferSampleLocationsfvARB, FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_ARB
381    must be enabled for these sample locations to take effect.  FramebufferSampleLocationsfvARB
382    and NamedFramebufferSampleLocationsfvARB also set MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV
383    but for the specified sample across all multicast GPUs.  If <gpu> is 0,
384    MulticastFramebufferSampleLocationsfvNV updates PROGRAMMABLE_SAMPLE_LOCATION_ARB in addition
385    to MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV.
386
387    The programmed sample locations can be retrieved using GetMultisamplefv with <pname> set to
388    MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV and indices calculated as follows:
389
390        index_x = gpu * PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB + 2 * sample_i;
391        index_y = gpu * PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB + 2 * sample_i + 1;
392
393    20.4 Interactions with Other Copy Functions
394
395    Many existing commands can be used to copy between resources with GPU-shared, per-GPU or
396    undefined storage.  For example: ReadPixels, GetBufferSubData or TexImage2D with a pixel
397    unpack buffer.  The following table defines how the storage of the resource influences the
398    behavior of these copies.
399
400    Table 20.1 Behavior of Copy Commands with Multi-GPU Storage
401
402    Source     Destination Behavior
403    ---------- ----------- -----------------------------------------------------------------------
404    GPU-shared GPU-shared  There is just one source and one destination.  Copy from source to
405                           destination.
406    GPU-shared per-GPU     There is a single source.  Copy it to the destination on all GPUs.
407    GPU-shared undefined   Either of the above behaviors for a GPU-shared source may apply.
408
409    per-GPU    GPU-shared  Copy from the GPU with the lowest index set in RENDER_GPU_MASK_NV to
410                           to the shared destination.
411    per-GPU    per-GPU     Implementations are encouraged to copy from source to destination
412                           separately on each GPU.  This is not required.  If and when this is not
413                           feasible, the copy should source from the GPU with the lowest index set
414                           in RENDER_GPU_MASK_NV.
415    per-GPU    undefined   Either of the above behaviors for a per-GPU source may apply.
416
417    undefined  GPU-shared  Either of the above behaviors for a GPU-shared destination may apply.
418    undefined  per-GPU     Either of the above behaviors for a per-GPU destination may apply.
419    undefined  undefined   Any of the above behaviors may apply.
420
421    20.5 Multi-GPU Synchronization
422
423    MulticastCopyImageSubDataNV and MulticastCopyBufferSubDataNV each provide implicit
424    synchronization with previous work on the source GPU.  MulticastBlitFramebufferNV is
425    different, providing implicit synchronization with previous work on the destination GPU.
426    In both cases, synchronization of the copies can be achieved with calls to the barrier
427    command:
428
429      void MulticastBarrierNV(void);
430
431    This is called to block all GPUs until all previous commands have been completed by all GPUs,
432    and all writes have landed.  To guarantee consistency, synchronization must be placed between
433    any two accesses by multiple GPUs to the same memory when at least one of the accesses is a
434    write.  This includes accesses to both the source and the destination.  The safest approach is
435    to call MulticastBarrierNV immediately before and after each copy that involves multiple GPUs.
436
437    GPU writes and reads to/from GPU-shared locations require synchronization as well.  GPU writes
438    such as transform feedback, shader image store, CopyTexImage, CopyBufferSubData are not
439    automatically synchronized with writes by other GPUs.  Neither are GPU reads such as texture
440    fetches, shader image loads, CopyTexImage, etc. synchronized with writes by other GPUs.
441    Existing barriers such as TextureBarrier and MemoryBarrier only provide consistency guarantees
442    for rendering, writes and reads on a single GPU.
443
444    In some cases it may be desirable to have one or more GPUs wait for an operation to complete
445    on another GPU without synchronizing all GPUs with MulticastBarrierNV.  This can be performed
446    with the following command:
447
448      void MulticastWaitSyncNV(uint signalGpu, bitfield waitGpuMask);
449
450    INVALID_VALUE is generated
451     * if <signalGpu> is greater than or equal to MULTICAST_GPUS_NV,
452     * if <waitGpuMask> is zero,
453     * if <waitGpuMask> is greater than or equal to 2^n, where n is equal to
454       MULTICAST_GPUS_NV, or
455     * if <signalGpu> is present in <waitGpuMask>.
456
457    MulticastWaitSyncNV provides the same consistency guarantees as MulticastBarrierNV but only
458    between the GPUs specified by <signalGpu> and <waitGpuMask> in a single direction.  It forces
459    the GPUs specified by waitGpuMask to wait until the GPU specified by <signalGpu> has completed
460    all previous commands and writes associated with those commands.
461
462    20.6 Multi-GPU Queries
463
464    Queries are performed across all multicast GPUs.  Each query object stores independent result
465    values for each GPU.  The result value for a specific GPU can be queried using one of the
466    following commands:
467
468    void MulticastGetQueryObjectivNV(uint gpu, uint id, enum pname, int *params);
469    void MulticastGetQueryObjectuivNV(uint gpu, uint id, enum pname, uint *params);
470    void MulticastGetQueryObjecti64vNV(uint gpu, uint id, enum pname, int64 *params);
471    void MulticastGetQueryObjectui64vNV(uint gpu, uint id, enum pname, uint64 *params);
472
473    The behavior of these commands matches the GetQueryObject* equivalent commands, except they
474    return the result value for the specified GPU.  A query may be available on one GPU but not on
475    another, so it may be necessary to check QUERY_RESULT_AVAILABLE for each GPU.  GetQueryObject*
476    return query results and availability for GPU 0 only.
477
478    In addition to the errors generated by GetQueryObject* (see the listing in section 4.2 on page
479    49), calling MulticastGetQueryObject* will generate INVALID_VALUE if <gpu> is greater than or
480    equal to MULTICAST_GPUS_NV.
481
482Additions to Chapter 8 of the OpenGL 4.5 (Compatibility Profile) Specification
483(Textures and Samplers)
484
485    Modify Section 8.10 (Texture Parameters)
486
487    Insert the following paragraph before Table 8.25 (Texture parameters and their values):
488
489        If <pname> is PER_GPU_STORAGE_NV, then the state is stored in the texture, but only takes
490    effect the next time storage is allocated for a texture using TexImage*, TexStorage* or
491    TextureStorage*.  If the value of TEXTURE_IMMUTABLE_FORMAT is TRUE, then PER_GPU_STORAGE_NV
492    cannot be changed and an error is generated.
493
494    Additions to Table 8.26 Texture parameters and their values
495
496    Name               Type    Legal values
497    ------------------ ------- ------------
498    PER_GPU_STORAGE_NV boolean TRUE, FALSE
499
500Additions to Chapter 10 of the OpenGL 4.5 (Compatibility Profile) Specification
501(Vertex Specification and Drawing Commands)
502
503    Modify Section 10.9 (Conditional Rendering)
504
505    Replace the following text:
506
507        If the result (SAMPLES_PASSED) of the query is zero, or if the result (ANY_SAMPLES_PASSED
508        or ANY_SAMPLES_- PASSED_CONSERVATIVE) is FALSE, all rendering commands described in
509        section 2.4 are discarded and have no effect when issued between BeginConditional- Render
510        and the corresponding EndConditionalRender
511
512    with this text:
513
514        For each active render GPU, if the result (SAMPLES_PASSED) of the query on that GPU is
515        zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_- PASSED_CONSERVATIVE) is FALSE,
516        all rendering commands described in section 2.4 are discarded by this GPU and have no
517        effect when issued between BeginConditional- Render and the corresponding
518        EndConditionalRender
519
520    Similarly replace the following:
521
522        If the result (SAMPLES_PASSED) of the query is non-zero, or if the result
523        (ANY_SAMPLES_PASSED or ANY_SAMPLES_PASSED_- CONSERVATIVE) is TRUE, such commands are not
524        discarded.
525
526    with this:
527
528        For each active render GPU, if the result (SAMPLES_PASSED) of the query on that GPU is
529        non-zero, or if the result (ANY_SAMPLES_PASSED or ANY_SAMPLES_PASSED_- CONSERVATIVE) is
530        TRUE, such commands are not discarded.
531
532    Finally, replace all instances of "the GL" with "each active render GPU".
533
534Additions to Chapter 14 of the OpenGL 4.5 (Compatibility Profile) Specification
535(Fixed-Function Primitive Assembly and Rasterization)
536
537    Modify Section 14.3.1 (Multisampling)
538
539    Replace the following text:
540
541        The location for sample <i> is taken from v[2*(i-start)] and v[2*(i-start)+1].
542
543    with the following:
544
545        These commands set the sample locations for all multicast GPUs in
546        MULTICAST_FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_NV.  The location for sample <i> on
547        gpu <g> is taken from v[g*N+2*(i-start)] and v[g*N+2*(i-start)+1].
548
549    Replace the following error generated by GetMultisamplefv:
550
551        An INVALID_ENUM error is generated if <pname> is not SAMPLE_LOCATION_ARB or
552        PROGRAMMABLE_SAMPLE_LOCATION_ARB.
553
554    with the following:
555
556        An INVALID_ENUM error is generated if <pname> is not SAMPLE_LOCATION_ARB,
557        PROGRAMMABLE_SAMPLE_LOCATION_ARB or MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV.
558
559    Add the following to the list of errors generated by GetMultisamplefv:
560
561        An INVALID_VALUE error is generated if <pname> is
562        MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_ARB and <index> is greater than or equal to the
563        value of PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB multiplied by the value of
564        MULTICAST_GPUS_NV.
565
566    Replace the following pseudocode (in both locations):
567
568        float *table = FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_ARB;
569        sample_location.xy = (table[2*sample_i], table[2*sample_i+1]);
570
571    with the following:
572
573        float *table = MULTICAST_FRAMEBUFFER_PROGRAMMABLE_SAMPLE_LOCATIONS_NV;
574        table += PROGRAMMABLE_SAMPLE_LOCATION_TABLE_SIZE_ARB * gpu;
575        sample_location.xy = (table[2*sample_i], table[2*sample_i+1]);
576
577Additions to the WGL/GLX/EGL/AGL Specifications
578
579    None
580
581Dependencies on ARB_sample_locations
582
583    If ARB_sample_locations is not supported, section 20.3.2 and any references to
584    MulticastFramebufferSampleLocationsfvNV and MULTICAST_PROGRAMMABLE_SAMPLE_LOCATION_NV should
585    be removed.  The modifications to Section 14.3.1 (Multisampling) should also be removed.
586
587Dependencies on ARB_sparse_buffer
588
589    If ARB_sparse_buffer is not supported, any reference to SPARSE_STORAGE_BIT_ARB should be
590    removed.
591
592Interactions with EXT_bindable_uniform
593
594    When using the functionality of EXT_bindable_uniform and a per-GPU storage buffer is bound
595    to a bindable location in a program object, client uniform updates apply to all GPUs.
596
597    An INVALID_OPERATION is generated if a buffer with PER_GPU_STORAGE_BIT_NV is bound to a
598    program object's bindable location and GetUniformfv, GetUniformiv, GetUniformuiv or
599    GetUniformdv is called.
600
601Errors
602
603    Relaxation of INVALID_ENUM errors
604    ---------------------------------
605    GetBooleanv, GetIntegerv, GetInteger64v, GetFloatv, and GetDoublev now accept new tokens as
606    described in the "New Tokens" section.
607
608New State
609
610    Additions to Table 23.4 Rasterization
611                                                   Initial
612    Get Value                   Type  Get Command Value  Description               Sec.  Attribute
613    -------------------------- ------ ----------- -----  -----------------------   ----  ---------
614    RENDER_GPU_MASK_NV           Z+   GetIntegerv   *    Mask of GPUs that have    20.1     -
615                                                           writes enabled
616    * See section 20.1
617
618    Additions to Table 23.19 Textures (state per texture object)
619
620                                                    Initial
621    Get Value                Type   Get Command      Value    Description                  Sec.
622    ---------                ----   -----------      -------  -----------                  ----
623    PER_GPU_STORAGE_NV       B      GetTexParameter  FALSE    Per-GPU storage requested    20.3
624
625
626    Additions to Table 23.30 Framebuffer (state per framebuffer object)
627
628    Get Value                Get Command      Type Initial Value    Description          Sec.    Attribute
629    ---------                -----------      ---- -------------    -----------          ----    ---------
630    MULTICAST_PROGRAMMABLE_- GetMultisamplefv  *    (0.5,0.5)       Programmable sample  20.3.2      -
631        SAMPLE_LOCATION_NV
632
633    * The type here is "2* x n x 2 x R[0,1]" which is is equivalent to PROGRAMMABLE_SAMPLE_LOCATION_ARB
634    but with samples locations for all multicast GPUs (one after the other).
635
636New Implementation Dependent State
637
638    Add to Table 23.82, Implementation-Dependent Values, p. 784
639
640                                                     Minimum
641    Get Value                     Type   Get Command  Value  Description               Sec.  Attribute
642    ---------------------------- ------ ------------- -----  ----------------------    ----  ---------
643    MULTICAST_GPUS_NV              Z+    GetIntegerv    1    Number of linked GPUs     20.0     -
644                                                             usable for multicast
645
646Backwards Compatibility
647
648    This extension replaces NVX_linked_gpu_multicast.  The enumerant values for MULTICAST_GPUS_NV
649    and PER_GPU_STORAGE_BIT_NV match those of MAX_LGPU_GPUS_NVX and LGPU_SEPARATE_STORAGE_BIT_NVX
650    (respectively).  MulticastBufferSubDataNV, MulticastCopyImageSubDataNV and MulticastBarrierNV
651    behave analog to LGPUNamedBufferSubDataNVX, LGPUCopyImageSubDataNVX and LGPUInterlockNVX
652    (respectively).
653
654Sample Code
655
656    Binocular stereo rendering example using NV_gpu_multicast with single GPU fallback:
657
658    struct ViewData {
659        GLint viewport_index;
660        GLfloat mvp[16];
661        GLfloat modelview[16];
662    };
663    ViewData leftViewData = { 0, {...}, {...} };
664    ViewData rightViewData = { 1, {...}, {...} };
665
666    GLuint ubo[2];
667    glCreateBuffers(2, &ubo[0]);
668
669    if (has_NV_gpu_multicast) {
670        glNamedBufferStorage(ubo[0], size, NULL, GL_PER_GPU_STORAGE_BIT_NV | GL_DYNAMIC_STORAGE_BIT);
671        glMulticastBufferSubDataNV(0x1, ubo[0], 0, size, &leftViewData);
672        glMulticastBufferSubDataNV(0x2, ubo[0], 0, size, &rightViewData);
673    } else {
674        glNamedBufferStorage(ubo[0], size, &leftViewData, 0);
675        glNamedBufferStorage(ubo[1], size, &rightViewData, 0);
676    }
677
678    glViewportIndexedf(0, 0, 0, 640, 480);  // left viewport
679    glViewportIndexedf(1, 640, 0, 640, 480);  // right viewport
680    // Vertex shader sets gl_ViewportIndex according to viewport_index in UBO
681
682    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
683
684    if (has_NV_gpu_multicast) {
685        glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
686        drawScene();
687        // Make GPU 1 wait for glClear above to complete on GPU 0
688        glMulticastWaitSyncNV(0, 0x2);
689        // Copy right viewport from GPU 1 to GPU 0
690        glMulticastCopyImageSubDataNV(1, 0x1,
691                                      renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
692                                      renderBuffer, GL_RENDERBUFFER, 0, 640, 0, 0,
693                                      640, 480, 1);
694        // Make GPU 0 wait for GPU 1 copy to GPU 0
695        glMulticastWaitSyncNV(1, 0x1);
696    } else {
697        glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[0]);
698        drawScene();
699        glBindBufferBase(GL_UNIFORM_BUFFER, 0, ubo[1]);
700        drawScene();
701    }
702    // Both viewports are now present in GPU 0's renderbuffer
703
704Issues
705
706  (1) Should we provide explicit inter-GPU synchronization API?  Will this make the implementation
707    easier or harder for the driver and applications?
708
709    RESOLVED. Yes. A naive implementation of implicit synchronization would simply synchronize the
710    GPUs before and after each copy.  Smart implicit synchronization would have to track all APIs
711    that can modify buffers and textures, creating an excessive burden for driver implementation
712    and maintenance.  An application can track dependencies more easily and outperform a naive
713    driver implementation using explicit synchronization.
714
715  (2) How does this extension interact with queries (e.g. occlusion queries)?
716
717    RESOLVED. Queries are performed separately on each GPU. The standard GetQueryObject* APIs
718    return query results for GPU 0 only. However GetQueryBufferObject* can be used to retrieve
719    query results for all GPUs through a buffer with separate storage (PER_GPU_STORAGE_BIT_NV).
720
721  (3) Are copy operations controlled by the render mask?
722
723    RESOLVED. Copies which write to the framebuffer are considered render commands and implicitly
724    controlled by the render mask.  Copies between textures and buffers are not considered render
725    commands so they are not influenced by the mask.  If masked copies are desired, use
726    MulticastCopyImageSubDataNV, MulticastCopyBufferSubDataNV or MulticastBlitFramebufferNV.
727    These commands explicitly specify the GPU source and destination and are not influenced by the
728    render mask.
729
730  (4) What happens if the MulticastCopyBufferSubDataNV source and destination buffer is the same?
731
732    RESOLVED.  When the source and destination involve the same GPU, MulticastCopyBufferSubDataNV
733    matches the behavior of CopyBufferSubData: overlapped copies are not allowed and an
734    INVALID_VALUE error results.  When the source and destination do not involve the same GPU,
735    overlapping copies are allowed and no error is generated.
736
737  (5) How does this extension interact with CopyTexImage2D?
738
739    RESOLVED.  The behavior depends on the storage type of the target.  See section 20.4.  Since
740    CopyTexImage* sources from the framebuffer, the source always has per-GPU storage.
741
742  (6) Should we provide a mechanism to modify viewports independently for each GPU?
743
744    RESOLVED. No. This can be achieved using multicast UBOs and ARB_shader_viewport_layer_array.
745
746  (7) Should we add a present API that automatically displays content from a specific GPU? It
747    could abstract the transport mechanism, copying when necessary.
748
749    RESOLVED. No. Transfers should be avoided to maximize performance and minimize latency.
750    Minimizing transfers requires application awareness of display connectivity to assign
751    rendering appropriately.  Hiding transfers behind an API would also prevent some interesting
752    multi-GPU rendering techniques (e.g. checkerboard-style split rendering).
753
754    WGL_NV_bridged_display can be used to enable display from multiple GPUs without copies.
755
756  (8) Should we expose the extension on single-GPU configurations?
757
758    RESOLVED.  Yes, this is recommended.  It allows more code sharing between multi-GPU and
759    single-GPU code paths.  If there is only one GPU present MULTICAST_GPUS_NV will be 1.  It
760    may also be 1 if explicit GPU control is unavailable (e.g. if the active multi-GPU rendering
761    mode prevents it).  Note that in revisions 5 and prior of this extension the minimum for
762    MULTICAST_GPUS_NV was 2.
763
764  (9) Should glGet*BufferParameter* return the PER_GPU_STORAGE_BIT_NV bit when
765    BUFFER_STORAGE_FLAGS is queried?
766
767    RESOLVED. Yes. BUFFER_STORAGE_FLAGS must match the flags parameter input to *BufferStorage, as
768    specified in table 6.3.
769
770  (10) Can a query be complete/available on one GPU and not another?
771
772    RESOLVED. Yes. Independent query completion is important for conditional rendering.  It
773    allows each GPU to begin conditional rendering in mode QUERY_WAIT without waiting on other
774    GPUs.
775
776  (11) How can custom texel data for be uploaded to each GPU for a given texture?
777
778    The easiest way is to create staging textures with the custom texel data and then copy it
779    to a texture with per-GPU storage using MulticastCopyImageSubDataNV.
780
781  (12) Should we allow the waitGpuMask in MulticastWaitSyncNV to include the signal GPU?
782
783    RESOLVED. No. There is no reason for a GPU to wait on itself.  This is effectively a no-op in
784    the command stream.  Furthermore it is easy to confuse GPU indices and masks, so it is
785    beneficial to explicitly generate an error in this case.
786
787  (13) Will support for NVX_linked_gpu_multicast continue?
788
789    RESOLVED. NVX_linked_gpu_multicast is deprecated and applications should switch to
790    NV_gpu_multicast.  However, implementations are encouraged to continue supporting
791    NVX_linked_gpu_multicast for backwards compatibility.
792
793  (14) Does RenderGpuMaskNV work with immediate mode rendering?
794
795    RESOLVED. Yes, the render GPU mask applies to immediate mode rendering the same as other
796    rendering.  Note that RenderGpuMaskNV is not one of the commands allowed between Begin and End
797    (see section 10.7.5) so the render mask must be set before Begin is called.
798
799Revision History
800
801    Rev.    Date    Author    Changes
802    ----  --------  --------  -----------------------------------------------
803     7    04/02/19  jschnarr  clarify that the interactions with uniform APIs only apply to
804                              EXT_bindable_uniform (not ARB_uniform_buffer_object).
805                              optionally allow MulticastCopyBufferSubDataNV with buffers lacking
806                              per-GPU storage
807     6    01/03/19  jschnarr  reduce MULTICAST_GPUS_NV minimum to 1
808                              clarify that MULTICAST_GPUS_NV is constant for a context
809     5    10/07/16  jschnarr  trivial typo fix
810     4    07/21/16  mjk       registered
811     3    06/15/16  jschnarr  R370 release
812