• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Name
2
3    ARB_compute_shader
4
5Name Strings
6
7    GL_ARB_compute_shader
8
9Contact
10
11    Graham Sellers, AMD (graham.sellers 'at' amd.com)
12
13Contributors
14
15    Pat Brown, NVIDIA
16    Daniel Koch, TransGaming
17    John Kessenich
18    Members of the ARB working group
19
20Notice
21
22    Copyright (c) 2012-2014 The Khronos Group Inc. Copyright terms at
23        http://www.khronos.org/registry/speccopyright.html
24
25Specification Update Policy
26
27    Khronos-approved extension specifications are updated in response to
28    issues and bugs prioritized by the Khronos OpenGL Working Group. For
29    extensions which have been promoted to a core Specification, fixes will
30    first appear in the latest version of that core Specification, and will
31    eventually be backported to the extension document. This policy is
32    described in more detail at
33        https://www.khronos.org/registry/OpenGL/docs/update_policy.php
34
35Status
36
37    Complete.
38    Approved by the ARB on 2012/06/12.
39
40Version
41
42    Last Modified Date: December 10, 2018
43    Revision: 28
44
45Number
46
47    ARB Extension #122
48
49Dependencies
50
51    OpenGL 4.2 is required.
52
53    This extension is written based on the wording of the OpenGL 4.2 (Core
54    Profile) specification, and on the wording of the OpenGL Shading Language
55    (GLSL) Specification, version 4.20.
56
57    This extension interacts with OpenGL 4.3 and
58    ARB_shader_storage_buffer_object.
59
60    This extension interacts with NV_vertex_buffer_unified_memory.
61
62Overview
63
64    Recent graphics hardware has become extremely powerful and a strong desire
65    to harness this power for work (both graphics and non-graphics) that does
66    not fit the traditional graphics pipeline well has emerged. To address
67    this, this extension adds a new single-stage program type known as a
68    compute program. This program may contain one or more compute shaders
69    which may be launched in a manner that is essentially stateless. This allows
70    arbitrary workloads to be sent to the graphics hardware with minimal
71    disturbance to the GL state machine.
72
73    In most respects, a compute program is identical to a traditional OpenGL
74    program object, with similar status, uniforms, and other such properties.
75    It has access to many of the same resources as fragment and other shader
76    types, such as textures, image variables, atomic counters, and so on.
77    However, it has no predefined inputs nor any fixed-function outputs. It
78    cannot be part of a pipeline and its visible side effects are through its
79    actions on images and atomic counters.
80
81    OpenCL is another solution for using graphics processors as generalized
82    compute devices. This extension addresses a different need. For example,
83    OpenCL is designed to be usable on a wide range of devices ranging from
84    CPUs, GPUs, and DSPs through to FPGAs. While one could implement GL on these
85    types of devices, the target here is clearly GPUs. Another difference is
86    that OpenCL is more full featured and includes features such as multiple
87    devices, asynchronous queues and strict IEEE semantics for floating point
88    operations. This extension follows the semantics of OpenGL - implicitly
89    synchronous, in-order operation with single-device, single queue
90    logical architecture and somewhat more relaxed numerical precision
91    requirements. Although not as feature rich, this extension offers several
92    advantages for applications that can tolerate the omission of these
93    features. Compute shaders are written in GLSL, for example and so code may
94    be shared between compute and other shader types. Objects are created and
95    owned by the same context as the rest of the GL, and therefore no
96    interoperability API is required and objects may be freely used by both
97    compute and graphics simultaneously without acquire-release semantics or
98    object type translation.
99
100New Procedures and Functions
101
102        void DispatchCompute(uint num_groups_x,
103                             uint num_groups_y,
104                             uint num_groups_z);
105
106        void DispatchComputeIndirect(intptr indirect);
107
108New Tokens
109
110    Accepted by the <type> parameter of CreateShader and returned in the
111    <params> parameter by GetShaderiv:
112
113        COMPUTE_SHADER                                  0x91B9
114
115    Accepted by the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv,
116    GetDoublev and GetInteger64v:
117
118        MAX_COMPUTE_UNIFORM_BLOCKS                      0x91BB
119        MAX_COMPUTE_TEXTURE_IMAGE_UNITS                 0x91BC
120        MAX_COMPUTE_IMAGE_UNIFORMS                      0x91BD
121        MAX_COMPUTE_SHARED_MEMORY_SIZE                  0x8262
122        MAX_COMPUTE_UNIFORM_COMPONENTS                  0x8263
123        MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS              0x8264
124        MAX_COMPUTE_ATOMIC_COUNTERS                     0x8265
125        MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS         0x8266
126        MAX_COMPUTE_WORK_GROUP_INVOCATIONS              0x90EB
127
128    Accepted by the <pname> parameter of GetIntegeri_v, GetBooleani_v,
129    GetFloati_v, GetDoublei_v and GetInteger64i_v:
130
131        MAX_COMPUTE_WORK_GROUP_COUNT                    0x91BE
132        MAX_COMPUTE_WORK_GROUP_SIZE                     0x91BF
133
134    Accepted by the <pname> parameter of GetProgramiv:
135
136        COMPUTE_WORK_GROUP_SIZE                         0x8267
137
138    Accepted by the <pname> parameter of GetActiveUniformBlockiv:
139
140        UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER      0x90EC
141
142    Accepted by the <pname> parameter of GetActiveAtomicCounterBufferiv:
143
144        ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER  0x90ED
145
146    Accepted by the <target> parameters of BindBuffer, BufferData,
147    BufferSubData, MapBuffer, UnmapBuffer, GetBufferSubData, and
148    GetBufferPointerv:
149
150        DISPATCH_INDIRECT_BUFFER                        0x90EE
151
152    Accepted by the <value> parameter of GetIntegerv, GetBooleanv,
153    GetInteger64v, GetFloatv, and GetDoublev:
154
155        DISPATCH_INDIRECT_BUFFER_BINDING                0x90EF
156
157    Accepted by the <stages> parameter of UseProgramStages:
158
159        COMPUTE_SHADER_BIT                              0x00000020
160
161Additions to Chapter 2 of the OpenGL 4.2 (Core Profile) Specification
162(OpenGL Operation)
163
164    In section 2.9.1, "Creating and Binding Buffer Objects", add to table 2.8
165    (p.43):
166
167                                                                Described
168      Target name                 Purpose                     in sections(s)
169      -----------------------     -------------------------  ---------------
170      DISPATCH_INDIRECT_BUFFER    Indirect compute dispatch       5.5
171                                  commands
172
173    Add to the end of section 2.9.8, "Indirect Commands In Buffer Objects"
174    (p. 53):
175
176    Arguments to the DispatchComputeIndirect command are stored in buffer
177    objects as a group of three unsigned integers.
178
179    A buffer object is bound to DISPATCH_INDIRECT_BUFFER by calling BindBuffer
180    with target set to DISPATCH_INDIRECT_BUFFER, and buffer set to the name of
181    the buffer object. If no corresponding buffer object exists, one is
182    initialized as defined in section 2.9.
183
184    DispatchComputeIndirect sources its arguments from the buffer object whose
185    name is bound to DISPATCH_INDIRECT_BUFFER, using the <indirect> parameter as
186    an offset into the buffer object in the same fashion as described in
187    section 2.9.6. An INVALID_OPERATION error is generated if this command
188    sources data beyond the end of the buffer object, if zero is bound to
189    DISPATCH_INDIRECT_BUFFER, or if <indirect> is less than zero or not a
190    multiple of the size, in basic machine units, of uint.
191
192    In section 2.11, "Vertex Shaders", modify the introductory text on shaders
193    to include compute shaders (second paragraph, p. 56):
194
195    In addition to vertex shaders, tessellation control..., geometry shaders,
196    fragment shaders, and compute shders can be created, compiled, and linked
197    into program objects.  ....  (section 3.10).  Compute shaders perform
198    general computations for dispatched arrays of shader invocations (section
199    5.5), but do not operate on primitives processed by the other shader
200    types. ...
201
202    In section 2.11.3, "Program Objects", add to the reasons that LinkProgram
203    may fail, p. 61:
204
205        * The program object contains objects to form a compute shader (see
206          section 5.5) and objects to form any other type of shader.
207
208    In section 2.11.3, modify the description of active programs (last
209    paragraph, p. 61, first paragraph, p. 62):
210
211    ... geometry shader stages, those stages are ignored.  If there is no
212    active program for the compute shader stage, compute dispatches will
213    generate an error.  The active program for the compute shader stage has no
214    effect on the processing of vertices, geometric primitives, and fragments,
215    and the active program for all other shader stages has no effect on
216    compute dispatches.
217
218    In section 2.11.4, "Program Pipeline Objects", modify the description of
219    UseProgramStages, p. 65:
220
221    The executables in a program object... becomes current.  These stages may
222    include vertex, tessellation control, tessellation evaluation, geometry,
223    fragment, or compute, indicated by VERTEX_SHADER_BIT,
224    TESS_CONTROL_SHADER_BIT, TESS_EVALUATION_SHADER_BIT, GEOMETRY_SHADER_BIT,
225    FRAGMENT_SHADER_BIT, or COMPUTE_SHADER_BIT, respectively. ...
226
227    In the unnumbered "Validation" section of section 2.11.12 "Shader
228    Execution", modify the list of validation errors, pp. 112-113:
229
230    This error is generated by any command that transfers vertices to the GL
231    or launches compute work if:
232
233      * (last bullet, p. 112) One program object is active... first program
234        object was active.  The active compute shader is ignored for the
235        purposes of this test.
236
237      * (2nd bullet, p. 113) There is no current program specified by
238        UseProgram, there is a current program pipeline object, and the
239        current program for any shader stage has been relinked since...
240
241      * (3rd bullet, p. 113) Any two active samplers in the set of active
242        program objects are of different types but refer to the same texture
243        image unit.
244
245      * (4th bullet, p. 113) The sum of the number of active samplers for each
246        active program exceeds the maximum number of texture image units
247        allowed.
248
249    Modify the paragraph describing ValidateProgram, p. 113:
250
251    ... If validation succeeded, ... set to FALSE.  If validation succeeded,
252    no INVALID_OPERATION validation error will be generated if <program> were
253    made current via UseProgram, given the current state.  If validation
254    failed, such errors will be generated under the current state.
255
256    Modify the paragraph describing ValidateProgramPipeline, p. 114:
257
258    ... can be queried with GetProgramPipelineiv (see section 6.1.12).  If
259    validation succeeded, no INVALID_OPERATION validation error will be
260    generated if <pipeline> were bound and no program were made current via
261    UseProgram, given the current state.  If validation failed, such errors
262    will be generated under the current state.
263
264    In subsection 2.11.12, "Shader Execution":
265
266        Add to the list of implementation dependent constants under the
267    "Texture Access" sub-heading:
268
269        MAX_COMPUTE_TEXTURE_IMAGE_UNITS (for compute shaders),
270
271        Add to the list of implementation dependent constants under the "Atomic
272    Counter Access" sub-heading:
273
274        MAX_COMPUTE_ATOMIC_COUNTERS (for compute shaders),
275
276        Add to the list of implementation dependent constants under the "Image
277    Access" sub-heading:
278
279        MAX_COMPUTE_IMAGE_UNIFORMS (for compute shaders),
280
281    In section 2.16, "Conditional Rendering", modify the sentence describing
282    conditional rendering, starting with "In this case"...
283
284    In this case, all drawing commands (see section 2.8.3), as well as
285    Clear and ClearBuffer* (see section 4.2.3), and compute dispatch
286    through DispacthCompute* (see section 5.5), have no effect.
287    In the "Shared Memory Access Synchronization" subsection of section
288    2.11.13, "Shader Memory Access", modify the description of
289    COMMAND_BARRIER_BIT (p. 118):
290
291      * COMMAND_BARRIER_BIT:  Command data sourced from buffer objects by
292        Draw*Indirect and DispatchComputeIndirect commands ... The buffer
293        objects affected by this bit are derived from the DRAW_INDIRECT_BUFFER
294        and DISPATCH_INDIRECT_BUFFER bindings.
295
296    In subection 2.17.7, "Uniform Variables", replace the paragraph beginning
297    "If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER,"... with:
298
299        If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER,
300    UNIFORM_BLOCK_REFERENCED_BY_TESS_CONTROL_SHADER,
301    UNIFORM_BLOCK_REFERENCED_BY_TESS_EVALUATION_SHADER,
302    UNIFORM_BLOCK_REFERENCED_BY_GEOMETRY_SHADER,
303    UNIFORM_BLOCK_REFERENCED_BY_FRAGMENT_SHADER or
304    UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER, then a boolean value indicating
305    whether the uniform block identified by uniformBlockIndex is referenced
306    by the vertex, tessellation control, tessellation evaluation, geometry,
307    fragment or compute programming stages of <program>, respectively, is
308    returned.
309
310    Also in subsection 2.17.7, "Uniform Variables", replace the paragraph
311    beginning, "If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER"
312    on p.80 with:
313
314        If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER,
315    ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_CONTROL_SHADER,
316    ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_EVALUATION_SHADER,
317    ATOMIC_COUNTER_BUFFER_REFERENCED_BY_GEOMETRY_SHADER,
318    ATOMIC_COUNTER_BUFFER_REFERENCED_BY_FRAGMENT_SHADER or
319    ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER, then a single boolean
320    value indicating whether the atomic counter buffer identified by
321    bufferIndex is referenced by the vertex, tessellation control, tessellation
322    evaluation, geometry, fragment or compute programming stages of
323    <program>, respectively, is returned.
324
325    Under the sub-heading "Uniform Blocks" in subsection 2.11.17, replace the
326    sentence beginning "The limits for vertex, tessellation ..." on p.92
327    with:
328
329        The limits for vertex, tessellation, geometry, fragment and compute
330    shaders can be obtained by calling GetIntegerv with <pname> set to
331    MAX_VERTEX_UNIFORM_BLOCKS, MAX_TESS_CONTROL_UNIFORM_BLOCKS,
332    MAX_TESS_EVALUATION_UNIFORM_BLOCKS, MAX_GEOMETRY_UNIFORM_BLOCKS,
333    MAX_FRAGMENT_UNIFORM_BLOCKS and MAX_COMPUTE_UNIFORM_BLOCKS, respectively.
334
335    Under the sub-heading "Atomic Counter Buffers" in subsection 2.11.17,
336    replace the sentence beginning "The limits for vertex, geometry, ..."
337    on p.96 with:
338
339        The limits for vertex, tessellation, geometry, fragment and compute
340    shaders can be obtained by calling GetIntegerv with <pname> set to
341    MAX_VERTEX_ATOMIC_COUNTER_BUFFERS, MAX_TESS_CONTROL_ATOMIC_COUNTER_BUFFERS,
342    MAX_TESS_EVALUATION_ATOMIC_COUNTER_BUFFERS,
343    MAX_GEOMETRY_ATOMIC_COUNTER_BUFFERS, MAX_FRAGMENT_ATOMIC_COUNTER_BUFFERS and
344    MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS, respectively.
345
346Additions to Chapter 3 of the OpenGL 4.2 (Core Profile) Specification
347(Rasterization)
348
349    None.
350
351Additions to Chapter 4 of the OpenGL 4.2 (Core Profile) Specification
352(Per-Fragment Operations and the Framebuffer)
353
354    None.
355
356Additions to Chapter 5 of the OpenGL 4.2 (Core Profile) Specification
357(Special Functions)
358
359    Add Section 5.5, "Compute Shaders"
360
361        In addition to graphics-oriented shading operations such as vertex,
362    tessellation, geometry and fragment shading, generic computation may be
363    performed by the GL through the use of compute shaders. The compute pipeline
364    is a form of single-stage machine that runs generic shaders. Compute shaders
365    are created as described in section 2.11.1 using a <type> parameter of
366    COMPUTE_SHADER. They are attached to and used in program objects as
367    described in section 2.11.3.
368
369        Compute workloads are formed from groups of work items called
370    _workgroups_ and processed by the executable code for a compute program.
371    A workgroup is a collection of shader invocations that execute the same code,
372    potentially in parallel. An invocation within a workgroup may share data
373    with other members of the same workgroup through shared variables and
374    issue memory and control barriers to synchronize with other members of the
375    same workgroup.  One or more workgroups is launched by calling:
376
377        void DispatchCompute(uint num_groups_x,
378                             uint num_groups_y,
379                             uint num_groups_z);
380
381        Each workgroup is processed by the active program object for the
382    compute shader stage.  The error INVALID_OPERATION will be generated if
383    there is no active program object for the compute shader stage.  The
384    active program for the compute shader stage will be determined in the same
385    manner as the active program for other pipeline stages, as described in
386    section 2.11.3.  While the individual shader invocations within a
387    workgroup are executed as a unit, workgroups are executed completely
388    independently and in unspecified order.
389
390        <num_groups_x>, <num_groups_y> and <num_groups_z> specify the number of
391    workgroups that will be dispatched in the X, Y and Z dimensions,
392    respectively. The builtin vector variable gl_NumWorkGroups will be
393    initialized with the contents of the <num_groups_x>, <num_groups_y> and
394    <num_groups_z> parameters. The maximum number of workgroups that may be
395    dispatched at one time may be determined by calling GetIntegeri_v with
396    <pname> set to MAX_COMPUTE_WORK_GROUP_COUNT and <index> must be zero, one,
397    or two, representing the X, Y, and Z dimensions, respectively. The
398    values in the <num_groups_x>, <num_groups_y> and <num_groups_z> array must
399    be less than or equal to the maximum workgroup count for the corresponding
400    dimension, otherwise an INVALID_VALUE error is generated. If the workgroup
401    count in any dimension is zero, no workgroups are dispatched.
402
403        The workgroup size in each dimension are specified at compile time
404    using an input layout qualifier in one or more of the compute shaders
405    attached to the program (see Section 4 of the OpenGL Shading Language
406    Specification). After the program has been linked, the workgroup size
407    of the program may be retrieved by calling GetProgramiv with <pname> set to
408    COMPUTE_WORK_GROUP_SIZE. This will return an array of three integers
409    containing the workgroup size of the compute program as specified by
410    its input layout qualifier(s). If <program> is the name of a program that
411    has not been successfully linked, or is the name of a linked program object
412    that contains no compute shaders, then an INVALID_OPERATION error is
413    generated.
414
415        The maximum size of a workgroup may be determined by calling
416    GetIntegeri_v with <pname> set to MAX_COMPUTE_WORK_GROUP_SIZE
417    and <index> set to 0, 1, or 2 to retrieve the maximum work size in the
418    X, Y and Z dimension, respectively. Furthermore, the maximum number of
419    invocations in a single workgroup (i.e., the product of the three
420    dimensions) may be determined by calling GetIntegerv with <pname> set to
421    MAX_COMPUTE_WORK_GROUP_INVOCATIONS.
422
423        The command
424
425        void DispatchComputeIndirect(intptr indirect);
426
427    is equivalent (assuming no errors are generated) to calling
428    DispatchCompute with <num_groups_x>, <num_groups_y> and <num_groups_z>
429    initialized with the three uint values contained in the buffer currently
430    bound to the DISPATCH_INDIRECT_BUFFER binding at an offset, in basic
431    machine units, specified by <indirect>.  The error INVALID_VALUE is
432    generated if <indirect> is less than zero or is not a multiple of four.
433    The error INVALID_OPERATION is generated if no buffer is bound to
434    DISPATCH_INDIRECT_BUFFER, if the command would source data beyond the end
435    of the buffer object, or if there is no active program for the compute
436    shader stage.  If any of <num_groups_x>, <num_groups_y> or <num_groups_z>
437    is greater than MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding
438    dimension then the results are undefined.
439
440    Add Subsection 5.5.1, "Compute Shader Variables"
441
442        Compute shaders can access variables belonging to the current program
443    object. The amount of storage in the default uniform block accessed by a
444    compute shader is specified by the value of the implementation dependent
445    constant MAX_COMPUTE_UNIFORM_COMPONENTS. The total amount of
446    combined storage available for uniform variables in all uniform blocks
447    accessed by a compute shader (including the default unifom block) is
448    specified by the implementation dependent constant
449    MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS.
450
451        There is a limit to the total size of all variables declared as
452    <shared> in a single program object. This limit, expressed in units of
453    basic machine units, may be queried as the value of
454    MAX_COMPUTE_SHARED_MEMORY_SIZE.
455
456Additions to Chapter 6 of the OpenGL 4.2 (Core Profile) Specification
457(State and State Requests)
458
459    None.
460
461Additions to Chapter 2 of the OpenGL Shading Language Specification, Version
4624.20 (Overview of OpenGL Shading)
463
464    Replace the last sentence of the first paragraph of the overview with
465    the following:
466
467    "Currently, these processors are the vertex, tessellation control,
468     tessellation evaluation, geometry, fragment, and compute processors."
469
470    Replace the last sentence of the second paragraph of the overview with
471    the following:
472
473    "The specific languages will be referred to by the name of the processor
474     they target: vertex, tessellation control, tessellation evaluation,
475     geometry, fragment, or compute."
476
477    Add a new Section 2.6 titled "Compute Processor" with the following text:
478
479    "The <compute processor> is a programmable unit that operates independently
480    from the other shader processors. Compilation units written in the OpenGL
481    Shading Language to run on this processor are called <compute shaders>.
482    When a complete set of compute shaders are compiled and linked, they
483    result in a <compute shader executable> that runs on the compute processor.
484
485    A compute shader has access to many of the same resources as fragment and
486    other shader processors, such as textures, buffers, image variables,
487    atomic counters, and so on. It does not have any predefined inputs
488    nor any fixed-function outputs.  It is not part of the graphics pipeline
489    and its visible side effects are through actions on images, storage
490    buffers, and atomic counters.
491
492    A compute shader operates on a group of work items called a workgroup.
493    A workgroup is a collection of shader invocations that execute the same
494    code, potentially in parallel. An invocation within a workgroup may share data with
495    other members of the same workgroup through shared variables and issue
496    memory and control barriers to synchronize with other members of the same workgroup."
497
498Additions to Chapter 4 of the OpenGL Shading Language Specification, Version
4994.20 (Variables and Types)
500
501    Modify section 4.4.1, second paragraph from
502
503    "All shaders allow input layout qualifiers on input variable declarations."
504
505    to
506
507    "All shaders, except compute shaders, allow input layout location qualifiers on
508     input variable declarations."
509
510    Modify Section 4.3. Add to the table at the start of Section 4.3:
511
512    +-------------------+-----------------------------------------------------------+
513    | Storage Qualifier | Meaning                                                   |
514    +-------------------+-----------------------------------------------------------+
515    | <shared>          | variable storage is shared across all work items in a     |
516    |                   | workgroup for compute shaders                             |
517    +-------------------+-----------------------------------------------------------+
518
519    Add the following paragraph to Section 4.3.4, "Input Variables"
520
521        Compute shaders do not permit user-defined input variables and do not
522    form a formal interface with any other shader stage. See section 7.1
523    for a description of built-in compute shader input variables. All other
524    input to a compute shader is retrieved explicitly through image loads,
525    texture fetches, loads from uniforms or uniform buffers, or other user
526    supplied code. Redeclaration of built-in input variables in compute
527    shaders is not permitted.
528
529    Add the following paragraph to Section 4.3.6, "Output Variables"
530
531        Compute shaders have no built-in output variables, do not support
532    user-defined output variables and do not form a formal interface with any
533    other shader stage. All outputs from a compute shader take the form of the
534    side effects such as image stores and operations on atomic counters.
535
536    Add Section 4.3.7, "Shared", renumber subsequent sections
537
538        The <shared> qualifier is used to declare variables that have storage
539    shared between all work items of a compute shader workgroup.
540    Variables declared as <shared> may only be used in compute shaders
541    (see Section 5.5, "Compute Shaders"). Shared variables are implicitly
542    coherent. That is, writes to shared variables from one shader invocation
543    will eventually be seen by other invocations within the same workgroup.
544
545        Variables declared as <shared> may not have initializers and their
546    contents are undefined at the beginning of shader execution. Any data
547    written to <shared> variables will be visible to other shaders executing
548    the same shader within the same workgroup. Order of execution
549    with regards to reads and writes to the same <shared> variables by different
550    invocations of a shader is not defined. In order to achieve ordering with
551    respect to reads and writes to <shared> variables, memory barriers must be
552    employed using the barrier() function (see Section 8.15).
553
554        There is a limit to the total size of all variables declared as
555    <shared> in a single program object. This limit, expressed in units of
556    basic machine units may be determined by using the OpenGL API to query the
557    value of MAX_COMPUTE_SHARED_MEMORY_SIZE.
558
559    Add Section 4.4.1.4, "Compute-Shader Inputs"
560
561    There are no layout location qualifiers for compute shader inputs.
562
563    Layout qualifier identifiers for compute shader inputs are the workgroup
564    size qualifiers:
565
566        layout-qualifier-id
567            local_size_x = integer-constant
568            local_size_y = integer-constant
569            local_size_z = integer-constant
570
571    <local_size_x>, <local_size_y>, and <local_size_z> are used to define the
572    local size of the kernel defined by the compute shader in the first,
573    second, and third dimension, respectively. The default size in each
574    dimension is 1. If a shader does not specify a size for one of the
575    dimensions, that dimension will have a size of 1.
576
577    For example, the following declaration in a compute shader
578
579        layout (local_size_x = 32, local_size_y = 32) in;
580
581    is used to declare a two-dimensional compute shader with a local size of
582    32 x 32 elements as a three-dimensional compute shader where the third dimension is
583    one element deep.
584
585    As another example, the declaration
586
587        layout (local_size_x = 8) in;
588
589    effectively specifies that a one-dimensional compute shader is being
590    compiled, and its size is 8 elements.
591
592        If the local size of the shader in any dimension is greater than the
593    maximum size supported by the implementation for that dimension, a
594    compile-time error results. Also, if such a layout qualifier is declared more
595    than once in the same shader, all those declarations must indicate the same
596    workgroup size; otherwise a compile-time error results. If multiple compute
597    shaders attached to a single program object declare the workgroup size,
598    the declarations must be identical; otherwise a link-time error results.
599    Furthermore, if a program object contains any compute shaders, at
600    least one must contain an input layout qualifier specifying the
601    workgroup sizes of the program, or a link-time error will occur.
602
603Additions to Chapter 7 of the OpenGL Shading Language Specification, Version
6044.20 (Built-in Variables)
605
606    Add to the start of Section 7.1, "Built-In Language Variables", before the
607    description of the vertex language built-in variables:
608
609        In the compute language, the built-in variables are declared as follows:
610
611        // workgroup dimensions
612        in    uvec3 gl_NumWorkGroups;
613        const uvec3 gl_WorkGroupSize;
614
615        // workgroup and invocation IDs
616        in    uvec3 gl_WorkGroupID;
617        in    uvec3 gl_LocalInvocationID;
618
619        // derived variables
620        in    uvec3 gl_GlobalInvocationID;
621        in    uint  gl_LocalInvocationIndex;
622
623    Add the end of Section 7.1, before Section 7.1.1:
624
625        The built-in variable <gl_NumWorkGroups> is a compute-shader input
626    variable containing the total number of global work items in each
627    dimension of the workgroup that will execute the compute shader.
628    Its content is equal to the values specified in the <num_groups_x>,
629    <num_groups_y>, and <num_groups_z> parameters passed to the
630    DispatchCompute API entry point.
631
632        The built-in constant <gl_WorkGroupSize> is a compute-shader constant
633    containing the workgroup size of the shader. The size of the workgroup
634    in the X, Y, and Z dimensions is stored in the x, y, and z components.
635    The values stored in <gl_WorkGroupSize> match those specified in the
636    required <local_size_x>, <local_size_y>, and <local_size_z> layout
637    qualifiers for the current shader. This value is constant so that
638    it can be used to size arrays of memory that can be shared within
639    the workgroup.
640
641        The built-in variable <gl_WorkGroupID> is a compute-shader input
642    variable containing the 3-dimensional index of the global workgroup
643    that the current invocation is executing in. The possible values range
644    across the parameters passed into DispatchCompute, i.e., from (0, 0, 0) to
645    (gl_NumWorkGroups.x - 1, gl_NumWorkGroups.y - 1, gl_NumWorkGroups.z - 1).
646
647        The built-in variable <gl_LocalInvocationID> is a compute-shader input
648    variable containing the 3-dimensional index of the workgroup
649    within the global workgroup that the current invocation is executing in.
650    The possible values for this variable range across the workgroup
651    size, i.e. (0,0,0) to (gl_WorkGroupSize.x - 1, gl_WorkGroupSize.y - 1,
652    gl_WorkGroupSize.z - 1).
653
654        The built-in variable <gl_GlobalInvocationID> is a compute shader input
655    variable containing the global index of the current work item.  This
656    value uniquely identifies this invocation from all other invocations
657    across all workgroups initiated by the current
658    DispatchCompute call.  This is computed as:
659
660        gl_GlobalInvocationID =
661            gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID.
662
663        The built-in variable <gl_LocalInvocationIndex> is a compute shader
664    input variable that contains the 1-dimensional representation of the
665    gl_LocalInvocationID. This is useful for uniquely identifying a
666    unique region of shared memory within the workgroup for this
667    invocation to use. This is computed as:
668
669        gl_LocalInvocationIndex =
670            gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +
671            gl_LocalInvocationID.y * gl_WorkGroupSize.x +
672            gl_LocalInvocationID.x;
673
674    Add to the list of built-in constants in Section 7.3:
675
676        const ivec3 gl_MaxComputeWorkGroupCount = { 65535, 65535, 65535 };
677        const ivec3 gl_MaxComputeWorkGroupSize = { 1024, 1024, 64 };
678        const int gl_MaxComputeUniformComponents = 512;
679        const int gl_MaxComputeTextureImageUnits = 16;
680        const int gl_MaxComputeImageUniforms = 8;
681        const int gl_MaxComputeAtomicCounters = 8;
682        const int gl_MaxComputeAtomicCounterBuffers = 1;
683
684Additions to Chapter 8 of the OpenGL Shading Language Specification, Version
6854.20 (Built-in Variables)
686
687    Insert "Atomic Memory Functions" section after Section 8.10, Atomic
688    Counter Functions (p. 149).  Atomic memory operations are supported on
689    shared variables; the set of operations and their definitions are similar
690    to those for the imageAtomic*() functions.  These functions are fully
691    documented in the ARB_shader_storage_buffer_object extension (see
692    dependencies).
693
694    Modify the first paragraph of Section 8.15, "Shader Invocation Control
695    Functions" to read:
696
697        The shader invocation control function is only available in tessellation
698    control shaders and compute shaders. It is used to control the relative
699    execution order of multiple shader invocations used to process a patch
700    (in the case of tessellation control shaders) or a workgroup (in the
701    case of compute shaders), which are otherwise executed with an undefined
702    order.
703
704    +----------------+--------------------------------------------------------------------------+
705    | Syntax         | Description                                                              |
706    +----------------+--------------------------------------------------------------------------+
707    | barrier        | For any given static instance of barrier() appearing in a tessellation   |
708    |                | control shader or compute shader, all invocations for a single patch     |
709    |                | or workgroup, respectively, must enter it before any will continue       |
710    |                | beyond it.                                                               |
711    +----------------+--------------------------------------------------------------------------+
712
713    Modify the second paragraph as follows:
714
715    ... Because invocations may execute in an undefined order between these
716    barrier calls, the values of a per-vertex or per-patch output variable in
717    a tessellation control shader or shared variables for compute shaders
718    will be undefined in a number of cases enumerated in Section 4.3.7 "Output
719    Variables" (for tessellation control shaders) and Section 4.3.6 "Shared
720    Variables" (for compute shaders).
721
722    Replace the third paragraph with the following:
723
724    For tessellation control shaders, the barrier() function may only be
725    placed inside the function main() of the tessellation control shader and
726    may not be called within any control flow. Barriers are also disallowed
727    after a return statement in the function main(). Any such misplaced
728    barriers result in a compile-time error.
729
730    For compute shaders, the barrier() function may be placed within flow
731    control, but that flow control must be uniform flow control. That is, all
732    the controlling expressions that lead to execution of the barrier must be
733    dynamically uniform expressions. This ensures that if any shader
734    invocation enters a conditional statement, then all invocations will enter
735    it. While compilers are encouraged to give warnings if they can detect
736    this might not happen, compilers cannot completely determine this. Hence,
737    it is the author's responsibility to ensure barrier() only exists inside
738    uniform flow control. Otherwise, some shader invocations will stall
739    indefinitely, waiting for a barrier that is never reached by other
740    invocations.
741
742    Modify the table of memory control functions on p.160,
743
744    +-----------------------------------+----------------------------------------------------------------------------------------+
745    | Syntax                            | Description                                                                            |
746    +-----------------------------------+----------------------------------------------------------------------------------------+
747    | void memoryBarrier()              | Control the ordering of all memory transactions issued by a single shader invocation.  |
748    +-----------------------------------+----------------------------------------------------------------------------------------+
749    | void memoryBarrierAtomicCounter() | Control the ordering of accesses to atomic counter variables issued by a single shader |
750    |                                   | invocation.                                                                            |
751    +-----------------------------------+----------------------------------------------------------------------------------------+
752    | void memoryBarrierBuffer()        | Control the ordering of memory transactions to buffer variables issued within a        |
753    |                                   | single shader invocation.                                                              |
754    +-----------------------------------+----------------------------------------------------------------------------------------+
755    | void memoryBarrierImage()         | Control the ordering of memory transactions to images issued within a single shader    |
756    |                                   | invocation.                                                                            |
757    +-----------------------------------+----------------------------------------------------------------------------------------+
758    | void memoryBarrierShared()        | Control the ordering of memory transactions to shared variables issued within a single |
759    |                                   | shader invocation.                                                                     |
760    |                                   | Only available in compute shaders.                                                     |
761    +-----------------------------------+----------------------------------------------------------------------------------------+
762    | void groupMemoryBarrier()         | Control the ordering of all memory transactions issued within a single shader          |
763    |                                   | invocation, as viewed by other invocations in the same workgroup.                      |
764    |                                   | Only available in compute shaders.                                                     |
765    +-----------------------------------+----------------------------------------------------------------------------------------+
766
767    Modify the subsequent paragraph as follows:
768
769    The memory barrier built-in functions can be used to order reads and
770    writes to variables stored in memory accessible to other shader
771    invocations.  When called, these functions will wait for the completion of
772    all reads and writes previously performed by the caller that access
773    selected variable types, and then return with no other effect.  The
774    built-in functions memoryBarrierAtomicCounter(), memoryBarrierBuffer(),
775    memoryBarrierImage(), and memoryBarrierShared() wait for the completion of
776    accesses to atomic counter, buffer, image, and shared variables,
777    respectively.  The built-in functions memoryBarrier() and
778    groupMemoryBarrier() wait for the completion of accesses to all of the
779    above variable types.  The functions memoryBarrierShared() and
780    groupMemoryBarrier() are available only in compute shaders; the other
781    functions are available in all shader types.
782
783    When these functions return, any memory stores performed using coherent
784    variables prior to the call will be visible to any future coherent access
785    to the same memory performed by any other shader invocation.  In
786    particular, the values written this way in one shader stage are guaranteed
787    to be visible to coherent memory accesses performed by shader invocations
788    in subsequent stages when those invocations were triggered by the
789    execution of the original shader invocation (e.g., fragment shader
790    invocations for a primitive resulting from a particular geometry shader
791    invocation).
792
793    Additionally, memory barrier functions order stores performed by the
794    calling invocation, as observed by other shader invocations.  Without
795    memory barriers, if one shader invocation performs two stores to coherent
796    variables, a second shader invocation might see the values written by the
797    second store prior to seeing those written by the first.  However, if the
798    first shader invocation calls a memory barrier function between the two
799    stores, selected other shader invocations will never see the results of
800    the second store before seeing those of the first.  When using the
801    function groupMemoryBarrier(), this ordering guarantee applies only to
802    other shader invocations in the same compute shader workgroup; all other
803    memory barrier functions provide the guarantee to all other shader
804    invocations.  No memory barrier is required to guarantee the order of
805    memory stores as observed by the invocation performing the stores; an
806    invocation reading from a variable that it previously wrote will always
807    see the most recently written value unless another shader invocation also
808    wrote to the same memory.
809
810Dependencies on OpenGL 4.3 and ARB_shader_storage_buffer_object
811
812    If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported, the
813    spec language adding the built-in functions atomicAdd(), atomicMin(),
814    atomicMax(), atomicAnd(), atomicOr(), atomicXor(), atomicExchange(), and
815    atomicCompSwap() should be considered to be incorporated into this
816    extension as-is, except that buffer variables will not be supported and
817    thus cannot be used with these functions.  No "#extension" directive is
818    necessary to use these functions in compute shaders.
819
820    If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported,
821    references to the GLSL built-in function memoryBarrierBuffer() should be
822    removed.
823
824Dependencies on NV_vertex_buffer_unified_memory
825
826    If NV_vertex_buffer_unified_memory is supported, a new buffer address
827    range and enable is provided to permit the use with
828    DispatchComputeIndirect with a resident buffer object without requiring
829    that it be bound to the DISPATCH_INDIRECT_BUFFER target.  The following
830    additional edits apply:
831
832    Accepted by the <cap> parameter of GetBufferParameterui64vNV:
833
834        DISPATCH_INDIRECT_BUFFER                        (defined above)
835
836    Accepted by the <cap> parameter of Disable, Enable, and IsEnabled, and by
837    the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv, GetDoublev
838    and GetInteger64v:
839
840        DISPATCH_INDIRECT_UNIFIED_NV                    0x90FD
841
842    Accepted by the <pname> parameter of BufferAddressRangeNV
843    and the <value> parameter of GetIntegerui64vNV:
844
845        DISPATCH_INDIRECT_ADDRESS_NV                    0x90FE
846
847    Accepted by the <value> parameter of GetIntegerv:
848
849        DISPATCH_INDIRECT_LENGTH_NV                     0x90FF
850
851    Add to the end of Section 5.5, after discussion of
852    DispatchComputeIndirect:
853
854    If DISPATCH_INDIRECT_UNIFIED_NV is enabled, DispatchComputeIndirect does
855    not use the buffer bound to DISPATCH_INDIRECT_BUFFER.  Instead, it sources
856    its arguments from the GPU address range specified by calling
857    BufferAddressRangeNV with a <pname> of DISPATCH_INDIRECT_ADDRESS_NV and an
858    <index> of zero.  The address is obtained by adding the <indirect>
859    parameter to the base address of the range, specified by the <address>
860    parameter of BufferAddressRangeNV.  If the command sources data outside
861    the specified address range, the error INVALID_OPERATION will be
862    generated.  The DISPATCH_INDIRECT_BUFFER binding will be ignored in this
863    case, and no errors will be generated due to the use of this binding.  The
864    error INVALID_VALUE will still be generated if <indirect> is negative.  No
865    INVALID_VALUE error will be generated if <indirect> is not a multiple of
866    four, but INVALID_OPERATION will be generated if the effective address is
867    not a multiple of four.  If the indirect dispatch address range does not
868    belong to a buffer object that is resident at the time of the
869    DispatchComputeIndirect call, undefined results, possibly including
870    program termination, may occur.
871
872    Add the following to the "Compute Dispatch State" table defined in this
873    extension:
874
875    Get Value                           Type    Get Command         Initial Value   Sec     Attribute
876    ---------                           ----    -----------         -------------   ---     ---------
877    DISPATCH_INDIRECT_UNIFIED_NV         B      IsEnabled               FALSE       5.5     none
878    DISPATCH_INDIRECT_ADDRESS_NV        Z64+    GetIntegerui64vNV         0         5.5     none
879    DISPATCH_INDIRECT_LENGTH_NV          Z+     GetIntegerv               0         5.5     none
880
881Errors
882
883    INVALID_OPERATION is generated by DispatchCompute or
884    DispatchComputeIndirect if there is no active program for the compute
885    shader stage.
886
887    INVALID_VALUE is generated by DispatchCompute if any of <num_groups_x>,
888    <num_groups_y> or <num_groups_z> is greater than the value of
889    MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding dimension.
890
891    INVALID_VALUE is generated by DispatchComputeIndirect if <indirect> is
892    less than zero or not a multiple of four.
893
894    INVALID_OPERATION is generated by DispatchComputeIndirect if no buffer is
895    bound to DISPATCH_INDIRECT_BUFFER or if the command would source data
896    beyond the end of the bound buffer object.
897
898    INVALID_OPERATION is generated by GetProgramiv is <pname> is
899    COMPUTE_WORK_GROUP_SIZE and either the program has not been linked
900    successfully, or has been linked but contains no compute shaders.
901
902    LinkProgram will fail if <program> contains a combination of compute and
903    non-compute shaders.
904
905New State
906
907    None.
908
909New Implementation Dependent State
910
911    Add to Table 6.31, "Program Pipeline Object State"
912
913    +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
914    | Get Value                                          | Type      | Get Command             | Initial Value | Description                                                           | Sec.    |
915    +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
916    | COMPUTE_SHADER                                     | Z+        | GetProgramPipelineiv    | 0             | Name of current compute shader project object                         | 2.11.4  |
917    +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
918
919    Add to Table 6.32, "Program Object State"
920
921    +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
922    | Get Value                                          | Type      | Get Command             | Initial Value | Description                                                           | Sec.    |
923    +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
924    | COMPUTE_WORK_GROUP_SIZE                            | 3 x Z+    | GetProgramiv            | { 0, ... }    | Workgroup size of a linked compute program                            | 5.5     |
925    | UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER         | B         | GetActiveUniformBlockiv | FALSE         | True if uniform block is referenced by the compute stage              | 2.17.7  |
926    | ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER | B         | GetActiveAtomicCounter- | FALSE         | AACB has a counter used by compute shaders                            | 2.17.7  |
927    |                                                    |           |   Bufferiv              | FALSE         |                                                                       |         |
928    +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
929
930    Insert new table named "Compute Dispatch State", after Table 6.46 "Hints":
931
932    +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
933    | Get Value                                          | Type      | Get Command             | Initial Value | Description                                                           | Sec.    |
934    +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
935    | DISPATCH_INDIRECT_BUFFER_BINDING                   | Z+        | GetIntegerv             | 0             | Indirect dispatch buffer binding                                      | 5.5     |
936    +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+
937
938    Insert Table 6.50, "Implementation Dependent Compute Shader Limits",
939    renumber subsequent tables.
940
941    +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+
942    | Get Value                               | Type      | Get Command   | Minimum Value       | Description                                                           | Sec.    |
943    +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+
944    | MAX_COMPUTE_WORK_GROUP_COUNT            | 3 x Z+    | GetIntegeri_v | 65535               | Maximum number of workgroups that may be dispatched by a single       | 5.5     |
945    |                                         |           |               |                     | dispatch command (per dimension)                                      |         |
946    | MAX_COMPUTE_WORK_GROUP_SIZE             | 3 x Z+    | GetIntegeri_v | 1024 (x, y), 64 (z) | Maximum local size of a compute workgroup (per dimension)             | 5.5     |
947    | MAX_COMPUTE_WORK_GROUP_INVOCATIONS      | Z+        | GetIntegerv   | 1024                | Maximum total compute shader invocations in a single workgroup        | 5.5     |
948    | MAX_COMPUTE_UNIFORM_BLOCKS              | Z+        | GetIntegerv   | 12                  | Maximum number of uniform blocks per compute program                  | 2.11.7  |
949    | MAX_COMPUTE_TEXTURE_IMAGE_UNITS         | Z+        | GetIntegerv   | 16                  | Maximum number of texture image units accessible by a compute shader  | 2.11.12 |
950    | MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS      | Z+        | GetIntegerv   | 8                   | Number of atomic counter buffers accessed by a compute shader         | 2.11.17 |
951    | MAX_COMPUTE_ATOMIC_COUNTERS             | Z+        | GetIntegerv   | 8                   | Number of atomic counters accessed by a compute shader                | 2.11.12 |
952    | MAX_COMPUTE_SHARED_MEMORY_SIZE          | Z+        | GetIntegerv   | 32768               | Maximum total storage size of all variables declared as <shared> in   |         |
953    |                                         |           |               |                     | all compute shaders linked into a single program object               |         |
954    | MAX_COMPUTE_UNIFORM_COMPONENTS          | Z+        | GetIntegerv   | 512                 | Number of components for compute shader uniform variables             | 5.5.1   |
955    | MAX_COMPUTE_IMAGE_UNIFORMS              | Z+        | GetIntegerv   | 8                   | Number of image variables in compute shaders                          | 2.11.12 |
956    | MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS | Z+        | GetIntegerv   | *                   | Number of words for compute shader uniform variables in all uniform   | 5.5.1   |
957    |                                         |           |               |                     | blocks, including the default                                         |         |
958    +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+
959
960    Modify Table 6.55, increasing the following minimum values:
961
962           MAX_COMBINED_TEXTURE_IMAGE_UNITS     96 (6*16), was 80
963           MAX_UNIFORM_BUFFER_BINDINGS          72 (6*12), was 60
964
965Issues
966
967    1) Should <shared> variables be usable only in compute shaders, or in other
968       stages too?
969
970       RESOLVED:  Support only in compute shaders.  While some hardware may be
971       able to support shared variables in shader stages other than compute,
972       it is difficult to clearly define what the semantics are as far as
973       sharing. For example, what is the equivalent for a workgroup for
974       vertex shaders?
975
976    2) Can we expose atomics on <shared> variables?
977
978       RESOLVED:  Yes.  The existing atomics in OpenGL 4.2 (via image
979       variables) don't map well to the <shared> declaration.  Instead, we've
980       defined new atomic functions that take a variable as a first input.
981       These functions are specified in the ARB_shader_storage_buffer_object
982       extension and are incorporated into this extension via the interaction
983       described above.  We could have also chosen to define operators +=, &=,
984       etc. to be atomic when applied to <shared> variables, but shaders may
985       want to use such variables in cases where atomic access (and the
986       related overhead) is not required.
987
988    3) Should the local size and dimensions of the workgroup be specified at
989       compile time? What are the default local dimensions?
990
991       RESOLVED: Dimension is always 3 and a workgroup size declaration is
992       compulsory at compile time. There is no default. The value used is
993       queriable.  To use a 1- or 2-dimensional workgroup, the extra
994       dimension(s) can be set to 1.
995
996    4) Do we need the local_work_size parameter in dispatch if the local size
997       may be specified at compile time in the shader?
998
999       RESOLVED: The specification of the workgroup size is now mandatory in
1000       the shader source at compile time and the local_work_size may no longer
1001       be specified at dispatch time.
1002
1003    5) How do multiple shaders attached to a single program object work?
1004
1005       RESOLVED:  Just as with any other shader stage. Exactly one of the
1006       shaders must provide the 'main' entry point. All shaders attached to a
1007       program object effectively get compiled into a single, large program at
1008       link time.  The program is dispatched as one big entity. Über shader
1009       type functionality can be achieved through the use of subroutine
1010       uniforms, which also work exactly as for other shader stages.
1011
1012    6) Should compute dispatch honor conditional rendering?
1013
1014       RESOLVED: Yes, it does honor conditional rendering.
1015
1016    7) Is it possible to pass compute programs to UseProgram, etc.?
1017
1018       RESOLVED: Yes, compute programs can be made current via UseProgram and
1019       can be made current in a program pipeline object via UseProgramStages.
1020       Note that a compute program must be linked with PROGRAM_SEPARABLE set
1021       to TRUE to be passed to UseProgramStages, even though the compute
1022       pipeline has only a single shader stage.
1023
1024       The active compute program that will be used by DispatchCompute will be
1025       determined in the same manner as the active program for any other
1026       program stage:
1027
1028         * If there is a current program specified via UseProgram, that
1029           program is considered current for all stages, including compute.
1030
1031         * Otherwise, if there is a current program pipeline object, the
1032           program current for the compute stage of the pipeline object is
1033           considered current for the compute stage.
1034
1035         * If neither of the former apply, no program is current for the
1036           compute stage.
1037
1038       The program that is current for the compute stage is considered to be
1039       active if and only if it has a compute shader executable.  For example,
1040       if a non-compute program is made current via UseProgram, it will also
1041       be considered "current" for the compute stage, but won't be considered
1042       active.
1043
1044       When using program pipeline objects, it's possible to switch between
1045       graphics and compute work without switching programs.  For example, in:
1046
1047         glBindProgramPipeline(pipeline);
1048         glUseProgramStages(pipeline, GL_VERTEX_SHADER_BIT, programA);
1049         glUseProgramStages(pipeline, GL_FRAGMENT_SHADER_BIT, programB);
1050         glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC);
1051         glDrawArrays(GL_TRIANGLES, 0, 900);
1052         glDispatchCompute(5, 5, 5);
1053
1054       the triangles will be processed by programA and programB, while the
1055       compute dispatch will be processed by programC.  Similarly,
1056
1057         glUseProgramStages(pipeline, ~GL_COMPUTE_SHADER_BIT, programAB);
1058         glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC);
1059         glDrawArrays(GL_TRIANGLES, 0, 900);
1060         glDispatchCompute(5, 5, 5);
1061
1062       will have the triangles processed by the multi-stage programAB.
1063
1064    8) What happens if you try to draw with no active compute program?
1065
1066       RESOLVED:  An INVALID_OPERATION error is generated if there is no
1067       active program for the compute shader stage.
1068
1069    9) Should we increase minimums on certain replicated state bindings
1070       (texture image units, uniform buffer bindings) to reflect the addition
1071       of a sixth shader stage?
1072
1073       RESOLVED:  Yes, for MAX_COMBINED_TEXTURE_IMAGE_UNITS and
1074       MAX_UNIFORM_BUFFER_BINDINGS.  These limits permit applications to
1075       statically partition the shared set of texture bindings into six
1076       separate sets, one per shader stage.
1077
1078       The limit MAX_COMBINED_UNIFORM_BLOCKS is not increased, because it
1079       reflects the sum of the number of uniform blocks used in each stage of
1080       a single program.  Since no single program can have more than five
1081       stages, these limits don't need to be increased.
1082
1083    10) How do the shader built-in variables relate to DirectCompute's
1084       built-in system values (SV_*)?
1085
1086        OpenGL Compute             DirectCompute
1087        --------------------------------------------------
1088        gl_NumWorkGroups           --
1089        gl_WorkGroupSize           --
1090        gl_WorkGroupID             SV_GroupID
1091        gl_LocalInvocationID       SV_GroupThreadID
1092        gl_GlobalInvocationID      SV_DispatchThreadID
1093        gl_LocalInvocationIndex    SV_GroupIndex
1094
1095    11) How does "program validation" (checking the active programs against
1096        the current state) apply to DispatchCompute?
1097
1098      RESOLVED:  The same program validation logic will be applied to both
1099      graphics primitives (e.g., DrawArrays) and compute dispatches.
1100      Conditions that will cause validation errors for graphics primitives
1101      will also cause validation errors for compute dispatch, even if the
1102      conditions wouldn't otherwise affect compute, for example:
1103
1104        * Mis-configured program pipeline objects (e.g., inserting a geometry
1105          program A between the linked vertex and fragment shaders of of
1106          program B).
1107
1108        * A graphics program has a vertex shader that uses a 2D texture from
1109          texture image unit 0 and a fragment shader that uses a 3D texture
1110          from texture image unit 0.
1111
1112      Similarly, validation errors specific to the compute shader executable
1113      (e.g., using different targets on a single texture image unit in a
1114      compute program) will generate validation errors for graphics Draw*
1115      calls.
1116
1117      We chose to specify this behavior for several reasons.  First, using the
1118      same logic in both places ensures a single result for ValidateProgram
1119      and ValidateProgramPipeline (a single VALIDATE_STATUS value wouldn't be
1120      good enough if the result could be different for compute and graphics).
1121      Additionally, a single test allows implementations to set up state and
1122      perform validation tests for compute and graphics operations at the same
1123      time, without requiring additional irregular graphics- or
1124      compute-specific logic.
1125
1126    12) We specify an INVALID_OPERATION error for DispatchCompute when there
1127        is no active program on the compute stage.  Should we specify similar
1128        errors for Draw* calls if the current program specified by UseProgram
1129        is a compute program?
1130
1131      RESOLVED:  Not in the current spec.  If a compute shader is made
1132      current with UseProgram, there will be no active program for either the
1133      vertex and fragment stages.  In this case, the results of vertex and
1134      fragment processing are undefined, but no error is generated.  This
1135      behavior is already specified in unextended OpenGL 4.2.
1136
1137      We don't generate errors in this case for several reasons:
1138
1139        * For the compatibility profile, fixed-function vertex and fragment
1140          processing is available, and INVALID_OPERATION wouldn't make sense
1141          there.
1142
1143        * Even in the core profile, there are cases where no active fragment
1144          shader is needed (e.g., primitives with RASTERIZER_DISCARD enabled).
1145
1146      While there is no case where having only a compute program makes sense,
1147      at least in the core profile, we chose to keep the same undefined
1148      behavior that's already in place.
1149
1150    13) Should we provide any additional support extending the memoryBarrier()
1151        GLSL built-in function provided by ARB_shader_image_load_store and
1152        GLSL 4.20?
1153
1154      RESOLVED:  Yes.  The memoryBarrier() function provided by GLSL 4.20
1155      requires (a) synchronizing all memory transactions that might be visible
1156      to other shader invocations and (b) ordering memory transactions so that
1157      all other shader invocations never see stores issued after the barrier
1158      before seeing stores issued before the barrier.  Hardware
1159      implementations of GLSL 4.20 may have a high degree of parallelism,
1160      where the memory subsystem servicing shader loads and stores may have
1161      multiple independent sub-units, and where the shader invocations
1162      themselves may be executed in parallel on many shader cores.  The
1163      memoryBarrier() command may be fairly heavyweight, requiring
1164      synchronization with all memory sub-units and shader cores.
1165
1166      We provide new functions in two different directions that might serve as
1167      lighter weight alternatives to memoryBarrier().  In particular, we
1168      provide four new functions
1169
1170        void memoryBarrierAtomicCounter();
1171        void memoryBarrierBuffer();
1172        void memoryBarrierImage();
1173        void memoryBarrierShared();
1174
1175      that order transactions of only a specific memory type and might require
1176      synchronization with fewer sub-units of the memory subsystem and a new
1177      function:
1178
1179        void groupMemoryBarrier();
1180
1181      that only order transactions as viewed by other threads in the same
1182      workgroup, which might not require synchronization with other shader cores.
1183      Since shared memory is only accessible to threads within a single
1184      workgroup, memoryBarrierShared() also only requires synchronization with
1185      other threads in the same workgroup.
1186
1187Revision History
1188
1189    Rev.    Date    Author    Changes
1190    ----  --------  --------- -----------------------------------------
1191    28    12/10/18  Jon Leech Use 'workgroup' consistently throughout (Bug
1192                              11723, internal API issue 87).
1193    27    07/24/14  Jon Leech Change value of GLSL limit
1194                              gl_MaxComputeUniformComponents to 512 for
1195                              consistency with the API (Bug 12370).
1196    26    01/30/14  Jon Leech Add table 6.31 COMPUTE_SHADER entry for
1197                              program pipeline objects (Bug 11539).
1198    25    10/23/12  pbrown    Remove the restriction forbidding the use of
1199                              barrier() inside potentially divergent flow
1200                              control.  Instead, we will allow barrier() to
1201                              be executed anywhere, but specify undefined
1202                              results (including hangs or program termination)
1203                              if the flow control is divergent (bug 9367).
1204    24    07/01/12  Jon Leech Fix typo (bug 8984).
1205    23    06/28/12  johnk     Remove two other references to "thread", add
1206                              "Only available in compute shaders" to the table
1207                              for memoryBarrierShared() and groupMemoryBarrier(),
1208                              fixed a typo.
1209    22    06/22/12  pbrown    Add a new built-in memoryBarrierBuffer() as an
1210                              interaction with ARB_shader_storage_buffer.  Add
1211                              a new built-in groupMemoryBarrier() that orders
1212                              memory transactions only as observed by other
1213                              shader invocations in the same work group.
1214                              Enhance the description of the GLSL memory
1215                              barrier functions.  Add issue 13 about the new
1216                              memory barrier functions added in this extension
1217                              (bug 9199).  Mark issues 11 and 12 as resolved.
1218                              Add NV_vertex_buffer_unified_memory interaction
1219                              allowing DispatchComputeIndirect to read its
1220                              arguments from any resident buffer object
1221                              instead of the single bound indirect dispatch
1222                              buffer.
1223    21    06/21/12  gsellers  Clarify that there are no built-in inputs or
1224                              outputs in compute shaders (bug 9200).
1225    20    06/21/12  gsellers  Throw INVALID_OPERATION if querying
1226                              COMPUTE_WORK_GROUP_SIZE from unlinked program or
1227                              program with no compute shader (bug 9117).
1228    19    06/18/12  pbrown    DispatchComputeIndirect throws INVALID_VALUE
1229                              if <indirect> is negative or misaligned (bug
1230                              9181).
1231    18    06/17/12  pbrown    Clarify that compute-only programs can be used
1232                              by both UseProgram and UseProgramStages, and add
1233                              a COMPUTE_SHADER_BIT for UseProgramStages (bug
1234                              9155).  Specify that validation errors checking
1235                              programs against each other and the GL state
1236                              apply equally to graphics primitives (Draw*) and
1237                              compute dispatches.  Update issue 7; add new
1238                              issues 11 and 12.  Clarify that compute shader
1239                              invocations in a workgroup are run "potentially
1240                              in parallel", but not "in lockstep" (bug 9151).
1241                              Other minor wording improvements.
1242    17    06/15/12  johnk     Don't allow location layout qualifiers for
1243                              compute shader inputs.
1244    16    06/15/12  johnk     In the intro material, allow work groups to
1245                              only potentially execute in parallel, and use
1246                              control barriers to synchronize.  Other minor
1247                              fixes.
1248    15    06/15/12  dgkoch    Added Additions to Ch.2 of Shading Language.
1249                              Renamed shader built-in variables, explained
1250                              them better, made them uvec3 instead of int[3].
1251                              Added derived shading language variables.
1252                              Renamed and changed built-in constants for
1253                              consistency with the variables. Removed
1254                              gl_MaxComputeWorkDimensions since it is no
1255                              longer necessary. Renamed API constants to
1256                              be consistent with shading language terminology.
1257                              Remove a few rogue references to variable
1258                              number of dispatch arguments. Added Issue 10.
1259                              (bugs 9151, 9167)
1260    14    06/14/12  pbrown    Modify DispatchComputeIndirect to accept an
1261                              "intptr"-typed offset instead of a "void *",
1262                              since doesn't accept pointers to client memory.
1263                              Modify DispatchComputeIndirect to use a new
1264                              buffer binding (DISPATCH_INDIRECT_BUFFER)
1265                              instead of sharing the binding used by
1266                              Draw*Indirect.  Add missing entries in the "New
1267                              Tokens" section and assign values.  Update
1268                              documentation of COMMAND_BARRIER_BIT to reflect
1269                              the new dispatch indirect binding.  Document
1270                              DispatchComputeIndirect errors for offsets that
1271                              are negative, misaligned, or run off the end of
1272                              the bound buffer.  Increase minimums for
1273                              combined texture image units and uniform buffer
1274                              bindings to reflect the new stage.  Update
1275                              various issues, add new issue 9 (bug 9130).
1276    13    06/14/12  Jon Leech Copy description of MAX_COMPUTE_SHARED_MEMORY_SIZE
1277                              into API spec from GLSL spec (bug 9069).
1278    12    05/14/12  pbrown    Add interaction with ARB_shader_storage_buffer_
1279                              object. The built-in functions provided there
1280                              for atomic memory operations on buffer variables
1281                              are also supported for the shared variables
1282                              provided here.  The functions themselves are
1283                              documented fully in the other specification.
1284    11    05/14/12  johnk     Keep the previous logical contents of the last
1285                              paragraph of the memory shader control functions.
1286    10    04/26/12  gsellers  Count max compute shared variable size in bytes.
1287                              Make shared variables implicitly coherent.
1288                              Add MAX_COMPUTE_UNIFORM_COMPONENTS.
1289                              Clean up MAX_COMPUTE_IMAGE_UNIFORMS.
1290     9    04/25/12  gsellers  Add UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER
1291                              and ATOMIC_COUNTER_BUFFER_REFERENCED_BY_-
1292                              COMPUTE_SHADER.  Remove <program> from dispatch
1293                              APIs.  Add memoryBarrier{Image,Shared,
1294                              AtomicCounter}().
1295     8    04/05/12  gsellers  Remove ARB suffixes.
1296     7    02/02/12  gsellers  Require OpenGL 4.2.
1297                              Add issue 8.
1298                              Up various minimums.
1299                              Remove variable dimensionality.
1300     6    01/24/12  gsellers  Require OpenGL 3.0.
1301                              Incorporate feedback from bmerry.
1302                              Add compute shader constants to sec. 7.7.
1303                              Add modifications to sec. 8.15 of the GLSL spec.
1304                              Add issue 7.
1305     5    01/20/12  gsellers  Make compute dispatch honor conditional
1306                              rendering.  Add indirect dispatch.
1307                              Change 'global work size' to 'num work groups',
1308                              make global size in multiples of work group size.
1309     4    01/10/12  gsellers  Fix typos and other small corrections.
1310                              Make specification of work group size at compile
1311                              time compulsory.
1312                              Add COMPUTE_WORK_DIMENSION_ARB and
1313                              COMPUTE_LOCAL_WORK_SIZE_ARB queries.
1314                              Add issue (5), resolve issues (3) and (4).
1315     3    01/09/12  gsellers  Change from AMD to ARB.
1316                              Update to be relative to OpenGL 4.2 (+GLSL 4.20).
1317                              Add <shared> variables.
1318                              Add issues (1) - (4).
1319                              Add link failure for programs that contain
1320                              compute and non-compute shaders.
1321     2    06/10/11  gsellers  Add error behavior.
1322                              Shading language changes.
1323                              Add global_offset parameter.
1324                              Add implementation dependent limits.
1325     1    09/24/10  gsellers  Initial revision
1326