1Name 2 3 ARB_compute_shader 4 5Name Strings 6 7 GL_ARB_compute_shader 8 9Contact 10 11 Graham Sellers, AMD (graham.sellers 'at' amd.com) 12 13Contributors 14 15 Pat Brown, NVIDIA 16 Daniel Koch, TransGaming 17 John Kessenich 18 Members of the ARB working group 19 20Notice 21 22 Copyright (c) 2012-2014 The Khronos Group Inc. Copyright terms at 23 http://www.khronos.org/registry/speccopyright.html 24 25Specification Update Policy 26 27 Khronos-approved extension specifications are updated in response to 28 issues and bugs prioritized by the Khronos OpenGL Working Group. For 29 extensions which have been promoted to a core Specification, fixes will 30 first appear in the latest version of that core Specification, and will 31 eventually be backported to the extension document. This policy is 32 described in more detail at 33 https://www.khronos.org/registry/OpenGL/docs/update_policy.php 34 35Status 36 37 Complete. 38 Approved by the ARB on 2012/06/12. 39 40Version 41 42 Last Modified Date: December 10, 2018 43 Revision: 28 44 45Number 46 47 ARB Extension #122 48 49Dependencies 50 51 OpenGL 4.2 is required. 52 53 This extension is written based on the wording of the OpenGL 4.2 (Core 54 Profile) specification, and on the wording of the OpenGL Shading Language 55 (GLSL) Specification, version 4.20. 56 57 This extension interacts with OpenGL 4.3 and 58 ARB_shader_storage_buffer_object. 59 60 This extension interacts with NV_vertex_buffer_unified_memory. 61 62Overview 63 64 Recent graphics hardware has become extremely powerful and a strong desire 65 to harness this power for work (both graphics and non-graphics) that does 66 not fit the traditional graphics pipeline well has emerged. To address 67 this, this extension adds a new single-stage program type known as a 68 compute program. This program may contain one or more compute shaders 69 which may be launched in a manner that is essentially stateless. This allows 70 arbitrary workloads to be sent to the graphics hardware with minimal 71 disturbance to the GL state machine. 72 73 In most respects, a compute program is identical to a traditional OpenGL 74 program object, with similar status, uniforms, and other such properties. 75 It has access to many of the same resources as fragment and other shader 76 types, such as textures, image variables, atomic counters, and so on. 77 However, it has no predefined inputs nor any fixed-function outputs. It 78 cannot be part of a pipeline and its visible side effects are through its 79 actions on images and atomic counters. 80 81 OpenCL is another solution for using graphics processors as generalized 82 compute devices. This extension addresses a different need. For example, 83 OpenCL is designed to be usable on a wide range of devices ranging from 84 CPUs, GPUs, and DSPs through to FPGAs. While one could implement GL on these 85 types of devices, the target here is clearly GPUs. Another difference is 86 that OpenCL is more full featured and includes features such as multiple 87 devices, asynchronous queues and strict IEEE semantics for floating point 88 operations. This extension follows the semantics of OpenGL - implicitly 89 synchronous, in-order operation with single-device, single queue 90 logical architecture and somewhat more relaxed numerical precision 91 requirements. Although not as feature rich, this extension offers several 92 advantages for applications that can tolerate the omission of these 93 features. Compute shaders are written in GLSL, for example and so code may 94 be shared between compute and other shader types. Objects are created and 95 owned by the same context as the rest of the GL, and therefore no 96 interoperability API is required and objects may be freely used by both 97 compute and graphics simultaneously without acquire-release semantics or 98 object type translation. 99 100New Procedures and Functions 101 102 void DispatchCompute(uint num_groups_x, 103 uint num_groups_y, 104 uint num_groups_z); 105 106 void DispatchComputeIndirect(intptr indirect); 107 108New Tokens 109 110 Accepted by the <type> parameter of CreateShader and returned in the 111 <params> parameter by GetShaderiv: 112 113 COMPUTE_SHADER 0x91B9 114 115 Accepted by the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv, 116 GetDoublev and GetInteger64v: 117 118 MAX_COMPUTE_UNIFORM_BLOCKS 0x91BB 119 MAX_COMPUTE_TEXTURE_IMAGE_UNITS 0x91BC 120 MAX_COMPUTE_IMAGE_UNIFORMS 0x91BD 121 MAX_COMPUTE_SHARED_MEMORY_SIZE 0x8262 122 MAX_COMPUTE_UNIFORM_COMPONENTS 0x8263 123 MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS 0x8264 124 MAX_COMPUTE_ATOMIC_COUNTERS 0x8265 125 MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS 0x8266 126 MAX_COMPUTE_WORK_GROUP_INVOCATIONS 0x90EB 127 128 Accepted by the <pname> parameter of GetIntegeri_v, GetBooleani_v, 129 GetFloati_v, GetDoublei_v and GetInteger64i_v: 130 131 MAX_COMPUTE_WORK_GROUP_COUNT 0x91BE 132 MAX_COMPUTE_WORK_GROUP_SIZE 0x91BF 133 134 Accepted by the <pname> parameter of GetProgramiv: 135 136 COMPUTE_WORK_GROUP_SIZE 0x8267 137 138 Accepted by the <pname> parameter of GetActiveUniformBlockiv: 139 140 UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER 0x90EC 141 142 Accepted by the <pname> parameter of GetActiveAtomicCounterBufferiv: 143 144 ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER 0x90ED 145 146 Accepted by the <target> parameters of BindBuffer, BufferData, 147 BufferSubData, MapBuffer, UnmapBuffer, GetBufferSubData, and 148 GetBufferPointerv: 149 150 DISPATCH_INDIRECT_BUFFER 0x90EE 151 152 Accepted by the <value> parameter of GetIntegerv, GetBooleanv, 153 GetInteger64v, GetFloatv, and GetDoublev: 154 155 DISPATCH_INDIRECT_BUFFER_BINDING 0x90EF 156 157 Accepted by the <stages> parameter of UseProgramStages: 158 159 COMPUTE_SHADER_BIT 0x00000020 160 161Additions to Chapter 2 of the OpenGL 4.2 (Core Profile) Specification 162(OpenGL Operation) 163 164 In section 2.9.1, "Creating and Binding Buffer Objects", add to table 2.8 165 (p.43): 166 167 Described 168 Target name Purpose in sections(s) 169 ----------------------- ------------------------- --------------- 170 DISPATCH_INDIRECT_BUFFER Indirect compute dispatch 5.5 171 commands 172 173 Add to the end of section 2.9.8, "Indirect Commands In Buffer Objects" 174 (p. 53): 175 176 Arguments to the DispatchComputeIndirect command are stored in buffer 177 objects as a group of three unsigned integers. 178 179 A buffer object is bound to DISPATCH_INDIRECT_BUFFER by calling BindBuffer 180 with target set to DISPATCH_INDIRECT_BUFFER, and buffer set to the name of 181 the buffer object. If no corresponding buffer object exists, one is 182 initialized as defined in section 2.9. 183 184 DispatchComputeIndirect sources its arguments from the buffer object whose 185 name is bound to DISPATCH_INDIRECT_BUFFER, using the <indirect> parameter as 186 an offset into the buffer object in the same fashion as described in 187 section 2.9.6. An INVALID_OPERATION error is generated if this command 188 sources data beyond the end of the buffer object, if zero is bound to 189 DISPATCH_INDIRECT_BUFFER, or if <indirect> is less than zero or not a 190 multiple of the size, in basic machine units, of uint. 191 192 In section 2.11, "Vertex Shaders", modify the introductory text on shaders 193 to include compute shaders (second paragraph, p. 56): 194 195 In addition to vertex shaders, tessellation control..., geometry shaders, 196 fragment shaders, and compute shders can be created, compiled, and linked 197 into program objects. .... (section 3.10). Compute shaders perform 198 general computations for dispatched arrays of shader invocations (section 199 5.5), but do not operate on primitives processed by the other shader 200 types. ... 201 202 In section 2.11.3, "Program Objects", add to the reasons that LinkProgram 203 may fail, p. 61: 204 205 * The program object contains objects to form a compute shader (see 206 section 5.5) and objects to form any other type of shader. 207 208 In section 2.11.3, modify the description of active programs (last 209 paragraph, p. 61, first paragraph, p. 62): 210 211 ... geometry shader stages, those stages are ignored. If there is no 212 active program for the compute shader stage, compute dispatches will 213 generate an error. The active program for the compute shader stage has no 214 effect on the processing of vertices, geometric primitives, and fragments, 215 and the active program for all other shader stages has no effect on 216 compute dispatches. 217 218 In section 2.11.4, "Program Pipeline Objects", modify the description of 219 UseProgramStages, p. 65: 220 221 The executables in a program object... becomes current. These stages may 222 include vertex, tessellation control, tessellation evaluation, geometry, 223 fragment, or compute, indicated by VERTEX_SHADER_BIT, 224 TESS_CONTROL_SHADER_BIT, TESS_EVALUATION_SHADER_BIT, GEOMETRY_SHADER_BIT, 225 FRAGMENT_SHADER_BIT, or COMPUTE_SHADER_BIT, respectively. ... 226 227 In the unnumbered "Validation" section of section 2.11.12 "Shader 228 Execution", modify the list of validation errors, pp. 112-113: 229 230 This error is generated by any command that transfers vertices to the GL 231 or launches compute work if: 232 233 * (last bullet, p. 112) One program object is active... first program 234 object was active. The active compute shader is ignored for the 235 purposes of this test. 236 237 * (2nd bullet, p. 113) There is no current program specified by 238 UseProgram, there is a current program pipeline object, and the 239 current program for any shader stage has been relinked since... 240 241 * (3rd bullet, p. 113) Any two active samplers in the set of active 242 program objects are of different types but refer to the same texture 243 image unit. 244 245 * (4th bullet, p. 113) The sum of the number of active samplers for each 246 active program exceeds the maximum number of texture image units 247 allowed. 248 249 Modify the paragraph describing ValidateProgram, p. 113: 250 251 ... If validation succeeded, ... set to FALSE. If validation succeeded, 252 no INVALID_OPERATION validation error will be generated if <program> were 253 made current via UseProgram, given the current state. If validation 254 failed, such errors will be generated under the current state. 255 256 Modify the paragraph describing ValidateProgramPipeline, p. 114: 257 258 ... can be queried with GetProgramPipelineiv (see section 6.1.12). If 259 validation succeeded, no INVALID_OPERATION validation error will be 260 generated if <pipeline> were bound and no program were made current via 261 UseProgram, given the current state. If validation failed, such errors 262 will be generated under the current state. 263 264 In subsection 2.11.12, "Shader Execution": 265 266 Add to the list of implementation dependent constants under the 267 "Texture Access" sub-heading: 268 269 MAX_COMPUTE_TEXTURE_IMAGE_UNITS (for compute shaders), 270 271 Add to the list of implementation dependent constants under the "Atomic 272 Counter Access" sub-heading: 273 274 MAX_COMPUTE_ATOMIC_COUNTERS (for compute shaders), 275 276 Add to the list of implementation dependent constants under the "Image 277 Access" sub-heading: 278 279 MAX_COMPUTE_IMAGE_UNIFORMS (for compute shaders), 280 281 In section 2.16, "Conditional Rendering", modify the sentence describing 282 conditional rendering, starting with "In this case"... 283 284 In this case, all drawing commands (see section 2.8.3), as well as 285 Clear and ClearBuffer* (see section 4.2.3), and compute dispatch 286 through DispacthCompute* (see section 5.5), have no effect. 287 In the "Shared Memory Access Synchronization" subsection of section 288 2.11.13, "Shader Memory Access", modify the description of 289 COMMAND_BARRIER_BIT (p. 118): 290 291 * COMMAND_BARRIER_BIT: Command data sourced from buffer objects by 292 Draw*Indirect and DispatchComputeIndirect commands ... The buffer 293 objects affected by this bit are derived from the DRAW_INDIRECT_BUFFER 294 and DISPATCH_INDIRECT_BUFFER bindings. 295 296 In subection 2.17.7, "Uniform Variables", replace the paragraph beginning 297 "If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER,"... with: 298 299 If <pname> is UNIFORM_BLOCK_REFERENCED_BY_VERTEX_SHADER, 300 UNIFORM_BLOCK_REFERENCED_BY_TESS_CONTROL_SHADER, 301 UNIFORM_BLOCK_REFERENCED_BY_TESS_EVALUATION_SHADER, 302 UNIFORM_BLOCK_REFERENCED_BY_GEOMETRY_SHADER, 303 UNIFORM_BLOCK_REFERENCED_BY_FRAGMENT_SHADER or 304 UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER, then a boolean value indicating 305 whether the uniform block identified by uniformBlockIndex is referenced 306 by the vertex, tessellation control, tessellation evaluation, geometry, 307 fragment or compute programming stages of <program>, respectively, is 308 returned. 309 310 Also in subsection 2.17.7, "Uniform Variables", replace the paragraph 311 beginning, "If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER" 312 on p.80 with: 313 314 If <pname> is ATOMIC_COUNTER_BUFFER_REFERENCED_BY_VERTEX_SHADER, 315 ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_CONTROL_SHADER, 316 ATOMIC_COUNTER_BUFFER_REFERENCED_BY_TESS_EVALUATION_SHADER, 317 ATOMIC_COUNTER_BUFFER_REFERENCED_BY_GEOMETRY_SHADER, 318 ATOMIC_COUNTER_BUFFER_REFERENCED_BY_FRAGMENT_SHADER or 319 ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER, then a single boolean 320 value indicating whether the atomic counter buffer identified by 321 bufferIndex is referenced by the vertex, tessellation control, tessellation 322 evaluation, geometry, fragment or compute programming stages of 323 <program>, respectively, is returned. 324 325 Under the sub-heading "Uniform Blocks" in subsection 2.11.17, replace the 326 sentence beginning "The limits for vertex, tessellation ..." on p.92 327 with: 328 329 The limits for vertex, tessellation, geometry, fragment and compute 330 shaders can be obtained by calling GetIntegerv with <pname> set to 331 MAX_VERTEX_UNIFORM_BLOCKS, MAX_TESS_CONTROL_UNIFORM_BLOCKS, 332 MAX_TESS_EVALUATION_UNIFORM_BLOCKS, MAX_GEOMETRY_UNIFORM_BLOCKS, 333 MAX_FRAGMENT_UNIFORM_BLOCKS and MAX_COMPUTE_UNIFORM_BLOCKS, respectively. 334 335 Under the sub-heading "Atomic Counter Buffers" in subsection 2.11.17, 336 replace the sentence beginning "The limits for vertex, geometry, ..." 337 on p.96 with: 338 339 The limits for vertex, tessellation, geometry, fragment and compute 340 shaders can be obtained by calling GetIntegerv with <pname> set to 341 MAX_VERTEX_ATOMIC_COUNTER_BUFFERS, MAX_TESS_CONTROL_ATOMIC_COUNTER_BUFFERS, 342 MAX_TESS_EVALUATION_ATOMIC_COUNTER_BUFFERS, 343 MAX_GEOMETRY_ATOMIC_COUNTER_BUFFERS, MAX_FRAGMENT_ATOMIC_COUNTER_BUFFERS and 344 MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS, respectively. 345 346Additions to Chapter 3 of the OpenGL 4.2 (Core Profile) Specification 347(Rasterization) 348 349 None. 350 351Additions to Chapter 4 of the OpenGL 4.2 (Core Profile) Specification 352(Per-Fragment Operations and the Framebuffer) 353 354 None. 355 356Additions to Chapter 5 of the OpenGL 4.2 (Core Profile) Specification 357(Special Functions) 358 359 Add Section 5.5, "Compute Shaders" 360 361 In addition to graphics-oriented shading operations such as vertex, 362 tessellation, geometry and fragment shading, generic computation may be 363 performed by the GL through the use of compute shaders. The compute pipeline 364 is a form of single-stage machine that runs generic shaders. Compute shaders 365 are created as described in section 2.11.1 using a <type> parameter of 366 COMPUTE_SHADER. They are attached to and used in program objects as 367 described in section 2.11.3. 368 369 Compute workloads are formed from groups of work items called 370 _workgroups_ and processed by the executable code for a compute program. 371 A workgroup is a collection of shader invocations that execute the same code, 372 potentially in parallel. An invocation within a workgroup may share data 373 with other members of the same workgroup through shared variables and 374 issue memory and control barriers to synchronize with other members of the 375 same workgroup. One or more workgroups is launched by calling: 376 377 void DispatchCompute(uint num_groups_x, 378 uint num_groups_y, 379 uint num_groups_z); 380 381 Each workgroup is processed by the active program object for the 382 compute shader stage. The error INVALID_OPERATION will be generated if 383 there is no active program object for the compute shader stage. The 384 active program for the compute shader stage will be determined in the same 385 manner as the active program for other pipeline stages, as described in 386 section 2.11.3. While the individual shader invocations within a 387 workgroup are executed as a unit, workgroups are executed completely 388 independently and in unspecified order. 389 390 <num_groups_x>, <num_groups_y> and <num_groups_z> specify the number of 391 workgroups that will be dispatched in the X, Y and Z dimensions, 392 respectively. The builtin vector variable gl_NumWorkGroups will be 393 initialized with the contents of the <num_groups_x>, <num_groups_y> and 394 <num_groups_z> parameters. The maximum number of workgroups that may be 395 dispatched at one time may be determined by calling GetIntegeri_v with 396 <pname> set to MAX_COMPUTE_WORK_GROUP_COUNT and <index> must be zero, one, 397 or two, representing the X, Y, and Z dimensions, respectively. The 398 values in the <num_groups_x>, <num_groups_y> and <num_groups_z> array must 399 be less than or equal to the maximum workgroup count for the corresponding 400 dimension, otherwise an INVALID_VALUE error is generated. If the workgroup 401 count in any dimension is zero, no workgroups are dispatched. 402 403 The workgroup size in each dimension are specified at compile time 404 using an input layout qualifier in one or more of the compute shaders 405 attached to the program (see Section 4 of the OpenGL Shading Language 406 Specification). After the program has been linked, the workgroup size 407 of the program may be retrieved by calling GetProgramiv with <pname> set to 408 COMPUTE_WORK_GROUP_SIZE. This will return an array of three integers 409 containing the workgroup size of the compute program as specified by 410 its input layout qualifier(s). If <program> is the name of a program that 411 has not been successfully linked, or is the name of a linked program object 412 that contains no compute shaders, then an INVALID_OPERATION error is 413 generated. 414 415 The maximum size of a workgroup may be determined by calling 416 GetIntegeri_v with <pname> set to MAX_COMPUTE_WORK_GROUP_SIZE 417 and <index> set to 0, 1, or 2 to retrieve the maximum work size in the 418 X, Y and Z dimension, respectively. Furthermore, the maximum number of 419 invocations in a single workgroup (i.e., the product of the three 420 dimensions) may be determined by calling GetIntegerv with <pname> set to 421 MAX_COMPUTE_WORK_GROUP_INVOCATIONS. 422 423 The command 424 425 void DispatchComputeIndirect(intptr indirect); 426 427 is equivalent (assuming no errors are generated) to calling 428 DispatchCompute with <num_groups_x>, <num_groups_y> and <num_groups_z> 429 initialized with the three uint values contained in the buffer currently 430 bound to the DISPATCH_INDIRECT_BUFFER binding at an offset, in basic 431 machine units, specified by <indirect>. The error INVALID_VALUE is 432 generated if <indirect> is less than zero or is not a multiple of four. 433 The error INVALID_OPERATION is generated if no buffer is bound to 434 DISPATCH_INDIRECT_BUFFER, if the command would source data beyond the end 435 of the buffer object, or if there is no active program for the compute 436 shader stage. If any of <num_groups_x>, <num_groups_y> or <num_groups_z> 437 is greater than MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding 438 dimension then the results are undefined. 439 440 Add Subsection 5.5.1, "Compute Shader Variables" 441 442 Compute shaders can access variables belonging to the current program 443 object. The amount of storage in the default uniform block accessed by a 444 compute shader is specified by the value of the implementation dependent 445 constant MAX_COMPUTE_UNIFORM_COMPONENTS. The total amount of 446 combined storage available for uniform variables in all uniform blocks 447 accessed by a compute shader (including the default unifom block) is 448 specified by the implementation dependent constant 449 MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS. 450 451 There is a limit to the total size of all variables declared as 452 <shared> in a single program object. This limit, expressed in units of 453 basic machine units, may be queried as the value of 454 MAX_COMPUTE_SHARED_MEMORY_SIZE. 455 456Additions to Chapter 6 of the OpenGL 4.2 (Core Profile) Specification 457(State and State Requests) 458 459 None. 460 461Additions to Chapter 2 of the OpenGL Shading Language Specification, Version 4624.20 (Overview of OpenGL Shading) 463 464 Replace the last sentence of the first paragraph of the overview with 465 the following: 466 467 "Currently, these processors are the vertex, tessellation control, 468 tessellation evaluation, geometry, fragment, and compute processors." 469 470 Replace the last sentence of the second paragraph of the overview with 471 the following: 472 473 "The specific languages will be referred to by the name of the processor 474 they target: vertex, tessellation control, tessellation evaluation, 475 geometry, fragment, or compute." 476 477 Add a new Section 2.6 titled "Compute Processor" with the following text: 478 479 "The <compute processor> is a programmable unit that operates independently 480 from the other shader processors. Compilation units written in the OpenGL 481 Shading Language to run on this processor are called <compute shaders>. 482 When a complete set of compute shaders are compiled and linked, they 483 result in a <compute shader executable> that runs on the compute processor. 484 485 A compute shader has access to many of the same resources as fragment and 486 other shader processors, such as textures, buffers, image variables, 487 atomic counters, and so on. It does not have any predefined inputs 488 nor any fixed-function outputs. It is not part of the graphics pipeline 489 and its visible side effects are through actions on images, storage 490 buffers, and atomic counters. 491 492 A compute shader operates on a group of work items called a workgroup. 493 A workgroup is a collection of shader invocations that execute the same 494 code, potentially in parallel. An invocation within a workgroup may share data with 495 other members of the same workgroup through shared variables and issue 496 memory and control barriers to synchronize with other members of the same workgroup." 497 498Additions to Chapter 4 of the OpenGL Shading Language Specification, Version 4994.20 (Variables and Types) 500 501 Modify section 4.4.1, second paragraph from 502 503 "All shaders allow input layout qualifiers on input variable declarations." 504 505 to 506 507 "All shaders, except compute shaders, allow input layout location qualifiers on 508 input variable declarations." 509 510 Modify Section 4.3. Add to the table at the start of Section 4.3: 511 512 +-------------------+-----------------------------------------------------------+ 513 | Storage Qualifier | Meaning | 514 +-------------------+-----------------------------------------------------------+ 515 | <shared> | variable storage is shared across all work items in a | 516 | | workgroup for compute shaders | 517 +-------------------+-----------------------------------------------------------+ 518 519 Add the following paragraph to Section 4.3.4, "Input Variables" 520 521 Compute shaders do not permit user-defined input variables and do not 522 form a formal interface with any other shader stage. See section 7.1 523 for a description of built-in compute shader input variables. All other 524 input to a compute shader is retrieved explicitly through image loads, 525 texture fetches, loads from uniforms or uniform buffers, or other user 526 supplied code. Redeclaration of built-in input variables in compute 527 shaders is not permitted. 528 529 Add the following paragraph to Section 4.3.6, "Output Variables" 530 531 Compute shaders have no built-in output variables, do not support 532 user-defined output variables and do not form a formal interface with any 533 other shader stage. All outputs from a compute shader take the form of the 534 side effects such as image stores and operations on atomic counters. 535 536 Add Section 4.3.7, "Shared", renumber subsequent sections 537 538 The <shared> qualifier is used to declare variables that have storage 539 shared between all work items of a compute shader workgroup. 540 Variables declared as <shared> may only be used in compute shaders 541 (see Section 5.5, "Compute Shaders"). Shared variables are implicitly 542 coherent. That is, writes to shared variables from one shader invocation 543 will eventually be seen by other invocations within the same workgroup. 544 545 Variables declared as <shared> may not have initializers and their 546 contents are undefined at the beginning of shader execution. Any data 547 written to <shared> variables will be visible to other shaders executing 548 the same shader within the same workgroup. Order of execution 549 with regards to reads and writes to the same <shared> variables by different 550 invocations of a shader is not defined. In order to achieve ordering with 551 respect to reads and writes to <shared> variables, memory barriers must be 552 employed using the barrier() function (see Section 8.15). 553 554 There is a limit to the total size of all variables declared as 555 <shared> in a single program object. This limit, expressed in units of 556 basic machine units may be determined by using the OpenGL API to query the 557 value of MAX_COMPUTE_SHARED_MEMORY_SIZE. 558 559 Add Section 4.4.1.4, "Compute-Shader Inputs" 560 561 There are no layout location qualifiers for compute shader inputs. 562 563 Layout qualifier identifiers for compute shader inputs are the workgroup 564 size qualifiers: 565 566 layout-qualifier-id 567 local_size_x = integer-constant 568 local_size_y = integer-constant 569 local_size_z = integer-constant 570 571 <local_size_x>, <local_size_y>, and <local_size_z> are used to define the 572 local size of the kernel defined by the compute shader in the first, 573 second, and third dimension, respectively. The default size in each 574 dimension is 1. If a shader does not specify a size for one of the 575 dimensions, that dimension will have a size of 1. 576 577 For example, the following declaration in a compute shader 578 579 layout (local_size_x = 32, local_size_y = 32) in; 580 581 is used to declare a two-dimensional compute shader with a local size of 582 32 x 32 elements as a three-dimensional compute shader where the third dimension is 583 one element deep. 584 585 As another example, the declaration 586 587 layout (local_size_x = 8) in; 588 589 effectively specifies that a one-dimensional compute shader is being 590 compiled, and its size is 8 elements. 591 592 If the local size of the shader in any dimension is greater than the 593 maximum size supported by the implementation for that dimension, a 594 compile-time error results. Also, if such a layout qualifier is declared more 595 than once in the same shader, all those declarations must indicate the same 596 workgroup size; otherwise a compile-time error results. If multiple compute 597 shaders attached to a single program object declare the workgroup size, 598 the declarations must be identical; otherwise a link-time error results. 599 Furthermore, if a program object contains any compute shaders, at 600 least one must contain an input layout qualifier specifying the 601 workgroup sizes of the program, or a link-time error will occur. 602 603Additions to Chapter 7 of the OpenGL Shading Language Specification, Version 6044.20 (Built-in Variables) 605 606 Add to the start of Section 7.1, "Built-In Language Variables", before the 607 description of the vertex language built-in variables: 608 609 In the compute language, the built-in variables are declared as follows: 610 611 // workgroup dimensions 612 in uvec3 gl_NumWorkGroups; 613 const uvec3 gl_WorkGroupSize; 614 615 // workgroup and invocation IDs 616 in uvec3 gl_WorkGroupID; 617 in uvec3 gl_LocalInvocationID; 618 619 // derived variables 620 in uvec3 gl_GlobalInvocationID; 621 in uint gl_LocalInvocationIndex; 622 623 Add the end of Section 7.1, before Section 7.1.1: 624 625 The built-in variable <gl_NumWorkGroups> is a compute-shader input 626 variable containing the total number of global work items in each 627 dimension of the workgroup that will execute the compute shader. 628 Its content is equal to the values specified in the <num_groups_x>, 629 <num_groups_y>, and <num_groups_z> parameters passed to the 630 DispatchCompute API entry point. 631 632 The built-in constant <gl_WorkGroupSize> is a compute-shader constant 633 containing the workgroup size of the shader. The size of the workgroup 634 in the X, Y, and Z dimensions is stored in the x, y, and z components. 635 The values stored in <gl_WorkGroupSize> match those specified in the 636 required <local_size_x>, <local_size_y>, and <local_size_z> layout 637 qualifiers for the current shader. This value is constant so that 638 it can be used to size arrays of memory that can be shared within 639 the workgroup. 640 641 The built-in variable <gl_WorkGroupID> is a compute-shader input 642 variable containing the 3-dimensional index of the global workgroup 643 that the current invocation is executing in. The possible values range 644 across the parameters passed into DispatchCompute, i.e., from (0, 0, 0) to 645 (gl_NumWorkGroups.x - 1, gl_NumWorkGroups.y - 1, gl_NumWorkGroups.z - 1). 646 647 The built-in variable <gl_LocalInvocationID> is a compute-shader input 648 variable containing the 3-dimensional index of the workgroup 649 within the global workgroup that the current invocation is executing in. 650 The possible values for this variable range across the workgroup 651 size, i.e. (0,0,0) to (gl_WorkGroupSize.x - 1, gl_WorkGroupSize.y - 1, 652 gl_WorkGroupSize.z - 1). 653 654 The built-in variable <gl_GlobalInvocationID> is a compute shader input 655 variable containing the global index of the current work item. This 656 value uniquely identifies this invocation from all other invocations 657 across all workgroups initiated by the current 658 DispatchCompute call. This is computed as: 659 660 gl_GlobalInvocationID = 661 gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID. 662 663 The built-in variable <gl_LocalInvocationIndex> is a compute shader 664 input variable that contains the 1-dimensional representation of the 665 gl_LocalInvocationID. This is useful for uniquely identifying a 666 unique region of shared memory within the workgroup for this 667 invocation to use. This is computed as: 668 669 gl_LocalInvocationIndex = 670 gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y + 671 gl_LocalInvocationID.y * gl_WorkGroupSize.x + 672 gl_LocalInvocationID.x; 673 674 Add to the list of built-in constants in Section 7.3: 675 676 const ivec3 gl_MaxComputeWorkGroupCount = { 65535, 65535, 65535 }; 677 const ivec3 gl_MaxComputeWorkGroupSize = { 1024, 1024, 64 }; 678 const int gl_MaxComputeUniformComponents = 512; 679 const int gl_MaxComputeTextureImageUnits = 16; 680 const int gl_MaxComputeImageUniforms = 8; 681 const int gl_MaxComputeAtomicCounters = 8; 682 const int gl_MaxComputeAtomicCounterBuffers = 1; 683 684Additions to Chapter 8 of the OpenGL Shading Language Specification, Version 6854.20 (Built-in Variables) 686 687 Insert "Atomic Memory Functions" section after Section 8.10, Atomic 688 Counter Functions (p. 149). Atomic memory operations are supported on 689 shared variables; the set of operations and their definitions are similar 690 to those for the imageAtomic*() functions. These functions are fully 691 documented in the ARB_shader_storage_buffer_object extension (see 692 dependencies). 693 694 Modify the first paragraph of Section 8.15, "Shader Invocation Control 695 Functions" to read: 696 697 The shader invocation control function is only available in tessellation 698 control shaders and compute shaders. It is used to control the relative 699 execution order of multiple shader invocations used to process a patch 700 (in the case of tessellation control shaders) or a workgroup (in the 701 case of compute shaders), which are otherwise executed with an undefined 702 order. 703 704 +----------------+--------------------------------------------------------------------------+ 705 | Syntax | Description | 706 +----------------+--------------------------------------------------------------------------+ 707 | barrier | For any given static instance of barrier() appearing in a tessellation | 708 | | control shader or compute shader, all invocations for a single patch | 709 | | or workgroup, respectively, must enter it before any will continue | 710 | | beyond it. | 711 +----------------+--------------------------------------------------------------------------+ 712 713 Modify the second paragraph as follows: 714 715 ... Because invocations may execute in an undefined order between these 716 barrier calls, the values of a per-vertex or per-patch output variable in 717 a tessellation control shader or shared variables for compute shaders 718 will be undefined in a number of cases enumerated in Section 4.3.7 "Output 719 Variables" (for tessellation control shaders) and Section 4.3.6 "Shared 720 Variables" (for compute shaders). 721 722 Replace the third paragraph with the following: 723 724 For tessellation control shaders, the barrier() function may only be 725 placed inside the function main() of the tessellation control shader and 726 may not be called within any control flow. Barriers are also disallowed 727 after a return statement in the function main(). Any such misplaced 728 barriers result in a compile-time error. 729 730 For compute shaders, the barrier() function may be placed within flow 731 control, but that flow control must be uniform flow control. That is, all 732 the controlling expressions that lead to execution of the barrier must be 733 dynamically uniform expressions. This ensures that if any shader 734 invocation enters a conditional statement, then all invocations will enter 735 it. While compilers are encouraged to give warnings if they can detect 736 this might not happen, compilers cannot completely determine this. Hence, 737 it is the author's responsibility to ensure barrier() only exists inside 738 uniform flow control. Otherwise, some shader invocations will stall 739 indefinitely, waiting for a barrier that is never reached by other 740 invocations. 741 742 Modify the table of memory control functions on p.160, 743 744 +-----------------------------------+----------------------------------------------------------------------------------------+ 745 | Syntax | Description | 746 +-----------------------------------+----------------------------------------------------------------------------------------+ 747 | void memoryBarrier() | Control the ordering of all memory transactions issued by a single shader invocation. | 748 +-----------------------------------+----------------------------------------------------------------------------------------+ 749 | void memoryBarrierAtomicCounter() | Control the ordering of accesses to atomic counter variables issued by a single shader | 750 | | invocation. | 751 +-----------------------------------+----------------------------------------------------------------------------------------+ 752 | void memoryBarrierBuffer() | Control the ordering of memory transactions to buffer variables issued within a | 753 | | single shader invocation. | 754 +-----------------------------------+----------------------------------------------------------------------------------------+ 755 | void memoryBarrierImage() | Control the ordering of memory transactions to images issued within a single shader | 756 | | invocation. | 757 +-----------------------------------+----------------------------------------------------------------------------------------+ 758 | void memoryBarrierShared() | Control the ordering of memory transactions to shared variables issued within a single | 759 | | shader invocation. | 760 | | Only available in compute shaders. | 761 +-----------------------------------+----------------------------------------------------------------------------------------+ 762 | void groupMemoryBarrier() | Control the ordering of all memory transactions issued within a single shader | 763 | | invocation, as viewed by other invocations in the same workgroup. | 764 | | Only available in compute shaders. | 765 +-----------------------------------+----------------------------------------------------------------------------------------+ 766 767 Modify the subsequent paragraph as follows: 768 769 The memory barrier built-in functions can be used to order reads and 770 writes to variables stored in memory accessible to other shader 771 invocations. When called, these functions will wait for the completion of 772 all reads and writes previously performed by the caller that access 773 selected variable types, and then return with no other effect. The 774 built-in functions memoryBarrierAtomicCounter(), memoryBarrierBuffer(), 775 memoryBarrierImage(), and memoryBarrierShared() wait for the completion of 776 accesses to atomic counter, buffer, image, and shared variables, 777 respectively. The built-in functions memoryBarrier() and 778 groupMemoryBarrier() wait for the completion of accesses to all of the 779 above variable types. The functions memoryBarrierShared() and 780 groupMemoryBarrier() are available only in compute shaders; the other 781 functions are available in all shader types. 782 783 When these functions return, any memory stores performed using coherent 784 variables prior to the call will be visible to any future coherent access 785 to the same memory performed by any other shader invocation. In 786 particular, the values written this way in one shader stage are guaranteed 787 to be visible to coherent memory accesses performed by shader invocations 788 in subsequent stages when those invocations were triggered by the 789 execution of the original shader invocation (e.g., fragment shader 790 invocations for a primitive resulting from a particular geometry shader 791 invocation). 792 793 Additionally, memory barrier functions order stores performed by the 794 calling invocation, as observed by other shader invocations. Without 795 memory barriers, if one shader invocation performs two stores to coherent 796 variables, a second shader invocation might see the values written by the 797 second store prior to seeing those written by the first. However, if the 798 first shader invocation calls a memory barrier function between the two 799 stores, selected other shader invocations will never see the results of 800 the second store before seeing those of the first. When using the 801 function groupMemoryBarrier(), this ordering guarantee applies only to 802 other shader invocations in the same compute shader workgroup; all other 803 memory barrier functions provide the guarantee to all other shader 804 invocations. No memory barrier is required to guarantee the order of 805 memory stores as observed by the invocation performing the stores; an 806 invocation reading from a variable that it previously wrote will always 807 see the most recently written value unless another shader invocation also 808 wrote to the same memory. 809 810Dependencies on OpenGL 4.3 and ARB_shader_storage_buffer_object 811 812 If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported, the 813 spec language adding the built-in functions atomicAdd(), atomicMin(), 814 atomicMax(), atomicAnd(), atomicOr(), atomicXor(), atomicExchange(), and 815 atomicCompSwap() should be considered to be incorporated into this 816 extension as-is, except that buffer variables will not be supported and 817 thus cannot be used with these functions. No "#extension" directive is 818 necessary to use these functions in compute shaders. 819 820 If OpenGL 4.3 and ARB_shader_storage_buffer_object are not supported, 821 references to the GLSL built-in function memoryBarrierBuffer() should be 822 removed. 823 824Dependencies on NV_vertex_buffer_unified_memory 825 826 If NV_vertex_buffer_unified_memory is supported, a new buffer address 827 range and enable is provided to permit the use with 828 DispatchComputeIndirect with a resident buffer object without requiring 829 that it be bound to the DISPATCH_INDIRECT_BUFFER target. The following 830 additional edits apply: 831 832 Accepted by the <cap> parameter of GetBufferParameterui64vNV: 833 834 DISPATCH_INDIRECT_BUFFER (defined above) 835 836 Accepted by the <cap> parameter of Disable, Enable, and IsEnabled, and by 837 the <pname> parameter of GetIntegerv, GetBooleanv, GetFloatv, GetDoublev 838 and GetInteger64v: 839 840 DISPATCH_INDIRECT_UNIFIED_NV 0x90FD 841 842 Accepted by the <pname> parameter of BufferAddressRangeNV 843 and the <value> parameter of GetIntegerui64vNV: 844 845 DISPATCH_INDIRECT_ADDRESS_NV 0x90FE 846 847 Accepted by the <value> parameter of GetIntegerv: 848 849 DISPATCH_INDIRECT_LENGTH_NV 0x90FF 850 851 Add to the end of Section 5.5, after discussion of 852 DispatchComputeIndirect: 853 854 If DISPATCH_INDIRECT_UNIFIED_NV is enabled, DispatchComputeIndirect does 855 not use the buffer bound to DISPATCH_INDIRECT_BUFFER. Instead, it sources 856 its arguments from the GPU address range specified by calling 857 BufferAddressRangeNV with a <pname> of DISPATCH_INDIRECT_ADDRESS_NV and an 858 <index> of zero. The address is obtained by adding the <indirect> 859 parameter to the base address of the range, specified by the <address> 860 parameter of BufferAddressRangeNV. If the command sources data outside 861 the specified address range, the error INVALID_OPERATION will be 862 generated. The DISPATCH_INDIRECT_BUFFER binding will be ignored in this 863 case, and no errors will be generated due to the use of this binding. The 864 error INVALID_VALUE will still be generated if <indirect> is negative. No 865 INVALID_VALUE error will be generated if <indirect> is not a multiple of 866 four, but INVALID_OPERATION will be generated if the effective address is 867 not a multiple of four. If the indirect dispatch address range does not 868 belong to a buffer object that is resident at the time of the 869 DispatchComputeIndirect call, undefined results, possibly including 870 program termination, may occur. 871 872 Add the following to the "Compute Dispatch State" table defined in this 873 extension: 874 875 Get Value Type Get Command Initial Value Sec Attribute 876 --------- ---- ----------- ------------- --- --------- 877 DISPATCH_INDIRECT_UNIFIED_NV B IsEnabled FALSE 5.5 none 878 DISPATCH_INDIRECT_ADDRESS_NV Z64+ GetIntegerui64vNV 0 5.5 none 879 DISPATCH_INDIRECT_LENGTH_NV Z+ GetIntegerv 0 5.5 none 880 881Errors 882 883 INVALID_OPERATION is generated by DispatchCompute or 884 DispatchComputeIndirect if there is no active program for the compute 885 shader stage. 886 887 INVALID_VALUE is generated by DispatchCompute if any of <num_groups_x>, 888 <num_groups_y> or <num_groups_z> is greater than the value of 889 MAX_COMPUTE_WORK_GROUP_COUNT for the corresponding dimension. 890 891 INVALID_VALUE is generated by DispatchComputeIndirect if <indirect> is 892 less than zero or not a multiple of four. 893 894 INVALID_OPERATION is generated by DispatchComputeIndirect if no buffer is 895 bound to DISPATCH_INDIRECT_BUFFER or if the command would source data 896 beyond the end of the bound buffer object. 897 898 INVALID_OPERATION is generated by GetProgramiv is <pname> is 899 COMPUTE_WORK_GROUP_SIZE and either the program has not been linked 900 successfully, or has been linked but contains no compute shaders. 901 902 LinkProgram will fail if <program> contains a combination of compute and 903 non-compute shaders. 904 905New State 906 907 None. 908 909New Implementation Dependent State 910 911 Add to Table 6.31, "Program Pipeline Object State" 912 913 +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ 914 | Get Value | Type | Get Command | Initial Value | Description | Sec. | 915 +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ 916 | COMPUTE_SHADER | Z+ | GetProgramPipelineiv | 0 | Name of current compute shader project object | 2.11.4 | 917 +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ 918 919 Add to Table 6.32, "Program Object State" 920 921 +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ 922 | Get Value | Type | Get Command | Initial Value | Description | Sec. | 923 +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ 924 | COMPUTE_WORK_GROUP_SIZE | 3 x Z+ | GetProgramiv | { 0, ... } | Workgroup size of a linked compute program | 5.5 | 925 | UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER | B | GetActiveUniformBlockiv | FALSE | True if uniform block is referenced by the compute stage | 2.17.7 | 926 | ATOMIC_COUNTER_BUFFER_REFERENCED_BY_COMPUTE_SHADER | B | GetActiveAtomicCounter- | FALSE | AACB has a counter used by compute shaders | 2.17.7 | 927 | | | Bufferiv | FALSE | | | 928 +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ 929 930 Insert new table named "Compute Dispatch State", after Table 6.46 "Hints": 931 932 +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ 933 | Get Value | Type | Get Command | Initial Value | Description | Sec. | 934 +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ 935 | DISPATCH_INDIRECT_BUFFER_BINDING | Z+ | GetIntegerv | 0 | Indirect dispatch buffer binding | 5.5 | 936 +----------------------------------------------------+-----------+-------------------------+---------------+-----------------------------------------------------------------------+---------+ 937 938 Insert Table 6.50, "Implementation Dependent Compute Shader Limits", 939 renumber subsequent tables. 940 941 +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+ 942 | Get Value | Type | Get Command | Minimum Value | Description | Sec. | 943 +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+ 944 | MAX_COMPUTE_WORK_GROUP_COUNT | 3 x Z+ | GetIntegeri_v | 65535 | Maximum number of workgroups that may be dispatched by a single | 5.5 | 945 | | | | | dispatch command (per dimension) | | 946 | MAX_COMPUTE_WORK_GROUP_SIZE | 3 x Z+ | GetIntegeri_v | 1024 (x, y), 64 (z) | Maximum local size of a compute workgroup (per dimension) | 5.5 | 947 | MAX_COMPUTE_WORK_GROUP_INVOCATIONS | Z+ | GetIntegerv | 1024 | Maximum total compute shader invocations in a single workgroup | 5.5 | 948 | MAX_COMPUTE_UNIFORM_BLOCKS | Z+ | GetIntegerv | 12 | Maximum number of uniform blocks per compute program | 2.11.7 | 949 | MAX_COMPUTE_TEXTURE_IMAGE_UNITS | Z+ | GetIntegerv | 16 | Maximum number of texture image units accessible by a compute shader | 2.11.12 | 950 | MAX_COMPUTE_ATOMIC_COUNTER_BUFFERS | Z+ | GetIntegerv | 8 | Number of atomic counter buffers accessed by a compute shader | 2.11.17 | 951 | MAX_COMPUTE_ATOMIC_COUNTERS | Z+ | GetIntegerv | 8 | Number of atomic counters accessed by a compute shader | 2.11.12 | 952 | MAX_COMPUTE_SHARED_MEMORY_SIZE | Z+ | GetIntegerv | 32768 | Maximum total storage size of all variables declared as <shared> in | | 953 | | | | | all compute shaders linked into a single program object | | 954 | MAX_COMPUTE_UNIFORM_COMPONENTS | Z+ | GetIntegerv | 512 | Number of components for compute shader uniform variables | 5.5.1 | 955 | MAX_COMPUTE_IMAGE_UNIFORMS | Z+ | GetIntegerv | 8 | Number of image variables in compute shaders | 2.11.12 | 956 | MAX_COMBINED_COMPUTE_UNIFORM_COMPONENTS | Z+ | GetIntegerv | * | Number of words for compute shader uniform variables in all uniform | 5.5.1 | 957 | | | | | blocks, including the default | | 958 +-----------------------------------------+-----------+---------------+---------------------+-----------------------------------------------------------------------+---------+ 959 960 Modify Table 6.55, increasing the following minimum values: 961 962 MAX_COMBINED_TEXTURE_IMAGE_UNITS 96 (6*16), was 80 963 MAX_UNIFORM_BUFFER_BINDINGS 72 (6*12), was 60 964 965Issues 966 967 1) Should <shared> variables be usable only in compute shaders, or in other 968 stages too? 969 970 RESOLVED: Support only in compute shaders. While some hardware may be 971 able to support shared variables in shader stages other than compute, 972 it is difficult to clearly define what the semantics are as far as 973 sharing. For example, what is the equivalent for a workgroup for 974 vertex shaders? 975 976 2) Can we expose atomics on <shared> variables? 977 978 RESOLVED: Yes. The existing atomics in OpenGL 4.2 (via image 979 variables) don't map well to the <shared> declaration. Instead, we've 980 defined new atomic functions that take a variable as a first input. 981 These functions are specified in the ARB_shader_storage_buffer_object 982 extension and are incorporated into this extension via the interaction 983 described above. We could have also chosen to define operators +=, &=, 984 etc. to be atomic when applied to <shared> variables, but shaders may 985 want to use such variables in cases where atomic access (and the 986 related overhead) is not required. 987 988 3) Should the local size and dimensions of the workgroup be specified at 989 compile time? What are the default local dimensions? 990 991 RESOLVED: Dimension is always 3 and a workgroup size declaration is 992 compulsory at compile time. There is no default. The value used is 993 queriable. To use a 1- or 2-dimensional workgroup, the extra 994 dimension(s) can be set to 1. 995 996 4) Do we need the local_work_size parameter in dispatch if the local size 997 may be specified at compile time in the shader? 998 999 RESOLVED: The specification of the workgroup size is now mandatory in 1000 the shader source at compile time and the local_work_size may no longer 1001 be specified at dispatch time. 1002 1003 5) How do multiple shaders attached to a single program object work? 1004 1005 RESOLVED: Just as with any other shader stage. Exactly one of the 1006 shaders must provide the 'main' entry point. All shaders attached to a 1007 program object effectively get compiled into a single, large program at 1008 link time. The program is dispatched as one big entity. Über shader 1009 type functionality can be achieved through the use of subroutine 1010 uniforms, which also work exactly as for other shader stages. 1011 1012 6) Should compute dispatch honor conditional rendering? 1013 1014 RESOLVED: Yes, it does honor conditional rendering. 1015 1016 7) Is it possible to pass compute programs to UseProgram, etc.? 1017 1018 RESOLVED: Yes, compute programs can be made current via UseProgram and 1019 can be made current in a program pipeline object via UseProgramStages. 1020 Note that a compute program must be linked with PROGRAM_SEPARABLE set 1021 to TRUE to be passed to UseProgramStages, even though the compute 1022 pipeline has only a single shader stage. 1023 1024 The active compute program that will be used by DispatchCompute will be 1025 determined in the same manner as the active program for any other 1026 program stage: 1027 1028 * If there is a current program specified via UseProgram, that 1029 program is considered current for all stages, including compute. 1030 1031 * Otherwise, if there is a current program pipeline object, the 1032 program current for the compute stage of the pipeline object is 1033 considered current for the compute stage. 1034 1035 * If neither of the former apply, no program is current for the 1036 compute stage. 1037 1038 The program that is current for the compute stage is considered to be 1039 active if and only if it has a compute shader executable. For example, 1040 if a non-compute program is made current via UseProgram, it will also 1041 be considered "current" for the compute stage, but won't be considered 1042 active. 1043 1044 When using program pipeline objects, it's possible to switch between 1045 graphics and compute work without switching programs. For example, in: 1046 1047 glBindProgramPipeline(pipeline); 1048 glUseProgramStages(pipeline, GL_VERTEX_SHADER_BIT, programA); 1049 glUseProgramStages(pipeline, GL_FRAGMENT_SHADER_BIT, programB); 1050 glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC); 1051 glDrawArrays(GL_TRIANGLES, 0, 900); 1052 glDispatchCompute(5, 5, 5); 1053 1054 the triangles will be processed by programA and programB, while the 1055 compute dispatch will be processed by programC. Similarly, 1056 1057 glUseProgramStages(pipeline, ~GL_COMPUTE_SHADER_BIT, programAB); 1058 glUseProgramStages(pipeline, GL_COMPUTE_SHADER_BIT, programC); 1059 glDrawArrays(GL_TRIANGLES, 0, 900); 1060 glDispatchCompute(5, 5, 5); 1061 1062 will have the triangles processed by the multi-stage programAB. 1063 1064 8) What happens if you try to draw with no active compute program? 1065 1066 RESOLVED: An INVALID_OPERATION error is generated if there is no 1067 active program for the compute shader stage. 1068 1069 9) Should we increase minimums on certain replicated state bindings 1070 (texture image units, uniform buffer bindings) to reflect the addition 1071 of a sixth shader stage? 1072 1073 RESOLVED: Yes, for MAX_COMBINED_TEXTURE_IMAGE_UNITS and 1074 MAX_UNIFORM_BUFFER_BINDINGS. These limits permit applications to 1075 statically partition the shared set of texture bindings into six 1076 separate sets, one per shader stage. 1077 1078 The limit MAX_COMBINED_UNIFORM_BLOCKS is not increased, because it 1079 reflects the sum of the number of uniform blocks used in each stage of 1080 a single program. Since no single program can have more than five 1081 stages, these limits don't need to be increased. 1082 1083 10) How do the shader built-in variables relate to DirectCompute's 1084 built-in system values (SV_*)? 1085 1086 OpenGL Compute DirectCompute 1087 -------------------------------------------------- 1088 gl_NumWorkGroups -- 1089 gl_WorkGroupSize -- 1090 gl_WorkGroupID SV_GroupID 1091 gl_LocalInvocationID SV_GroupThreadID 1092 gl_GlobalInvocationID SV_DispatchThreadID 1093 gl_LocalInvocationIndex SV_GroupIndex 1094 1095 11) How does "program validation" (checking the active programs against 1096 the current state) apply to DispatchCompute? 1097 1098 RESOLVED: The same program validation logic will be applied to both 1099 graphics primitives (e.g., DrawArrays) and compute dispatches. 1100 Conditions that will cause validation errors for graphics primitives 1101 will also cause validation errors for compute dispatch, even if the 1102 conditions wouldn't otherwise affect compute, for example: 1103 1104 * Mis-configured program pipeline objects (e.g., inserting a geometry 1105 program A between the linked vertex and fragment shaders of of 1106 program B). 1107 1108 * A graphics program has a vertex shader that uses a 2D texture from 1109 texture image unit 0 and a fragment shader that uses a 3D texture 1110 from texture image unit 0. 1111 1112 Similarly, validation errors specific to the compute shader executable 1113 (e.g., using different targets on a single texture image unit in a 1114 compute program) will generate validation errors for graphics Draw* 1115 calls. 1116 1117 We chose to specify this behavior for several reasons. First, using the 1118 same logic in both places ensures a single result for ValidateProgram 1119 and ValidateProgramPipeline (a single VALIDATE_STATUS value wouldn't be 1120 good enough if the result could be different for compute and graphics). 1121 Additionally, a single test allows implementations to set up state and 1122 perform validation tests for compute and graphics operations at the same 1123 time, without requiring additional irregular graphics- or 1124 compute-specific logic. 1125 1126 12) We specify an INVALID_OPERATION error for DispatchCompute when there 1127 is no active program on the compute stage. Should we specify similar 1128 errors for Draw* calls if the current program specified by UseProgram 1129 is a compute program? 1130 1131 RESOLVED: Not in the current spec. If a compute shader is made 1132 current with UseProgram, there will be no active program for either the 1133 vertex and fragment stages. In this case, the results of vertex and 1134 fragment processing are undefined, but no error is generated. This 1135 behavior is already specified in unextended OpenGL 4.2. 1136 1137 We don't generate errors in this case for several reasons: 1138 1139 * For the compatibility profile, fixed-function vertex and fragment 1140 processing is available, and INVALID_OPERATION wouldn't make sense 1141 there. 1142 1143 * Even in the core profile, there are cases where no active fragment 1144 shader is needed (e.g., primitives with RASTERIZER_DISCARD enabled). 1145 1146 While there is no case where having only a compute program makes sense, 1147 at least in the core profile, we chose to keep the same undefined 1148 behavior that's already in place. 1149 1150 13) Should we provide any additional support extending the memoryBarrier() 1151 GLSL built-in function provided by ARB_shader_image_load_store and 1152 GLSL 4.20? 1153 1154 RESOLVED: Yes. The memoryBarrier() function provided by GLSL 4.20 1155 requires (a) synchronizing all memory transactions that might be visible 1156 to other shader invocations and (b) ordering memory transactions so that 1157 all other shader invocations never see stores issued after the barrier 1158 before seeing stores issued before the barrier. Hardware 1159 implementations of GLSL 4.20 may have a high degree of parallelism, 1160 where the memory subsystem servicing shader loads and stores may have 1161 multiple independent sub-units, and where the shader invocations 1162 themselves may be executed in parallel on many shader cores. The 1163 memoryBarrier() command may be fairly heavyweight, requiring 1164 synchronization with all memory sub-units and shader cores. 1165 1166 We provide new functions in two different directions that might serve as 1167 lighter weight alternatives to memoryBarrier(). In particular, we 1168 provide four new functions 1169 1170 void memoryBarrierAtomicCounter(); 1171 void memoryBarrierBuffer(); 1172 void memoryBarrierImage(); 1173 void memoryBarrierShared(); 1174 1175 that order transactions of only a specific memory type and might require 1176 synchronization with fewer sub-units of the memory subsystem and a new 1177 function: 1178 1179 void groupMemoryBarrier(); 1180 1181 that only order transactions as viewed by other threads in the same 1182 workgroup, which might not require synchronization with other shader cores. 1183 Since shared memory is only accessible to threads within a single 1184 workgroup, memoryBarrierShared() also only requires synchronization with 1185 other threads in the same workgroup. 1186 1187Revision History 1188 1189 Rev. Date Author Changes 1190 ---- -------- --------- ----------------------------------------- 1191 28 12/10/18 Jon Leech Use 'workgroup' consistently throughout (Bug 1192 11723, internal API issue 87). 1193 27 07/24/14 Jon Leech Change value of GLSL limit 1194 gl_MaxComputeUniformComponents to 512 for 1195 consistency with the API (Bug 12370). 1196 26 01/30/14 Jon Leech Add table 6.31 COMPUTE_SHADER entry for 1197 program pipeline objects (Bug 11539). 1198 25 10/23/12 pbrown Remove the restriction forbidding the use of 1199 barrier() inside potentially divergent flow 1200 control. Instead, we will allow barrier() to 1201 be executed anywhere, but specify undefined 1202 results (including hangs or program termination) 1203 if the flow control is divergent (bug 9367). 1204 24 07/01/12 Jon Leech Fix typo (bug 8984). 1205 23 06/28/12 johnk Remove two other references to "thread", add 1206 "Only available in compute shaders" to the table 1207 for memoryBarrierShared() and groupMemoryBarrier(), 1208 fixed a typo. 1209 22 06/22/12 pbrown Add a new built-in memoryBarrierBuffer() as an 1210 interaction with ARB_shader_storage_buffer. Add 1211 a new built-in groupMemoryBarrier() that orders 1212 memory transactions only as observed by other 1213 shader invocations in the same work group. 1214 Enhance the description of the GLSL memory 1215 barrier functions. Add issue 13 about the new 1216 memory barrier functions added in this extension 1217 (bug 9199). Mark issues 11 and 12 as resolved. 1218 Add NV_vertex_buffer_unified_memory interaction 1219 allowing DispatchComputeIndirect to read its 1220 arguments from any resident buffer object 1221 instead of the single bound indirect dispatch 1222 buffer. 1223 21 06/21/12 gsellers Clarify that there are no built-in inputs or 1224 outputs in compute shaders (bug 9200). 1225 20 06/21/12 gsellers Throw INVALID_OPERATION if querying 1226 COMPUTE_WORK_GROUP_SIZE from unlinked program or 1227 program with no compute shader (bug 9117). 1228 19 06/18/12 pbrown DispatchComputeIndirect throws INVALID_VALUE 1229 if <indirect> is negative or misaligned (bug 1230 9181). 1231 18 06/17/12 pbrown Clarify that compute-only programs can be used 1232 by both UseProgram and UseProgramStages, and add 1233 a COMPUTE_SHADER_BIT for UseProgramStages (bug 1234 9155). Specify that validation errors checking 1235 programs against each other and the GL state 1236 apply equally to graphics primitives (Draw*) and 1237 compute dispatches. Update issue 7; add new 1238 issues 11 and 12. Clarify that compute shader 1239 invocations in a workgroup are run "potentially 1240 in parallel", but not "in lockstep" (bug 9151). 1241 Other minor wording improvements. 1242 17 06/15/12 johnk Don't allow location layout qualifiers for 1243 compute shader inputs. 1244 16 06/15/12 johnk In the intro material, allow work groups to 1245 only potentially execute in parallel, and use 1246 control barriers to synchronize. Other minor 1247 fixes. 1248 15 06/15/12 dgkoch Added Additions to Ch.2 of Shading Language. 1249 Renamed shader built-in variables, explained 1250 them better, made them uvec3 instead of int[3]. 1251 Added derived shading language variables. 1252 Renamed and changed built-in constants for 1253 consistency with the variables. Removed 1254 gl_MaxComputeWorkDimensions since it is no 1255 longer necessary. Renamed API constants to 1256 be consistent with shading language terminology. 1257 Remove a few rogue references to variable 1258 number of dispatch arguments. Added Issue 10. 1259 (bugs 9151, 9167) 1260 14 06/14/12 pbrown Modify DispatchComputeIndirect to accept an 1261 "intptr"-typed offset instead of a "void *", 1262 since doesn't accept pointers to client memory. 1263 Modify DispatchComputeIndirect to use a new 1264 buffer binding (DISPATCH_INDIRECT_BUFFER) 1265 instead of sharing the binding used by 1266 Draw*Indirect. Add missing entries in the "New 1267 Tokens" section and assign values. Update 1268 documentation of COMMAND_BARRIER_BIT to reflect 1269 the new dispatch indirect binding. Document 1270 DispatchComputeIndirect errors for offsets that 1271 are negative, misaligned, or run off the end of 1272 the bound buffer. Increase minimums for 1273 combined texture image units and uniform buffer 1274 bindings to reflect the new stage. Update 1275 various issues, add new issue 9 (bug 9130). 1276 13 06/14/12 Jon Leech Copy description of MAX_COMPUTE_SHARED_MEMORY_SIZE 1277 into API spec from GLSL spec (bug 9069). 1278 12 05/14/12 pbrown Add interaction with ARB_shader_storage_buffer_ 1279 object. The built-in functions provided there 1280 for atomic memory operations on buffer variables 1281 are also supported for the shared variables 1282 provided here. The functions themselves are 1283 documented fully in the other specification. 1284 11 05/14/12 johnk Keep the previous logical contents of the last 1285 paragraph of the memory shader control functions. 1286 10 04/26/12 gsellers Count max compute shared variable size in bytes. 1287 Make shared variables implicitly coherent. 1288 Add MAX_COMPUTE_UNIFORM_COMPONENTS. 1289 Clean up MAX_COMPUTE_IMAGE_UNIFORMS. 1290 9 04/25/12 gsellers Add UNIFORM_BLOCK_REFERENCED_BY_COMPUTE_SHADER 1291 and ATOMIC_COUNTER_BUFFER_REFERENCED_BY_- 1292 COMPUTE_SHADER. Remove <program> from dispatch 1293 APIs. Add memoryBarrier{Image,Shared, 1294 AtomicCounter}(). 1295 8 04/05/12 gsellers Remove ARB suffixes. 1296 7 02/02/12 gsellers Require OpenGL 4.2. 1297 Add issue 8. 1298 Up various minimums. 1299 Remove variable dimensionality. 1300 6 01/24/12 gsellers Require OpenGL 3.0. 1301 Incorporate feedback from bmerry. 1302 Add compute shader constants to sec. 7.7. 1303 Add modifications to sec. 8.15 of the GLSL spec. 1304 Add issue 7. 1305 5 01/20/12 gsellers Make compute dispatch honor conditional 1306 rendering. Add indirect dispatch. 1307 Change 'global work size' to 'num work groups', 1308 make global size in multiples of work group size. 1309 4 01/10/12 gsellers Fix typos and other small corrections. 1310 Make specification of work group size at compile 1311 time compulsory. 1312 Add COMPUTE_WORK_DIMENSION_ARB and 1313 COMPUTE_LOCAL_WORK_SIZE_ARB queries. 1314 Add issue (5), resolve issues (3) and (4). 1315 3 01/09/12 gsellers Change from AMD to ARB. 1316 Update to be relative to OpenGL 4.2 (+GLSL 4.20). 1317 Add <shared> variables. 1318 Add issues (1) - (4). 1319 Add link failure for programs that contain 1320 compute and non-compute shaders. 1321 2 06/10/11 gsellers Add error behavior. 1322 Shading language changes. 1323 Add global_offset parameter. 1324 Add implementation dependent limits. 1325 1 09/24/10 gsellers Initial revision 1326