1Name 2 3 ARB_fragment_shader_interlock 4 5Name Strings 6 7 GL_ARB_fragment_shader_interlock 8 9Contact 10 11 Slawomir Grajewski, Intel (slawomir.grajewski 'at' intel.com) 12 13Contributors 14 15 Contributors to INTEL_fragment_shader_ordering 16 Contributers to NV_fragment_shader_interlock 17 18Notice 19 20 Copyright (c) 2015 The Khronos Group Inc. Copyright terms at 21 http://www.khronos.org/registry/speccopyright.html 22 23Specification Update Policy 24 25 Khronos-approved extension specifications are updated in response to 26 issues and bugs prioritized by the Khronos OpenGL Working Group. For 27 extensions which have been promoted to a core Specification, fixes will 28 first appear in the latest version of that core Specification, and will 29 eventually be backported to the extension document. This policy is 30 described in more detail at 31 https://www.khronos.org/registry/OpenGL/docs/update_policy.php 32 33Status 34 35 Complete. Approved by the ARB on June 26, 2015. 36 Ratified by the Khronos Board of Promoters on August 7, 2015. 37 38Version 39 40 Last Modified Date: May 7, 2015 41 Revision: 2 42 43Number 44 45 ARB Extension #177 46 47Dependencies 48 49 This extension is written against the OpenGL 4.5 (Core Profile) 50 Specification. 51 52 This extension is written against version 4.50 (revision 5) of the OpenGL 53 Shading Language Specification. 54 55 OpenGL 4.2 or ARB_shader_image_load_store is required; GLSL 4.20 is 56 required. 57 58Overview 59 60 In unextended OpenGL 4.5, applications may produce a 61 large number of fragment shader invocations that perform loads and 62 stores to memory using image uniforms, atomic counter uniforms, 63 buffer variables, or pointers. The order in which loads and stores 64 to common addresses are performed by different fragment shader 65 invocations is largely undefined. For algorithms that use shader 66 writes and touch the same pixels more than once, one or more of the 67 following techniques may be required to ensure proper execution ordering: 68 69 * inserting Finish or WaitSync commands to drain the pipeline between 70 different "passes" or "layers"; 71 72 * using only atomic memory operations to write to shader memory (which 73 may be relatively slow and limits how memory may be updated); or 74 75 * injecting spin loops into shaders to prevent multiple shader 76 invocations from touching the same memory concurrently. 77 78 This extension provides new GLSL built-in functions 79 beginInvocationInterlockARB() and endInvocationInterlockARB() that delimit 80 a critical section of fragment shader code. For pairs of shader 81 invocations with "overlapping" coverage in a given pixel, the OpenGL 82 implementation will guarantee that the critical section of the fragment 83 shader will be executed for only one fragment at a time. 84 85 There are four different interlock modes supported by this extension, 86 which are identified by layout qualifiers. The qualifiers 87 "pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual 88 exclusion in the critical section for any pair of fragments corresponding 89 to the same pixel. When using multisampling, the qualifiers 90 "sample_interlock_ordered" and "sample_interlock_unordered" only provide 91 mutual exclusion for pairs of fragments that both cover at least one 92 common sample in the same pixel; these are recommended for performance if 93 shaders use per-sample data structures. 94 95 Additionally, when the "pixel_interlock_ordered" or 96 "sample_interlock_ordered" layout qualifier is used, the interlock also 97 guarantees that the critical section for multiple shader invocations with 98 "overlapping" coverage will be executed in the order in which the 99 primitives were processed by the GL. Such a guarantee is useful for 100 applications like blending in the fragment shader, where an application 101 requires that fragment values to be composited in the framebuffer in 102 primitive order. 103 104 This extension can be useful for algorithms that need to access per-pixel 105 data structures via shader loads and stores. Such algorithms using this 106 extension can access such data structures in the critical section without 107 worrying about other invocations for the same pixel accessing the data 108 structures concurrently. Additionally, the ordering guarantees are useful 109 for cases where the API ordering of fragments is meaningful. For example, 110 applications may be able to execute programmable blending operations in 111 the fragment shader, where the destination buffer is read via image loads 112 and the final value is written via image stores. 113 114New Procedures and Functions 115 116 None. 117 118New Tokens 119 120 None. 121 122Modifications to the OpenGL Shading Language Specification, Version 4.50 123 124 Including the following line in a shader can be used to control the 125 language features described in this extension: 126 127 #extension GL_ARB_fragment_shader_interlock : <behavior> 128 129 where <behavior> is as specified in section 3.3. 130 131 New preprocessor #defines are added to the OpenGL Shading Language: 132 133 #define GL_ARB_fragment_shader_interlock 1 134 135 136 Modify Section 4.4.1.3, Fragment Shader Inputs (p. 63) 137 138 (add to the list of layout qualifiers containing "early_fragment_tests", 139 p. 63, and modify the surrounding language to reflect that multiple 140 layout qualifiers are supported on "in") 141 142 layout-qualifier-id 143 pixel_interlock_ordered 144 pixel_interlock_unordered 145 sample_interlock_ordered 146 sample_interlock_unordered 147 148 (add to the end of the section, p. 63) 149 150 The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered", 151 "sample_interlock_ordered", and "sample_interlock_unordered" control the 152 ordering of the execution of shader invocations between calls to the 153 built-in functions beginInvocationInterlockARB() and 154 endInvocationInterlockARB(), as described in section 8.13.3. A 155 compile or link error will be generated if more than one of these layout 156 qualifiers is specified in shader code. If a program containing a 157 fragment shader includes none of these layout qualifiers, it is as 158 though "pixel_interlock_ordered" were specified. 159 160 Add to the end of Section 8.13, Fragment Processing Functions (p. 170) 161 162 8.13.3, Fragment Shader Execution Ordering Functions 163 164 By default, fragment shader invocations are generally executed in 165 undefined order. Multiple fragment shader invocations may be executed 166 concurrently, including multiple invocations corresponding to a single 167 pixel. Additionally, fragment shader invocations for a single pixel might 168 not be processed in the order in which the primitives generating the 169 fragments were specified in the OpenGL API. 170 171 The paired functions beginInvocationInterlockARB() and 172 endInvocationInterlockARB() allow shaders to specify a critical section, 173 inside which stronger execution ordering is guaranteed. When using the 174 "pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier, 175 ordering guarantees are provided for any pair of fragment shader 176 invocations X and Y triggered by fragments A and B corresponding to the 177 same pixel. When using the "sample_interlock_ordered" or 178 "sample_interlock_unordered" qualifier, ordering guarantees are provided 179 for any pair of fragment shader invocations X and Y triggered by fragments 180 A and B that correspond to the same pixel, where at least one sample of 181 the pixel is covered by both fragments. No ordering guarantees are 182 provided for pairs of fragment shader invocations corresponding to 183 different pixels. Additionally, no ordering guarantees are provided for 184 pairs of fragment shader invocations corresponding to the same fragment. 185 When multisampling is enabled and the framebuffer has sample buffers, 186 multiple fragment shader invocations may result from a single fragment due 187 to the use of the "sample" auxiliary storage qualifier, OpenGL API 188 commands forcing multiple shader invocations per fragment, or for other 189 implementation-dependent reasons. 190 191 When using the "pixel_interlock_unordered" or "sample_interlock_unordered" 192 qualifier, the interlock will ensure that the critical sections of 193 fragment shader invocations X and Y with overlapping coverage will never 194 execute concurrently. That is, invocation X is guaranteed to complete its 195 call to endInvocationInterlockARB() before invocation Y completes its call 196 to beginInvocationInterlockARB(), or vice versa. 197 198 When using the "pixel_interlock_ordered" or "sample_interlock_ordered" 199 layout qualifier, the critical sections of invocations X and Y with 200 overlapping coverage will be executed in a specific order, based on the 201 relative order assigned to their fragments A and B. If fragment A is 202 considered to precede fragment B, the critical section of invocation X is 203 guaranteed to complete before the critical section of invocation Y begins. 204 When a pair of fragments A and B have overlapping coverage, fragment A is 205 considered to precede fragment B if 206 207 * the OpenGL API command producing fragment A was called prior to the 208 command producing B, or 209 210 * the point, line, triangle, [[compatibility profile: quadrilateral, 211 polygon,]] or patch primitive producing fragment A appears earlier in 212 the same strip, loop, fan, or independent primitive list producing 213 fragment B. 214 215 When [[compatibility profile: decomposing quadrilateral or polygon 216 primitives or]] tessellating a single patch primitive, multiple 217 primitives may be generated in an undefined implementation-dependent 218 order. When fragments A and B are generated from such unordered 219 primitives, their ordering is also implementation-dependent. 220 221 If fragment shader X completes its critical section before fragment shader 222 Y begins its critical section, all stores to memory performed in the 223 critical section of invocation X using a pointer, image uniform, atomic 224 counter uniform, or buffer variable qualified by "coherent" are guaranteed 225 to be visible to any reads of the same types of variable performed in the 226 critical section of invocation Y. 227 228 If multisampling is disabled, or if the framebuffer does not include 229 sample buffers, fragment coverage is computed per-pixel. In this case, 230 the "sample_interlock_ordered" or "sample_interlock_unordered" layout 231 qualifiers are treated as "pixel_interlock_ordered" or 232 "pixel_interlock_unordered", respectively. 233 234 Syntax: 235 236 void beginInvocationInterlockARB(void); 237 void endInvocationInterlockARB(void); 238 239 Description: 240 241 The beginInvocationInterlockARB() and endInvocationInterlockARB() may only 242 be placed inside the function main() of a fragment shader and may not be 243 called within any flow control. These functions may not be called after a 244 return statement in the function main(), but may be called after a discard 245 statement. A compile- or link-time error will be generated if main() 246 calls either function more than once, contains a call to one function 247 without a matching call to the other, or calls endInvocationInterlockARB() 248 before calling beginInvocationInterlockARB(). 249 250Additions to the AGL/GLX/WGL Specifications 251 252 None. 253 254Errors 255 256 None. 257 258New State 259 260 None. 261 262New Implementation Dependent State 263 264 None. 265 266Issues 267 268 (1) When using multisampling, the OpenGL specification permits 269 multiple fragment shader invocations to be generated for a single 270 fragment. For example, per-sample shading using the "sample" 271 auxiliary storage qualifier or the MinSampleShading() OpenGL API command 272 can be used to force per-sample shading. What execution ordering 273 guarantees are provided between fragment shader invocations generated 274 from the same fragment? 275 276 RESOLVED: We don't provide any ordering guarantees in this extension. 277 This implies that when using multisampling, there is no guarantee that 278 two fragment shader invocations for the same fragment won't be executing 279 their critical sections concurrently. This could cause problems for 280 algorithms sharing data structures between all the samples of a pixel 281 unless accesses to these data structures are performed atomically. 282 283 When using per-sample shading, the interlock we provide *does* guarantee 284 that no two invocations corresponding to the same sample execute the 285 critical section concurrently. If a separate set of data structures is 286 provided for each sample, no conflicts should occur within the critical 287 section. 288 289 Note that in addition to the per-sample shading options in the shading 290 language and API, implementations may provide multisample antialiasing 291 modes where the implementation can't simply run the fragment shader once 292 and broadcast results to a large set of covered samples. 293 294 (2) What performance differences are expected between shaders using the 295 "pixel" and "sample" layout qualifier variants in this extension (e.g., 296 "pixel_invocation_ordered" and "sample_invocation_ordered")? 297 298 RESOLVED: We expect that shaders using "sample" qualifiers may have 299 higher performance, since the implementation need not order pairs of 300 fragments that touch the same pixel with "complementary" coverage. Such 301 situations are fairly common: when two adjacent triangles combine to 302 cover a given pixel, two fragments will be generated for the pixel but 303 no sample will be covered by both. When using "sample" qualifiers, the 304 invocations for both fragments can run concurrently. When using "pixel" 305 qualifiers, the critical section for one fragment must wait until the 306 critical section for the other fragment completes. 307 308 (3) What performance differences are expected between shaders using the 309 "ordered" and "unordered" layout qualifier variants in this extension 310 (e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")? 311 312 RESOLVED: We expect that shaders using "unordered" may have higher 313 performance, since the critical section implementation doesn't need to 314 ensure that all previous invocations with overlapping coverage have 315 completed their critical sections. Some algorithms (e.g., building data 316 structures in order-independent transparency algorithms) will require 317 mutual exclusion when updating per-pixel data structures, but do not 318 require that shaders execute in a specific ordering. 319 320 (4) Are fragment shaders using this extension allowed to write outputs? 321 If so, is there any guarantee on the order in which such outputs are 322 written to the framebuffer? 323 324 RESOLVED: Yes, fragment shaders with critical sections may still write 325 outputs. If fragment shader outputs are written, they are stored or 326 blended into the framebuffer in API order, as is the case for fragment 327 shaders not using this extension. 328 329 (5) What considerations apply when using this extension to implement a 330 programmable form of conventional blending using image stores? 331 332 RESOLVED: Per-fragment operations performed in the pipeline following 333 fragment shader execution obviously have no effect on image stores 334 executing during fragment shader execution. In particular, multisample 335 operations such as broadcasting a single fragment output to multiple 336 samples or modifying the coverage with alpha-to-coverage or a shader 337 coverage mask output value have no effect. Fragments can not be killed 338 before fragment shader blending using the fixed-function alpha test or 339 using the depth test with a Z value produced by the shader. Fragments 340 will normally not be killed by fixed-function depth or stencil tests, 341 but those tests can be enabled before fragment shader invocations using 342 the layout qualifier "early_fragment_tests". Any required 343 fixed-function features that need to be handled before programmable 344 blending that aren't enabled by "early_fragment_tests" would need to be 345 emulated in the shader. 346 347 Note also that performing blend computations in the shader are not 348 guaranteed to produce results that are bit-identical to these produced 349 by fixed-function blending hardware, even if mathematically equivalent 350 algorithms are used. 351 352 (6) For operations accessing shared per-pixel data structures in the 353 critical section, what operations (if any) must be performed in shader 354 code to ensure that stores from one shader invocation are visible to 355 the next? 356 357 RESOLVED: The "coherent" qualifier is required in the declaration of 358 the shared data structures to ensure that writes performed by one 359 invocation are visible to reads performed by another invocation. 360 361 In shaders that don't use the interlock, "coherent" is not sufficient as 362 there is no guarantee of the ordering of fragment shader invocations -- 363 even if invocation A can see the values written by another invocation B, 364 there is no general guarantee that invocation A's read will be performed 365 before invocation B's write. The built-in function memoryBarrier() can 366 be used to generate a weak ordering by which threads can communicate, 367 but it doesn't order memory transactions between two separate 368 invocations. With the interlock, execution ordering between two threads 369 from the same pixel is well-defined as long as the loads and stores are 370 performed inside the critical section, and the use of "coherent" ensures 371 that stores done by one invocation are visible to other invocations. 372 373 (7) Should we provide an explicit mechanisms for shaders to indicate a 374 critical section? Or should we just automatically infer a critical 375 section by analyzing shader code? Or should we just wrap the entire 376 fragment shader in a critical section? 377 378 RESOLVED: Provide an explicit critical section. 379 380 We definitely don't want to wrap the entire shader in a critical section 381 when a smaller section will suffice. Doing so would hold off the 382 execution of any other fragment shader invocation with the same (x,y) 383 for the entire (potentially long) life of the fragment shader. Hardware 384 would need to track a large number of fragments awaiting execution, and 385 may be so backed up that further fragments will be blocked even if they 386 don't overlap with any fragments currently executing. Providing a 387 smaller critical section reduces the amount of time other fragments are 388 blocked and allows implementations to perform useful work for 389 conflicting fragments before they hit the critical section. 390 391 While a compiler could analyze the code and wrap a critical section 392 around all memory accesses, it may be difficult to determine which 393 accesses actually require mutual exclusion and ordering, and which 394 accesses are safe to do with no protection. Requiring shaders to 395 explicitly identify a critical section doesn't seem overwhelmingly 396 burdensome, and allows applications to exclude memory accesses that it 397 knows to be "safe". 398 399 (8) What restrictions should be imposed on the use of the 400 beginInvocationInterlockARB() and endInvocationInterlockARB() functions 401 delimiting a critical section? 402 403 RESOLVED: We impose restrictions similar to those on the barrier() 404 built-in function in tessellation control shaders to ensure that any 405 shader using this functionality has a single critical section that can 406 be easily identified during compilation. In particular, we require that 407 these functions be called in main() and don't permit them to be called 408 in conditional flow control. 409 410 These restrictions ensure that there is always exactly one call to the 411 "begin" and "end" functions in a predictable location in the compiled 412 shader code, and ensure that the compiler and hardware don't have to 413 deal with unusual cases (like entering a critical section and never 414 leaving, leaving a critical section without entering it, or trying to 415 enter a critical section more than once). 416 417Revision History 418 419 Rev. Date Author Changes 420 ---- -------- -------- ----------------------------------------- 421 1 04/01/15 S.Grajewski Inital version merging 422 INTEL_fragment_shader_ordering with 423 NV_fragment_shader_interlock 424 425 2 05/07/15 S.Grajewski Built-in functions 426 beginInvocationInterlockARB() and 427 endInvocationInterlockARB() have now ARB 428 suffixes. 429