1Name 2 3 NV_fragment_shader_interlock 4 5Name Strings 6 7 GL_NV_fragment_shader_interlock 8 9Contact 10 11 Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com) 12 13Contributors 14 15 Jeff Bolz, NVIDIA Corporation 16 Mathias Heyer, NVIDIA Corporation 17 18Status 19 20 Shipping 21 22Version 23 24 Last Modified Date: March 27, 2015 25 NVIDIA Revision: 2 26 27Number 28 29 OpenGL Extension #468 30 OpenGL ES Extension #230 31 32Dependencies 33 34 This extension is written against the OpenGL 4.3 35 (Compatibility Profile, dated February 14, 2013), and the 36 OpenGL ES 3.1.0 (dated March 17, 2014) Specification 37 38 This extension is written against the OpenGL Shading Language 39 Specification (version 4.30, revision 8) and the OpenGL ES Shading 40 Language Specification (version 3.10, revision 2). 41 42 OpenGL 4.3 and GLSL 4.30 are required in an OpenGL implementation 43 OpenGL ES 3.1 and GLSL ES 3.10 are required in an OpenGL ES implementation 44 45 This extension interacts with NV_shader_buffer_load and 46 NV_shader_buffer_store. 47 48 This extension interacts with NV_gpu_program4 and NV_gpu_program5. 49 50 This extension interacts with EXT_tessellation_shader. 51 52 This extension interacts with OES_sample_shading 53 54 This extension interacts with OES_shader_multisample_interpolation 55 56 This extension interacts with OES_shader_image_atomic 57 58Overview 59 60 In unextended OpenGL 4.3 or OpenGL ES 3.1, applications may produce a 61 large number of fragment shader invocations that perform loads and 62 stores to memory using image uniforms, atomic counter uniforms, 63 buffer variables, or pointers. The order in which loads and stores 64 to common addresses are performed by different fragment shader 65 invocations is largely undefined. For algorithms that use shader 66 writes and touch the same pixels more than once, one or more of the 67 following techniques may be required to ensure proper execution ordering: 68 69 * inserting Finish or WaitSync commands to drain the pipeline between 70 different "passes" or "layers"; 71 72 * using only atomic memory operations to write to shader memory (which 73 may be relatively slow and limits how memory may be updated); or 74 75 * injecting spin loops into shaders to prevent multiple shader 76 invocations from touching the same memory concurrently. 77 78 This extension provides new GLSL built-in functions 79 beginInvocationInterlockNV() and endInvocationInterlockNV() that delimit a 80 critical section of fragment shader code. For pairs of shader invocations 81 with "overlapping" coverage in a given pixel, the OpenGL implementation 82 will guarantee that the critical section of the fragment shader will be 83 executed for only one fragment at a time. 84 85 There are four different interlock modes supported by this extension, 86 which are identified by layout qualifiers. The qualifiers 87 "pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual 88 exclusion in the critical section for any pair of fragments corresponding 89 to the same pixel. When using multisampling, the qualifiers 90 "sample_interlock_ordered" and "sample_interlock_unordered" only provide 91 mutual exclusion for pairs of fragments that both cover at least one 92 common sample in the same pixel; these are recommended for performance if 93 shaders use per-sample data structures. 94 95 Additionally, when the "pixel_interlock_ordered" or 96 "sample_interlock_ordered" layout qualifier is used, the interlock also 97 guarantees that the critical section for multiple shader invocations with 98 "overlapping" coverage will be executed in the order in which the 99 primitives were processed by the GL. Such a guarantee is useful for 100 applications like blending in the fragment shader, where an application 101 requires that fragment values to be composited in the framebuffer in 102 primitive order. 103 104 This extension can be useful for algorithms that need to access per-pixel 105 data structures via shader loads and stores. Such algorithms using this 106 extension can access such data structures in the critical section without 107 worrying about other invocations for the same pixel accessing the data 108 structures concurrently. Additionally, the ordering guarantees are useful 109 for cases where the API ordering of fragments is meaningful. For example, 110 applications may be able to execute programmable blending operations in 111 the fragment shader, where the destination buffer is read via image loads 112 and the final value is written via image stores. 113 114New Procedures and Functions 115 116 None. 117 118New Tokens 119 120 None. 121 122Modifications to the OpenGL 4.3 Specification (Compatibility Profile) 123 124 None. 125 126Modifications to the OpenGL Shading Language Specification, Version 4.30 127 128 Including the following line in a shader can be used to control the 129 language features described in this extension: 130 131 #extension GL_NV_fragment_shader_interlock : <behavior> 132 133 where <behavior> is as specified in section 3.3. 134 135 New preprocessor #defines are added to the OpenGL Shading Language: 136 137 #define GL_NV_fragment_shader_interlock 1 138 139 140 Modify Section 4.4.1.3, Fragment Shader Inputs (p. 58) 141 142 (add to the list of layout qualifiers containing "early_fragment_tests", 143 p. 59, and modify the surrounding language to reflect that multiple 144 layout qualifiers are supported on "in") 145 146 layout-qualifier-id 147 pixel_interlock_ordered 148 pixel_interlock_unordered 149 sample_interlock_ordered 150 sample_interlock_unordered 151 152 (add to the end of the section, p. 59) 153 154 The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered", 155 "sample_interlock_ordered", and "sample_interlock_unordered" control the 156 ordering of the execution of shader invocations between calls to the 157 built-in functions beginInvocationInterlockNV() and 158 endInvocationInterlockNV(), as described in section 8.13.3. A 159 compile or link error will be generated if more than one of these layout 160 qualifiers is specified in shader code. If a program containing a 161 fragment shader includes none of these layout qualifiers, it is as 162 though "pixel_interlock_ordered" were specified. 163 164 Add to the end of Section 8.13, Fragment Processing Functions (p. 168) 165 166 8.13.3, Fragment Shader Execution Ordering Functions 167 168 By default, fragment shader invocations are generally executed in 169 undefined order. Multiple fragment shader invocations may be executed 170 concurrently, including multiple invocations corresponding to a single 171 pixel. Additionally, fragment shader invocations for a single pixel might 172 not be processed in the order in which the primitives generating the 173 fragments were specified in the OpenGL API. 174 175 The paired functions beginInvocationInterlockNV() and 176 endInvocationInterlockNV() allow shaders to specify a critical section, 177 inside which stronger execution ordering is guaranteed. When using the 178 "pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier, 179 ordering guarantees are provided for any pair of fragment shader 180 invocations X and Y triggered by fragments A and B corresponding to the 181 same pixel. When using the "sample_interlock_ordered" or 182 "sample_interlock_unordered" qualifier, ordering guarantees are provided 183 for any pair of fragment shader invocations X and Y triggered by fragments 184 A and B that correspond to the same pixel, where at least one sample of 185 the pixel is covered by both fragments. No ordering guarantees are 186 provided for pairs of fragment shader invocations corresponding to 187 different pixels. Additionally, no ordering guarantees are provided for 188 pairs of fragment shader invocations corresponding to the same fragment. 189 When multisampling is enabled and the framebuffer has sample buffers, 190 multiple fragment shader invocations may result from a single fragment due 191 to the use of the "sample" auxilliary storage qualifier, OpenGL API 192 commands forcing multiple shader invocations per fragment, or for other 193 implementation-dependent reasons. 194 195 When using the "pixel_interlock_unordered" or "sample_interlock_unordered" 196 qualifier, the interlock will ensure that the critical sections of 197 fragment shader invocations X and Y with overlapping coverage will never 198 execute concurrently. That is, invocation X is guaranteed to complete its 199 call to endInvocationInterlockNV() before invocation Y completes its call 200 to beginInvocationInterlockNV(), or vice versa. 201 202 When using the "pixel_interlock_ordered" or "sample_interlock_ordered" 203 layout qualifier, the critical sections of invocations X and Y with 204 overlapping coverage will be executed in a specific order, based on the 205 relative order assigned to their fragments A and B. If fragment A is 206 considered to precede fragment B, the critical section of invocation X is 207 guaranteed to complete before the critical section of invocation Y begins. 208 When a pair of fragments A and B have overlapping coverage, fragment A is 209 considered to precede fragment B if 210 211 * the OpenGL API command producing fragment A was called prior to the 212 command producing B, or 213 214 * the point, line, triangle, [[compatibility profile: quadrilateral, 215 polygon,]] or patch primitive producing fragment A appears earlier in 216 the same strip, loop, fan, or independent primitive list producing 217 fragment B. 218 219 When [[compatibility profile: decomposing quadrilateral or polygon 220 primitives or]] tessellating a single patch primitive, multiple 221 primitives may be generated in an undefined implementation-dependent 222 order. When fragments A and B are generated from such unordered 223 primitives, their ordering is also implementation-dependent. 224 225 If fragment shader X completes its critical section before fragment shader 226 Y begins its critical section, all stores to memory performed in the 227 critical section of invocation X using a pointer, image uniform, atomic 228 counter uniform, or buffer variable qualified by "coherent" are guaranteed 229 to be visible to any reads of the same types of variable performed in the 230 critical section of invocation Y. 231 232 If multisampling is disabled, or if the framebuffer does not include 233 sample buffers, fragment coverage is computed per-pixel. In this case, 234 the "sample_interlock_ordered" or "sample_interlock_unordered" layout 235 qualifiers are treated as "pixel_interlock_ordered" or 236 "pixel_interlock_unordered", respectively. 237 238 239 Syntax: 240 241 void beginInvocationInterlockNV(void); 242 void endInvocationInterlockNV(void); 243 244 Description: 245 246 The beginInvocationInterlockNV() and endInvocationInterlockNV() may only 247 be placed inside the function main() of a fragment shader and may not be 248 called within any flow control. These functions may not be called after a 249 return statement in the function main(), but may be called after a discard 250 statement. A compile- or link-time error will be generated if main() 251 calls either function more than once, contains a call to one function 252 without a matching call to the other, or calls endInvocationInterlockNV() 253 before calling beginInvocationInterlockNV(). 254 255Additions to the AGL/GLX/WGL Specifications 256 257 None. 258 259Errors 260 261 None. 262 263New State 264 265 None. 266 267New Implementation Dependent State 268 269 None. 270 271Interactions with OpenGL ES 3.1 272 273 Disabling multisample rasterization is not available on OpenGL ES; 274 it is always enabled. 275 276 277Dependencies on EXT_tessellation_shader 278 279 If this extension is implemented on OpenGL ES and EXT_tessellation_shader 280 is not supported, remove language referring to tessellation of patch 281 primitives. 282 283 284Dependencies on OES_sample_shading 285 286 If this extension is implemented on OpenGL ES and OES_sample_shading 287 is not supported, remove references to per-sample shading via 288 MinSampleShading[OES]() 289 290 291Dependencies on OES_shader_image_atomic 292 293 If this extension is implemented on OpenGL ES and OES_shader_image_atomic 294 is not supported, disregard language referring to atomic memory operations. 295 296 297Dependencies on OES_shader_multisample_interpolation 298 299 If this extension is implemented on OpenGL ES and OES_shader_- 300 multisample_interpolation is not supported, ignore language 301 about the "sample" auxilliary storage qualifier. 302 303 304Dependencies on NV_shader_buffer_load and NV_shader_buffer_store 305 306 If NV_shader_buffer_load and NV_shader_buffer_store are not supported, 307 references to ordering memory accesses using pointers should be deleted. 308 309 310Dependencies on NV_gpu_program4 and NV_fragment_program4 311 312 Modify Section 2.X.2, Program Grammar, of the NV_fragment_program4 313 specification (which modifies the NV_gpu_program4 base grammar) 314 315 <SpecialInstruction> ::= "FSIB" 316 | "FSIE" 317 318 319 Modify Section 2.X.4, Program Execution Environment 320 321 (add to the opcode table) 322 323 Modifiers 324 Instruction F I C S H D Out Inputs Description 325 ----------- - - - - - - --- -------- -------------------------------- 326 FSIB - - - - - - - - begin fragment shader interlock 327 FSIE - - - - - - - - end fragment shader interlock 328 329 330 Modify Section 2.X.6, Program Options 331 332 + Fragment Shader Interlock (NV_pixel_interlock_ordered, 333 NV_pixel_interlock_unordered, NV_sample_interlock_ordered, and 334 NV_sample_interlock_ordered) 335 336 If a fragment program specifies the "NV_pixel_interlock_ordered", 337 "NV_pixel_interlock_unordered", "NV_sample_interlock_ordered", or 338 "NV_sample_interlock_ordered" options, it will configure a critical 339 section using the FSIB (fragment shader interlock begin) and FSIE opcodes 340 (fragment shader interlock end) opcodes. The execution of the critical 341 sections will be ordered for pairs of program invocations corresponding to 342 the same pixel, as described in Section 8.13.3 of the OpenGL Shading 343 Language Specification, where the four options are considered to specify 344 layout qualifiers with names equivalent to matching the program option. 345 346 A program will fail to load if it specifies more than one of these program 347 options, if it specifies exactly one of these options but does not contain 348 exactly one FSIB instruction and one FSIE instruction, or if it contains 349 an FSIB or FSIE instruction without specifying any of these options. 350 351 352 Add the following subsections to section 2.X.8, Program Instruction Set 353 354 355 Section 2.X.8.Z, FSIB: Fragment Shader Interlock Begin 356 357 The FSIB instruction specifies the beginning of a critical section in a 358 fragment program, where execution of the critical section is ordered 359 relative to other fragments. This instruction has no other effect. 360 361 The FSIB instruction is not allowed in arbitrary locations in a program. 362 A program will fail to load if it includes an FSIB instruction inside a 363 IF/ELSE/ENDIF block, inside a REP/ENDREP block, or inside any subroutine 364 block other than the one labeled "main". Additionally, a program will 365 fail to load if it contains more than one FSIB instruction, or if its one 366 FSIB instruction is not followed by an FSIE instruction. 367 368 FSIB has no operands and generates no result. 369 370 371 Section 2.X.8.Z, FSIE: Fragment Shader Interlock End 372 373 The FSIE instruction specifies the end of a critical section in a fragment 374 program, where execution of the critical section is ordered relative to 375 other fragments. This instruction has no other effect. 376 377 The FSIE instruction is not allowed in arbitrary locations in a program. 378 A program will fail to load if it includes an FSIE instruction inside a 379 IF/ELSE/ENDIF block, inside a REP/ENDREP block, or inside any subroutine 380 block other than the one labeled "main". Additionally, a program will 381 fail to load if it contains more than one FSIE instruction, or if its one 382 FSIE instruction is not preceded by an FSIB instruction. 383 384 FSIE has no operands and generates no result. 385 386Issues 387 388 (1) What should this extension be called? 389 390 RESOLVED: NV_fragment_shader_interlock. The 391 beginInvocationInterlockNV() and endInvocationInterlockNV() commands 392 identify a critical section during which other invocations with 393 overlapping coverage are locked out until the critical section 394 completes. 395 396 (2) When using multisampling, the OpenGL specification permits 397 multiple fragment shader invocations to be generated for a single 398 fragment. For example, per-sample shading using the "sample" 399 auxilliary storage qualifier or the MinSampleShading() OpenGL API command 400 can be used to force per-sample shading. What execution ordering 401 guarantees are provided between fragment shader invocations generated 402 from the same fragment? 403 404 RESOLVED: We don't provide any ordering guarantees in this extension. 405 This implies that when using multisampling, there is no guarantee that 406 two fragment shader invocations for the same fragment won't be executing 407 their critical sections concurrently. This could cause problems for 408 algorithms sharing data structures between all the samples of a pixel 409 unless accesses to these data structures are performed atomically. 410 411 When using per-sample shading, the interlock we provide *does* guarantee 412 that no two invocations corresponding to the same sample execute the 413 critical section concurrently. If a separate set of data structures is 414 provided for each sample, no conflicts should occur within the critical 415 section. 416 417 Note that in addition to the per-sample shading options in the shading 418 language and API, implementations may provide multisample antialiasing 419 modes where the implementation can't simply run the fragment shader once 420 and broadcast results to a large set of covered samples. 421 422 (3) What performance differences are expected between shaders using the 423 "pixel" and "sample" layout qualifier variants in this extension (e.g., 424 "pixel_invocation_ordered" and "sample_invocation_ordered")? 425 426 RESOLVED: We expect that shaders using "sample" qualifiers may have 427 higher performance, since the implementation need not order pairs of 428 fragments that touch the same pixel with "complementary" coverage. Such 429 situations are fairly common: when two adjacent triangles combine to 430 cover a given pixel, two fragments will be generated for the pixel but 431 no sample will be covered by both. When using "sample" qualifiers, the 432 invocations for both fragments can run concurrently. When using "pixel" 433 qualifiers, the critical section for one fragment must wait until the 434 critical section for the other fragment completes. 435 436 (4) What performance differences are expected between shaders using the 437 "ordered" and "unordered" layout qualifier variants in this extension 438 (e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")? 439 440 RESOLVED: We expect that shaders using "unordered" may have higher 441 performance, since the critical section implementation doesn't need to 442 ensure that all previous invocations with overlapping coverage have 443 completed their critical sections. Some algorithms (e.g., building data 444 structures in order-independent transparency algorithms) will require 445 mutual exclusion when updating per-pixel data structures, but do not 446 require that shaders execute in a specific ordering. 447 448 (5) Are fragment shaders using this extension allowed to write outputs? 449 If so, is there any guarantee on the order in which such outputs are 450 written to the framebuffer? 451 452 RESOLVED: Yes, fragment shaders with critical sections may still write 453 outputs. If fragment shader outputs are written, they are stored or 454 blended into the framebuffer in API order, as is the case for fragment 455 shaders not using this extension. 456 457 (6) What considerations apply when using this extension to implement a 458 programmable form of conventional blending using image stores? 459 460 RESOLVED: Per-fragment operations performed in the pipeline following 461 fragment shader execution obviously have no effect on image stores 462 executing during fragment shader execution. In particular, multisample 463 operations such as broadcasting a single fragment output to multiple 464 samples or modifying the coverage with alpha-to-coverage or a shader 465 coverage mask output value have no effect. Fragments can not be killed 466 before fragment shader blending using the fixed-function alpha test or 467 using the depth test with a Z value produced by the shader. Fragments 468 will normally not be killed by fixed-function depth or stencil tests, 469 but those tests can be enabled before fragment shader invocations using 470 the layout qualifier "early_fragment_tests". Any required 471 fixed-function features that need to be handled before programmable 472 blending that aren't enabled by "early_fragment_tests" would need to be 473 emulated in the shader. 474 475 Note also that performing blend computations in the shader are not 476 guaranteed to produce results that are bit-identical to these produced 477 by fixed-function blending hardware, even if mathematically equivalent 478 algorithms are used. 479 480 (7) For operations accessing shared per-pixel data structures in the 481 critical section, what operations (if any) must be performed in shader 482 code to ensure that stores from one shader invocation are visible to 483 the next? 484 485 RESOLVED: The "coherent" qualifier is required in the declaration of 486 the shared data structures to ensure that writes performed by one 487 invocation are visible to reads performed by another invocation. 488 489 In shaders that don't use the interlock, "coherent" is not sufficient as 490 there is no guarantee of the ordering of fragment shader invocations -- 491 even if invocation A can see the values written by another invocation B, 492 there is no general guarantee that invocation A's read will be performed 493 before invocation B's write. The built-in function memoryBarrier() can 494 be used to generate a weak ordering by which threads can communicate, 495 but it doesn't order memory transactions between two separate 496 invocations. With the interlock, execution ordering between two threads 497 from the same pixel is well-defined as long as the loads and stores are 498 performed inside the critical section, and the use of "coherent" ensures 499 that stores done by one invocation are visible to other invocations. 500 501 (8) Should we provide an explicit mechanisms for shaders to indicate a 502 critical section? Or should we just automatically infer a critical 503 section by analyzing shader code? Or should we just wrap the entire 504 fragment shader in a critical section? 505 506 RESOLVED: Provide an explicit critical section. 507 508 We definitely don't want to wrap the entire shader in a critical section 509 when a smaller section will suffice. Doing so would hold off the 510 execution of any other fragment shader invocation with the same (x,y) 511 for the entire (potentially long) life of the fragment shader. Hardware 512 would need to track a large number of fragments awaiting execution, and 513 may be so backed up that further fragments will be blocked even if they 514 don't overlap with any fragments currently executing. Providing a 515 smaller critical section reduces the amount of time other fragments are 516 blocked and allows implementations to perform useful work for 517 conflicting fragments before they hit the critical section. 518 519 While a compiler could analyze the code and wrap a critical section 520 around all memory accesses, it may be difficult to determine which 521 accesses actually require mutual exclusion and ordering, and which 522 accesses are safe to do with no protection. Requiring shaders to 523 explicitly identify a critical section doesn't seem overwhelmingly 524 burdensome, and allows applications to exclude memory accesses that it 525 knows to be "safe". 526 527 (9) What restrictions should be imposed on the use of the 528 beginInvocationInterlockNV() and endInvocationInterlockNV() functions 529 delimiting a critical section? 530 531 RESOLVED: We impose restrictions similar to those on the barrier() 532 built-in function in tessellation control shaders to ensure that any 533 shader using this functionality has a single critical section that can 534 be easily identified during compilation. In particular, we require that 535 these functions be called in main() and don't permit them to be called 536 in conditional flow control. 537 538 These restrictions ensure that there is always exactly one call to the 539 "begin" and "end" functions in a predictable location in the compiled 540 shader code, and ensure that the compiler and hardware don't have to 541 deal with unusual cases (like entering a critical section and never 542 leaving, leaving a critical section without entering it, or trying to 543 enter a critical section more than once). 544 545Revision History 546 547 Revision 2, 2015/03/27 548 - Add ES interactions 549 550 Revision 1 551 - Internal revisions 552