• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Name
2
3    NV_fragment_shader_interlock
4
5Name Strings
6
7    GL_NV_fragment_shader_interlock
8
9Contact
10
11    Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)
12
13Contributors
14
15    Jeff Bolz, NVIDIA Corporation
16    Mathias Heyer, NVIDIA Corporation
17
18Status
19
20    Shipping
21
22Version
23
24    Last Modified Date:         March 27, 2015
25    NVIDIA Revision:            2
26
27Number
28
29    OpenGL Extension #468
30    OpenGL ES Extension #230
31
32Dependencies
33
34    This extension is written against the OpenGL 4.3
35    (Compatibility Profile, dated February 14, 2013), and the
36    OpenGL ES 3.1.0 (dated March 17, 2014) Specification
37
38    This extension is written against the OpenGL Shading Language
39    Specification (version 4.30, revision 8) and the OpenGL ES Shading
40    Language Specification (version 3.10, revision 2).
41
42    OpenGL 4.3 and GLSL 4.30 are required in an OpenGL implementation
43    OpenGL ES 3.1 and GLSL ES 3.10 are required in an OpenGL ES implementation
44
45    This extension interacts with NV_shader_buffer_load and
46    NV_shader_buffer_store.
47
48    This extension interacts with NV_gpu_program4 and NV_gpu_program5.
49
50    This extension interacts with EXT_tessellation_shader.
51
52    This extension interacts with OES_sample_shading
53
54    This extension interacts with OES_shader_multisample_interpolation
55
56    This extension interacts with OES_shader_image_atomic
57
58Overview
59
60    In unextended OpenGL 4.3 or OpenGL ES 3.1, applications may produce a
61    large number of fragment shader invocations that perform loads and
62    stores to memory using image uniforms, atomic counter uniforms,
63    buffer variables, or pointers. The order in which loads and stores
64    to common addresses are performed by different fragment shader
65    invocations is largely undefined.  For algorithms that use shader
66    writes and touch the same pixels more than once, one or more of the
67    following techniques may be required to ensure proper execution ordering:
68
69      * inserting Finish or WaitSync commands to drain the pipeline between
70        different "passes" or "layers";
71
72      * using only atomic memory operations to write to shader memory (which
73        may be relatively slow and limits how memory may be updated); or
74
75      * injecting spin loops into shaders to prevent multiple shader
76        invocations from touching the same memory concurrently.
77
78    This extension provides new GLSL built-in functions
79    beginInvocationInterlockNV() and endInvocationInterlockNV() that delimit a
80    critical section of fragment shader code.  For pairs of shader invocations
81    with "overlapping" coverage in a given pixel, the OpenGL implementation
82    will guarantee that the critical section of the fragment shader will be
83    executed for only one fragment at a time.
84
85    There are four different interlock modes supported by this extension,
86    which are identified by layout qualifiers.  The qualifiers
87    "pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual
88    exclusion in the critical section for any pair of fragments corresponding
89    to the same pixel.  When using multisampling, the qualifiers
90    "sample_interlock_ordered" and "sample_interlock_unordered" only provide
91    mutual exclusion for pairs of fragments that both cover at least one
92    common sample in the same pixel; these are recommended for performance if
93    shaders use per-sample data structures.
94
95    Additionally, when the "pixel_interlock_ordered" or
96    "sample_interlock_ordered" layout qualifier is used, the interlock also
97    guarantees that the critical section for multiple shader invocations with
98    "overlapping" coverage will be executed in the order in which the
99    primitives were processed by the GL.  Such a guarantee is useful for
100    applications like blending in the fragment shader, where an application
101    requires that fragment values to be composited in the framebuffer in
102    primitive order.
103
104    This extension can be useful for algorithms that need to access per-pixel
105    data structures via shader loads and stores.  Such algorithms using this
106    extension can access such data structures in the critical section without
107    worrying about other invocations for the same pixel accessing the data
108    structures concurrently.  Additionally, the ordering guarantees are useful
109    for cases where the API ordering of fragments is meaningful.  For example,
110    applications may be able to execute programmable blending operations in
111    the fragment shader, where the destination buffer is read via image loads
112    and the final value is written via image stores.
113
114New Procedures and Functions
115
116    None.
117
118New Tokens
119
120    None.
121
122Modifications to the OpenGL 4.3 Specification (Compatibility Profile)
123
124    None.
125
126Modifications to the OpenGL Shading Language Specification, Version 4.30
127
128    Including the following line in a shader can be used to control the
129    language features described in this extension:
130
131      #extension GL_NV_fragment_shader_interlock : <behavior>
132
133    where <behavior> is as specified in section 3.3.
134
135    New preprocessor #defines are added to the OpenGL Shading Language:
136
137      #define GL_NV_fragment_shader_interlock           1
138
139
140    Modify Section 4.4.1.3, Fragment Shader Inputs (p. 58)
141
142    (add to the list of layout qualifiers containing "early_fragment_tests",
143     p. 59, and modify the surrounding language to reflect that multiple
144     layout qualifiers are supported on "in")
145
146      layout-qualifier-id
147        pixel_interlock_ordered
148        pixel_interlock_unordered
149        sample_interlock_ordered
150        sample_interlock_unordered
151
152    (add to the end of the section, p. 59)
153
154    The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered",
155    "sample_interlock_ordered", and "sample_interlock_unordered" control the
156    ordering of the execution of shader invocations between calls to the
157    built-in functions beginInvocationInterlockNV() and
158    endInvocationInterlockNV(), as described in section 8.13.3. A
159    compile or link error will be generated if more than one of these layout
160    qualifiers is specified in shader code. If a program containing a
161    fragment shader includes none of these layout qualifiers, it is as
162    though "pixel_interlock_ordered" were specified.
163
164    Add to the end of Section 8.13, Fragment Processing Functions (p. 168)
165
166    8.13.3, Fragment Shader Execution Ordering Functions
167
168    By default, fragment shader invocations are generally executed in
169    undefined order. Multiple fragment shader invocations may be executed
170    concurrently, including multiple invocations corresponding to a single
171    pixel. Additionally, fragment shader invocations for a single pixel might
172    not be processed in the order in which the primitives generating the
173    fragments were specified in the OpenGL API.
174
175    The paired functions beginInvocationInterlockNV() and
176    endInvocationInterlockNV() allow shaders to specify a critical section,
177    inside which stronger execution ordering is guaranteed.  When using the
178    "pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier,
179    ordering guarantees are provided for any pair of fragment shader
180    invocations X and Y triggered by fragments A and B corresponding to the
181    same pixel. When using the "sample_interlock_ordered" or
182    "sample_interlock_unordered" qualifier, ordering guarantees are provided
183    for any pair of fragment shader invocations X and Y triggered by fragments
184    A and B that correspond to the same pixel, where at least one sample of
185    the pixel is covered by both fragments. No ordering guarantees are
186    provided for pairs of fragment shader invocations corresponding to
187    different pixels. Additionally, no ordering guarantees are provided for
188    pairs of fragment shader invocations corresponding to the same fragment.
189    When multisampling is enabled and the framebuffer has sample buffers,
190    multiple fragment shader invocations may result from a single fragment due
191    to the use of the "sample" auxilliary storage qualifier, OpenGL API
192    commands forcing multiple shader invocations per fragment, or for other
193    implementation-dependent reasons.
194
195    When using the "pixel_interlock_unordered" or "sample_interlock_unordered"
196    qualifier, the interlock will ensure that the critical sections of
197    fragment shader invocations X and Y with overlapping coverage will never
198    execute concurrently. That is, invocation X is guaranteed to complete its
199    call to endInvocationInterlockNV() before invocation Y completes its call
200    to beginInvocationInterlockNV(), or vice versa.
201
202    When using the "pixel_interlock_ordered" or "sample_interlock_ordered"
203    layout qualifier, the critical sections of invocations X and Y with
204    overlapping coverage will be executed in a specific order, based on the
205    relative order assigned to their fragments A and B.  If fragment A is
206    considered to precede fragment B, the critical section of invocation X is
207    guaranteed to complete before the critical section of invocation Y begins.
208    When a pair of fragments A and B have overlapping coverage, fragment A is
209    considered to precede fragment B if
210
211      * the OpenGL API command producing fragment A was called prior to the
212        command producing B, or
213
214      * the point, line, triangle, [[compatibility profile: quadrilateral,
215        polygon,]] or patch primitive producing fragment A appears earlier in
216        the same strip, loop, fan, or independent primitive list producing
217        fragment B.
218
219    When [[compatibility profile: decomposing quadrilateral or polygon
220    primitives or]] tessellating a single patch primitive, multiple
221    primitives may be generated in an undefined implementation-dependent
222    order.  When fragments A and B are generated from such unordered
223    primitives, their ordering is also implementation-dependent.
224
225    If fragment shader X completes its critical section before fragment shader
226    Y begins its critical section, all stores to memory performed in the
227    critical section of invocation X using a pointer, image uniform, atomic
228    counter uniform, or buffer variable qualified by "coherent" are guaranteed
229    to be visible to any reads of the same types of variable performed in the
230    critical section of invocation Y.
231
232    If multisampling is disabled, or if the framebuffer does not include
233    sample buffers, fragment coverage is computed per-pixel. In this case,
234    the "sample_interlock_ordered" or "sample_interlock_unordered" layout
235    qualifiers are treated as "pixel_interlock_ordered" or
236    "pixel_interlock_unordered", respectively.
237
238
239      Syntax:
240
241        void beginInvocationInterlockNV(void);
242        void endInvocationInterlockNV(void);
243
244      Description:
245
246    The beginInvocationInterlockNV() and endInvocationInterlockNV() may only
247    be placed inside the function main() of a fragment shader and may not be
248    called within any flow control.  These functions may not be called after a
249    return statement in the function main(), but may be called after a discard
250    statement.  A compile- or link-time error will be generated if main()
251    calls either function more than once, contains a call to one function
252    without a matching call to the other, or calls endInvocationInterlockNV()
253    before calling beginInvocationInterlockNV().
254
255Additions to the AGL/GLX/WGL Specifications
256
257    None.
258
259Errors
260
261    None.
262
263New State
264
265    None.
266
267New Implementation Dependent State
268
269    None.
270
271Interactions with OpenGL ES 3.1
272
273    Disabling multisample rasterization is not available on OpenGL ES;
274    it is always enabled.
275
276
277Dependencies on EXT_tessellation_shader
278
279     If this extension is implemented on OpenGL ES and EXT_tessellation_shader
280     is not supported, remove language referring to tessellation of patch
281     primitives.
282
283
284Dependencies on OES_sample_shading
285
286     If this extension is implemented on OpenGL ES and OES_sample_shading
287     is not supported, remove references to per-sample shading via
288     MinSampleShading[OES]()
289
290
291Dependencies on OES_shader_image_atomic
292
293    If this extension is implemented on OpenGL ES and OES_shader_image_atomic
294    is not supported, disregard language referring to atomic memory operations.
295
296
297Dependencies on OES_shader_multisample_interpolation
298
299   If this extension is implemented on OpenGL ES and OES_shader_-
300   multisample_interpolation is not supported, ignore language
301   about the "sample" auxilliary storage qualifier.
302
303
304Dependencies on NV_shader_buffer_load and NV_shader_buffer_store
305
306    If NV_shader_buffer_load and NV_shader_buffer_store are not supported,
307    references to ordering memory accesses using pointers should be deleted.
308
309
310Dependencies on NV_gpu_program4 and NV_fragment_program4
311
312    Modify Section 2.X.2, Program Grammar, of the NV_fragment_program4
313    specification (which modifies the NV_gpu_program4 base grammar)
314
315      <SpecialInstruction>    ::= "FSIB"
316                                | "FSIE"
317
318
319    Modify Section 2.X.4, Program Execution Environment
320
321    (add to the opcode table)
322
323                  Modifiers
324      Instruction F I C S H D  Out Inputs    Description
325      ----------- - - - - - -  --- --------  --------------------------------
326      FSIB        - - - - - -  -   -         begin fragment shader interlock
327      FSIE        - - - - - -  -   -         end fragment shader interlock
328
329
330    Modify Section 2.X.6, Program Options
331
332    + Fragment Shader Interlock (NV_pixel_interlock_ordered,
333      NV_pixel_interlock_unordered, NV_sample_interlock_ordered, and
334      NV_sample_interlock_ordered)
335
336    If a fragment program specifies the "NV_pixel_interlock_ordered",
337    "NV_pixel_interlock_unordered", "NV_sample_interlock_ordered", or
338    "NV_sample_interlock_ordered" options, it will configure a critical
339    section using the FSIB (fragment shader interlock begin) and FSIE opcodes
340    (fragment shader interlock end) opcodes.  The execution of the critical
341    sections will be ordered for pairs of program invocations corresponding to
342    the same pixel, as described in Section 8.13.3 of the OpenGL Shading
343    Language Specification, where the four options are considered to specify
344    layout qualifiers with names equivalent to matching the program option.
345
346    A program will fail to load if it specifies more than one of these program
347    options, if it specifies exactly one of these options but does not contain
348    exactly one FSIB instruction and one FSIE instruction, or if it contains
349    an FSIB or FSIE instruction without specifying any of these options.
350
351
352    Add the following subsections to section 2.X.8, Program Instruction Set
353
354
355    Section 2.X.8.Z, FSIB:  Fragment Shader Interlock Begin
356
357    The FSIB instruction specifies the beginning of a critical section in a
358    fragment program, where execution of the critical section is ordered
359    relative to other fragments.  This instruction has no other effect.
360
361    The FSIB instruction is not allowed in arbitrary locations in a program.
362    A program will fail to load if it includes an FSIB instruction inside a
363    IF/ELSE/ENDIF block, inside a REP/ENDREP block, or inside any subroutine
364    block other than the one labeled "main".  Additionally, a program will
365    fail to load if it contains more than one FSIB instruction, or if its one
366    FSIB instruction is not followed by an FSIE instruction.
367
368    FSIB has no operands and generates no result.
369
370
371    Section 2.X.8.Z, FSIE:  Fragment Shader Interlock End
372
373    The FSIE instruction specifies the end of a critical section in a fragment
374    program, where execution of the critical section is ordered relative to
375    other fragments.  This instruction has no other effect.
376
377    The FSIE instruction is not allowed in arbitrary locations in a program.
378    A program will fail to load if it includes an FSIE instruction inside a
379    IF/ELSE/ENDIF block, inside a REP/ENDREP block, or inside any subroutine
380    block other than the one labeled "main".  Additionally, a program will
381    fail to load if it contains more than one FSIE instruction, or if its one
382    FSIE instruction is not preceded by an FSIB instruction.
383
384    FSIE has no operands and generates no result.
385
386Issues
387
388    (1) What should this extension be called?
389
390      RESOLVED:  NV_fragment_shader_interlock.  The
391      beginInvocationInterlockNV() and endInvocationInterlockNV() commands
392      identify a critical section during which other invocations with
393      overlapping coverage are locked out until the critical section
394      completes.
395
396    (2) When using multisampling, the OpenGL specification permits
397        multiple fragment shader invocations to be generated for a single
398        fragment.  For example, per-sample shading using the "sample"
399        auxilliary storage qualifier or the MinSampleShading() OpenGL API command
400        can be used to force per-sample shading.  What execution ordering
401        guarantees are provided between fragment shader invocations generated
402        from the same fragment?
403
404      RESOLVED:  We don't provide any ordering guarantees in this extension.
405      This implies that when using multisampling, there is no guarantee that
406      two fragment shader invocations for the same fragment won't be executing
407      their critical sections concurrently.  This could cause problems for
408      algorithms sharing data structures between all the samples of a pixel
409      unless accesses to these data structures are performed atomically.
410
411      When using per-sample shading, the interlock we provide *does* guarantee
412      that no two invocations corresponding to the same sample execute the
413      critical section concurrently.  If a separate set of data structures is
414      provided for each sample, no conflicts should occur within the critical
415      section.
416
417      Note that in addition to the per-sample shading options in the shading
418      language and API, implementations may provide multisample antialiasing
419      modes where the implementation can't simply run the fragment shader once
420      and broadcast results to a large set of covered samples.
421
422    (3) What performance differences are expected between shaders using the
423       "pixel" and "sample" layout qualifier variants in this extension (e.g.,
424       "pixel_invocation_ordered" and "sample_invocation_ordered")?
425
426      RESOLVED:  We expect that shaders using "sample" qualifiers may have
427      higher performance, since the implementation need not order pairs of
428      fragments that touch the same pixel with "complementary" coverage.  Such
429      situations are fairly common:  when two adjacent triangles combine to
430      cover a given pixel, two fragments will be generated for the pixel but
431      no sample will be covered by both.  When using "sample" qualifiers, the
432      invocations for both fragments can run concurrently.  When using "pixel"
433      qualifiers, the critical section for one fragment must wait until the
434      critical section for the other fragment completes.
435
436    (4) What performance differences are expected between shaders using the
437       "ordered" and "unordered" layout qualifier variants in this extension
438       (e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")?
439
440      RESOLVED:  We expect that shaders using "unordered" may have higher
441      performance, since the critical section implementation doesn't need to
442      ensure that all previous invocations with overlapping coverage have
443      completed their critical sections.  Some algorithms (e.g., building data
444      structures in order-independent transparency algorithms) will require
445      mutual exclusion when updating per-pixel data structures, but do not
446      require that shaders execute in a specific ordering.
447
448    (5) Are fragment shaders using this extension allowed to write outputs?
449        If so, is there any guarantee on the order in which such outputs are
450        written to the framebuffer?
451
452      RESOLVED:  Yes, fragment shaders with critical sections may still write
453      outputs.  If fragment shader outputs are written, they are stored or
454      blended into the framebuffer in API order, as is the case for fragment
455      shaders not using this extension.
456
457    (6) What considerations apply when using this extension to implement a
458        programmable form of conventional blending using image stores?
459
460      RESOLVED:  Per-fragment operations performed in the pipeline following
461      fragment shader execution obviously have no effect on image stores
462      executing during fragment shader execution.  In particular, multisample
463      operations such as broadcasting a single fragment output to multiple
464      samples or modifying the coverage with alpha-to-coverage or a shader
465      coverage mask output value have no effect.  Fragments can not be killed
466      before fragment shader blending using the fixed-function alpha test or
467      using the depth test with a Z value produced by the shader.  Fragments
468      will normally not be killed by fixed-function depth or stencil tests,
469      but those tests can be enabled before fragment shader invocations using
470      the layout qualifier "early_fragment_tests".  Any required
471      fixed-function features that need to be handled before programmable
472      blending that aren't enabled by "early_fragment_tests" would need to be
473      emulated in the shader.
474
475      Note also that performing blend computations in the shader are not
476      guaranteed to produce results that are bit-identical to these produced
477      by fixed-function blending hardware, even if mathematically equivalent
478      algorithms are used.
479
480    (7) For operations accessing shared per-pixel data structures in the
481        critical section, what operations (if any) must be performed in shader
482        code to ensure that stores from one shader invocation are visible to
483        the next?
484
485      RESOLVED:  The "coherent" qualifier is required in the declaration of
486      the shared data structures to ensure that writes performed by one
487      invocation are visible to reads performed by another invocation.
488
489      In shaders that don't use the interlock, "coherent" is not sufficient as
490      there is no guarantee of the ordering of fragment shader invocations --
491      even if invocation A can see the values written by another invocation B,
492      there is no general guarantee that invocation A's read will be performed
493      before invocation B's write.  The built-in function memoryBarrier() can
494      be used to generate a weak ordering by which threads can communicate,
495      but it doesn't order memory transactions between two separate
496      invocations.  With the interlock, execution ordering between two threads
497      from the same pixel is well-defined as long as the loads and stores are
498      performed inside the critical section, and the use of "coherent" ensures
499      that stores done by one invocation are visible to other invocations.
500
501    (8) Should we provide an explicit mechanisms for shaders to indicate a
502        critical section?  Or should we just automatically infer a critical
503        section by analyzing shader code?  Or should we just wrap the entire
504        fragment shader in a critical section?
505
506      RESOLVED:  Provide an explicit critical section.
507
508      We definitely don't want to wrap the entire shader in a critical section
509      when a smaller section will suffice.  Doing so would hold off the
510      execution of any other fragment shader invocation with the same (x,y)
511      for the entire (potentially long) life of the fragment shader.  Hardware
512      would need to track a large number of fragments awaiting execution, and
513      may be so backed up that further fragments will be blocked even if they
514      don't overlap with any fragments currently executing.  Providing a
515      smaller critical section reduces the amount of time other fragments are
516      blocked and allows implementations to perform useful work for
517      conflicting fragments before they hit the critical section.
518
519      While a compiler could analyze the code and wrap a critical section
520      around all memory accesses, it may be difficult to determine which
521      accesses actually require mutual exclusion and ordering, and which
522      accesses are safe to do with no protection.  Requiring shaders to
523      explicitly identify a critical section doesn't seem overwhelmingly
524      burdensome, and allows applications to exclude memory accesses that it
525      knows to be "safe".
526
527    (9) What restrictions should be imposed on the use of the
528        beginInvocationInterlockNV() and endInvocationInterlockNV() functions
529        delimiting a critical section?
530
531      RESOLVED:  We impose restrictions similar to those on the barrier()
532      built-in function in tessellation control shaders to ensure that any
533      shader using this functionality has a single critical section that can
534      be easily identified during compilation.  In particular, we require that
535      these functions be called in main() and don't permit them to be called
536      in conditional flow control.
537
538      These restrictions ensure that there is always exactly one call to the
539      "begin" and "end" functions in a predictable location in the compiled
540      shader code, and ensure that the compiler and hardware don't have to
541      deal with unusual cases (like entering a critical section and never
542      leaving, leaving a critical section without entering it, or trying to
543      enter a critical section more than once).
544
545Revision History
546
547    Revision 2, 2015/03/27
548      - Add ES interactions
549
550    Revision 1
551      - Internal revisions
552