• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Name
2
3    ARB_fragment_shader_interlock
4
5Name Strings
6
7    GL_ARB_fragment_shader_interlock
8
9Contact
10
11    Slawomir Grajewski, Intel  (slawomir.grajewski 'at' intel.com)
12
13Contributors
14
15    Contributors to INTEL_fragment_shader_ordering
16    Contributers to NV_fragment_shader_interlock
17
18Notice
19
20    Copyright (c) 2015 The Khronos Group Inc. Copyright terms at
21        http://www.khronos.org/registry/speccopyright.html
22
23Specification Update Policy
24
25    Khronos-approved extension specifications are updated in response to
26    issues and bugs prioritized by the Khronos OpenGL Working Group. For
27    extensions which have been promoted to a core Specification, fixes will
28    first appear in the latest version of that core Specification, and will
29    eventually be backported to the extension document. This policy is
30    described in more detail at
31        https://www.khronos.org/registry/OpenGL/docs/update_policy.php
32
33Status
34
35    Complete. Approved by the ARB on June 26, 2015.
36    Ratified by the Khronos Board of Promoters on August 7, 2015.
37
38Version
39
40    Last Modified Date:        May 7, 2015
41    Revision:                  2
42
43Number
44
45    ARB Extension #177
46
47Dependencies
48
49    This extension is written against the OpenGL 4.5 (Core Profile)
50    Specification.
51
52    This extension is written against version 4.50 (revision 5) of the OpenGL
53    Shading Language Specification.
54
55    OpenGL 4.2 or ARB_shader_image_load_store is required; GLSL 4.20 is
56    required.
57
58Overview
59
60    In unextended OpenGL 4.5, applications may produce a
61    large number of fragment shader invocations that perform loads and
62    stores to memory using image uniforms, atomic counter uniforms,
63    buffer variables, or pointers. The order in which loads and stores
64    to common addresses are performed by different fragment shader
65    invocations is largely undefined.  For algorithms that use shader
66    writes and touch the same pixels more than once, one or more of the
67    following techniques may be required to ensure proper execution ordering:
68
69      * inserting Finish or WaitSync commands to drain the pipeline between
70        different "passes" or "layers";
71
72      * using only atomic memory operations to write to shader memory (which
73        may be relatively slow and limits how memory may be updated); or
74
75      * injecting spin loops into shaders to prevent multiple shader
76        invocations from touching the same memory concurrently.
77
78    This extension provides new GLSL built-in functions
79    beginInvocationInterlockARB() and endInvocationInterlockARB() that delimit
80    a critical section of fragment shader code.  For pairs of shader
81    invocations with "overlapping" coverage in a given pixel, the OpenGL
82    implementation will guarantee that the critical section of the fragment
83    shader will be executed for only one fragment at a time.
84
85    There are four different interlock modes supported by this extension,
86    which are identified by layout qualifiers.  The qualifiers
87    "pixel_interlock_ordered" and "pixel_interlock_unordered" provides mutual
88    exclusion in the critical section for any pair of fragments corresponding
89    to the same pixel.  When using multisampling, the qualifiers
90    "sample_interlock_ordered" and "sample_interlock_unordered" only provide
91    mutual exclusion for pairs of fragments that both cover at least one
92    common sample in the same pixel; these are recommended for performance if
93    shaders use per-sample data structures.
94
95    Additionally, when the "pixel_interlock_ordered" or
96    "sample_interlock_ordered" layout qualifier is used, the interlock also
97    guarantees that the critical section for multiple shader invocations with
98    "overlapping" coverage will be executed in the order in which the
99    primitives were processed by the GL.  Such a guarantee is useful for
100    applications like blending in the fragment shader, where an application
101    requires that fragment values to be composited in the framebuffer in
102    primitive order.
103
104    This extension can be useful for algorithms that need to access per-pixel
105    data structures via shader loads and stores.  Such algorithms using this
106    extension can access such data structures in the critical section without
107    worrying about other invocations for the same pixel accessing the data
108    structures concurrently.  Additionally, the ordering guarantees are useful
109    for cases where the API ordering of fragments is meaningful.  For example,
110    applications may be able to execute programmable blending operations in
111    the fragment shader, where the destination buffer is read via image loads
112    and the final value is written via image stores.
113
114New Procedures and Functions
115
116    None.
117
118New Tokens
119
120    None.
121
122Modifications to the OpenGL Shading Language Specification, Version 4.50
123
124    Including the following line in a shader can be used to control the
125    language features described in this extension:
126
127      #extension GL_ARB_fragment_shader_interlock : <behavior>
128
129    where <behavior> is as specified in section 3.3.
130
131    New preprocessor #defines are added to the OpenGL Shading Language:
132
133      #define GL_ARB_fragment_shader_interlock           1
134
135
136    Modify Section 4.4.1.3, Fragment Shader Inputs (p. 63)
137
138    (add to the list of layout qualifiers containing "early_fragment_tests",
139     p. 63, and modify the surrounding language to reflect that multiple
140     layout qualifiers are supported on "in")
141
142      layout-qualifier-id
143        pixel_interlock_ordered
144        pixel_interlock_unordered
145        sample_interlock_ordered
146        sample_interlock_unordered
147
148    (add to the end of the section, p. 63)
149
150    The identifiers "pixel_interlock_ordered", "pixel_interlock_unordered",
151    "sample_interlock_ordered", and "sample_interlock_unordered" control the
152    ordering of the execution of shader invocations between calls to the
153    built-in functions beginInvocationInterlockARB() and
154    endInvocationInterlockARB(), as described in section 8.13.3. A
155    compile or link error will be generated if more than one of these layout
156    qualifiers is specified in shader code. If a program containing a
157    fragment shader includes none of these layout qualifiers, it is as
158    though "pixel_interlock_ordered" were specified.
159
160    Add to the end of Section 8.13, Fragment Processing Functions (p. 170)
161
162    8.13.3, Fragment Shader Execution Ordering Functions
163
164    By default, fragment shader invocations are generally executed in
165    undefined order. Multiple fragment shader invocations may be executed
166    concurrently, including multiple invocations corresponding to a single
167    pixel. Additionally, fragment shader invocations for a single pixel might
168    not be processed in the order in which the primitives generating the
169    fragments were specified in the OpenGL API.
170
171    The paired functions beginInvocationInterlockARB() and
172    endInvocationInterlockARB() allow shaders to specify a critical section,
173    inside which stronger execution ordering is guaranteed.  When using the
174    "pixel_interlock_ordered" or "pixel_interlock_unordered" qualifier,
175    ordering guarantees are provided for any pair of fragment shader
176    invocations X and Y triggered by fragments A and B corresponding to the
177    same pixel. When using the "sample_interlock_ordered" or
178    "sample_interlock_unordered" qualifier, ordering guarantees are provided
179    for any pair of fragment shader invocations X and Y triggered by fragments
180    A and B that correspond to the same pixel, where at least one sample of
181    the pixel is covered by both fragments. No ordering guarantees are
182    provided for pairs of fragment shader invocations corresponding to
183    different pixels. Additionally, no ordering guarantees are provided for
184    pairs of fragment shader invocations corresponding to the same fragment.
185    When multisampling is enabled and the framebuffer has sample buffers,
186    multiple fragment shader invocations may result from a single fragment due
187    to the use of the "sample" auxiliary storage qualifier, OpenGL API
188    commands forcing multiple shader invocations per fragment, or for other
189    implementation-dependent reasons.
190
191    When using the "pixel_interlock_unordered" or "sample_interlock_unordered"
192    qualifier, the interlock will ensure that the critical sections of
193    fragment shader invocations X and Y with overlapping coverage will never
194    execute concurrently. That is, invocation X is guaranteed to complete its
195    call to endInvocationInterlockARB() before invocation Y completes its call
196    to beginInvocationInterlockARB(), or vice versa.
197
198    When using the "pixel_interlock_ordered" or "sample_interlock_ordered"
199    layout qualifier, the critical sections of invocations X and Y with
200    overlapping coverage will be executed in a specific order, based on the
201    relative order assigned to their fragments A and B.  If fragment A is
202    considered to precede fragment B, the critical section of invocation X is
203    guaranteed to complete before the critical section of invocation Y begins.
204    When a pair of fragments A and B have overlapping coverage, fragment A is
205    considered to precede fragment B if
206
207      * the OpenGL API command producing fragment A was called prior to the
208        command producing B, or
209
210      * the point, line, triangle, [[compatibility profile: quadrilateral,
211        polygon,]] or patch primitive producing fragment A appears earlier in
212        the same strip, loop, fan, or independent primitive list producing
213        fragment B.
214
215    When [[compatibility profile: decomposing quadrilateral or polygon
216    primitives or]] tessellating a single patch primitive, multiple
217    primitives may be generated in an undefined implementation-dependent
218    order.  When fragments A and B are generated from such unordered
219    primitives, their ordering is also implementation-dependent.
220
221    If fragment shader X completes its critical section before fragment shader
222    Y begins its critical section, all stores to memory performed in the
223    critical section of invocation X using a pointer, image uniform, atomic
224    counter uniform, or buffer variable qualified by "coherent" are guaranteed
225    to be visible to any reads of the same types of variable performed in the
226    critical section of invocation Y.
227
228    If multisampling is disabled, or if the framebuffer does not include
229    sample buffers, fragment coverage is computed per-pixel. In this case,
230    the "sample_interlock_ordered" or "sample_interlock_unordered" layout
231    qualifiers are treated as "pixel_interlock_ordered" or
232    "pixel_interlock_unordered", respectively.
233
234      Syntax:
235
236        void beginInvocationInterlockARB(void);
237        void endInvocationInterlockARB(void);
238
239      Description:
240
241    The beginInvocationInterlockARB() and endInvocationInterlockARB() may only
242    be placed inside the function main() of a fragment shader and may not be
243    called within any flow control.  These functions may not be called after a
244    return statement in the function main(), but may be called after a discard
245    statement.  A compile- or link-time error will be generated if main()
246    calls either function more than once, contains a call to one function
247    without a matching call to the other, or calls endInvocationInterlockARB()
248    before calling beginInvocationInterlockARB().
249
250Additions to the AGL/GLX/WGL Specifications
251
252    None.
253
254Errors
255
256    None.
257
258New State
259
260    None.
261
262New Implementation Dependent State
263
264    None.
265
266Issues
267
268    (1) When using multisampling, the OpenGL specification permits
269        multiple fragment shader invocations to be generated for a single
270        fragment.  For example, per-sample shading using the "sample"
271        auxiliary storage qualifier or the MinSampleShading() OpenGL API command
272        can be used to force per-sample shading.  What execution ordering
273        guarantees are provided between fragment shader invocations generated
274        from the same fragment?
275
276      RESOLVED:  We don't provide any ordering guarantees in this extension.
277      This implies that when using multisampling, there is no guarantee that
278      two fragment shader invocations for the same fragment won't be executing
279      their critical sections concurrently.  This could cause problems for
280      algorithms sharing data structures between all the samples of a pixel
281      unless accesses to these data structures are performed atomically.
282
283      When using per-sample shading, the interlock we provide *does* guarantee
284      that no two invocations corresponding to the same sample execute the
285      critical section concurrently.  If a separate set of data structures is
286      provided for each sample, no conflicts should occur within the critical
287      section.
288
289      Note that in addition to the per-sample shading options in the shading
290      language and API, implementations may provide multisample antialiasing
291      modes where the implementation can't simply run the fragment shader once
292      and broadcast results to a large set of covered samples.
293
294    (2) What performance differences are expected between shaders using the
295       "pixel" and "sample" layout qualifier variants in this extension (e.g.,
296       "pixel_invocation_ordered" and "sample_invocation_ordered")?
297
298      RESOLVED:  We expect that shaders using "sample" qualifiers may have
299      higher performance, since the implementation need not order pairs of
300      fragments that touch the same pixel with "complementary" coverage.  Such
301      situations are fairly common:  when two adjacent triangles combine to
302      cover a given pixel, two fragments will be generated for the pixel but
303      no sample will be covered by both.  When using "sample" qualifiers, the
304      invocations for both fragments can run concurrently.  When using "pixel"
305      qualifiers, the critical section for one fragment must wait until the
306      critical section for the other fragment completes.
307
308    (3) What performance differences are expected between shaders using the
309       "ordered" and "unordered" layout qualifier variants in this extension
310       (e.g., "pixel_invocation_ordered" and "pixel_invocation_unordered")?
311
312      RESOLVED:  We expect that shaders using "unordered" may have higher
313      performance, since the critical section implementation doesn't need to
314      ensure that all previous invocations with overlapping coverage have
315      completed their critical sections.  Some algorithms (e.g., building data
316      structures in order-independent transparency algorithms) will require
317      mutual exclusion when updating per-pixel data structures, but do not
318      require that shaders execute in a specific ordering.
319
320    (4) Are fragment shaders using this extension allowed to write outputs?
321        If so, is there any guarantee on the order in which such outputs are
322        written to the framebuffer?
323
324      RESOLVED:  Yes, fragment shaders with critical sections may still write
325      outputs.  If fragment shader outputs are written, they are stored or
326      blended into the framebuffer in API order, as is the case for fragment
327      shaders not using this extension.
328
329    (5) What considerations apply when using this extension to implement a
330        programmable form of conventional blending using image stores?
331
332      RESOLVED:  Per-fragment operations performed in the pipeline following
333      fragment shader execution obviously have no effect on image stores
334      executing during fragment shader execution.  In particular, multisample
335      operations such as broadcasting a single fragment output to multiple
336      samples or modifying the coverage with alpha-to-coverage or a shader
337      coverage mask output value have no effect.  Fragments can not be killed
338      before fragment shader blending using the fixed-function alpha test or
339      using the depth test with a Z value produced by the shader.  Fragments
340      will normally not be killed by fixed-function depth or stencil tests,
341      but those tests can be enabled before fragment shader invocations using
342      the layout qualifier "early_fragment_tests".  Any required
343      fixed-function features that need to be handled before programmable
344      blending that aren't enabled by "early_fragment_tests" would need to be
345      emulated in the shader.
346
347      Note also that performing blend computations in the shader are not
348      guaranteed to produce results that are bit-identical to these produced
349      by fixed-function blending hardware, even if mathematically equivalent
350      algorithms are used.
351
352    (6) For operations accessing shared per-pixel data structures in the
353        critical section, what operations (if any) must be performed in shader
354        code to ensure that stores from one shader invocation are visible to
355        the next?
356
357      RESOLVED:  The "coherent" qualifier is required in the declaration of
358      the shared data structures to ensure that writes performed by one
359      invocation are visible to reads performed by another invocation.
360
361      In shaders that don't use the interlock, "coherent" is not sufficient as
362      there is no guarantee of the ordering of fragment shader invocations --
363      even if invocation A can see the values written by another invocation B,
364      there is no general guarantee that invocation A's read will be performed
365      before invocation B's write.  The built-in function memoryBarrier() can
366      be used to generate a weak ordering by which threads can communicate,
367      but it doesn't order memory transactions between two separate
368      invocations.  With the interlock, execution ordering between two threads
369      from the same pixel is well-defined as long as the loads and stores are
370      performed inside the critical section, and the use of "coherent" ensures
371      that stores done by one invocation are visible to other invocations.
372
373    (7) Should we provide an explicit mechanisms for shaders to indicate a
374        critical section?  Or should we just automatically infer a critical
375        section by analyzing shader code?  Or should we just wrap the entire
376        fragment shader in a critical section?
377
378      RESOLVED:  Provide an explicit critical section.
379
380      We definitely don't want to wrap the entire shader in a critical section
381      when a smaller section will suffice.  Doing so would hold off the
382      execution of any other fragment shader invocation with the same (x,y)
383      for the entire (potentially long) life of the fragment shader.  Hardware
384      would need to track a large number of fragments awaiting execution, and
385      may be so backed up that further fragments will be blocked even if they
386      don't overlap with any fragments currently executing.  Providing a
387      smaller critical section reduces the amount of time other fragments are
388      blocked and allows implementations to perform useful work for
389      conflicting fragments before they hit the critical section.
390
391      While a compiler could analyze the code and wrap a critical section
392      around all memory accesses, it may be difficult to determine which
393      accesses actually require mutual exclusion and ordering, and which
394      accesses are safe to do with no protection.  Requiring shaders to
395      explicitly identify a critical section doesn't seem overwhelmingly
396      burdensome, and allows applications to exclude memory accesses that it
397      knows to be "safe".
398
399    (8) What restrictions should be imposed on the use of the
400        beginInvocationInterlockARB() and endInvocationInterlockARB() functions
401        delimiting a critical section?
402
403      RESOLVED:  We impose restrictions similar to those on the barrier()
404      built-in function in tessellation control shaders to ensure that any
405      shader using this functionality has a single critical section that can
406      be easily identified during compilation.  In particular, we require that
407      these functions be called in main() and don't permit them to be called
408      in conditional flow control.
409
410      These restrictions ensure that there is always exactly one call to the
411      "begin" and "end" functions in a predictable location in the compiled
412      shader code, and ensure that the compiler and hardware don't have to
413      deal with unusual cases (like entering a critical section and never
414      leaving, leaving a critical section without entering it, or trying to
415      enter a critical section more than once).
416
417Revision History
418
419    Rev.    Date    Author        Changes
420    ----  --------  --------     -----------------------------------------
421     1    04/01/15  S.Grajewski  Inital version merging
422                                 INTEL_fragment_shader_ordering with
423                                 NV_fragment_shader_interlock
424
425     2    05/07/15  S.Grajewski  Built-in functions
426                                 beginInvocationInterlockARB() and
427                                 endInvocationInterlockARB() have now ARB
428                                 suffixes.
429