• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Name
2
3    NV_shader_thread_group
4
5Name Strings
6
7    GL_NV_shader_thread_group
8
9Contributors
10
11    Jeannot Breton, NVIDIA
12    Pat Brown, NVIDIA
13    Eric Werness, NVIDIA
14    Mark Kilgard, NVIDIA
15
16Contact
17
18    Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com)
19
20Status
21
22    Shipping.
23
24Version
25
26    Last Modified Date:         7/21/2015
27    NVIDIA Revision:            4
28
29Number
30
31    OpenGL Extension #447
32
33Dependencies
34
35    This extension is written against the OpenGL 4.3 (Compatibility Profile)
36    Specification.
37
38    This extension is written against version 4.30 (revision 07) of the OpenGL
39    Shading Language Specification.
40
41    OpenGL 4.3 and GLSL 4.3 are required.
42
43    This extension interacts with NV_gpu_program5
44
45    This extension interacts with NV_compute_program5
46
47    This extension interacts with NV_tessellation_program5
48
49Overview
50
51    Implementations of the OpenGL Shading Language may, but are not required
52    to, run multiple shader threads for a single stage as a SIMD thread group,
53    where individual execution threads are assigned to thread groups in an
54    undefined, implementation-dependent order.  This extension provides a set
55    of new features to the OpenGL Shading Language to query thread states and
56    to share data between fragments within a 2x2 pixel quad.
57
58    More specifically the following functionalities were added:
59
60    *   New uniform variables and tokens to query the number of threads in a
61        warp, the number of warps running on a SM and the number of SMs on the
62        GPU.
63
64    *   New shader inputs to query the thread id, the warp id and the SM id.
65
66    *   New shader inputs to query if a fragment shader thread is a helper
67        thread.
68
69    *   New shader built-in functions to query the state of a Boolean condition
70        over all threads in a thread group.
71
72    *   New shader built-in functions to query which threads are active within
73        a thread group.
74
75    *   New fragment shader built-in functions to share data between fragments
76        within a 2x2 pixel quad.
77
78    Shaders using the new functionalities provided by this extension should
79    enable this functionality via the construct
80
81        #extension GL_NV_shader_thread_group : require     (or enable)
82
83    This extension also specifies some modifications to the program assembly
84    language to support the thread state query and thread data sharing
85    functionalities.
86
87    Note that in this extension specification warp and thread group have the
88    same meaning.  A warp is a group of threads that get executed in lockstep.
89    Each thread in a warp executes the same instruction of a program, but on
90    different data.
91
92New Procedures and Functions
93
94    None
95
96
97New Tokens
98
99    Accepted by the <pname> parameter of GetBooleanv, GetIntegerv,
100    GetFloatv, and GetDoublev:
101
102        WARP_SIZE_NV                                    0x9339
103        WARPS_PER_SM_NV                                 0x933A
104        SM_COUNT_NV                                     0x933B
105
106
107Modifications to The OpenGL Shading Language Specification, Version 4.30
108(Revision 07)
109
110    Including the following line in a shader can be used to control the
111    language features described in this extension:
112
113      #extension GL_NV_shader_thread_group : <behavior>
114
115    where <behavior> is as specified in section 3.3.
116
117    New preprocessor #defines are added to the OpenGL Shading Language:
118
119      #define GL_NV_shader_thread_group         1
120
121    Modify Section 7.1, Built-in Languages Variable, p. 110
122
123    (Add to the list of built-in variables for the compute, vertex, geometry,
124     tessellation control, tessellation evaluation and fragment languages)
125
126        in uint  gl_ThreadInWarpNV;
127        in uint  gl_ThreadEqMaskNV;
128        in uint  gl_ThreadGeMaskNV;
129        in uint  gl_ThreadGtMaskNV;
130        in uint  gl_ThreadLeMaskNV;
131        in uint  gl_ThreadLtMaskNV;
132        in uint  gl_WarpIDNV;
133        in uint  gl_SMIDNV;
134
135    (Add to the list of built-in variables for the fragment languages)
136
137        in bool  gl_HelperThreadNV;
138
139    (Add those paragraphs at the end of this section)
140
141    The variable gl_ThreadInWarpNV hold the id of the thread within the thread
142    group(or warp).  This variable is in the range 0 to gl_WarpSizeNV-1, where
143    gl_WarpSizeNV is the total number of thread in a warp.
144
145    The variable gl_ThreadEqMaskNV is a bitfield in which the bit equal to the
146    current thread id is set.  The variable gl_ThreadGeMaskNV is a bitfield in
147    which bits greater or equal to the current thread id are set.  The variable
148    gl_ThreadGtMaskNV is a bitfield in which bits greater than the current
149    thread id are set.  The variable gl_ThreadLeMaskNV is a bitfield in which
150    bits lower or equal to the current thread id are set.  The variable
151    gl_ThreadLtMaskNV is a bitfield in which bits lower than the current thread
152    id are set.
153
154    The value of gl_ThreadEqMaskNV, gl_ThreadGeMaskNV, gl_ThreadGtMaskNV,
155    gl_ThreadLeMaskNV and gl_ThreadLtMaskNV are derived from the value of
156    gl_ThreadInWarpNV using simple bit-shift arithmetic, they don't take into
157    account the value of the thread group active mask.  For example, if the
158    application wants a bitfield in which bits lower or equal to the current
159    thread id are set only for active threads, the result of gl_ThreadLeMaskNV
160    will need to be ANDed with the thread group active mask.
161
162    The variable gl_WarpIDNV hold the warp id of the executing thread.  This
163    variable is in the range 0 to gl_WarpsPerSMNV-1, where gl_WarpsPerSMNV is
164    the maximum number of warp executing on a SM.
165
166    The variable gl_SMIDNV hold the SM id of the executing thread.  This
167    variable is in the range 0 to gl_SMCountNV-1, where gl_SMCountNV is the
168    number of SM on the GPU.
169
170    The variable gl_HelperThreadNV specifies if the current thread is a helper
171    thread.  In implementations supporting this extension, fragment shader
172    invocations may be arranged in SIMD thread groups of 2x2 fragments called
173    "quad".  When a fragment shader instruction is executed on a quad, it's
174    possible that some fragments within the quad will execute the instruction
175    even if they are not covered by the primitive.  Those threads are called
176    helper threads.  Their outputs will be discarded and they will not execute
177    global store functions, but the intermediate values they compute can still
178    be used by thread group sharing functions or by fragment derivative
179    functions like dFdx and dFdy.
180
181
182    Modify Section 7.4, Built-In Uniform State, p. 125
183
184    (Add to the list of built-in uniform variable declaration)
185
186        uniform uint  gl_WarpSizeNV;
187        uniform uint  gl_WarpsPerSMNV;
188        uniform uint  gl_SMCountNV;
189
190    (Add this paragraph at the end of this section)
191
192    The variable gl_WarpSizeNV is the total number of thread in a warp.  The
193    variable gl_WarpsPerSMNV is the maximum number of warp executing on a SM.
194    The variable gl_SMCountNV is the number of SM on the GPU.
195
196
197    Modify Section 8.3, Common Functions, p. 133
198
199    (add a function to query which threads are active within a thread group)
200
201    Syntax:
202
203      uint  activeThreadsNV(void)
204
205    In the value returned by activeThreadsNV(), bit <N> is set to 1 if the
206    corresponding thread in the SIMD thread group is executing the call to
207    activeThreadsNV() and 0 otherwise.  A bit in the return value may be set
208    to zero due to conditional flow control (e.g., returning from a function,
209    executing the "else" part of an "if" statement) or SIMD thread group was
210    dispatched without a full collection of threads.
211
212    (add a function to query the state of a Boolean condition over all the
213    threads in a thread group)
214
215    Syntax:
216
217      uint  ballotThreadNV(bool value)
218
219    The function ballotThreadNV() computes a 32-bit bitfield.  It looks at the
220    condition <value> for each active thread of a thread group and set to 1
221    each bit for which the condition in the corresponding thread is true.  Bits
222    for threads with false condition are set to 0.  Bits for inactive threads
223    are also set to 0.  It's possible to query the active thread mask by
224    calling the function activeThreadsNV.
225
226    (add a function to share data between fragment in a quad)
227
228    Syntax:
229
230        float  quadSwizzle0NV(float swizzledValue, [float unswizzledValue])
231        vec2   quadSwizzle0NV(vec2  swizzledValue, [vec2  unswizzledValue])
232        vec3   quadSwizzle0NV(vec3  swizzledValue, [vec3  unswizzledValue])
233        vec4   quadSwizzle0NV(vec4  swizzledValue, [vec4  unswizzledValue])
234
235        float  quadSwizzle1NV(float swizzledValue, [float unswizzledValue])
236        vec2   quadSwizzle1NV(vec2  swizzledValue, [vec2  unswizzledValue])
237        vec3   quadSwizzle1NV(vec3  swizzledValue, [vec3  unswizzledValue])
238        vec4   quadSwizzle1NV(vec4  swizzledValue, [vec4  unswizzledValue])
239
240        float  quadSwizzle2NV(float swizzledValue, [float unswizzledValue])
241        vec2   quadSwizzle2NV(vec2  swizzledValue, [vec2  unswizzledValue])
242        vec3   quadSwizzle2NV(vec3  swizzledValue, [vec3  unswizzledValue])
243        vec4   quadSwizzle2NV(vec4  swizzledValue, [vec4  unswizzledValue])
244
245        float  quadSwizzle3NV(float swizzledValue, [float unswizzledValue])
246        vec2   quadSwizzle3NV(vec2  swizzledValue, [vec2  unswizzledValue])
247        vec3   quadSwizzle3NV(vec3  swizzledValue, [vec3  unswizzledValue])
248        vec4   quadSwizzle3NV(vec4  swizzledValue, [vec4  unswizzledValue])
249
250        float  quadSwizzleXNV(float swizzledValue, [float unswizzledValue])
251        vec2   quadSwizzleXNV(vec2  swizzledValue, [vec2  unswizzledValue])
252        vec3   quadSwizzleXNV(vec3  swizzledValue, [vec3  unswizzledValue])
253        vec4   quadSwizzleXNV(vec4  swizzledValue, [vec4  unswizzledValue])
254
255        float  quadSwizzleYNV(float swizzledValue, [float unswizzledValue])
256        vec2   quadSwizzleYNV(vec2  swizzledValue, [vec2  unswizzledValue])
257        vec3   quadSwizzleYNV(vec3  swizzledValue, [vec3  unswizzledValue])
258        vec4   quadSwizzleYNV(vec4  swizzledValue, [vec4  unswizzledValue])
259
260    In implementations supporting this extension, if a primitive covers a
261    fragment at (x,y), its fragment shader invocation will be arranged in a
262    SIMD thread group with fragment shader invocations corresponding to three
263    neighboring pixels.  These four invocations are arranged in a 2x2 grid,
264    called a "quad".  If the neighbors of a fragment are not covered by the
265    primitive, fragment shader invocations will still be generated.  The
266    implementation may compute differences between values in these threads to
267    estimate derivatives for dFdx(), dFdy(), and for texture lookups with
268    automatic LOD calculations.
269
270    Fragments may have different locations in the quads based on the type of
271    render target.
272
273    When rendering to a window, fragments within a quad follow this pattern:
274
275        ---------------------------------------------------
276        | gl_ThreadInWarpNV 4N+0 | gl_ThreadInWarpNV 4N+1 |
277        |     pixel (X+0,Y+1)    |     pixel (X+1,Y+1)    |
278        ---------------------------------------------------
279        | gl_ThreadInWarpNV 4N+2 | gl_ThreadInWarpNV 4N+3 |
280        |     pixel (X+0,Y+0)    |     pixel (X+1,Y+0)    |
281        ---------------------------------------------------
282
283
284    When rendering to a framebuffer object, fragments within a quad follow this
285    pattern:
286
287        ---------------------------------------------------
288        | gl_ThreadInWarpNV 4N+2 | gl_ThreadInWarpNV 4N+3 |
289        |     pixel (X+0,Y+1)    |     pixel (X+1,Y+1)    |
290        ---------------------------------------------------
291        | gl_ThreadInWarpNV 4N+0 | gl_ThreadInWarpNV 4N+1 |
292        |     pixel (X+0,Y+0)    |     pixel (X+1,Y+0)    |
293        ---------------------------------------------------
294
295    There are 6 quadSwizzle functions that allow fragments within a quad to
296    exchange data.  All those functions will read a floating point
297    operand <swizzledValue>, which can come from any fragment in the quad.
298    Another optional floating point operand <unswizzledValue>, which comes from
299    the current fragment, can be added to <swizzledValue>.  The only difference
300    between all those quadSwizzle functions is the location where they get the
301    <swizzledValue> operand within the 2x2 pixel quad.
302
303    quadSwizzle0NV will read the <swizzledValue> operand from the fragment 0:
304
305        result[thread N] = swizzledValue[thread 0] + unswizzledValue[thread N]
306
307
308    quadSwizzle1NV will read the <swizzledValue> operand from the fragment 1:
309
310        result[thread N] = swizzledValue[thread 1] + unswizzledValue[thread N]
311
312
313    quadSwizzle2NV will read the <swizzledValue> operand from the fragment 2:
314
315        result[thread N] = swizzledValue[thread 2] + unswizzledValue[thread N]
316
317
318    quadSwizzle3NV will read the <swizzledValue> operand from the fragment 3:
319
320        result[thread N] = swizzledValue[thread 3] + unswizzledValue[thread N]
321
322
323    quadSwizzleXNV will read the <swizzledValue> operand for each fragment
324    from its neighbor in X:
325
326        result[thread 0] = swizzledValue[thread 1] + unswizzledValue[thread 0]
327        result[thread 1] = swizzledValue[thread 0] + unswizzledValue[thread 1]
328        result[thread 2] = swizzledValue[thread 3] + unswizzledValue[thread 2]
329        result[thread 3] = swizzledValue[thread 2] + unswizzledValue[thread 3]
330
331
332    quadSwizzleYNV will read the <swizzledValue> operand for each fragment
333    from its neighbor in Y:
334
335        result[thread 0] = swizzledValue[thread 2] + unswizzledValue[thread 0]
336        result[thread 1] = swizzledValue[thread 3] + unswizzledValue[thread 1]
337        result[thread 2] = swizzledValue[thread 0] + unswizzledValue[thread 2]
338        result[thread 3] = swizzledValue[thread 1] + unswizzledValue[thread 3]
339
340
341    If any thread in a 2x2 pixel quad is inactive, the quad is divergent.  In
342    this case quadSwizzle will return 0 for all fragments in the quad.
343
344
345Dependencies on NV_gpu_program5
346
347    If NV_gpu_program5 is supported and "OPTION NV_shader_thread_group" is
348    specified in an assembly program, the following edits are made to extend
349    the assembly programming model documented in the NV_gpu_program4 extension
350    and extended by NV_gpu_program5.
351
352    If NV_gpu_program5 is not supported, or if "OPTION NV_shader_thread_group"
353    is not specified in an assembly program, the contents of this dependencies
354    section should be ignored.
355
356    Modify Section 2.X.2, Program Grammar
357
358    (add the following rules to the the NV_gpu_program4 and
359     NV_gpu_program5 base grammars)
360
361    <VECTORop>              ::= "TGBALLOT"
362
363    <stateSingleItem>       ::= "state" "." <stateThreadItem>
364
365    <stateThreadItem>       ::= "thread" "." <stateThreadProperty>
366
367    <stateThreadProperty>   ::= "warpsize"
368                              | "warpspersm"
369                              | "smcount"
370
371    (add/change the following rules to the NV_fragment_program4 and
372     NV_gpu_program5 base grammars)
373
374    <VECTORop>              ::= "QSWZ0"
375                              | "QSWZ1"
376                              | "QSWZ2"
377                              | "QSWZ3"
378                              | "QSWZX"
379                              | "QSWZY"
380
381    <attribBasic>           ::= <fragPrefix> "threadid"
382                              | <fragPrefix> "threadeqmask"
383                              | <fragPrefix> "threadltmask"
384                              | <fragPrefix> "threadlemask"
385                              | <fragPrefix> "threadgtmask"
386                              | <fragPrefix> "threadgemask"
387                              | <fragPrefix> "warpid"
388                              | <fragPrefix> "smid"
389                              | <fragPrefix> "helperthread"
390
391    (add/change the following rules to the NV_vertex_program4 and
392     NV_gpu_program5 base grammars)
393
394    <attribBasic>           ::= <vtxPrefix> "threadid"
395                              | <vtxPrefix> "threadeqmask"
396                              | <vtxPrefix> "threadltmask"
397                              | <vtxPrefix> "threadlemask"
398                              | <vtxPrefix> "threadgtmask"
399                              | <vtxPrefix> "threadgemask"
400                              | <vtxPrefix> "warpid"
401                              | <vtxPrefix> "smid"
402
403    (add/change the following rules to the NV_geometry_program4 and
404     NV_gpu_program5 base grammars)
405
406    <attribBasic>           ::= <primPrefix> "threadid"
407                              | <primPrefix> "threadeqmask"
408                              | <primPrefix> "threadltmask"
409                              | <primPrefix> "threadlemask"
410                              | <primPrefix> "threadgtmask"
411                              | <primPrefix> "threadgemask"
412                              | <primPrefix> "warpid"
413                              | <primPrefix> "smid"
414
415    Modify Section 2.X.3.2 of the NV_gpu_program4 specification, Program
416    Attribute Variables.
417
418    (Add the table entries and relevant text describing the fragment program
419     input variable use to query thread states.)
420
421      Fragment Attribute Binding  Components  Underlying State
422      --------------------------  ----------  ----------------------------
423      ...
424      fragment.threadid           (id,-,-,-)  id of the current thread
425      fragment.threadeqmask       (m,-,-,-)   mask with the current thread
426      fragment.threadltmask       (m,-,-,-)   mask with lower thread
427      fragment.threadlemask       (m,-,-,-)   mask with lower or equal thread
428      fragment.threadgtmask       (m,-,-,-)   mask with greater thread
429      fragment.threadgemask       (m,-,-,-)   mask with greater or equal thread
430      fragment.warpid             (id,-,-,-)  warp id of the current thread
431      fragment.smid               (id,-,-,-)  SM id of the current thread
432      fragment.helperthread       (k,-,-,-)   current thread is a helper thread
433      ...
434
435    If a fragment attribute binding matches "fragment.threadid", the "x"
436    component is filled with the thread id of the current thread.  The thread
437    id is an unsigned integer in the range 0 to 31.
438
439    If a fragment attribute binding matches "fragment.threadeqmask", the "x"
440    component is filled with a 32-bit unsigned integer bitfield in which the
441    bit equal to the current thread id is set.
442
443    If a fragment attribute binding matches "fragment.threadltmask", the "x"
444    component is filled with a 32-bit unsigned integer bitfield in which bits
445    lower than the current thread id are set.
446
447    If a fragment attribute binding matches "fragment.threadlemask", the "x"
448    component is filled with a 32-bit unsigned integer bitfield in which bits
449    lower or equal to the current thread id are set.
450
451    If a fragment attribute binding matches "fragment.threadgtmask", the "x"
452    component is filled with a 32-bit unsigned integer bitfield in which bits
453    greater than the current thread id are set.
454
455    If a fragment attribute binding matches "fragment.threadgemask", the "x"
456    component is filled with a 32-bit unsigned integer bitfield in which bits
457    greater or equal to the current thread id are set.
458
459    If a fragment attribute binding matches "fragment.warpid", the "x"
460    component is filled with the warp id of the current thread.  The warp id is
461    an unsigned integer, the range of this value is hw dependent.
462
463    If a fragment attribute binding matches "fragment.smid", the "x" component
464    is filled with the SM id of the current thread.  The SM id is an unsigned
465    integer, the range of this value is hw dependent.
466
467    If a fragment attribute binding matches "fragment.helperthread", the "x"
468    component is an integer value equal to -1 when the current thread is a
469    helper thread and 0 otherwise.  In implementations supporting this
470    extension, fragment program invocations may be arranged in SIMD thread
471    groups of 2x2 fragments called "quad".  When a fragment program instruction
472    is executed on a quad, it's possible that some fragments within the quad
473    will execute the instruction even if they are not covered by the primitive.
474    Those threads are called helper threads.  Their outputs will be discarded
475    and they will not execute global store instructions, but the intermediate
476    values they compute can still be used by thread group sharing instructions
477    or by fragment derivative instructions like DDX and DDY.
478
479    (Add the table entries and relevant text describing the vertex program
480     attribute variable use to query thread states.)
481
482      Vertex Attribute Binding  Components  Underlying State
483      ------------------------  ----------  ----------------------------
484      ...
485      vertex.threadid           (id,-,-,-)  id of the current thread
486      vertex.threadeqmask       (m,-,-,-)   mask with the current thread
487      vertex.threadltmask       (m,-,-,-)   mask with lower thread
488      vertex.threadlemask       (m,-,-,-)   mask with lower or equal thread
489      vertex.threadgtmask       (m,-,-,-)   mask with greater thread
490      vertex.threadgemask       (m,-,-,-)   mask with greater or equal thread
491      vertex.warpid             (id,-,-,-)  warp id of the current thread
492      vertex.smid               (id,-,-,-)  SM id of the current thread
493      ...
494
495    If a vertex attribute binding matches "vertex.threadid", the "x" component
496    is filled with the thread id of the current thread.  The thread id is an
497    unsigned integer in the range 0 to 31.
498
499    If a vertex attribute binding matches "vertex.threadeqmask", the "x"
500    component is filled with a 32-bit unsigned integer bitfield in which the
501    bit equal to the current thread id is set.
502
503    If a vertex attribute binding matches "vertex.threadltmask", the "x"
504    component is filled with a 32-bit unsigned integer bitfield in which bits
505    lower than the current thread id are set.
506
507    If a vertex attribute binding matches "vertex.threadlemask", the "x"
508    component is filled with a 32-bit unsigned integer bitfield in which bits
509    lower or equal to the current thread id are set.
510
511    If a vertex attribute binding matches "vertex.threadgtmask", the "x"
512    component is filled with a 32-bit unsigned integer bitfield in which bits
513    greater than the current thread id are set.
514
515    If a vertex attribute binding matches "vertex.threadgemask", the "x"
516    component is filled with a 32-bit unsigned integer bitfield in which bits
517    greater or equal to the current thread id are set.
518
519    If a vertex attribute binding matches "vertex.warpid", the "x" component is
520    filled with the warp id of the current thread.  The warp id is an unsigned
521    integer, the range of this value is hw dependent.
522
523    If a vertex attribute binding matches "vertex.smid", the "x" component
524    is filled with the SM id of the current thread.  The SM id is an unsigned
525    integer, the range of this value is hw dependent.
526
527
528    (Add the table entries and relevant text describing the geometry program
529     attribute variable use to query thread states.)
530
531      Geometry Attribute Binding  Components  Underlying State
532      --------------------------  ----------  ----------------------------
533      ...
534      primitive.threadid          (id,-,-,-)  id of the current thread
535      primitive.threadeqmask      (m,-,-,-)   mask with the current thread
536      primitive.threadltmask      (m,-,-,-)   mask with lower thread
537      primitive.threadlemask      (m,-,-,-)   mask with lower or equal thread
538      primitive.threadgtmask      (m,-,-,-)   mask with greater thread
539      primitive.threadgemask      (m,-,-,-)   mask with greater or equal thread
540      primitive.warpid            (id,-,-,-)  warp id of the current thread
541      primitive.smid              (id,-,-,-)  SM id of the current thread
542      ...
543
544    If a geometry attribute binding matches "primitive.threadid", the "x"
545    component is filled with the thread id of the current thread.  The thread
546    id is an unsigned integer in the range 0 to 31.
547
548    If a geometry attribute binding matches "primitive.threadeqmask", the "x"
549    component is filled with a 32-bit unsigned integer bitfield in which the
550    bit equal to the current thread id is set.
551
552    If a geometry attribute binding matches "primitive.threadltmask", the "x"
553    component is filled with a 32-bit unsigned integer bitfield in which bits
554    lower than the current thread id are set.
555
556    If a geometry attribute binding matches "primitive.threadlemask", the "x"
557    component is filled with a 32-bit unsigned integer bitfield in which bits
558    lower or equal to the current thread id are set.
559
560    If a geometry attribute binding matches "primitive.threadgtmask", the "x"
561    component is filled with a 32-bit unsigned integer bitfield in which bits
562    greater than the current thread id are set.
563
564    If a geometry attribute binding matches "primitive.threadgemask", the "x"
565    component is filled with a 32-bit unsigned integer bitfield in which bits
566    greater or equal to the current thread id are set.
567
568    If a geometry attribute binding matches "primitive.warpid", the "x"
569    component is filled with the warp id of the current thread.  The warp id is
570    an unsigned integer, the range of this value is hw dependent.
571
572    If a geometry attribute binding matches "primitive.smid", the "x" component
573    is filled with the SM id of the current thread.  The SM id is an unsigned
574    integer, the range of this value is hw dependent.
575
576
577    (add the following subsection to section 2.X.3.3, Parameters)
578
579    Thread Group Property Bindings
580
581      Binding                        Components  Underlying State
582      -----------------------------  ----------  ----------------------------
583      state.thread.warpsize          (x,-,-,-)   total number of thread in a
584                                                 warp
585      state.thread.warpspersm        (x,-,-,-)   maximum number of warp
586                                                 executing on a SM
587      state.thread.smcount           (x,-,-,-)   number of SM on the GPU
588
589    If a program parameter binding matches "state.thread.warpsize", the "x"
590    component of the program parameter variable is filled with an integer value
591    indicating the total number of thread in a warp.  The "y", "z", and "w"
592    components are undefined.
593
594    If a program parameter binding matches "state.thread.warpspersm", the "x"
595    component of the program parameter variable is filled with an integer value
596    indicating the maximum number of warp executing on a SM.  The "y", "z", and
597    "w" components are undefined.
598
599    If a program parameter binding matches "state.thread.smcount", the "x"
600    component of the program parameter variable is filled with an integer value
601    indicating the number of SM on the GPU.  The "y", "z", and "w" components
602    are undefined.
603
604
605    Modify Section 2.X.4, Program Execution Environment
606
607    (Add the table entries and relevant text describing the program
608     instruction to query thread conditions.)
609
610      Instr-      Modifiers
611      uction   V  F I C S H D  Out Inputs    Description
612      -------  -- - - - - - -  --- --------  --------------------------------
613      ...
614      TGBALLOT 50 X X X X - - F  vu  v        query a boolean in thread group
615      ...
616
617
618    (Add the table entries and relevant text describing the fragment program
619     instructions to exchange data between threads.)
620
621      Instr-      Modifiers
622      uction   V  F I C S H D  Out Inputs    Description
623      -------  -- - - - - - -  --- --------  --------------------------------
624      ...
625      QSWZ0    50 X - - - - - F  v   v,v      add fragment 0 in a quad
626      QSWZ1    50 X - - - - - F  v   v,v      add fragment 1 in a quad
627      QSWZ2    50 X - - - - - F  v   v,v      add fragment 2 in a quad
628      QSWZ3    50 X - - - - - F  v   v,v      add fragment 3 in a quad
629      QSWZX    50 X - - - - - F  v   v,v      add fragments horizontally
630      QSWZY    50 X - - - - - F  v   v,v      add fragments vertically
631      ...
632
633
634    (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
635     as extended by NV_gpu_program5)
636
637    + Shader thread group (NV_shader_thread_group)
638
639    If a fragment program specifies the "NV_shader_thread_group" option, it
640    may use the "fragment.threadid", "fragment.threadeqmask",
641    "fragment.threadltmask", "fragment.threadlemask", "fragment.threadgtmask",
642    "fragment.threadgemask", "fragment.warpid", "fragment.smid",
643    "fragment.helperthread", "state.thread.warpsize", "state.thread.warpspersm"
644    and "state.thread.smcount" bindings.  It may also use the "TGBALLOT",
645    "QSWZ0", "QSWZ1", "QSWZ2", "QSWZ3", "QSWZX" and "QSWZY" instructions.  If
646    this option is not specified, a program will fail to compile if it uses
647    those instructions or bindings.
648
649    If a vertex program specifies the "NV_shader_thread_group" option, it may
650    use the "vertex.threadid", "vertex.threadeqmask", "vertex.threadltmask",
651    "vertex.threadlemask", "vertex.threadgtmask", "vertex.threadgemask",
652    "vertex.warpid", "vertex.smid", "state.thread.warpsize",
653    "state.thread.warpspersm" and "state.thread.smcount" bindings.  It may also
654    use the "TGBALLOT" instruction.  If this option is not specified, a program
655    will fail to compile if it uses those instructions or bindings.
656
657    If a geometry program specifies the "NV_shader_thread_group" option, it
658    may use the "primitive.threadid", "primitive.threadeqmask",
659    "primitive.threadltmask", "primitive.threadlemask",
660    "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid",
661    "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and
662    "state.thread.smcount" bindings.  It may also use the "TGBALLOT"
663    instruction.  If this option is not specified, a program will fail to
664    compile if it uses those instructions or bindings.
665
666    Section 2.X.8.Z, QSWZ0:  add fragment 0 data to all fragment in a quad
667
668    The QSWZ0 instruction produces a floating point result by adding the
669    first operand, a floating point value from fragment 0, to the second
670    operand, another floating point value from the current fragment.
671
672    quadSwizzle0NV is the GLSL function that implements the same functionality
673    as the QSWZ0 assembly instruction.  The section 8.3 of the OpenGL Shading
674    Language Specification has more detail about the implementation of
675    quadSwizzle0NV.  This additional information also applies to QSWZ0.
676
677
678    Section 2.X.8.Z, QSWZ1:  add fragment 1 data to all fragment in a quad
679
680    The QSWZ1 instruction produces a floating point result by adding the
681    first operand, a floating point value from fragment 1, to the second
682    operand, another floating point value from the current fragment.
683
684    quadSwizzle1NV is the GLSL function that implements the same functionality
685    as the QSWZ1 assembly instruction.  The section 8.3 of the OpenGL Shading
686    Language Specification has more detail about the implementation of
687    quadSwizzle1NV.  This additional information also applies to QSWZ1.
688
689
690    Section 2.X.8.Z, QSWZ2:  add fragment 2 data to all fragment in a quad
691
692    The QSWZ2 instruction produces a floating point result by adding the
693    first operand, a floating point value from fragment 2, to the second
694    operand, another floating point value from the current fragment.
695
696    quadSwizzle2NV is the GLSL function that implements the same functionality
697    as the QSWZ2 assembly instruction.  The section 8.3 of the OpenGL Shading
698    Language Specification has more detail about the implementation of
699    quadSwizzle2NV.  This additional information also applies to QSWZ2.
700
701
702    Section 2.X.8.Z, QSWZ3:  add fragment 3 data to all fragment in a quad
703
704    The QSWZ3 instruction produces a floating point result by adding the
705    first operand, a floating point value from fragment 3, to the second
706    operand, another floating point value from the current fragment.
707
708    quadSwizzle3NV is the GLSL function that implements the same functionality
709    as the QSWZ3 assembly instruction.  The section 8.3 of the OpenGL Shading
710    Language Specification has more detail about the implementation of
711    quadSwizzle3NV.  This additional information also applies to QSWZ3.
712
713
714    Section 2.X.8.Z, QSWZX:  add fragments in a quad horizontally
715
716    The QSWZX instruction produces a floating point result by adding the
717    first operand, a floating point value from the fragment neighbor in X to
718    the current fragment, to the second operand, another floating point value
719    from the current fragment.
720
721    quadSwizzleXNV is the GLSL function that implements the same functionality
722    as the QSWZX assembly instruction.  The section 8.3 of the OpenGL Shading
723    Language Specification has more detail about the implementation of
724    quadSwizzleXNV.  This additional information also applies to QSWZX.
725
726
727    Section 2.X.8.Z, QSWZY:  add fragments in a quad vertically
728
729    The QSWZY instruction produces a floating point result by adding the
730    first operand, a floating point value from the fragment neighbor in Y to
731    the current fragment, to the second operand, another floating point value
732    from the current fragment.
733
734    quadSwizzleYNV is the GLSL function that implements the same functionality
735    as the QSWZY assembly instruction.  The section 8.3 of the OpenGL Shading
736    Language Specification has more detail about the implementation of
737    quadSwizzleYNV.  This additional information also applies to QSWZY.
738
739
740    Section 2.X.8.Z, TGBALLOT:  query a boolean condition over a thread group
741
742    The TGBALLOT instruction produces a result vector by reading a vector
743    operand for each active thread in the current thread group and comparing
744    each component to zero.  A result vector component contains an integer
745    bitmask  value (described below) for which the bits in a component bitmask
746    are set if the value in the operand vector is non-zero for the
747    corresponding thread, and not set otherwise.
748
749    Sometime when the instruction is in a conditional control flow block or
750    when it's not possible to completely fill a thread group, only a subset of
751    the threads in the thread group will be active and will execute the
752    TGBALLOT instruction.  Each bit in the bitfield corresponding to inactive
753    threads will be set to 0.  It's possible to query the active thread mask
754    by calling TGBALLOT with 1 as the first operand.
755
756      tmp = VectorLoad(op0);
757      result = { 0, 0, 0, 0 };
758      for (all active threads) {
759        if ([thread]tmp.x != 0) result.x |= 1 << thread;
760        if ([thread]tmp.y != 0) result.y |= 1 << thread;
761        if ([thread]tmp.z != 0) result.z |= 1 << thread;
762        if ([thread]tmp.w != 0) result.w |= 1 << thread;
763      }
764
765Dependencies on NV_tessellation_program5
766
767    If NV_tessellation_program5 is supported and
768    "OPTION NV_shader_thread_group" is specified in an assembly program, the
769    following edits are made to extend the assembly programming model
770    documented in the NV_gpu_program4 extension and extended by NV_gpu_program5
771    and NV_tessellation_program5.
772
773    If NV_tessellation_program5 is not supported, or if
774    "OPTION NV_shader_thread_group" is not specified in an assembly program,
775    the contents of this dependencies section should be ignored.
776
777
778    Modify Section 2.X.2, Program Grammar
779
780    (add/change the following rules to the NV_gpu_program5 base grammars for
781     tessellation control programs)
782
783    <attribBasic>           ::= <primPrefix> "threadid"
784                              | <primPrefix> "threadeqmask"
785                              | <primPrefix> "threadltmask"
786                              | <primPrefix> "threadlemask"
787                              | <primPrefix> "threadgtmask"
788                              | <primPrefix> "threadgemask"
789                              | <primPrefix> "warpid"
790                              | <primPrefix> "smid"
791
792    (add/change the following rules to the NV_gpu_program5 base grammars for
793     tessellation evaluation programs)
794
795    <attribBasic>           ::= <primPrefix> "threadid"
796                              | <primPrefix> "threadeqmask"
797                              | <primPrefix> "threadltmask"
798                              | <primPrefix> "threadlemask"
799                              | <primPrefix> "threadgtmask"
800                              | <primPrefix> "threadgemask"
801                              | <primPrefix> "warpid"
802                              | <primPrefix> "smid"
803
804
805    Modify Section 2.X.3.2 of the NV_tessellation_program5 specification,
806    Program Attribute Variables.
807
808    (Add the table entries and relevant text describing the Tessellation
809     control and evaluation program attribute variables use to query thread
810     states.)
811
812
813      Primitive Binding Suffix    Components  Underlying State
814      --------------------------  ----------  ----------------------------
815      ...
816      primitive.threadid         (id,-,-,-)  id of the current thread
817      primitive.threadeqmask     (m,-,-,-)   mask with the current thread
818      primitive.threadltmask     (m,-,-,-)   mask with lower thread
819      primitive.threadlemask     (m,-,-,-)   mask with lower or equal thread
820      primitive.threadgtmask     (m,-,-,-)   mask with greater thread
821      primitive.threadgemask     (m,-,-,-)   mask with greater or equal thread
822      primitive.warpid           (id,-,-,-)  warp id of the current thread
823      primitive.smid             (id,-,-,-)  SM id of the current thread
824      ...
825
826    If a attribute binding matches "primitive.threadid", the "x" component is
827    filled with the thread id of the current thread.  The thread id is an
828    unsigned integer in the range 0 to 31.
829
830    If a attribute binding matches "primitive.threadeqmask", the "x"
831    component is filled with a 32-bit unsigned integer bitfield in which the
832    bit equal to the current thread id is set.
833
834    If a attribute binding matches "primitive.threadltmask", the "x"
835    component is filled with a 32-bit unsigned integer bitfield in which bits
836    lower than the current thread id are set.
837
838    If a attribute binding matches "primitive.threadlemask", the "x"
839    component is filled with a 32-bit unsigned integer bitfield in which bits
840    lower or equal to the current thread id are set.
841
842    If a attribute binding matches "primitive.threadgtmask", the "x"
843    component is filled with a 32-bit unsigned integer bitfield in which bits
844    greater than the current thread id are set.
845
846    If a attribute binding matches "primitive.threadgemask", the "x"
847    component is filled with a 32-bit unsigned integer bitfield in which bits
848    greater or equal to the current thread id are set.
849
850    If a attribute binding matches "primitive.warpid", the "x" component is
851    filled with the warp id of the current thread.  The warp id is an unsigned
852    integer, the range of this value is hw dependent.
853
854    If a attribute binding matches "primitive.smid", the "x" component is
855    filled with the SM id of the current thread.  The SM id is an unsigned
856    integer, the range of this value is hw dependent.
857
858    (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
859     as extended by NV_gpu_program5 and NV_tessellation_program5)
860
861    + Shader thread group (NV_shader_thread_group)
862
863    If a program specifies the "NV_shader_thread_group" option, it may use
864    the "primitive.threadid", "primitive.threadeqmask",
865    "primitive.threadltmask", "primitive.threadlemask",
866    "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid",
867    "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and
868    "state.thread.smcount" bindings.  It may also use the "TGBALLOT"
869    instruction.  If this option is not specified, a program will fail to
870    compile if it uses those bindings.
871
872
873Dependencies on NV_compute_program5
874
875    If NV_compute_program5 is supported and "OPTION NV_shader_thread_group" is
876    specified in an assembly program, the following edits are made to extend
877    the assembly programming model documented in the NV_gpu_program4 extension
878    and extended by NV_gpu_program5 and NV_compute_program5.
879
880    If NV_compute_program5 is not supported, or if
881    "OPTION NV_shader_thread_group" is not specified in an assembly program,
882    the contents of this dependencies section should be ignored.
883
884    Section 2.X.2, Program Grammar
885
886    (add the following rules to the grammar)
887
888    <attribBasic>           ::= "invocation" "." "threadid"
889                              | "invocation" "." "threadeqmask"
890                              | "invocation" "." "threadltmask"
891                              | "invocation" "." "threadlemask"
892                              | "invocation" "." "threadgtmask"
893                              | "invocation" "." "threadgemask"
894                              | "invocation" "." "warpid"
895                              | "invocation" "." "smid"
896
897    Modify Section 2.X.3.2 of the NV_compute_program5 specification, Program
898    Attribute Variables.
899
900    (Add the table entries and relevant text describing the compute program
901     input variable use to query thread states.)
902
903      Attribute Binding           Components  Underlying State
904      --------------------------  ----------  ----------------------------
905      ...
906      invocation.threadid         (id,-,-,-)  id of the current thread
907      invocation.threadeqmask     (m,-,-,-)   mask with the current thread
908      invocation.threadltmask     (m,-,-,-)   mask with lower thread
909      invocation.threadlemask     (m,-,-,-)   mask with lower or equal thread
910      invocation.threadgtmask     (m,-,-,-)   mask with greater thread
911      invocation.threadgemask     (m,-,-,-)   mask with greater or equal thread
912      invocation.warpid           (id,-,-,-)  warp id of the current thread
913      invocation.smid             (id,-,-,-)  SM id of the current thread
914      ...
915
916    If a compute attribute binding matches "invocation.threadid", the "x"
917    component is filled with the thread id of the current thread.  The thread
918    id is an unsigned integer in the range 0 to 31.
919
920    If a compute attribute binding matches "invocation.threadeqmask", the "x"
921    component is filled with a 32-bit unsigned integer bitfield in which the
922    bit equal to the current thread id is set.
923
924    If a compute attribute binding matches "invocation.threadltmask", the "x"
925    component is filled with a 32-bit unsigned integer bitfield in which bits
926    lower than the current thread id are set.
927
928    If a compute attribute binding matches "invocation.threadlemask", the "x"
929    component is filled with a 32-bit unsigned integer bitfield in which bits
930    lower or equal to the current thread id are set.
931
932    If a compute attribute binding matches "invocation.threadgtmask", the "x"
933    component is filled with a 32-bit unsigned integer bitfield in which bits
934    greater than the current thread id are set.
935
936    If a compute attribute binding matches "invocation.threadgemask", the "x"
937    component is filled with a 32-bit unsigned integer bitfield in which bits
938    greater or equal to the current thread id are set.
939
940    If a compute attribute binding matches "invocation.warpid", the "x"
941    component is filled with the warp id of the current thread.  The warp id is
942    an unsigned integer, the range of this value is hw dependent.
943
944    If a compute attribute binding matches "invocation.smid", the "x" component
945    is filled with the SM id of the current thread.  The SM id is an unsigned
946    integer, the range of this value is hw dependent.
947
948    (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
949     as extended by NV_gpu_program5 and NV_compute_program5)
950
951
952    + Shader thread group (NV_shader_thread_group)
953
954    If a program specifies the "NV_shader_thread_group" option, it may use the
955    "invocation.threadid", "invocation.threadeqmask",
956    "invocation.threadltmask", "invocation.threadlemask",
957    "invocation.threadgtmask", "invocation.threadgemask", "invocation.warpid",
958    "invocation.smid", "state.thread.warpsize", "state.thread.warpspersm" and
959    "state.thread.smcount" bindings.  It may also use the "TGBALLOT"
960    instruction.  If this option is not specified, a program will fail to
961    compile if it uses those bindings.
962
963
964Errors
965
966    None.
967
968New State
969
970    None.
971
972New Implementation Dependent State
973
974                                                             Minimum
975    Get Value                         Type  Get Command       Value   Description           Sec.   Attrib
976    --------------------------------  ----  ---------------  -------  --------------------- ------ ------
977    WARP_SIZE_NV                       Z+   GetIntegerv        1       total number of      2.X.3.3  -
978                                                                       thread in a warp.
979
980    WARPS_PER_SM_NV                    Z+   GetIntegerv        1       maximum number of    2.X.3.3  -
981                                                                       warp executing on a
982                                                                       SM.
983
984    SM_COUNT_NV                        Z+   GetIntegerv        1       number of SM on the  2.X.3.3  -
985                                                                       GPU.
986
987
988Issues
989
990    None
991
992
993Revision History
994
995    Rev.    Date    Author    Changes
996    ----  --------  --------  -----------------------------------------
997     4     7/21/15  jbreton    Update the layout of threads within a quad for
998                               window and framebuffer object rendering.
999     3     2/14/14  jbreton    Rename the extension from NVX to NV.
1000     2      9/4/13  jbreton    Add helperThread attribute binding.
1001     1    12/19/12  jbreton    Internal revisions.
1002