extensions/NV/NV_shader_thread_group.txt

Name

    NV_shader_thread_group

Name Strings

    GL_NV_shader_thread_group

Contributors

    Jeannot Breton, NVIDIA
    Pat Brown, NVIDIA
    Eric Werness, NVIDIA
    Mark Kilgard, NVIDIA

Contact

    Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com)

Status

    Shipping.

Version

    Last Modified Date:         7/21/2015
    NVIDIA Revision:            4

Number

    OpenGL Extension #447

Dependencies

    This extension is written against the OpenGL 4.3 (Compatibility Profile)
    Specification.

    This extension is written against version 4.30 (revision 07) of the OpenGL
    Shading Language Specification.

    OpenGL 4.3 and GLSL 4.3 are required.

    This extension interacts with NV_gpu_program5

    This extension interacts with NV_compute_program5

    This extension interacts with NV_tessellation_program5

Overview

    Implementations of the OpenGL Shading Language may, but are not required
    to, run multiple shader threads for a single stage as a SIMD thread group,
    where individual execution threads are assigned to thread groups in an
    undefined, implementation-dependent order.  This extension provides a set
    of new features to the OpenGL Shading Language to query thread states and
    to share data between fragments within a 2x2 pixel quad.

    More specifically the following functionalities were added:

    *   New uniform variables and tokens to query the number of threads in a
        warp, the number of warps running on a SM and the number of SMs on the
        GPU.

    *   New shader inputs to query the thread id, the warp id and the SM id.

    *   New shader inputs to query if a fragment shader thread is a helper
        thread.

    *   New shader built-in functions to query the state of a Boolean condition
        over all threads in a thread group.

    *   New shader built-in functions to query which threads are active within
        a thread group.

    *   New fragment shader built-in functions to share data between fragments
        within a 2x2 pixel quad.

    Shaders using the new functionalities provided by this extension should
    enable this functionality via the construct

        #extension GL_NV_shader_thread_group : require     (or enable)

    This extension also specifies some modifications to the program assembly
    language to support the thread state query and thread data sharing
    functionalities.

    Note that in this extension specification warp and thread group have the
    same meaning.  A warp is a group of threads that get executed in lockstep.
    Each thread in a warp executes the same instruction of a program, but on
    different data.

New Procedures and Functions

    None


New Tokens

    Accepted by the <pname> parameter of GetBooleanv, GetIntegerv,
    GetFloatv, and GetDoublev:

        WARP_SIZE_NV                                    0x9339
        WARPS_PER_SM_NV                                 0x933A
        SM_COUNT_NV                                     0x933B


Modifications to The OpenGL Shading Language Specification, Version 4.30
(Revision 07)

    Including the following line in a shader can be used to control the
    language features described in this extension:

      #extension GL_NV_shader_thread_group : <behavior>

    where <behavior> is as specified in section 3.3.

    New preprocessor #defines are added to the OpenGL Shading Language:

      #define GL_NV_shader_thread_group         1

    Modify Section 7.1, Built-in Languages Variable, p. 110

    (Add to the list of built-in variables for the compute, vertex, geometry,
     tessellation control, tessellation evaluation and fragment languages)

        in uint  gl_ThreadInWarpNV;
        in uint  gl_ThreadEqMaskNV;
        in uint  gl_ThreadGeMaskNV;
        in uint  gl_ThreadGtMaskNV;
        in uint  gl_ThreadLeMaskNV;
        in uint  gl_ThreadLtMaskNV;
        in uint  gl_WarpIDNV;
        in uint  gl_SMIDNV;

    (Add to the list of built-in variables for the fragment languages)

        in bool  gl_HelperThreadNV;

    (Add those paragraphs at the end of this section)

    The variable gl_ThreadInWarpNV hold the id of the thread within the thread
    group(or warp).  This variable is in the range 0 to gl_WarpSizeNV-1, where
    gl_WarpSizeNV is the total number of thread in a warp.

    The variable gl_ThreadEqMaskNV is a bitfield in which the bit equal to the
    current thread id is set.  The variable gl_ThreadGeMaskNV is a bitfield in
    which bits greater or equal to the current thread id are set.  The variable
    gl_ThreadGtMaskNV is a bitfield in which bits greater than the current
    thread id are set.  The variable gl_ThreadLeMaskNV is a bitfield in which
    bits lower or equal to the current thread id are set.  The variable
    gl_ThreadLtMaskNV is a bitfield in which bits lower than the current thread
    id are set.

    The value of gl_ThreadEqMaskNV, gl_ThreadGeMaskNV, gl_ThreadGtMaskNV,
    gl_ThreadLeMaskNV and gl_ThreadLtMaskNV are derived from the value of
    gl_ThreadInWarpNV using simple bit-shift arithmetic, they don't take into
    account the value of the thread group active mask.  For example, if the
    application wants a bitfield in which bits lower or equal to the current
    thread id are set only for active threads, the result of gl_ThreadLeMaskNV
    will need to be ANDed with the thread group active mask.

    The variable gl_WarpIDNV hold the warp id of the executing thread.  This
    variable is in the range 0 to gl_WarpsPerSMNV-1, where gl_WarpsPerSMNV is
    the maximum number of warp executing on a SM.

    The variable gl_SMIDNV hold the SM id of the executing thread.  This
    variable is in the range 0 to gl_SMCountNV-1, where gl_SMCountNV is the
    number of SM on the GPU.

    The variable gl_HelperThreadNV specifies if the current thread is a helper
    thread.  In implementations supporting this extension, fragment shader
    invocations may be arranged in SIMD thread groups of 2x2 fragments called
    "quad".  When a fragment shader instruction is executed on a quad, it's
    possible that some fragments within the quad will execute the instruction
    even if they are not covered by the primitive.  Those threads are called
    helper threads.  Their outputs will be discarded and they will not execute
    global store functions, but the intermediate values they compute can still
    be used by thread group sharing functions or by fragment derivative
    functions like dFdx and dFdy.


    Modify Section 7.4, Built-In Uniform State, p. 125

    (Add to the list of built-in uniform variable declaration)

        uniform uint  gl_WarpSizeNV;
        uniform uint  gl_WarpsPerSMNV;
        uniform uint  gl_SMCountNV;

    (Add this paragraph at the end of this section)

    The variable gl_WarpSizeNV is the total number of thread in a warp.  The
    variable gl_WarpsPerSMNV is the maximum number of warp executing on a SM.
    The variable gl_SMCountNV is the number of SM on the GPU.


    Modify Section 8.3, Common Functions, p. 133

    (add a function to query which threads are active within a thread group)

    Syntax:

      uint  activeThreadsNV(void)

    In the value returned by activeThreadsNV(), bit <N> is set to 1 if the
    corresponding thread in the SIMD thread group is executing the call to
    activeThreadsNV() and 0 otherwise.  A bit in the return value may be set
    to zero due to conditional flow control (e.g., returning from a function,
    executing the "else" part of an "if" statement) or SIMD thread group was
    dispatched without a full collection of threads.

    (add a function to query the state of a Boolean condition over all the
    threads in a thread group)

    Syntax:

      uint  ballotThreadNV(bool value)

    The function ballotThreadNV() computes a 32-bit bitfield.  It looks at the
    condition <value> for each active thread of a thread group and set to 1
    each bit for which the condition in the corresponding thread is true.  Bits
    for threads with false condition are set to 0.  Bits for inactive threads
    are also set to 0.  It's possible to query the active thread mask by
    calling the function activeThreadsNV.

    (add a function to share data between fragment in a quad)

    Syntax:

        float  quadSwizzle0NV(float swizzledValue, [float unswizzledValue])
        vec2   quadSwizzle0NV(vec2  swizzledValue, [vec2  unswizzledValue])
        vec3   quadSwizzle0NV(vec3  swizzledValue, [vec3  unswizzledValue])
        vec4   quadSwizzle0NV(vec4  swizzledValue, [vec4  unswizzledValue])

        float  quadSwizzle1NV(float swizzledValue, [float unswizzledValue])
        vec2   quadSwizzle1NV(vec2  swizzledValue, [vec2  unswizzledValue])
        vec3   quadSwizzle1NV(vec3  swizzledValue, [vec3  unswizzledValue])
        vec4   quadSwizzle1NV(vec4  swizzledValue, [vec4  unswizzledValue])

        float  quadSwizzle2NV(float swizzledValue, [float unswizzledValue])
        vec2   quadSwizzle2NV(vec2  swizzledValue, [vec2  unswizzledValue])
        vec3   quadSwizzle2NV(vec3  swizzledValue, [vec3  unswizzledValue])
        vec4   quadSwizzle2NV(vec4  swizzledValue, [vec4  unswizzledValue])

        float  quadSwizzle3NV(float swizzledValue, [float unswizzledValue])
        vec2   quadSwizzle3NV(vec2  swizzledValue, [vec2  unswizzledValue])
        vec3   quadSwizzle3NV(vec3  swizzledValue, [vec3  unswizzledValue])
        vec4   quadSwizzle3NV(vec4  swizzledValue, [vec4  unswizzledValue])

        float  quadSwizzleXNV(float swizzledValue, [float unswizzledValue])
        vec2   quadSwizzleXNV(vec2  swizzledValue, [vec2  unswizzledValue])
        vec3   quadSwizzleXNV(vec3  swizzledValue, [vec3  unswizzledValue])
        vec4   quadSwizzleXNV(vec4  swizzledValue, [vec4  unswizzledValue])

        float  quadSwizzleYNV(float swizzledValue, [float unswizzledValue])
        vec2   quadSwizzleYNV(vec2  swizzledValue, [vec2  unswizzledValue])
        vec3   quadSwizzleYNV(vec3  swizzledValue, [vec3  unswizzledValue])
        vec4   quadSwizzleYNV(vec4  swizzledValue, [vec4  unswizzledValue])

    In implementations supporting this extension, if a primitive covers a
    fragment at (x,y), its fragment shader invocation will be arranged in a
    SIMD thread group with fragment shader invocations corresponding to three
    neighboring pixels.  These four invocations are arranged in a 2x2 grid,
    called a "quad".  If the neighbors of a fragment are not covered by the
    primitive, fragment shader invocations will still be generated.  The
    implementation may compute differences between values in these threads to
    estimate derivatives for dFdx(), dFdy(), and for texture lookups with
    automatic LOD calculations.

    Fragments may have different locations in the quads based on the type of
    render target.

    When rendering to a window, fragments within a quad follow this pattern:

        ---------------------------------------------------
        | gl_ThreadInWarpNV 4N+0 | gl_ThreadInWarpNV 4N+1 |
        |     pixel (X+0,Y+1)    |     pixel (X+1,Y+1)    |
        ---------------------------------------------------
        | gl_ThreadInWarpNV 4N+2 | gl_ThreadInWarpNV 4N+3 |
        |     pixel (X+0,Y+0)    |     pixel (X+1,Y+0)    |
        ---------------------------------------------------


    When rendering to a framebuffer object, fragments within a quad follow this
    pattern:

        ---------------------------------------------------
        | gl_ThreadInWarpNV 4N+2 | gl_ThreadInWarpNV 4N+3 |
        |     pixel (X+0,Y+1)    |     pixel (X+1,Y+1)    |
        ---------------------------------------------------
        | gl_ThreadInWarpNV 4N+0 | gl_ThreadInWarpNV 4N+1 |
        |     pixel (X+0,Y+0)    |     pixel (X+1,Y+0)    |
        ---------------------------------------------------

    There are 6 quadSwizzle functions that allow fragments within a quad to
    exchange data.  All those functions will read a floating point
    operand <swizzledValue>, which can come from any fragment in the quad.
    Another optional floating point operand <unswizzledValue>, which comes from
    the current fragment, can be added to <swizzledValue>.  The only difference
    between all those quadSwizzle functions is the location where they get the
    <swizzledValue> operand within the 2x2 pixel quad.

    quadSwizzle0NV will read the <swizzledValue> operand from the fragment 0:

        result[thread N] = swizzledValue[thread 0] + unswizzledValue[thread N]


    quadSwizzle1NV will read the <swizzledValue> operand from the fragment 1:

        result[thread N] = swizzledValue[thread 1] + unswizzledValue[thread N]


    quadSwizzle2NV will read the <swizzledValue> operand from the fragment 2:

        result[thread N] = swizzledValue[thread 2] + unswizzledValue[thread N]


    quadSwizzle3NV will read the <swizzledValue> operand from the fragment 3:

        result[thread N] = swizzledValue[thread 3] + unswizzledValue[thread N]


    quadSwizzleXNV will read the <swizzledValue> operand for each fragment
    from its neighbor in X:

        result[thread 0] = swizzledValue[thread 1] + unswizzledValue[thread 0]
        result[thread 1] = swizzledValue[thread 0] + unswizzledValue[thread 1]
        result[thread 2] = swizzledValue[thread 3] + unswizzledValue[thread 2]
        result[thread 3] = swizzledValue[thread 2] + unswizzledValue[thread 3]


    quadSwizzleYNV will read the <swizzledValue> operand for each fragment
    from its neighbor in Y:

        result[thread 0] = swizzledValue[thread 2] + unswizzledValue[thread 0]
        result[thread 1] = swizzledValue[thread 3] + unswizzledValue[thread 1]
        result[thread 2] = swizzledValue[thread 0] + unswizzledValue[thread 2]
        result[thread 3] = swizzledValue[thread 1] + unswizzledValue[thread 3]


    If any thread in a 2x2 pixel quad is inactive, the quad is divergent.  In
    this case quadSwizzle will return 0 for all fragments in the quad.


Dependencies on NV_gpu_program5

    If NV_gpu_program5 is supported and "OPTION NV_shader_thread_group" is
    specified in an assembly program, the following edits are made to extend
    the assembly programming model documented in the NV_gpu_program4 extension
    and extended by NV_gpu_program5.

    If NV_gpu_program5 is not supported, or if "OPTION NV_shader_thread_group"
    is not specified in an assembly program, the contents of this dependencies
    section should be ignored.

    Modify Section 2.X.2, Program Grammar

    (add the following rules to the the NV_gpu_program4 and
     NV_gpu_program5 base grammars)

    <VECTORop>              ::= "TGBALLOT"

    <stateSingleItem>       ::= "state" "." <stateThreadItem>

    <stateThreadItem>       ::= "thread" "." <stateThreadProperty>

    <stateThreadProperty>   ::= "warpsize"
                              | "warpspersm"
                              | "smcount"

    (add/change the following rules to the NV_fragment_program4 and
     NV_gpu_program5 base grammars)

    <VECTORop>              ::= "QSWZ0"
                              | "QSWZ1"
                              | "QSWZ2"
                              | "QSWZ3"
                              | "QSWZX"
                              | "QSWZY"

    <attribBasic>           ::= <fragPrefix> "threadid"
                              | <fragPrefix> "threadeqmask"
                              | <fragPrefix> "threadltmask"
                              | <fragPrefix> "threadlemask"
                              | <fragPrefix> "threadgtmask"
                              | <fragPrefix> "threadgemask"
                              | <fragPrefix> "warpid"
                              | <fragPrefix> "smid"
                              | <fragPrefix> "helperthread"

    (add/change the following rules to the NV_vertex_program4 and
     NV_gpu_program5 base grammars)

    <attribBasic>           ::= <vtxPrefix> "threadid"
                              | <vtxPrefix> "threadeqmask"
                              | <vtxPrefix> "threadltmask"
                              | <vtxPrefix> "threadlemask"
                              | <vtxPrefix> "threadgtmask"
                              | <vtxPrefix> "threadgemask"
                              | <vtxPrefix> "warpid"
                              | <vtxPrefix> "smid"

    (add/change the following rules to the NV_geometry_program4 and
     NV_gpu_program5 base grammars)

    <attribBasic>           ::= <primPrefix> "threadid"
                              | <primPrefix> "threadeqmask"
                              | <primPrefix> "threadltmask"
                              | <primPrefix> "threadlemask"
                              | <primPrefix> "threadgtmask"
                              | <primPrefix> "threadgemask"
                              | <primPrefix> "warpid"
                              | <primPrefix> "smid"

    Modify Section 2.X.3.2 of the NV_gpu_program4 specification, Program
    Attribute Variables.

    (Add the table entries and relevant text describing the fragment program
     input variable use to query thread states.)

      Fragment Attribute Binding  Components  Underlying State
      --------------------------  ----------  ----------------------------
      ...
      fragment.threadid           (id,-,-,-)  id of the current thread
      fragment.threadeqmask       (m,-,-,-)   mask with the current thread
      fragment.threadltmask       (m,-,-,-)   mask with lower thread
      fragment.threadlemask       (m,-,-,-)   mask with lower or equal thread
      fragment.threadgtmask       (m,-,-,-)   mask with greater thread
      fragment.threadgemask       (m,-,-,-)   mask with greater or equal thread
      fragment.warpid             (id,-,-,-)  warp id of the current thread
      fragment.smid               (id,-,-,-)  SM id of the current thread
      fragment.helperthread       (k,-,-,-)   current thread is a helper thread
      ...

    If a fragment attribute binding matches "fragment.threadid", the "x"
    component is filled with the thread id of the current thread.  The thread
    id is an unsigned integer in the range 0 to 31.

    If a fragment attribute binding matches "fragment.threadeqmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which the
    bit equal to the current thread id is set.

    If a fragment attribute binding matches "fragment.threadltmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower than the current thread id are set.

    If a fragment attribute binding matches "fragment.threadlemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower or equal to the current thread id are set.

    If a fragment attribute binding matches "fragment.threadgtmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater than the current thread id are set.

    If a fragment attribute binding matches "fragment.threadgemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater or equal to the current thread id are set.

    If a fragment attribute binding matches "fragment.warpid", the "x"
    component is filled with the warp id of the current thread.  The warp id is
    an unsigned integer, the range of this value is hw dependent.

    If a fragment attribute binding matches "fragment.smid", the "x" component
    is filled with the SM id of the current thread.  The SM id is an unsigned
    integer, the range of this value is hw dependent.

    If a fragment attribute binding matches "fragment.helperthread", the "x"
    component is an integer value equal to -1 when the current thread is a
    helper thread and 0 otherwise.  In implementations supporting this
    extension, fragment program invocations may be arranged in SIMD thread
    groups of 2x2 fragments called "quad".  When a fragment program instruction
    is executed on a quad, it's possible that some fragments within the quad
    will execute the instruction even if they are not covered by the primitive.
    Those threads are called helper threads.  Their outputs will be discarded
    and they will not execute global store instructions, but the intermediate
    values they compute can still be used by thread group sharing instructions
    or by fragment derivative instructions like DDX and DDY.

    (Add the table entries and relevant text describing the vertex program
     attribute variable use to query thread states.)

      Vertex Attribute Binding  Components  Underlying State
      ------------------------  ----------  ----------------------------
      ...
      vertex.threadid           (id,-,-,-)  id of the current thread
      vertex.threadeqmask       (m,-,-,-)   mask with the current thread
      vertex.threadltmask       (m,-,-,-)   mask with lower thread
      vertex.threadlemask       (m,-,-,-)   mask with lower or equal thread
      vertex.threadgtmask       (m,-,-,-)   mask with greater thread
      vertex.threadgemask       (m,-,-,-)   mask with greater or equal thread
      vertex.warpid             (id,-,-,-)  warp id of the current thread
      vertex.smid               (id,-,-,-)  SM id of the current thread
      ...

    If a vertex attribute binding matches "vertex.threadid", the "x" component
    is filled with the thread id of the current thread.  The thread id is an
    unsigned integer in the range 0 to 31.

    If a vertex attribute binding matches "vertex.threadeqmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which the
    bit equal to the current thread id is set.

    If a vertex attribute binding matches "vertex.threadltmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower than the current thread id are set.

    If a vertex attribute binding matches "vertex.threadlemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower or equal to the current thread id are set.

    If a vertex attribute binding matches "vertex.threadgtmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater than the current thread id are set.

    If a vertex attribute binding matches "vertex.threadgemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater or equal to the current thread id are set.

    If a vertex attribute binding matches "vertex.warpid", the "x" component is
    filled with the warp id of the current thread.  The warp id is an unsigned
    integer, the range of this value is hw dependent.

    If a vertex attribute binding matches "vertex.smid", the "x" component
    is filled with the SM id of the current thread.  The SM id is an unsigned
    integer, the range of this value is hw dependent.


    (Add the table entries and relevant text describing the geometry program
     attribute variable use to query thread states.)

      Geometry Attribute Binding  Components  Underlying State
      --------------------------  ----------  ----------------------------
      ...
      primitive.threadid          (id,-,-,-)  id of the current thread
      primitive.threadeqmask      (m,-,-,-)   mask with the current thread
      primitive.threadltmask      (m,-,-,-)   mask with lower thread
      primitive.threadlemask      (m,-,-,-)   mask with lower or equal thread
      primitive.threadgtmask      (m,-,-,-)   mask with greater thread
      primitive.threadgemask      (m,-,-,-)   mask with greater or equal thread
      primitive.warpid            (id,-,-,-)  warp id of the current thread
      primitive.smid              (id,-,-,-)  SM id of the current thread
      ...

    If a geometry attribute binding matches "primitive.threadid", the "x"
    component is filled with the thread id of the current thread.  The thread
    id is an unsigned integer in the range 0 to 31.

    If a geometry attribute binding matches "primitive.threadeqmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which the
    bit equal to the current thread id is set.

    If a geometry attribute binding matches "primitive.threadltmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower than the current thread id are set.

    If a geometry attribute binding matches "primitive.threadlemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower or equal to the current thread id are set.

    If a geometry attribute binding matches "primitive.threadgtmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater than the current thread id are set.

    If a geometry attribute binding matches "primitive.threadgemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater or equal to the current thread id are set.

    If a geometry attribute binding matches "primitive.warpid", the "x"
    component is filled with the warp id of the current thread.  The warp id is
    an unsigned integer, the range of this value is hw dependent.

    If a geometry attribute binding matches "primitive.smid", the "x" component
    is filled with the SM id of the current thread.  The SM id is an unsigned
    integer, the range of this value is hw dependent.


    (add the following subsection to section 2.X.3.3, Parameters)

    Thread Group Property Bindings

      Binding                        Components  Underlying State
      -----------------------------  ----------  ----------------------------
      state.thread.warpsize          (x,-,-,-)   total number of thread in a
                                                 warp
      state.thread.warpspersm        (x,-,-,-)   maximum number of warp
                                                 executing on a SM
      state.thread.smcount           (x,-,-,-)   number of SM on the GPU

    If a program parameter binding matches "state.thread.warpsize", the "x"
    component of the program parameter variable is filled with an integer value
    indicating the total number of thread in a warp.  The "y", "z", and "w"
    components are undefined.

    If a program parameter binding matches "state.thread.warpspersm", the "x"
    component of the program parameter variable is filled with an integer value
    indicating the maximum number of warp executing on a SM.  The "y", "z", and
    "w" components are undefined.

    If a program parameter binding matches "state.thread.smcount", the "x"
    component of the program parameter variable is filled with an integer value
    indicating the number of SM on the GPU.  The "y", "z", and "w" components
    are undefined.


    Modify Section 2.X.4, Program Execution Environment

    (Add the table entries and relevant text describing the program
     instruction to query thread conditions.)

      Instr-      Modifiers
      uction   V  F I C S H D  Out Inputs    Description
      -------  -- - - - - - -  --- --------  --------------------------------
      ...
      TGBALLOT 50 X X X X - - F  vu  v        query a boolean in thread group
      ...


    (Add the table entries and relevant text describing the fragment program
     instructions to exchange data between threads.)

      Instr-      Modifiers
      uction   V  F I C S H D  Out Inputs    Description
      -------  -- - - - - - -  --- --------  --------------------------------
      ...
      QSWZ0    50 X - - - - - F  v   v,v      add fragment 0 in a quad
      QSWZ1    50 X - - - - - F  v   v,v      add fragment 1 in a quad
      QSWZ2    50 X - - - - - F  v   v,v      add fragment 2 in a quad
      QSWZ3    50 X - - - - - F  v   v,v      add fragment 3 in a quad
      QSWZX    50 X - - - - - F  v   v,v      add fragments horizontally
      QSWZY    50 X - - - - - F  v   v,v      add fragments vertically
      ...


    (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
     as extended by NV_gpu_program5)

    + Shader thread group (NV_shader_thread_group)

    If a fragment program specifies the "NV_shader_thread_group" option, it
    may use the "fragment.threadid", "fragment.threadeqmask",
    "fragment.threadltmask", "fragment.threadlemask", "fragment.threadgtmask",
    "fragment.threadgemask", "fragment.warpid", "fragment.smid",
    "fragment.helperthread", "state.thread.warpsize", "state.thread.warpspersm"
    and "state.thread.smcount" bindings.  It may also use the "TGBALLOT",
    "QSWZ0", "QSWZ1", "QSWZ2", "QSWZ3", "QSWZX" and "QSWZY" instructions.  If
    this option is not specified, a program will fail to compile if it uses
    those instructions or bindings.

    If a vertex program specifies the "NV_shader_thread_group" option, it may
    use the "vertex.threadid", "vertex.threadeqmask", "vertex.threadltmask",
    "vertex.threadlemask", "vertex.threadgtmask", "vertex.threadgemask",
    "vertex.warpid", "vertex.smid", "state.thread.warpsize",
    "state.thread.warpspersm" and "state.thread.smcount" bindings.  It may also
    use the "TGBALLOT" instruction.  If this option is not specified, a program
    will fail to compile if it uses those instructions or bindings.

    If a geometry program specifies the "NV_shader_thread_group" option, it
    may use the "primitive.threadid", "primitive.threadeqmask",
    "primitive.threadltmask", "primitive.threadlemask",
    "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid",
    "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and
    "state.thread.smcount" bindings.  It may also use the "TGBALLOT"
    instruction.  If this option is not specified, a program will fail to
    compile if it uses those instructions or bindings.

    Section 2.X.8.Z, QSWZ0:  add fragment 0 data to all fragment in a quad

    The QSWZ0 instruction produces a floating point result by adding the
    first operand, a floating point value from fragment 0, to the second
    operand, another floating point value from the current fragment.

    quadSwizzle0NV is the GLSL function that implements the same functionality
    as the QSWZ0 assembly instruction.  The section 8.3 of the OpenGL Shading
    Language Specification has more detail about the implementation of
    quadSwizzle0NV.  This additional information also applies to QSWZ0.


    Section 2.X.8.Z, QSWZ1:  add fragment 1 data to all fragment in a quad

    The QSWZ1 instruction produces a floating point result by adding the
    first operand, a floating point value from fragment 1, to the second
    operand, another floating point value from the current fragment.

    quadSwizzle1NV is the GLSL function that implements the same functionality
    as the QSWZ1 assembly instruction.  The section 8.3 of the OpenGL Shading
    Language Specification has more detail about the implementation of
    quadSwizzle1NV.  This additional information also applies to QSWZ1.


    Section 2.X.8.Z, QSWZ2:  add fragment 2 data to all fragment in a quad

    The QSWZ2 instruction produces a floating point result by adding the
    first operand, a floating point value from fragment 2, to the second
    operand, another floating point value from the current fragment.

    quadSwizzle2NV is the GLSL function that implements the same functionality
    as the QSWZ2 assembly instruction.  The section 8.3 of the OpenGL Shading
    Language Specification has more detail about the implementation of
    quadSwizzle2NV.  This additional information also applies to QSWZ2.


    Section 2.X.8.Z, QSWZ3:  add fragment 3 data to all fragment in a quad

    The QSWZ3 instruction produces a floating point result by adding the
    first operand, a floating point value from fragment 3, to the second
    operand, another floating point value from the current fragment.

    quadSwizzle3NV is the GLSL function that implements the same functionality
    as the QSWZ3 assembly instruction.  The section 8.3 of the OpenGL Shading
    Language Specification has more detail about the implementation of
    quadSwizzle3NV.  This additional information also applies to QSWZ3.


    Section 2.X.8.Z, QSWZX:  add fragments in a quad horizontally

    The QSWZX instruction produces a floating point result by adding the
    first operand, a floating point value from the fragment neighbor in X to
    the current fragment, to the second operand, another floating point value
    from the current fragment.

    quadSwizzleXNV is the GLSL function that implements the same functionality
    as the QSWZX assembly instruction.  The section 8.3 of the OpenGL Shading
    Language Specification has more detail about the implementation of
    quadSwizzleXNV.  This additional information also applies to QSWZX.


    Section 2.X.8.Z, QSWZY:  add fragments in a quad vertically

    The QSWZY instruction produces a floating point result by adding the
    first operand, a floating point value from the fragment neighbor in Y to
    the current fragment, to the second operand, another floating point value
    from the current fragment.

    quadSwizzleYNV is the GLSL function that implements the same functionality
    as the QSWZY assembly instruction.  The section 8.3 of the OpenGL Shading
    Language Specification has more detail about the implementation of
    quadSwizzleYNV.  This additional information also applies to QSWZY.


    Section 2.X.8.Z, TGBALLOT:  query a boolean condition over a thread group

    The TGBALLOT instruction produces a result vector by reading a vector
    operand for each active thread in the current thread group and comparing
    each component to zero.  A result vector component contains an integer
    bitmask  value (described below) for which the bits in a component bitmask
    are set if the value in the operand vector is non-zero for the
    corresponding thread, and not set otherwise.

    Sometime when the instruction is in a conditional control flow block or
    when it's not possible to completely fill a thread group, only a subset of
    the threads in the thread group will be active and will execute the
    TGBALLOT instruction.  Each bit in the bitfield corresponding to inactive
    threads will be set to 0.  It's possible to query the active thread mask
    by calling TGBALLOT with 1 as the first operand.

      tmp = VectorLoad(op0);
      result = { 0, 0, 0, 0 };
      for (all active threads) {
        if ([thread]tmp.x != 0) result.x |= 1 << thread;
        if ([thread]tmp.y != 0) result.y |= 1 << thread;
        if ([thread]tmp.z != 0) result.z |= 1 << thread;
        if ([thread]tmp.w != 0) result.w |= 1 << thread;
      }

Dependencies on NV_tessellation_program5

    If NV_tessellation_program5 is supported and
    "OPTION NV_shader_thread_group" is specified in an assembly program, the
    following edits are made to extend the assembly programming model
    documented in the NV_gpu_program4 extension and extended by NV_gpu_program5
    and NV_tessellation_program5.

    If NV_tessellation_program5 is not supported, or if
    "OPTION NV_shader_thread_group" is not specified in an assembly program,
    the contents of this dependencies section should be ignored.


    Modify Section 2.X.2, Program Grammar

    (add/change the following rules to the NV_gpu_program5 base grammars for
     tessellation control programs)

    <attribBasic>           ::= <primPrefix> "threadid"
                              | <primPrefix> "threadeqmask"
                              | <primPrefix> "threadltmask"
                              | <primPrefix> "threadlemask"
                              | <primPrefix> "threadgtmask"
                              | <primPrefix> "threadgemask"
                              | <primPrefix> "warpid"
                              | <primPrefix> "smid"

    (add/change the following rules to the NV_gpu_program5 base grammars for
     tessellation evaluation programs)

    <attribBasic>           ::= <primPrefix> "threadid"
                              | <primPrefix> "threadeqmask"
                              | <primPrefix> "threadltmask"
                              | <primPrefix> "threadlemask"
                              | <primPrefix> "threadgtmask"
                              | <primPrefix> "threadgemask"
                              | <primPrefix> "warpid"
                              | <primPrefix> "smid"


    Modify Section 2.X.3.2 of the NV_tessellation_program5 specification,
    Program Attribute Variables.

    (Add the table entries and relevant text describing the Tessellation
     control and evaluation program attribute variables use to query thread
     states.)


      Primitive Binding Suffix    Components  Underlying State
      --------------------------  ----------  ----------------------------
      ...
      primitive.threadid         (id,-,-,-)  id of the current thread
      primitive.threadeqmask     (m,-,-,-)   mask with the current thread
      primitive.threadltmask     (m,-,-,-)   mask with lower thread
      primitive.threadlemask     (m,-,-,-)   mask with lower or equal thread
      primitive.threadgtmask     (m,-,-,-)   mask with greater thread
      primitive.threadgemask     (m,-,-,-)   mask with greater or equal thread
      primitive.warpid           (id,-,-,-)  warp id of the current thread
      primitive.smid             (id,-,-,-)  SM id of the current thread
      ...

    If a attribute binding matches "primitive.threadid", the "x" component is
    filled with the thread id of the current thread.  The thread id is an
    unsigned integer in the range 0 to 31.

    If a attribute binding matches "primitive.threadeqmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which the
    bit equal to the current thread id is set.

    If a attribute binding matches "primitive.threadltmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower than the current thread id are set.

    If a attribute binding matches "primitive.threadlemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower or equal to the current thread id are set.

    If a attribute binding matches "primitive.threadgtmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater than the current thread id are set.

    If a attribute binding matches "primitive.threadgemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater or equal to the current thread id are set.

    If a attribute binding matches "primitive.warpid", the "x" component is
    filled with the warp id of the current thread.  The warp id is an unsigned
    integer, the range of this value is hw dependent.

    If a attribute binding matches "primitive.smid", the "x" component is
    filled with the SM id of the current thread.  The SM id is an unsigned
    integer, the range of this value is hw dependent.

    (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
     as extended by NV_gpu_program5 and NV_tessellation_program5)

    + Shader thread group (NV_shader_thread_group)

    If a program specifies the "NV_shader_thread_group" option, it may use
    the "primitive.threadid", "primitive.threadeqmask",
    "primitive.threadltmask", "primitive.threadlemask",
    "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid",
    "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and
    "state.thread.smcount" bindings.  It may also use the "TGBALLOT"
    instruction.  If this option is not specified, a program will fail to
    compile if it uses those bindings.


Dependencies on NV_compute_program5

    If NV_compute_program5 is supported and "OPTION NV_shader_thread_group" is
    specified in an assembly program, the following edits are made to extend
    the assembly programming model documented in the NV_gpu_program4 extension
    and extended by NV_gpu_program5 and NV_compute_program5.

    If NV_compute_program5 is not supported, or if
    "OPTION NV_shader_thread_group" is not specified in an assembly program,
    the contents of this dependencies section should be ignored.

    Section 2.X.2, Program Grammar

    (add the following rules to the grammar)

    <attribBasic>           ::= "invocation" "." "threadid"
                              | "invocation" "." "threadeqmask"
                              | "invocation" "." "threadltmask"
                              | "invocation" "." "threadlemask"
                              | "invocation" "." "threadgtmask"
                              | "invocation" "." "threadgemask"
                              | "invocation" "." "warpid"
                              | "invocation" "." "smid"

    Modify Section 2.X.3.2 of the NV_compute_program5 specification, Program
    Attribute Variables.

    (Add the table entries and relevant text describing the compute program
     input variable use to query thread states.)

      Attribute Binding           Components  Underlying State
      --------------------------  ----------  ----------------------------
      ...
      invocation.threadid         (id,-,-,-)  id of the current thread
      invocation.threadeqmask     (m,-,-,-)   mask with the current thread
      invocation.threadltmask     (m,-,-,-)   mask with lower thread
      invocation.threadlemask     (m,-,-,-)   mask with lower or equal thread
      invocation.threadgtmask     (m,-,-,-)   mask with greater thread
      invocation.threadgemask     (m,-,-,-)   mask with greater or equal thread
      invocation.warpid           (id,-,-,-)  warp id of the current thread
      invocation.smid             (id,-,-,-)  SM id of the current thread
      ...

    If a compute attribute binding matches "invocation.threadid", the "x"
    component is filled with the thread id of the current thread.  The thread
    id is an unsigned integer in the range 0 to 31.

    If a compute attribute binding matches "invocation.threadeqmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which the
    bit equal to the current thread id is set.

    If a compute attribute binding matches "invocation.threadltmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower than the current thread id are set.

    If a compute attribute binding matches "invocation.threadlemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    lower or equal to the current thread id are set.

    If a compute attribute binding matches "invocation.threadgtmask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater than the current thread id are set.

    If a compute attribute binding matches "invocation.threadgemask", the "x"
    component is filled with a 32-bit unsigned integer bitfield in which bits
    greater or equal to the current thread id are set.

    If a compute attribute binding matches "invocation.warpid", the "x"
    component is filled with the warp id of the current thread.  The warp id is
    an unsigned integer, the range of this value is hw dependent.

    If a compute attribute binding matches "invocation.smid", the "x" component
    is filled with the SM id of the current thread.  The SM id is an unsigned
    integer, the range of this value is hw dependent.

    (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
     as extended by NV_gpu_program5 and NV_compute_program5)


    + Shader thread group (NV_shader_thread_group)

    If a program specifies the "NV_shader_thread_group" option, it may use the
    "invocation.threadid", "invocation.threadeqmask",
    "invocation.threadltmask", "invocation.threadlemask",
    "invocation.threadgtmask", "invocation.threadgemask", "invocation.warpid",
    "invocation.smid", "state.thread.warpsize", "state.thread.warpspersm" and
    "state.thread.smcount" bindings.  It may also use the "TGBALLOT"
    instruction.  If this option is not specified, a program will fail to
    compile if it uses those bindings.


Errors

    None.

New State

    None.

New Implementation Dependent State

                                                             Minimum
    Get Value                         Type  Get Command       Value   Description           Sec.   Attrib
    --------------------------------  ----  ---------------  -------  --------------------- ------ ------
    WARP_SIZE_NV                       Z+   GetIntegerv        1       total number of      2.X.3.3  -
                                                                       thread in a warp.

    WARPS_PER_SM_NV                    Z+   GetIntegerv        1       maximum number of    2.X.3.3  -
                                                                       warp executing on a
                                                                       SM.

    SM_COUNT_NV                        Z+   GetIntegerv        1       number of SM on the  2.X.3.3  -
                                                                       GPU.


Issues

    None


Revision History

    Rev.    Date    Author    Changes
    ----  --------  --------  -----------------------------------------
     4     7/21/15  jbreton    Update the layout of threads within a quad for
                               window and framebuffer object rendering.
     3     2/14/14  jbreton    Rename the extension from NVX to NV.
     2      9/4/13  jbreton    Add helperThread attribute binding.
     1    12/19/12  jbreton    Internal revisions.