extensions/NV/NV_compute_program5.txt

Name

    NV_compute_program5

Name Strings

    GL_NV_compute_program5

Contact

    Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)

Status

    Complete

Version

    Last Modified Date:         10/23/2012
    NVIDIA Revision:            2

Number

    421

Dependencies

    OpenGL 4.0 (Core or Compatibiity Profile) is required.

    This extension is written against the OpenGL 4.2 Specification
    (Compatibility Profile).

    NV_gpu_program4 and NV_gpu_program5 are required.

    ARB_compute_shader is required.

    This specification interacts with NV_shader_atomic_float.

    This specification interacts with EXT_shader_image_load_store.

Overview

    This extension builds on the ARB_compute_shader extension to provide new
    assembly compute program capability for OpenGL.  ARB_compute_shader adds
    the basic functionality, including the ability to dispatch compute work.
    This extension provides the ability to write a compute program in
    assembly, using the same basic syntax and capability set found in the
    NV_gpu_program4 and NV_gpu_program5 extensions.

New Procedures and Functions

    None.

New Tokens

    Accepted by the <cap> parameter of Disable, Enable, and IsEnabled,
    by the <pname> parameter of GetBooleanv, GetIntegerv, GetFloatv,
    and GetDoublev, and by the <target> parameter of ProgramStringARB,
    BindProgramARB, ProgramEnvParameter4[df][v]ARB,
    ProgramLocalParameter4[df][v]ARB, GetProgramEnvParameter[df]vARB,
    GetProgramLocalParameter[df]vARB, GetProgramivARB and
    GetProgramStringARB:

        COMPUTE_PROGRAM_NV                              0x90FB

    Accepted by the <target> parameter of ProgramBufferParametersfvNV,
    ProgramBufferParametersIivNV, and ProgramBufferParametersIuivNV,
    BindBufferRangeNV, BindBufferOffsetNV, BindBufferBaseNV, and BindBuffer
    and the <value> parameter of GetIntegerIndexedvEXT:

        COMPUTE_PROGRAM_PARAMETER_BUFFER_NV             0x90FC

    (Note:  Various enumerants from ARB_compute_shader will also be used by
     this extension.)

Additions to Chapter 2 of the OpenGL 4.2 (Compatibility Profile) Specification
(OpenGL Operation)

    Modify Section 2.X, GPU Programs, of NV_gpu_program4 (as modified by
    NV_gpu_program5)

    (insert after second paragraph)

    Compute Programs

    Compute programs are used to perform general purpose computations using a
    three-dimensional array of program invocations (threads).  The compute
    shader invocations are arranged into work groups specified by the
    mandatory GROUP_SIZE declaration, each of which comprises a fixed-size,
    three-dimensional array of program invocations.  One or more work groups
    are scheduled for execution using the DispatchCompute or
    DispatchComputeIndirect commands.

    Each work group scheduled for execution will launch a separate program
    invocation for each work group member.  While the program invocations in a
    work group are launched together, they run independently after launch.
    The BAR (barrier) instruction is available to synchronize program
    invocations; an invocation stops at each BAR instruction until all
    invocations in the work group have executed the BAR instruction.  Each
    work group has an optional shared memory allocation (specified by the
    SHARED_MEMORY declaration) that can be read or written by any invocations
    of the work group.

    Unlike other program types, compute program invocations have no inputs or
    outputs interfacing with the rest of the pipeline.  Compute programs may
    obtain inputs using mechanisms such as global loads, image loads, atomic
    counter reads, shader storage buffer reads, and program parameters.
    Built-in inputs are also provided to allow a compute shader invocation to
    determine its position in the work group, the position of its work group
    in the full dispatch, as well as the work group and full dispatch sizes.
    Compute program results are expected to be written to globally accessible
    memory using mechanisms such as global stores, image stores, atomic
    counters, and shader storage buffers.


    Modify Section 2.X.2, Program Grammar

    (replace third paragraph)

    Compute programs are required to begin with the header string "!!NVcp5.0".
    This header string identifies the subsequent program body as being a
    compute program and indicates that it should be parsed according to the
    base NV_gpu_program5 grammar plus the additions below.  Program string
    parsing begins with the character immediately following the header string.

    (add the following grammar rules to the NV_gpu_program5 base grammar for
     compute programs)

    <declSequence>          ::= <declaration> <declSequence>

    <instruction>           ::= <SpecialInstruction>

    <opModifier>            ::= "CTA"

    <namingStatement>       ::= <SHARED_statement>

    <SHARED_statement>      ::= "SHARED" <establishName> <sharedSingleInit>
                              | "SHARED" <establishName> <optArraySize>
                                <sharedMultipleInit>

    <sharedSingleInit>      ::= "=" <sharedUseDS>

    <sharedMultipleInit>    ::= "=" "{" <sharedItemList> "}"

    <sharedItemList>        ::= <sharedUseDM>
                              | <sharedUseDM> "," <sharedItemList>

    <sharedUseV>            ::= <sharedVarName> <optArrayMem>

    <sharedUseDS>           ::= <sharedBaseBinding> <arrayMemAbs>

    <sharedUseDM>           ::= <sharedUseDS>
                              | <sharedBaseBinding> <arrayRange>

    <sharedBaseBinding>     ::= "program" "." "sharedmem"

    <SpecialInstruction>    ::= "BAR"
                              | "ATOMS" <opModifiers> <instResult> ","
                                <instOperandV> "," <sharedUseV>
                              | "LDS" <opModifiers> <instResult> ","
                                <sharedUseV>
                              | "STS" <opModifiers> <instOperandV> ","
                                <sharedUseV>

    <declaration>           ::= "GROUP_SIZE" <int>
                              | "GROUP_SIZE" <int> <int>
                              | "GROUP_SIZE" <int> <int> <int>
                              | "SHARED_MEMORY" <int>

    <attribBasic>           ::= "invocation" "." "localid"
                              | "invocation" "." "globalid"
                              | "invocation" "." "groupid"
                              | "invocation" "." "groupcount"
                              | "invocation" "." "groupsize"
                              | "invocation" "." "localindex"


    (add the following subsection to Section 2.X.3.2, Program Attribute
     Variables)

    Compute program attribute variables describe the attributes of the current
    program invocation.  Each DispatchCompute command produces a set of
    program invocations arranged as a one-, two-, or three-dimensional array.
    Figure X.1 illustrates a two-dimensional dispatch with a local work group
    size of 8x4, and a total dispatch of 5x4 local workgroups.  Each
    individual program invocation has a global one-, two-, or
    three-dimensional global coordinate, which can be further decomposed into
    a work group offset (in fixed-size work groups) and a local offset
    relative to the origin of an invocation's work group.

                +-------+-------+-------+-------+-------+
                |       |       | work  |       |       |
                |       |       | group |       |       |
                |       |       | (2,3) |       |       |
         (0,12) +-------+-------+-------+-------+-------+
                |       |       |       |       |       |
                |       |       |       |       |       |
                |       | *     |       |       |       |
          (0,8) +-------+-------+-------+-------+-------+
                |       |       |       |       | work  |
                |       |       |       |       | group |
                |       |       |       |       | (4,1) |
          (0,4) +-------+-------+-------+-------+-------+
                | work  |       |       |       |       |
                | group |       |       |       |       |
                | (0,0) |       |       |       |       |
                +-------+-------+-------+-------+-------+
              (0,0)   (8,0)   (16,0)  (24,0)  (32,0)

      Figure X.1, Compute Dispatch.  The single invocation at the location
      labeled "*" has a location (invocation.globalid) of (10,9).  The offset
      relative to its local work group (invocation.localid) is (2,1).  Its
      local work group has an offset (invocation.groupid) of (1,2), in units
      of work groups.

    The set of available compute program attribute bindings is enumerated in
    Table X.1.  All bindings are considered four-component unsigned integer
    vectors with the value of the fourth component undefined.

      Attribute Binding          Components  Underlying State
      -------------------------  ----------  ------------------------------
      invocation.localid         (x,y,z,-)   offset relative to base of
                                             work group

      invocation.globalid        (x,y,z,-)   offset relative to the base
                                             of the dispatched work

      invocation.groupid         (x,y,z,-)   offset (in groups) of local work
                                             group

      invocation.groupcount      (x,y,z,-)   total local work group count

      invocation.groupsize       (x,y,z,-)   number of invocations in each
                                             dimension of the local work group

      invocation.localindex      (x,-,-,-)   one-dimensional (flattened) index
                                             in local workgroup

      Table X.1, Compute Program Attribute Bindings.

    If a compute attribute binding matches "invocation.localid", the "x", "y",
    and "z" components of the invocation attribute variable are filled with
    the "x", "y", "z" components, respectively, of the offset of the
    invocation relative to the base of its local workgroup.  The "w" component
    of the attribute is undefined.

    If a compute attribute binding matches "invocation.globalid", the "x",
    "y", and "z" components of the invocation attribute variable are filled
    with the "x", "y", "z" components, respectively, of the offset of the
    invocation relative to the full compute dispatch.  The "w" component of
    the attribute is undefined.

    If a compute attribute binding matches "invocation.groupid", the "x", "y",
    and "z" components of the invocation attribute variable are filled with
    the "x", "y", "z" components, respectively, of the offset of the local
    work group (in groups) relative to the full compute dispatch.  The "w"
    component of the attribute is undefined.

    If a compute attribute binding matches "invocation.groupcount", the "x",
    "y", and "z" components of the invocation attribute variable are filled
    the "x", "y", and "z" dimensions, respectively, in local work groups of
    the full compute dispatch.  The "w" component of the attribute is
    undefined.

    If a compute attribute binding matches "invocation.groupsize", the "x",
    "y", and "z" components of the invocation attribute variable are filled
    the "x", "y", and "z" dimensions, respectively, of the local work group,
    as specified by the GROUP_SIZE declaration.  The "w" component of the
    attribute is undefined.

    If a compute attribute binding matches "invocation.localindex", the "x",
    components of the invocation attribute variable is filled with a flattened
    one-dimensional index of the invocation, which is derived as:

      invocation.localid.z * invocation.groupsize.x * invocation.groupsize.y +
      invocation.localid.y * invocation.groupsize.x +
      invocation.localid.x

    The "y", "z", and "w" components of the attribute are undefined.

    For one-dimensional dispatches, the "y" components of
    "invocation.localid", "invocation.globalid", and "invocation.groupid" will
    be zero.  For one- and two- dimensional dispatches, the "z" components of
    "invocation.localid", "invocation.globalid", and "invocation.groupid" will
    be zero.  The same components of "invocation.groupcount" and
    "invocation.groupsize" will be one in these cases.


    (add the following subsection to section 2.X.3.5, Program Results.)

    Compute programs have no result variables; all shader results must be
    written to memory.


    Add New Section 2.X.3.Y, Compute Program Shared Memory, after Section
    2.X.3.6, Program Parameter Buffers

    Compute program shared memory variables are arrays of basic machine units
    from which data can be read or written using the LDS and STS instructions.
    Compute program shared memory also supports atomic memory operations using
    the ATOMS instruction.  The GL allocates a single block of shared memory
    for each local work group, whose size in basic machine units is specified
    by the "SHARED_MEMORY" statement.  The contents of compute program shared
    memory are undefined when program execution for the local work group
    begins and can be changed only by using the ATOMS or STS instructions.
    Compute program shared memory variables are shared between all invocations
    of a local work group.  Writes performed by one invocation will be visible
    for any reads of the same memory from any other invocation executed after
    the write.  Note that the order of reads and writes between different
    invocations in a local work group is largely undefined, although the BAR
    instruction can be used to introduce synchronization points for all
    invocations in a local work group.

    Shared memory variables may only be used as operands in the ATOMS, LDS,
    and STS instructions; they may not be used by used as results or operands
    in general instructions.  Shared memory variables must be declared
    explicitly via the <SHARED_statement> grammar rule.  Shared memory
    bindings can not be used directly in executable instructions.

    Shader storage buffer variables may be declared as arrays, but all
    bindings assigned to the array must use the same binding point(s) and must
    increase consecutively.

      Binding                        Components  Underlying State
      -----------------------------  ----------  -----------------------------
      program.sharedmem[a]           (x,x,x,x)   compute shared memory,
                                                   element a
      program.sharedmem[a..b]        (x,x,x,x)   compute shared memory,
                                                   elements a through b
      program.sharedmem              (x,x,x,x)   compute shared memory,
                                                   all elements

      Table X.3: Shared Memory Bindings.  <a> and <b> indicate individual
      elements of shared memory.

    If a shared memory binding matches "program.sharedmem[a]", the shared
    memory variable is associated with basic machine element <a> of compute
    shared memory.

    For shared memory declarations, "program.sharedmem[a..b]" is equivalent to
    specifying elements <a> through <b> of compute shared memory in order.

    For shared memory declarations, "program.sharedmem" is equivalent to
    specifying elements zero through <N>-1 of compute shared memory in order,
    where <N> is the total shared memory size declared by the "SHARED_MEMORY"
    statement.


    Modify Section 2.X.4, Program Execution Environment

    (add to the opcode table)

                  Modifiers
      Instruction F I C S H D  Out Inputs    Description
      ----------- - - - - - -  --- --------  --------------------------------
      ATOMS       - - X - - -  s   v,su      atomic transaction to shared mem
      BAR         - - - - - -  -   -         work group execution barrier
      LDS         - - X X - F  v   su        load from shared memory
      STS         - - - - - -  -   v,su      store to shared memory


    Modify Section 2.X.4.1, Program Instruction Modifiers

      Modifier  Description
      --------  -----------------------------------------------
      CTA       Memory barrier orders only memory transactions
                relative to invocations within local work group

    (add to descriptions of opcode modifiers)

    For the MEMBAR (memory barrier) instruction, the "CTA" modifier specifies
    that memory transactions before and after the barrier are strongly ordered
    as observed by any other shader invocation in the local work group.


    Modify Section 2.X.4.5, Program Memory Access, from NV_gpu_program5

    (add to the end of the first paragraph) ... Additionally programs may load
    from or store to shared memory via the ATOMS (atomic shared memory
    operation), LDS (load from shared memory), and STS (store to shared
    memory) instructions.

    (modify miscellaneous other language referring to "buffer object memory"
    to instead refer to "buffer object and shared memory")

    (add hypothetical built-in functions SharedMemoryLoad() and
    SharedMemoryStore() that behave similarly to BufferMemoryLoad() and
    BufferMemoryStore(), except that they access local work group shared
    memory instead of buffer object memory)


    Add the following subsection to section 2.X.7, Program Declarations

    Section 2.X.7.Y, Compute Program Declarations

    Compute programs support two types of declaration statement, as described
    below.

    - Shader Thread Group Size (GROUP_SIZE)

    The GROUP_SIZE statement declares the number of shader threads in a one-,
    two-, or three-dimensional local work group.  The statement must have one
    to three unsigned integer arguments.  Each argument must be less than or
    equal to the value of the implementation-dependent limit
    MAX_COMPUTE_LOCAL_WORK_SIZE for its corresponding dimension (X, Y, or Z).
    A program will fail to load unless it contains exactly one GROUP_SIZE
    declaration.


    - Shared Memory Storage Size (SHARED_MEMORY)

    The SHARED_MEMORY statement declares the size of the shared memory, in
    basic machine units, available to the threads of each local work group.
    The SHARED_MEMORY statement is optional, but a program will fail to load
    if it includes multiple SHARED_MEMORY declarations, if it uses the the
    ATOMS, LDS, or STS instructions in a program without a SHARED_MEMORY
    declaration, if uses these instructions with an offset that would access
    memory beyond the declared shared memory size, or if the declared shared
    memory size is greater than the implementation-dependent limit
    MAX_COMPUTE_SHARED_VARIABLE_SIZE.


    (add the following subsection to section 2.X.8, Program Instruction Set.)

    Section 2.X.8.Z, ATOMS:  Atomic Memory Operation (Shared Memory)

    The ATOMS instruction performs an atomic memory operation by reading from
    shared memory specified by the second unsigned integer scalar operand,
    computing a new value based on the value read from memory and the first
    (vector) operand, and then writing the result back to the same memory
    address.  The memory transaction is atomic, guaranteeing that no other
    write to the memory accessed will occur between the time it is read and
    written by the ATOMS instruction.  The result of the ATOMS instruction is
    the scalar value read from memory.  The second operand used for the ATOMS
    instruction must correspond to a shared memory variable declared using the
    "SHARED" statement; a program will fail to load if any other type of
    operand is used for the second operand of an ATOMS instruction.

    The ATOMS instruction has two required instruction modifiers.  The atomic
    modifier specifies the type of operation to be performed.  The storage
    modifier specifies the size and data type of the operand read from memory
    and the base data type of the operation used to compute the value to be
    written to memory.

      atomic     storage
      modifier   modifiers            operation
      --------   ------------------   --------------------------------------
       ADD       U32, S32, U64, F32   compute a sum
       MIN       U32, S32             compute minimum
       MAX       U32, S32             compute maximum
       IWRAP     U32                  increment memory, wrapping at operand
       DWRAP     U32                  decrement memory, wrapping at operand
       AND       U32, S32             compute bit-wise AND
       OR        U32, S32             compute bit-wise OR
       XOR       U32, S32             compute bit-wise XOR
       EXCH      U32, S32, U64, F32   exchange memory with operand
       CSWAP     U32, S32, U64        compare-and-swap

     Table X.Y, Supported atomic and storage modifiers for the ATOM
     instruction.

    Not all storage modifiers are supported by ATOMS, and the set of modifiers
    allowed for any given instruction depends on the atomic modifier
    specified.  Table X.Y enumerates the set of atomic modifiers supported by
    the ATOMS instruction, and the storage modifiers allowed for each.

      tmp0 = VectorLoad(op0);
      result = SharedMemoryLoad(op1, storageModifier);
      switch (atomicModifier) {
      case ADD:
        writeval = tmp0.x + result;
        break;
      case MIN:
        writeval = min(tmp0.x, result);
        break;
      case MAX:
        writeval = max(tmp0.x, result);
        break;
      case IWRAP:
        writeval = (result >= tmp0.x) ? 0 : result+1;
        break;
      case DWRAP:
        writeval = (result == 0 || result > tmp0.x) ? tmp0.x : result-1;
        break;
      case AND:
        writeval = tmp0.x & result;
        break;
      case OR:
        writeval = tmp0.x | result;
        break;
      case XOR:
        writeval = tmp0.x ^ result;
        break;
      case EXCH:
        break;
      case CSWAP:
        if (result == tmp0.x) {
          writeval = tmp0.y;
        } else {
          return result;  // no memory store
        }
        break;
      }
      SharedMemoryStore(op1, writeval, storageModifier);

    ATOMS performs a scalar atomic operation.  The <y>, <z>, and <w>
    components of the result vector are undefined.

    ATOMS supports no base data type modifiers, but requires exactly one
    storage modifier.  The base data types of the result vector, and the first
    (vector) operand are derived from the storage modifier.  The second
    operand is always interpreted as a scalar unsigned integer.


    Section 2.X.8.Z, BAR:  Execution Barrier

    The BAR instruction synchronizes the execution of compute shader
    invocations within a local work group.  When a compute shader invocation
    executes the BAR instruction, it pauses until the same BAR instruction has
    been executed by all invocations in the current local work group.  Once
    all invocations have executed the BAR instruction, processing continues
    with the instruction following the BAR instruction.

    There is no compile-time restriction on the locations in a program where
    BAR is allowed.  However, BAR instructions are not allowed in divergent
    flow control; if any compute shader invocation in the work group executes
    the BAR instruction, all compute shaders invocations must execute the
    instruction.  Results of executing a BAR instruction are undefined and can
    result in application hangs and/or program termination if the instruction
    is issued:

      * inside any IF/ELSE/ENDIF block where the results of the condition
        evaluated by the IF instruction are not identical across the work
        group;

      * inside any iteration of REP/ENDREP block where at least one invocation
        in the work group has skipped to the next iteration using the CONT
        instruction, exited the loop using a BRK or RET instruction, or exited
        the loop due to having completed the requested number of loop
        iterations; or

      * inside any subroutine (including main) where at least one invocation
        in the work group has exited the subroutine using the RET instruction.

    BAR has no operands and generates no result.


    Section 2.X.8.Z, LDS:  Load from Shared Memory

    The LDS instruction generates a result vector by fetching data from the
    shared memory for the current local work group identified by the first
    operand, as described in Section 2.X.4.5.  The single operand for the LDS
    instruction must correspond to a shader shared memory variable declared
    using the "SHARED" statement; a program will fail to load if any other
    type of operand is used in an LDS instruction.

      result = SharedMemoryLoad(op0, storageModifier);

    LDS supports no base data type modifiers, but requires exactly one storage
    modifier.  The base data type of the result vector is derived from the
    storage modifier.


    Replace Section 2.X.8.Z, MEMBAR:  Memory Barrier, as added by
    EXT_shader_image_load_store

    The MEMBAR instruction synchronizes memory transactions to ensure that
    memory transactions resulting from any instruction executed by the thread
    prior to the MEMBAR instruction complete prior to any memory transactions
    issued after the instruction, as observed by other shader invocations.

    The MEMBAR instruction has one optional instruction modifier.  If the CTA
    instruction modifier is specified, memory transactions before and after
    the barrier will be strongly ordered as observed by other shader
    invocations in the same local work group.  However, it does not order
    transactions as viewed by any other shader.  With the CTA modifier,
    shaders not in the local work group may observe the results of memory
    transactions issued after the MEMBAR instruction before those issued
    before the MEMBAR instruction.  If the CTA instruction modifier is not
    specified, all shader invocations will see the results of any memory
    transaction issued before the MEMBAR instruction before those issued after
    the MEMBAR instruction.

    MEMBAR has no operands and generates no result.


    Section 2.X.8.Z, STS:  Store to Shared Memory

    The STS instruction writes the contents of the first vector operand to
    shared memory for the current local work group identified by the second
    operand, as described in Section 2.X.4.5.  This instruction generates no
    result.  The second operand for the STS instruction must correspond to a
    shared memory variable declared using the "SHARED" statement; a program
    will fail to load if any other type of operand is used in an STS
    instruction.

      tmp0 = VectorLoad(op0);
      SharedMemoryStore(op1, tmp0, storageModifier);

    STS supports no base data type modifiers, but requires exactly one storage
    modifier.  The base data type of the vector components of the first
    operand is derived from the storage modifier.


Additions to Chapter 3 of the OpenGL 4.2 (Compatibility Profile) Specification
(Rasterization)

    None.

Additions to Chapter 4 of the OpenGL 4.2 (Compatibility Profile) Specification
(Per-Fragment Operations and the Frame Buffer)

    None.

Additions to Chapter 5 of the OpenGL 4.2 (Compatibility Profile) Specification
(Special Functions)

    None.

Additions to Chapter 6 of the OpenGL 4.2 (Compatibility Profile) Specification
(State and State Requests)

    None.

Additions to the AGL/GLX/WGL Specifications

    None.

GLX Protocol

    None.

Dependencies on NV_shader_atomic_float

    If NV_shader_atomic_float is not supported, the ADD and EXCH atomic
    operations in the ATOMS instruction do not support the "F32" storage
    modifier.

Dependencies on EXT_shader_image_load_store

    If EXT_shader_image_load_store is not supported, language describing the
    "CTA" instruction modifier and modifying the MEMBAR instruction (as added
    by EXT_shader_image_load_store) should be removed.

Errors

    None.

New State

    (Modify ARB_vertex_program, Table X.6 -- Program State)

                                                      Initial
    Get Value                    Type    Get Command  Value   Description               Sec.    Attribute
    ---------                    ------- -----------  ------- ------------------------  ------  ---------
    COMPUTE_PROGRAM_PARAMETER_   Z+      GetIntegerv  0       Active compute program    2.14.1  -
      BUFFER_NV                                               buffer object binding
    COMPUTE_PROGRAM_PARAMETER_   nxZ+    GetInteger-  0       Buffer objects bound for  2.14.1  -
      BUFFER_NV                          IndexedvEXT          compute program use

    Also shares buffer bindings and other state with the ARB_compute_shader
    extension.

New Implementation Dependent State

    None, but shares implementation-dependent state with the
    ARB_compute_shader extension.

Issues

    None.

Revision History

    Rev.    Date    Author    Changes
    ----  --------  --------  --------------------------------------------
     2    10/23/12  pbrown    Remove the restriction forbidding the use of BAR
                              inside potentially divergent flow control.
                              Instead, we will allow BAR to be executed
                              anywhere, but specify undefined results
                              (including hangs or program termination) if the
                              flow control is divergent (bug 9367).

     1              pbrown    Internal spec development.