extensions/NV/NV_gpu_program5.txt

Name

    NV_gpu_program5

Name Strings

    GL_NV_gpu_program5
    GL_NV_gpu_program_fp64

Contact

    Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)

Status

    Shipping.

Version

    Last Modified Date:         09/11/2014
    NVIDIA Revision:            7

Number

    388

Dependencies

    OpenGL 2.0 is required.

    This extension is written against the OpenGL 3.0 specification.

    NV_gpu_program4 and NV_gpu_program4_1 are required.

    NV_shader_buffer_load is required.

    NV_shader_buffer_store is required.

    This extension is written against and interacts with the NV_gpu_program4,
    NV_vertex_program4, NV_geometry_program4, and NV_fragment_program4
    specifications.

    This extension interacts with NV_tessellation_program5.

    This extension interacts with ARB_transform_feedback3.

    This extension interacts trivially with NV_shader_buffer_load.

    This extension interacts trivially with NV_shader_buffer_store.

    This extension interacts trivially with NV_parameter_buffer_object2.

    This extension interacts trivially with OpenGL 3.3, ARB_texture_swizzle,
    and EXT_texture_swizzle.

    This extension interacts trivially with ARB_blend_func_extended.

    This extension interacts trivially with EXT_shader_image_load_store.

    This extension interacts trivially with ARB_shader_subroutine.

    If the 64-bit floating-point portion of this extension is not supported,
    "GL_NV_gpu_program_fp64" will not be found in the extension string.

Overview

    This specification documents the common instruction set and basic
    functionality provided by NVIDIA's 5th generation of assembly instruction
    sets supporting programmable graphics pipeline stages.

    The instruction set builds upon the basic framework provided by the
    ARB_vertex_program and ARB_fragment_program extensions to expose
    considerably more capable hardware.  In addition to new capabilities for
    vertex and fragment programs, this extension provides new functionality
    for geometry programs as originally described in the NV_geometry_program4
    specification, and serves as the basis for the new tessellation control
    and evaluation programs described in the NV_tessellation_program5
    extension.

    Programs using the functionality provided by this extension should begin
    with the program headers "!!NVvp5.0" (vertex programs), "!!NVtcp5.0"
    (tessellation control programs), "!!NVtep5.0" (tessellation evaluation
    programs), "!!NVgp5.0" (geometry programs), and "!!NVfp5.0" (fragment
    programs).

    This extension provides a variety of new features, including:

      * support for 64-bit integer operations;

      * the ability to dynamically index into an array of texture units or
        program parameter buffers;

      * extending texel offset support to allow loading texel offsets from
        regular integer operands computed at run-time, instead of requiring
        that the offsets be constants encoded in texture instructions;

      * extending TXG (texture gather) support to return the 2x2 footprint
        from any component of the texture image instead of always returning
        the first (x) component;

      * extending TXG to support shadow comparisons in conjunction with a
        depth texture, via the SHADOW* targets;

      * further extending texture gather support to provide a new opcode
        (TXGO) that applies a separate texel offset vector to each of the four
        samples returned by the instruction;

      * bit manipulation instructions, including ones to find the position of
        the most or least significant set bit, bitfield insertion and
        extraction, and bit reversal;

      * a general data conversion instruction (CVT) supporting conversion
        between any two data types supported by this extension; and

      * new instructions to compute the composite of a set of boolean
        conditions a group of shader threads.

    This extension also provides some new capabilities for individual program
    types, including:

      * support for instanced geometry programs, where a geometry program may
        be run multiple times for each primitive;

      * support for emitting vertices in a geometry program where each vertex
        emitted may be directed at a specified vertex stream and captured
        using the ARB_transform_feedback3 extension;

      * support for interpolating an attribute at a programmable offset
        relative to the pixel center (IPAO), at a programmable sample number
        (IPAS), or at the fragment's centroid location (IPAC) in a fragment
        program;

      * support for reading a mask of covered samples in a fragment program;

      * support for reading a point sprite coordinate directly in a fragment
        program, without overriding a texture coordinate;

      * support for reading patch primitives and per-patch attributes
        (introduced by ARB_tessellation_shader) in a geometry program; and

      * support for multiple output vectors for a single color output in a
        fragment program (as used by ARB_blend_func_extended).

    This extension also provides optional support for 64-bit-per-component
    variables and 64-bit floating-point arithmetic.  These features are
    supported if and only if "NV_gpu_program_fp64" is found in the extension
    string.

    This extension incorporates the memory access operations from the
    NV_shader_buffer_load and NV_parameter_buffer_object2 extensions,
    originally built as add-ons to NV_gpu_program4.  It also provides the
    following new capabilities:

      * support for the features without requiring a separate OPTION keyword;

      * support for indexing into an array of constant buffers using the LDC
        opcode added by NV_parameter_buffer_object2;

      * support for storing into buffer objects at a specified GPU address
        using the STORE opcode, an allowing applications to create READ_WRITE
        and WRITE_ONLY mappings when making a buffer object resident using the
        API mechanisms in the NV_shader_buffer_store extension;

      * storage instruction modifiers to allow loading and storing 64-bit
        component values;

      * support for atomic memory transactions using the ATOM opcode, where
        the instruction atomically reads the memory pointed to by a pointer,
        performs a specified computation, stores the results of that
        computation, and returns the original value read;

      * support for memory barrier transactions using the MEMBAR opcode, which
        ensures that all memory stores issued prior to the opcode complete
        prior to any subsequent memory transactions; and

      * a fragment program option to specify that depth and stencil tests are
        performed prior to fragment program execution.

    Additionally, the assembly program languages supported by this extension
    include support for reading, writing, and performing atomic memory
    operations on texture image data using the opcodes and mechanisms
    documented in the "Dependencies on NV_gpu_program5" section of the
    EXT_shader_image_load_store extension.

New Procedures and Functions

    None.

New Tokens

    Accepted by the <pname> parameter of GetBooleanv, GetIntegerv,
    GetFloatv, and GetDoublev:

        MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV             0x8E5A
        MIN_FRAGMENT_INTERPOLATION_OFFSET_NV            0x8E5B
        MAX_FRAGMENT_INTERPOLATION_OFFSET_NV            0x8E5C
        FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV   0x8E5D
        MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV            0x8E5E
        MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV            0x8E5F


Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation)

    Modify Section 2.X.2 of NV_fragment_program4, Program Grammar

    (modify the section, updating the program header string for the extended
     instruction set)

    Fragment programs are required to begin with the header string
    "!!NVfp5.0".  This header string identifies the subsequent program body as
    being a fragment program and indicates that it should be parsed according
    to the base NV_gpu_program5 grammar plus the additions below.  Program
    string parsing begins with the character immediately following the header
    string.

    (add/change the following rules to the NV_fragment_program4 and
     NV_gpu_program5 base grammars)

    <SpecialInstruction>    ::= "IPAC" <opModifiers> <instResult> ","
                                <instOperandV>
                              | "IPAO" <opModifiers> <instResult> ","
                                <instOperandV> "," <instOperandV>
                              | "IPAS" <opModifiers> <instResult> ","
                                <instOperandV> "," <instOperandS>

    <interpModifier>        ::= "SAMPLE"

    <attribBasic>           ::= <fragPrefix> "sampleid"
                              | <fragPrefix> "samplemask"
                              | <fragPrefix> "pointcoord"

    <resultBasic>           ::= <resPrefix> "color" <resultOptColorNum>
                                <resultOptColorType>
                              | <resPrefix> "samplemask"

    <resultOptColorType>    ::= ""
                              | "." <colorType>


    Modify Section 2.X.2 of NV_geometry_program4, Program Grammar

    (modify the section, updating the program header string for the extended
     instruction set)

    Geometry programs are required to begin with the header string
    "!!NVgp5.0".  This header string identifies the subsequent program body as
    being a geometry program and indicates that it should be parsed according
    to the base NV_gpu_program5 grammar plus the additions below.  Program
    string parsing begins with the character immediately following the header
    string.

    (add the following rules to the NV_geometry_program4 and NV_gpu_program5
     base grammars)

    <declaration>           ::= "INVOCATIONS" <int>

    <declPrimInType>        ::= "PATCHES"

    <SpecialInstruction>    ::= "EMITS" <instOperandS>

    <attribBasic>           ::= <primPrefix> "invocation"
                              | <primPrefix> "vertexcount"
                              | <attribTessOuter> <optArrayMemAbs>
                              | <attribTessInner> <optArrayMemAbs>
                              | <attribPatchGeneric> <optArrayMemAbs>

    <attribMulti>           ::= <attribTessOuter> <arrayRange>
                              | <attribTessInner> <arrayRange>
                              | <attribPatchGeneric> <arrayRange>

    <attribTessOuter>       ::= <primPrefix> "." "tessouter"

    <attribTessInner>       ::= <primPrefix> "." "tessinner"

    <attribPatchGeneric>    ::= <primPrefix> "." "patch" "." "attrib"


    Modify Section 2.X.2 of NV_vertex_program4, Program Grammar

    (modify the section, updating the program header string for the extended
     instruction set)

    Vertex programs are required to begin with the header string "!!NVvp5.0".
    This header string identifies the subsequent program body as being a
    vertex program and indicates that it should be parsed according to the
    base NV_gpu_program5 grammar plus the additions below.  Program string
    parsing begins with the character immediately following the header string.


    Modify Section 2.X.2 of NV_gpu_program4, Program Grammar

    (add the following grammar rules to the NV_gpu_program4 base grammar;
     additional grammar rules usable for assembly programs are documented in
     the EXT_shader_image_load_store and ARB_shader_subroutine specifications)

    <instruction>           ::= <MemInstruction>

    <MemInstruction>        ::= <ATOMop_instruction>
                              | <STOREop_instruction>
                              | <MEMBARop_instruction>

    <VECTORop>              ::= "BFR"
                              | "BTC"
                              | "BTFL"
                              | "BTFM"
                              | "PK64"
                              | "LDC"
                              | "CVT"
                              | "TGALL"
                              | "TGANY"
                              | "TGEQ"
                              | "UP64"

    <SCALARop>              ::= "LOAD"

    <BINop>                 ::= "BFE"

    <TRIop>                 ::= "BFI"

    <TEXop_instruction>     ::= <TEXop> <opModifiers> <instResult> ","
                                <instOperandV> "," <instOperandV> ","
                                <texAccess>

    <TEXop>                 ::= "TXG"
                              | "LOD"

    <TXDop>                 ::= "TXGO"

    <ATOMop_instruction>    ::= <ATOMop> <opModifiers> <instResult> ","
                                <instOperandV> "," <instOperandS>

    <ATOMop>                ::= "ATOM"

    <STOREop_instruction>   ::= <STOREop> <opModifiers> <instOperandV> ","
                                <instOperandS>

    <STOREop>               ::= "STORE"

    <MEMBARop_instruction>  ::= <MEMBARop> <opModifiers>

    <MEMBARop>              ::= "MEMBAR"

    <opModifier>            ::= "F16"
                              | "F32"
                              | "F64"
                              | "F32X2"
                              | "F32X4"
                              | "F64X2"
                              | "F64X4"
                              | "S8"
                              | "S16"
                              | "S32"
                              | "S32X2"
                              | "S32X4"
                              | "S64"
                              | "S64X2"
                              | "S64X4"
                              | "U8"
                              | "U16"
                              | "U32"
                              | "U32X2"
                              | "U32X4"
                              | "U64"
                              | "U64X2"
                              | "U64X4"
                              | "ADD"
                              | "MIN"
                              | "MAX"
                              | "IWRAP"
                              | "DWRAP"
                              | "AND"
                              | "OR"
                              | "XOR"
                              | "EXCH"
                              | "CSWAP"
                              | "COH"
                              | "ROUND"
                              | "CEIL"
                              | "FLR"
                              | "TRUNC"
                              | "PREC"
                              | "VOL"

    <texAccess>             ::= <textureUseS> "," <texTarget> <optTexOffset>
                              | <textureUseV> "," <texTarget> <optTexOffset>

    <texTarget>             ::= "ARRAYCUBE"
                              | "SHADOWARRAYCUBE"

    <optTexOffset>          ::= /* empty */
                              | <texOffset>

    <texOffset>             ::= "offset" "(" <instOperandV> ")"

    <namingStatement>       ::= <TEXTURE_statement>

    <BUFFER_statement>      ::= <bufferDeclType> <establishName>
                                <optArraySize> <optArraySize> "="
                                <bufferMultInit>

    <bufferDeclType>        ::= "CBUFFER"

    <TEXTURE_statement>     ::= "TEXTURE" <establishName> <texSingleInit>
                              | "TEXTURE" <establishName> <optArraySize>
                                <texMultipleInit>

    <texSingleInit>         ::= "=" <textureUseDS>

    <texMultipleInit>       ::= "=" "{" <texItemList> "}"

    <texItemList>           ::= <textureUseDM>
                              | <textureUseDM> "," <texItemList>

    <bufferBinding>         ::= "program" "." "buffer" <arrayRange>

    <textureUseS>           ::= <textureUseV> <texImageUnitComp>

    <textureUseV>           ::= <texImageUnit>
                              | <texVarName> <optArrayMem>

    <textureUseDS>          ::= "texture" <arrayMemAbs>

    <textureUseDM>          ::= <textureUseDS>
                              | "texture" <arrayRange>

    <texImageUnitComp>      ::= <scalarSuffix>


    Modify Section 2.X.3.1, Program Variable Types

    (IGNORE if GL_NV_gpu_program_fp64 is not found in the extension string.
     Otherwise modify storage size modifiers to guarantee that "LONG"
     variables are at least 64 bits in size.)

    Explicitly declared variables may optionally have one storage size
    modifier.  Variables decared as "SHORT" will be represented using at least
    16 bits per component.  "SHORT" floating-point values will have at least 5
    bits of exponent and 10 bits of mantissa.  Variables declared as "LONG"
    will be represented with at least 64 bits per component.  "LONG"
    floating-point values will have at least 11 bits of exponent and 52 bits
    of mantissa.  If no size modifier is provided, the GL will automatically
    select component sizes.  Implementations are not required to support more
    than one component size, so "SHORT", "LONG", and the default could all
    refer to the same component size.  The "LONG" modifier is supported only
    for declarations of temporary variables ("TEMP"), and attribute variables
    ("ATTRIB") in vertex programs.  The "SHORT" modifier is supported only
    for declarations of temporary variables and result variables ("OUTPUT").


    Modify Section 2.X.3.2 of the NV_fragment_program4 specification, Program
    Attribute Variables.

    (Add a table entry and relevant text describing the fragment program
     input sample mask variable.)

      Fragment Attribute Binding  Components  Underlying State
      --------------------------  ----------  ----------------------------
      fragment.samplemask         (m,-,-,-)   fragment coverage mask
      fragment.pointcoord         (s,t,-,-)   fragment point sprite coordinate

    If a fragment attribute binding matches "fragment.samplemask", the "x"
    component is filled with a coverage mask indicating the set of samples
    covered by this fragment.  The coverage mask is a bitfield, where bit <n>
    is one if the sample number <n> is covered and zero otherwise.  If
    multisample buffers are not available (SAMPLE_BUFFERS is zero), bit zero
    indicates if the center of the pixel corresponding to the fragment is
    covered.

    If a fragment attribute binding matches "fragment.pointcoord", the "x" and
    "y" components are filled with the s and t point sprite coordinates
    (section 3.3.1), respectively.  The "z" and "w" components are undefined.
    If the fragment is generated by any primitive other than a point, or if
    point sprites are disabled, all four components of the binding are
    undefined.

    Modify Section 2.X.3.2 of the NV_geometry_program4 specification, Program
    Attribute Variables.

    (Add a table entry and relevant text describing the geometry program
    invocation attribute and per-patch attributes.)

      Geometry Vertex Binding         Components  Description
      -----------------------------   ----------  ----------------------------
      ...
      primitive.invocation            (id,-,-,-)  geometry program invocation
      primitive.tessouter[n]          (x,-,-,-)   outer tess. level n
      primitive.tessinner[n]          (x,-,-,-)   inner tess. level n
      primitive.patch.attrib[n]       (x,y,z,w)   generic patch attribute n
      primitive.tessouter[n..o]       (x,-,-,-)   outer tess. levels n to o
      primitive.tessinner[n..o]       (x,-,-,-)   inner tess. levels n to o
      primitive.patch.attrib[n..o]    (x,y,z,w)   generic patch attrib n to o
      primitive.vertexcount           (c,-,-,-)   vertices in primitive

    ...

    If a geometry attribute binding matches "primitive.invocation", the "x"
    component is filled with an integer giving the number of previous
    invocations of the geometry program on the primitive being processed.  If
    the geometry program is invoked only once per primitive (default), this
    component will always be zero.  If the program is invoked multiple times
    (via the INVOCATIONS declaration), the component will be zero on the first
    invocation, one on the second, and so forth.  The "y", "z", and "w"
    components of the variable are always undefined.

    If an attribute binding matches "primitive.tessouter[n]", the "x"
    component is filled with the per-patch outer tessellation level numbered
    <n> of the input patch.  <n> must be less than four.  The "y", "z", and
    "w" components are always undefined.  A program will fail to load if this
    attribute binding is used and the input primitive type is not PATCHES.

    If an attribute binding matches "primitive.tessinner[n]", the "x"
    component is filled with the per-patch inner tessellation level numbered
    <n> of the input patch.  <n> must be less than two.  The "y", "z", and "w"
    components are always undefined.  A program will fail to load if this
    attribute binding is used and the input primitive type is not PATCHES.

    If an attribute binding matches "primitive.patch.attrib[n]", the "x", "y",
    "z", and "w" components are filled with the corresponding components of
    the per-patch generic attribute numbered <n> of the input patch.  A
    program will fail to load if this attribute binding is used and the input
    primitive type is not PATCHES.

    If an attribute binding matches "primitive.tessouter[n..o]",
    "primitive.tessinner[n..o]", or "primitive.patch.attrib[n..o]", a sequence
    of 1+<o>-<n> outer tessellation level, inner tessellation level, or
    per-patch generic attribute bindings is created.  For per-patch generic
    attribute bindings, it is as though the sequence
    "primitive.patch.attrib[n], primitive.patch.attrib[n+1], ...
    primitive.patch.attrib[o]" were specfied.  These bindings are available
    only in explicit declarations of array variables.  A program will fail to
    load if <n> is greater than <o> or the input primitive type is not
    PATCHES.

    If a geometry attribute binding matches "primitive.vertexcount", the "x"
    component is filled with the number of vertices in the input primitive
    being processed.  The "y", "z", and "w" components of the variable are
    always undefined.


    Modify Section 2.X.3.5, Program Results

    (modify Table X.X)

      Binding                        Components  Description
      -----------------------------  ----------  ----------------------------
      result.color[n].primary        (r,g,b,a)   primary color n (SRC_COLOR)
      result.color[n].secondary      (r,g,b,a)   secondary color n (SRC1_COLOR)

      Table X.X:  Fragment Result Variable Bindings. Components labeled "*"
      are unused. "[n]" is optional -- color <n> is used if specified; color
      0 is used otherwise.

    (add after third paragraph)

    If a result variable binding matches "result.color[n].primary" or
    "result.color[n].secondary" and the ARB_blend_func_extended option is
    specified, updates to the "x", "y", "z", and "w" components of these color
    result variables modify the "r", "g", "b", and "a" components of the
    SRC_COLOR and SRC1_COLOR color outputs, respectively, for the fragment
    output color numbered <n>.  If the ARB_blend_func_extended program option
    is not specified, the "result.color[n].primary" and
    "result.color[n].secondary" bindings are unavailable.


    Modify Section 2.X.3.6, Program Parameter Buffers

    (modify the description of parameter buffer arrays to require that all
    bindings in an array declaration must use the same single buffer *or*
    buffer range)

    ...  Program parameter buffer variables may be declared as arrays, but all
    bindings assigned to the array must use the same binding point or binding
    point range, and must increase consecutively.

    (add to the end of the section)

    In explicit variable declarations, the bindings in Table X.12.1 of the
    form "program.buffer[a..b]" may also be used, and indicate the variable
    spans multiple buffer binding points.  Such variables must be accessed as
    an arrays, with the first index specifying an offset into the range of
    buffer object binding points.  A buffer index of zero identifies binding
    point <a>; an index of <b>-<a>-1 identifies binding point <b>.  If such a
    variable is declared as an array, a second index must be provided to
    identify the individual array element.  A program will fail to compile if
    such bindings are used when <a> or <b> is negative or greater than or
    equal to the number of buffer binding points supported for the program
    type, or if <a> is greater than <b>.  The bindings in Table X.12.1 may not
    be used in implicit variable declarations.

      Binding                        Components  Underlying State
      -----------------------------  ----------  -----------------------------
      program.buffer[a..b][c]        (x,x,x,x)   program parameter buffers a
                                                   through b, element c
      program.buffer[a..b][c..d]     (x,x,x,x)   program parameter buffers a
                                                   through b, elements b
                                                   through c
      program.buffer[a..b]           (x,x,x,x)   program parameter buffers a
                                                   through b, all elements

      Table X.12.1:  Program Parameter Buffer Array Bindings.  <a> and <b>
      indicate buffer numbers, <c> and <d> indicate individual elements.

    When bindings beginning with "program.buffer[a..b]" are used in a variable
    declaration, they behave identically to corresponding beginning with
    "program.buffer[a]", except that the variable is filled with a separate
    set of values for each buffer binding point from <a> to <b> inclusive.

    (add new section after Section 2.X.3.7, Program Condition Code Registers
    and renumber subsequent sections accordingly)

    Section 2.X.3.8, Program Texture Variables

    Program texture variables are used as constants during program execution
    and refer the texture objects bound to to one or more texture image units.
    All texture variables have associated bindings and are read-only during
    program execution.  Texture variables retain their values across program
    invocations, and the set of texture image units to which they refer is
    constant.  The texture object a variable refers to may be changed by
    binding a new texture object to the appropriate target of the
    corresponding texture image unit.  Texture variables may only be used to
    identify a texture object in texture instructions, and may not be used as
    operands in any other instruction.  Texture variables may be declared
    explicitly via the <TEXTURE_statement> grammar rule, or implicitly by
    using a texture image unit binding in an instruction.

    Texture array variables may be declared as arrays, but the list of
    texture image units assigned to the array must increase consectively.

    Texture variables identify only a texture image unit; the corresponding
    texture target (e.g., 1D, 2D, CUBE) and texture object is identified by
    the <texTarget> grammar rule in instructions using the texture variable.

      Binding          Components  Underlying State
      ---------------  ----------  ------------------------------------------
      texture[a]           x      texture object bound to image unit a
      texture[a..b]        x      texture objects bound to image units a
                                     through b

      Table X.12.2:  Texture Image Unit Bindings.  <a> and <b> indicate
      texture image unit numbers.

    If a texture binding matches "texture[a]", the texture variable is filled
    with a single integer referring to texture image unit <a>.

    If a texture binding matches "texture[a..b]", the texture variable is
    filled with an array of integers referring to texture image units <a>
    through <b>, inclusive.  A program will fail to compile if <a> or <b> is
    negative or greater than or equal to the number of texture image units
    supported, or if <a> is greater than <b>.


    Modify Section 2.X.4, Program Execution Environment

    (Update the instruction set table to include new columns to indicate the
     first ISA supporting the instruction, and to indicate whether the
     instruction supports 64-bit floating-point modifiers.)

      Instr-      Modifiers
      uction  V  F I C S H D  Out Inputs    Description
      ------- -- - - - - - -  --- --------  --------------------------------
      ABS     40 6 6 X X X F  v   v         absolute value
      ADD     40 6 6 X X X F  v   v,v       add
      AND     40 - 6 X - - S  v   v,v       bitwise and
      ATOM    50 - - X - - -  s   v,su      atomic memory transaction
      BFE     50 - X X - - S  v   v,v       bitfield extract
      BFI     50 - X X - - S  v   v,v,v     bitfield insert
      BFR     50 - X X - - S  v   v         bitfield reverse
      BRK     40 - - - - - -  -   c         break out of loop instruction
      BTC     50 - X X - - S  v   v         bit count
      BTFL    50 - X X - - S  v   v         find least significant bit
      BTFM    50 - X X - - S  v   v         find most significant bit
      CAL     40 - - - - - -  -   c         subroutine call
      CEIL    40 6 6 X X X F  v   vf        ceiling
      CMP     40 6 6 X X X F  v   v,v,v     compare
      CONT    40 - - - - - -  -   c         continue with next loop interation
      COS     40 X - X X X F  s   s         cosine with reduction to [-PI,PI]
      CVT     50 - - X X - F  v   v         general data type conversion
      DDX     40 X - X X X F  v   v         derivative relative to X (fp-only)
      DDY     40 X - X X X F  v   v         derivative relative to Y (fp-only)
      DIV     40 6 6 X X X F  v   v,s       divide vector components by scalar
      DP2     40 X - X X X F  s   v,v       2-component dot product
      DP2A    40 X - X X X F  s   v,v,v     2-comp. dot product w/scalar add
      DP3     40 X - X X X F  s   v,v       3-component dot product
      DP4     40 X - X X X F  s   v,v       4-component dot product
      DPH     40 X - X X X F  s   v,v       homogeneous dot product
      DST     40 X - X X X F  v   v,v       distance vector
      ELSE    40 - - - - - -  -   -         start if test else block
      EMIT    40 - - - - - -  -   -         emit vertex stream 0 (gp-only)
      EMITS   50 - X - - - S  -   s         emit vertex to stream (gp-only)
      ENDIF   40 - - - - - -  -   -         end if test block
      ENDPRIM 40 - - - - - -  -   -         end of primitive (gp-only)
      ENDREP  40 - - - - - -  -   -         end of repeat block
      EX2     40 X - X X X F  s   s         exponential base 2
      FLR     40 6 6 X X X F  v   vf        floor
      FRC     40 6 - X X X F  v   v         fraction
      I2F     40 - 6 X - - S  vf  v         integer to float
      IF      40 - - - - - -  -   c         start of if test block
      IPAC    50 X - X X - F  v   v         interpolate at centroid (fp-only)
      IPAO    50 X - X X - F  v   v,v       interpolate w/offset (fp-only)
      IPAS    50 X - X X - F  v   v,su      interpolate at sample (fp-only)
      KIL     40 X X - - X F  -   vc        kill fragment
      LDC     40 - - X X - F  v   v         load from constant buffer
      LG2     40 X - X X X F  s   s         logarithm base 2
      LIT     40 X - X X X F  v   v         compute lighting coefficients
      LOAD    40 - - X X - F  v   su        global load
      LOD     41 X - X X - F  v   vf,t      compute texture LOD
      LRP     40 X - X X X F  v   v,v,v     linear interpolation
      MAD     40 6 6 X X X F  v   v,v,v     multiply and add
      MAX     40 6 6 X X X F  v   v,v       maximum
      MEMBAR  50 - - - - - -  -   -         memory barrier
      MIN     40 6 6 X X X F  v   v,v       minimum
      MOD     40 - 6 X - - S  v   v,s       modulus vector components by scalar
      MOV     40 6 6 X X X F  v   v         move
      MUL     40 6 6 X X X F  v   v,v       multiply
      NOT     40 - 6 X - - S  v   v         bitwise not
      NRM     40 X - X X X F  v   v         normalize 3-component vector
      OR      40 - 6 X - - S  v   v,v       bitwise or
      PK2H    40 X X - - - F  s   vf        pack two 16-bit floats
      PK2US   40 X X - - - F  s   vf        pack two floats as unsigned 16-bit
      PK4B    40 X X - - - F  s   vf        pack four floats as signed 8-bit
      PK4UB   40 X X - - - F  s   vf        pack four floats as unsigned 8-bit
      PK64    50 X X - - - F  v   v         pack 4x32-bit vectors to 2x64
      POW     40 X - X X X F  s   s,s       exponentiate
      RCC     40 X - X X X F  s   s         reciprocal (clamped)
      RCP     40 6 - X X X F  s   s         reciprocal
      REP     40 6 6 - - X F  -   v         start of repeat block
      RET     40 - - - - - -  -   c         subroutine return
      RFL     40 X - X X X F  v   v,v       reflection vector
      ROUND   40 6 6 X X X F  v   vf        round to nearest integer
      RSQ     40 6 - X X X F  s   s         reciprocal square root
      SAD     40 - 6 X - - S  vu  v,v,vu    sum of absolute differences
      SCS     40 X - X X X F  v   s         sine/cosine without reduction
      SEQ     40 6 6 X X X F  v   v,v       set on equal
      SFL     40 6 6 X X X F  v   v,v       set on false
      SGE     40 6 6 X X X F  v   v,v       set on greater than or equal
      SGT     40 6 6 X X X F  v   v,v       set on greater than
      SHL     40 - 6 X - - S  v   v,s       shift left
      SHR     40 - 6 X - - S  v   v,s       shift right
      SIN     40 X - X X X F  s   s         sine with reduction to [-PI,PI]
      SLE     40 6 6 X X X F  v   v,v       set on less than or equal
      SLT     40 6 6 X X X F  v   v,v       set on less than
      SNE     40 6 6 X X X F  v   v,v       set on not equal
      SSG     40 6 - X X X F  v   v         set sign
      STORE   50 - - - - - -  -   v,su      global store
      STR     40 6 6 X X X F  v   v,v       set on true
      SUB     40 6 6 X X X F  v   v,v       subtract
      SWZ     40 X - X X X F  v   v         extended swizzle
      TEX     40 X X X X - F  v   vf,t      texture sample
      TGALL   50 X X X X - F  v   v         test all non-zero in thread group
      TGANY   50 X X X X - F  v   v         test any non-zero in thread group
      TGEQ    50 X X X X - F  v   v         test all equal in thread group
      TRUNC   40 6 6 X X X F  v   vf        truncate (round toward zero)
      TXB     40 X X X X - F  v   vf,t      texture sample with bias
      TXD     40 X X X X - F  v vf,vf,vf,t  texture sample w/partials
      TXF     40 X X X X - F  v   vs,t      texel fetch
      TXFMS   40 X X X X - F  v   vs,t      multisample texel fetch
      TXG     41 X X X X - F  v   vf,t      texture gather
      TXGO    50 X X X X - F  v vf,vs,vs,t  texture gather w/per-texel offsets
      TXL     40 X X X X - F  v   vf,t      texture sample w/LOD
      TXP     40 X X X X - F  v   vf,t      texture sample w/projection
      TXQ     40 - - - - - S  vs  vs,t      texture info query
      UP2H    40 X X X X - F  vf  s         unpack two 16-bit floats
      UP2US   40 X X X X - F  vf  s         unpack two unsigned 16-bit integers
      UP4B    40 X X X X - F  vf  s         unpack four signed 8-bit integers
      UP4UB   40 X X X X - F  vf  s         unpack four unsigned 8-bit integers
      UP64    50 X X X X - F  v   v         unpack 2x64 vectors to 4x32
      X2D     40 X - X X X F  v   v,v,v     2D coordinate transformation
      XOR     40 - 6 X - - S  v   v,v       exclusive or
      XPD     40 X - X X X F  v   v,v       cross product

          Table X.13:  Summary of NV_gpu_program5 instructions.

      The "V" column indicates the first assembly language in the
      NV_gpu_program4 family (if any) supporting the opcode.  "41" and "50"
      indicate NV_gpu_program4_1 and NV_gpu_program5, respectively.

      The "Modifiers" columns specify the set of modifiers allowed for the
      instruction:

        F = floating-point data type modifiers
        I = signed and unsigned integer data type modifiers
        C = condition code update modifiers
        S = clamping (saturation) modifiers
        H = half-precision float data type suffix
        D = default data type modifier (F, U, or S)

      For the "F" and "I" columns, an "X" indicates support for both unsized
      type modifiers and sized type modifiers with fewer than 64 bits.  A "6"
      indicates support for all modifiers, including 64-bit versions (when
      supported).

      The input and output columns describe the formats of the operands and
      results of the instruction.

        v:  4-component vector (data type is inherited from operation)
        vf: 4-component vector (data type is always floating-point)
        vs: 4-component vector (data type is always signed integer)
        vu: 4-component vector (data type is always unsigned integer)
        s:  scalar (replicated if written to a vector destination;
                    data type is inherited from operation)
        su:  scalar (data type is always unsigned integer)
        c:  condition code test result (e.g., "EQ", "GT1.x")
        vc: 4-component vector or condition code test
        t:  texture

      Instructions labeled "fp-only" and "gp-only" are supported only for
      fragment and geometry programs, respectively.


    Modify Section 2.X.4.1, Program Instruction Modifiers

    (Update the discussion of instruction precision modifiers.  If
     GL_NV_gpu_program_fp64 is not found in the extension string, the "F64"
     instruction modifier described below is not supported.)

    (add to Table X.14 of the NV_gpu_program4 specification.)

      Modifier  Description
      --------  ---------------------------------------------------
      F         Floating-point operation
      U         Fixed-point operation, unsigned operands
      S         Fixed-point operation, signed operands
      ...
      F32       Floating-point operation, 32-bit precision or
                  access one 32-bit floating-point value
      F64       Floating-point operation, 64-bit precision or
                  access one 64-bit floating-point value
      S32       Fixed-point operation, signed 32-bit operands or
                  access one 32-bit signed integer value
      S64       Fixed-point operation, signed 64-bit operands or
                  access one 64-bit signed integer value
      U32       Fixed-point operation, unsigned 32-bit operands or
                  access one 32-bit unsigned integer value
      U64       Fixed-point operation, unsigned 64-bit operands or
                  access one 64-bit unsigned integer value
      ...
      F32X2     Access two 32-bit floating-point values
      F32X4     Access four 32-bit floating-point values
      F64X2     Access two 64-bit floating-point values
      F64X4     Access four 64-bit floating-point values
      S8        Access one 8-bit signed integer value
      S16       Access one 16-bit signed integer value
      S32X2     Access two 32-bit signed integer values
      S32X4     Access four 32-bit signed integer values
      S64       Access one 64-bit signed integer value
      S64X2     Access two 64-bit signed integer values
      S64X4     Access four 64-bit signed integer values
      U8        Access one 8-bit unsigned integer value
      U16       Access one 16-bit unsigned integer value
      U32       Access one 32-bit unsigned integer value
      U32X2     Access two 32-bit unsigned integer values
      U32X4     Access four 32-bit unsigned integer values
      U64       Access one 64-bit unsigned integer value
      U64X2     Access two 64-bit unsigned integer values
      U64X4     Access four 64-bit unsigned integer values

      ADD       Perform add operation for ATOM
      MIN       Perform minimum operation for ATOM
      MAX       Perform maximum operation for ATOM
      IWRAP     Perform wrapping increment for ATOM
      DWRAP     Perform wrapping decrment for ATOM
      AND       Perform logical AND operation for ATOM
      OR        Perform logical OR operation for ATOM
      XOR       Perform logical XOR operation for ATOM
      EXCH      Perform exchange operation for ATOM
      CSWAP     Perform compare-and-swap operation for ATOM

      COH       Make LOAD and STORE operations use coherent caching
      VOL       Make LOAD and STORE operations treat memory as volatile

      PREC      Instruction results should be precise

      ROUND     Inexact conversion results round to nearest value (even)
      CEIL      Inexact conversion results round to larger value
      FLR       Inexact conversion results round to smaller value
      TRUNC     Inexact conversion results round to value closest to zero


    "F", "U", and "S" modifiers are base data type modifiers and specify that
    the instruction should operate on floating-point, unsigned integer, or
    signed integer values, respectively.  For example, "ADD.F", "ADD.U", and
    "ADD.S" specify component-wise addition of floating-point, unsigned
    integer, or signed integer vectors, respectively.  While these modifiers
    specify a data type, they do not specify an exact precision at which the
    operation is performed.  Floating-point and fixed-point operations will
    typically be carried out at 32-bit precision, unless otherwise described
    in the instruction documentation or overridden by the precision modifiers.
    If all operands are represented with less than 32-bit precision (e.g.,
    variables with the "SHORT" component size modifier), operations may be
    carried out at a precision no less than the precision of the largest
    operand used by the instruction.  For some instructions, the data type of
    some operands or the result are fixed; in these cases, the data type
    modifier specifies the data type of the remaining values.

    Operands represented with fewer bits than used to perform the instruction
    will be promoted to a larger data type.  Signed integer operands will be
    sign-extended, where the most significant bits are filled with ones if the
    operand is negative and zero otherwise.  Unsigned integer operands will be
    zero-extended, where the most significant bits are always filled with
    zeroes.  Operands represented with more bits than used to perform the
    instruction will be converted to lower precision.  Floating-point
    overflows result in IEEE infinity encodings; integer overflows result in
    the truncation of the most significant bits.

    For arithmetic operations, the "F32", "F64", "U32", "U64", "S32", and
    "S64" modifiers are precision-specific data type modifiers that specify
    that floating-point, unsigned integer, or signed integer operations be
    carried out with an internal precision of no less than 32 or 64 bits per
    component, respectively.  The "F64", "U64", and "S64" modifiers are
    supported on only a subset of instructions, as documented in the
    instruction table.  The base data type of the instruction is trivially
    derived from a precision-specific data type modifiers, and an instruction
    may not specify both base and precision-specific data type modifiers.

    ...

    "SAT" and "SSAT" are clamping modifiers that generally specify that the
    floating-point components of the instruction result should be clamped to
    [0,1] or [-1,1], respectively, before updating the condition code and the
    destination variable.  If no clamping suffix is specified, unclamped
    results will be used for condition code updates (if any) and destination
    variable writes.  Clamping modifiers are not supported on instructions
    that do not produce floating-point results, with one exception.

    ...

    For load and store operations, the "F32", "F32X2", "F32X4", "F64",
    "F64X2", "F64X4", "S8", "S16", "S32", "S32X2", "S32X4", "S64", "S64X2",
    "S64X4", "U8", "U16", "U32", "U32X2", "U32X4", "U64", "U64X2", and "U64X4"
    storage modifiers control how data are loaded from or stored to memory.
    Storage modifiers are supported by the ATOM, LDC, LOAD, and STORE
    instructions and are covered in more detail in the descriptions of these
    instructions.  These instructions must specify exactly one of these
    modifiers, and may not specify any of the base data type modifiers (F,U,S)
    described above.  The base data types of the result vector of a load
    instruction or the first operand of a store instruction are trivially
    derived from the storage modifier.

    For atomic memory operations performed by the ATOM instruction, the "ADD",
    "MIN", "MAX", "IWRAP", "DWRAP", "AND", "OR", "XOR", "EXCH", and "CSWAP"
    modifiers specify the operation to perform on the memory being accessed,
    and are described in more detail in the description of this instruction.

    For load and store operations, the "COH" modifier controls whether the
    operation uses a coherent level of the cache hierarchy, as described in
    Section 2.X.4.5.

    For load and store operations, the "VOL" modifier controls whether the
    operation treats the memory being read or written as volatile.
    Instructions modified with "VOL" will always read or write the underlying
    memory, whether or not previous or subsequent loads and stores access the
    same memory.

    For arithmetic and logical operations, the "PREC" modifier controls
    whether the instruction result should be treated as precise.  For
    instructions not qualified with ".PREC", the implementation may rearrange
    the computations specified by the program instructions to execute more
    efficiently, even if it may generate slightly different results in some
    cases.  For example, an implementation may combine a MUL instruction with
    a dependent ADD instruction and generate code to execute a MAD
    (multiply-add) instruction instead.  The difference in rounding may
    produce unacceptable artifacts for some algorithms.  When ".PREC" is
    specified, the instruction will be executed in a manner that always
    generates the same result regardless of the program instructions that
    precede or follow the instruction.  Note that a ".PREC" modifier does not
    affect the processing of any other instruction.  For example, tagging an
    instruction with ".PREC" does not mean that the instructions used to
    generate the instruction's operands will be treated as precise unless
    those instructions are also qualified with ".PREC".

    For the CVT (data type conversion) instruction, the "F16", "F32", "F64",
    "S8", "S16", "S32", "S64", "U8", "U16", "U32", and "U64" storage modifiers
    specify the data type of the vector operand and the converted result.  Two
    storage modifiers must be provided, which specify the data type of the
    result and the operand, respectively.

    For the CVT (data type conversion) instruction, the "ROUND", "CEIL",
    "FLR", and "TRUNC" modifiers specify how to round converted results that
    are not directly representable using the data type of the result.


    Modify Section 2.X.4.4, Program Texture Access

    (Extend the language describing the operation of texel offsets to cover
     the new capability to load texel offsets from a register.  Otherwise,
     this functionality is unchanged from previous extensions.)

    <offset> is a 3-component signed integer vector, which can be specified
    using constants embedded in the texture instruction according to the
    <texOffsetImmed> grammar rule, or taken from a vector operand according to
    the <texOffsetVar> grammar rule.  The three components of the offset
    vector are added to the computed <u>, <v>, and <w> texel locations prior
    to sampling.  When using a constant offset, one, two, or three components
    may be specified in the instruction; if fewer than three are specified,
    the remaining offset components are zero.  If no offsets are specified,
    all three components of the offset are treated as zero.  A limited range
    of offset values are supported; the minimum and maximum <texOffset> values
    are implementation-dependent and given by MIN_PROGRAM_TEXEL_OFFSET_EXT and
    MAX_PROGRAM_TEXEL_OFFSET_EXT, respectively.  A program will fail to load:

      * if the texture target specified in the instruction is 1D, ARRAY1D,
        SHADOW1D, or SHADOWARRAY1D, and the second or third component of a
        constant offset vector is non-zero;

      * if the texture target specified in the instruction is 2D, RECT,
        ARRAY2D, SHADOW2D, SHADOWRECT, or SHADOWARRAY2D, and the third
        component of a constant offset vector is non-zero;

      * if the texture target is CUBE, SHADOWCUBE, ARRAYCUBE, or
        SHADOWARRAYCUBE, and any component of a constant offset vector is
        non-zero -- texel offsets are not supported for cube map or buffer
        textures;

      * if any component of the constant offset vector of a TXGO instruction
        is non-zero -- non-constant offsets are provided in separate operands;

      * if any component of a constant offset vector is less than
        MIN_PROGRAM_TEXEL_OFFSET_EXT or greater than
        MAX_PROGRAM_TEXEL_OFFSET_EXT;

      * if a TXD or TXGO instruction specifies a non-constant texel offset
        according to the <texOffsetVar> grammar rule; or

      * if any instruction specifies a non-constant texel offset according
        to the <texOffsetVar> grammar rule and the texture target is CUBE,
        SHADOWCUBE, ARRAYCUBE, or SHADOWARRAYCUBE.

    The implementation-dependent minimum and maximum texel offset values apply
    to texel offsets are taken from a vector operand, but out-of-bounds or
    invalid component values will not prevent program loading since the
    offsets may not be computed until the program is executed.  Components of
    the vector operand not needed for the texture target are ignored.  The W
    component of the offset vector is always ignored; the Z component of the
    offset vector is ignored unless the target is 3D; the Y component is
    ignored if the target is 1D, ARRAY1D, SHADOW1D, or SHADOWARRAY1D.  If the
    value of any non-ignored component of the vector operand is outside
    implementation-dependent limits, the results of the texture lookup are
    undefined.  For all instructions except TXGO, the limits are
    MIN_PROGRAM_TEXEL_OFFSET_EXT and MAX_PROGRAM_TEXEL_OFFSET_EXT.  For the
    TXGO instruction, the limits are MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV and
    MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV.


    (Modify language describing how the check for using multiple targets on a
     single texture image unit works, to account for texture array variables
     where a single instruction may access one of multiple textures and the
     texture used is not known when the program is loaded.)

    A program will fail to load if it attempts to sample from multiple texture
    targets (including the SHADOW pseudo-targets) on the same texture image
    unit.  For example, a program containing any two the following
    instructions will fail to load:

      TEX out, coord, texture[0], 1D;
      TEX out, coord, texture[0], 2D;
      TEX out, coord, texture[0], ARRAY2D;
      TEX out, coord, texture[0], SHADOW2D;
      TEX out, coord, texture[0], 3D;

    For the purposes of this test, sampling using a texture variable declared
    as an array is treated as though all texture image units bound to the
    variable were accessed.  A program containing the following
    instructions would fail to load:

      TEXTURE textures[] = { texture[0..3] };
      TEX out, coord, textures[2], 2D;     # acts as if all textures are used
      TEX out, coord, texture[1], 3D;

    (Add language describing texture gather component selection)

    The TXG and TXGO instructions provide the ability to assemble a
    four-component vector by taking the value of a single component of a
    multi-component texture from each of four texels.  The component selected
    is identified by the <texImageUnitComp> grammar rule.  Component selection
    is not supported for any other instruction, and a program will fail to
    load if <texImageUnitComp> is matched for any texture instruction other
    than TXG or TXGO.


    Add New Section 2.X.4.5, Program Memory Access

    Programs may load from or store to buffer object memory via the ATOM
    (atomic global memory operation), LDC (load constant), LOAD (global load),
    and STORE (global store) instructions.

    Load instructions read 8, 16, 32, 64, 128, or 256 bits of data from a
    source address to produce a four-component vector, according to the
    storage modifier specified with the instruction.  The storage modifier has
    three parts:

      - a base data type, "F", "S", or "U", specifying that the instruction
        fetches floating-point, signed integer, or unsigned integer values,
        respectively;

      - a component size, specifying that the components fetched by the
        instruction have 8, 16, 32, or 64 bits; and

      - an optional component count, where "X2" and "X4" indicate that two or
        four components be fetched, and no count indicates a single component
        fetch.

    When the storage modifier specifies that fewer than four components should
    be fetched, remaining components are filled with zeroes.  When performing
    an atomic memory operation (ATOM) or a global load (LOAD), the GPU address
    is specified as an instruction operand.  When performing a constant buffer
    load (LDC), the GPU address is derived by adding the base address of the
    bound buffer object to an offset specified as an instruction operand.
    Given a GPU address <address> and a storage modifier <modifier>, the
    memory load can be described by the following code:

      result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
      {
        result_t_vec result = { 0, 0, 0, 0 };
        switch (modifier) {
        case F32:
            result.x = ((float32_t *)address)[0];
            break;
        case F32X2:
            result.x = ((float32_t *)address)[0];
            result.y = ((float32_t *)address)[1];
            break;
        case F32X4:
            result.x = ((float32_t *)address)[0];
            result.y = ((float32_t *)address)[1];
            result.z = ((float32_t *)address)[2];
            result.w = ((float32_t *)address)[3];
            break;
        case F64:
            result.x = ((float64_t *)address)[0];
            break;
        case F64X2:
            result.x = ((float64_t *)address)[0];
            result.y = ((float64_t *)address)[1];
            break;
        case F64X4:
            result.x = ((float64_t *)address)[0];
            result.y = ((float64_t *)address)[1];
            result.z = ((float64_t *)address)[2];
            result.w = ((float64_t *)address)[3];
            break;
        case S8:
            result.x = ((int8_t *)address)[0];
            break;
        case S16:
            result.x = ((int16_t *)address)[0];
            break;
        case S32:
            result.x = ((int32_t *)address)[0];
            break;
        case S32X2:
            result.x = ((int32_t *)address)[0];
            result.y = ((int32_t *)address)[1];
            break;
        case S32X4:
            result.x = ((int32_t *)address)[0];
            result.y = ((int32_t *)address)[1];
            result.z = ((int32_t *)address)[2];
            result.w = ((int32_t *)address)[3];
            break;
        case S64:
            result.x = ((int64_t *)address)[0];
            break;
        case S64X2:
            result.x = ((int64_t *)address)[0];
            result.y = ((int64_t *)address)[1];
            break;
        case S64X4:
            result.x = ((int64_t *)address)[0];
            result.y = ((int64_t *)address)[1];
            result.z = ((int64_t *)address)[2];
            result.w = ((int64_t *)address)[3];
            break;
        case U8:
            result.x = ((uint8_t *)address)[0];
            break;
        case U16:
            result.x = ((uint16_t *)address)[0];
            break;
        case U32:
            result.x = ((uint32_t *)address)[0];
            break;
        case U32X2:
            result.x = ((uint32_t *)address)[0];
            result.y = ((uint32_t *)address)[1];
            break;
        case U32X4:
            result.x = ((uint32_t *)address)[0];
            result.y = ((uint32_t *)address)[1];
            result.z = ((uint32_t *)address)[2];
            result.w = ((uint32_t *)address)[3];
            break;
        case U64:
            result.x = ((uint64_t *)address)[0];
            break;
        case U64X2:
            result.x = ((uint64_t *)address)[0];
            result.y = ((uint64_t *)address)[1];
            break;
        case U64X4:
            result.x = ((uint64_t *)address)[0];
            result.y = ((uint64_t *)address)[1];
            result.z = ((uint64_t *)address)[2];
            result.w = ((uint64_t *)address)[3];
            break;
        }
        return result;
      }

    Store instructions write the contents of a four-component vector operand
    into 8, 16, 32, 64, 128, or 256 bits, according to the storage modifier
    specified with the instruction.  The storage modifiers supported by stores
    are identical to those supported for loads.  Given a GPU address
    <address>, a vector operand <operand> containing the data to be stored,
    and a storage modifier <modifier>, the memory store can be described by
    the following code:

      void BufferMemoryStore(char *address, operand_t_vec operand,
                             OpModifier modifier)
      {
        switch (modifier) {
        case F32:
            ((float32_t *)address)[0] = operand.x;
            break;
        case F32X2:
            ((float32_t *)address)[0] = operand.x;
            ((float32_t *)address)[1] = operand.y;
            break;
        case F32X4:
            ((float32_t *)address)[0] = operand.x;
            ((float32_t *)address)[1] = operand.y;
            ((float32_t *)address)[2] = operand.z;
            ((float32_t *)address)[3] = operand.w;
            break;
        case F64:
            ((float64_t *)address)[0] = operand.x;
            break;
        case F64X2:
            ((float64_t *)address)[0] = operand.x;
            ((float64_t *)address)[1] = operand.y;
            break;
        case F64X4:
            ((float64_t *)address)[0] = operand.x;
            ((float64_t *)address)[1] = operand.y;
            ((float64_t *)address)[2] = operand.z;
            ((float64_t *)address)[3] = operand.w;
            break;
        case S8:
            ((int8_t *)address)[0] = operand.x;
            break;
        case S16:
            ((int16_t *)address)[0] = operand.x;
            break;
        case S32:
            ((int32_t *)address)[0] = operand.x;
            break;
        case S32X2:
            ((int32_t *)address)[0] = operand.x;
            ((int32_t *)address)[1] = operand.y;
            break;
        case S32X4:
            ((int32_t *)address)[0] = operand.x;
            ((int32_t *)address)[1] = operand.y;
            ((int32_t *)address)[2] = operand.z;
            ((int32_t *)address)[3] = operand.w;
            break;
        case S64:
            ((int64_t *)address)[0] = operand.x;
            break;
        case S64X2:
            ((int64_t *)address)[0] = operand.x;
            ((int64_t *)address)[1] = operand.y;
            break;
        case S64X4:
            ((int64_t *)address)[0] = operand.x;
            ((int64_t *)address)[1] = operand.y;
            ((int64_t *)address)[2] = operand.z;
            ((int64_t *)address)[3] = operand.w;
            break;
        case U8:
            ((uint8_t *)address)[0] = operand.x;
            break;
        case U16:
            ((uint16_t *)address)[0] = operand.x;
            break;
        case U32:
            ((uint32_t *)address)[0] = operand.x;
            break;
        case U32X2:
            ((uint32_t *)address)[0] = operand.x;
            ((uint32_t *)address)[1] = operand.y;
            break;
        case U32X4:
            ((uint32_t *)address)[0] = operand.x;
            ((uint32_t *)address)[1] = operand.y;
            ((uint32_t *)address)[2] = operand.z;
            ((uint32_t *)address)[3] = operand.w;
            break;
        case U64:
            ((uint64_t *)address)[0] = operand.x;
            break;
        case U64X2:
            ((uint64_t *)address)[0] = operand.x;
            ((uint64_t *)address)[1] = operand.y;
            break;
        case U64X4:
            ((uint64_t *)address)[0] = operand.x;
            ((uint64_t *)address)[1] = operand.y;
            ((uint64_t *)address)[2] = operand.z;
            ((uint64_t *)address)[3] = operand.w;
            break;
        }
      }

    If a global load or store accesses a memory address that does not
    correspond to a buffer object made resident by MakeBufferResidentNV, the
    results of the operation are undefined and may produce a fault resulting
    in application termination.  If a load accesses a buffer object made
    resident with an <access> parameter of WRITE_ONLY, or if a store accesses
    a buffer object made resident with an <access> parameter of READ_ONLY, the
    results of the operation are also undefined and may lead to application
    termination.

    The address used for global memory loads or stores or offset used for
    constant buffer loads must be aligned to the fetch size corresponding to
    the storage opcode modifier.  For S8 and U8, the offset has no alignment
    requirements.  For S16 and U16, the offset must be a multiple of two basic
    machine units.  For F32, S32, and U32, the offset must be a multiple of
    four.  For F32X2, F64, S32X2, S64, U32X2, and U64, the offset must be a
    multiple of eight.  For F32X4, F64X2, S32X4, S64X2, U32X4, and U64X2, the
    offset must be a multiple of sixteen.  For F64X4, S64X4, and U64X4, the
    offset must be a multiple of thirty-two.  If an offset is not correctly
    aligned, the values returned by a buffer memory load will be undefined,
    and the effects of a buffer memory store will also be undefined.

    Global and image memory accesses in assembly programs are weakly ordered
    and may require synchronization relative to other operations in the OpenGL
    pipeline.  The ordering and synchronization mehcanisms described in
    Section 2.14.X (of the EXT_shader_image_load_store extension
    specification) for shaders using the OpenGL Shading Language apply equally
    to loads, stores, and atomics performed in assembly programs.


    Modify Section 2.X.6.Y of the NV_fragment_program4 specification

    (add new option section)

    + Early Per-Fragment Tests (NV_early_fragment_tests)

    If a fragment program specifies the "NV_early_fragment_tests" option, the
    depth and stencil tests will be performed prior to fragment program
    invocation, as described in Section 3.X.


    Modify Section 2.X.7.Y of the NV_geometry_program4 specification

    (Simply add the new input primitive type "PATCHES" to the list of tokens
     allowed by the "PRIMITIVE_IN" declaration.)

    - Input Primitive Type (PRIMITIVE_IN)

    The PRIMITIVE_IN statement declares the type of primitives seen by a
    geometry program.  The single argument must be one of "POINTS", "LINES",
    "LINES_ADJACENCY", "TRIANGLES", "TRIANGLES_ADJACENCY", or "PATCHES".


    (Add a new optional program declaration to declare a geometry shader that
     is run <N> times per primitive.)

    Geometry programs support three types of mandatory declaration statements,
    as described below.  Each of the three must be included exactly once in
    the geometry program.

    ...

    Geometry programs also support one optional declaration statement.

    - Program Invocation Count (INVOCATIONS)

    The INVOCATIONS statement declares the number of times the geometry
    program is run on each primitive processed.  The single argument must be a
    positive integer less than or equal to the value of the
    implementation-dependent limit MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV.  Each
    invocation of the geometry program will have the same inputs and outputs
    except for the built-in input variable "primitive.invocation".  This
    variable will be an integer between 0 and <n>-1, where <n> is the declared
    number of invocations.  If omitted, the program invocation count is one.


    Section 2.X.8.Z, ATOM:  Atomic Global Memory Operation

    The ATOM instruction performs an atomic global memory operation by reading
    from memory at the address specified by the second unsigned integer scalar
    operand, computing a new value based on the value read from memory and the
    first (vector) operand, and then writing the result back to the same
    memory address.  The memory transaction is atomic, guaranteeing that no
    other write to the memory accessed will occur between the time it is read
    and written by the ATOM instruction.  The result of the ATOM instruction
    is the scalar value read from memory.

    The ATOM instruction has two required instruction modifiers.  The atomic
    modifier specifies the type of operation to be performed.  The storage
    modifier specifies the size and data type of the operand read from memory
    and the base data type of the operation used to compute the value to be
    written to memory.

      atomic     storage
      modifier   modifiers            operation
      --------   ------------------   --------------------------------------
       ADD       U32, S32, U64        compute a sum
       MIN       U32, S32             compute minimum
       MAX       U32, S32             compute maximum
       IWRAP     U32                  increment memory, wrapping at operand
       DWRAP     U32                  decrement memory, wrapping at operand
       AND       U32, S32             compute bit-wise AND
       OR        U32, S32             compute bit-wise OR
       XOR       U32, S32             compute bit-wise XOR
       EXCH      U32, S32, U64        exchange memory with operand
       CSWAP     U32, S32, U64        compare-and-swap

     Table X.Y, Supported atomic and storage modifiers for the ATOM
     instruction.

    Not all storage modifiers are supported by ATOM, and the set of modifiers
    allowed for any given instruction depends on the atomic modifier
    specified.  Table X.Y enumerates the set of atomic modifiers supported by
    the ATOM instruction, and the storage modifiers allowed for each.

      tmp0 = VectorLoad(op0);
      address = ScalarLoad(op1);
      result = BufferMemoryLoad(address, storageModifier);
      switch (atomicModifier) {
      case ADD:
        writeval = tmp0.x + result;
        break;
      case MIN:
        writeval = min(tmp0.x, result);
        break;
      case MAX:
        writeval = max(tmp0.x, result);
        break;
      case IWRAP:
        writeval = (result >= tmp0.x) ? 0 : result+1;
        break;
      case DWRAP:
        writeval = (result == 0 || result > tmp0.x) ? tmp0.x : result-1;
        break;
      case AND:
        writeval = tmp0.x & result;
        break;
      case OR:
        writeval = tmp0.x | result;
        break;
      case XOR:
        writeval = tmp0.x ^ result;
        break;
      case EXCH:
        break;
      case CSWAP:
        if (result == tmp0.x) {
          writeval = tmp0.y;
        } else {
          return result;  // no memory store
        }
        break;
      }
      BufferMemoryStore(address, writeval, storageModifier);

    ATOM performs a scalar atomic operation.  The <y>, <z>, and <w> components
    of the result vector are undefined.

    ATOM supports no base data type modifiers, but requires exactly one
    storage modifier.  The base data types of the result vector, and the first
    (vector) operand are derived from the storage modifier.  The second
    operand is always interpreted as a scalar unsigned integer.


    Section 2.X.8.Z, BFE:  Bitfield Extract

    The BFE instruction extracts a selected set of performs a component-wise
    bit extraction of the second vector operand to yield a result vector.  For
    each component, the number of bits extracted is given by the x component
    of the first vector operand, and the bit number of the least significant
    bit extracted is given by the y component of the first vector operand.

      tmp0 = VectorLoad(op0);
      tmp1 = VectorLoad(op1);
      result.x = BitfieldExtract(tmp0.x, tmp0.y, tmp1.x);
      result.y = BitfieldExtract(tmp0.x, tmp0.y, tmp1.y);
      result.z = BitfieldExtract(tmp0.x, tmp0.y, tmp1.z);
      result.w = BitfieldExtract(tmp0.x, tmp0.y, tmp1.w);

    If the number of bits to extract is zero, zero is returned.  The results
    of bitfield extraction are undefined

      * if the number of bits to extract or the starting offset is negative,
      * if the sum of the number of bits to extract and the starting offset
        is greater than the total number of bits in the operand/result, or
      * if the starting offset is greater than or equal to the total number of
        bits in the operand/result.

      Type BitfieldExtract(Type bits, Type offset, Type value)
      {
        if (bits < 0 || offset < 0 || offset >= TotalBits(Type) ||
            bits + offset > TotalBits(Type)) {
          /* result undefined */
        } else if (bits == 0) {
          return 0;
        } else {
          return (value << (TotalBits(Type) - (bits+offset))) >>
                   (TotalBits(type) - bits);
        }
      }

    BFE supports only signed and unsigned integer data type modifiers.  For
    signed integer data types, the extracted value is sign-extended (i.e.,
    filled with ones if the most significant bit extracted is one and filled
    with zeroes otherwise).  For unsigned integer data types, the extracted
    value is zero-extended.


    Section 2.X.8.Z, BFI:  Bitfield Insert

    The BFI instruction performs a component-wise bitfield insertion of the
    second vector operand into the third vector operand to yield a result
    vector.  For each component, the <n> least significant bits are extracted
    from the corresponding component of the second vector operand, where <n>
    is given by the x component of the first vector operand.  Those bits are
    merged into the corresponding component of the third vector operand,
    replacing bits <b> through <b>+<n>-1, to produce the result.  The bit
    offset <b> is specified by the y component of the first operand.

      tmp0 = VectorLoad(op0);
      tmp1 = VectorLoad(op1);
      tmp2 = VectorLoad(op2);
      result.x = BitfieldInsert(op0.x, op0.y, tmp1.x, tmp2.x);
      result.y = BitfieldInsert(op0.x, op0.y, tmp1.y, tmp2.y);
      result.z = BitfieldInsert(op0.x, op0.y, tmp1.z, tmp2.z);
      result.w = BitfieldInsert(op0.x, op0.y, tmp1.w, tmp2.w);

    The results of bitfield insertion are undefined

      * if the number of bits to insert or the starting offset is negative,
      * if the sum of the number of bits to insert and the starting offset
        is greater than the total number of bits in the operand/result, or
      * if the starting offset is greater than or equal to the total number of
        bits in the operand/result.

      Type BitfieldInsert(Type bits, Type offset, Type src, Type dst)
      {
        if (bits < 0 || offset < 0 || offset >= TotalBits(type) ||
            bits + offset > TotalBits(Type)) {
          /* result undefined */
        } else if (bits == TotalBits(Type)) {
          return src;
        } else {
          Type mask = ((1 << bits) - 1) << offset;
          return ((src << offset) & mask) | (dst & (~mask));
        }
      }

    BFI supports only signed and unsigned integer data type modifiers.  If no
    type modifier is specified, the operand and result vectors are treated as
    signed integers.


    Section 2.X.8.Z, BFR:  Bitfield Reverse

    The BFR instruction performs a component-wise bit reversal of the single
    vector operand to produce a result vector.  Bit reversal is performed by
    exchanging the most and least significant bits, the second-most and
    second-least significant bits, and so on.

      tmp0 = VectorLoad(op0);
      result.x = BitReverse(tmp0.x);
      result.y = BitReverse(tmp0.y);
      result.z = BitReverse(tmp0.z);
      result.w = BitReverse(tmp0.w);

    BFR supports only signed and unsigned integer data type modifiers.  If no
    type modifier is specified, the operand and result vectors are treated as
    signed integers.


    Section 2.X.8.Z, BTC:  Bit Count

    The BTC instruction performs a component-wise bit count of the single
    source vector to yield a result vector.  Each component of the result
    vector contains the number of one bits in the corresponding component of
    the source vector.

      tmp0 = VectorLoad(op0);
      result.x = BitCount(tmp0.x);
      result.y = BitCount(tmp0.y);
      result.z = BitCount(tmp0.z);
      result.w = BitCount(tmp0.w);

    BTC supports only signed and unsigned integer data type modifiers.  If no
    type modifier is specified, both operands and the result are treated as
    signed integers.


    Section 2.X.8.Z, BTFL:  Find Least Significant Bit

    The BTFL instruction searches for the least significant bit of each
    component of the single source vector, yielding a result vector comprising
    the bit number of the located bit for each component.

      tmp0 = VectorLoad(op0);
      result.x = FindLSB(tmp0.x);
      result.y = FindLSB(tmp0.y);
      result.z = FindLSB(tmp0.z);
      result.w = FindLSB(tmp0.w);

    BTFL supports only signed and unsigned integer data type modifiers.  For
    unsigned integer data types, the search will yield the bit number of the
    least significant one bit in each component, or the maximum integer (all
    bits are ones) if the source vector component is zero.  For signed data
    types, the search will yield the bit number of the least significant one
    bit in each component, or -1 if the source vector component is zero.  If
    no type modifier is specified, both operands and the result are treated as
    signed integers.


    Section 2.X.8.Z, BTFM:  Find Most Significant Bit

    The BTFM instruction searches for the most significant bit of each
    component of the single source vector, yielding a result vector comprising
    the bit number of the located bit for each component.

      tmp0 = VectorLoad(op0);
      result.x = FindMSB(tmp0.x);
      result.y = FindMSB(tmp0.y);
      result.z = FindMSB(tmp0.z);
      result.w = FindMSB(tmp0.w);

    BTFM supports only signed and unsigned integer data type modifiers.  For
    unsigned integer data types, the search will yield the bit number of the
    most significant one bit in each component , or the maximum integer (all
    bits are ones) if the source vector component is zero.  For signed data
    types, the search will yield the bit number of the most significant one
    bit if the source value is positive, the bit number of the most
    significant zero bit if the source value is negative, or -1 if the source
    value is zero.  If no type modifier is specified, both operands and the
    result are treated as signed integers.


    Section 2.X.8.Z, CVT:  Data Type Conversion

    The CVT instruction converts each component of the single source vector
    from one specified data type to another to yield a result vector.

      tmp0 = VectorLoad(op0);
      result = DataTypeConvert(tmp0);

    The CVT instruction requires two storage modifiers.  The first specifies
    the data type of the result components; the second specifies the data type
    of the operand components.  The supported storage modifiers are F16, F32,
    F64, S8, S16, S32, S64, U8, U16, U32, and U64.  A storage modifier of
    "F16" indicates a source or destination that is treated as having a
    floating-point type, but whose sixteen least significant bits describe a
    16-bit floating-point value using the encoding provided in Section 2.1.2.

    If the component size of the source register doesn't match the size of the
    specified operand data type, the source register components are first
    interpreted as a value with the same base data type as the operand and
    converted to the operand data type.  The operand components are then
    converted to the result data type.  Finally, if the component size of the
    destination register doesn't match the specified result data type, the
    result components are converted to values of the same base data type with
    a size matching the result register's component size.

    Data type conversion is performed by first converting the source
    components to an infinite-precision value of the destination data type,
    and then converting to the result data type.  When converting between
    floating-point and integer values, integer values are never interpreted as
    being normalized to [0,1] or [-1,+1].  Converting the floating-point
    special values -INF, +INF, and NaN to integers will yield undefined
    results.

    When converting from a non-integral floating-point value to an integer,
    one of the two integers closest in value to the floating-point value are
    chosen according to the rounding instruction modifier.  If "CEIL" or "FLR"
    is specified, the larger or smaller value, respectively is chosen.  If
    "TRUNC" is specified, the value nearest to zero is chosen.  If "ROUND" is
    specified, if one integer is nearer in value to the original
    floating-point value, it is chosen; otherwise, the even integer is chosen.
    "ROUND" is used if no rounding modifier is specified.

    When converting from the infinite-precision intermediate value to the
    destination data type:

      * Floating-point values not exactly representable in the destination
        data are rounded to one of the two nearest values in the destination
        type according to the rounding modifier.  Note that the results of
        float-to-float conversion are not automatically rounded to integer
        values, even if a rounding modifier such as CEIL or FLR is specified.

      * Integer values are clamped to the closest value representable in the
        result data type if the "SAT" (saturation) modifier is specified.

      * Integer values drop the most significant bits if the "SAT" modifier is
        not specified.

    Negation and absolute value operators are not supported on the source
    operand; a program using such operators will fail to compile.

    CVT supports no data type modifiers; the type of the operand and result
    vectors is fully specified by the required storage modifiers.


    Section 2.X.8.Z, EMIT:  Emit Vertex

    (Modify the description of the EMIT opcode to deal with the interaction
     with multiple vertex streams added by ARB_transform_feedback3.  For more
     information on vertex streams, see ARB_transform_feedback3.)

    The EMIT instruction emits a new vertex to be added to the current output
    primitive for vertex stream zero.  The attributes of the emitted vertex
    are given by the current values of the vertex result variables.  After the
    EMIT instruction completes, a new vertex is started and all result
    variables become undefined.


    Section 2.X.8.Z, EMITS:  Emit Vertex to Stream

    (Add new geometry program opcode; the EMITS instruction is not supported
     for any other program types.  For more information on vertex streams, see
     ARB_transform_feedback3.)

    The EMITS instruction emits a new vertex to be added to the current output
    primitive for the vertex stream specified by the single signed integer
    scalar operand.  The attributes of the emitted vertex are given by the
    current values of the vertex result variables.  After the EMITS
    instruction completes, a new vertex is started and all result variables
    become undefined.

    If the specified stream is negative or greater than or equal to the
    implementation-dependent number of vertex streams
    (MAX_VERTEX_STREAMS_NV), the results of the instruction are undefined.


    Section 2.X.8.Z, IPAC:  Interpolate at Centroid

    The IPAC instruction generates a result vector by evaluating the fragment
    attribute named by the single vector operand at the centroid location.
    The result vector would be identical to the value obtained by a MOV
    instruction if the attribute variable were declared using the CENTROID
    modifier.

    When interpolating an attribute variable with this instruction, the
    CENTROID and SAMPLE attribute variable modifiers are ignored.  The FLAT
    and NOPERSPECTIVE variable modifiers operate normally.

     tmp0 = Interpolate(op0, x_pixel + x_centroid, y_pixel + x_centroid);
     result = tmp0;

    IPAC supports only floating-point data type modifiers.  A program will
    fail to load if it contains an IPAC instruction whose single operand is
    not a fragment program attribute variable or matches the "fragment.facing"
    or "primitive.id" binding.


    Section 2.X.8.Z, IPAO:  Interpolate with Offset

    The IPAO instruction generates a result vector by evaluating the fragment
    attribute named by the single vector operand at an offset from the pixel
    center given by the x and y components of the second vector operand.  The
    z and w components of the second vector operand are ignored.  The (x,y)
    position used for interpolating the attribute variable is obtained by
    adding the (x,y) offsets in the second vector operand to the (x,y)
    position of the pixel center.

    The range of offsets supported by the IPAO instruction is
    implementation-dependent.  The position used to interpolate the attribute
    variable is undefined if the x or y component of the second operand is
    less than MIN_FRAGMENT_INTERPOLATION_OFFSET_NV or greater than
    MAX_FRAGMENT_INTERPOLATION_OFFSET_NV.  Additionally, the granularity of
    offsets may be limited.  The (x,y) value may be snapped to a fixed
    sub-pixel grid with the number of subpixel bits given by
    FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV.

    When interpolating an attribute variable with this instruction, the
    CENTROID and SAMPLE attribute variable modifiers are ignored.  The FLAT
    and NOPERSPECTIVE variable modifiers operate normally.

     tmp1 = VectorLoad(op1);
     tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x);
     result = tmp0;

    IPAO supports only floating-point data type modifiers.  A program will
    fail to load if it contains an IPAO instruction whose first operand is not
    a fragment program attribute variable or matches the "fragment.facing" or
    "primitive.id" binding.


    Section 2.X.8.Z, IPAS:  Interpolate at Sample Location

    The IPAS instruction generates a result vector by evaluating the fragment
    attribute named by the single vector operand at the location of the
    pixel's sample whose sample number is given by the second integer scalar
    operand.  If multisample buffers are not available (SAMPLE_BUFFERS is
    zero), the attribute will be evaluated at the pixel center.  If the sample
    number given by the second operand does not exist, the position used to
    interpolate the attribute is undefined.

    When interpolating an attribute variable with this instruction, the
    CENTROID and SAMPLE attribute variable modifiers are ignored.  The FLAT
    and NOPERSPECTIVE variable modifiers operate normally.

     sample = ScalarLoad(op1);
     tmp1 = SampleOffset(sample);
     tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x);
     result = tmp0;

    IPAS supports only floating-point data type modifiers.  A program will
    fail to load if it contains an IPAO instruction whose first operand is not
    a fragment program attribute variable or matches the "fragment.facing" or
    "primitive.id" binding.


    Section 2.X.8.Z, LDC:  Load from Constant Buffer

    The LDC instruction loads a vector operand from a buffer object to yield a
    result vector.  The operand used for the LDC instruction must correspond
    to a parameter buffer variable declared using the "CBUFFER" statement; a
    program will fail to load if any other type of operand is used in an LDC
    instruction.

      result = BufferMemoryLoad(&op0, storageModifier);

    A base operand vector is fetched from memory as described in Section
    2.X.4.5, with the GPU address derived from the binding corresponding to
    the operand.  A final operand vector is derived from the base operand
    vector by applying swizzle, negation, and absolute value operand modifiers
    as described in Section 2.X.4.2.

    The amount of memory in any given buffer object binding accessible by the
    LDC instruction may be limited.  If any component fetched by the LDC
    instruction extends 4*<n> or more basic machine units from the beginning
    of the buffer object binding, where <n> is the implementation-dependent
    constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that
    component will be undefined.

    LDC supports no base data type modifiers, but requires exactly one storage
    modifier.  The base data types of the operand and result vectors are
    derived from the storage modifier.


    Section 2.X.8.Z, LOAD:  Global Load

    The LOAD instruction generates a result vector by reading an address from
    the single unsigned integer scalar operand and fetching data from buffer
    object memory, as described in Section 2.X.4.5.

      address = ScalarLoad(op0);
      result = BufferMemoryLoad(address, storageModifier);

    LOAD supports no base data type modifiers, but requires exactly one
    storage modifier.  The base data type of the result vector is derived from
    the storage modifier.  The single scalar operand is always interpreted as
    an unsigned integer.


    Section 2.X.8.Z, MEMBAR:  Memory Barrier

    The MEMBAR instruction synchronizes memory transactions to ensure that
    memory transactions resulting from any instruction executed by the thread
    prior to the MEMBAR instruction complete prior to any memory transactions
    issued after the instruction.

    MEMBAR has no operands and generates no result.


    Section 2.X.8.Z, PK64:  Pack 64-Bit Component

    The PK64 instruction reads the four components of the single vector
    operand as 32-bit values, packs the bit representations of these into a
    pair of 64-bit values, and replicates those to produce a four-component
    result vector.  The "x" and "y" components of the operand are packed to
    produce the "x" and "z" components of the result vector; the "z" and "w"
    components of the operand are packed to produce the "y" and "w" components
    of the result vector.  The PK64 instruction can be reversed by the UP64
    instruction below.

    This instruction is intended to allow a program to reconstruct 64-bit
    integer or floating-point values generated by the application but passed
    to the GL as two 32-bit values taken from adjacent words in memory.  The
    ability to use this technique depends on how the 64-bit value is stored in
    memory.  For "little-endian" processors, first 32-bit value would hold the
    with the least significant 32 bits of the 64-bit value.  For "big-endian"
    processors, the first 32-bit value holds the most significant 32 bits of
    the 64-bit value.  This reconstruction assumes that the first 32-bit word
    comes from the x component of the operand and the second 32-bit word comes
    from the y component.  The method used to construct a 64-bit value from a
    pair of 32-bit values depends on the processor type.

      tmp = VectorLoad(op0);

      if (underlying system is little-endian) {
        result.x = RawBits(tmp.x) | (RawBits(tmp.y) << 32);
        result.y = RawBits(tmp.z) | (RawBits(tmp.w) << 32);
        result.z = RawBits(tmp.x) | (RawBits(tmp.y) << 32);
        result.w = RawBits(tmp.z) | (RawBits(tmp.w) << 32);
      } else {
        result.x = RawBits(tmp.y) | (RawBits(tmp.x) << 32);
        result.y = RawBits(tmp.w) | (RawBits(tmp.z) << 32);
        result.z = RawBits(tmp.y) | (RawBits(tmp.x) << 32);
        result.w = RawBits(tmp.w) | (RawBits(tmp.z) << 32);
      }

    PK64 supports integer and floating-point data type modifiers, which
    specify the base data type of the operand and result.  The single vector
    operand is always treated as having 32-bit components, and the result is
    treated as a vector with 64-bit components.  The encoding performed by
    PK64 can be reversed using the UP64 instruction.

    A program will fail to load if it contains a PK64 instruction that writes
    its results to a variable not declared as "LONG".


    Section 2.X.8.Z, STORE:  Global Store

    The STORE instruction reads an address from the second unsigned integer
    scalar operand and writes the contents of the first vector operand to
    buffer object memory at that address, as described in Section 2.X.4.5.
    This instruction generates no result.

      tmp0 = VectorLoad(op0);
      address = ScalarLoad(op1);
      BufferMemoryStore(address, tmp0, storageModifier);

    STORE supports no base data type modifiers, but requires exactly one
    storage modifier.  The base data type of the vector components of the
    first operand is derived from the storage modifier.  The second operand is
    always interpreted as an unsigned integer scalar.


    Section 2.X.8.Z, TEX:  Texture Sample

    (Modify the instruction pseudo-code to account for texel offsets no
     longer need to be immediate arguments.)

      tmp = VectorLoad(op0);
      if (instruction has variable texel offset) {
        itmp = VectorLoad(op1);
      } else {
        itmp = instruction.texelOffset;
      }
      ddx = ComputePartialsX(tmp);
      ddy = ComputePartialsY(tmp);
      lambda = ComputeLOD(ddx, ddy);
      result = TextureSample(tmp, lambda, ddx, ddy, itmp);


    Section 2.X.8.Z, TGALL:  Test for All Non-Zero in a Thread Group

    The TGALL instruction produces a result vector by reading a vector operand
    for each active thread in the current thread group and comparing each
    component to zero.  A result vector component contains a TRUE value
    (described below) if the value of the corresponding component in the
    operand vector is non-zero for all active threads, and a FALSE value
    otherwise.

    An implementation may choose to arrange programs threads into thread
    groups, and execute an instruction simultaneously for each thread in the
    group.  If the TGALL instruction is contained inside conditional flow
    control blocks and not all threads in the group execute the instruction,
    the operand values for threads not executing the instruction have no
    bearing on the value returned.  The method used to arrange threads into
    groups is undefined.

      tmp = VectorLoad(op0);
      result = { TRUE, TRUE, TRUE, TRUE };
      for (all active threads) {
        if ([thread]tmp.x == 0) result.x = FALSE;
        if ([thread]tmp.y == 0) result.y = FALSE;
        if ([thread]tmp.z == 0) result.z = FALSE;
        if ([thread]tmp.w == 0) result.w = FALSE;
      }

    TGALL supports all data type modifiers.  For floating-point data types,
    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned
    integer data types, the TRUE value is the maximum integer value (all bits
    are ones) and the FALSE value is zero.


    Section 2.X.8.Z, TGANY:  Test for Any Non-Zero in a Thread Group

    The TGANY instruction produces a result vector by reading a vector operand
    for each active thread in the current thread group and comparing each
    component to zero.  A result vector component contains a TRUE value
    (described below) if the value of the corresponding component in the
    operand vector is non-zero for any active thread, and a FALSE value
    otherwise.

    An implementation may choose to arrange programs threads into thread
    groups, and execute an instruction simultaneously for each thread in the
    group.  If the TGANY instruction is contained inside conditional flow
    control blocks and not all threads in the group execute the instruction,
    the operand values for threads not executing the instruction have no
    bearing on the value returned.  The method used to arrange threads into
    groups is undefined.

      tmp = VectorLoad(op0);
      result = { FALSE, FALSE, FALSE, FALSE };
      for (all active threads) {
        if ([thread]tmp.x != 0) result.x = TRUE;
        if ([thread]tmp.y != 0) result.y = TRUE;
        if ([thread]tmp.z != 0) result.z = TRUE;
        if ([thread]tmp.w != 0) result.w = TRUE;
      }

    TGANY supports all data type modifiers.  For floating-point data types,
    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned
    integer data types, the TRUE value is the maximum integer value (all bits
    are ones) and the FALSE value is zero.


    Section 2.X.8.Z, TGEQ:  Test for All Equal Values in a Thread Group

    The TGEQ instruction produces a result vector by reading a vector operand
    for each active thread in the current thread group and comparing each
    component to zero.  A result vector component contains a TRUE value
    (described below) if the value of the corresponding component in the
    operand vector is the same for all active threads, and a FALSE value
    otherwise.

    An implementation may choose to arrange programs threads into thread
    groups, and execute an instruction simultaneously for each thread in the
    group.  If the TGEQ instruction is contained inside conditional flow
    control blocks and not all threads in the group execute the instruction,
    the operand values for threads not executing the instruction have no
    bearing on the value returned.  The method used to arrange threads into
    groups is undefined.

      tmp = VectorLoad(op0);
      tgall = { TRUE, TRUE, TRUE, TRUE };
      tgany = { FALSE, FALSE, FALSE, FALSE };
      for (all active threads) {
        if ([thread]tmp.x == 0) tgall.x = FALSE; else tgany.x = TRUE;
        if ([thread]tmp.y == 0) tgall.y = FALSE; else tgany.y = TRUE;
        if ([thread]tmp.z == 0) tgall.z = FALSE; else tgany.z = TRUE;
        if ([thread]tmp.w == 0) tgall.w = FALSE; else tgany.w = TRUE;
      }
      result.x = (tgall.x == tgany.x) ? TRUE : FALSE;
      result.y = (tgall.y == tgany.y) ? TRUE : FALSE;
      result.z = (tgall.z == tgany.z) ? TRUE : FALSE;
      result.w = (tgall.w == tgany.w) ? TRUE : FALSE;

    TGEQ supports all data type modifiers.  For floating-point data types, the
    TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned
    integer data types, the TRUE value is the maximum integer value (all bits
    are ones) and the FALSE value is zero.


    Section 2.X.8.Z, TXB:  Texture Sample with Bias

    (Modify the instruction pseudo-code to account for texel offsets no
     longer need to be immediate arguments.)

      tmp = VectorLoad(op0);
      if (instruction has variable texel offset) {
        itmp = VectorLoad(op1);
      } else {
        itmp = instruction.texelOffset;
      }
      ddx = ComputePartialsX(tmp);
      ddy = ComputePartialsY(tmp);
      lambda = ComputeLOD(ddx, ddy);
      result = TextureSample(tmp, lambda + tmp.w, ddx, ddy, itmp);

    Section 2.X.8.Z, TXG:  Texture Gather

    (Update the TXG opcode description from NV_gpu_program4_1 specification.
     This version adds two capabilities:  any component of a multi-component
     texture can be selected by tacking on a component name to the texture
     variable passed to identify the texture unit, and depth compares are
     supported if a SHADOW target is specified.)

    The TXG instruction takes the four components of a single floating-point
    vector operand as a texture coordinate, determines a set of four texels to
    sample from the base level of detail of the specified texture image, and
    returns one component from each texel in a four-component result vector.
    To determine the four texels to sample, the minification and magnification
    filters are ignored and the rules for LINEAR filter are applied to the
    base level of the texture image to determine the texels T_i0_j1, T_i1_j1,
    T_i1_j0, and T_i0_j0, as defined in equations 3.23 through 3.25. The
    texels are then converted to texture source colors (Rs,Gs,Bs,As) according
    to table 3.21, followed by application of the texture swizzle as described
    in section 3.8.13.  A four-component vector is returned by taking one of
    the four components of the swizzled texture source colors from each of the
    four selected texels.  The component is selected using the
    <texImageUnitComp> grammar rule, by adding a scalar suffix
    (".x", ".y", ".z", ".w") to the identified texture; if no scalar suffix
    is provided, the first component is selected.

    TXG only operates on 2D, SHADOW2D, CUBE, SHADOWCUBE, ARRAY2D,
    SHADOWARRAY2D, ARRAYCUBE, SHADOWARRAYCUBE, RECT, and SHADOWRECT texture
    targets; a program will fail to compile if any other texture target is
    used.

    When using a "SHADOW" texture target, component selection is ignored.
    Instead, depth comparisons are performed on the depth values for each of
    the four selected texels, and 0/1 values are returned based on the results
    of the comparison.

    As with other texture accesses, the results of a texture gather operation
    are undefined if the texture target in the instruction is incompatible
    with the selected texture's base internal format and depth compare mode.

      tmp = VectorLoad(op0);
      ddx = (0,0,0);
      ddy = (0,0,0);
      lambda = 0;
      if (instruction has variable texel offset) {
        itmp = VectorLoad(op1);
      } else {
        itmp = instruction.texelOffset;
      }
      result.x = TextureSample_i0j1(tmp, lambda, ddx, ddy, itmp).<comp>;
      result.y = TextureSample_i1j1(tmp, lambda, ddx, ddy, itmp).<comp>;
      result.z = TextureSample_i1j0(tmp, lambda, ddx, ddy, itmp).<comp>;
      result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;

    In this pseudocode, "<comp>" refers to the texel component selected by the
    <texImageUnitComp> grammar rule, as described above.

    TXG supports all three data type modifiers.  The single operand is always
    treated as a floating-point vector; the results are interpreted according
    to the data type modifier.


    Section 2.X.8.Z, TXGO:  Texture Gather with Per-Texel Offsets

    Like the TXG instruction, the TXGO instruction takes the four components
    of its first floating-point vector operand as a texture coordinate,
    determines a set of four texels to sample from the base level of detail of
    the specified texture image, and returns one component from each texel in
    a four-component result vector.  The second and third vector operands are
    taken as signed four-component integer vectors providing the x and y
    components of the offsets, respectively, used to determine the location of
    each of the four texels.  To determine the four texels to sample, each of
    the four independent offsets is used in conjunction with the specified
    texture coordinate to select a texel.  The minification and magnification
    filters are ignored and the rules for LINEAR filtering are used to select
    the texel T_i0_j0, as defined in equations 3.23 through 3.25, from the
    base level of the texture image.  The texels are then converted to texture
    source colors (Rs,Gs,Bs,As) according to table 3.21, followed by
    application of the texture swizzle as described in section 3.8.13.  A
    four-component vector is returned by taking one of the four components
    of the swizzled texture source colors from each of the four selected
    texels.  The component is selected using the <texImageUnitComp> grammar
    rule, by adding a scalar suffix (".x", ".y", ".z", ".w") to the identified
    texture; if no scalar suffix is provided, the first component is selected.

    TXGO only operates on 2D, SHADOW2D, ARRAY2D, SHADOWARRAY2D, RECT, and
    SHADOWRECT texture targets; a program will fail to compile if any other
    texture target is used.

    When using a "SHADOW" texture target, component selection is ignored.
    Instead, depth comparisons are performed on the depth values for each of
    the four selected texels, and 0/1 values are returned based on the results
    of the comparison.

    As with other texture accesses, the results of a texture gather operation
    are undefined if the texture target in the instruction is incompatible
    with the selected texture's base internal format and depth compare mode.

      tmp = VectorLoad(op0);
      itmp1 = VectorLoad(op1);
      itmp2 = VectorLoad(op2);
      ddx = (0,0,0);
      ddy = (0,0,0);
      lambda = 0;
      itmp = (op1.x, op2.x);
      result.x = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
      itmp = (op1.y, op2.y);
      result.y = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
      itmp = (op1.z, op2.z);
      result.z = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
      itmp = (op1.w, op2.w);
      result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;

    In this pseudocode, "<comp>" refers to the texel component selected by the
    <texImageUnitComp> grammar rule, as described above.

    If TEXTURE_WRAP_S or TEXTURE_WRAP_T are either CLAMP or MIRROR_CLAMP_EXT,
    the results of the TXGO instruction are undefined.

    Note:  The TXG instruction is equivalent to the TXGO instruction with X
    and Y offset vectors of (0,1,1,0) and (0,0,-1,-1), respectively.

    TXGO supports all three data type modifiers.  The first operand is always
    treated as a floating-point vector and the second and third operands are
    always treated as a signed integer vector; the results are interpreted
    according to the data type modifier.


    Section 2.X.8.Z, TXL:  Texture Sample with LOD

    (Modify the instruction pseudo-code to account for texel offsets no
     longer need to be immediate arguments.)

      tmp = VectorLoad(op0);
      if (instruction has variable texel offset) {
        itmp = VectorLoad(op1);
      } else {
        itmp = instruction.texelOffset;
      }
      ddx = (0,0,0);
      ddy = (0,0,0);
      result = TextureSample(tmp, tmp.w, ddx, ddy, itmp);


    Section 2.X.8.Z, TXP:  Texture Sample with Projection

    (Modify the instruction pseudo-code to account for texel offsets no
     longer need to be immediate arguments.)

      tmp0 = VectorLoad(op0);
      tmp0.x = tmp0.x / tmp0.w;
      tmp0.y = tmp0.y / tmp0.w;
      tmp0.z = tmp0.z / tmp0.w;
      if (instruction has variable texel offset) {
        itmp = VectorLoad(op1);
      } else {
        itmp = instruction.texelOffset;
      }
      ddx = ComputePartialsX(tmp);
      ddy = ComputePartialsY(tmp);
      lambda = ComputeLOD(ddx, ddy);
      result = TextureSample(tmp, lambda, ddx, ddy, itmp);


    Section 2.X.8.Z, UP64:  Unpack 64-bit Component

    The UP64 instruction produces a vector result with 32-bit components by
    unpacking the bits of the "x" and "y" components of a 64-bit vector
    operand.  The "x" component of the operand is unpacked to produce the "x"
    and "y" components of the result vector; the "y" component is unpacked to
    produce the "z" and "w" components of the result vector.

    This instruction is intended to allow a program to pass 64-bit integer or
    floating-point values to an application using two 32-bit values stored in
    adjacent words in memory, which will be read by the application as single
    64-bit values.  The ability to use this technique depends on how the
    64-bit value is stored in memory.  For "little-endian" processors, the
    first 32-bit value would hold the with the least significant 32 bits of
    the 64-bit value.  For "big-endian" processors, the first 32-bit value
    holds the most significant 32 bits of the 64-bit value.  This
    reconstruction assumes that the first 32-bit word comes from the "x"
    component of the operand and the second 32-bit word comes from the "y"
    component.  The method used to unpack a 64-bit value into a pair of 32-bit
    values depends on the processor type.

      tmp = VectorLoad(op0);
      if (underlying system is little-endian) {
        result.x = (RawBits(tmp.x) >>  0) & 0xFFFFFFFF;
        result.y = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF;
        result.z = (RawBits(tmp.y) >>  0) & 0xFFFFFFFF;
        result.w = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF;
      } else {
        result.x = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF;
        result.y = (RawBits(tmp.x) >>  0) & 0xFFFFFFFF;
        result.z = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF;
        result.w = (RawBits(tmp.y) >>  0) & 0xFFFFFFFF;
      }

    UP64 supports integer and floating-point data type modifiers, which
    specify the base data type of the operand and result.  The single operand
    vector always has 64-bit components.  The result is treated as a vector
    with 32-bit components.  The encoding performed by UP64 can be reversed
    using the PK64 instruction.

    A program will fail to load if it contains a UP64 instruction whose
    operand is a variable not declared as "LONG".


    Modify Section 2.14.6.1 of the NV_geometry_program4 specification,
    Geometry Program Input Primitives

    (add patches to the list of supported input primitive types)

    The supported input primitive types are: ...

    Patches (PATCHES)

    Geometry programs that operate on patches are valid only for the
    PATCHES_NV primitive type.  There are a variable number of vertices
    available for each program invocation, depending on the number of input
    vertices in the primitive itself.  For a patch with <n> vertices,
    "vertex[0]" refers to the first vertex of the patch, and "vertex[<n>-1]"
    refers to the last vertex.


    Modify Section 2.14.6.2 of the NV_geometry_program4 specification,
    Geometry Program Output Primitives

    (Add a new paragraph limiting the use of the EMITS opcode to geometry
     programs with a POINTS output primitive type at the end of the section.
     This limitation may be removed in future specifications.)

    Geometry programs may write to multiple vertex streams only if the
    specified output primitive type is POINTS.  A program will fail to load if
    it contains and EMITS instruction and the output primitive type specified
    by the PRIMITIVE_OUT declaration is not POINTS.

    Modify Section 2.14.6.4 of the NV_geometry_program4 specification,
    Geometry Program Output Limits

    (Modify the limitation on the total number of components emitted by a
     geometry program from NV_gpu_program4 to be per-invocation.  If a that
     limit is 4096 and a program has 16 invocations, each of the 16 program
     invocation can emit up to 4096 total components.)

    There are two implementation-dependent limits that limit the total number
    of vertices that each invocation of a program can emit.  First, the vertex
    limit may not exceed the value of MAX_PROGRAM_OUTPUT_VERTICES_NV.  Second,
    product of the vertex limit and the number of result variable components
    written by the program (PROGRAM_RESULT_COMPONENTS_NV, as described in
    section 2.X.3.5 of NV_gpu_program4) may not exceed the value of
    MAX_PROGRAM_TOTAL_OUTPUT_COMPONENTS_NV.  A geometry program will fail to
    load if its maximum vertex count or maximum total component count exceeds
    the implementation-dependent limit.  The limits may be queried by calling
    GetProgramiv with a <target> of GEOMETRY_PROGRAM_NV.  Note that the
    maximum number of vertices that a geometry program can emit may be much
    lower than MAX_PROGRAM_OUTPUT_VERTICES_NV if the program writes a large
    number of result variable components.  If a geometry program has multiple
    invocations (via the "INVOCATIONS" declaration), the program will load
    successfully as long as no single invocation exceeds the total component
    count limit, even if the total output of all invocations combined exceeds
    the limit.


Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization)

    Modify Section 3.X, Early Per-Fragment Tests, as documented in the
    EXT_shader_image_load_store specification

    (add new paragraph at the end of a section, describing how early fragment
     tests work when assembly fragment programs are active)

    If an assembly fragment program is active, early depth tests are
    considered enabled if and only if the fragment program source included the
    NV_early_fragment_tests option.


   Add to Section 3.11.4.5 of ARB_fragment_program (Fragment Program):

   Section 3.11.4.5.3, ARB_blend_func_extended Option

   If a fragment program specifies the "ARB_blend_func_extended" option, dual
   source color outputs as described in ARB_blend_func_extended are made
   available through the use of the "result.color[n].primary" and
   "result.color[n].secondary" result bindings, corresponding to SRC_COLOR
   and SRC1_COLOR, respectively, for the fragment color output numbered <n>.


Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment
Operations and the Frame Buffer)

    Modify Section 4.4.3, Rendering When an Image of a Bound Texture Object
    is Also Attached to the Framebuffer, p. 288

    (Replace the complicated set of conditions with the following)

    Specifically, the values of rendered fragments are undefined if any
    shader stage fetches texels from a given mipmap level, cubemap face, and
    array layer of a texture if that same mipmap level, cubemap face, and
    array layer of the texture can be written to via fragment shader outputs,
    even if the reads and writes are not in the same Draw call. However, an
    application can insert MemoryBarrier(TEXTURE_FETCH_BARRIER_BIT_NV) between
    Draw calls that have such read/write hazards in order to guarantee that
    writes have completed and caches have been invalidated, as described in
    section 2.20.X.


Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions)

    None.

Additions to Chapter 6 of the OpenGL 3.0 Specification (State and
State Requests)

    None.

Additions to Appendix A of the OpenGL 3.0 Specification (Invariance)

    None.

Additions to the AGL/GLX/WGL Specifications

    None.

GLX Protocol

    None.

Errors

    None, other than new conditions by which a program string would fail to
    load.

New State

    None.


New Implementation Dependent State

                                                             Minimum
    Get Value                         Type  Get Command       Value   Description           Sec.   Attrib
    --------------------------------  ----  ---------------  -------  --------------------- ------ ------
    MAX_GEOMETRY_PROGRAM_              Z+   GetIntegerv        32     Maximum number of GP  2.X.6.Y  -
      INVOCATIONS_NV                                                  invocations per prim.
    MIN_FRAGMENT_INTERPOLATION_        R    GetFloatv        -0.5     Max. negative offset  2.X.8.Z  -
      OFFSET_NV                                                       for IPAO instruction.
    MAX_FRAGMENT_INTERPOLATION_        R    GetFloatv         +0.5    Max. positive offset  2.X.8.Z  -
      OFFSET_NV                                                       for IPAO instruction.
    FRAGMENT_PROGRAM_INTERPOLATION_    Z+   GetIntegerv         4     Subpixel bit count    2.X.8.Z  -
      OFFSET_BITS_NV                                                  for IPAO instruction


Dependencies on NV_gpu_program4, NV_vertex_program4, NV_geometry_program4, and
NV_fragment_program4

    This extension is written against the NV_gpu_program4 family of
    extensions, and introduces new instruction set features and inputs/outputs
    described here.  These features are available only if the extension is
    supported and the appropriate program header string is used ("!!NVvp5.0"
    for vertex programs, "!!NVgp5.0" for geometry programs, and "!!NVfp5.0"
    for fragment programs.) When loading a program with an older header (e.g.,
    "!!NVvp4.0"), the instruction set features described in this extension are
    not available.  The features in this extension build upon those documented
    in full in NV_gpu_program4.

Dependencies on NV_tessellation_program5

    This extension provides the basic assembly instruction set constructs for
    tessellation programs.  If this extension is supported, tessellation
    control and evaluation programs are supported, as described in the
    NV_tessellation_program5 specification.  There is no separate extension
    string for tessellation programs; such support is implied by this
    extension.

Dependencies on ARB_transform_feedback3

    The concept of multiple vertex streams emitted by a geometry shader is
    introduced by ARB_transform_feedback3, as is the description of how they
    operate and implementation-dependent limits on the number of streams.
    This extension simply provides a mechanism to emit a vertex to more than
    one stream.  If ARB_transform_feedback3 is not supported, language
    describing the EMITS opcode and the restriction on PRIMITIVE_OUT when
    EMITS is used should be removed.

Dependencies on NV_shader_buffer_load

    The programmability functionality provided by NV_shader_buffer_load is
    also incorporated by this extension.  Any assembly program using a program
    header corresponding to this or any subsequent extension (e.g.,
    "!!NVfp5.0") may use the LOAD opcode without needing to declare "OPTION
    NV_shader_buffer_load".

    NV_shader_buffer_load is required by this extension, which means that the
    API mechanisms documented there allowing applications to make a buffer
    resident and query its GPU address are available to any applications using
    this extension.

    In addition to the basic functionality in NV_shader_buffer_load, this
    extension provides the ability to load 64-bit integers and floating-point
    values using the "S64", "S64X2", "S64X4", "U64", "U64X2", "U64X4", "F64",
    "F64X2", and "F64X4" opcode modifiers.

Dependencies on NV_shader_buffer_store

    This extension provides assembly programmability support for the
    NV_shader_buffer_store, which provides the API mechanisms allowing buffer
    object to be stored to.  NV_shader_buffer_store does not have a separate
    extension string entry, and will always be supported if this extension is
    present.

Dependencies on NV_parameter_buffer_object2

    The programmability functionality provided by NV_parameter_buffer_object2
    is also incorporated by this extension.  Any assembly program using a
    program header corresponding to this or any subsequent extension (e.g.,
    "!!NVfp5.0") may use the LDC opcode without needing to declare "OPTION
    NV_parameter_buffer_object2".

    In addition to the basic functionality in NV_parameter_buffer_object2,
    this extension provides the ability to load 64-bit integers and
    floating-point values using the "S64", "S64X2", "S64X4", "U64", "U64X2",
    "U64X4", "F64", "F64X2", and "F64X4" opcode modifiers.

Dependencies on OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle

    If OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle are not
    supported, remove the swizzling step from the definition of TXG and TXGO.

Dependencies on ARB_blend_func_extended

    If ARB_blend_func_extended is not supported, references to the dual source
    color output bindings (result.color.primary and result.color.secondary)
    should be removed.

Dependencies on EXT_shader_image_load_store

    EXT_shader_image_load_store provides OpenGL Shading Language mechanisms to
    load/store to buffer and texture image memory, including spec language
    describing memory access ordering and synchronization, a built-in function
    (MemoryBarrierEXT) controlling synchronization of memory operations, and
    spec language describing early fragment tests that can be enabled via GLSL
    fragment shader source.  These sections of the EXT_shader_image_load_store
    specification apply equally to the assembly program memory accesses
    provided by this extension.  If EXT_shader_image_load_store is not
    supported, the sections of that specification describing these features
    should be considered to be added to this extension.

    EXT_shader_image_load_store additionally provides and documents assembly
    language support for image loads, stores, and atomics as described in the
    "Dependencies on NV_gpu_program5" section of EXT_shader_image_load_store.
    The features described there are automatically supported for all
    NV_gpu_program5 assembly programs without requiring any additional
    "OPTION" line.

Dependencies on ARB_shader_subroutine

    ARB_shader_subroutine provides and documents assembly language support for
    subroutines as described in the "Dependencies on NV_gpu_program5" section
    of ARB_shader_subroutine.  The features described there are automatically
    supported for all NV_gpu_program5 assembly programs without requiring any
    additional "OPTION" line.


Issues

    (1) Are there any restrictions or performance concerns involving the
        support for indexing textures or parameter buffers?

      RESOLVED:  There are no significant functional limitations.  Textures
      and parameter buffers accessed with an index must be declared as arrays,
      so the assembler knows which textures might be accessed this way.
      Additionally, accessing an array of textures or parameter buffers with
      an out-of-bounds index will yield undefined results.

      In particular, there is no limitation on the values used for indexing --
      they are not required to be true constants and are not required to have
      the same value for all vertices/fragments in a primitive.  However,
      using divergent texture or parameter buffer indices may have performance
      concerns.  We expect that GPU implementations of this extension will run
      multiple program threads in parallel (SIMD).  If different threads in a
      thread group have different indices, it will be necessary to do lookups
      in more than one texture at once.  This is likely to result in some
      thread serialization.  We expect that indexed texture or parameter
      buffer access where all indices in a thread group match will perform
      identically to non-indexed accesses.

    (2) Which texture instructions support programmable texel offsets, and
        what offset limits apply?

      RESOLVED:  Most texture instructions (TEX, TXB, TXF, TXG, TXL, TXP)
      support both constant texel offsets as provided by NV_gpu_program4 and
      programmable texel offsets.  TXD supports only constant offsets.  TXGO
      does not support non-zero or programmable offsets in the texture portion
      of the instruction, but provides full support for programmable offsets
      via two of the three vector arguments in the regular instruction.

      For example,

        TEX result, coord, texture[0], 2D, (-1,-1);

      uses the NV_gpu_program4 mechanism applies a constant texel offset of
      (-1,-1) to the texture coordinates.  With programmable offsets, the
      following code applies the same offset.

        TEMP offxy;
        MOV offxy, {-1, -1};
        TEX result, coord, texture[0], offset(offxy);

      Of course, the programmable form allows the offsets to be computed in
      the program and does not require constant values.

      For most texture instructions, the range of allowable offsets is
      [MIN_PROGRAM_TEXEL_OFFSET_EXT, MAX_PROGRAM_TEXEL_OFFSET_EXT] for both
      constant and programmable texel offsets.  Constant offsets can be
      checked when the program is loaded, and out-of-bounds offsets cause the
      program to fail to load.  Programmable offsets can not have a
      load-time range check; out-of-bounds offsets produce undefined results.

      Additionally, the new TXGO instruction has a separate (likely larger)
      allowable offset range, [MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV,
      MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV], that applies to the offset
      vectors passed in its second and third operand.

      In the initial implementation of this extension, the range limits are
      [-8,+7] for most instructions and [-32,+31] for TXGO.

    (3) What is TXGO (texture gather with separate offsets) good for?

      RESOLVED:  TXGO allows for efficiently sampling a single-component
      texture with a variety of offsets that need not be contiguous.

      For example, a shadow mapping algorithm using a high-resolution shadow
      map may have pixels whose footpoint covers a large number of texels in
      the shadow map.  Such pixels could do a single lookup into a
      lower-resolution texture (using mipmapping), but quality problems will
      arise.  Alternately, a shader could perform a large number of texture
      lookups using either NEAREST or LINEAR filtering from the
      high-resolution texture.  NEAREST filtering will require a separate
      lookup for each texel accessed; LINEAR filtering may require somewhat
      fewer lookups, but all accesses cover a 2x2 portion of the texture.  The
      TXG instruction added to NV_gpu_program4_1 allows a 2x2 block of texels
      to be returned in a single instruction in case the program wants to do
      something other than linear filtering with the samples.  The TXGO allows
      a program to do semi-random sampling of the texture without requiring
      that each sample cover a 2x2 block of texels.  For example, the TXGO
      instruction would allow a program to the four texels A, H, J, O from the
      4x4 block depicted below:

        TXGO result, coord, {-1,+2,0,+1}, {-1,0,+1,+2}, texture[0], 2D;

      The "equivalent" TXG instruction would only sample the four center
      texels F, G, J, and K

        TXG result, coord, texture[0], 2D;

      All sixteen texels of the footprint could be sampled with four TXG
      instructions,

        TXG result0, coord, texture[0], 2D, (-1,-1);
        TXG result1, coord, texture[0], 2D, (-1,+1);
        TXG result2, coord, texture[0], 2D, (+1,-1);
        TXG result3, coord, texture[0], 2D, (+1,+1);

      but accessing a smaller number of samples spread across the footprint
      with fewer instructions may produce results that are good enough.

      The figure here depicts a texture with texel (0,0) shown in the
      upper-left corner.  If you insist on a lower-left origin, please look at
      this figure while standing on your head.

       (0,0) +-+-+-+-+
             |A|B|C|D|
             +-+-+-+-+
             |E|F|G|H|
             +-+-+-+-+
             |I|J|K|L|
             +-+-+-+-+
             |M|N|O|P|
             +-+-+-+-+ (4,4)

    (4) Why are the results of TXGO (texture gather with separate offsets)
        undefined if the wrap mode is CLAMP or MIRROR_CLAMP_EXT?

      RESOLVED:  The CLAMP and MIRROR_CLAMP_EXT wrap modes are fairly
      different from other wrap modes.  After adding any instruction offsets,
      the spec says to pre-clamp the (u,v) coordinates to [0,texture_size]
      before generating the footprint.  If such clamping occurs on one edge
      for a normal texture filtering operation, the footprint ends up being
      half border texels, half edge texels, and the clamping effectively
      forces the interpolation weights used for texture filtering to 50/50.

      We expect the TXG instruction to be used in cases where an application
      may want to do custom filtering, and is in control of its own filtering
      weights.  Coordinate clamping as above will affect the footprint used
      for filtering, but not the weights.  In the NV_gpu_program4_1 spec, we
      defined the TXG/CLAMP combination to simply return the "normal"
      footprint produced after the pre-clamp operation above.  Any adjustment
      of weights due to clamping is the responsibility of the application.  We
      don't expect this to be a common operation, because CLAMP_TO_EDGE or
      CLAMP_TO_BORDER are much more sensible wrap modes.

      The hardware implementing TXGO is anticipated to extract all four
      samples in a single pass.  However, the spec language is defined for
      simplicity to perform four separate "gather" operations with the four
      provided offsets, extract a single sample from each, and combine the
      four samples into a vector.  This would require four separate pre-clamp
      operations, which was deemed too costly to implement in hardware for a
      wrap mode that doesn't work well with texture gather operations.  Even
      if such hardware were built, it still wouldn't obtain a footprint
      resembling the half-border, half-edge footprint for simple TXGO offsets
      -- that would require different per-texel clamping rules for the four
      samples.  We chose to leave the results of this operation undefined.

    (5) Should double-precision floating-point support be required or
        optional?  If optional, how?

      RESOLVED:  Double-precision floating-point support will be optional in
      case low-end GPUs supporting the remainder of these instruction features
      choose to cut costs by removing the silicon necessary to implement
      64-bit floating-point arithmetic.

    (6) While this extension supports double-precision computation, how can
        you provide high-precision inputs and outputs to the GPU programs?

      RESOLVED:  The underlying hardware implementing this extension does not
      provide full support for 64-bit floats, even though DOUBLE is a standard
      data type provided by the GL.  For example, when specifying a vertex
      array with a data type of DOUBLE, the vertex attribute components will
      end up being converted to 32-bit floats (FLOAT) by the driver before
      being passed to the hardware, and the extra precision in the original
      64-bit float values will be lost.

      For vertex attributes, the EXT_vertex_attrib_64bit and
      NV_vertex_attrib_integer_64bit extensions provide the ability to specify
      64-bit vertex attribute components using the VertexAttribL* and
      VertexAttribLPointer APIs.  Such attributes can be read in a vertex
      program using a "LONG ATTRIB" declaration:

        LONG ATTRIB vector64;

      The LONG modifier can only be used vertex program inputs, and can not be
      used for inputs of any program type or outputs of any program type.

      For other cases, this extension provides the PK64 and UP64 instructions
      that provide a mechanism to pass 64-bit components using consecutive
      32-bit components.  For example, a 3-component vector with 64-bit
      components can be passed to a vertex shader using multiple vertex
      attributes without using the VertexAttribL APIs with the following code:

        /* Pass the X/Y components in vertex attribute 0 (X/Y/Z/W).  Use
           stride to skip over Z. */
        glVertexAttribPointer(0, 4, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble),
                              (GLdouble *) buffer);

        /* Pass the Z components in vertex attribute 1 (X/Y).  Use stride to
           skip over original X/Y components. */
        glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble),
                              (GLdouble *) buffer + 2);

      In this example, the vertex program would use the PK64 instruction to
      reconstruct the 64-bit value for each component as follows:

        LONG TEMP reconstructed;
        PK64 reconstructed.xy, vertex.attrib[0];
        PK64 reconstructed.z,  vertex.attrib[1];

      A similar technique can be used to pass 64-bit values computed by a GPU
      program, using transform feedback or writes to a color buffer.  The UP64
      instruction would be used to convert the 64-bit computed value into two
      32-bit values, which would be written to adjacent components.

      Note also that the original hardware implementation of this extension
      does not support interpolation of 64-bit floating-point values.  If an
      application desires to pass a 64-bit floating-point value from a vertex
      or geometry program to a fragment program, and doesn't require
      interpolation, the PK64/UP64 techniques can be combined.  For example,
      the vertex shader could unpack a 3-component vector with 64-bit
      components into a four-component and a two-component 32-bit vector:

        LONG TEMP result64;
        RESULT result32[2] = { result.attrib[0..1] };
        UP64 result32[0],    result64.xyxy;
        UP64 result32[1].xy, result64.z;

      The fragment program would read and reconstruct using PK64:

        LONG TEMP input64;
        FLAT ATTRIB input32[3] = { fragment.attrib[0..1] };
        PK64 input64.xy, input32[0];
        PK64 input64.z,  input32[1];

      Note that such inputs must be declared as "FLAT" in the fragment program
      to prevent the hardware from trying to do floating-point interpolation
      on the separate 32-bit halves of the value being passed.  Such
      interpolation would produce complete garbage.

    (7) What are instanced geometry programs useful for?

      RESOLVED:  Instanced geometry programs allow geometry programs that
      perform regular operations to run more efficiently.

      Consider a simple example of an algorithm that uses geometry programs to
      render primitives to a cube map in a single pass.  Without instanced
      geometry programs, the geometry program to render triangles to the cube
      map would do something like:

        for (face = 0; face < 6; face++) {
          for (vertex = 0; vertex < 3; vertex++) {
            project vertex <vertex> onto face <face>, output position
            compute/copy attributes of emitted <vertex> to outputs
            output <face> to result.layer
            emit the projected vertex
          }
          end the primitive (next triangle)
        }

      This algorithm would output 18 vertices per input triangle, three for
      each cube face.  The six triangles emitted would be rasterized, one per
      face.  Geometry programs that emit a large number of attributes have
      often posed performance challenges, since all the attributes must be
      stored somewhere until the emitted primitives.  Large storage
      requirements may limit the number of threads that can be run in parallel
      and reduce overall performance.

      Instanced geometry programs allow this example to be restructured to run
      with six separate threads, one per face.  Each thread projects the
      triangle to only a single face (identified by the invocation number) and
      emits only 3 vertices.  The reduced storage requirements allow more
      geometry program threads to be run in parallel, with greater overall
      efficiency.

      Additionally, the total number of attributes that can be emitted by a
      single geometry program invocation is limited.  However, for instanced
      geometry shaders, that limit applies to each of <N> program invocations
      which allows for a larger total output.  For example, if the GL
      implementation supports only 1024 components of output per program
      invocation, the 18-vertex algorithm above could emit no more than 56
      components per vertex.  The same algorithm implemented as a 3-vertex
      6-invocation geometry program could theoretically allow for 341
      components per vertex.

    (8) What are the special interpolation opcodes (IPAC, IPAO, IPAS) good
        for, and how do they work?

      RESOLVED:  The interpolation opcodes allow programs to control the
      frequency and location at which fragment inputs are sampled.  Limited
      control has been provided in previous extensions, but the support was
      more limited.  NV_gpu_program4 had an interpolation modifier (CENTROID)
      that allowed attributes to be sampled inside the primitive, but that was
      a per-attribute modifier -- you could only sample any given attribute at
      one location.  NV_gpu_program4_1 added a new interpolation modifier
      (SAMPLE) that directed that fragment programs be run once per sample,
      and that the specified attributes be interpolated at the sample
      location.  Per-sample interpolation can produce higher quality, but the
      performance cost is significant since more fragment program invocations
      are required.

      This extension provides additional control over interpolation, and
      allows programs to interpolate attributes at different locations without
      necessarily requiring the performance hit of per-sample invocation.

      The IPAC instruction allows an attribute to be sampled at the centroid
      location, while still allowing the same attribute to be sampled
      elsewhere.  The IPAS instruction allows the attribute to be sampled at a
      number sample location, as per-sample interpolation would do.  Multiple
      IPAS instructions with different sample numbers allows a program to
      sample an attribute at multiple sample points in the pixel and then
      combine the samples in a programmable manner, which may allow for higher
      quality than simply interpolating at a single representative point in
      the pixel.  The IPAO instruction allows the attribute to be sampled at
      an arbitrary (x,y) offset relative to the pixel center.  The range of
      supported (x,y) values is limited, and the limits in the initial
      implementation are not large enough to permit sampling the attribute
      outside the pixel.

      Note that previous instruction sets allowed shaders to fake IPAC,
      IPAS, and IPAO by a sequence such as:

        TEMP ddx, ddy, offset, interp;
        MOV interp, fragment.attrib[0];          # start with center
        DDX ddx, fragment.attrib[0];
        MAD interp, offset.x, ddx, interp;       # add offset.x * dA/dx
        DDY ddx, fragment.attrib[0];
        MAD interp, offset.y, ddy, interp;       # add offset.y * dA/dy

      However, this method does not apply perspective correction.  The quality
      of the results may be unacceptable, particularly for primitives that are
      nearly perpendicular to the screen.

      The semantics of the first operand of these instructions is different
      from normal assembly instructions.  Operands are normally evaluated by
      loading the value of the corresponding variable and applying any
      swizzle/negation/absolute value modifier before the instruction is
      executed.  In the IPAC/IPAO/IPAS instructions, the value of the
      attribute is evaluated by the instruction itself.  Swizzles, negation,
      and absolute value modifiers are still allowed, and are applied after
      the attribute values are interpolated.

    (9) When using a program that issues global stores (via the STORE
        instruction), what amount of execution ordering is guaranteed?  How
        can an application ensure that writes executed in a shader have
        completed and will be visible to other operations using the buffer
        object in question?

      RESOLVED:  There are very few automatic guarantees for potential
      write/read or write/write conflicts.  Program invocations will run in
      generally run in arbitrary order, and applications can't rely on
      read/write order to match primitive order.

      To get consistent results when buffers are read and written using
      multiple pipeline stages, manual synchronization using the
      MemoryBarrierEXT() API documented in EXT_shader_image_load_store or some
      other synchronization primitive is necessary.

    (10) Unlike most other shader features, the STORE opcode allows for
         externally-visible side effects from executing a program.  How does
         this capability interact with other features of the GL?

      RESOLVED:  First, some GL implementations support a variety of "early Z"
      optimizations designed to minimize unnecessary fragment processing work,
      such as executing an expensive fragment program on a fragment that will
      eventually fail the depth test.  Such optimizations have been valid
      because fragment programs had no side effects.  That is no longer the
      case, and such optimizations may not be employed if the fragment program
      performs a global store.  However, we provide a new "early depth and
      stencil test" enable that allows applications to deterministically
      control depth and stencil testing.  If enabled, depth testing is always
      performed prior to fragment program execution.  Fragment programs will
      never be run on fragments that fail any of these tests.

      Second, we are permitting global stores in all program types; however,
      the number of program invocations is not well-defined for some program
      types.  For example, a GL implementation may choose to combine multiple
      instances of identical vertices (e.g., duplicate indices in
      DrawElements, immediate-mode vertices with identical data) into one
      single vertex program invocation, or it may run a vertex program on each
      separately.  Similarly, the tessellation primitive generator will
      generate independent primitives with duplicated vertices, which may or
      may not be combined for tessellation evaluation program execution.
      Fragment program execution also has several issues described in more
      detail below.

    (11) What issues arise when running fragment programs doing global stores?

      RESOLVED:  The order of per-fragment operations in the existing OpenGL
      3.0 specification can be fairly loose, because previously-defined
      fragment programs, shaders, and fixed-function fragment processing had
      no side effects.  With side effects, the order of operations must be
      defined more tightly.  In particular, the pixel ownership and scissor
      tests are specified to be performed prior to fragment program execution,
      and we provide an option to perform depth and stencil tests early as
      well.

      OpenGL implementations sometimes run fragment programs on "helper"
      pixels that have no coverage in order to be able to compute sane partial
      deriviatives for fragment program instructions (DDX, DDY) or automatic
      level-of-detail calculation for texturing.  In this approach,
      derivatives are approximated by computing the difference in a quantity
      computed for a given fragment at (x,y) and a fragment at a neighboring
      pixel.  When a fragment program is executed on a "helper" pixel, global
      stores have no effect.  Helper pixels aren't explicitly mentioned in the
      spec body; instead, partial derivatives are obtained by magic.

      If a fragment program contains a KIL instruction, compilers may not
      reorder code where an ATOM or STORE execution is executed before a KIL
      instruction that logically precedes it in flow control.  Once a fragment
      is killed, subsequent atomics or stores should never be executed.

      Multisample rasterization poses several issues for fragment programs
      with global stores.  The number of times a fragment program is executed
      for multisample rendering is not fully specified, which gives
      implementations a number of different choices -- pure multisample (only
      runs once), pure supersample (runs once per covered sample), or modes in
      between.  There are some ways for an application to indirectly control
      the behavior -- for example, fragment programs specifying per-sample
      attribute interpolation are guaranteed to run once per covered sample.

      Note that when rendering to a multisample buffer, a pair of adjacent
      triangles may cause a fragment program to be executed more than once at
      a given (x,y) with different sets of samples covered.  This can also
      occur in the interior of a quadrilateral or polygon primitive.
      Implementations are permitted to split quads and polygons with >3
      vertices into triangles, creating interior edges that split a pixel.

    (12) What happens if early fragment tests are enabled, the early depth
         test passes, and a fragment program that computes a new depth value
         is executed?

      RESOLVED:  The depth value produced by the fragment program has no
      effect if early fragment tests are enabled.  The depth value computed by
      a fragment program is used only by the post-fragment program stencil and
      depth tests, and those tests always have no effect when early depth
      testing is enabled.

    (13) How do early fragment tests interact with occlusion queries?

      RESOLVED:  When early fragment tests are enabled, sample counting for
      occlusion queries also happens prior to fragment program execution.
      Enabling early fragment tests can change the overall sample count,
      because samples killed by alpha test and alpha to coverage will still be
      counted if early fragment tests are enabled.

    (14) What happens if a program performs a global store to a GPU address
         corresponding to a read-only buffer mapping?  What if it performs a
         global read to a write-only mapping?

      RESOLVED:  Implementations may choose implement full memory protection,
      in which case accesses using the wrong type of memory mapping will fault
      and lead to termination of the application.

      However, full memory protection is not required in this extension --
      implementations may choose to substitute a read-write mapping in place
      of a read-only or write-only mapping.  As a result, we specify the
      result of such invalid loads and stores to be undefined.

      Note that if a program erroneously writes to nominally read-only
      mappings, the results may be weird.  If the implementation substitutes a
      read-write mapping, such invalid writes are likely to proceed normally.
      However, if the application later makes a buffer object non-resident and
      the memory manager of the GL implementation needs to move the buffer,
      the GL may assume that the contents of the buffer have not been modified
      and thus discard the new values written by the (invalid) global store
      instructions.

    (15) What performance considerations apply to atomics?

      RESOLVED:  Atomics can be useful for operations like locking, or for
      maintaining counters.  Note that high-performance GPUs may have hundreds
      of program threads in flight at once, and may also have some SIMD
      characteristics (where threads are grouped and run as a unit).  Using
      ATOM instructions with a single memory address to implement a critical
      section will result in serial execution -- only one of the hundreds of
      threads can execute code in the critical section at a time.

      When a global operation would be done under a lock, it may be possible
      to improve performance if the algorithm can be parallelized to have
      multiple critical sections.  For example, an application could allocate
      an array of shared resources, each protected by its own lock, and use
      the LSBs of the primitive ID or some function of the screen-space (x,y)
      to determine which resource in the array to use.

    (16) The atomic instruction ATOM returns the old contents of memory into
         the result register.  Should we provide a version of this opcodes
         that doesn't return a value?

      RESOLVED:  No.  In theory, atomics that don't return any values can
      perform better (because the program may not need to allocate resources
      to hold a result or wait for the result.  However, a new opcode isn't
      required to obtain this behavior -- a compiler can recognize that the
      result of an ATOM instruction is written to a "dummy" temporary that
      isn't read by subsequent instructions:

        TEMP junk;
        ATOM.ADD.U32 junk, address, 1;

      The compiler can also recognize that the result will always be discarded
      if a conditional write mask of "(FL)" is used.

        ATOM.ADD.U32 not_junk (FL), address, 1;

    (17) How do we ensure that memory access made by multiple program
         invocations of possibly different types are coherent?

      RESOLVED:  Atomic instructions allow program invocations to coordinate
      using shared global memory addresses.  However, memory transactions,
      including atomics, are not guaranteed to land in the order specified in
      the program; they may be reordered by the compiler, cached in different
      memory hierarchies, and stored in a distributed memory system where
      later stores to one "partition" might be completed prior to earlier
      stores to another.  The MEMBAR instruction helps control memory
      transaction ordering by ensuring that all memory transactions prior to
      the barrier complete before any after the barrier.  Additionally the
      ".COH" modifier ensures that memory transactions using the modifier are
      cached coherently and will be visible to other shader invocations.

    (18) How do the TXG and TXGO opcodes work with sRGB textures?

       RESOLVED. Gamma-correction is applied to the texture source color
       before "gathering" and hence applies to all four components, unless
       the texture swizzle of the selected component is ALPHA in which case
       no gamma-correction is applied.

    (19) How can render-to-texture algorithms take advantage of
         MemoryBarrierEXT, nominally provided for global memory transactions?

      RESOLVED: Many algorithms use RTT to ping-pong between two allocations,
      using the result of one rendering pass as the input to the next.
      Existing mechanisms require expensive FBO Binds, DrawBuffer changes, or
      FBO attachment changes to safely swap the render target and texture. With
      memory barriers, layered geometry shader rendering, and texture arrays,
      an application can very cheaply ping-pong between two layers of a single
      texture. i.e.

        X = 0;
        // Bind the array texture to a texture unit
        // Attach the array texture to an FBO using FramebufferTextureARB
        while (!done) {
          // Stuff X in a constant, vertex attrib, etc.
          Draw -
            Texturing from layer X;
            Writing gl_Layer = 1 - X in the geometry shader;

          MemoryBarrierNV(TEXTURE_FETCH_BARRIER_BIT_NV);
          X = 1 - X;
        }

      However, be warned that this requires geometry shaders and hence adds
      the overhead that all geometry must pass through an additional program
      stage, so an application using large amounts of geometry could become
      geometry-limited or more shader-limited.

    (20) What is the ".PREC" instruction modifier good for?

      RESOLVED:  ".PREC" provides some invariance guarantees is useful for
      certain algorithms.  Using ".PREC", it is possible to ensure that an
      algorithm can be written to produce identical results on subtly
      different inputs.  For example, the order of vertices visible to a
      geometry or tessellation shader used to subdivide primitive edges might
      present an edge shared between two primitives in one direction for one
      primitive and the other direction for the adjacent primitive.  Even if
      the weights are identical in the two cases, there may be cracking if the
      computations are being done in an order-dependent manner.  If the
      position of a new vertex were evaluation with code below with
      limited-precision floating-point math, it's not necessarily the case
      that we will get the same result for inputs (a,b,c) and (c,b,a) in the
      following code:

          ADD result, a, b;
          ADD result, result, c;

      There are two problems with this code:  the rounding errors will be
      different and the implementation is free to rearrange the computation
      order.  The code can be rewritten as follows with ".PREC" and a
      symmetric evaluation order to ensure a precise result with the inputs
      reversed:

          ADD result, a, c;
          ADD.PREC result, result, b;

      Note that in this example, the first instruction doesn't need the
      ".PREC" qualifier because the second instruction requires that the
      implementation compute <a>+<c>, which will be done reliably if <a> and
      <c> are inputs.  If <a> and <c> were results of other computations, the
      first add and possibly the dependent computations may also need to be
      tagged with ".PREC" to ensure reliable results.

      The ".PREC" modifier will disable certain optimization and thus carries
      a performance cost.

    (21) What are the TGALL, TGANY, TGEQ instructions good for?

      RESOLVED:  If an implementation performs SIMD thread execution,
      divergent branching may result in reduced performance if the "if" and
      "else" blocks of an "if" statement are executed sequentially.  For
      example, an algorithm may have both a "fast path" that performs a
      computation quickly for a subset of all cases and a "fast path" that
      performs a computation quickly but correctly.  When performing SIMD
      execution, code like the following:

        SNE.S.CC cc.x, condition.x;
        IF NE.x;
          # do fast path
        ELSE;
          # do slow path
        ENDIF;

      may end up executing *both* the fast and slow paths for a SIMD thread
      group if <condition> diverges, and may execute more slowly than simply
      executing the slow path unconditionally.  These instructions allow code
      like:

        # Condition code matches NE if and only if condition.x is non-zero
        # for all threads.
        TGALL.S.CC cc.x, condition.x;
        IF NE.x;
          # do fast path
        ELSE;
          # do slow path
        ENDIF;

      that executes the fast path if and only if it can be used for *all*
      threads in the group.  For thread groups where <condition> diverges,
      this algorithm would unconditionally run the slow path, but would never
      run both in sequence.


Revision History

    Rev.    Date    Author    Changes
    ----  --------  --------  -----------------------------------------
     7    09/11/14  pbrown    Minor typo fixes.

     6    07/04/13  pbrown    Add missing language describing the
                              <texImageUnitComp> grammar rule for component
                              selection in TXG and TXGO instructions.

     5    09/23/10  pbrown    Add missing constants for {MIN,MAX}_PROGRAM_
                              TEXTURE_GATHER_OFFSET_NV (same as ARB/core).
                              Add missing description for "su" in the opcode
                              table; fix a couple operand order bugs for
                              STORE.

     4    06/22/10  pbrown    Specify that the y/z/w component of the ATOM
                              results are undefined, as is the case with
                              ATOMIM from EXT_shader_image_load_store.

     3    04/13/10  pbrown    Remove F32 support from ATOM.ADD.

     2    03/22/10  pbrown    Various wording updates to the spec overview,
                              dependencies, issues, and body.  Remove various
                              spec language that has been refactored into the
                              EXT_shader_image_load_store specification.

     1              pbrown    Internal revisions.