• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Name
2
3    NV_gpu_program5
4
5Name Strings
6
7    GL_NV_gpu_program5
8    GL_NV_gpu_program_fp64
9
10Contact
11
12    Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)
13
14Status
15
16    Shipping.
17
18Version
19
20    Last Modified Date:         09/11/2014
21    NVIDIA Revision:            7
22
23Number
24
25    388
26
27Dependencies
28
29    OpenGL 2.0 is required.
30
31    This extension is written against the OpenGL 3.0 specification.
32
33    NV_gpu_program4 and NV_gpu_program4_1 are required.
34
35    NV_shader_buffer_load is required.
36
37    NV_shader_buffer_store is required.
38
39    This extension is written against and interacts with the NV_gpu_program4,
40    NV_vertex_program4, NV_geometry_program4, and NV_fragment_program4
41    specifications.
42
43    This extension interacts with NV_tessellation_program5.
44
45    This extension interacts with ARB_transform_feedback3.
46
47    This extension interacts trivially with NV_shader_buffer_load.
48
49    This extension interacts trivially with NV_shader_buffer_store.
50
51    This extension interacts trivially with NV_parameter_buffer_object2.
52
53    This extension interacts trivially with OpenGL 3.3, ARB_texture_swizzle,
54    and EXT_texture_swizzle.
55
56    This extension interacts trivially with ARB_blend_func_extended.
57
58    This extension interacts trivially with EXT_shader_image_load_store.
59
60    This extension interacts trivially with ARB_shader_subroutine.
61
62    If the 64-bit floating-point portion of this extension is not supported,
63    "GL_NV_gpu_program_fp64" will not be found in the extension string.
64
65Overview
66
67    This specification documents the common instruction set and basic
68    functionality provided by NVIDIA's 5th generation of assembly instruction
69    sets supporting programmable graphics pipeline stages.
70
71    The instruction set builds upon the basic framework provided by the
72    ARB_vertex_program and ARB_fragment_program extensions to expose
73    considerably more capable hardware.  In addition to new capabilities for
74    vertex and fragment programs, this extension provides new functionality
75    for geometry programs as originally described in the NV_geometry_program4
76    specification, and serves as the basis for the new tessellation control
77    and evaluation programs described in the NV_tessellation_program5
78    extension.
79
80    Programs using the functionality provided by this extension should begin
81    with the program headers "!!NVvp5.0" (vertex programs), "!!NVtcp5.0"
82    (tessellation control programs), "!!NVtep5.0" (tessellation evaluation
83    programs), "!!NVgp5.0" (geometry programs), and "!!NVfp5.0" (fragment
84    programs).
85
86    This extension provides a variety of new features, including:
87
88      * support for 64-bit integer operations;
89
90      * the ability to dynamically index into an array of texture units or
91        program parameter buffers;
92
93      * extending texel offset support to allow loading texel offsets from
94        regular integer operands computed at run-time, instead of requiring
95        that the offsets be constants encoded in texture instructions;
96
97      * extending TXG (texture gather) support to return the 2x2 footprint
98        from any component of the texture image instead of always returning
99        the first (x) component;
100
101      * extending TXG to support shadow comparisons in conjunction with a
102        depth texture, via the SHADOW* targets;
103
104      * further extending texture gather support to provide a new opcode
105        (TXGO) that applies a separate texel offset vector to each of the four
106        samples returned by the instruction;
107
108      * bit manipulation instructions, including ones to find the position of
109        the most or least significant set bit, bitfield insertion and
110        extraction, and bit reversal;
111
112      * a general data conversion instruction (CVT) supporting conversion
113        between any two data types supported by this extension; and
114
115      * new instructions to compute the composite of a set of boolean
116        conditions a group of shader threads.
117
118    This extension also provides some new capabilities for individual program
119    types, including:
120
121      * support for instanced geometry programs, where a geometry program may
122        be run multiple times for each primitive;
123
124      * support for emitting vertices in a geometry program where each vertex
125        emitted may be directed at a specified vertex stream and captured
126        using the ARB_transform_feedback3 extension;
127
128      * support for interpolating an attribute at a programmable offset
129        relative to the pixel center (IPAO), at a programmable sample number
130        (IPAS), or at the fragment's centroid location (IPAC) in a fragment
131        program;
132
133      * support for reading a mask of covered samples in a fragment program;
134
135      * support for reading a point sprite coordinate directly in a fragment
136        program, without overriding a texture coordinate;
137
138      * support for reading patch primitives and per-patch attributes
139        (introduced by ARB_tessellation_shader) in a geometry program; and
140
141      * support for multiple output vectors for a single color output in a
142        fragment program (as used by ARB_blend_func_extended).
143
144    This extension also provides optional support for 64-bit-per-component
145    variables and 64-bit floating-point arithmetic.  These features are
146    supported if and only if "NV_gpu_program_fp64" is found in the extension
147    string.
148
149    This extension incorporates the memory access operations from the
150    NV_shader_buffer_load and NV_parameter_buffer_object2 extensions,
151    originally built as add-ons to NV_gpu_program4.  It also provides the
152    following new capabilities:
153
154      * support for the features without requiring a separate OPTION keyword;
155
156      * support for indexing into an array of constant buffers using the LDC
157        opcode added by NV_parameter_buffer_object2;
158
159      * support for storing into buffer objects at a specified GPU address
160        using the STORE opcode, an allowing applications to create READ_WRITE
161        and WRITE_ONLY mappings when making a buffer object resident using the
162        API mechanisms in the NV_shader_buffer_store extension;
163
164      * storage instruction modifiers to allow loading and storing 64-bit
165        component values;
166
167      * support for atomic memory transactions using the ATOM opcode, where
168        the instruction atomically reads the memory pointed to by a pointer,
169        performs a specified computation, stores the results of that
170        computation, and returns the original value read;
171
172      * support for memory barrier transactions using the MEMBAR opcode, which
173        ensures that all memory stores issued prior to the opcode complete
174        prior to any subsequent memory transactions; and
175
176      * a fragment program option to specify that depth and stencil tests are
177        performed prior to fragment program execution.
178
179    Additionally, the assembly program languages supported by this extension
180    include support for reading, writing, and performing atomic memory
181    operations on texture image data using the opcodes and mechanisms
182    documented in the "Dependencies on NV_gpu_program5" section of the
183    EXT_shader_image_load_store extension.
184
185New Procedures and Functions
186
187    None.
188
189New Tokens
190
191    Accepted by the <pname> parameter of GetBooleanv, GetIntegerv,
192    GetFloatv, and GetDoublev:
193
194        MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV             0x8E5A
195        MIN_FRAGMENT_INTERPOLATION_OFFSET_NV            0x8E5B
196        MAX_FRAGMENT_INTERPOLATION_OFFSET_NV            0x8E5C
197        FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV   0x8E5D
198        MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV            0x8E5E
199        MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV            0x8E5F
200
201
202Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation)
203
204    Modify Section 2.X.2 of NV_fragment_program4, Program Grammar
205
206    (modify the section, updating the program header string for the extended
207     instruction set)
208
209    Fragment programs are required to begin with the header string
210    "!!NVfp5.0".  This header string identifies the subsequent program body as
211    being a fragment program and indicates that it should be parsed according
212    to the base NV_gpu_program5 grammar plus the additions below.  Program
213    string parsing begins with the character immediately following the header
214    string.
215
216    (add/change the following rules to the NV_fragment_program4 and
217     NV_gpu_program5 base grammars)
218
219    <SpecialInstruction>    ::= "IPAC" <opModifiers> <instResult> ","
220                                <instOperandV>
221                              | "IPAO" <opModifiers> <instResult> ","
222                                <instOperandV> "," <instOperandV>
223                              | "IPAS" <opModifiers> <instResult> ","
224                                <instOperandV> "," <instOperandS>
225
226    <interpModifier>        ::= "SAMPLE"
227
228    <attribBasic>           ::= <fragPrefix> "sampleid"
229                              | <fragPrefix> "samplemask"
230                              | <fragPrefix> "pointcoord"
231
232    <resultBasic>           ::= <resPrefix> "color" <resultOptColorNum>
233                                <resultOptColorType>
234                              | <resPrefix> "samplemask"
235
236    <resultOptColorType>    ::= ""
237                              | "." <colorType>
238
239
240    Modify Section 2.X.2 of NV_geometry_program4, Program Grammar
241
242    (modify the section, updating the program header string for the extended
243     instruction set)
244
245    Geometry programs are required to begin with the header string
246    "!!NVgp5.0".  This header string identifies the subsequent program body as
247    being a geometry program and indicates that it should be parsed according
248    to the base NV_gpu_program5 grammar plus the additions below.  Program
249    string parsing begins with the character immediately following the header
250    string.
251
252    (add the following rules to the NV_geometry_program4 and NV_gpu_program5
253     base grammars)
254
255    <declaration>           ::= "INVOCATIONS" <int>
256
257    <declPrimInType>        ::= "PATCHES"
258
259    <SpecialInstruction>    ::= "EMITS" <instOperandS>
260
261    <attribBasic>           ::= <primPrefix> "invocation"
262                              | <primPrefix> "vertexcount"
263                              | <attribTessOuter> <optArrayMemAbs>
264                              | <attribTessInner> <optArrayMemAbs>
265                              | <attribPatchGeneric> <optArrayMemAbs>
266
267    <attribMulti>           ::= <attribTessOuter> <arrayRange>
268                              | <attribTessInner> <arrayRange>
269                              | <attribPatchGeneric> <arrayRange>
270
271    <attribTessOuter>       ::= <primPrefix> "." "tessouter"
272
273    <attribTessInner>       ::= <primPrefix> "." "tessinner"
274
275    <attribPatchGeneric>    ::= <primPrefix> "." "patch" "." "attrib"
276
277
278    Modify Section 2.X.2 of NV_vertex_program4, Program Grammar
279
280    (modify the section, updating the program header string for the extended
281     instruction set)
282
283    Vertex programs are required to begin with the header string "!!NVvp5.0".
284    This header string identifies the subsequent program body as being a
285    vertex program and indicates that it should be parsed according to the
286    base NV_gpu_program5 grammar plus the additions below.  Program string
287    parsing begins with the character immediately following the header string.
288
289
290    Modify Section 2.X.2 of NV_gpu_program4, Program Grammar
291
292    (add the following grammar rules to the NV_gpu_program4 base grammar;
293     additional grammar rules usable for assembly programs are documented in
294     the EXT_shader_image_load_store and ARB_shader_subroutine specifications)
295
296    <instruction>           ::= <MemInstruction>
297
298    <MemInstruction>        ::= <ATOMop_instruction>
299                              | <STOREop_instruction>
300                              | <MEMBARop_instruction>
301
302    <VECTORop>              ::= "BFR"
303                              | "BTC"
304                              | "BTFL"
305                              | "BTFM"
306                              | "PK64"
307                              | "LDC"
308                              | "CVT"
309                              | "TGALL"
310                              | "TGANY"
311                              | "TGEQ"
312                              | "UP64"
313
314    <SCALARop>              ::= "LOAD"
315
316    <BINop>                 ::= "BFE"
317
318    <TRIop>                 ::= "BFI"
319
320    <TEXop_instruction>     ::= <TEXop> <opModifiers> <instResult> ","
321                                <instOperandV> "," <instOperandV> ","
322                                <texAccess>
323
324    <TEXop>                 ::= "TXG"
325                              | "LOD"
326
327    <TXDop>                 ::= "TXGO"
328
329    <ATOMop_instruction>    ::= <ATOMop> <opModifiers> <instResult> ","
330                                <instOperandV> "," <instOperandS>
331
332    <ATOMop>                ::= "ATOM"
333
334    <STOREop_instruction>   ::= <STOREop> <opModifiers> <instOperandV> ","
335                                <instOperandS>
336
337    <STOREop>               ::= "STORE"
338
339    <MEMBARop_instruction>  ::= <MEMBARop> <opModifiers>
340
341    <MEMBARop>              ::= "MEMBAR"
342
343    <opModifier>            ::= "F16"
344                              | "F32"
345                              | "F64"
346                              | "F32X2"
347                              | "F32X4"
348                              | "F64X2"
349                              | "F64X4"
350                              | "S8"
351                              | "S16"
352                              | "S32"
353                              | "S32X2"
354                              | "S32X4"
355                              | "S64"
356                              | "S64X2"
357                              | "S64X4"
358                              | "U8"
359                              | "U16"
360                              | "U32"
361                              | "U32X2"
362                              | "U32X4"
363                              | "U64"
364                              | "U64X2"
365                              | "U64X4"
366                              | "ADD"
367                              | "MIN"
368                              | "MAX"
369                              | "IWRAP"
370                              | "DWRAP"
371                              | "AND"
372                              | "OR"
373                              | "XOR"
374                              | "EXCH"
375                              | "CSWAP"
376                              | "COH"
377                              | "ROUND"
378                              | "CEIL"
379                              | "FLR"
380                              | "TRUNC"
381                              | "PREC"
382                              | "VOL"
383
384    <texAccess>             ::= <textureUseS> "," <texTarget> <optTexOffset>
385                              | <textureUseV> "," <texTarget> <optTexOffset>
386
387    <texTarget>             ::= "ARRAYCUBE"
388                              | "SHADOWARRAYCUBE"
389
390    <optTexOffset>          ::= /* empty */
391                              | <texOffset>
392
393    <texOffset>             ::= "offset" "(" <instOperandV> ")"
394
395    <namingStatement>       ::= <TEXTURE_statement>
396
397    <BUFFER_statement>      ::= <bufferDeclType> <establishName>
398                                <optArraySize> <optArraySize> "="
399                                <bufferMultInit>
400
401    <bufferDeclType>        ::= "CBUFFER"
402
403    <TEXTURE_statement>     ::= "TEXTURE" <establishName> <texSingleInit>
404                              | "TEXTURE" <establishName> <optArraySize>
405                                <texMultipleInit>
406
407    <texSingleInit>         ::= "=" <textureUseDS>
408
409    <texMultipleInit>       ::= "=" "{" <texItemList> "}"
410
411    <texItemList>           ::= <textureUseDM>
412                              | <textureUseDM> "," <texItemList>
413
414    <bufferBinding>         ::= "program" "." "buffer" <arrayRange>
415
416    <textureUseS>           ::= <textureUseV> <texImageUnitComp>
417
418    <textureUseV>           ::= <texImageUnit>
419                              | <texVarName> <optArrayMem>
420
421    <textureUseDS>          ::= "texture" <arrayMemAbs>
422
423    <textureUseDM>          ::= <textureUseDS>
424                              | "texture" <arrayRange>
425
426    <texImageUnitComp>      ::= <scalarSuffix>
427
428
429    Modify Section 2.X.3.1, Program Variable Types
430
431    (IGNORE if GL_NV_gpu_program_fp64 is not found in the extension string.
432     Otherwise modify storage size modifiers to guarantee that "LONG"
433     variables are at least 64 bits in size.)
434
435    Explicitly declared variables may optionally have one storage size
436    modifier.  Variables decared as "SHORT" will be represented using at least
437    16 bits per component.  "SHORT" floating-point values will have at least 5
438    bits of exponent and 10 bits of mantissa.  Variables declared as "LONG"
439    will be represented with at least 64 bits per component.  "LONG"
440    floating-point values will have at least 11 bits of exponent and 52 bits
441    of mantissa.  If no size modifier is provided, the GL will automatically
442    select component sizes.  Implementations are not required to support more
443    than one component size, so "SHORT", "LONG", and the default could all
444    refer to the same component size.  The "LONG" modifier is supported only
445    for declarations of temporary variables ("TEMP"), and attribute variables
446    ("ATTRIB") in vertex programs.  The "SHORT" modifier is supported only
447    for declarations of temporary variables and result variables ("OUTPUT").
448
449
450    Modify Section 2.X.3.2 of the NV_fragment_program4 specification, Program
451    Attribute Variables.
452
453    (Add a table entry and relevant text describing the fragment program
454     input sample mask variable.)
455
456      Fragment Attribute Binding  Components  Underlying State
457      --------------------------  ----------  ----------------------------
458      fragment.samplemask         (m,-,-,-)   fragment coverage mask
459      fragment.pointcoord         (s,t,-,-)   fragment point sprite coordinate
460
461    If a fragment attribute binding matches "fragment.samplemask", the "x"
462    component is filled with a coverage mask indicating the set of samples
463    covered by this fragment.  The coverage mask is a bitfield, where bit <n>
464    is one if the sample number <n> is covered and zero otherwise.  If
465    multisample buffers are not available (SAMPLE_BUFFERS is zero), bit zero
466    indicates if the center of the pixel corresponding to the fragment is
467    covered.
468
469    If a fragment attribute binding matches "fragment.pointcoord", the "x" and
470    "y" components are filled with the s and t point sprite coordinates
471    (section 3.3.1), respectively.  The "z" and "w" components are undefined.
472    If the fragment is generated by any primitive other than a point, or if
473    point sprites are disabled, all four components of the binding are
474    undefined.
475
476    Modify Section 2.X.3.2 of the NV_geometry_program4 specification, Program
477    Attribute Variables.
478
479    (Add a table entry and relevant text describing the geometry program
480    invocation attribute and per-patch attributes.)
481
482      Geometry Vertex Binding         Components  Description
483      -----------------------------   ----------  ----------------------------
484      ...
485      primitive.invocation            (id,-,-,-)  geometry program invocation
486      primitive.tessouter[n]          (x,-,-,-)   outer tess. level n
487      primitive.tessinner[n]          (x,-,-,-)   inner tess. level n
488      primitive.patch.attrib[n]       (x,y,z,w)   generic patch attribute n
489      primitive.tessouter[n..o]       (x,-,-,-)   outer tess. levels n to o
490      primitive.tessinner[n..o]       (x,-,-,-)   inner tess. levels n to o
491      primitive.patch.attrib[n..o]    (x,y,z,w)   generic patch attrib n to o
492      primitive.vertexcount           (c,-,-,-)   vertices in primitive
493
494    ...
495
496    If a geometry attribute binding matches "primitive.invocation", the "x"
497    component is filled with an integer giving the number of previous
498    invocations of the geometry program on the primitive being processed.  If
499    the geometry program is invoked only once per primitive (default), this
500    component will always be zero.  If the program is invoked multiple times
501    (via the INVOCATIONS declaration), the component will be zero on the first
502    invocation, one on the second, and so forth.  The "y", "z", and "w"
503    components of the variable are always undefined.
504
505    If an attribute binding matches "primitive.tessouter[n]", the "x"
506    component is filled with the per-patch outer tessellation level numbered
507    <n> of the input patch.  <n> must be less than four.  The "y", "z", and
508    "w" components are always undefined.  A program will fail to load if this
509    attribute binding is used and the input primitive type is not PATCHES.
510
511    If an attribute binding matches "primitive.tessinner[n]", the "x"
512    component is filled with the per-patch inner tessellation level numbered
513    <n> of the input patch.  <n> must be less than two.  The "y", "z", and "w"
514    components are always undefined.  A program will fail to load if this
515    attribute binding is used and the input primitive type is not PATCHES.
516
517    If an attribute binding matches "primitive.patch.attrib[n]", the "x", "y",
518    "z", and "w" components are filled with the corresponding components of
519    the per-patch generic attribute numbered <n> of the input patch.  A
520    program will fail to load if this attribute binding is used and the input
521    primitive type is not PATCHES.
522
523    If an attribute binding matches "primitive.tessouter[n..o]",
524    "primitive.tessinner[n..o]", or "primitive.patch.attrib[n..o]", a sequence
525    of 1+<o>-<n> outer tessellation level, inner tessellation level, or
526    per-patch generic attribute bindings is created.  For per-patch generic
527    attribute bindings, it is as though the sequence
528    "primitive.patch.attrib[n], primitive.patch.attrib[n+1], ...
529    primitive.patch.attrib[o]" were specfied.  These bindings are available
530    only in explicit declarations of array variables.  A program will fail to
531    load if <n> is greater than <o> or the input primitive type is not
532    PATCHES.
533
534    If a geometry attribute binding matches "primitive.vertexcount", the "x"
535    component is filled with the number of vertices in the input primitive
536    being processed.  The "y", "z", and "w" components of the variable are
537    always undefined.
538
539
540    Modify Section 2.X.3.5, Program Results
541
542    (modify Table X.X)
543
544      Binding                        Components  Description
545      -----------------------------  ----------  ----------------------------
546      result.color[n].primary        (r,g,b,a)   primary color n (SRC_COLOR)
547      result.color[n].secondary      (r,g,b,a)   secondary color n (SRC1_COLOR)
548
549      Table X.X:  Fragment Result Variable Bindings. Components labeled "*"
550      are unused. "[n]" is optional -- color <n> is used if specified; color
551      0 is used otherwise.
552
553    (add after third paragraph)
554
555    If a result variable binding matches "result.color[n].primary" or
556    "result.color[n].secondary" and the ARB_blend_func_extended option is
557    specified, updates to the "x", "y", "z", and "w" components of these color
558    result variables modify the "r", "g", "b", and "a" components of the
559    SRC_COLOR and SRC1_COLOR color outputs, respectively, for the fragment
560    output color numbered <n>.  If the ARB_blend_func_extended program option
561    is not specified, the "result.color[n].primary" and
562    "result.color[n].secondary" bindings are unavailable.
563
564
565    Modify Section 2.X.3.6, Program Parameter Buffers
566
567    (modify the description of parameter buffer arrays to require that all
568    bindings in an array declaration must use the same single buffer *or*
569    buffer range)
570
571    ...  Program parameter buffer variables may be declared as arrays, but all
572    bindings assigned to the array must use the same binding point or binding
573    point range, and must increase consecutively.
574
575    (add to the end of the section)
576
577    In explicit variable declarations, the bindings in Table X.12.1 of the
578    form "program.buffer[a..b]" may also be used, and indicate the variable
579    spans multiple buffer binding points.  Such variables must be accessed as
580    an arrays, with the first index specifying an offset into the range of
581    buffer object binding points.  A buffer index of zero identifies binding
582    point <a>; an index of <b>-<a>-1 identifies binding point <b>.  If such a
583    variable is declared as an array, a second index must be provided to
584    identify the individual array element.  A program will fail to compile if
585    such bindings are used when <a> or <b> is negative or greater than or
586    equal to the number of buffer binding points supported for the program
587    type, or if <a> is greater than <b>.  The bindings in Table X.12.1 may not
588    be used in implicit variable declarations.
589
590      Binding                        Components  Underlying State
591      -----------------------------  ----------  -----------------------------
592      program.buffer[a..b][c]        (x,x,x,x)   program parameter buffers a
593                                                   through b, element c
594      program.buffer[a..b][c..d]     (x,x,x,x)   program parameter buffers a
595                                                   through b, elements b
596                                                   through c
597      program.buffer[a..b]           (x,x,x,x)   program parameter buffers a
598                                                   through b, all elements
599
600      Table X.12.1:  Program Parameter Buffer Array Bindings.  <a> and <b>
601      indicate buffer numbers, <c> and <d> indicate individual elements.
602
603    When bindings beginning with "program.buffer[a..b]" are used in a variable
604    declaration, they behave identically to corresponding beginning with
605    "program.buffer[a]", except that the variable is filled with a separate
606    set of values for each buffer binding point from <a> to <b> inclusive.
607
608    (add new section after Section 2.X.3.7, Program Condition Code Registers
609    and renumber subsequent sections accordingly)
610
611    Section 2.X.3.8, Program Texture Variables
612
613    Program texture variables are used as constants during program execution
614    and refer the texture objects bound to to one or more texture image units.
615    All texture variables have associated bindings and are read-only during
616    program execution.  Texture variables retain their values across program
617    invocations, and the set of texture image units to which they refer is
618    constant.  The texture object a variable refers to may be changed by
619    binding a new texture object to the appropriate target of the
620    corresponding texture image unit.  Texture variables may only be used to
621    identify a texture object in texture instructions, and may not be used as
622    operands in any other instruction.  Texture variables may be declared
623    explicitly via the <TEXTURE_statement> grammar rule, or implicitly by
624    using a texture image unit binding in an instruction.
625
626    Texture array variables may be declared as arrays, but the list of
627    texture image units assigned to the array must increase consectively.
628
629    Texture variables identify only a texture image unit; the corresponding
630    texture target (e.g., 1D, 2D, CUBE) and texture object is identified by
631    the <texTarget> grammar rule in instructions using the texture variable.
632
633      Binding          Components  Underlying State
634      ---------------  ----------  ------------------------------------------
635      texture[a]           x      texture object bound to image unit a
636      texture[a..b]        x      texture objects bound to image units a
637                                     through b
638
639      Table X.12.2:  Texture Image Unit Bindings.  <a> and <b> indicate
640      texture image unit numbers.
641
642    If a texture binding matches "texture[a]", the texture variable is filled
643    with a single integer referring to texture image unit <a>.
644
645    If a texture binding matches "texture[a..b]", the texture variable is
646    filled with an array of integers referring to texture image units <a>
647    through <b>, inclusive.  A program will fail to compile if <a> or <b> is
648    negative or greater than or equal to the number of texture image units
649    supported, or if <a> is greater than <b>.
650
651
652    Modify Section 2.X.4, Program Execution Environment
653
654    (Update the instruction set table to include new columns to indicate the
655     first ISA supporting the instruction, and to indicate whether the
656     instruction supports 64-bit floating-point modifiers.)
657
658      Instr-      Modifiers
659      uction  V  F I C S H D  Out Inputs    Description
660      ------- -- - - - - - -  --- --------  --------------------------------
661      ABS     40 6 6 X X X F  v   v         absolute value
662      ADD     40 6 6 X X X F  v   v,v       add
663      AND     40 - 6 X - - S  v   v,v       bitwise and
664      ATOM    50 - - X - - -  s   v,su      atomic memory transaction
665      BFE     50 - X X - - S  v   v,v       bitfield extract
666      BFI     50 - X X - - S  v   v,v,v     bitfield insert
667      BFR     50 - X X - - S  v   v         bitfield reverse
668      BRK     40 - - - - - -  -   c         break out of loop instruction
669      BTC     50 - X X - - S  v   v         bit count
670      BTFL    50 - X X - - S  v   v         find least significant bit
671      BTFM    50 - X X - - S  v   v         find most significant bit
672      CAL     40 - - - - - -  -   c         subroutine call
673      CEIL    40 6 6 X X X F  v   vf        ceiling
674      CMP     40 6 6 X X X F  v   v,v,v     compare
675      CONT    40 - - - - - -  -   c         continue with next loop interation
676      COS     40 X - X X X F  s   s         cosine with reduction to [-PI,PI]
677      CVT     50 - - X X - F  v   v         general data type conversion
678      DDX     40 X - X X X F  v   v         derivative relative to X (fp-only)
679      DDY     40 X - X X X F  v   v         derivative relative to Y (fp-only)
680      DIV     40 6 6 X X X F  v   v,s       divide vector components by scalar
681      DP2     40 X - X X X F  s   v,v       2-component dot product
682      DP2A    40 X - X X X F  s   v,v,v     2-comp. dot product w/scalar add
683      DP3     40 X - X X X F  s   v,v       3-component dot product
684      DP4     40 X - X X X F  s   v,v       4-component dot product
685      DPH     40 X - X X X F  s   v,v       homogeneous dot product
686      DST     40 X - X X X F  v   v,v       distance vector
687      ELSE    40 - - - - - -  -   -         start if test else block
688      EMIT    40 - - - - - -  -   -         emit vertex stream 0 (gp-only)
689      EMITS   50 - X - - - S  -   s         emit vertex to stream (gp-only)
690      ENDIF   40 - - - - - -  -   -         end if test block
691      ENDPRIM 40 - - - - - -  -   -         end of primitive (gp-only)
692      ENDREP  40 - - - - - -  -   -         end of repeat block
693      EX2     40 X - X X X F  s   s         exponential base 2
694      FLR     40 6 6 X X X F  v   vf        floor
695      FRC     40 6 - X X X F  v   v         fraction
696      I2F     40 - 6 X - - S  vf  v         integer to float
697      IF      40 - - - - - -  -   c         start of if test block
698      IPAC    50 X - X X - F  v   v         interpolate at centroid (fp-only)
699      IPAO    50 X - X X - F  v   v,v       interpolate w/offset (fp-only)
700      IPAS    50 X - X X - F  v   v,su      interpolate at sample (fp-only)
701      KIL     40 X X - - X F  -   vc        kill fragment
702      LDC     40 - - X X - F  v   v         load from constant buffer
703      LG2     40 X - X X X F  s   s         logarithm base 2
704      LIT     40 X - X X X F  v   v         compute lighting coefficients
705      LOAD    40 - - X X - F  v   su        global load
706      LOD     41 X - X X - F  v   vf,t      compute texture LOD
707      LRP     40 X - X X X F  v   v,v,v     linear interpolation
708      MAD     40 6 6 X X X F  v   v,v,v     multiply and add
709      MAX     40 6 6 X X X F  v   v,v       maximum
710      MEMBAR  50 - - - - - -  -   -         memory barrier
711      MIN     40 6 6 X X X F  v   v,v       minimum
712      MOD     40 - 6 X - - S  v   v,s       modulus vector components by scalar
713      MOV     40 6 6 X X X F  v   v         move
714      MUL     40 6 6 X X X F  v   v,v       multiply
715      NOT     40 - 6 X - - S  v   v         bitwise not
716      NRM     40 X - X X X F  v   v         normalize 3-component vector
717      OR      40 - 6 X - - S  v   v,v       bitwise or
718      PK2H    40 X X - - - F  s   vf        pack two 16-bit floats
719      PK2US   40 X X - - - F  s   vf        pack two floats as unsigned 16-bit
720      PK4B    40 X X - - - F  s   vf        pack four floats as signed 8-bit
721      PK4UB   40 X X - - - F  s   vf        pack four floats as unsigned 8-bit
722      PK64    50 X X - - - F  v   v         pack 4x32-bit vectors to 2x64
723      POW     40 X - X X X F  s   s,s       exponentiate
724      RCC     40 X - X X X F  s   s         reciprocal (clamped)
725      RCP     40 6 - X X X F  s   s         reciprocal
726      REP     40 6 6 - - X F  -   v         start of repeat block
727      RET     40 - - - - - -  -   c         subroutine return
728      RFL     40 X - X X X F  v   v,v       reflection vector
729      ROUND   40 6 6 X X X F  v   vf        round to nearest integer
730      RSQ     40 6 - X X X F  s   s         reciprocal square root
731      SAD     40 - 6 X - - S  vu  v,v,vu    sum of absolute differences
732      SCS     40 X - X X X F  v   s         sine/cosine without reduction
733      SEQ     40 6 6 X X X F  v   v,v       set on equal
734      SFL     40 6 6 X X X F  v   v,v       set on false
735      SGE     40 6 6 X X X F  v   v,v       set on greater than or equal
736      SGT     40 6 6 X X X F  v   v,v       set on greater than
737      SHL     40 - 6 X - - S  v   v,s       shift left
738      SHR     40 - 6 X - - S  v   v,s       shift right
739      SIN     40 X - X X X F  s   s         sine with reduction to [-PI,PI]
740      SLE     40 6 6 X X X F  v   v,v       set on less than or equal
741      SLT     40 6 6 X X X F  v   v,v       set on less than
742      SNE     40 6 6 X X X F  v   v,v       set on not equal
743      SSG     40 6 - X X X F  v   v         set sign
744      STORE   50 - - - - - -  -   v,su      global store
745      STR     40 6 6 X X X F  v   v,v       set on true
746      SUB     40 6 6 X X X F  v   v,v       subtract
747      SWZ     40 X - X X X F  v   v         extended swizzle
748      TEX     40 X X X X - F  v   vf,t      texture sample
749      TGALL   50 X X X X - F  v   v         test all non-zero in thread group
750      TGANY   50 X X X X - F  v   v         test any non-zero in thread group
751      TGEQ    50 X X X X - F  v   v         test all equal in thread group
752      TRUNC   40 6 6 X X X F  v   vf        truncate (round toward zero)
753      TXB     40 X X X X - F  v   vf,t      texture sample with bias
754      TXD     40 X X X X - F  v vf,vf,vf,t  texture sample w/partials
755      TXF     40 X X X X - F  v   vs,t      texel fetch
756      TXFMS   40 X X X X - F  v   vs,t      multisample texel fetch
757      TXG     41 X X X X - F  v   vf,t      texture gather
758      TXGO    50 X X X X - F  v vf,vs,vs,t  texture gather w/per-texel offsets
759      TXL     40 X X X X - F  v   vf,t      texture sample w/LOD
760      TXP     40 X X X X - F  v   vf,t      texture sample w/projection
761      TXQ     40 - - - - - S  vs  vs,t      texture info query
762      UP2H    40 X X X X - F  vf  s         unpack two 16-bit floats
763      UP2US   40 X X X X - F  vf  s         unpack two unsigned 16-bit integers
764      UP4B    40 X X X X - F  vf  s         unpack four signed 8-bit integers
765      UP4UB   40 X X X X - F  vf  s         unpack four unsigned 8-bit integers
766      UP64    50 X X X X - F  v   v         unpack 2x64 vectors to 4x32
767      X2D     40 X - X X X F  v   v,v,v     2D coordinate transformation
768      XOR     40 - 6 X - - S  v   v,v       exclusive or
769      XPD     40 X - X X X F  v   v,v       cross product
770
771          Table X.13:  Summary of NV_gpu_program5 instructions.
772
773      The "V" column indicates the first assembly language in the
774      NV_gpu_program4 family (if any) supporting the opcode.  "41" and "50"
775      indicate NV_gpu_program4_1 and NV_gpu_program5, respectively.
776
777      The "Modifiers" columns specify the set of modifiers allowed for the
778      instruction:
779
780        F = floating-point data type modifiers
781        I = signed and unsigned integer data type modifiers
782        C = condition code update modifiers
783        S = clamping (saturation) modifiers
784        H = half-precision float data type suffix
785        D = default data type modifier (F, U, or S)
786
787      For the "F" and "I" columns, an "X" indicates support for both unsized
788      type modifiers and sized type modifiers with fewer than 64 bits.  A "6"
789      indicates support for all modifiers, including 64-bit versions (when
790      supported).
791
792      The input and output columns describe the formats of the operands and
793      results of the instruction.
794
795        v:  4-component vector (data type is inherited from operation)
796        vf: 4-component vector (data type is always floating-point)
797        vs: 4-component vector (data type is always signed integer)
798        vu: 4-component vector (data type is always unsigned integer)
799        s:  scalar (replicated if written to a vector destination;
800                    data type is inherited from operation)
801        su:  scalar (data type is always unsigned integer)
802        c:  condition code test result (e.g., "EQ", "GT1.x")
803        vc: 4-component vector or condition code test
804        t:  texture
805
806      Instructions labeled "fp-only" and "gp-only" are supported only for
807      fragment and geometry programs, respectively.
808
809
810    Modify Section 2.X.4.1, Program Instruction Modifiers
811
812    (Update the discussion of instruction precision modifiers.  If
813     GL_NV_gpu_program_fp64 is not found in the extension string, the "F64"
814     instruction modifier described below is not supported.)
815
816    (add to Table X.14 of the NV_gpu_program4 specification.)
817
818      Modifier  Description
819      --------  ---------------------------------------------------
820      F         Floating-point operation
821      U         Fixed-point operation, unsigned operands
822      S         Fixed-point operation, signed operands
823      ...
824      F32       Floating-point operation, 32-bit precision or
825                  access one 32-bit floating-point value
826      F64       Floating-point operation, 64-bit precision or
827                  access one 64-bit floating-point value
828      S32       Fixed-point operation, signed 32-bit operands or
829                  access one 32-bit signed integer value
830      S64       Fixed-point operation, signed 64-bit operands or
831                  access one 64-bit signed integer value
832      U32       Fixed-point operation, unsigned 32-bit operands or
833                  access one 32-bit unsigned integer value
834      U64       Fixed-point operation, unsigned 64-bit operands or
835                  access one 64-bit unsigned integer value
836      ...
837      F32X2     Access two 32-bit floating-point values
838      F32X4     Access four 32-bit floating-point values
839      F64X2     Access two 64-bit floating-point values
840      F64X4     Access four 64-bit floating-point values
841      S8        Access one 8-bit signed integer value
842      S16       Access one 16-bit signed integer value
843      S32X2     Access two 32-bit signed integer values
844      S32X4     Access four 32-bit signed integer values
845      S64       Access one 64-bit signed integer value
846      S64X2     Access two 64-bit signed integer values
847      S64X4     Access four 64-bit signed integer values
848      U8        Access one 8-bit unsigned integer value
849      U16       Access one 16-bit unsigned integer value
850      U32       Access one 32-bit unsigned integer value
851      U32X2     Access two 32-bit unsigned integer values
852      U32X4     Access four 32-bit unsigned integer values
853      U64       Access one 64-bit unsigned integer value
854      U64X2     Access two 64-bit unsigned integer values
855      U64X4     Access four 64-bit unsigned integer values
856
857      ADD       Perform add operation for ATOM
858      MIN       Perform minimum operation for ATOM
859      MAX       Perform maximum operation for ATOM
860      IWRAP     Perform wrapping increment for ATOM
861      DWRAP     Perform wrapping decrment for ATOM
862      AND       Perform logical AND operation for ATOM
863      OR        Perform logical OR operation for ATOM
864      XOR       Perform logical XOR operation for ATOM
865      EXCH      Perform exchange operation for ATOM
866      CSWAP     Perform compare-and-swap operation for ATOM
867
868      COH       Make LOAD and STORE operations use coherent caching
869      VOL       Make LOAD and STORE operations treat memory as volatile
870
871      PREC      Instruction results should be precise
872
873      ROUND     Inexact conversion results round to nearest value (even)
874      CEIL      Inexact conversion results round to larger value
875      FLR       Inexact conversion results round to smaller value
876      TRUNC     Inexact conversion results round to value closest to zero
877
878
879    "F", "U", and "S" modifiers are base data type modifiers and specify that
880    the instruction should operate on floating-point, unsigned integer, or
881    signed integer values, respectively.  For example, "ADD.F", "ADD.U", and
882    "ADD.S" specify component-wise addition of floating-point, unsigned
883    integer, or signed integer vectors, respectively.  While these modifiers
884    specify a data type, they do not specify an exact precision at which the
885    operation is performed.  Floating-point and fixed-point operations will
886    typically be carried out at 32-bit precision, unless otherwise described
887    in the instruction documentation or overridden by the precision modifiers.
888    If all operands are represented with less than 32-bit precision (e.g.,
889    variables with the "SHORT" component size modifier), operations may be
890    carried out at a precision no less than the precision of the largest
891    operand used by the instruction.  For some instructions, the data type of
892    some operands or the result are fixed; in these cases, the data type
893    modifier specifies the data type of the remaining values.
894
895    Operands represented with fewer bits than used to perform the instruction
896    will be promoted to a larger data type.  Signed integer operands will be
897    sign-extended, where the most significant bits are filled with ones if the
898    operand is negative and zero otherwise.  Unsigned integer operands will be
899    zero-extended, where the most significant bits are always filled with
900    zeroes.  Operands represented with more bits than used to perform the
901    instruction will be converted to lower precision.  Floating-point
902    overflows result in IEEE infinity encodings; integer overflows result in
903    the truncation of the most significant bits.
904
905    For arithmetic operations, the "F32", "F64", "U32", "U64", "S32", and
906    "S64" modifiers are precision-specific data type modifiers that specify
907    that floating-point, unsigned integer, or signed integer operations be
908    carried out with an internal precision of no less than 32 or 64 bits per
909    component, respectively.  The "F64", "U64", and "S64" modifiers are
910    supported on only a subset of instructions, as documented in the
911    instruction table.  The base data type of the instruction is trivially
912    derived from a precision-specific data type modifiers, and an instruction
913    may not specify both base and precision-specific data type modifiers.
914
915    ...
916
917    "SAT" and "SSAT" are clamping modifiers that generally specify that the
918    floating-point components of the instruction result should be clamped to
919    [0,1] or [-1,1], respectively, before updating the condition code and the
920    destination variable.  If no clamping suffix is specified, unclamped
921    results will be used for condition code updates (if any) and destination
922    variable writes.  Clamping modifiers are not supported on instructions
923    that do not produce floating-point results, with one exception.
924
925    ...
926
927    For load and store operations, the "F32", "F32X2", "F32X4", "F64",
928    "F64X2", "F64X4", "S8", "S16", "S32", "S32X2", "S32X4", "S64", "S64X2",
929    "S64X4", "U8", "U16", "U32", "U32X2", "U32X4", "U64", "U64X2", and "U64X4"
930    storage modifiers control how data are loaded from or stored to memory.
931    Storage modifiers are supported by the ATOM, LDC, LOAD, and STORE
932    instructions and are covered in more detail in the descriptions of these
933    instructions.  These instructions must specify exactly one of these
934    modifiers, and may not specify any of the base data type modifiers (F,U,S)
935    described above.  The base data types of the result vector of a load
936    instruction or the first operand of a store instruction are trivially
937    derived from the storage modifier.
938
939    For atomic memory operations performed by the ATOM instruction, the "ADD",
940    "MIN", "MAX", "IWRAP", "DWRAP", "AND", "OR", "XOR", "EXCH", and "CSWAP"
941    modifiers specify the operation to perform on the memory being accessed,
942    and are described in more detail in the description of this instruction.
943
944    For load and store operations, the "COH" modifier controls whether the
945    operation uses a coherent level of the cache hierarchy, as described in
946    Section 2.X.4.5.
947
948    For load and store operations, the "VOL" modifier controls whether the
949    operation treats the memory being read or written as volatile.
950    Instructions modified with "VOL" will always read or write the underlying
951    memory, whether or not previous or subsequent loads and stores access the
952    same memory.
953
954    For arithmetic and logical operations, the "PREC" modifier controls
955    whether the instruction result should be treated as precise.  For
956    instructions not qualified with ".PREC", the implementation may rearrange
957    the computations specified by the program instructions to execute more
958    efficiently, even if it may generate slightly different results in some
959    cases.  For example, an implementation may combine a MUL instruction with
960    a dependent ADD instruction and generate code to execute a MAD
961    (multiply-add) instruction instead.  The difference in rounding may
962    produce unacceptable artifacts for some algorithms.  When ".PREC" is
963    specified, the instruction will be executed in a manner that always
964    generates the same result regardless of the program instructions that
965    precede or follow the instruction.  Note that a ".PREC" modifier does not
966    affect the processing of any other instruction.  For example, tagging an
967    instruction with ".PREC" does not mean that the instructions used to
968    generate the instruction's operands will be treated as precise unless
969    those instructions are also qualified with ".PREC".
970
971    For the CVT (data type conversion) instruction, the "F16", "F32", "F64",
972    "S8", "S16", "S32", "S64", "U8", "U16", "U32", and "U64" storage modifiers
973    specify the data type of the vector operand and the converted result.  Two
974    storage modifiers must be provided, which specify the data type of the
975    result and the operand, respectively.
976
977    For the CVT (data type conversion) instruction, the "ROUND", "CEIL",
978    "FLR", and "TRUNC" modifiers specify how to round converted results that
979    are not directly representable using the data type of the result.
980
981
982    Modify Section 2.X.4.4, Program Texture Access
983
984    (Extend the language describing the operation of texel offsets to cover
985     the new capability to load texel offsets from a register.  Otherwise,
986     this functionality is unchanged from previous extensions.)
987
988    <offset> is a 3-component signed integer vector, which can be specified
989    using constants embedded in the texture instruction according to the
990    <texOffsetImmed> grammar rule, or taken from a vector operand according to
991    the <texOffsetVar> grammar rule.  The three components of the offset
992    vector are added to the computed <u>, <v>, and <w> texel locations prior
993    to sampling.  When using a constant offset, one, two, or three components
994    may be specified in the instruction; if fewer than three are specified,
995    the remaining offset components are zero.  If no offsets are specified,
996    all three components of the offset are treated as zero.  A limited range
997    of offset values are supported; the minimum and maximum <texOffset> values
998    are implementation-dependent and given by MIN_PROGRAM_TEXEL_OFFSET_EXT and
999    MAX_PROGRAM_TEXEL_OFFSET_EXT, respectively.  A program will fail to load:
1000
1001      * if the texture target specified in the instruction is 1D, ARRAY1D,
1002        SHADOW1D, or SHADOWARRAY1D, and the second or third component of a
1003        constant offset vector is non-zero;
1004
1005      * if the texture target specified in the instruction is 2D, RECT,
1006        ARRAY2D, SHADOW2D, SHADOWRECT, or SHADOWARRAY2D, and the third
1007        component of a constant offset vector is non-zero;
1008
1009      * if the texture target is CUBE, SHADOWCUBE, ARRAYCUBE, or
1010        SHADOWARRAYCUBE, and any component of a constant offset vector is
1011        non-zero -- texel offsets are not supported for cube map or buffer
1012        textures;
1013
1014      * if any component of the constant offset vector of a TXGO instruction
1015        is non-zero -- non-constant offsets are provided in separate operands;
1016
1017      * if any component of a constant offset vector is less than
1018        MIN_PROGRAM_TEXEL_OFFSET_EXT or greater than
1019        MAX_PROGRAM_TEXEL_OFFSET_EXT;
1020
1021      * if a TXD or TXGO instruction specifies a non-constant texel offset
1022        according to the <texOffsetVar> grammar rule; or
1023
1024      * if any instruction specifies a non-constant texel offset according
1025        to the <texOffsetVar> grammar rule and the texture target is CUBE,
1026        SHADOWCUBE, ARRAYCUBE, or SHADOWARRAYCUBE.
1027
1028    The implementation-dependent minimum and maximum texel offset values apply
1029    to texel offsets are taken from a vector operand, but out-of-bounds or
1030    invalid component values will not prevent program loading since the
1031    offsets may not be computed until the program is executed.  Components of
1032    the vector operand not needed for the texture target are ignored.  The W
1033    component of the offset vector is always ignored; the Z component of the
1034    offset vector is ignored unless the target is 3D; the Y component is
1035    ignored if the target is 1D, ARRAY1D, SHADOW1D, or SHADOWARRAY1D.  If the
1036    value of any non-ignored component of the vector operand is outside
1037    implementation-dependent limits, the results of the texture lookup are
1038    undefined.  For all instructions except TXGO, the limits are
1039    MIN_PROGRAM_TEXEL_OFFSET_EXT and MAX_PROGRAM_TEXEL_OFFSET_EXT.  For the
1040    TXGO instruction, the limits are MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV and
1041    MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV.
1042
1043
1044    (Modify language describing how the check for using multiple targets on a
1045     single texture image unit works, to account for texture array variables
1046     where a single instruction may access one of multiple textures and the
1047     texture used is not known when the program is loaded.)
1048
1049    A program will fail to load if it attempts to sample from multiple texture
1050    targets (including the SHADOW pseudo-targets) on the same texture image
1051    unit.  For example, a program containing any two the following
1052    instructions will fail to load:
1053
1054      TEX out, coord, texture[0], 1D;
1055      TEX out, coord, texture[0], 2D;
1056      TEX out, coord, texture[0], ARRAY2D;
1057      TEX out, coord, texture[0], SHADOW2D;
1058      TEX out, coord, texture[0], 3D;
1059
1060    For the purposes of this test, sampling using a texture variable declared
1061    as an array is treated as though all texture image units bound to the
1062    variable were accessed.  A program containing the following
1063    instructions would fail to load:
1064
1065      TEXTURE textures[] = { texture[0..3] };
1066      TEX out, coord, textures[2], 2D;     # acts as if all textures are used
1067      TEX out, coord, texture[1], 3D;
1068
1069    (Add language describing texture gather component selection)
1070
1071    The TXG and TXGO instructions provide the ability to assemble a
1072    four-component vector by taking the value of a single component of a
1073    multi-component texture from each of four texels.  The component selected
1074    is identified by the <texImageUnitComp> grammar rule.  Component selection
1075    is not supported for any other instruction, and a program will fail to
1076    load if <texImageUnitComp> is matched for any texture instruction other
1077    than TXG or TXGO.
1078
1079
1080    Add New Section 2.X.4.5, Program Memory Access
1081
1082    Programs may load from or store to buffer object memory via the ATOM
1083    (atomic global memory operation), LDC (load constant), LOAD (global load),
1084    and STORE (global store) instructions.
1085
1086    Load instructions read 8, 16, 32, 64, 128, or 256 bits of data from a
1087    source address to produce a four-component vector, according to the
1088    storage modifier specified with the instruction.  The storage modifier has
1089    three parts:
1090
1091      - a base data type, "F", "S", or "U", specifying that the instruction
1092        fetches floating-point, signed integer, or unsigned integer values,
1093        respectively;
1094
1095      - a component size, specifying that the components fetched by the
1096        instruction have 8, 16, 32, or 64 bits; and
1097
1098      - an optional component count, where "X2" and "X4" indicate that two or
1099        four components be fetched, and no count indicates a single component
1100        fetch.
1101
1102    When the storage modifier specifies that fewer than four components should
1103    be fetched, remaining components are filled with zeroes.  When performing
1104    an atomic memory operation (ATOM) or a global load (LOAD), the GPU address
1105    is specified as an instruction operand.  When performing a constant buffer
1106    load (LDC), the GPU address is derived by adding the base address of the
1107    bound buffer object to an offset specified as an instruction operand.
1108    Given a GPU address <address> and a storage modifier <modifier>, the
1109    memory load can be described by the following code:
1110
1111      result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
1112      {
1113        result_t_vec result = { 0, 0, 0, 0 };
1114        switch (modifier) {
1115        case F32:
1116            result.x = ((float32_t *)address)[0];
1117            break;
1118        case F32X2:
1119            result.x = ((float32_t *)address)[0];
1120            result.y = ((float32_t *)address)[1];
1121            break;
1122        case F32X4:
1123            result.x = ((float32_t *)address)[0];
1124            result.y = ((float32_t *)address)[1];
1125            result.z = ((float32_t *)address)[2];
1126            result.w = ((float32_t *)address)[3];
1127            break;
1128        case F64:
1129            result.x = ((float64_t *)address)[0];
1130            break;
1131        case F64X2:
1132            result.x = ((float64_t *)address)[0];
1133            result.y = ((float64_t *)address)[1];
1134            break;
1135        case F64X4:
1136            result.x = ((float64_t *)address)[0];
1137            result.y = ((float64_t *)address)[1];
1138            result.z = ((float64_t *)address)[2];
1139            result.w = ((float64_t *)address)[3];
1140            break;
1141        case S8:
1142            result.x = ((int8_t *)address)[0];
1143            break;
1144        case S16:
1145            result.x = ((int16_t *)address)[0];
1146            break;
1147        case S32:
1148            result.x = ((int32_t *)address)[0];
1149            break;
1150        case S32X2:
1151            result.x = ((int32_t *)address)[0];
1152            result.y = ((int32_t *)address)[1];
1153            break;
1154        case S32X4:
1155            result.x = ((int32_t *)address)[0];
1156            result.y = ((int32_t *)address)[1];
1157            result.z = ((int32_t *)address)[2];
1158            result.w = ((int32_t *)address)[3];
1159            break;
1160        case S64:
1161            result.x = ((int64_t *)address)[0];
1162            break;
1163        case S64X2:
1164            result.x = ((int64_t *)address)[0];
1165            result.y = ((int64_t *)address)[1];
1166            break;
1167        case S64X4:
1168            result.x = ((int64_t *)address)[0];
1169            result.y = ((int64_t *)address)[1];
1170            result.z = ((int64_t *)address)[2];
1171            result.w = ((int64_t *)address)[3];
1172            break;
1173        case U8:
1174            result.x = ((uint8_t *)address)[0];
1175            break;
1176        case U16:
1177            result.x = ((uint16_t *)address)[0];
1178            break;
1179        case U32:
1180            result.x = ((uint32_t *)address)[0];
1181            break;
1182        case U32X2:
1183            result.x = ((uint32_t *)address)[0];
1184            result.y = ((uint32_t *)address)[1];
1185            break;
1186        case U32X4:
1187            result.x = ((uint32_t *)address)[0];
1188            result.y = ((uint32_t *)address)[1];
1189            result.z = ((uint32_t *)address)[2];
1190            result.w = ((uint32_t *)address)[3];
1191            break;
1192        case U64:
1193            result.x = ((uint64_t *)address)[0];
1194            break;
1195        case U64X2:
1196            result.x = ((uint64_t *)address)[0];
1197            result.y = ((uint64_t *)address)[1];
1198            break;
1199        case U64X4:
1200            result.x = ((uint64_t *)address)[0];
1201            result.y = ((uint64_t *)address)[1];
1202            result.z = ((uint64_t *)address)[2];
1203            result.w = ((uint64_t *)address)[3];
1204            break;
1205        }
1206        return result;
1207      }
1208
1209    Store instructions write the contents of a four-component vector operand
1210    into 8, 16, 32, 64, 128, or 256 bits, according to the storage modifier
1211    specified with the instruction.  The storage modifiers supported by stores
1212    are identical to those supported for loads.  Given a GPU address
1213    <address>, a vector operand <operand> containing the data to be stored,
1214    and a storage modifier <modifier>, the memory store can be described by
1215    the following code:
1216
1217      void BufferMemoryStore(char *address, operand_t_vec operand,
1218                             OpModifier modifier)
1219      {
1220        switch (modifier) {
1221        case F32:
1222            ((float32_t *)address)[0] = operand.x;
1223            break;
1224        case F32X2:
1225            ((float32_t *)address)[0] = operand.x;
1226            ((float32_t *)address)[1] = operand.y;
1227            break;
1228        case F32X4:
1229            ((float32_t *)address)[0] = operand.x;
1230            ((float32_t *)address)[1] = operand.y;
1231            ((float32_t *)address)[2] = operand.z;
1232            ((float32_t *)address)[3] = operand.w;
1233            break;
1234        case F64:
1235            ((float64_t *)address)[0] = operand.x;
1236            break;
1237        case F64X2:
1238            ((float64_t *)address)[0] = operand.x;
1239            ((float64_t *)address)[1] = operand.y;
1240            break;
1241        case F64X4:
1242            ((float64_t *)address)[0] = operand.x;
1243            ((float64_t *)address)[1] = operand.y;
1244            ((float64_t *)address)[2] = operand.z;
1245            ((float64_t *)address)[3] = operand.w;
1246            break;
1247        case S8:
1248            ((int8_t *)address)[0] = operand.x;
1249            break;
1250        case S16:
1251            ((int16_t *)address)[0] = operand.x;
1252            break;
1253        case S32:
1254            ((int32_t *)address)[0] = operand.x;
1255            break;
1256        case S32X2:
1257            ((int32_t *)address)[0] = operand.x;
1258            ((int32_t *)address)[1] = operand.y;
1259            break;
1260        case S32X4:
1261            ((int32_t *)address)[0] = operand.x;
1262            ((int32_t *)address)[1] = operand.y;
1263            ((int32_t *)address)[2] = operand.z;
1264            ((int32_t *)address)[3] = operand.w;
1265            break;
1266        case S64:
1267            ((int64_t *)address)[0] = operand.x;
1268            break;
1269        case S64X2:
1270            ((int64_t *)address)[0] = operand.x;
1271            ((int64_t *)address)[1] = operand.y;
1272            break;
1273        case S64X4:
1274            ((int64_t *)address)[0] = operand.x;
1275            ((int64_t *)address)[1] = operand.y;
1276            ((int64_t *)address)[2] = operand.z;
1277            ((int64_t *)address)[3] = operand.w;
1278            break;
1279        case U8:
1280            ((uint8_t *)address)[0] = operand.x;
1281            break;
1282        case U16:
1283            ((uint16_t *)address)[0] = operand.x;
1284            break;
1285        case U32:
1286            ((uint32_t *)address)[0] = operand.x;
1287            break;
1288        case U32X2:
1289            ((uint32_t *)address)[0] = operand.x;
1290            ((uint32_t *)address)[1] = operand.y;
1291            break;
1292        case U32X4:
1293            ((uint32_t *)address)[0] = operand.x;
1294            ((uint32_t *)address)[1] = operand.y;
1295            ((uint32_t *)address)[2] = operand.z;
1296            ((uint32_t *)address)[3] = operand.w;
1297            break;
1298        case U64:
1299            ((uint64_t *)address)[0] = operand.x;
1300            break;
1301        case U64X2:
1302            ((uint64_t *)address)[0] = operand.x;
1303            ((uint64_t *)address)[1] = operand.y;
1304            break;
1305        case U64X4:
1306            ((uint64_t *)address)[0] = operand.x;
1307            ((uint64_t *)address)[1] = operand.y;
1308            ((uint64_t *)address)[2] = operand.z;
1309            ((uint64_t *)address)[3] = operand.w;
1310            break;
1311        }
1312      }
1313
1314    If a global load or store accesses a memory address that does not
1315    correspond to a buffer object made resident by MakeBufferResidentNV, the
1316    results of the operation are undefined and may produce a fault resulting
1317    in application termination.  If a load accesses a buffer object made
1318    resident with an <access> parameter of WRITE_ONLY, or if a store accesses
1319    a buffer object made resident with an <access> parameter of READ_ONLY, the
1320    results of the operation are also undefined and may lead to application
1321    termination.
1322
1323    The address used for global memory loads or stores or offset used for
1324    constant buffer loads must be aligned to the fetch size corresponding to
1325    the storage opcode modifier.  For S8 and U8, the offset has no alignment
1326    requirements.  For S16 and U16, the offset must be a multiple of two basic
1327    machine units.  For F32, S32, and U32, the offset must be a multiple of
1328    four.  For F32X2, F64, S32X2, S64, U32X2, and U64, the offset must be a
1329    multiple of eight.  For F32X4, F64X2, S32X4, S64X2, U32X4, and U64X2, the
1330    offset must be a multiple of sixteen.  For F64X4, S64X4, and U64X4, the
1331    offset must be a multiple of thirty-two.  If an offset is not correctly
1332    aligned, the values returned by a buffer memory load will be undefined,
1333    and the effects of a buffer memory store will also be undefined.
1334
1335    Global and image memory accesses in assembly programs are weakly ordered
1336    and may require synchronization relative to other operations in the OpenGL
1337    pipeline.  The ordering and synchronization mehcanisms described in
1338    Section 2.14.X (of the EXT_shader_image_load_store extension
1339    specification) for shaders using the OpenGL Shading Language apply equally
1340    to loads, stores, and atomics performed in assembly programs.
1341
1342
1343    Modify Section 2.X.6.Y of the NV_fragment_program4 specification
1344
1345    (add new option section)
1346
1347    + Early Per-Fragment Tests (NV_early_fragment_tests)
1348
1349    If a fragment program specifies the "NV_early_fragment_tests" option, the
1350    depth and stencil tests will be performed prior to fragment program
1351    invocation, as described in Section 3.X.
1352
1353
1354    Modify Section 2.X.7.Y of the NV_geometry_program4 specification
1355
1356    (Simply add the new input primitive type "PATCHES" to the list of tokens
1357     allowed by the "PRIMITIVE_IN" declaration.)
1358
1359    - Input Primitive Type (PRIMITIVE_IN)
1360
1361    The PRIMITIVE_IN statement declares the type of primitives seen by a
1362    geometry program.  The single argument must be one of "POINTS", "LINES",
1363    "LINES_ADJACENCY", "TRIANGLES", "TRIANGLES_ADJACENCY", or "PATCHES".
1364
1365
1366    (Add a new optional program declaration to declare a geometry shader that
1367     is run <N> times per primitive.)
1368
1369    Geometry programs support three types of mandatory declaration statements,
1370    as described below.  Each of the three must be included exactly once in
1371    the geometry program.
1372
1373    ...
1374
1375    Geometry programs also support one optional declaration statement.
1376
1377    - Program Invocation Count (INVOCATIONS)
1378
1379    The INVOCATIONS statement declares the number of times the geometry
1380    program is run on each primitive processed.  The single argument must be a
1381    positive integer less than or equal to the value of the
1382    implementation-dependent limit MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV.  Each
1383    invocation of the geometry program will have the same inputs and outputs
1384    except for the built-in input variable "primitive.invocation".  This
1385    variable will be an integer between 0 and <n>-1, where <n> is the declared
1386    number of invocations.  If omitted, the program invocation count is one.
1387
1388
1389    Section 2.X.8.Z, ATOM:  Atomic Global Memory Operation
1390
1391    The ATOM instruction performs an atomic global memory operation by reading
1392    from memory at the address specified by the second unsigned integer scalar
1393    operand, computing a new value based on the value read from memory and the
1394    first (vector) operand, and then writing the result back to the same
1395    memory address.  The memory transaction is atomic, guaranteeing that no
1396    other write to the memory accessed will occur between the time it is read
1397    and written by the ATOM instruction.  The result of the ATOM instruction
1398    is the scalar value read from memory.
1399
1400    The ATOM instruction has two required instruction modifiers.  The atomic
1401    modifier specifies the type of operation to be performed.  The storage
1402    modifier specifies the size and data type of the operand read from memory
1403    and the base data type of the operation used to compute the value to be
1404    written to memory.
1405
1406      atomic     storage
1407      modifier   modifiers            operation
1408      --------   ------------------   --------------------------------------
1409       ADD       U32, S32, U64        compute a sum
1410       MIN       U32, S32             compute minimum
1411       MAX       U32, S32             compute maximum
1412       IWRAP     U32                  increment memory, wrapping at operand
1413       DWRAP     U32                  decrement memory, wrapping at operand
1414       AND       U32, S32             compute bit-wise AND
1415       OR        U32, S32             compute bit-wise OR
1416       XOR       U32, S32             compute bit-wise XOR
1417       EXCH      U32, S32, U64        exchange memory with operand
1418       CSWAP     U32, S32, U64        compare-and-swap
1419
1420     Table X.Y, Supported atomic and storage modifiers for the ATOM
1421     instruction.
1422
1423    Not all storage modifiers are supported by ATOM, and the set of modifiers
1424    allowed for any given instruction depends on the atomic modifier
1425    specified.  Table X.Y enumerates the set of atomic modifiers supported by
1426    the ATOM instruction, and the storage modifiers allowed for each.
1427
1428      tmp0 = VectorLoad(op0);
1429      address = ScalarLoad(op1);
1430      result = BufferMemoryLoad(address, storageModifier);
1431      switch (atomicModifier) {
1432      case ADD:
1433        writeval = tmp0.x + result;
1434        break;
1435      case MIN:
1436        writeval = min(tmp0.x, result);
1437        break;
1438      case MAX:
1439        writeval = max(tmp0.x, result);
1440        break;
1441      case IWRAP:
1442        writeval = (result >= tmp0.x) ? 0 : result+1;
1443        break;
1444      case DWRAP:
1445        writeval = (result == 0 || result > tmp0.x) ? tmp0.x : result-1;
1446        break;
1447      case AND:
1448        writeval = tmp0.x & result;
1449        break;
1450      case OR:
1451        writeval = tmp0.x | result;
1452        break;
1453      case XOR:
1454        writeval = tmp0.x ^ result;
1455        break;
1456      case EXCH:
1457        break;
1458      case CSWAP:
1459        if (result == tmp0.x) {
1460          writeval = tmp0.y;
1461        } else {
1462          return result;  // no memory store
1463        }
1464        break;
1465      }
1466      BufferMemoryStore(address, writeval, storageModifier);
1467
1468    ATOM performs a scalar atomic operation.  The <y>, <z>, and <w> components
1469    of the result vector are undefined.
1470
1471    ATOM supports no base data type modifiers, but requires exactly one
1472    storage modifier.  The base data types of the result vector, and the first
1473    (vector) operand are derived from the storage modifier.  The second
1474    operand is always interpreted as a scalar unsigned integer.
1475
1476
1477    Section 2.X.8.Z, BFE:  Bitfield Extract
1478
1479    The BFE instruction extracts a selected set of performs a component-wise
1480    bit extraction of the second vector operand to yield a result vector.  For
1481    each component, the number of bits extracted is given by the x component
1482    of the first vector operand, and the bit number of the least significant
1483    bit extracted is given by the y component of the first vector operand.
1484
1485      tmp0 = VectorLoad(op0);
1486      tmp1 = VectorLoad(op1);
1487      result.x = BitfieldExtract(tmp0.x, tmp0.y, tmp1.x);
1488      result.y = BitfieldExtract(tmp0.x, tmp0.y, tmp1.y);
1489      result.z = BitfieldExtract(tmp0.x, tmp0.y, tmp1.z);
1490      result.w = BitfieldExtract(tmp0.x, tmp0.y, tmp1.w);
1491
1492    If the number of bits to extract is zero, zero is returned.  The results
1493    of bitfield extraction are undefined
1494
1495      * if the number of bits to extract or the starting offset is negative,
1496      * if the sum of the number of bits to extract and the starting offset
1497        is greater than the total number of bits in the operand/result, or
1498      * if the starting offset is greater than or equal to the total number of
1499        bits in the operand/result.
1500
1501      Type BitfieldExtract(Type bits, Type offset, Type value)
1502      {
1503        if (bits < 0 || offset < 0 || offset >= TotalBits(Type) ||
1504            bits + offset > TotalBits(Type)) {
1505          /* result undefined */
1506        } else if (bits == 0) {
1507          return 0;
1508        } else {
1509          return (value << (TotalBits(Type) - (bits+offset))) >>
1510                   (TotalBits(type) - bits);
1511        }
1512      }
1513
1514    BFE supports only signed and unsigned integer data type modifiers.  For
1515    signed integer data types, the extracted value is sign-extended (i.e.,
1516    filled with ones if the most significant bit extracted is one and filled
1517    with zeroes otherwise).  For unsigned integer data types, the extracted
1518    value is zero-extended.
1519
1520
1521    Section 2.X.8.Z, BFI:  Bitfield Insert
1522
1523    The BFI instruction performs a component-wise bitfield insertion of the
1524    second vector operand into the third vector operand to yield a result
1525    vector.  For each component, the <n> least significant bits are extracted
1526    from the corresponding component of the second vector operand, where <n>
1527    is given by the x component of the first vector operand.  Those bits are
1528    merged into the corresponding component of the third vector operand,
1529    replacing bits <b> through <b>+<n>-1, to produce the result.  The bit
1530    offset <b> is specified by the y component of the first operand.
1531
1532      tmp0 = VectorLoad(op0);
1533      tmp1 = VectorLoad(op1);
1534      tmp2 = VectorLoad(op2);
1535      result.x = BitfieldInsert(op0.x, op0.y, tmp1.x, tmp2.x);
1536      result.y = BitfieldInsert(op0.x, op0.y, tmp1.y, tmp2.y);
1537      result.z = BitfieldInsert(op0.x, op0.y, tmp1.z, tmp2.z);
1538      result.w = BitfieldInsert(op0.x, op0.y, tmp1.w, tmp2.w);
1539
1540    The results of bitfield insertion are undefined
1541
1542      * if the number of bits to insert or the starting offset is negative,
1543      * if the sum of the number of bits to insert and the starting offset
1544        is greater than the total number of bits in the operand/result, or
1545      * if the starting offset is greater than or equal to the total number of
1546        bits in the operand/result.
1547
1548      Type BitfieldInsert(Type bits, Type offset, Type src, Type dst)
1549      {
1550        if (bits < 0 || offset < 0 || offset >= TotalBits(type) ||
1551            bits + offset > TotalBits(Type)) {
1552          /* result undefined */
1553        } else if (bits == TotalBits(Type)) {
1554          return src;
1555        } else {
1556          Type mask = ((1 << bits) - 1) << offset;
1557          return ((src << offset) & mask) | (dst & (~mask));
1558        }
1559      }
1560
1561    BFI supports only signed and unsigned integer data type modifiers.  If no
1562    type modifier is specified, the operand and result vectors are treated as
1563    signed integers.
1564
1565
1566    Section 2.X.8.Z, BFR:  Bitfield Reverse
1567
1568    The BFR instruction performs a component-wise bit reversal of the single
1569    vector operand to produce a result vector.  Bit reversal is performed by
1570    exchanging the most and least significant bits, the second-most and
1571    second-least significant bits, and so on.
1572
1573      tmp0 = VectorLoad(op0);
1574      result.x = BitReverse(tmp0.x);
1575      result.y = BitReverse(tmp0.y);
1576      result.z = BitReverse(tmp0.z);
1577      result.w = BitReverse(tmp0.w);
1578
1579    BFR supports only signed and unsigned integer data type modifiers.  If no
1580    type modifier is specified, the operand and result vectors are treated as
1581    signed integers.
1582
1583
1584    Section 2.X.8.Z, BTC:  Bit Count
1585
1586    The BTC instruction performs a component-wise bit count of the single
1587    source vector to yield a result vector.  Each component of the result
1588    vector contains the number of one bits in the corresponding component of
1589    the source vector.
1590
1591      tmp0 = VectorLoad(op0);
1592      result.x = BitCount(tmp0.x);
1593      result.y = BitCount(tmp0.y);
1594      result.z = BitCount(tmp0.z);
1595      result.w = BitCount(tmp0.w);
1596
1597    BTC supports only signed and unsigned integer data type modifiers.  If no
1598    type modifier is specified, both operands and the result are treated as
1599    signed integers.
1600
1601
1602    Section 2.X.8.Z, BTFL:  Find Least Significant Bit
1603
1604    The BTFL instruction searches for the least significant bit of each
1605    component of the single source vector, yielding a result vector comprising
1606    the bit number of the located bit for each component.
1607
1608      tmp0 = VectorLoad(op0);
1609      result.x = FindLSB(tmp0.x);
1610      result.y = FindLSB(tmp0.y);
1611      result.z = FindLSB(tmp0.z);
1612      result.w = FindLSB(tmp0.w);
1613
1614    BTFL supports only signed and unsigned integer data type modifiers.  For
1615    unsigned integer data types, the search will yield the bit number of the
1616    least significant one bit in each component, or the maximum integer (all
1617    bits are ones) if the source vector component is zero.  For signed data
1618    types, the search will yield the bit number of the least significant one
1619    bit in each component, or -1 if the source vector component is zero.  If
1620    no type modifier is specified, both operands and the result are treated as
1621    signed integers.
1622
1623
1624    Section 2.X.8.Z, BTFM:  Find Most Significant Bit
1625
1626    The BTFM instruction searches for the most significant bit of each
1627    component of the single source vector, yielding a result vector comprising
1628    the bit number of the located bit for each component.
1629
1630      tmp0 = VectorLoad(op0);
1631      result.x = FindMSB(tmp0.x);
1632      result.y = FindMSB(tmp0.y);
1633      result.z = FindMSB(tmp0.z);
1634      result.w = FindMSB(tmp0.w);
1635
1636    BTFM supports only signed and unsigned integer data type modifiers.  For
1637    unsigned integer data types, the search will yield the bit number of the
1638    most significant one bit in each component , or the maximum integer (all
1639    bits are ones) if the source vector component is zero.  For signed data
1640    types, the search will yield the bit number of the most significant one
1641    bit if the source value is positive, the bit number of the most
1642    significant zero bit if the source value is negative, or -1 if the source
1643    value is zero.  If no type modifier is specified, both operands and the
1644    result are treated as signed integers.
1645
1646
1647    Section 2.X.8.Z, CVT:  Data Type Conversion
1648
1649    The CVT instruction converts each component of the single source vector
1650    from one specified data type to another to yield a result vector.
1651
1652      tmp0 = VectorLoad(op0);
1653      result = DataTypeConvert(tmp0);
1654
1655    The CVT instruction requires two storage modifiers.  The first specifies
1656    the data type of the result components; the second specifies the data type
1657    of the operand components.  The supported storage modifiers are F16, F32,
1658    F64, S8, S16, S32, S64, U8, U16, U32, and U64.  A storage modifier of
1659    "F16" indicates a source or destination that is treated as having a
1660    floating-point type, but whose sixteen least significant bits describe a
1661    16-bit floating-point value using the encoding provided in Section 2.1.2.
1662
1663    If the component size of the source register doesn't match the size of the
1664    specified operand data type, the source register components are first
1665    interpreted as a value with the same base data type as the operand and
1666    converted to the operand data type.  The operand components are then
1667    converted to the result data type.  Finally, if the component size of the
1668    destination register doesn't match the specified result data type, the
1669    result components are converted to values of the same base data type with
1670    a size matching the result register's component size.
1671
1672    Data type conversion is performed by first converting the source
1673    components to an infinite-precision value of the destination data type,
1674    and then converting to the result data type.  When converting between
1675    floating-point and integer values, integer values are never interpreted as
1676    being normalized to [0,1] or [-1,+1].  Converting the floating-point
1677    special values -INF, +INF, and NaN to integers will yield undefined
1678    results.
1679
1680    When converting from a non-integral floating-point value to an integer,
1681    one of the two integers closest in value to the floating-point value are
1682    chosen according to the rounding instruction modifier.  If "CEIL" or "FLR"
1683    is specified, the larger or smaller value, respectively is chosen.  If
1684    "TRUNC" is specified, the value nearest to zero is chosen.  If "ROUND" is
1685    specified, if one integer is nearer in value to the original
1686    floating-point value, it is chosen; otherwise, the even integer is chosen.
1687    "ROUND" is used if no rounding modifier is specified.
1688
1689    When converting from the infinite-precision intermediate value to the
1690    destination data type:
1691
1692      * Floating-point values not exactly representable in the destination
1693        data are rounded to one of the two nearest values in the destination
1694        type according to the rounding modifier.  Note that the results of
1695        float-to-float conversion are not automatically rounded to integer
1696        values, even if a rounding modifier such as CEIL or FLR is specified.
1697
1698      * Integer values are clamped to the closest value representable in the
1699        result data type if the "SAT" (saturation) modifier is specified.
1700
1701      * Integer values drop the most significant bits if the "SAT" modifier is
1702        not specified.
1703
1704    Negation and absolute value operators are not supported on the source
1705    operand; a program using such operators will fail to compile.
1706
1707    CVT supports no data type modifiers; the type of the operand and result
1708    vectors is fully specified by the required storage modifiers.
1709
1710
1711    Section 2.X.8.Z, EMIT:  Emit Vertex
1712
1713    (Modify the description of the EMIT opcode to deal with the interaction
1714     with multiple vertex streams added by ARB_transform_feedback3.  For more
1715     information on vertex streams, see ARB_transform_feedback3.)
1716
1717    The EMIT instruction emits a new vertex to be added to the current output
1718    primitive for vertex stream zero.  The attributes of the emitted vertex
1719    are given by the current values of the vertex result variables.  After the
1720    EMIT instruction completes, a new vertex is started and all result
1721    variables become undefined.
1722
1723
1724    Section 2.X.8.Z, EMITS:  Emit Vertex to Stream
1725
1726    (Add new geometry program opcode; the EMITS instruction is not supported
1727     for any other program types.  For more information on vertex streams, see
1728     ARB_transform_feedback3.)
1729
1730    The EMITS instruction emits a new vertex to be added to the current output
1731    primitive for the vertex stream specified by the single signed integer
1732    scalar operand.  The attributes of the emitted vertex are given by the
1733    current values of the vertex result variables.  After the EMITS
1734    instruction completes, a new vertex is started and all result variables
1735    become undefined.
1736
1737    If the specified stream is negative or greater than or equal to the
1738    implementation-dependent number of vertex streams
1739    (MAX_VERTEX_STREAMS_NV), the results of the instruction are undefined.
1740
1741
1742    Section 2.X.8.Z, IPAC:  Interpolate at Centroid
1743
1744    The IPAC instruction generates a result vector by evaluating the fragment
1745    attribute named by the single vector operand at the centroid location.
1746    The result vector would be identical to the value obtained by a MOV
1747    instruction if the attribute variable were declared using the CENTROID
1748    modifier.
1749
1750    When interpolating an attribute variable with this instruction, the
1751    CENTROID and SAMPLE attribute variable modifiers are ignored.  The FLAT
1752    and NOPERSPECTIVE variable modifiers operate normally.
1753
1754     tmp0 = Interpolate(op0, x_pixel + x_centroid, y_pixel + x_centroid);
1755     result = tmp0;
1756
1757    IPAC supports only floating-point data type modifiers.  A program will
1758    fail to load if it contains an IPAC instruction whose single operand is
1759    not a fragment program attribute variable or matches the "fragment.facing"
1760    or "primitive.id" binding.
1761
1762
1763    Section 2.X.8.Z, IPAO:  Interpolate with Offset
1764
1765    The IPAO instruction generates a result vector by evaluating the fragment
1766    attribute named by the single vector operand at an offset from the pixel
1767    center given by the x and y components of the second vector operand.  The
1768    z and w components of the second vector operand are ignored.  The (x,y)
1769    position used for interpolating the attribute variable is obtained by
1770    adding the (x,y) offsets in the second vector operand to the (x,y)
1771    position of the pixel center.
1772
1773    The range of offsets supported by the IPAO instruction is
1774    implementation-dependent.  The position used to interpolate the attribute
1775    variable is undefined if the x or y component of the second operand is
1776    less than MIN_FRAGMENT_INTERPOLATION_OFFSET_NV or greater than
1777    MAX_FRAGMENT_INTERPOLATION_OFFSET_NV.  Additionally, the granularity of
1778    offsets may be limited.  The (x,y) value may be snapped to a fixed
1779    sub-pixel grid with the number of subpixel bits given by
1780    FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV.
1781
1782    When interpolating an attribute variable with this instruction, the
1783    CENTROID and SAMPLE attribute variable modifiers are ignored.  The FLAT
1784    and NOPERSPECTIVE variable modifiers operate normally.
1785
1786     tmp1 = VectorLoad(op1);
1787     tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x);
1788     result = tmp0;
1789
1790    IPAO supports only floating-point data type modifiers.  A program will
1791    fail to load if it contains an IPAO instruction whose first operand is not
1792    a fragment program attribute variable or matches the "fragment.facing" or
1793    "primitive.id" binding.
1794
1795
1796    Section 2.X.8.Z, IPAS:  Interpolate at Sample Location
1797
1798    The IPAS instruction generates a result vector by evaluating the fragment
1799    attribute named by the single vector operand at the location of the
1800    pixel's sample whose sample number is given by the second integer scalar
1801    operand.  If multisample buffers are not available (SAMPLE_BUFFERS is
1802    zero), the attribute will be evaluated at the pixel center.  If the sample
1803    number given by the second operand does not exist, the position used to
1804    interpolate the attribute is undefined.
1805
1806    When interpolating an attribute variable with this instruction, the
1807    CENTROID and SAMPLE attribute variable modifiers are ignored.  The FLAT
1808    and NOPERSPECTIVE variable modifiers operate normally.
1809
1810     sample = ScalarLoad(op1);
1811     tmp1 = SampleOffset(sample);
1812     tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x);
1813     result = tmp0;
1814
1815    IPAS supports only floating-point data type modifiers.  A program will
1816    fail to load if it contains an IPAO instruction whose first operand is not
1817    a fragment program attribute variable or matches the "fragment.facing" or
1818    "primitive.id" binding.
1819
1820
1821    Section 2.X.8.Z, LDC:  Load from Constant Buffer
1822
1823    The LDC instruction loads a vector operand from a buffer object to yield a
1824    result vector.  The operand used for the LDC instruction must correspond
1825    to a parameter buffer variable declared using the "CBUFFER" statement; a
1826    program will fail to load if any other type of operand is used in an LDC
1827    instruction.
1828
1829      result = BufferMemoryLoad(&op0, storageModifier);
1830
1831    A base operand vector is fetched from memory as described in Section
1832    2.X.4.5, with the GPU address derived from the binding corresponding to
1833    the operand.  A final operand vector is derived from the base operand
1834    vector by applying swizzle, negation, and absolute value operand modifiers
1835    as described in Section 2.X.4.2.
1836
1837    The amount of memory in any given buffer object binding accessible by the
1838    LDC instruction may be limited.  If any component fetched by the LDC
1839    instruction extends 4*<n> or more basic machine units from the beginning
1840    of the buffer object binding, where <n> is the implementation-dependent
1841    constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that
1842    component will be undefined.
1843
1844    LDC supports no base data type modifiers, but requires exactly one storage
1845    modifier.  The base data types of the operand and result vectors are
1846    derived from the storage modifier.
1847
1848
1849    Section 2.X.8.Z, LOAD:  Global Load
1850
1851    The LOAD instruction generates a result vector by reading an address from
1852    the single unsigned integer scalar operand and fetching data from buffer
1853    object memory, as described in Section 2.X.4.5.
1854
1855      address = ScalarLoad(op0);
1856      result = BufferMemoryLoad(address, storageModifier);
1857
1858    LOAD supports no base data type modifiers, but requires exactly one
1859    storage modifier.  The base data type of the result vector is derived from
1860    the storage modifier.  The single scalar operand is always interpreted as
1861    an unsigned integer.
1862
1863
1864    Section 2.X.8.Z, MEMBAR:  Memory Barrier
1865
1866    The MEMBAR instruction synchronizes memory transactions to ensure that
1867    memory transactions resulting from any instruction executed by the thread
1868    prior to the MEMBAR instruction complete prior to any memory transactions
1869    issued after the instruction.
1870
1871    MEMBAR has no operands and generates no result.
1872
1873
1874    Section 2.X.8.Z, PK64:  Pack 64-Bit Component
1875
1876    The PK64 instruction reads the four components of the single vector
1877    operand as 32-bit values, packs the bit representations of these into a
1878    pair of 64-bit values, and replicates those to produce a four-component
1879    result vector.  The "x" and "y" components of the operand are packed to
1880    produce the "x" and "z" components of the result vector; the "z" and "w"
1881    components of the operand are packed to produce the "y" and "w" components
1882    of the result vector.  The PK64 instruction can be reversed by the UP64
1883    instruction below.
1884
1885    This instruction is intended to allow a program to reconstruct 64-bit
1886    integer or floating-point values generated by the application but passed
1887    to the GL as two 32-bit values taken from adjacent words in memory.  The
1888    ability to use this technique depends on how the 64-bit value is stored in
1889    memory.  For "little-endian" processors, first 32-bit value would hold the
1890    with the least significant 32 bits of the 64-bit value.  For "big-endian"
1891    processors, the first 32-bit value holds the most significant 32 bits of
1892    the 64-bit value.  This reconstruction assumes that the first 32-bit word
1893    comes from the x component of the operand and the second 32-bit word comes
1894    from the y component.  The method used to construct a 64-bit value from a
1895    pair of 32-bit values depends on the processor type.
1896
1897      tmp = VectorLoad(op0);
1898
1899      if (underlying system is little-endian) {
1900        result.x = RawBits(tmp.x) | (RawBits(tmp.y) << 32);
1901        result.y = RawBits(tmp.z) | (RawBits(tmp.w) << 32);
1902        result.z = RawBits(tmp.x) | (RawBits(tmp.y) << 32);
1903        result.w = RawBits(tmp.z) | (RawBits(tmp.w) << 32);
1904      } else {
1905        result.x = RawBits(tmp.y) | (RawBits(tmp.x) << 32);
1906        result.y = RawBits(tmp.w) | (RawBits(tmp.z) << 32);
1907        result.z = RawBits(tmp.y) | (RawBits(tmp.x) << 32);
1908        result.w = RawBits(tmp.w) | (RawBits(tmp.z) << 32);
1909      }
1910
1911    PK64 supports integer and floating-point data type modifiers, which
1912    specify the base data type of the operand and result.  The single vector
1913    operand is always treated as having 32-bit components, and the result is
1914    treated as a vector with 64-bit components.  The encoding performed by
1915    PK64 can be reversed using the UP64 instruction.
1916
1917    A program will fail to load if it contains a PK64 instruction that writes
1918    its results to a variable not declared as "LONG".
1919
1920
1921    Section 2.X.8.Z, STORE:  Global Store
1922
1923    The STORE instruction reads an address from the second unsigned integer
1924    scalar operand and writes the contents of the first vector operand to
1925    buffer object memory at that address, as described in Section 2.X.4.5.
1926    This instruction generates no result.
1927
1928      tmp0 = VectorLoad(op0);
1929      address = ScalarLoad(op1);
1930      BufferMemoryStore(address, tmp0, storageModifier);
1931
1932    STORE supports no base data type modifiers, but requires exactly one
1933    storage modifier.  The base data type of the vector components of the
1934    first operand is derived from the storage modifier.  The second operand is
1935    always interpreted as an unsigned integer scalar.
1936
1937
1938    Section 2.X.8.Z, TEX:  Texture Sample
1939
1940    (Modify the instruction pseudo-code to account for texel offsets no
1941     longer need to be immediate arguments.)
1942
1943      tmp = VectorLoad(op0);
1944      if (instruction has variable texel offset) {
1945        itmp = VectorLoad(op1);
1946      } else {
1947        itmp = instruction.texelOffset;
1948      }
1949      ddx = ComputePartialsX(tmp);
1950      ddy = ComputePartialsY(tmp);
1951      lambda = ComputeLOD(ddx, ddy);
1952      result = TextureSample(tmp, lambda, ddx, ddy, itmp);
1953
1954
1955    Section 2.X.8.Z, TGALL:  Test for All Non-Zero in a Thread Group
1956
1957    The TGALL instruction produces a result vector by reading a vector operand
1958    for each active thread in the current thread group and comparing each
1959    component to zero.  A result vector component contains a TRUE value
1960    (described below) if the value of the corresponding component in the
1961    operand vector is non-zero for all active threads, and a FALSE value
1962    otherwise.
1963
1964    An implementation may choose to arrange programs threads into thread
1965    groups, and execute an instruction simultaneously for each thread in the
1966    group.  If the TGALL instruction is contained inside conditional flow
1967    control blocks and not all threads in the group execute the instruction,
1968    the operand values for threads not executing the instruction have no
1969    bearing on the value returned.  The method used to arrange threads into
1970    groups is undefined.
1971
1972      tmp = VectorLoad(op0);
1973      result = { TRUE, TRUE, TRUE, TRUE };
1974      for (all active threads) {
1975        if ([thread]tmp.x == 0) result.x = FALSE;
1976        if ([thread]tmp.y == 0) result.y = FALSE;
1977        if ([thread]tmp.z == 0) result.z = FALSE;
1978        if ([thread]tmp.w == 0) result.w = FALSE;
1979      }
1980
1981    TGALL supports all data type modifiers.  For floating-point data types,
1982    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
1983    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned
1984    integer data types, the TRUE value is the maximum integer value (all bits
1985    are ones) and the FALSE value is zero.
1986
1987
1988    Section 2.X.8.Z, TGANY:  Test for Any Non-Zero in a Thread Group
1989
1990    The TGANY instruction produces a result vector by reading a vector operand
1991    for each active thread in the current thread group and comparing each
1992    component to zero.  A result vector component contains a TRUE value
1993    (described below) if the value of the corresponding component in the
1994    operand vector is non-zero for any active thread, and a FALSE value
1995    otherwise.
1996
1997    An implementation may choose to arrange programs threads into thread
1998    groups, and execute an instruction simultaneously for each thread in the
1999    group.  If the TGANY instruction is contained inside conditional flow
2000    control blocks and not all threads in the group execute the instruction,
2001    the operand values for threads not executing the instruction have no
2002    bearing on the value returned.  The method used to arrange threads into
2003    groups is undefined.
2004
2005      tmp = VectorLoad(op0);
2006      result = { FALSE, FALSE, FALSE, FALSE };
2007      for (all active threads) {
2008        if ([thread]tmp.x != 0) result.x = TRUE;
2009        if ([thread]tmp.y != 0) result.y = TRUE;
2010        if ([thread]tmp.z != 0) result.z = TRUE;
2011        if ([thread]tmp.w != 0) result.w = TRUE;
2012      }
2013
2014    TGANY supports all data type modifiers.  For floating-point data types,
2015    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
2016    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned
2017    integer data types, the TRUE value is the maximum integer value (all bits
2018    are ones) and the FALSE value is zero.
2019
2020
2021    Section 2.X.8.Z, TGEQ:  Test for All Equal Values in a Thread Group
2022
2023    The TGEQ instruction produces a result vector by reading a vector operand
2024    for each active thread in the current thread group and comparing each
2025    component to zero.  A result vector component contains a TRUE value
2026    (described below) if the value of the corresponding component in the
2027    operand vector is the same for all active threads, and a FALSE value
2028    otherwise.
2029
2030    An implementation may choose to arrange programs threads into thread
2031    groups, and execute an instruction simultaneously for each thread in the
2032    group.  If the TGEQ instruction is contained inside conditional flow
2033    control blocks and not all threads in the group execute the instruction,
2034    the operand values for threads not executing the instruction have no
2035    bearing on the value returned.  The method used to arrange threads into
2036    groups is undefined.
2037
2038      tmp = VectorLoad(op0);
2039      tgall = { TRUE, TRUE, TRUE, TRUE };
2040      tgany = { FALSE, FALSE, FALSE, FALSE };
2041      for (all active threads) {
2042        if ([thread]tmp.x == 0) tgall.x = FALSE; else tgany.x = TRUE;
2043        if ([thread]tmp.y == 0) tgall.y = FALSE; else tgany.y = TRUE;
2044        if ([thread]tmp.z == 0) tgall.z = FALSE; else tgany.z = TRUE;
2045        if ([thread]tmp.w == 0) tgall.w = FALSE; else tgany.w = TRUE;
2046      }
2047      result.x = (tgall.x == tgany.x) ? TRUE : FALSE;
2048      result.y = (tgall.y == tgany.y) ? TRUE : FALSE;
2049      result.z = (tgall.z == tgany.z) ? TRUE : FALSE;
2050      result.w = (tgall.w == tgany.w) ? TRUE : FALSE;
2051
2052    TGEQ supports all data type modifiers.  For floating-point data types, the
2053    TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
2054    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned
2055    integer data types, the TRUE value is the maximum integer value (all bits
2056    are ones) and the FALSE value is zero.
2057
2058
2059    Section 2.X.8.Z, TXB:  Texture Sample with Bias
2060
2061    (Modify the instruction pseudo-code to account for texel offsets no
2062     longer need to be immediate arguments.)
2063
2064      tmp = VectorLoad(op0);
2065      if (instruction has variable texel offset) {
2066        itmp = VectorLoad(op1);
2067      } else {
2068        itmp = instruction.texelOffset;
2069      }
2070      ddx = ComputePartialsX(tmp);
2071      ddy = ComputePartialsY(tmp);
2072      lambda = ComputeLOD(ddx, ddy);
2073      result = TextureSample(tmp, lambda + tmp.w, ddx, ddy, itmp);
2074
2075    Section 2.X.8.Z, TXG:  Texture Gather
2076
2077    (Update the TXG opcode description from NV_gpu_program4_1 specification.
2078     This version adds two capabilities:  any component of a multi-component
2079     texture can be selected by tacking on a component name to the texture
2080     variable passed to identify the texture unit, and depth compares are
2081     supported if a SHADOW target is specified.)
2082
2083    The TXG instruction takes the four components of a single floating-point
2084    vector operand as a texture coordinate, determines a set of four texels to
2085    sample from the base level of detail of the specified texture image, and
2086    returns one component from each texel in a four-component result vector.
2087    To determine the four texels to sample, the minification and magnification
2088    filters are ignored and the rules for LINEAR filter are applied to the
2089    base level of the texture image to determine the texels T_i0_j1, T_i1_j1,
2090    T_i1_j0, and T_i0_j0, as defined in equations 3.23 through 3.25. The
2091    texels are then converted to texture source colors (Rs,Gs,Bs,As) according
2092    to table 3.21, followed by application of the texture swizzle as described
2093    in section 3.8.13.  A four-component vector is returned by taking one of
2094    the four components of the swizzled texture source colors from each of the
2095    four selected texels.  The component is selected using the
2096    <texImageUnitComp> grammar rule, by adding a scalar suffix
2097    (".x", ".y", ".z", ".w") to the identified texture; if no scalar suffix
2098    is provided, the first component is selected.
2099
2100    TXG only operates on 2D, SHADOW2D, CUBE, SHADOWCUBE, ARRAY2D,
2101    SHADOWARRAY2D, ARRAYCUBE, SHADOWARRAYCUBE, RECT, and SHADOWRECT texture
2102    targets; a program will fail to compile if any other texture target is
2103    used.
2104
2105    When using a "SHADOW" texture target, component selection is ignored.
2106    Instead, depth comparisons are performed on the depth values for each of
2107    the four selected texels, and 0/1 values are returned based on the results
2108    of the comparison.
2109
2110    As with other texture accesses, the results of a texture gather operation
2111    are undefined if the texture target in the instruction is incompatible
2112    with the selected texture's base internal format and depth compare mode.
2113
2114      tmp = VectorLoad(op0);
2115      ddx = (0,0,0);
2116      ddy = (0,0,0);
2117      lambda = 0;
2118      if (instruction has variable texel offset) {
2119        itmp = VectorLoad(op1);
2120      } else {
2121        itmp = instruction.texelOffset;
2122      }
2123      result.x = TextureSample_i0j1(tmp, lambda, ddx, ddy, itmp).<comp>;
2124      result.y = TextureSample_i1j1(tmp, lambda, ddx, ddy, itmp).<comp>;
2125      result.z = TextureSample_i1j0(tmp, lambda, ddx, ddy, itmp).<comp>;
2126      result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
2127
2128    In this pseudocode, "<comp>" refers to the texel component selected by the
2129    <texImageUnitComp> grammar rule, as described above.
2130
2131    TXG supports all three data type modifiers.  The single operand is always
2132    treated as a floating-point vector; the results are interpreted according
2133    to the data type modifier.
2134
2135
2136    Section 2.X.8.Z, TXGO:  Texture Gather with Per-Texel Offsets
2137
2138    Like the TXG instruction, the TXGO instruction takes the four components
2139    of its first floating-point vector operand as a texture coordinate,
2140    determines a set of four texels to sample from the base level of detail of
2141    the specified texture image, and returns one component from each texel in
2142    a four-component result vector.  The second and third vector operands are
2143    taken as signed four-component integer vectors providing the x and y
2144    components of the offsets, respectively, used to determine the location of
2145    each of the four texels.  To determine the four texels to sample, each of
2146    the four independent offsets is used in conjunction with the specified
2147    texture coordinate to select a texel.  The minification and magnification
2148    filters are ignored and the rules for LINEAR filtering are used to select
2149    the texel T_i0_j0, as defined in equations 3.23 through 3.25, from the
2150    base level of the texture image.  The texels are then converted to texture
2151    source colors (Rs,Gs,Bs,As) according to table 3.21, followed by
2152    application of the texture swizzle as described in section 3.8.13.  A
2153    four-component vector is returned by taking one of the four components
2154    of the swizzled texture source colors from each of the four selected
2155    texels.  The component is selected using the <texImageUnitComp> grammar
2156    rule, by adding a scalar suffix (".x", ".y", ".z", ".w") to the identified
2157    texture; if no scalar suffix is provided, the first component is selected.
2158
2159    TXGO only operates on 2D, SHADOW2D, ARRAY2D, SHADOWARRAY2D, RECT, and
2160    SHADOWRECT texture targets; a program will fail to compile if any other
2161    texture target is used.
2162
2163    When using a "SHADOW" texture target, component selection is ignored.
2164    Instead, depth comparisons are performed on the depth values for each of
2165    the four selected texels, and 0/1 values are returned based on the results
2166    of the comparison.
2167
2168    As with other texture accesses, the results of a texture gather operation
2169    are undefined if the texture target in the instruction is incompatible
2170    with the selected texture's base internal format and depth compare mode.
2171
2172      tmp = VectorLoad(op0);
2173      itmp1 = VectorLoad(op1);
2174      itmp2 = VectorLoad(op2);
2175      ddx = (0,0,0);
2176      ddy = (0,0,0);
2177      lambda = 0;
2178      itmp = (op1.x, op2.x);
2179      result.x = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
2180      itmp = (op1.y, op2.y);
2181      result.y = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
2182      itmp = (op1.z, op2.z);
2183      result.z = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
2184      itmp = (op1.w, op2.w);
2185      result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>;
2186
2187    In this pseudocode, "<comp>" refers to the texel component selected by the
2188    <texImageUnitComp> grammar rule, as described above.
2189
2190    If TEXTURE_WRAP_S or TEXTURE_WRAP_T are either CLAMP or MIRROR_CLAMP_EXT,
2191    the results of the TXGO instruction are undefined.
2192
2193    Note:  The TXG instruction is equivalent to the TXGO instruction with X
2194    and Y offset vectors of (0,1,1,0) and (0,0,-1,-1), respectively.
2195
2196    TXGO supports all three data type modifiers.  The first operand is always
2197    treated as a floating-point vector and the second and third operands are
2198    always treated as a signed integer vector; the results are interpreted
2199    according to the data type modifier.
2200
2201
2202    Section 2.X.8.Z, TXL:  Texture Sample with LOD
2203
2204    (Modify the instruction pseudo-code to account for texel offsets no
2205     longer need to be immediate arguments.)
2206
2207      tmp = VectorLoad(op0);
2208      if (instruction has variable texel offset) {
2209        itmp = VectorLoad(op1);
2210      } else {
2211        itmp = instruction.texelOffset;
2212      }
2213      ddx = (0,0,0);
2214      ddy = (0,0,0);
2215      result = TextureSample(tmp, tmp.w, ddx, ddy, itmp);
2216
2217
2218    Section 2.X.8.Z, TXP:  Texture Sample with Projection
2219
2220    (Modify the instruction pseudo-code to account for texel offsets no
2221     longer need to be immediate arguments.)
2222
2223      tmp0 = VectorLoad(op0);
2224      tmp0.x = tmp0.x / tmp0.w;
2225      tmp0.y = tmp0.y / tmp0.w;
2226      tmp0.z = tmp0.z / tmp0.w;
2227      if (instruction has variable texel offset) {
2228        itmp = VectorLoad(op1);
2229      } else {
2230        itmp = instruction.texelOffset;
2231      }
2232      ddx = ComputePartialsX(tmp);
2233      ddy = ComputePartialsY(tmp);
2234      lambda = ComputeLOD(ddx, ddy);
2235      result = TextureSample(tmp, lambda, ddx, ddy, itmp);
2236
2237
2238    Section 2.X.8.Z, UP64:  Unpack 64-bit Component
2239
2240    The UP64 instruction produces a vector result with 32-bit components by
2241    unpacking the bits of the "x" and "y" components of a 64-bit vector
2242    operand.  The "x" component of the operand is unpacked to produce the "x"
2243    and "y" components of the result vector; the "y" component is unpacked to
2244    produce the "z" and "w" components of the result vector.
2245
2246    This instruction is intended to allow a program to pass 64-bit integer or
2247    floating-point values to an application using two 32-bit values stored in
2248    adjacent words in memory, which will be read by the application as single
2249    64-bit values.  The ability to use this technique depends on how the
2250    64-bit value is stored in memory.  For "little-endian" processors, the
2251    first 32-bit value would hold the with the least significant 32 bits of
2252    the 64-bit value.  For "big-endian" processors, the first 32-bit value
2253    holds the most significant 32 bits of the 64-bit value.  This
2254    reconstruction assumes that the first 32-bit word comes from the "x"
2255    component of the operand and the second 32-bit word comes from the "y"
2256    component.  The method used to unpack a 64-bit value into a pair of 32-bit
2257    values depends on the processor type.
2258
2259      tmp = VectorLoad(op0);
2260      if (underlying system is little-endian) {
2261        result.x = (RawBits(tmp.x) >>  0) & 0xFFFFFFFF;
2262        result.y = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF;
2263        result.z = (RawBits(tmp.y) >>  0) & 0xFFFFFFFF;
2264        result.w = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF;
2265      } else {
2266        result.x = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF;
2267        result.y = (RawBits(tmp.x) >>  0) & 0xFFFFFFFF;
2268        result.z = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF;
2269        result.w = (RawBits(tmp.y) >>  0) & 0xFFFFFFFF;
2270      }
2271
2272    UP64 supports integer and floating-point data type modifiers, which
2273    specify the base data type of the operand and result.  The single operand
2274    vector always has 64-bit components.  The result is treated as a vector
2275    with 32-bit components.  The encoding performed by UP64 can be reversed
2276    using the PK64 instruction.
2277
2278    A program will fail to load if it contains a UP64 instruction whose
2279    operand is a variable not declared as "LONG".
2280
2281
2282    Modify Section 2.14.6.1 of the NV_geometry_program4 specification,
2283    Geometry Program Input Primitives
2284
2285    (add patches to the list of supported input primitive types)
2286
2287    The supported input primitive types are: ...
2288
2289    Patches (PATCHES)
2290
2291    Geometry programs that operate on patches are valid only for the
2292    PATCHES_NV primitive type.  There are a variable number of vertices
2293    available for each program invocation, depending on the number of input
2294    vertices in the primitive itself.  For a patch with <n> vertices,
2295    "vertex[0]" refers to the first vertex of the patch, and "vertex[<n>-1]"
2296    refers to the last vertex.
2297
2298
2299    Modify Section 2.14.6.2 of the NV_geometry_program4 specification,
2300    Geometry Program Output Primitives
2301
2302    (Add a new paragraph limiting the use of the EMITS opcode to geometry
2303     programs with a POINTS output primitive type at the end of the section.
2304     This limitation may be removed in future specifications.)
2305
2306    Geometry programs may write to multiple vertex streams only if the
2307    specified output primitive type is POINTS.  A program will fail to load if
2308    it contains and EMITS instruction and the output primitive type specified
2309    by the PRIMITIVE_OUT declaration is not POINTS.
2310
2311    Modify Section 2.14.6.4 of the NV_geometry_program4 specification,
2312    Geometry Program Output Limits
2313
2314    (Modify the limitation on the total number of components emitted by a
2315     geometry program from NV_gpu_program4 to be per-invocation.  If a that
2316     limit is 4096 and a program has 16 invocations, each of the 16 program
2317     invocation can emit up to 4096 total components.)
2318
2319    There are two implementation-dependent limits that limit the total number
2320    of vertices that each invocation of a program can emit.  First, the vertex
2321    limit may not exceed the value of MAX_PROGRAM_OUTPUT_VERTICES_NV.  Second,
2322    product of the vertex limit and the number of result variable components
2323    written by the program (PROGRAM_RESULT_COMPONENTS_NV, as described in
2324    section 2.X.3.5 of NV_gpu_program4) may not exceed the value of
2325    MAX_PROGRAM_TOTAL_OUTPUT_COMPONENTS_NV.  A geometry program will fail to
2326    load if its maximum vertex count or maximum total component count exceeds
2327    the implementation-dependent limit.  The limits may be queried by calling
2328    GetProgramiv with a <target> of GEOMETRY_PROGRAM_NV.  Note that the
2329    maximum number of vertices that a geometry program can emit may be much
2330    lower than MAX_PROGRAM_OUTPUT_VERTICES_NV if the program writes a large
2331    number of result variable components.  If a geometry program has multiple
2332    invocations (via the "INVOCATIONS" declaration), the program will load
2333    successfully as long as no single invocation exceeds the total component
2334    count limit, even if the total output of all invocations combined exceeds
2335    the limit.
2336
2337
2338Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization)
2339
2340    Modify Section 3.X, Early Per-Fragment Tests, as documented in the
2341    EXT_shader_image_load_store specification
2342
2343    (add new paragraph at the end of a section, describing how early fragment
2344     tests work when assembly fragment programs are active)
2345
2346    If an assembly fragment program is active, early depth tests are
2347    considered enabled if and only if the fragment program source included the
2348    NV_early_fragment_tests option.
2349
2350
2351   Add to Section 3.11.4.5 of ARB_fragment_program (Fragment Program):
2352
2353   Section 3.11.4.5.3, ARB_blend_func_extended Option
2354
2355   If a fragment program specifies the "ARB_blend_func_extended" option, dual
2356   source color outputs as described in ARB_blend_func_extended are made
2357   available through the use of the "result.color[n].primary" and
2358   "result.color[n].secondary" result bindings, corresponding to SRC_COLOR
2359   and SRC1_COLOR, respectively, for the fragment color output numbered <n>.
2360
2361
2362Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment
2363Operations and the Frame Buffer)
2364
2365    Modify Section 4.4.3, Rendering When an Image of a Bound Texture Object
2366    is Also Attached to the Framebuffer, p. 288
2367
2368    (Replace the complicated set of conditions with the following)
2369
2370    Specifically, the values of rendered fragments are undefined if any
2371    shader stage fetches texels from a given mipmap level, cubemap face, and
2372    array layer of a texture if that same mipmap level, cubemap face, and
2373    array layer of the texture can be written to via fragment shader outputs,
2374    even if the reads and writes are not in the same Draw call. However, an
2375    application can insert MemoryBarrier(TEXTURE_FETCH_BARRIER_BIT_NV) between
2376    Draw calls that have such read/write hazards in order to guarantee that
2377    writes have completed and caches have been invalidated, as described in
2378    section 2.20.X.
2379
2380
2381Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions)
2382
2383    None.
2384
2385Additions to Chapter 6 of the OpenGL 3.0 Specification (State and
2386State Requests)
2387
2388    None.
2389
2390Additions to Appendix A of the OpenGL 3.0 Specification (Invariance)
2391
2392    None.
2393
2394Additions to the AGL/GLX/WGL Specifications
2395
2396    None.
2397
2398GLX Protocol
2399
2400    None.
2401
2402Errors
2403
2404    None, other than new conditions by which a program string would fail to
2405    load.
2406
2407New State
2408
2409    None.
2410
2411
2412New Implementation Dependent State
2413
2414                                                             Minimum
2415    Get Value                         Type  Get Command       Value   Description           Sec.   Attrib
2416    --------------------------------  ----  ---------------  -------  --------------------- ------ ------
2417    MAX_GEOMETRY_PROGRAM_              Z+   GetIntegerv        32     Maximum number of GP  2.X.6.Y  -
2418      INVOCATIONS_NV                                                  invocations per prim.
2419    MIN_FRAGMENT_INTERPOLATION_        R    GetFloatv        -0.5     Max. negative offset  2.X.8.Z  -
2420      OFFSET_NV                                                       for IPAO instruction.
2421    MAX_FRAGMENT_INTERPOLATION_        R    GetFloatv         +0.5    Max. positive offset  2.X.8.Z  -
2422      OFFSET_NV                                                       for IPAO instruction.
2423    FRAGMENT_PROGRAM_INTERPOLATION_    Z+   GetIntegerv         4     Subpixel bit count    2.X.8.Z  -
2424      OFFSET_BITS_NV                                                  for IPAO instruction
2425
2426
2427Dependencies on NV_gpu_program4, NV_vertex_program4, NV_geometry_program4, and
2428NV_fragment_program4
2429
2430    This extension is written against the NV_gpu_program4 family of
2431    extensions, and introduces new instruction set features and inputs/outputs
2432    described here.  These features are available only if the extension is
2433    supported and the appropriate program header string is used ("!!NVvp5.0"
2434    for vertex programs, "!!NVgp5.0" for geometry programs, and "!!NVfp5.0"
2435    for fragment programs.) When loading a program with an older header (e.g.,
2436    "!!NVvp4.0"), the instruction set features described in this extension are
2437    not available.  The features in this extension build upon those documented
2438    in full in NV_gpu_program4.
2439
2440Dependencies on NV_tessellation_program5
2441
2442    This extension provides the basic assembly instruction set constructs for
2443    tessellation programs.  If this extension is supported, tessellation
2444    control and evaluation programs are supported, as described in the
2445    NV_tessellation_program5 specification.  There is no separate extension
2446    string for tessellation programs; such support is implied by this
2447    extension.
2448
2449Dependencies on ARB_transform_feedback3
2450
2451    The concept of multiple vertex streams emitted by a geometry shader is
2452    introduced by ARB_transform_feedback3, as is the description of how they
2453    operate and implementation-dependent limits on the number of streams.
2454    This extension simply provides a mechanism to emit a vertex to more than
2455    one stream.  If ARB_transform_feedback3 is not supported, language
2456    describing the EMITS opcode and the restriction on PRIMITIVE_OUT when
2457    EMITS is used should be removed.
2458
2459Dependencies on NV_shader_buffer_load
2460
2461    The programmability functionality provided by NV_shader_buffer_load is
2462    also incorporated by this extension.  Any assembly program using a program
2463    header corresponding to this or any subsequent extension (e.g.,
2464    "!!NVfp5.0") may use the LOAD opcode without needing to declare "OPTION
2465    NV_shader_buffer_load".
2466
2467    NV_shader_buffer_load is required by this extension, which means that the
2468    API mechanisms documented there allowing applications to make a buffer
2469    resident and query its GPU address are available to any applications using
2470    this extension.
2471
2472    In addition to the basic functionality in NV_shader_buffer_load, this
2473    extension provides the ability to load 64-bit integers and floating-point
2474    values using the "S64", "S64X2", "S64X4", "U64", "U64X2", "U64X4", "F64",
2475    "F64X2", and "F64X4" opcode modifiers.
2476
2477Dependencies on NV_shader_buffer_store
2478
2479    This extension provides assembly programmability support for the
2480    NV_shader_buffer_store, which provides the API mechanisms allowing buffer
2481    object to be stored to.  NV_shader_buffer_store does not have a separate
2482    extension string entry, and will always be supported if this extension is
2483    present.
2484
2485Dependencies on NV_parameter_buffer_object2
2486
2487    The programmability functionality provided by NV_parameter_buffer_object2
2488    is also incorporated by this extension.  Any assembly program using a
2489    program header corresponding to this or any subsequent extension (e.g.,
2490    "!!NVfp5.0") may use the LDC opcode without needing to declare "OPTION
2491    NV_parameter_buffer_object2".
2492
2493    In addition to the basic functionality in NV_parameter_buffer_object2,
2494    this extension provides the ability to load 64-bit integers and
2495    floating-point values using the "S64", "S64X2", "S64X4", "U64", "U64X2",
2496    "U64X4", "F64", "F64X2", and "F64X4" opcode modifiers.
2497
2498Dependencies on OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle
2499
2500    If OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle are not
2501    supported, remove the swizzling step from the definition of TXG and TXGO.
2502
2503Dependencies on ARB_blend_func_extended
2504
2505    If ARB_blend_func_extended is not supported, references to the dual source
2506    color output bindings (result.color.primary and result.color.secondary)
2507    should be removed.
2508
2509Dependencies on EXT_shader_image_load_store
2510
2511    EXT_shader_image_load_store provides OpenGL Shading Language mechanisms to
2512    load/store to buffer and texture image memory, including spec language
2513    describing memory access ordering and synchronization, a built-in function
2514    (MemoryBarrierEXT) controlling synchronization of memory operations, and
2515    spec language describing early fragment tests that can be enabled via GLSL
2516    fragment shader source.  These sections of the EXT_shader_image_load_store
2517    specification apply equally to the assembly program memory accesses
2518    provided by this extension.  If EXT_shader_image_load_store is not
2519    supported, the sections of that specification describing these features
2520    should be considered to be added to this extension.
2521
2522    EXT_shader_image_load_store additionally provides and documents assembly
2523    language support for image loads, stores, and atomics as described in the
2524    "Dependencies on NV_gpu_program5" section of EXT_shader_image_load_store.
2525    The features described there are automatically supported for all
2526    NV_gpu_program5 assembly programs without requiring any additional
2527    "OPTION" line.
2528
2529Dependencies on ARB_shader_subroutine
2530
2531    ARB_shader_subroutine provides and documents assembly language support for
2532    subroutines as described in the "Dependencies on NV_gpu_program5" section
2533    of ARB_shader_subroutine.  The features described there are automatically
2534    supported for all NV_gpu_program5 assembly programs without requiring any
2535    additional "OPTION" line.
2536
2537
2538Issues
2539
2540    (1) Are there any restrictions or performance concerns involving the
2541        support for indexing textures or parameter buffers?
2542
2543      RESOLVED:  There are no significant functional limitations.  Textures
2544      and parameter buffers accessed with an index must be declared as arrays,
2545      so the assembler knows which textures might be accessed this way.
2546      Additionally, accessing an array of textures or parameter buffers with
2547      an out-of-bounds index will yield undefined results.
2548
2549      In particular, there is no limitation on the values used for indexing --
2550      they are not required to be true constants and are not required to have
2551      the same value for all vertices/fragments in a primitive.  However,
2552      using divergent texture or parameter buffer indices may have performance
2553      concerns.  We expect that GPU implementations of this extension will run
2554      multiple program threads in parallel (SIMD).  If different threads in a
2555      thread group have different indices, it will be necessary to do lookups
2556      in more than one texture at once.  This is likely to result in some
2557      thread serialization.  We expect that indexed texture or parameter
2558      buffer access where all indices in a thread group match will perform
2559      identically to non-indexed accesses.
2560
2561    (2) Which texture instructions support programmable texel offsets, and
2562        what offset limits apply?
2563
2564      RESOLVED:  Most texture instructions (TEX, TXB, TXF, TXG, TXL, TXP)
2565      support both constant texel offsets as provided by NV_gpu_program4 and
2566      programmable texel offsets.  TXD supports only constant offsets.  TXGO
2567      does not support non-zero or programmable offsets in the texture portion
2568      of the instruction, but provides full support for programmable offsets
2569      via two of the three vector arguments in the regular instruction.
2570
2571      For example,
2572
2573        TEX result, coord, texture[0], 2D, (-1,-1);
2574
2575      uses the NV_gpu_program4 mechanism applies a constant texel offset of
2576      (-1,-1) to the texture coordinates.  With programmable offsets, the
2577      following code applies the same offset.
2578
2579        TEMP offxy;
2580        MOV offxy, {-1, -1};
2581        TEX result, coord, texture[0], offset(offxy);
2582
2583      Of course, the programmable form allows the offsets to be computed in
2584      the program and does not require constant values.
2585
2586      For most texture instructions, the range of allowable offsets is
2587      [MIN_PROGRAM_TEXEL_OFFSET_EXT, MAX_PROGRAM_TEXEL_OFFSET_EXT] for both
2588      constant and programmable texel offsets.  Constant offsets can be
2589      checked when the program is loaded, and out-of-bounds offsets cause the
2590      program to fail to load.  Programmable offsets can not have a
2591      load-time range check; out-of-bounds offsets produce undefined results.
2592
2593      Additionally, the new TXGO instruction has a separate (likely larger)
2594      allowable offset range, [MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV,
2595      MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV], that applies to the offset
2596      vectors passed in its second and third operand.
2597
2598      In the initial implementation of this extension, the range limits are
2599      [-8,+7] for most instructions and [-32,+31] for TXGO.
2600
2601    (3) What is TXGO (texture gather with separate offsets) good for?
2602
2603      RESOLVED:  TXGO allows for efficiently sampling a single-component
2604      texture with a variety of offsets that need not be contiguous.
2605
2606      For example, a shadow mapping algorithm using a high-resolution shadow
2607      map may have pixels whose footpoint covers a large number of texels in
2608      the shadow map.  Such pixels could do a single lookup into a
2609      lower-resolution texture (using mipmapping), but quality problems will
2610      arise.  Alternately, a shader could perform a large number of texture
2611      lookups using either NEAREST or LINEAR filtering from the
2612      high-resolution texture.  NEAREST filtering will require a separate
2613      lookup for each texel accessed; LINEAR filtering may require somewhat
2614      fewer lookups, but all accesses cover a 2x2 portion of the texture.  The
2615      TXG instruction added to NV_gpu_program4_1 allows a 2x2 block of texels
2616      to be returned in a single instruction in case the program wants to do
2617      something other than linear filtering with the samples.  The TXGO allows
2618      a program to do semi-random sampling of the texture without requiring
2619      that each sample cover a 2x2 block of texels.  For example, the TXGO
2620      instruction would allow a program to the four texels A, H, J, O from the
2621      4x4 block depicted below:
2622
2623        TXGO result, coord, {-1,+2,0,+1}, {-1,0,+1,+2}, texture[0], 2D;
2624
2625      The "equivalent" TXG instruction would only sample the four center
2626      texels F, G, J, and K
2627
2628        TXG result, coord, texture[0], 2D;
2629
2630      All sixteen texels of the footprint could be sampled with four TXG
2631      instructions,
2632
2633        TXG result0, coord, texture[0], 2D, (-1,-1);
2634        TXG result1, coord, texture[0], 2D, (-1,+1);
2635        TXG result2, coord, texture[0], 2D, (+1,-1);
2636        TXG result3, coord, texture[0], 2D, (+1,+1);
2637
2638      but accessing a smaller number of samples spread across the footprint
2639      with fewer instructions may produce results that are good enough.
2640
2641      The figure here depicts a texture with texel (0,0) shown in the
2642      upper-left corner.  If you insist on a lower-left origin, please look at
2643      this figure while standing on your head.
2644
2645       (0,0) +-+-+-+-+
2646             |A|B|C|D|
2647             +-+-+-+-+
2648             |E|F|G|H|
2649             +-+-+-+-+
2650             |I|J|K|L|
2651             +-+-+-+-+
2652             |M|N|O|P|
2653             +-+-+-+-+ (4,4)
2654
2655    (4) Why are the results of TXGO (texture gather with separate offsets)
2656        undefined if the wrap mode is CLAMP or MIRROR_CLAMP_EXT?
2657
2658      RESOLVED:  The CLAMP and MIRROR_CLAMP_EXT wrap modes are fairly
2659      different from other wrap modes.  After adding any instruction offsets,
2660      the spec says to pre-clamp the (u,v) coordinates to [0,texture_size]
2661      before generating the footprint.  If such clamping occurs on one edge
2662      for a normal texture filtering operation, the footprint ends up being
2663      half border texels, half edge texels, and the clamping effectively
2664      forces the interpolation weights used for texture filtering to 50/50.
2665
2666      We expect the TXG instruction to be used in cases where an application
2667      may want to do custom filtering, and is in control of its own filtering
2668      weights.  Coordinate clamping as above will affect the footprint used
2669      for filtering, but not the weights.  In the NV_gpu_program4_1 spec, we
2670      defined the TXG/CLAMP combination to simply return the "normal"
2671      footprint produced after the pre-clamp operation above.  Any adjustment
2672      of weights due to clamping is the responsibility of the application.  We
2673      don't expect this to be a common operation, because CLAMP_TO_EDGE or
2674      CLAMP_TO_BORDER are much more sensible wrap modes.
2675
2676      The hardware implementing TXGO is anticipated to extract all four
2677      samples in a single pass.  However, the spec language is defined for
2678      simplicity to perform four separate "gather" operations with the four
2679      provided offsets, extract a single sample from each, and combine the
2680      four samples into a vector.  This would require four separate pre-clamp
2681      operations, which was deemed too costly to implement in hardware for a
2682      wrap mode that doesn't work well with texture gather operations.  Even
2683      if such hardware were built, it still wouldn't obtain a footprint
2684      resembling the half-border, half-edge footprint for simple TXGO offsets
2685      -- that would require different per-texel clamping rules for the four
2686      samples.  We chose to leave the results of this operation undefined.
2687
2688    (5) Should double-precision floating-point support be required or
2689        optional?  If optional, how?
2690
2691      RESOLVED:  Double-precision floating-point support will be optional in
2692      case low-end GPUs supporting the remainder of these instruction features
2693      choose to cut costs by removing the silicon necessary to implement
2694      64-bit floating-point arithmetic.
2695
2696    (6) While this extension supports double-precision computation, how can
2697        you provide high-precision inputs and outputs to the GPU programs?
2698
2699      RESOLVED:  The underlying hardware implementing this extension does not
2700      provide full support for 64-bit floats, even though DOUBLE is a standard
2701      data type provided by the GL.  For example, when specifying a vertex
2702      array with a data type of DOUBLE, the vertex attribute components will
2703      end up being converted to 32-bit floats (FLOAT) by the driver before
2704      being passed to the hardware, and the extra precision in the original
2705      64-bit float values will be lost.
2706
2707      For vertex attributes, the EXT_vertex_attrib_64bit and
2708      NV_vertex_attrib_integer_64bit extensions provide the ability to specify
2709      64-bit vertex attribute components using the VertexAttribL* and
2710      VertexAttribLPointer APIs.  Such attributes can be read in a vertex
2711      program using a "LONG ATTRIB" declaration:
2712
2713        LONG ATTRIB vector64;
2714
2715      The LONG modifier can only be used vertex program inputs, and can not be
2716      used for inputs of any program type or outputs of any program type.
2717
2718      For other cases, this extension provides the PK64 and UP64 instructions
2719      that provide a mechanism to pass 64-bit components using consecutive
2720      32-bit components.  For example, a 3-component vector with 64-bit
2721      components can be passed to a vertex shader using multiple vertex
2722      attributes without using the VertexAttribL APIs with the following code:
2723
2724        /* Pass the X/Y components in vertex attribute 0 (X/Y/Z/W).  Use
2725           stride to skip over Z. */
2726        glVertexAttribPointer(0, 4, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble),
2727                              (GLdouble *) buffer);
2728
2729        /* Pass the Z components in vertex attribute 1 (X/Y).  Use stride to
2730           skip over original X/Y components. */
2731        glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble),
2732                              (GLdouble *) buffer + 2);
2733
2734      In this example, the vertex program would use the PK64 instruction to
2735      reconstruct the 64-bit value for each component as follows:
2736
2737        LONG TEMP reconstructed;
2738        PK64 reconstructed.xy, vertex.attrib[0];
2739        PK64 reconstructed.z,  vertex.attrib[1];
2740
2741      A similar technique can be used to pass 64-bit values computed by a GPU
2742      program, using transform feedback or writes to a color buffer.  The UP64
2743      instruction would be used to convert the 64-bit computed value into two
2744      32-bit values, which would be written to adjacent components.
2745
2746      Note also that the original hardware implementation of this extension
2747      does not support interpolation of 64-bit floating-point values.  If an
2748      application desires to pass a 64-bit floating-point value from a vertex
2749      or geometry program to a fragment program, and doesn't require
2750      interpolation, the PK64/UP64 techniques can be combined.  For example,
2751      the vertex shader could unpack a 3-component vector with 64-bit
2752      components into a four-component and a two-component 32-bit vector:
2753
2754        LONG TEMP result64;
2755        RESULT result32[2] = { result.attrib[0..1] };
2756        UP64 result32[0],    result64.xyxy;
2757        UP64 result32[1].xy, result64.z;
2758
2759      The fragment program would read and reconstruct using PK64:
2760
2761        LONG TEMP input64;
2762        FLAT ATTRIB input32[3] = { fragment.attrib[0..1] };
2763        PK64 input64.xy, input32[0];
2764        PK64 input64.z,  input32[1];
2765
2766      Note that such inputs must be declared as "FLAT" in the fragment program
2767      to prevent the hardware from trying to do floating-point interpolation
2768      on the separate 32-bit halves of the value being passed.  Such
2769      interpolation would produce complete garbage.
2770
2771    (7) What are instanced geometry programs useful for?
2772
2773      RESOLVED:  Instanced geometry programs allow geometry programs that
2774      perform regular operations to run more efficiently.
2775
2776      Consider a simple example of an algorithm that uses geometry programs to
2777      render primitives to a cube map in a single pass.  Without instanced
2778      geometry programs, the geometry program to render triangles to the cube
2779      map would do something like:
2780
2781        for (face = 0; face < 6; face++) {
2782          for (vertex = 0; vertex < 3; vertex++) {
2783            project vertex <vertex> onto face <face>, output position
2784            compute/copy attributes of emitted <vertex> to outputs
2785            output <face> to result.layer
2786            emit the projected vertex
2787          }
2788          end the primitive (next triangle)
2789        }
2790
2791      This algorithm would output 18 vertices per input triangle, three for
2792      each cube face.  The six triangles emitted would be rasterized, one per
2793      face.  Geometry programs that emit a large number of attributes have
2794      often posed performance challenges, since all the attributes must be
2795      stored somewhere until the emitted primitives.  Large storage
2796      requirements may limit the number of threads that can be run in parallel
2797      and reduce overall performance.
2798
2799      Instanced geometry programs allow this example to be restructured to run
2800      with six separate threads, one per face.  Each thread projects the
2801      triangle to only a single face (identified by the invocation number) and
2802      emits only 3 vertices.  The reduced storage requirements allow more
2803      geometry program threads to be run in parallel, with greater overall
2804      efficiency.
2805
2806      Additionally, the total number of attributes that can be emitted by a
2807      single geometry program invocation is limited.  However, for instanced
2808      geometry shaders, that limit applies to each of <N> program invocations
2809      which allows for a larger total output.  For example, if the GL
2810      implementation supports only 1024 components of output per program
2811      invocation, the 18-vertex algorithm above could emit no more than 56
2812      components per vertex.  The same algorithm implemented as a 3-vertex
2813      6-invocation geometry program could theoretically allow for 341
2814      components per vertex.
2815
2816    (8) What are the special interpolation opcodes (IPAC, IPAO, IPAS) good
2817        for, and how do they work?
2818
2819      RESOLVED:  The interpolation opcodes allow programs to control the
2820      frequency and location at which fragment inputs are sampled.  Limited
2821      control has been provided in previous extensions, but the support was
2822      more limited.  NV_gpu_program4 had an interpolation modifier (CENTROID)
2823      that allowed attributes to be sampled inside the primitive, but that was
2824      a per-attribute modifier -- you could only sample any given attribute at
2825      one location.  NV_gpu_program4_1 added a new interpolation modifier
2826      (SAMPLE) that directed that fragment programs be run once per sample,
2827      and that the specified attributes be interpolated at the sample
2828      location.  Per-sample interpolation can produce higher quality, but the
2829      performance cost is significant since more fragment program invocations
2830      are required.
2831
2832      This extension provides additional control over interpolation, and
2833      allows programs to interpolate attributes at different locations without
2834      necessarily requiring the performance hit of per-sample invocation.
2835
2836      The IPAC instruction allows an attribute to be sampled at the centroid
2837      location, while still allowing the same attribute to be sampled
2838      elsewhere.  The IPAS instruction allows the attribute to be sampled at a
2839      number sample location, as per-sample interpolation would do.  Multiple
2840      IPAS instructions with different sample numbers allows a program to
2841      sample an attribute at multiple sample points in the pixel and then
2842      combine the samples in a programmable manner, which may allow for higher
2843      quality than simply interpolating at a single representative point in
2844      the pixel.  The IPAO instruction allows the attribute to be sampled at
2845      an arbitrary (x,y) offset relative to the pixel center.  The range of
2846      supported (x,y) values is limited, and the limits in the initial
2847      implementation are not large enough to permit sampling the attribute
2848      outside the pixel.
2849
2850      Note that previous instruction sets allowed shaders to fake IPAC,
2851      IPAS, and IPAO by a sequence such as:
2852
2853        TEMP ddx, ddy, offset, interp;
2854        MOV interp, fragment.attrib[0];          # start with center
2855        DDX ddx, fragment.attrib[0];
2856        MAD interp, offset.x, ddx, interp;       # add offset.x * dA/dx
2857        DDY ddx, fragment.attrib[0];
2858        MAD interp, offset.y, ddy, interp;       # add offset.y * dA/dy
2859
2860      However, this method does not apply perspective correction.  The quality
2861      of the results may be unacceptable, particularly for primitives that are
2862      nearly perpendicular to the screen.
2863
2864      The semantics of the first operand of these instructions is different
2865      from normal assembly instructions.  Operands are normally evaluated by
2866      loading the value of the corresponding variable and applying any
2867      swizzle/negation/absolute value modifier before the instruction is
2868      executed.  In the IPAC/IPAO/IPAS instructions, the value of the
2869      attribute is evaluated by the instruction itself.  Swizzles, negation,
2870      and absolute value modifiers are still allowed, and are applied after
2871      the attribute values are interpolated.
2872
2873    (9) When using a program that issues global stores (via the STORE
2874        instruction), what amount of execution ordering is guaranteed?  How
2875        can an application ensure that writes executed in a shader have
2876        completed and will be visible to other operations using the buffer
2877        object in question?
2878
2879      RESOLVED:  There are very few automatic guarantees for potential
2880      write/read or write/write conflicts.  Program invocations will run in
2881      generally run in arbitrary order, and applications can't rely on
2882      read/write order to match primitive order.
2883
2884      To get consistent results when buffers are read and written using
2885      multiple pipeline stages, manual synchronization using the
2886      MemoryBarrierEXT() API documented in EXT_shader_image_load_store or some
2887      other synchronization primitive is necessary.
2888
2889    (10) Unlike most other shader features, the STORE opcode allows for
2890         externally-visible side effects from executing a program.  How does
2891         this capability interact with other features of the GL?
2892
2893      RESOLVED:  First, some GL implementations support a variety of "early Z"
2894      optimizations designed to minimize unnecessary fragment processing work,
2895      such as executing an expensive fragment program on a fragment that will
2896      eventually fail the depth test.  Such optimizations have been valid
2897      because fragment programs had no side effects.  That is no longer the
2898      case, and such optimizations may not be employed if the fragment program
2899      performs a global store.  However, we provide a new "early depth and
2900      stencil test" enable that allows applications to deterministically
2901      control depth and stencil testing.  If enabled, depth testing is always
2902      performed prior to fragment program execution.  Fragment programs will
2903      never be run on fragments that fail any of these tests.
2904
2905      Second, we are permitting global stores in all program types; however,
2906      the number of program invocations is not well-defined for some program
2907      types.  For example, a GL implementation may choose to combine multiple
2908      instances of identical vertices (e.g., duplicate indices in
2909      DrawElements, immediate-mode vertices with identical data) into one
2910      single vertex program invocation, or it may run a vertex program on each
2911      separately.  Similarly, the tessellation primitive generator will
2912      generate independent primitives with duplicated vertices, which may or
2913      may not be combined for tessellation evaluation program execution.
2914      Fragment program execution also has several issues described in more
2915      detail below.
2916
2917    (11) What issues arise when running fragment programs doing global stores?
2918
2919      RESOLVED:  The order of per-fragment operations in the existing OpenGL
2920      3.0 specification can be fairly loose, because previously-defined
2921      fragment programs, shaders, and fixed-function fragment processing had
2922      no side effects.  With side effects, the order of operations must be
2923      defined more tightly.  In particular, the pixel ownership and scissor
2924      tests are specified to be performed prior to fragment program execution,
2925      and we provide an option to perform depth and stencil tests early as
2926      well.
2927
2928      OpenGL implementations sometimes run fragment programs on "helper"
2929      pixels that have no coverage in order to be able to compute sane partial
2930      deriviatives for fragment program instructions (DDX, DDY) or automatic
2931      level-of-detail calculation for texturing.  In this approach,
2932      derivatives are approximated by computing the difference in a quantity
2933      computed for a given fragment at (x,y) and a fragment at a neighboring
2934      pixel.  When a fragment program is executed on a "helper" pixel, global
2935      stores have no effect.  Helper pixels aren't explicitly mentioned in the
2936      spec body; instead, partial derivatives are obtained by magic.
2937
2938      If a fragment program contains a KIL instruction, compilers may not
2939      reorder code where an ATOM or STORE execution is executed before a KIL
2940      instruction that logically precedes it in flow control.  Once a fragment
2941      is killed, subsequent atomics or stores should never be executed.
2942
2943      Multisample rasterization poses several issues for fragment programs
2944      with global stores.  The number of times a fragment program is executed
2945      for multisample rendering is not fully specified, which gives
2946      implementations a number of different choices -- pure multisample (only
2947      runs once), pure supersample (runs once per covered sample), or modes in
2948      between.  There are some ways for an application to indirectly control
2949      the behavior -- for example, fragment programs specifying per-sample
2950      attribute interpolation are guaranteed to run once per covered sample.
2951
2952      Note that when rendering to a multisample buffer, a pair of adjacent
2953      triangles may cause a fragment program to be executed more than once at
2954      a given (x,y) with different sets of samples covered.  This can also
2955      occur in the interior of a quadrilateral or polygon primitive.
2956      Implementations are permitted to split quads and polygons with >3
2957      vertices into triangles, creating interior edges that split a pixel.
2958
2959    (12) What happens if early fragment tests are enabled, the early depth
2960         test passes, and a fragment program that computes a new depth value
2961         is executed?
2962
2963      RESOLVED:  The depth value produced by the fragment program has no
2964      effect if early fragment tests are enabled.  The depth value computed by
2965      a fragment program is used only by the post-fragment program stencil and
2966      depth tests, and those tests always have no effect when early depth
2967      testing is enabled.
2968
2969    (13) How do early fragment tests interact with occlusion queries?
2970
2971      RESOLVED:  When early fragment tests are enabled, sample counting for
2972      occlusion queries also happens prior to fragment program execution.
2973      Enabling early fragment tests can change the overall sample count,
2974      because samples killed by alpha test and alpha to coverage will still be
2975      counted if early fragment tests are enabled.
2976
2977    (14) What happens if a program performs a global store to a GPU address
2978         corresponding to a read-only buffer mapping?  What if it performs a
2979         global read to a write-only mapping?
2980
2981      RESOLVED:  Implementations may choose implement full memory protection,
2982      in which case accesses using the wrong type of memory mapping will fault
2983      and lead to termination of the application.
2984
2985      However, full memory protection is not required in this extension --
2986      implementations may choose to substitute a read-write mapping in place
2987      of a read-only or write-only mapping.  As a result, we specify the
2988      result of such invalid loads and stores to be undefined.
2989
2990      Note that if a program erroneously writes to nominally read-only
2991      mappings, the results may be weird.  If the implementation substitutes a
2992      read-write mapping, such invalid writes are likely to proceed normally.
2993      However, if the application later makes a buffer object non-resident and
2994      the memory manager of the GL implementation needs to move the buffer,
2995      the GL may assume that the contents of the buffer have not been modified
2996      and thus discard the new values written by the (invalid) global store
2997      instructions.
2998
2999    (15) What performance considerations apply to atomics?
3000
3001      RESOLVED:  Atomics can be useful for operations like locking, or for
3002      maintaining counters.  Note that high-performance GPUs may have hundreds
3003      of program threads in flight at once, and may also have some SIMD
3004      characteristics (where threads are grouped and run as a unit).  Using
3005      ATOM instructions with a single memory address to implement a critical
3006      section will result in serial execution -- only one of the hundreds of
3007      threads can execute code in the critical section at a time.
3008
3009      When a global operation would be done under a lock, it may be possible
3010      to improve performance if the algorithm can be parallelized to have
3011      multiple critical sections.  For example, an application could allocate
3012      an array of shared resources, each protected by its own lock, and use
3013      the LSBs of the primitive ID or some function of the screen-space (x,y)
3014      to determine which resource in the array to use.
3015
3016    (16) The atomic instruction ATOM returns the old contents of memory into
3017         the result register.  Should we provide a version of this opcodes
3018         that doesn't return a value?
3019
3020      RESOLVED:  No.  In theory, atomics that don't return any values can
3021      perform better (because the program may not need to allocate resources
3022      to hold a result or wait for the result.  However, a new opcode isn't
3023      required to obtain this behavior -- a compiler can recognize that the
3024      result of an ATOM instruction is written to a "dummy" temporary that
3025      isn't read by subsequent instructions:
3026
3027        TEMP junk;
3028        ATOM.ADD.U32 junk, address, 1;
3029
3030      The compiler can also recognize that the result will always be discarded
3031      if a conditional write mask of "(FL)" is used.
3032
3033        ATOM.ADD.U32 not_junk (FL), address, 1;
3034
3035    (17) How do we ensure that memory access made by multiple program
3036         invocations of possibly different types are coherent?
3037
3038      RESOLVED:  Atomic instructions allow program invocations to coordinate
3039      using shared global memory addresses.  However, memory transactions,
3040      including atomics, are not guaranteed to land in the order specified in
3041      the program; they may be reordered by the compiler, cached in different
3042      memory hierarchies, and stored in a distributed memory system where
3043      later stores to one "partition" might be completed prior to earlier
3044      stores to another.  The MEMBAR instruction helps control memory
3045      transaction ordering by ensuring that all memory transactions prior to
3046      the barrier complete before any after the barrier.  Additionally the
3047      ".COH" modifier ensures that memory transactions using the modifier are
3048      cached coherently and will be visible to other shader invocations.
3049
3050    (18) How do the TXG and TXGO opcodes work with sRGB textures?
3051
3052       RESOLVED. Gamma-correction is applied to the texture source color
3053       before "gathering" and hence applies to all four components, unless
3054       the texture swizzle of the selected component is ALPHA in which case
3055       no gamma-correction is applied.
3056
3057    (19) How can render-to-texture algorithms take advantage of
3058         MemoryBarrierEXT, nominally provided for global memory transactions?
3059
3060      RESOLVED: Many algorithms use RTT to ping-pong between two allocations,
3061      using the result of one rendering pass as the input to the next.
3062      Existing mechanisms require expensive FBO Binds, DrawBuffer changes, or
3063      FBO attachment changes to safely swap the render target and texture. With
3064      memory barriers, layered geometry shader rendering, and texture arrays,
3065      an application can very cheaply ping-pong between two layers of a single
3066      texture. i.e.
3067
3068        X = 0;
3069        // Bind the array texture to a texture unit
3070        // Attach the array texture to an FBO using FramebufferTextureARB
3071        while (!done) {
3072          // Stuff X in a constant, vertex attrib, etc.
3073          Draw -
3074            Texturing from layer X;
3075            Writing gl_Layer = 1 - X in the geometry shader;
3076
3077          MemoryBarrierNV(TEXTURE_FETCH_BARRIER_BIT_NV);
3078          X = 1 - X;
3079        }
3080
3081      However, be warned that this requires geometry shaders and hence adds
3082      the overhead that all geometry must pass through an additional program
3083      stage, so an application using large amounts of geometry could become
3084      geometry-limited or more shader-limited.
3085
3086    (20) What is the ".PREC" instruction modifier good for?
3087
3088      RESOLVED:  ".PREC" provides some invariance guarantees is useful for
3089      certain algorithms.  Using ".PREC", it is possible to ensure that an
3090      algorithm can be written to produce identical results on subtly
3091      different inputs.  For example, the order of vertices visible to a
3092      geometry or tessellation shader used to subdivide primitive edges might
3093      present an edge shared between two primitives in one direction for one
3094      primitive and the other direction for the adjacent primitive.  Even if
3095      the weights are identical in the two cases, there may be cracking if the
3096      computations are being done in an order-dependent manner.  If the
3097      position of a new vertex were evaluation with code below with
3098      limited-precision floating-point math, it's not necessarily the case
3099      that we will get the same result for inputs (a,b,c) and (c,b,a) in the
3100      following code:
3101
3102          ADD result, a, b;
3103          ADD result, result, c;
3104
3105      There are two problems with this code:  the rounding errors will be
3106      different and the implementation is free to rearrange the computation
3107      order.  The code can be rewritten as follows with ".PREC" and a
3108      symmetric evaluation order to ensure a precise result with the inputs
3109      reversed:
3110
3111          ADD result, a, c;
3112          ADD.PREC result, result, b;
3113
3114      Note that in this example, the first instruction doesn't need the
3115      ".PREC" qualifier because the second instruction requires that the
3116      implementation compute <a>+<c>, which will be done reliably if <a> and
3117      <c> are inputs.  If <a> and <c> were results of other computations, the
3118      first add and possibly the dependent computations may also need to be
3119      tagged with ".PREC" to ensure reliable results.
3120
3121      The ".PREC" modifier will disable certain optimization and thus carries
3122      a performance cost.
3123
3124    (21) What are the TGALL, TGANY, TGEQ instructions good for?
3125
3126      RESOLVED:  If an implementation performs SIMD thread execution,
3127      divergent branching may result in reduced performance if the "if" and
3128      "else" blocks of an "if" statement are executed sequentially.  For
3129      example, an algorithm may have both a "fast path" that performs a
3130      computation quickly for a subset of all cases and a "fast path" that
3131      performs a computation quickly but correctly.  When performing SIMD
3132      execution, code like the following:
3133
3134        SNE.S.CC cc.x, condition.x;
3135        IF NE.x;
3136          # do fast path
3137        ELSE;
3138          # do slow path
3139        ENDIF;
3140
3141      may end up executing *both* the fast and slow paths for a SIMD thread
3142      group if <condition> diverges, and may execute more slowly than simply
3143      executing the slow path unconditionally.  These instructions allow code
3144      like:
3145
3146        # Condition code matches NE if and only if condition.x is non-zero
3147        # for all threads.
3148        TGALL.S.CC cc.x, condition.x;
3149        IF NE.x;
3150          # do fast path
3151        ELSE;
3152          # do slow path
3153        ENDIF;
3154
3155      that executes the fast path if and only if it can be used for *all*
3156      threads in the group.  For thread groups where <condition> diverges,
3157      this algorithm would unconditionally run the slow path, but would never
3158      run both in sequence.
3159
3160
3161Revision History
3162
3163    Rev.    Date    Author    Changes
3164    ----  --------  --------  -----------------------------------------
3165     7    09/11/14  pbrown    Minor typo fixes.
3166
3167     6    07/04/13  pbrown    Add missing language describing the
3168                              <texImageUnitComp> grammar rule for component
3169                              selection in TXG and TXGO instructions.
3170
3171     5    09/23/10  pbrown    Add missing constants for {MIN,MAX}_PROGRAM_
3172                              TEXTURE_GATHER_OFFSET_NV (same as ARB/core).
3173                              Add missing description for "su" in the opcode
3174                              table; fix a couple operand order bugs for
3175                              STORE.
3176
3177     4    06/22/10  pbrown    Specify that the y/z/w component of the ATOM
3178                              results are undefined, as is the case with
3179                              ATOMIM from EXT_shader_image_load_store.
3180
3181     3    04/13/10  pbrown    Remove F32 support from ATOM.ADD.
3182
3183     2    03/22/10  pbrown    Various wording updates to the spec overview,
3184                              dependencies, issues, and body.  Remove various
3185                              spec language that has been refactored into the
3186                              EXT_shader_image_load_store specification.
3187
3188     1              pbrown    Internal revisions.
3189