• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Name
2
3    NV_parameter_buffer_object2
4
5Name Strings
6
7    GL_NV_parameter_buffer_object2
8
9Contact
10
11    Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com)
12
13Status
14
15    Shipping (July 2009, Release 190)
16
17Version
18
19    Last Modified Date:         09/09/09
20    NVIDIA Revision:            2
21
22Number
23
24    378
25
26Dependencies
27
28    OpenGL 2.0 is required.
29
30    NV_gpu_program4 is required.
31
32    NV_parameter_buffer_object is required.
33
34    This extension is written against the NV_gpu_program4 specification.
35
36    NV_shader_buffer_load trivially affects the definition of this extension.
37
38Overview
39
40    This extension builds on the NV_parameter_buffer_object extension to
41    provide additional flexibility in sourcing data from buffer objects.
42
43    The original NV_parameter_buffer_object (PaBO) extension provided the
44    ability to bind buffer objects to a set of numbered binding points and
45    access them in assembly programs as though they were arrays of 32-bit
46    scalars (via the BUFFER variable type) or arrays of four-component vectors
47    with 32-bit scalar components (via the BUFFER4 variable type).  However,
48    the functionality it provided had some significant limits on flexibility.
49    Since any given buffer binding point could be used either as a BUFFER or
50    BUFFER4, but not both, programs couldn't do both 32- and 128-bit fetches
51    from a single binding point.  Additionally, No support was provided for
52    8-, 16-, or 64-bit fetches, though they could be emulated using a larger
53    loads, with bitfield operations and/or write masking to put components in
54    the right places.  Indexing was supported, but strides were limited to 4-
55    and 16-byte multiples, depending on whether BUFFER or BUFFER4 is used.
56
57    This new extension provides the buffer variable declaration type CBUFFER
58    to specify a buffer that is treated as an array of bytes, rather than an
59    array of words or vectors.  The LDC instruction allows programs to extract
60    a vector of data from a CBUFFER variable, using a size and component count
61    specified in the opcode modifier.  1-, 2-, and 4-component fetches are
62    supported.  The LDC instruction supports byte offsets using normal array
63    indexing mechanisms; both run-time and immediate offsets are supported.
64    Offsets used for a buffer object fetch are required to be aligned to the
65    size of the fetch (1, 2, 4, 8, or 16 bytes).
66
67New Procedures and Functions
68
69    None.
70
71New Tokens
72
73    None.
74
75Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation)
76
77    (All modifications are relative to Section 2.X, GPU Programs, from the
78     NV_gpu_program4 specification.)
79
80    Modify Section 2.X.2, Program Grammar
81
82    (add after the long list of grammar rules) If a program specifies the
83    NV_parameter_buffer_object2 program option, the following rules are added
84    to the NV_gpu_program4 base program grammar:
85
86    <VECTORop>              ::= "LDC"
87
88    <opModifier>            ::= "F32";
89                              | "F32X2";
90                              | "F32X4";
91                              | "S8";
92                              | "S16";
93                              | "S32";
94                              | "S32X2";
95                              | "S32X4";
96                              | "U8";
97                              | "U16";
98                              | "U32";
99                              | "U32X2";
100                              | "U32X4";
101
102    <bufferDeclType>        ::= "CBUFFER"
103
104
105    Modify Section 2.X.3.6, Program Parameter Buffers
106
107    (modify the paragraph describing the different type of parameter buffer
108    variable declarations to include support for "CBUFFER".)
109
110    Program parameter buffer variables are treated as an array of
111    single-component words if the <bufferDeclType> grammar rule matches
112    "BUFFER" or as an array of four-component vectors if it matches "BUFFER4".
113    Program parameter buffers may also be declared as an array of basic
114    machine units from which data can be extracted using the LDC (load
115    constant) instruction, if <bufferDeclType> matches "CBUFFER".  Parameter
116    buffer variables declared using "CBUFFER" may not be used as an operand in
117    any instruction other than LDC, while "BUFFER" and "BUFFER4" variables may
118    not be used with LDC.  A program will fail to load if a variable declared
119    as "BUFFER" and another variable declared as "BUFFER4" use the same buffer
120    binding point.  There is no limitation on the use of "CBUFFER" variables
121    in conjunction with "BUFFER" or "BUFFER4" variables using the same buffer
122    binding point.
123
124    (modify/restructure the paragraph describing basic program parameter
125     bindings to handle the byte bindings provided by "CBUFFER" variables)
126
127    If a program parameter buffer binding matches "program.buffer[a][b]", the
128    program parameter variable corresponds to element <b> of the buffer object
129    bound to binding point <a>.  Each element of the bound buffer object is
130    treated as:
131
132      * a single basic machine unit of data, if the variable is declared using
133        "CBUFFER";
134
135      * a single word of data that can hold an integer or floating-point
136        value, if the variable is declared as "BUFFER"; or
137
138      * four words of data that can hold integer or floating-point values, if
139        the variable is declared as "BUFFER4".
140
141    When a binding corresponding to a "BUFFER" variable is used as an operand,
142    the selected word is broadcast to all four components of the variable.
143    When a binding corresponding to a "BUFFER4" variable is used as an
144    operand, the four components of the selected buffer element are loaded
145    into the variable.  A binding corresponding to a "CBUFFER" variable may be
146    used only in the LDC instruction, and will be used there as a pointer to
147    extract operand values from buffer memory.  If no buffer object is bound
148    to binding point <a>, or the bound buffer object is not large enough to
149    hold element <b>, the values used are undefined.  The binding point <a>
150    must be a nonnegative integer constant.
151
152
153    Modify Section 2.X.4, Program Execution Environment
154
155    (Add to the set of opcodes in Table X.13)
156
157                  Modifiers
158      Instruction F I C S H D  Out Inputs    Description
159      ----------- - - - - - -  --- --------  --------------------------------
160      LDC         X X X X - F  v   v         load from constant buffer
161
162
163    Modify Section 2.X.4.1, Program Instruction Modifiers
164
165    (Add to Table X.14, Instruction Modifiers, and to the corresponding
166    description following the table)
167
168      Modifier  Description
169      --------  -----------------------------------------------
170      F32       Access one 32-bit floating-point value
171      F32X2     Access two 32-bit floating-point values
172      F32X4     Access four 32-bit floating-point values
173      S8        Access one 8-bit signed integer value
174      S16       Access one 16-bit signed integer value
175      S32       Access one 32-bit signed integer value
176      S32X2     Access two 32-bit signed integer values
177      S32X4     Access four 32-bit signed integer values
178      U8        Access one 8-bit unsigned integer value
179      U16       Access one 16-bit unsigned integer value
180      U32       Access one 32-bit unsigned integer value
181      U32X2     Access two 32-bit unsigned integer values
182      U32X4     Access four 32-bit unsigned integer values
183
184    For memory load operations, the "F32", "F32X2", "F32X4", "S8", "S16",
185    "S32", "S32X2", "S32X4", "U8", "U16", "U32", "U32X2", and "U32X4" storage
186    modifiers control how data are loaded from memory.  Storage modifiers are
187    supported by the LDC and LOAD instructions and are covered in more detail
188    in the descriptions of these instructions.  These instructions must
189    specify exactly one of these modifiers, and may not specify any of the
190    base data type modifiers (F,U,S) described above.  The base data type of
191    the result vector of a LOAD or LDC instruction is trivially derived from
192    the storage modifier.
193
194
195    Add New Section 2.X.4.5, Program Memory Access
196
197    Programs may load from buffer object memory via the LDC (load constant)
198    and LOAD (global load) instructions.
199
200    Load instructions read 8, 16, 32, 64, or 128 bits of data from a source
201    address to produce a four-component vector, according to the storage
202    modifier specified with the instruction.  The storage modifier has three
203    parts:
204
205      - a base data type, "F", "S", or "U", specifying that the instruction
206        fetches floating-point, signed integer, or unsigned integer values,
207        respectively;
208
209      - a component size, specifying that the components fetched by the
210        instruction have 8, 16, or 32 bits; and
211
212      - an optional component count, where "X2" and "X4" indicate that two or
213        four components be fetched, and no count indicates a single component
214        fetch.
215
216    When the storage modifier specifies that fewer than four components should
217    be fetched, remaining components are filled with zeroes.  When performing
218    a global load (LOAD), the GPU address is specified as an instruction
219    operand.  When performing a constant buffer load (LDC), the GPU address is
220    derived by adding the base address of the bound buffer object to an offset
221    specified as an instruction operand.  Given a GPU address <address> and a
222    storage modifier <modifier>, the memory load can be described by the
223    following code:
224
225      result_t_vec BufferMemoryLoad(char *address, OpModifier modifier)
226      {
227        result_t_vec result = { 0, 0, 0, 0 };
228        switch (modifier) {
229        case F32:
230            result.x = ((float32_t *)address)[0];
231            break;
232        case F32X2:
233            result.x = ((float32_t *)address)[0];
234            result.y = ((float32_t *)address)[1];
235            break;
236        case F32X4:
237            result.x = ((float32_t *)address)[0];
238            result.y = ((float32_t *)address)[1];
239            result.z = ((float32_t *)address)[2];
240            result.w = ((float32_t *)address)[3];
241            break;
242        case S8:
243            result.x = ((int8_t *)address)[0];
244            break;
245        case S16:
246            result.x = ((int16_t *)address)[0];
247            break;
248        case S32:
249            result.x = ((int32_t *)address)[0];
250            break;
251        case S32X2:
252            result.x = ((int32_t *)address)[0];
253            result.y = ((int32_t *)address)[1];
254            break;
255        case S32X4:
256            result.x = ((int32_t *)address)[0];
257            result.y = ((int32_t *)address)[1];
258            result.z = ((int32_t *)address)[2];
259            result.w = ((int32_t *)address)[3];
260            break;
261        case U8:
262            result.x = ((uint8_t *)address)[0];
263            break;
264        case U16:
265            result.x = ((uint16_t *)address)[0];
266            break;
267        case U32:
268            result.x = ((uint32_t *)address)[0];
269            break;
270        case U32X2:
271            result.x = ((uint32_t *)address)[0];
272            result.y = ((uint32_t *)address)[1];
273            break;
274        case U32X4:
275            result.x = ((uint32_t *)address)[0];
276            result.y = ((uint32_t *)address)[1];
277            result.z = ((uint32_t *)address)[2];
278            result.w = ((uint32_t *)address)[3];
279            break;
280        }
281        return result;
282      }
283
284    The offset used for the constant buffer loads must be aligned to the fetch
285    size corresponding to the storage opcode modifier.  For S8 and U8, the
286    offset has no alignment requirements.  For S16 and U16, the offset must be
287    a multiple of two basic machine units.  For F32, S32, and U32, the offset
288    must be a multiple of four.  For F32X2, S32X2, and U32X2, the offset must
289    be a multiple of eight.  For F32X4, S32X4, and U32X4, the offset must be a
290    multiple of sixteen.  If an offset is not correctly aligned, the values
291    returned by a constant buffer load will be undefined.
292
293
294    Modify Section 2.X.6, Program Options
295
296    + Extended Parameter Buffer Object Support (NV_parameter_buffer_object2)
297
298    If a program specifies the "NV_parameter_buffer_object2" option, it may
299    use the CBUFFER statement to declare program parameter buffer variables
300    and the LDC instruction to load data from parameter buffer variables using
301    arbitrary offsets.
302
303
304    Modify Section 2.X.8, Program Instruction Set
305
306    Section 2.X.8.Z, LDC:  Load from Constant Buffer
307
308    The LDC instruction loads a vector operand from a buffer object to yield a
309    result vector.  The operand used for the LDC instruction must correspond
310    to a parameter buffer variable declared using the "CBUFFER" statement; a
311    program will fail to load if any other type of operand is used in an LDC
312    instruction.
313
314      result = BufferMemoryLoad(&op0, storageModifier);
315
316    A base operand vector is fetched from memory as described in Section
317    2.X.4.5, with the GPU address derived from the binding corresponding to
318    the operand.  A final operand vector is derived from the base operand
319    vector by applying swizzle, negation, and absolute value operand modifiers
320    as described in Section 2.X.4.2.
321
322    The amount of memory in any given buffer object binding accessible by the
323    LDC instruction may be limited.  If any component fetched by the LDC
324    instruction extends 4*<n> or more basic machine units from the beginning
325    of the buffer object binding, where <n> is the implementation-dependent
326    constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that
327    component will be undefined.
328
329    LDC supports no base data type modifiers, but requires exactly one storage
330    modifier.  The base data types of the operand and result vectors are
331    derived from the storage modifier.
332
333
334Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization)
335
336    None.
337
338Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment
339Operations and the Frame Buffer)
340
341    None.
342
343Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions)
344
345    None.
346
347Additions to Chapter 6 of the OpenGL 3.0 Specification (State and
348State Requests)
349
350    None.
351
352Additions to Appendix A of the OpenGL 3.0 Specification (Invariance)
353
354    None.
355
356Additions to the AGL/GLX/WGL Specifications
357
358    None.
359
360Errors
361
362    No new errors.
363
364Dependencies on NV_shader_buffer_load
365
366    If NV_shader_buffer_load (or equivalent functionality) is not supported,
367    references to the "LOAD" opcode in the description of the opcode modifiers
368    for "LDC" should be removed.
369
370New State
371
372    None.
373
374New Implementation Dependent State
375
376    None.
377
378Issues
379
380    (1) What sort of alignment requirements, if any, should be imposed on the
381        operand provided to the LDC instruction?
382
383      RESOLVED:  The offset of the operand must be aligned according to the
384      size of the fetch.  For 1-, 2-, and 4-component fetches, the offset must
385      be a multiple of <N>, 2*<N>, and 4*<N>, where <N> is the size in bytes
386      of the components being fetched.
387
388    (2) NV_parameter_buffer_object provides an implementation-dependent limit
389        on the portion of a buffer object that may be fetched via BUFFER and
390        BUFFER4 variables?  Should the same limits apply to the LDC
391        instruction?
392
393      RESOLVED:  Yes.  On currently shipping NVIDIA GPUs, the maximum program
394      parameter buffer size is 16384 32-bit words, or 64KB.  Buffers larger
395      than 64KB may be used, but any fetches accessing memory beyond the first
396      64KB of a buffer binding will return undefined values.
397
398    (3) Should we support fetches of 3-component vectors?  If so, what should
399    be the minimum alignment for the specified offset?
400
401      RESOLVED:  No, we'll leave 3-component vectors out of this extension.
402      This limitation can be worked around by either by doing three separate
403      single-component fetches or a four-component fetch with an appropriate
404      write mask.  The former approach supports indexing in a tightly packed
405      array of 3-component vectors; the latter would require that array
406      elements be padded to four components.
407
408    (4) Should we support fetches of 8- and 16-bit components?
409
410      RESOLVED:  Yes, we will support fetches of 8- and 16-bit signed and
411      unsigned integers.
412
413      Fetches of vectors of 8- and 16-bit integers are not supported but may
414      be emulated by performing shift/mask operations on the results of 32-bit
415      fetches.
416
417      Fetches of 16-bit floating-point values, or floating-point vectors
418      thereof, are not supported.  A single fp16 fetch may be emulated using a
419      16-bit unsigned integer fetch and the UP2H instruction to convert the 16
420      LSBs of the fetch to a floating-point value.  The encoding of 16-bit
421      floating-point values is described in section 2.1.2 of the OpenGL 3.0
422      specification.
423
424    (5) Should we support fetches of 64-bit components?
425
426      RESOLVED:  No; the instruction set provided by NV_gpu_program4 does not
427      support 64-bit components anywhere.  If future instructions support
428      64-bit components, this restriction should be removed.
429
430    (6) How should the operands of the LDC instruction should be specified?
431
432      RESOLVED:  We will create a new type of buffer variable ("CBUFFER"),
433      which defines an array of bytes to be fetched form.  The type of fetch
434      to perform is specified by a storage modifier (as in
435      NV_shader_buffer_load).  An offset relative to the buffer binding (in
436      bytes) may be specified using normal array indexing syntax, and an index
437      computed at run-time is supported.
438
439      Some examples:
440
441        CBUFFER buffer[] = { program.buffer[0] };
442        TEMP      i;
443        MOV.S     i, 32;                  # computed offset of 32B
444        LDC.F32   result, buffer[12];     # (x,0,0,0) from bytes 12..15
445        LDC.F32X4 result, buffer[16];     # (x,y,z,w) from bytes 16..31
446        LDC.U8    result, buffer[i.x+3];  # (x,0,0,0) from byte 35
447        LDC.S32   result, buffer[i.x+12]; # (x,0,0,0) from bytes 44..47
448        LDC.U32X2 result, buffer[i.x+8];  # (x,y,0,0) from bytes 40..47
449        LDC.S16   result, buffer[i.x+2];  # (x,0,0,0) from bytes 34..35
450
451      We chose to provide the new buffer variable type (CBUFFER) rather than
452      reusing BUFFER or BUFFER4.  For CBUFFER variables, "buffer[12]"
453      unambiguously specifies a 12-byte offset.  For BUFFER or BUFFER4
454      variables, an operand of "buffer[12]" already has an existing meaning,
455      implying an offset of 12 words or vectors, which would be 48 or 192
456      bytes, respectively.  Because we want to be able to fetch 8-, and 16-bit
457      units, having an offset multiplied by four doesn't make sense.  We could
458      have had LDC simply ignore the type of binding and always interpret an
459      index as a byte offset, but chose the new declaration type to avoid
460      confusion.
461
462      We also considered an approach where the buffer and offset were
463      specified in separate operands.  That would be similar to texture, where
464      the coordinates and texture are specified separately.  The first operand
465      would have been interpreted as a unsigned scalar specifying a byte
466      offset, the second operand would have specified a buffer variable
467      binding, and a pointer would be obtained by adding the two
468      operands. This would have looked something like:
469
470        BUFFER buffer[] = { program.buffer[0] };
471        LDC.S32X2 result, offset.x, buffer;
472
473      We chose not to implement this approach mainly because this syntax would
474      require specifying a new type of instruction; the syntax we adopted
475      simply reuses existing vector operand and indexing mechanisms.
476      Additionally, the syntax in this extension provides immediate offsets
477      for "free", which the operand-buffer syntax would not support directly
478      without additional new syntax.  For example, to load a structure with a
479      pair of two-component vectors using offset-buffer syntax, you would have
480      to do something like:
481
482        BUFFER buffer[] = { program.buffer[0] };
483        TEMP offset;
484        LDC.S32X2 result1, offset.x, buffer;
485        ADD.U offset.x, offset.x, 8;            # bump offset to second vector
486        LDC.S32X2 result2, offset.x, buffer;
487
488    (7) How should the fetches in the LDC instruction interact with other
489        operand modifiers (swizzle, absolute value, negation)?  With result
490        modifiers (condition codes, saturation)?
491
492      RESOLVED:  These features will be orthogonal.  When any of these
493      modifiers are specified, the base data type to which they apply come
494      from the storage modifier of the LDC instruction.
495
496      The LDC instruction is defined to produce a "base operand vector" from a
497      memory fetch.  This isn't particularly different from normal operands,
498      where a base operand vector is derived from the binding corresponding to
499      the operand.  In both cases, the components of this vector are swizzled
500      and have optional absolute value and negation operations performed to
501      produce a final vector operand, as is the case with other vector
502      operands.
503
504      If condition code operations or saturation are specified for the result
505      vector, these operations are performed using the appropriate data types.
506
507    (8) What happens if a non-zero base offset is specified for a CBUFFER
508        variable?
509
510      RESOLVED:  A subset of the bytes in a buffer object can be specified
511      using range syntax like the following:
512
513        CBUFFER buffer[] = { program.buffer[0][16..31] };
514
515      The sub-range need not start at the beginning of the buffer object; in
516      the example above, it starts 16 bytes into the buffer.  When accessing a
517      parameter buffer variable corresponding to such a sub-range, an array
518      index is relative to the base of the sub-range.  So the offset of the
519      sub-range is effectively added to the index used for the LDC operand:
520
521        LDC.F32   result, buffer[12];     # (x,0,0,0) from bytes 28..31
522
523    (9) What happens if a non-array CBUFFER variable is used?
524
525      RESOLVED:  A non-array variable may be used with LDC.  However, array
526      indexing isn't supported with non-array variables, so all LDC loads
527      using that variable will fetch using the same base address.
528
529        CBUFFER bufferElement = program.buffer[0][32];
530        LDC.U8    result, buffer;     # (x,0,0,0) from byte 32
531        LDC.S16   result, buffer;     # (x,0,0,0) from bytes 32..33
532        LDC.F32   result, buffer;     # (x,0,0,0) from bytes 32..35
533        LDC.F32X4 result, buffer;     # (x,y,z,w) from bytes 32..47
534
535    (10) Should single-component fetches from LDC smear their results across
536         all four components of the result vector, to allow packing multiple
537         non-vectors into a single vector?
538
539      RESOLVED:  No.  However, swizzle suffixes on the operand will provide
540      this capability for free.  For example, let's say you wanted to fetch
541      four scalars from a buffer and pack the results into a single temporary
542      vector.  The swizzle syntax lets you do this by smearing the real
543      component (always fetched in "x") into the other components:
544
545        CBUFFER buffer[] = { program.buffer[0] };
546        LDC.F32 temp.x, buffer[16];
547        LDC.F32 temp.y, buffer[28].x;
548        LDC.F32 temp.z, buffer[32].x;
549        LDC.F32 temp.w, buffer[40].x;
550
551
552Revision History
553
554    Rev.    Date    Author    Changes
555    ----  --------  --------  -----------------------------------------
556     1              pbrown    Internal revisions.
557     2    09/09/09  mjk       Assigned number
558