1Name 2 3 NV_parameter_buffer_object2 4 5Name Strings 6 7 GL_NV_parameter_buffer_object2 8 9Contact 10 11 Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com) 12 13Status 14 15 Shipping (July 2009, Release 190) 16 17Version 18 19 Last Modified Date: 09/09/09 20 NVIDIA Revision: 2 21 22Number 23 24 378 25 26Dependencies 27 28 OpenGL 2.0 is required. 29 30 NV_gpu_program4 is required. 31 32 NV_parameter_buffer_object is required. 33 34 This extension is written against the NV_gpu_program4 specification. 35 36 NV_shader_buffer_load trivially affects the definition of this extension. 37 38Overview 39 40 This extension builds on the NV_parameter_buffer_object extension to 41 provide additional flexibility in sourcing data from buffer objects. 42 43 The original NV_parameter_buffer_object (PaBO) extension provided the 44 ability to bind buffer objects to a set of numbered binding points and 45 access them in assembly programs as though they were arrays of 32-bit 46 scalars (via the BUFFER variable type) or arrays of four-component vectors 47 with 32-bit scalar components (via the BUFFER4 variable type). However, 48 the functionality it provided had some significant limits on flexibility. 49 Since any given buffer binding point could be used either as a BUFFER or 50 BUFFER4, but not both, programs couldn't do both 32- and 128-bit fetches 51 from a single binding point. Additionally, No support was provided for 52 8-, 16-, or 64-bit fetches, though they could be emulated using a larger 53 loads, with bitfield operations and/or write masking to put components in 54 the right places. Indexing was supported, but strides were limited to 4- 55 and 16-byte multiples, depending on whether BUFFER or BUFFER4 is used. 56 57 This new extension provides the buffer variable declaration type CBUFFER 58 to specify a buffer that is treated as an array of bytes, rather than an 59 array of words or vectors. The LDC instruction allows programs to extract 60 a vector of data from a CBUFFER variable, using a size and component count 61 specified in the opcode modifier. 1-, 2-, and 4-component fetches are 62 supported. The LDC instruction supports byte offsets using normal array 63 indexing mechanisms; both run-time and immediate offsets are supported. 64 Offsets used for a buffer object fetch are required to be aligned to the 65 size of the fetch (1, 2, 4, 8, or 16 bytes). 66 67New Procedures and Functions 68 69 None. 70 71New Tokens 72 73 None. 74 75Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation) 76 77 (All modifications are relative to Section 2.X, GPU Programs, from the 78 NV_gpu_program4 specification.) 79 80 Modify Section 2.X.2, Program Grammar 81 82 (add after the long list of grammar rules) If a program specifies the 83 NV_parameter_buffer_object2 program option, the following rules are added 84 to the NV_gpu_program4 base program grammar: 85 86 <VECTORop> ::= "LDC" 87 88 <opModifier> ::= "F32"; 89 | "F32X2"; 90 | "F32X4"; 91 | "S8"; 92 | "S16"; 93 | "S32"; 94 | "S32X2"; 95 | "S32X4"; 96 | "U8"; 97 | "U16"; 98 | "U32"; 99 | "U32X2"; 100 | "U32X4"; 101 102 <bufferDeclType> ::= "CBUFFER" 103 104 105 Modify Section 2.X.3.6, Program Parameter Buffers 106 107 (modify the paragraph describing the different type of parameter buffer 108 variable declarations to include support for "CBUFFER".) 109 110 Program parameter buffer variables are treated as an array of 111 single-component words if the <bufferDeclType> grammar rule matches 112 "BUFFER" or as an array of four-component vectors if it matches "BUFFER4". 113 Program parameter buffers may also be declared as an array of basic 114 machine units from which data can be extracted using the LDC (load 115 constant) instruction, if <bufferDeclType> matches "CBUFFER". Parameter 116 buffer variables declared using "CBUFFER" may not be used as an operand in 117 any instruction other than LDC, while "BUFFER" and "BUFFER4" variables may 118 not be used with LDC. A program will fail to load if a variable declared 119 as "BUFFER" and another variable declared as "BUFFER4" use the same buffer 120 binding point. There is no limitation on the use of "CBUFFER" variables 121 in conjunction with "BUFFER" or "BUFFER4" variables using the same buffer 122 binding point. 123 124 (modify/restructure the paragraph describing basic program parameter 125 bindings to handle the byte bindings provided by "CBUFFER" variables) 126 127 If a program parameter buffer binding matches "program.buffer[a][b]", the 128 program parameter variable corresponds to element <b> of the buffer object 129 bound to binding point <a>. Each element of the bound buffer object is 130 treated as: 131 132 * a single basic machine unit of data, if the variable is declared using 133 "CBUFFER"; 134 135 * a single word of data that can hold an integer or floating-point 136 value, if the variable is declared as "BUFFER"; or 137 138 * four words of data that can hold integer or floating-point values, if 139 the variable is declared as "BUFFER4". 140 141 When a binding corresponding to a "BUFFER" variable is used as an operand, 142 the selected word is broadcast to all four components of the variable. 143 When a binding corresponding to a "BUFFER4" variable is used as an 144 operand, the four components of the selected buffer element are loaded 145 into the variable. A binding corresponding to a "CBUFFER" variable may be 146 used only in the LDC instruction, and will be used there as a pointer to 147 extract operand values from buffer memory. If no buffer object is bound 148 to binding point <a>, or the bound buffer object is not large enough to 149 hold element <b>, the values used are undefined. The binding point <a> 150 must be a nonnegative integer constant. 151 152 153 Modify Section 2.X.4, Program Execution Environment 154 155 (Add to the set of opcodes in Table X.13) 156 157 Modifiers 158 Instruction F I C S H D Out Inputs Description 159 ----------- - - - - - - --- -------- -------------------------------- 160 LDC X X X X - F v v load from constant buffer 161 162 163 Modify Section 2.X.4.1, Program Instruction Modifiers 164 165 (Add to Table X.14, Instruction Modifiers, and to the corresponding 166 description following the table) 167 168 Modifier Description 169 -------- ----------------------------------------------- 170 F32 Access one 32-bit floating-point value 171 F32X2 Access two 32-bit floating-point values 172 F32X4 Access four 32-bit floating-point values 173 S8 Access one 8-bit signed integer value 174 S16 Access one 16-bit signed integer value 175 S32 Access one 32-bit signed integer value 176 S32X2 Access two 32-bit signed integer values 177 S32X4 Access four 32-bit signed integer values 178 U8 Access one 8-bit unsigned integer value 179 U16 Access one 16-bit unsigned integer value 180 U32 Access one 32-bit unsigned integer value 181 U32X2 Access two 32-bit unsigned integer values 182 U32X4 Access four 32-bit unsigned integer values 183 184 For memory load operations, the "F32", "F32X2", "F32X4", "S8", "S16", 185 "S32", "S32X2", "S32X4", "U8", "U16", "U32", "U32X2", and "U32X4" storage 186 modifiers control how data are loaded from memory. Storage modifiers are 187 supported by the LDC and LOAD instructions and are covered in more detail 188 in the descriptions of these instructions. These instructions must 189 specify exactly one of these modifiers, and may not specify any of the 190 base data type modifiers (F,U,S) described above. The base data type of 191 the result vector of a LOAD or LDC instruction is trivially derived from 192 the storage modifier. 193 194 195 Add New Section 2.X.4.5, Program Memory Access 196 197 Programs may load from buffer object memory via the LDC (load constant) 198 and LOAD (global load) instructions. 199 200 Load instructions read 8, 16, 32, 64, or 128 bits of data from a source 201 address to produce a four-component vector, according to the storage 202 modifier specified with the instruction. The storage modifier has three 203 parts: 204 205 - a base data type, "F", "S", or "U", specifying that the instruction 206 fetches floating-point, signed integer, or unsigned integer values, 207 respectively; 208 209 - a component size, specifying that the components fetched by the 210 instruction have 8, 16, or 32 bits; and 211 212 - an optional component count, where "X2" and "X4" indicate that two or 213 four components be fetched, and no count indicates a single component 214 fetch. 215 216 When the storage modifier specifies that fewer than four components should 217 be fetched, remaining components are filled with zeroes. When performing 218 a global load (LOAD), the GPU address is specified as an instruction 219 operand. When performing a constant buffer load (LDC), the GPU address is 220 derived by adding the base address of the bound buffer object to an offset 221 specified as an instruction operand. Given a GPU address <address> and a 222 storage modifier <modifier>, the memory load can be described by the 223 following code: 224 225 result_t_vec BufferMemoryLoad(char *address, OpModifier modifier) 226 { 227 result_t_vec result = { 0, 0, 0, 0 }; 228 switch (modifier) { 229 case F32: 230 result.x = ((float32_t *)address)[0]; 231 break; 232 case F32X2: 233 result.x = ((float32_t *)address)[0]; 234 result.y = ((float32_t *)address)[1]; 235 break; 236 case F32X4: 237 result.x = ((float32_t *)address)[0]; 238 result.y = ((float32_t *)address)[1]; 239 result.z = ((float32_t *)address)[2]; 240 result.w = ((float32_t *)address)[3]; 241 break; 242 case S8: 243 result.x = ((int8_t *)address)[0]; 244 break; 245 case S16: 246 result.x = ((int16_t *)address)[0]; 247 break; 248 case S32: 249 result.x = ((int32_t *)address)[0]; 250 break; 251 case S32X2: 252 result.x = ((int32_t *)address)[0]; 253 result.y = ((int32_t *)address)[1]; 254 break; 255 case S32X4: 256 result.x = ((int32_t *)address)[0]; 257 result.y = ((int32_t *)address)[1]; 258 result.z = ((int32_t *)address)[2]; 259 result.w = ((int32_t *)address)[3]; 260 break; 261 case U8: 262 result.x = ((uint8_t *)address)[0]; 263 break; 264 case U16: 265 result.x = ((uint16_t *)address)[0]; 266 break; 267 case U32: 268 result.x = ((uint32_t *)address)[0]; 269 break; 270 case U32X2: 271 result.x = ((uint32_t *)address)[0]; 272 result.y = ((uint32_t *)address)[1]; 273 break; 274 case U32X4: 275 result.x = ((uint32_t *)address)[0]; 276 result.y = ((uint32_t *)address)[1]; 277 result.z = ((uint32_t *)address)[2]; 278 result.w = ((uint32_t *)address)[3]; 279 break; 280 } 281 return result; 282 } 283 284 The offset used for the constant buffer loads must be aligned to the fetch 285 size corresponding to the storage opcode modifier. For S8 and U8, the 286 offset has no alignment requirements. For S16 and U16, the offset must be 287 a multiple of two basic machine units. For F32, S32, and U32, the offset 288 must be a multiple of four. For F32X2, S32X2, and U32X2, the offset must 289 be a multiple of eight. For F32X4, S32X4, and U32X4, the offset must be a 290 multiple of sixteen. If an offset is not correctly aligned, the values 291 returned by a constant buffer load will be undefined. 292 293 294 Modify Section 2.X.6, Program Options 295 296 + Extended Parameter Buffer Object Support (NV_parameter_buffer_object2) 297 298 If a program specifies the "NV_parameter_buffer_object2" option, it may 299 use the CBUFFER statement to declare program parameter buffer variables 300 and the LDC instruction to load data from parameter buffer variables using 301 arbitrary offsets. 302 303 304 Modify Section 2.X.8, Program Instruction Set 305 306 Section 2.X.8.Z, LDC: Load from Constant Buffer 307 308 The LDC instruction loads a vector operand from a buffer object to yield a 309 result vector. The operand used for the LDC instruction must correspond 310 to a parameter buffer variable declared using the "CBUFFER" statement; a 311 program will fail to load if any other type of operand is used in an LDC 312 instruction. 313 314 result = BufferMemoryLoad(&op0, storageModifier); 315 316 A base operand vector is fetched from memory as described in Section 317 2.X.4.5, with the GPU address derived from the binding corresponding to 318 the operand. A final operand vector is derived from the base operand 319 vector by applying swizzle, negation, and absolute value operand modifiers 320 as described in Section 2.X.4.2. 321 322 The amount of memory in any given buffer object binding accessible by the 323 LDC instruction may be limited. If any component fetched by the LDC 324 instruction extends 4*<n> or more basic machine units from the beginning 325 of the buffer object binding, where <n> is the implementation-dependent 326 constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that 327 component will be undefined. 328 329 LDC supports no base data type modifiers, but requires exactly one storage 330 modifier. The base data types of the operand and result vectors are 331 derived from the storage modifier. 332 333 334Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization) 335 336 None. 337 338Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment 339Operations and the Frame Buffer) 340 341 None. 342 343Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions) 344 345 None. 346 347Additions to Chapter 6 of the OpenGL 3.0 Specification (State and 348State Requests) 349 350 None. 351 352Additions to Appendix A of the OpenGL 3.0 Specification (Invariance) 353 354 None. 355 356Additions to the AGL/GLX/WGL Specifications 357 358 None. 359 360Errors 361 362 No new errors. 363 364Dependencies on NV_shader_buffer_load 365 366 If NV_shader_buffer_load (or equivalent functionality) is not supported, 367 references to the "LOAD" opcode in the description of the opcode modifiers 368 for "LDC" should be removed. 369 370New State 371 372 None. 373 374New Implementation Dependent State 375 376 None. 377 378Issues 379 380 (1) What sort of alignment requirements, if any, should be imposed on the 381 operand provided to the LDC instruction? 382 383 RESOLVED: The offset of the operand must be aligned according to the 384 size of the fetch. For 1-, 2-, and 4-component fetches, the offset must 385 be a multiple of <N>, 2*<N>, and 4*<N>, where <N> is the size in bytes 386 of the components being fetched. 387 388 (2) NV_parameter_buffer_object provides an implementation-dependent limit 389 on the portion of a buffer object that may be fetched via BUFFER and 390 BUFFER4 variables? Should the same limits apply to the LDC 391 instruction? 392 393 RESOLVED: Yes. On currently shipping NVIDIA GPUs, the maximum program 394 parameter buffer size is 16384 32-bit words, or 64KB. Buffers larger 395 than 64KB may be used, but any fetches accessing memory beyond the first 396 64KB of a buffer binding will return undefined values. 397 398 (3) Should we support fetches of 3-component vectors? If so, what should 399 be the minimum alignment for the specified offset? 400 401 RESOLVED: No, we'll leave 3-component vectors out of this extension. 402 This limitation can be worked around by either by doing three separate 403 single-component fetches or a four-component fetch with an appropriate 404 write mask. The former approach supports indexing in a tightly packed 405 array of 3-component vectors; the latter would require that array 406 elements be padded to four components. 407 408 (4) Should we support fetches of 8- and 16-bit components? 409 410 RESOLVED: Yes, we will support fetches of 8- and 16-bit signed and 411 unsigned integers. 412 413 Fetches of vectors of 8- and 16-bit integers are not supported but may 414 be emulated by performing shift/mask operations on the results of 32-bit 415 fetches. 416 417 Fetches of 16-bit floating-point values, or floating-point vectors 418 thereof, are not supported. A single fp16 fetch may be emulated using a 419 16-bit unsigned integer fetch and the UP2H instruction to convert the 16 420 LSBs of the fetch to a floating-point value. The encoding of 16-bit 421 floating-point values is described in section 2.1.2 of the OpenGL 3.0 422 specification. 423 424 (5) Should we support fetches of 64-bit components? 425 426 RESOLVED: No; the instruction set provided by NV_gpu_program4 does not 427 support 64-bit components anywhere. If future instructions support 428 64-bit components, this restriction should be removed. 429 430 (6) How should the operands of the LDC instruction should be specified? 431 432 RESOLVED: We will create a new type of buffer variable ("CBUFFER"), 433 which defines an array of bytes to be fetched form. The type of fetch 434 to perform is specified by a storage modifier (as in 435 NV_shader_buffer_load). An offset relative to the buffer binding (in 436 bytes) may be specified using normal array indexing syntax, and an index 437 computed at run-time is supported. 438 439 Some examples: 440 441 CBUFFER buffer[] = { program.buffer[0] }; 442 TEMP i; 443 MOV.S i, 32; # computed offset of 32B 444 LDC.F32 result, buffer[12]; # (x,0,0,0) from bytes 12..15 445 LDC.F32X4 result, buffer[16]; # (x,y,z,w) from bytes 16..31 446 LDC.U8 result, buffer[i.x+3]; # (x,0,0,0) from byte 35 447 LDC.S32 result, buffer[i.x+12]; # (x,0,0,0) from bytes 44..47 448 LDC.U32X2 result, buffer[i.x+8]; # (x,y,0,0) from bytes 40..47 449 LDC.S16 result, buffer[i.x+2]; # (x,0,0,0) from bytes 34..35 450 451 We chose to provide the new buffer variable type (CBUFFER) rather than 452 reusing BUFFER or BUFFER4. For CBUFFER variables, "buffer[12]" 453 unambiguously specifies a 12-byte offset. For BUFFER or BUFFER4 454 variables, an operand of "buffer[12]" already has an existing meaning, 455 implying an offset of 12 words or vectors, which would be 48 or 192 456 bytes, respectively. Because we want to be able to fetch 8-, and 16-bit 457 units, having an offset multiplied by four doesn't make sense. We could 458 have had LDC simply ignore the type of binding and always interpret an 459 index as a byte offset, but chose the new declaration type to avoid 460 confusion. 461 462 We also considered an approach where the buffer and offset were 463 specified in separate operands. That would be similar to texture, where 464 the coordinates and texture are specified separately. The first operand 465 would have been interpreted as a unsigned scalar specifying a byte 466 offset, the second operand would have specified a buffer variable 467 binding, and a pointer would be obtained by adding the two 468 operands. This would have looked something like: 469 470 BUFFER buffer[] = { program.buffer[0] }; 471 LDC.S32X2 result, offset.x, buffer; 472 473 We chose not to implement this approach mainly because this syntax would 474 require specifying a new type of instruction; the syntax we adopted 475 simply reuses existing vector operand and indexing mechanisms. 476 Additionally, the syntax in this extension provides immediate offsets 477 for "free", which the operand-buffer syntax would not support directly 478 without additional new syntax. For example, to load a structure with a 479 pair of two-component vectors using offset-buffer syntax, you would have 480 to do something like: 481 482 BUFFER buffer[] = { program.buffer[0] }; 483 TEMP offset; 484 LDC.S32X2 result1, offset.x, buffer; 485 ADD.U offset.x, offset.x, 8; # bump offset to second vector 486 LDC.S32X2 result2, offset.x, buffer; 487 488 (7) How should the fetches in the LDC instruction interact with other 489 operand modifiers (swizzle, absolute value, negation)? With result 490 modifiers (condition codes, saturation)? 491 492 RESOLVED: These features will be orthogonal. When any of these 493 modifiers are specified, the base data type to which they apply come 494 from the storage modifier of the LDC instruction. 495 496 The LDC instruction is defined to produce a "base operand vector" from a 497 memory fetch. This isn't particularly different from normal operands, 498 where a base operand vector is derived from the binding corresponding to 499 the operand. In both cases, the components of this vector are swizzled 500 and have optional absolute value and negation operations performed to 501 produce a final vector operand, as is the case with other vector 502 operands. 503 504 If condition code operations or saturation are specified for the result 505 vector, these operations are performed using the appropriate data types. 506 507 (8) What happens if a non-zero base offset is specified for a CBUFFER 508 variable? 509 510 RESOLVED: A subset of the bytes in a buffer object can be specified 511 using range syntax like the following: 512 513 CBUFFER buffer[] = { program.buffer[0][16..31] }; 514 515 The sub-range need not start at the beginning of the buffer object; in 516 the example above, it starts 16 bytes into the buffer. When accessing a 517 parameter buffer variable corresponding to such a sub-range, an array 518 index is relative to the base of the sub-range. So the offset of the 519 sub-range is effectively added to the index used for the LDC operand: 520 521 LDC.F32 result, buffer[12]; # (x,0,0,0) from bytes 28..31 522 523 (9) What happens if a non-array CBUFFER variable is used? 524 525 RESOLVED: A non-array variable may be used with LDC. However, array 526 indexing isn't supported with non-array variables, so all LDC loads 527 using that variable will fetch using the same base address. 528 529 CBUFFER bufferElement = program.buffer[0][32]; 530 LDC.U8 result, buffer; # (x,0,0,0) from byte 32 531 LDC.S16 result, buffer; # (x,0,0,0) from bytes 32..33 532 LDC.F32 result, buffer; # (x,0,0,0) from bytes 32..35 533 LDC.F32X4 result, buffer; # (x,y,z,w) from bytes 32..47 534 535 (10) Should single-component fetches from LDC smear their results across 536 all four components of the result vector, to allow packing multiple 537 non-vectors into a single vector? 538 539 RESOLVED: No. However, swizzle suffixes on the operand will provide 540 this capability for free. For example, let's say you wanted to fetch 541 four scalars from a buffer and pack the results into a single temporary 542 vector. The swizzle syntax lets you do this by smearing the real 543 component (always fetched in "x") into the other components: 544 545 CBUFFER buffer[] = { program.buffer[0] }; 546 LDC.F32 temp.x, buffer[16]; 547 LDC.F32 temp.y, buffer[28].x; 548 LDC.F32 temp.z, buffer[32].x; 549 LDC.F32 temp.w, buffer[40].x; 550 551 552Revision History 553 554 Rev. Date Author Changes 555 ---- -------- -------- ----------------------------------------- 556 1 pbrown Internal revisions. 557 2 09/09/09 mjk Assigned number 558