1Name 2 3 NV_gpu_program5 4 5Name Strings 6 7 GL_NV_gpu_program5 8 GL_NV_gpu_program_fp64 9 10Contact 11 12 Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com) 13 14Status 15 16 Shipping. 17 18Version 19 20 Last Modified Date: 09/11/2014 21 NVIDIA Revision: 7 22 23Number 24 25 388 26 27Dependencies 28 29 OpenGL 2.0 is required. 30 31 This extension is written against the OpenGL 3.0 specification. 32 33 NV_gpu_program4 and NV_gpu_program4_1 are required. 34 35 NV_shader_buffer_load is required. 36 37 NV_shader_buffer_store is required. 38 39 This extension is written against and interacts with the NV_gpu_program4, 40 NV_vertex_program4, NV_geometry_program4, and NV_fragment_program4 41 specifications. 42 43 This extension interacts with NV_tessellation_program5. 44 45 This extension interacts with ARB_transform_feedback3. 46 47 This extension interacts trivially with NV_shader_buffer_load. 48 49 This extension interacts trivially with NV_shader_buffer_store. 50 51 This extension interacts trivially with NV_parameter_buffer_object2. 52 53 This extension interacts trivially with OpenGL 3.3, ARB_texture_swizzle, 54 and EXT_texture_swizzle. 55 56 This extension interacts trivially with ARB_blend_func_extended. 57 58 This extension interacts trivially with EXT_shader_image_load_store. 59 60 This extension interacts trivially with ARB_shader_subroutine. 61 62 If the 64-bit floating-point portion of this extension is not supported, 63 "GL_NV_gpu_program_fp64" will not be found in the extension string. 64 65Overview 66 67 This specification documents the common instruction set and basic 68 functionality provided by NVIDIA's 5th generation of assembly instruction 69 sets supporting programmable graphics pipeline stages. 70 71 The instruction set builds upon the basic framework provided by the 72 ARB_vertex_program and ARB_fragment_program extensions to expose 73 considerably more capable hardware. In addition to new capabilities for 74 vertex and fragment programs, this extension provides new functionality 75 for geometry programs as originally described in the NV_geometry_program4 76 specification, and serves as the basis for the new tessellation control 77 and evaluation programs described in the NV_tessellation_program5 78 extension. 79 80 Programs using the functionality provided by this extension should begin 81 with the program headers "!!NVvp5.0" (vertex programs), "!!NVtcp5.0" 82 (tessellation control programs), "!!NVtep5.0" (tessellation evaluation 83 programs), "!!NVgp5.0" (geometry programs), and "!!NVfp5.0" (fragment 84 programs). 85 86 This extension provides a variety of new features, including: 87 88 * support for 64-bit integer operations; 89 90 * the ability to dynamically index into an array of texture units or 91 program parameter buffers; 92 93 * extending texel offset support to allow loading texel offsets from 94 regular integer operands computed at run-time, instead of requiring 95 that the offsets be constants encoded in texture instructions; 96 97 * extending TXG (texture gather) support to return the 2x2 footprint 98 from any component of the texture image instead of always returning 99 the first (x) component; 100 101 * extending TXG to support shadow comparisons in conjunction with a 102 depth texture, via the SHADOW* targets; 103 104 * further extending texture gather support to provide a new opcode 105 (TXGO) that applies a separate texel offset vector to each of the four 106 samples returned by the instruction; 107 108 * bit manipulation instructions, including ones to find the position of 109 the most or least significant set bit, bitfield insertion and 110 extraction, and bit reversal; 111 112 * a general data conversion instruction (CVT) supporting conversion 113 between any two data types supported by this extension; and 114 115 * new instructions to compute the composite of a set of boolean 116 conditions a group of shader threads. 117 118 This extension also provides some new capabilities for individual program 119 types, including: 120 121 * support for instanced geometry programs, where a geometry program may 122 be run multiple times for each primitive; 123 124 * support for emitting vertices in a geometry program where each vertex 125 emitted may be directed at a specified vertex stream and captured 126 using the ARB_transform_feedback3 extension; 127 128 * support for interpolating an attribute at a programmable offset 129 relative to the pixel center (IPAO), at a programmable sample number 130 (IPAS), or at the fragment's centroid location (IPAC) in a fragment 131 program; 132 133 * support for reading a mask of covered samples in a fragment program; 134 135 * support for reading a point sprite coordinate directly in a fragment 136 program, without overriding a texture coordinate; 137 138 * support for reading patch primitives and per-patch attributes 139 (introduced by ARB_tessellation_shader) in a geometry program; and 140 141 * support for multiple output vectors for a single color output in a 142 fragment program (as used by ARB_blend_func_extended). 143 144 This extension also provides optional support for 64-bit-per-component 145 variables and 64-bit floating-point arithmetic. These features are 146 supported if and only if "NV_gpu_program_fp64" is found in the extension 147 string. 148 149 This extension incorporates the memory access operations from the 150 NV_shader_buffer_load and NV_parameter_buffer_object2 extensions, 151 originally built as add-ons to NV_gpu_program4. It also provides the 152 following new capabilities: 153 154 * support for the features without requiring a separate OPTION keyword; 155 156 * support for indexing into an array of constant buffers using the LDC 157 opcode added by NV_parameter_buffer_object2; 158 159 * support for storing into buffer objects at a specified GPU address 160 using the STORE opcode, an allowing applications to create READ_WRITE 161 and WRITE_ONLY mappings when making a buffer object resident using the 162 API mechanisms in the NV_shader_buffer_store extension; 163 164 * storage instruction modifiers to allow loading and storing 64-bit 165 component values; 166 167 * support for atomic memory transactions using the ATOM opcode, where 168 the instruction atomically reads the memory pointed to by a pointer, 169 performs a specified computation, stores the results of that 170 computation, and returns the original value read; 171 172 * support for memory barrier transactions using the MEMBAR opcode, which 173 ensures that all memory stores issued prior to the opcode complete 174 prior to any subsequent memory transactions; and 175 176 * a fragment program option to specify that depth and stencil tests are 177 performed prior to fragment program execution. 178 179 Additionally, the assembly program languages supported by this extension 180 include support for reading, writing, and performing atomic memory 181 operations on texture image data using the opcodes and mechanisms 182 documented in the "Dependencies on NV_gpu_program5" section of the 183 EXT_shader_image_load_store extension. 184 185New Procedures and Functions 186 187 None. 188 189New Tokens 190 191 Accepted by the <pname> parameter of GetBooleanv, GetIntegerv, 192 GetFloatv, and GetDoublev: 193 194 MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV 0x8E5A 195 MIN_FRAGMENT_INTERPOLATION_OFFSET_NV 0x8E5B 196 MAX_FRAGMENT_INTERPOLATION_OFFSET_NV 0x8E5C 197 FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV 0x8E5D 198 MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV 0x8E5E 199 MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV 0x8E5F 200 201 202Additions to Chapter 2 of the OpenGL 3.0 Specification (OpenGL Operation) 203 204 Modify Section 2.X.2 of NV_fragment_program4, Program Grammar 205 206 (modify the section, updating the program header string for the extended 207 instruction set) 208 209 Fragment programs are required to begin with the header string 210 "!!NVfp5.0". This header string identifies the subsequent program body as 211 being a fragment program and indicates that it should be parsed according 212 to the base NV_gpu_program5 grammar plus the additions below. Program 213 string parsing begins with the character immediately following the header 214 string. 215 216 (add/change the following rules to the NV_fragment_program4 and 217 NV_gpu_program5 base grammars) 218 219 <SpecialInstruction> ::= "IPAC" <opModifiers> <instResult> "," 220 <instOperandV> 221 | "IPAO" <opModifiers> <instResult> "," 222 <instOperandV> "," <instOperandV> 223 | "IPAS" <opModifiers> <instResult> "," 224 <instOperandV> "," <instOperandS> 225 226 <interpModifier> ::= "SAMPLE" 227 228 <attribBasic> ::= <fragPrefix> "sampleid" 229 | <fragPrefix> "samplemask" 230 | <fragPrefix> "pointcoord" 231 232 <resultBasic> ::= <resPrefix> "color" <resultOptColorNum> 233 <resultOptColorType> 234 | <resPrefix> "samplemask" 235 236 <resultOptColorType> ::= "" 237 | "." <colorType> 238 239 240 Modify Section 2.X.2 of NV_geometry_program4, Program Grammar 241 242 (modify the section, updating the program header string for the extended 243 instruction set) 244 245 Geometry programs are required to begin with the header string 246 "!!NVgp5.0". This header string identifies the subsequent program body as 247 being a geometry program and indicates that it should be parsed according 248 to the base NV_gpu_program5 grammar plus the additions below. Program 249 string parsing begins with the character immediately following the header 250 string. 251 252 (add the following rules to the NV_geometry_program4 and NV_gpu_program5 253 base grammars) 254 255 <declaration> ::= "INVOCATIONS" <int> 256 257 <declPrimInType> ::= "PATCHES" 258 259 <SpecialInstruction> ::= "EMITS" <instOperandS> 260 261 <attribBasic> ::= <primPrefix> "invocation" 262 | <primPrefix> "vertexcount" 263 | <attribTessOuter> <optArrayMemAbs> 264 | <attribTessInner> <optArrayMemAbs> 265 | <attribPatchGeneric> <optArrayMemAbs> 266 267 <attribMulti> ::= <attribTessOuter> <arrayRange> 268 | <attribTessInner> <arrayRange> 269 | <attribPatchGeneric> <arrayRange> 270 271 <attribTessOuter> ::= <primPrefix> "." "tessouter" 272 273 <attribTessInner> ::= <primPrefix> "." "tessinner" 274 275 <attribPatchGeneric> ::= <primPrefix> "." "patch" "." "attrib" 276 277 278 Modify Section 2.X.2 of NV_vertex_program4, Program Grammar 279 280 (modify the section, updating the program header string for the extended 281 instruction set) 282 283 Vertex programs are required to begin with the header string "!!NVvp5.0". 284 This header string identifies the subsequent program body as being a 285 vertex program and indicates that it should be parsed according to the 286 base NV_gpu_program5 grammar plus the additions below. Program string 287 parsing begins with the character immediately following the header string. 288 289 290 Modify Section 2.X.2 of NV_gpu_program4, Program Grammar 291 292 (add the following grammar rules to the NV_gpu_program4 base grammar; 293 additional grammar rules usable for assembly programs are documented in 294 the EXT_shader_image_load_store and ARB_shader_subroutine specifications) 295 296 <instruction> ::= <MemInstruction> 297 298 <MemInstruction> ::= <ATOMop_instruction> 299 | <STOREop_instruction> 300 | <MEMBARop_instruction> 301 302 <VECTORop> ::= "BFR" 303 | "BTC" 304 | "BTFL" 305 | "BTFM" 306 | "PK64" 307 | "LDC" 308 | "CVT" 309 | "TGALL" 310 | "TGANY" 311 | "TGEQ" 312 | "UP64" 313 314 <SCALARop> ::= "LOAD" 315 316 <BINop> ::= "BFE" 317 318 <TRIop> ::= "BFI" 319 320 <TEXop_instruction> ::= <TEXop> <opModifiers> <instResult> "," 321 <instOperandV> "," <instOperandV> "," 322 <texAccess> 323 324 <TEXop> ::= "TXG" 325 | "LOD" 326 327 <TXDop> ::= "TXGO" 328 329 <ATOMop_instruction> ::= <ATOMop> <opModifiers> <instResult> "," 330 <instOperandV> "," <instOperandS> 331 332 <ATOMop> ::= "ATOM" 333 334 <STOREop_instruction> ::= <STOREop> <opModifiers> <instOperandV> "," 335 <instOperandS> 336 337 <STOREop> ::= "STORE" 338 339 <MEMBARop_instruction> ::= <MEMBARop> <opModifiers> 340 341 <MEMBARop> ::= "MEMBAR" 342 343 <opModifier> ::= "F16" 344 | "F32" 345 | "F64" 346 | "F32X2" 347 | "F32X4" 348 | "F64X2" 349 | "F64X4" 350 | "S8" 351 | "S16" 352 | "S32" 353 | "S32X2" 354 | "S32X4" 355 | "S64" 356 | "S64X2" 357 | "S64X4" 358 | "U8" 359 | "U16" 360 | "U32" 361 | "U32X2" 362 | "U32X4" 363 | "U64" 364 | "U64X2" 365 | "U64X4" 366 | "ADD" 367 | "MIN" 368 | "MAX" 369 | "IWRAP" 370 | "DWRAP" 371 | "AND" 372 | "OR" 373 | "XOR" 374 | "EXCH" 375 | "CSWAP" 376 | "COH" 377 | "ROUND" 378 | "CEIL" 379 | "FLR" 380 | "TRUNC" 381 | "PREC" 382 | "VOL" 383 384 <texAccess> ::= <textureUseS> "," <texTarget> <optTexOffset> 385 | <textureUseV> "," <texTarget> <optTexOffset> 386 387 <texTarget> ::= "ARRAYCUBE" 388 | "SHADOWARRAYCUBE" 389 390 <optTexOffset> ::= /* empty */ 391 | <texOffset> 392 393 <texOffset> ::= "offset" "(" <instOperandV> ")" 394 395 <namingStatement> ::= <TEXTURE_statement> 396 397 <BUFFER_statement> ::= <bufferDeclType> <establishName> 398 <optArraySize> <optArraySize> "=" 399 <bufferMultInit> 400 401 <bufferDeclType> ::= "CBUFFER" 402 403 <TEXTURE_statement> ::= "TEXTURE" <establishName> <texSingleInit> 404 | "TEXTURE" <establishName> <optArraySize> 405 <texMultipleInit> 406 407 <texSingleInit> ::= "=" <textureUseDS> 408 409 <texMultipleInit> ::= "=" "{" <texItemList> "}" 410 411 <texItemList> ::= <textureUseDM> 412 | <textureUseDM> "," <texItemList> 413 414 <bufferBinding> ::= "program" "." "buffer" <arrayRange> 415 416 <textureUseS> ::= <textureUseV> <texImageUnitComp> 417 418 <textureUseV> ::= <texImageUnit> 419 | <texVarName> <optArrayMem> 420 421 <textureUseDS> ::= "texture" <arrayMemAbs> 422 423 <textureUseDM> ::= <textureUseDS> 424 | "texture" <arrayRange> 425 426 <texImageUnitComp> ::= <scalarSuffix> 427 428 429 Modify Section 2.X.3.1, Program Variable Types 430 431 (IGNORE if GL_NV_gpu_program_fp64 is not found in the extension string. 432 Otherwise modify storage size modifiers to guarantee that "LONG" 433 variables are at least 64 bits in size.) 434 435 Explicitly declared variables may optionally have one storage size 436 modifier. Variables decared as "SHORT" will be represented using at least 437 16 bits per component. "SHORT" floating-point values will have at least 5 438 bits of exponent and 10 bits of mantissa. Variables declared as "LONG" 439 will be represented with at least 64 bits per component. "LONG" 440 floating-point values will have at least 11 bits of exponent and 52 bits 441 of mantissa. If no size modifier is provided, the GL will automatically 442 select component sizes. Implementations are not required to support more 443 than one component size, so "SHORT", "LONG", and the default could all 444 refer to the same component size. The "LONG" modifier is supported only 445 for declarations of temporary variables ("TEMP"), and attribute variables 446 ("ATTRIB") in vertex programs. The "SHORT" modifier is supported only 447 for declarations of temporary variables and result variables ("OUTPUT"). 448 449 450 Modify Section 2.X.3.2 of the NV_fragment_program4 specification, Program 451 Attribute Variables. 452 453 (Add a table entry and relevant text describing the fragment program 454 input sample mask variable.) 455 456 Fragment Attribute Binding Components Underlying State 457 -------------------------- ---------- ---------------------------- 458 fragment.samplemask (m,-,-,-) fragment coverage mask 459 fragment.pointcoord (s,t,-,-) fragment point sprite coordinate 460 461 If a fragment attribute binding matches "fragment.samplemask", the "x" 462 component is filled with a coverage mask indicating the set of samples 463 covered by this fragment. The coverage mask is a bitfield, where bit <n> 464 is one if the sample number <n> is covered and zero otherwise. If 465 multisample buffers are not available (SAMPLE_BUFFERS is zero), bit zero 466 indicates if the center of the pixel corresponding to the fragment is 467 covered. 468 469 If a fragment attribute binding matches "fragment.pointcoord", the "x" and 470 "y" components are filled with the s and t point sprite coordinates 471 (section 3.3.1), respectively. The "z" and "w" components are undefined. 472 If the fragment is generated by any primitive other than a point, or if 473 point sprites are disabled, all four components of the binding are 474 undefined. 475 476 Modify Section 2.X.3.2 of the NV_geometry_program4 specification, Program 477 Attribute Variables. 478 479 (Add a table entry and relevant text describing the geometry program 480 invocation attribute and per-patch attributes.) 481 482 Geometry Vertex Binding Components Description 483 ----------------------------- ---------- ---------------------------- 484 ... 485 primitive.invocation (id,-,-,-) geometry program invocation 486 primitive.tessouter[n] (x,-,-,-) outer tess. level n 487 primitive.tessinner[n] (x,-,-,-) inner tess. level n 488 primitive.patch.attrib[n] (x,y,z,w) generic patch attribute n 489 primitive.tessouter[n..o] (x,-,-,-) outer tess. levels n to o 490 primitive.tessinner[n..o] (x,-,-,-) inner tess. levels n to o 491 primitive.patch.attrib[n..o] (x,y,z,w) generic patch attrib n to o 492 primitive.vertexcount (c,-,-,-) vertices in primitive 493 494 ... 495 496 If a geometry attribute binding matches "primitive.invocation", the "x" 497 component is filled with an integer giving the number of previous 498 invocations of the geometry program on the primitive being processed. If 499 the geometry program is invoked only once per primitive (default), this 500 component will always be zero. If the program is invoked multiple times 501 (via the INVOCATIONS declaration), the component will be zero on the first 502 invocation, one on the second, and so forth. The "y", "z", and "w" 503 components of the variable are always undefined. 504 505 If an attribute binding matches "primitive.tessouter[n]", the "x" 506 component is filled with the per-patch outer tessellation level numbered 507 <n> of the input patch. <n> must be less than four. The "y", "z", and 508 "w" components are always undefined. A program will fail to load if this 509 attribute binding is used and the input primitive type is not PATCHES. 510 511 If an attribute binding matches "primitive.tessinner[n]", the "x" 512 component is filled with the per-patch inner tessellation level numbered 513 <n> of the input patch. <n> must be less than two. The "y", "z", and "w" 514 components are always undefined. A program will fail to load if this 515 attribute binding is used and the input primitive type is not PATCHES. 516 517 If an attribute binding matches "primitive.patch.attrib[n]", the "x", "y", 518 "z", and "w" components are filled with the corresponding components of 519 the per-patch generic attribute numbered <n> of the input patch. A 520 program will fail to load if this attribute binding is used and the input 521 primitive type is not PATCHES. 522 523 If an attribute binding matches "primitive.tessouter[n..o]", 524 "primitive.tessinner[n..o]", or "primitive.patch.attrib[n..o]", a sequence 525 of 1+<o>-<n> outer tessellation level, inner tessellation level, or 526 per-patch generic attribute bindings is created. For per-patch generic 527 attribute bindings, it is as though the sequence 528 "primitive.patch.attrib[n], primitive.patch.attrib[n+1], ... 529 primitive.patch.attrib[o]" were specfied. These bindings are available 530 only in explicit declarations of array variables. A program will fail to 531 load if <n> is greater than <o> or the input primitive type is not 532 PATCHES. 533 534 If a geometry attribute binding matches "primitive.vertexcount", the "x" 535 component is filled with the number of vertices in the input primitive 536 being processed. The "y", "z", and "w" components of the variable are 537 always undefined. 538 539 540 Modify Section 2.X.3.5, Program Results 541 542 (modify Table X.X) 543 544 Binding Components Description 545 ----------------------------- ---------- ---------------------------- 546 result.color[n].primary (r,g,b,a) primary color n (SRC_COLOR) 547 result.color[n].secondary (r,g,b,a) secondary color n (SRC1_COLOR) 548 549 Table X.X: Fragment Result Variable Bindings. Components labeled "*" 550 are unused. "[n]" is optional -- color <n> is used if specified; color 551 0 is used otherwise. 552 553 (add after third paragraph) 554 555 If a result variable binding matches "result.color[n].primary" or 556 "result.color[n].secondary" and the ARB_blend_func_extended option is 557 specified, updates to the "x", "y", "z", and "w" components of these color 558 result variables modify the "r", "g", "b", and "a" components of the 559 SRC_COLOR and SRC1_COLOR color outputs, respectively, for the fragment 560 output color numbered <n>. If the ARB_blend_func_extended program option 561 is not specified, the "result.color[n].primary" and 562 "result.color[n].secondary" bindings are unavailable. 563 564 565 Modify Section 2.X.3.6, Program Parameter Buffers 566 567 (modify the description of parameter buffer arrays to require that all 568 bindings in an array declaration must use the same single buffer *or* 569 buffer range) 570 571 ... Program parameter buffer variables may be declared as arrays, but all 572 bindings assigned to the array must use the same binding point or binding 573 point range, and must increase consecutively. 574 575 (add to the end of the section) 576 577 In explicit variable declarations, the bindings in Table X.12.1 of the 578 form "program.buffer[a..b]" may also be used, and indicate the variable 579 spans multiple buffer binding points. Such variables must be accessed as 580 an arrays, with the first index specifying an offset into the range of 581 buffer object binding points. A buffer index of zero identifies binding 582 point <a>; an index of <b>-<a>-1 identifies binding point <b>. If such a 583 variable is declared as an array, a second index must be provided to 584 identify the individual array element. A program will fail to compile if 585 such bindings are used when <a> or <b> is negative or greater than or 586 equal to the number of buffer binding points supported for the program 587 type, or if <a> is greater than <b>. The bindings in Table X.12.1 may not 588 be used in implicit variable declarations. 589 590 Binding Components Underlying State 591 ----------------------------- ---------- ----------------------------- 592 program.buffer[a..b][c] (x,x,x,x) program parameter buffers a 593 through b, element c 594 program.buffer[a..b][c..d] (x,x,x,x) program parameter buffers a 595 through b, elements b 596 through c 597 program.buffer[a..b] (x,x,x,x) program parameter buffers a 598 through b, all elements 599 600 Table X.12.1: Program Parameter Buffer Array Bindings. <a> and <b> 601 indicate buffer numbers, <c> and <d> indicate individual elements. 602 603 When bindings beginning with "program.buffer[a..b]" are used in a variable 604 declaration, they behave identically to corresponding beginning with 605 "program.buffer[a]", except that the variable is filled with a separate 606 set of values for each buffer binding point from <a> to <b> inclusive. 607 608 (add new section after Section 2.X.3.7, Program Condition Code Registers 609 and renumber subsequent sections accordingly) 610 611 Section 2.X.3.8, Program Texture Variables 612 613 Program texture variables are used as constants during program execution 614 and refer the texture objects bound to to one or more texture image units. 615 All texture variables have associated bindings and are read-only during 616 program execution. Texture variables retain their values across program 617 invocations, and the set of texture image units to which they refer is 618 constant. The texture object a variable refers to may be changed by 619 binding a new texture object to the appropriate target of the 620 corresponding texture image unit. Texture variables may only be used to 621 identify a texture object in texture instructions, and may not be used as 622 operands in any other instruction. Texture variables may be declared 623 explicitly via the <TEXTURE_statement> grammar rule, or implicitly by 624 using a texture image unit binding in an instruction. 625 626 Texture array variables may be declared as arrays, but the list of 627 texture image units assigned to the array must increase consectively. 628 629 Texture variables identify only a texture image unit; the corresponding 630 texture target (e.g., 1D, 2D, CUBE) and texture object is identified by 631 the <texTarget> grammar rule in instructions using the texture variable. 632 633 Binding Components Underlying State 634 --------------- ---------- ------------------------------------------ 635 texture[a] x texture object bound to image unit a 636 texture[a..b] x texture objects bound to image units a 637 through b 638 639 Table X.12.2: Texture Image Unit Bindings. <a> and <b> indicate 640 texture image unit numbers. 641 642 If a texture binding matches "texture[a]", the texture variable is filled 643 with a single integer referring to texture image unit <a>. 644 645 If a texture binding matches "texture[a..b]", the texture variable is 646 filled with an array of integers referring to texture image units <a> 647 through <b>, inclusive. A program will fail to compile if <a> or <b> is 648 negative or greater than or equal to the number of texture image units 649 supported, or if <a> is greater than <b>. 650 651 652 Modify Section 2.X.4, Program Execution Environment 653 654 (Update the instruction set table to include new columns to indicate the 655 first ISA supporting the instruction, and to indicate whether the 656 instruction supports 64-bit floating-point modifiers.) 657 658 Instr- Modifiers 659 uction V F I C S H D Out Inputs Description 660 ------- -- - - - - - - --- -------- -------------------------------- 661 ABS 40 6 6 X X X F v v absolute value 662 ADD 40 6 6 X X X F v v,v add 663 AND 40 - 6 X - - S v v,v bitwise and 664 ATOM 50 - - X - - - s v,su atomic memory transaction 665 BFE 50 - X X - - S v v,v bitfield extract 666 BFI 50 - X X - - S v v,v,v bitfield insert 667 BFR 50 - X X - - S v v bitfield reverse 668 BRK 40 - - - - - - - c break out of loop instruction 669 BTC 50 - X X - - S v v bit count 670 BTFL 50 - X X - - S v v find least significant bit 671 BTFM 50 - X X - - S v v find most significant bit 672 CAL 40 - - - - - - - c subroutine call 673 CEIL 40 6 6 X X X F v vf ceiling 674 CMP 40 6 6 X X X F v v,v,v compare 675 CONT 40 - - - - - - - c continue with next loop interation 676 COS 40 X - X X X F s s cosine with reduction to [-PI,PI] 677 CVT 50 - - X X - F v v general data type conversion 678 DDX 40 X - X X X F v v derivative relative to X (fp-only) 679 DDY 40 X - X X X F v v derivative relative to Y (fp-only) 680 DIV 40 6 6 X X X F v v,s divide vector components by scalar 681 DP2 40 X - X X X F s v,v 2-component dot product 682 DP2A 40 X - X X X F s v,v,v 2-comp. dot product w/scalar add 683 DP3 40 X - X X X F s v,v 3-component dot product 684 DP4 40 X - X X X F s v,v 4-component dot product 685 DPH 40 X - X X X F s v,v homogeneous dot product 686 DST 40 X - X X X F v v,v distance vector 687 ELSE 40 - - - - - - - - start if test else block 688 EMIT 40 - - - - - - - - emit vertex stream 0 (gp-only) 689 EMITS 50 - X - - - S - s emit vertex to stream (gp-only) 690 ENDIF 40 - - - - - - - - end if test block 691 ENDPRIM 40 - - - - - - - - end of primitive (gp-only) 692 ENDREP 40 - - - - - - - - end of repeat block 693 EX2 40 X - X X X F s s exponential base 2 694 FLR 40 6 6 X X X F v vf floor 695 FRC 40 6 - X X X F v v fraction 696 I2F 40 - 6 X - - S vf v integer to float 697 IF 40 - - - - - - - c start of if test block 698 IPAC 50 X - X X - F v v interpolate at centroid (fp-only) 699 IPAO 50 X - X X - F v v,v interpolate w/offset (fp-only) 700 IPAS 50 X - X X - F v v,su interpolate at sample (fp-only) 701 KIL 40 X X - - X F - vc kill fragment 702 LDC 40 - - X X - F v v load from constant buffer 703 LG2 40 X - X X X F s s logarithm base 2 704 LIT 40 X - X X X F v v compute lighting coefficients 705 LOAD 40 - - X X - F v su global load 706 LOD 41 X - X X - F v vf,t compute texture LOD 707 LRP 40 X - X X X F v v,v,v linear interpolation 708 MAD 40 6 6 X X X F v v,v,v multiply and add 709 MAX 40 6 6 X X X F v v,v maximum 710 MEMBAR 50 - - - - - - - - memory barrier 711 MIN 40 6 6 X X X F v v,v minimum 712 MOD 40 - 6 X - - S v v,s modulus vector components by scalar 713 MOV 40 6 6 X X X F v v move 714 MUL 40 6 6 X X X F v v,v multiply 715 NOT 40 - 6 X - - S v v bitwise not 716 NRM 40 X - X X X F v v normalize 3-component vector 717 OR 40 - 6 X - - S v v,v bitwise or 718 PK2H 40 X X - - - F s vf pack two 16-bit floats 719 PK2US 40 X X - - - F s vf pack two floats as unsigned 16-bit 720 PK4B 40 X X - - - F s vf pack four floats as signed 8-bit 721 PK4UB 40 X X - - - F s vf pack four floats as unsigned 8-bit 722 PK64 50 X X - - - F v v pack 4x32-bit vectors to 2x64 723 POW 40 X - X X X F s s,s exponentiate 724 RCC 40 X - X X X F s s reciprocal (clamped) 725 RCP 40 6 - X X X F s s reciprocal 726 REP 40 6 6 - - X F - v start of repeat block 727 RET 40 - - - - - - - c subroutine return 728 RFL 40 X - X X X F v v,v reflection vector 729 ROUND 40 6 6 X X X F v vf round to nearest integer 730 RSQ 40 6 - X X X F s s reciprocal square root 731 SAD 40 - 6 X - - S vu v,v,vu sum of absolute differences 732 SCS 40 X - X X X F v s sine/cosine without reduction 733 SEQ 40 6 6 X X X F v v,v set on equal 734 SFL 40 6 6 X X X F v v,v set on false 735 SGE 40 6 6 X X X F v v,v set on greater than or equal 736 SGT 40 6 6 X X X F v v,v set on greater than 737 SHL 40 - 6 X - - S v v,s shift left 738 SHR 40 - 6 X - - S v v,s shift right 739 SIN 40 X - X X X F s s sine with reduction to [-PI,PI] 740 SLE 40 6 6 X X X F v v,v set on less than or equal 741 SLT 40 6 6 X X X F v v,v set on less than 742 SNE 40 6 6 X X X F v v,v set on not equal 743 SSG 40 6 - X X X F v v set sign 744 STORE 50 - - - - - - - v,su global store 745 STR 40 6 6 X X X F v v,v set on true 746 SUB 40 6 6 X X X F v v,v subtract 747 SWZ 40 X - X X X F v v extended swizzle 748 TEX 40 X X X X - F v vf,t texture sample 749 TGALL 50 X X X X - F v v test all non-zero in thread group 750 TGANY 50 X X X X - F v v test any non-zero in thread group 751 TGEQ 50 X X X X - F v v test all equal in thread group 752 TRUNC 40 6 6 X X X F v vf truncate (round toward zero) 753 TXB 40 X X X X - F v vf,t texture sample with bias 754 TXD 40 X X X X - F v vf,vf,vf,t texture sample w/partials 755 TXF 40 X X X X - F v vs,t texel fetch 756 TXFMS 40 X X X X - F v vs,t multisample texel fetch 757 TXG 41 X X X X - F v vf,t texture gather 758 TXGO 50 X X X X - F v vf,vs,vs,t texture gather w/per-texel offsets 759 TXL 40 X X X X - F v vf,t texture sample w/LOD 760 TXP 40 X X X X - F v vf,t texture sample w/projection 761 TXQ 40 - - - - - S vs vs,t texture info query 762 UP2H 40 X X X X - F vf s unpack two 16-bit floats 763 UP2US 40 X X X X - F vf s unpack two unsigned 16-bit integers 764 UP4B 40 X X X X - F vf s unpack four signed 8-bit integers 765 UP4UB 40 X X X X - F vf s unpack four unsigned 8-bit integers 766 UP64 50 X X X X - F v v unpack 2x64 vectors to 4x32 767 X2D 40 X - X X X F v v,v,v 2D coordinate transformation 768 XOR 40 - 6 X - - S v v,v exclusive or 769 XPD 40 X - X X X F v v,v cross product 770 771 Table X.13: Summary of NV_gpu_program5 instructions. 772 773 The "V" column indicates the first assembly language in the 774 NV_gpu_program4 family (if any) supporting the opcode. "41" and "50" 775 indicate NV_gpu_program4_1 and NV_gpu_program5, respectively. 776 777 The "Modifiers" columns specify the set of modifiers allowed for the 778 instruction: 779 780 F = floating-point data type modifiers 781 I = signed and unsigned integer data type modifiers 782 C = condition code update modifiers 783 S = clamping (saturation) modifiers 784 H = half-precision float data type suffix 785 D = default data type modifier (F, U, or S) 786 787 For the "F" and "I" columns, an "X" indicates support for both unsized 788 type modifiers and sized type modifiers with fewer than 64 bits. A "6" 789 indicates support for all modifiers, including 64-bit versions (when 790 supported). 791 792 The input and output columns describe the formats of the operands and 793 results of the instruction. 794 795 v: 4-component vector (data type is inherited from operation) 796 vf: 4-component vector (data type is always floating-point) 797 vs: 4-component vector (data type is always signed integer) 798 vu: 4-component vector (data type is always unsigned integer) 799 s: scalar (replicated if written to a vector destination; 800 data type is inherited from operation) 801 su: scalar (data type is always unsigned integer) 802 c: condition code test result (e.g., "EQ", "GT1.x") 803 vc: 4-component vector or condition code test 804 t: texture 805 806 Instructions labeled "fp-only" and "gp-only" are supported only for 807 fragment and geometry programs, respectively. 808 809 810 Modify Section 2.X.4.1, Program Instruction Modifiers 811 812 (Update the discussion of instruction precision modifiers. If 813 GL_NV_gpu_program_fp64 is not found in the extension string, the "F64" 814 instruction modifier described below is not supported.) 815 816 (add to Table X.14 of the NV_gpu_program4 specification.) 817 818 Modifier Description 819 -------- --------------------------------------------------- 820 F Floating-point operation 821 U Fixed-point operation, unsigned operands 822 S Fixed-point operation, signed operands 823 ... 824 F32 Floating-point operation, 32-bit precision or 825 access one 32-bit floating-point value 826 F64 Floating-point operation, 64-bit precision or 827 access one 64-bit floating-point value 828 S32 Fixed-point operation, signed 32-bit operands or 829 access one 32-bit signed integer value 830 S64 Fixed-point operation, signed 64-bit operands or 831 access one 64-bit signed integer value 832 U32 Fixed-point operation, unsigned 32-bit operands or 833 access one 32-bit unsigned integer value 834 U64 Fixed-point operation, unsigned 64-bit operands or 835 access one 64-bit unsigned integer value 836 ... 837 F32X2 Access two 32-bit floating-point values 838 F32X4 Access four 32-bit floating-point values 839 F64X2 Access two 64-bit floating-point values 840 F64X4 Access four 64-bit floating-point values 841 S8 Access one 8-bit signed integer value 842 S16 Access one 16-bit signed integer value 843 S32X2 Access two 32-bit signed integer values 844 S32X4 Access four 32-bit signed integer values 845 S64 Access one 64-bit signed integer value 846 S64X2 Access two 64-bit signed integer values 847 S64X4 Access four 64-bit signed integer values 848 U8 Access one 8-bit unsigned integer value 849 U16 Access one 16-bit unsigned integer value 850 U32 Access one 32-bit unsigned integer value 851 U32X2 Access two 32-bit unsigned integer values 852 U32X4 Access four 32-bit unsigned integer values 853 U64 Access one 64-bit unsigned integer value 854 U64X2 Access two 64-bit unsigned integer values 855 U64X4 Access four 64-bit unsigned integer values 856 857 ADD Perform add operation for ATOM 858 MIN Perform minimum operation for ATOM 859 MAX Perform maximum operation for ATOM 860 IWRAP Perform wrapping increment for ATOM 861 DWRAP Perform wrapping decrment for ATOM 862 AND Perform logical AND operation for ATOM 863 OR Perform logical OR operation for ATOM 864 XOR Perform logical XOR operation for ATOM 865 EXCH Perform exchange operation for ATOM 866 CSWAP Perform compare-and-swap operation for ATOM 867 868 COH Make LOAD and STORE operations use coherent caching 869 VOL Make LOAD and STORE operations treat memory as volatile 870 871 PREC Instruction results should be precise 872 873 ROUND Inexact conversion results round to nearest value (even) 874 CEIL Inexact conversion results round to larger value 875 FLR Inexact conversion results round to smaller value 876 TRUNC Inexact conversion results round to value closest to zero 877 878 879 "F", "U", and "S" modifiers are base data type modifiers and specify that 880 the instruction should operate on floating-point, unsigned integer, or 881 signed integer values, respectively. For example, "ADD.F", "ADD.U", and 882 "ADD.S" specify component-wise addition of floating-point, unsigned 883 integer, or signed integer vectors, respectively. While these modifiers 884 specify a data type, they do not specify an exact precision at which the 885 operation is performed. Floating-point and fixed-point operations will 886 typically be carried out at 32-bit precision, unless otherwise described 887 in the instruction documentation or overridden by the precision modifiers. 888 If all operands are represented with less than 32-bit precision (e.g., 889 variables with the "SHORT" component size modifier), operations may be 890 carried out at a precision no less than the precision of the largest 891 operand used by the instruction. For some instructions, the data type of 892 some operands or the result are fixed; in these cases, the data type 893 modifier specifies the data type of the remaining values. 894 895 Operands represented with fewer bits than used to perform the instruction 896 will be promoted to a larger data type. Signed integer operands will be 897 sign-extended, where the most significant bits are filled with ones if the 898 operand is negative and zero otherwise. Unsigned integer operands will be 899 zero-extended, where the most significant bits are always filled with 900 zeroes. Operands represented with more bits than used to perform the 901 instruction will be converted to lower precision. Floating-point 902 overflows result in IEEE infinity encodings; integer overflows result in 903 the truncation of the most significant bits. 904 905 For arithmetic operations, the "F32", "F64", "U32", "U64", "S32", and 906 "S64" modifiers are precision-specific data type modifiers that specify 907 that floating-point, unsigned integer, or signed integer operations be 908 carried out with an internal precision of no less than 32 or 64 bits per 909 component, respectively. The "F64", "U64", and "S64" modifiers are 910 supported on only a subset of instructions, as documented in the 911 instruction table. The base data type of the instruction is trivially 912 derived from a precision-specific data type modifiers, and an instruction 913 may not specify both base and precision-specific data type modifiers. 914 915 ... 916 917 "SAT" and "SSAT" are clamping modifiers that generally specify that the 918 floating-point components of the instruction result should be clamped to 919 [0,1] or [-1,1], respectively, before updating the condition code and the 920 destination variable. If no clamping suffix is specified, unclamped 921 results will be used for condition code updates (if any) and destination 922 variable writes. Clamping modifiers are not supported on instructions 923 that do not produce floating-point results, with one exception. 924 925 ... 926 927 For load and store operations, the "F32", "F32X2", "F32X4", "F64", 928 "F64X2", "F64X4", "S8", "S16", "S32", "S32X2", "S32X4", "S64", "S64X2", 929 "S64X4", "U8", "U16", "U32", "U32X2", "U32X4", "U64", "U64X2", and "U64X4" 930 storage modifiers control how data are loaded from or stored to memory. 931 Storage modifiers are supported by the ATOM, LDC, LOAD, and STORE 932 instructions and are covered in more detail in the descriptions of these 933 instructions. These instructions must specify exactly one of these 934 modifiers, and may not specify any of the base data type modifiers (F,U,S) 935 described above. The base data types of the result vector of a load 936 instruction or the first operand of a store instruction are trivially 937 derived from the storage modifier. 938 939 For atomic memory operations performed by the ATOM instruction, the "ADD", 940 "MIN", "MAX", "IWRAP", "DWRAP", "AND", "OR", "XOR", "EXCH", and "CSWAP" 941 modifiers specify the operation to perform on the memory being accessed, 942 and are described in more detail in the description of this instruction. 943 944 For load and store operations, the "COH" modifier controls whether the 945 operation uses a coherent level of the cache hierarchy, as described in 946 Section 2.X.4.5. 947 948 For load and store operations, the "VOL" modifier controls whether the 949 operation treats the memory being read or written as volatile. 950 Instructions modified with "VOL" will always read or write the underlying 951 memory, whether or not previous or subsequent loads and stores access the 952 same memory. 953 954 For arithmetic and logical operations, the "PREC" modifier controls 955 whether the instruction result should be treated as precise. For 956 instructions not qualified with ".PREC", the implementation may rearrange 957 the computations specified by the program instructions to execute more 958 efficiently, even if it may generate slightly different results in some 959 cases. For example, an implementation may combine a MUL instruction with 960 a dependent ADD instruction and generate code to execute a MAD 961 (multiply-add) instruction instead. The difference in rounding may 962 produce unacceptable artifacts for some algorithms. When ".PREC" is 963 specified, the instruction will be executed in a manner that always 964 generates the same result regardless of the program instructions that 965 precede or follow the instruction. Note that a ".PREC" modifier does not 966 affect the processing of any other instruction. For example, tagging an 967 instruction with ".PREC" does not mean that the instructions used to 968 generate the instruction's operands will be treated as precise unless 969 those instructions are also qualified with ".PREC". 970 971 For the CVT (data type conversion) instruction, the "F16", "F32", "F64", 972 "S8", "S16", "S32", "S64", "U8", "U16", "U32", and "U64" storage modifiers 973 specify the data type of the vector operand and the converted result. Two 974 storage modifiers must be provided, which specify the data type of the 975 result and the operand, respectively. 976 977 For the CVT (data type conversion) instruction, the "ROUND", "CEIL", 978 "FLR", and "TRUNC" modifiers specify how to round converted results that 979 are not directly representable using the data type of the result. 980 981 982 Modify Section 2.X.4.4, Program Texture Access 983 984 (Extend the language describing the operation of texel offsets to cover 985 the new capability to load texel offsets from a register. Otherwise, 986 this functionality is unchanged from previous extensions.) 987 988 <offset> is a 3-component signed integer vector, which can be specified 989 using constants embedded in the texture instruction according to the 990 <texOffsetImmed> grammar rule, or taken from a vector operand according to 991 the <texOffsetVar> grammar rule. The three components of the offset 992 vector are added to the computed <u>, <v>, and <w> texel locations prior 993 to sampling. When using a constant offset, one, two, or three components 994 may be specified in the instruction; if fewer than three are specified, 995 the remaining offset components are zero. If no offsets are specified, 996 all three components of the offset are treated as zero. A limited range 997 of offset values are supported; the minimum and maximum <texOffset> values 998 are implementation-dependent and given by MIN_PROGRAM_TEXEL_OFFSET_EXT and 999 MAX_PROGRAM_TEXEL_OFFSET_EXT, respectively. A program will fail to load: 1000 1001 * if the texture target specified in the instruction is 1D, ARRAY1D, 1002 SHADOW1D, or SHADOWARRAY1D, and the second or third component of a 1003 constant offset vector is non-zero; 1004 1005 * if the texture target specified in the instruction is 2D, RECT, 1006 ARRAY2D, SHADOW2D, SHADOWRECT, or SHADOWARRAY2D, and the third 1007 component of a constant offset vector is non-zero; 1008 1009 * if the texture target is CUBE, SHADOWCUBE, ARRAYCUBE, or 1010 SHADOWARRAYCUBE, and any component of a constant offset vector is 1011 non-zero -- texel offsets are not supported for cube map or buffer 1012 textures; 1013 1014 * if any component of the constant offset vector of a TXGO instruction 1015 is non-zero -- non-constant offsets are provided in separate operands; 1016 1017 * if any component of a constant offset vector is less than 1018 MIN_PROGRAM_TEXEL_OFFSET_EXT or greater than 1019 MAX_PROGRAM_TEXEL_OFFSET_EXT; 1020 1021 * if a TXD or TXGO instruction specifies a non-constant texel offset 1022 according to the <texOffsetVar> grammar rule; or 1023 1024 * if any instruction specifies a non-constant texel offset according 1025 to the <texOffsetVar> grammar rule and the texture target is CUBE, 1026 SHADOWCUBE, ARRAYCUBE, or SHADOWARRAYCUBE. 1027 1028 The implementation-dependent minimum and maximum texel offset values apply 1029 to texel offsets are taken from a vector operand, but out-of-bounds or 1030 invalid component values will not prevent program loading since the 1031 offsets may not be computed until the program is executed. Components of 1032 the vector operand not needed for the texture target are ignored. The W 1033 component of the offset vector is always ignored; the Z component of the 1034 offset vector is ignored unless the target is 3D; the Y component is 1035 ignored if the target is 1D, ARRAY1D, SHADOW1D, or SHADOWARRAY1D. If the 1036 value of any non-ignored component of the vector operand is outside 1037 implementation-dependent limits, the results of the texture lookup are 1038 undefined. For all instructions except TXGO, the limits are 1039 MIN_PROGRAM_TEXEL_OFFSET_EXT and MAX_PROGRAM_TEXEL_OFFSET_EXT. For the 1040 TXGO instruction, the limits are MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV and 1041 MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV. 1042 1043 1044 (Modify language describing how the check for using multiple targets on a 1045 single texture image unit works, to account for texture array variables 1046 where a single instruction may access one of multiple textures and the 1047 texture used is not known when the program is loaded.) 1048 1049 A program will fail to load if it attempts to sample from multiple texture 1050 targets (including the SHADOW pseudo-targets) on the same texture image 1051 unit. For example, a program containing any two the following 1052 instructions will fail to load: 1053 1054 TEX out, coord, texture[0], 1D; 1055 TEX out, coord, texture[0], 2D; 1056 TEX out, coord, texture[0], ARRAY2D; 1057 TEX out, coord, texture[0], SHADOW2D; 1058 TEX out, coord, texture[0], 3D; 1059 1060 For the purposes of this test, sampling using a texture variable declared 1061 as an array is treated as though all texture image units bound to the 1062 variable were accessed. A program containing the following 1063 instructions would fail to load: 1064 1065 TEXTURE textures[] = { texture[0..3] }; 1066 TEX out, coord, textures[2], 2D; # acts as if all textures are used 1067 TEX out, coord, texture[1], 3D; 1068 1069 (Add language describing texture gather component selection) 1070 1071 The TXG and TXGO instructions provide the ability to assemble a 1072 four-component vector by taking the value of a single component of a 1073 multi-component texture from each of four texels. The component selected 1074 is identified by the <texImageUnitComp> grammar rule. Component selection 1075 is not supported for any other instruction, and a program will fail to 1076 load if <texImageUnitComp> is matched for any texture instruction other 1077 than TXG or TXGO. 1078 1079 1080 Add New Section 2.X.4.5, Program Memory Access 1081 1082 Programs may load from or store to buffer object memory via the ATOM 1083 (atomic global memory operation), LDC (load constant), LOAD (global load), 1084 and STORE (global store) instructions. 1085 1086 Load instructions read 8, 16, 32, 64, 128, or 256 bits of data from a 1087 source address to produce a four-component vector, according to the 1088 storage modifier specified with the instruction. The storage modifier has 1089 three parts: 1090 1091 - a base data type, "F", "S", or "U", specifying that the instruction 1092 fetches floating-point, signed integer, or unsigned integer values, 1093 respectively; 1094 1095 - a component size, specifying that the components fetched by the 1096 instruction have 8, 16, 32, or 64 bits; and 1097 1098 - an optional component count, where "X2" and "X4" indicate that two or 1099 four components be fetched, and no count indicates a single component 1100 fetch. 1101 1102 When the storage modifier specifies that fewer than four components should 1103 be fetched, remaining components are filled with zeroes. When performing 1104 an atomic memory operation (ATOM) or a global load (LOAD), the GPU address 1105 is specified as an instruction operand. When performing a constant buffer 1106 load (LDC), the GPU address is derived by adding the base address of the 1107 bound buffer object to an offset specified as an instruction operand. 1108 Given a GPU address <address> and a storage modifier <modifier>, the 1109 memory load can be described by the following code: 1110 1111 result_t_vec BufferMemoryLoad(char *address, OpModifier modifier) 1112 { 1113 result_t_vec result = { 0, 0, 0, 0 }; 1114 switch (modifier) { 1115 case F32: 1116 result.x = ((float32_t *)address)[0]; 1117 break; 1118 case F32X2: 1119 result.x = ((float32_t *)address)[0]; 1120 result.y = ((float32_t *)address)[1]; 1121 break; 1122 case F32X4: 1123 result.x = ((float32_t *)address)[0]; 1124 result.y = ((float32_t *)address)[1]; 1125 result.z = ((float32_t *)address)[2]; 1126 result.w = ((float32_t *)address)[3]; 1127 break; 1128 case F64: 1129 result.x = ((float64_t *)address)[0]; 1130 break; 1131 case F64X2: 1132 result.x = ((float64_t *)address)[0]; 1133 result.y = ((float64_t *)address)[1]; 1134 break; 1135 case F64X4: 1136 result.x = ((float64_t *)address)[0]; 1137 result.y = ((float64_t *)address)[1]; 1138 result.z = ((float64_t *)address)[2]; 1139 result.w = ((float64_t *)address)[3]; 1140 break; 1141 case S8: 1142 result.x = ((int8_t *)address)[0]; 1143 break; 1144 case S16: 1145 result.x = ((int16_t *)address)[0]; 1146 break; 1147 case S32: 1148 result.x = ((int32_t *)address)[0]; 1149 break; 1150 case S32X2: 1151 result.x = ((int32_t *)address)[0]; 1152 result.y = ((int32_t *)address)[1]; 1153 break; 1154 case S32X4: 1155 result.x = ((int32_t *)address)[0]; 1156 result.y = ((int32_t *)address)[1]; 1157 result.z = ((int32_t *)address)[2]; 1158 result.w = ((int32_t *)address)[3]; 1159 break; 1160 case S64: 1161 result.x = ((int64_t *)address)[0]; 1162 break; 1163 case S64X2: 1164 result.x = ((int64_t *)address)[0]; 1165 result.y = ((int64_t *)address)[1]; 1166 break; 1167 case S64X4: 1168 result.x = ((int64_t *)address)[0]; 1169 result.y = ((int64_t *)address)[1]; 1170 result.z = ((int64_t *)address)[2]; 1171 result.w = ((int64_t *)address)[3]; 1172 break; 1173 case U8: 1174 result.x = ((uint8_t *)address)[0]; 1175 break; 1176 case U16: 1177 result.x = ((uint16_t *)address)[0]; 1178 break; 1179 case U32: 1180 result.x = ((uint32_t *)address)[0]; 1181 break; 1182 case U32X2: 1183 result.x = ((uint32_t *)address)[0]; 1184 result.y = ((uint32_t *)address)[1]; 1185 break; 1186 case U32X4: 1187 result.x = ((uint32_t *)address)[0]; 1188 result.y = ((uint32_t *)address)[1]; 1189 result.z = ((uint32_t *)address)[2]; 1190 result.w = ((uint32_t *)address)[3]; 1191 break; 1192 case U64: 1193 result.x = ((uint64_t *)address)[0]; 1194 break; 1195 case U64X2: 1196 result.x = ((uint64_t *)address)[0]; 1197 result.y = ((uint64_t *)address)[1]; 1198 break; 1199 case U64X4: 1200 result.x = ((uint64_t *)address)[0]; 1201 result.y = ((uint64_t *)address)[1]; 1202 result.z = ((uint64_t *)address)[2]; 1203 result.w = ((uint64_t *)address)[3]; 1204 break; 1205 } 1206 return result; 1207 } 1208 1209 Store instructions write the contents of a four-component vector operand 1210 into 8, 16, 32, 64, 128, or 256 bits, according to the storage modifier 1211 specified with the instruction. The storage modifiers supported by stores 1212 are identical to those supported for loads. Given a GPU address 1213 <address>, a vector operand <operand> containing the data to be stored, 1214 and a storage modifier <modifier>, the memory store can be described by 1215 the following code: 1216 1217 void BufferMemoryStore(char *address, operand_t_vec operand, 1218 OpModifier modifier) 1219 { 1220 switch (modifier) { 1221 case F32: 1222 ((float32_t *)address)[0] = operand.x; 1223 break; 1224 case F32X2: 1225 ((float32_t *)address)[0] = operand.x; 1226 ((float32_t *)address)[1] = operand.y; 1227 break; 1228 case F32X4: 1229 ((float32_t *)address)[0] = operand.x; 1230 ((float32_t *)address)[1] = operand.y; 1231 ((float32_t *)address)[2] = operand.z; 1232 ((float32_t *)address)[3] = operand.w; 1233 break; 1234 case F64: 1235 ((float64_t *)address)[0] = operand.x; 1236 break; 1237 case F64X2: 1238 ((float64_t *)address)[0] = operand.x; 1239 ((float64_t *)address)[1] = operand.y; 1240 break; 1241 case F64X4: 1242 ((float64_t *)address)[0] = operand.x; 1243 ((float64_t *)address)[1] = operand.y; 1244 ((float64_t *)address)[2] = operand.z; 1245 ((float64_t *)address)[3] = operand.w; 1246 break; 1247 case S8: 1248 ((int8_t *)address)[0] = operand.x; 1249 break; 1250 case S16: 1251 ((int16_t *)address)[0] = operand.x; 1252 break; 1253 case S32: 1254 ((int32_t *)address)[0] = operand.x; 1255 break; 1256 case S32X2: 1257 ((int32_t *)address)[0] = operand.x; 1258 ((int32_t *)address)[1] = operand.y; 1259 break; 1260 case S32X4: 1261 ((int32_t *)address)[0] = operand.x; 1262 ((int32_t *)address)[1] = operand.y; 1263 ((int32_t *)address)[2] = operand.z; 1264 ((int32_t *)address)[3] = operand.w; 1265 break; 1266 case S64: 1267 ((int64_t *)address)[0] = operand.x; 1268 break; 1269 case S64X2: 1270 ((int64_t *)address)[0] = operand.x; 1271 ((int64_t *)address)[1] = operand.y; 1272 break; 1273 case S64X4: 1274 ((int64_t *)address)[0] = operand.x; 1275 ((int64_t *)address)[1] = operand.y; 1276 ((int64_t *)address)[2] = operand.z; 1277 ((int64_t *)address)[3] = operand.w; 1278 break; 1279 case U8: 1280 ((uint8_t *)address)[0] = operand.x; 1281 break; 1282 case U16: 1283 ((uint16_t *)address)[0] = operand.x; 1284 break; 1285 case U32: 1286 ((uint32_t *)address)[0] = operand.x; 1287 break; 1288 case U32X2: 1289 ((uint32_t *)address)[0] = operand.x; 1290 ((uint32_t *)address)[1] = operand.y; 1291 break; 1292 case U32X4: 1293 ((uint32_t *)address)[0] = operand.x; 1294 ((uint32_t *)address)[1] = operand.y; 1295 ((uint32_t *)address)[2] = operand.z; 1296 ((uint32_t *)address)[3] = operand.w; 1297 break; 1298 case U64: 1299 ((uint64_t *)address)[0] = operand.x; 1300 break; 1301 case U64X2: 1302 ((uint64_t *)address)[0] = operand.x; 1303 ((uint64_t *)address)[1] = operand.y; 1304 break; 1305 case U64X4: 1306 ((uint64_t *)address)[0] = operand.x; 1307 ((uint64_t *)address)[1] = operand.y; 1308 ((uint64_t *)address)[2] = operand.z; 1309 ((uint64_t *)address)[3] = operand.w; 1310 break; 1311 } 1312 } 1313 1314 If a global load or store accesses a memory address that does not 1315 correspond to a buffer object made resident by MakeBufferResidentNV, the 1316 results of the operation are undefined and may produce a fault resulting 1317 in application termination. If a load accesses a buffer object made 1318 resident with an <access> parameter of WRITE_ONLY, or if a store accesses 1319 a buffer object made resident with an <access> parameter of READ_ONLY, the 1320 results of the operation are also undefined and may lead to application 1321 termination. 1322 1323 The address used for global memory loads or stores or offset used for 1324 constant buffer loads must be aligned to the fetch size corresponding to 1325 the storage opcode modifier. For S8 and U8, the offset has no alignment 1326 requirements. For S16 and U16, the offset must be a multiple of two basic 1327 machine units. For F32, S32, and U32, the offset must be a multiple of 1328 four. For F32X2, F64, S32X2, S64, U32X2, and U64, the offset must be a 1329 multiple of eight. For F32X4, F64X2, S32X4, S64X2, U32X4, and U64X2, the 1330 offset must be a multiple of sixteen. For F64X4, S64X4, and U64X4, the 1331 offset must be a multiple of thirty-two. If an offset is not correctly 1332 aligned, the values returned by a buffer memory load will be undefined, 1333 and the effects of a buffer memory store will also be undefined. 1334 1335 Global and image memory accesses in assembly programs are weakly ordered 1336 and may require synchronization relative to other operations in the OpenGL 1337 pipeline. The ordering and synchronization mehcanisms described in 1338 Section 2.14.X (of the EXT_shader_image_load_store extension 1339 specification) for shaders using the OpenGL Shading Language apply equally 1340 to loads, stores, and atomics performed in assembly programs. 1341 1342 1343 Modify Section 2.X.6.Y of the NV_fragment_program4 specification 1344 1345 (add new option section) 1346 1347 + Early Per-Fragment Tests (NV_early_fragment_tests) 1348 1349 If a fragment program specifies the "NV_early_fragment_tests" option, the 1350 depth and stencil tests will be performed prior to fragment program 1351 invocation, as described in Section 3.X. 1352 1353 1354 Modify Section 2.X.7.Y of the NV_geometry_program4 specification 1355 1356 (Simply add the new input primitive type "PATCHES" to the list of tokens 1357 allowed by the "PRIMITIVE_IN" declaration.) 1358 1359 - Input Primitive Type (PRIMITIVE_IN) 1360 1361 The PRIMITIVE_IN statement declares the type of primitives seen by a 1362 geometry program. The single argument must be one of "POINTS", "LINES", 1363 "LINES_ADJACENCY", "TRIANGLES", "TRIANGLES_ADJACENCY", or "PATCHES". 1364 1365 1366 (Add a new optional program declaration to declare a geometry shader that 1367 is run <N> times per primitive.) 1368 1369 Geometry programs support three types of mandatory declaration statements, 1370 as described below. Each of the three must be included exactly once in 1371 the geometry program. 1372 1373 ... 1374 1375 Geometry programs also support one optional declaration statement. 1376 1377 - Program Invocation Count (INVOCATIONS) 1378 1379 The INVOCATIONS statement declares the number of times the geometry 1380 program is run on each primitive processed. The single argument must be a 1381 positive integer less than or equal to the value of the 1382 implementation-dependent limit MAX_GEOMETRY_PROGRAM_INVOCATIONS_NV. Each 1383 invocation of the geometry program will have the same inputs and outputs 1384 except for the built-in input variable "primitive.invocation". This 1385 variable will be an integer between 0 and <n>-1, where <n> is the declared 1386 number of invocations. If omitted, the program invocation count is one. 1387 1388 1389 Section 2.X.8.Z, ATOM: Atomic Global Memory Operation 1390 1391 The ATOM instruction performs an atomic global memory operation by reading 1392 from memory at the address specified by the second unsigned integer scalar 1393 operand, computing a new value based on the value read from memory and the 1394 first (vector) operand, and then writing the result back to the same 1395 memory address. The memory transaction is atomic, guaranteeing that no 1396 other write to the memory accessed will occur between the time it is read 1397 and written by the ATOM instruction. The result of the ATOM instruction 1398 is the scalar value read from memory. 1399 1400 The ATOM instruction has two required instruction modifiers. The atomic 1401 modifier specifies the type of operation to be performed. The storage 1402 modifier specifies the size and data type of the operand read from memory 1403 and the base data type of the operation used to compute the value to be 1404 written to memory. 1405 1406 atomic storage 1407 modifier modifiers operation 1408 -------- ------------------ -------------------------------------- 1409 ADD U32, S32, U64 compute a sum 1410 MIN U32, S32 compute minimum 1411 MAX U32, S32 compute maximum 1412 IWRAP U32 increment memory, wrapping at operand 1413 DWRAP U32 decrement memory, wrapping at operand 1414 AND U32, S32 compute bit-wise AND 1415 OR U32, S32 compute bit-wise OR 1416 XOR U32, S32 compute bit-wise XOR 1417 EXCH U32, S32, U64 exchange memory with operand 1418 CSWAP U32, S32, U64 compare-and-swap 1419 1420 Table X.Y, Supported atomic and storage modifiers for the ATOM 1421 instruction. 1422 1423 Not all storage modifiers are supported by ATOM, and the set of modifiers 1424 allowed for any given instruction depends on the atomic modifier 1425 specified. Table X.Y enumerates the set of atomic modifiers supported by 1426 the ATOM instruction, and the storage modifiers allowed for each. 1427 1428 tmp0 = VectorLoad(op0); 1429 address = ScalarLoad(op1); 1430 result = BufferMemoryLoad(address, storageModifier); 1431 switch (atomicModifier) { 1432 case ADD: 1433 writeval = tmp0.x + result; 1434 break; 1435 case MIN: 1436 writeval = min(tmp0.x, result); 1437 break; 1438 case MAX: 1439 writeval = max(tmp0.x, result); 1440 break; 1441 case IWRAP: 1442 writeval = (result >= tmp0.x) ? 0 : result+1; 1443 break; 1444 case DWRAP: 1445 writeval = (result == 0 || result > tmp0.x) ? tmp0.x : result-1; 1446 break; 1447 case AND: 1448 writeval = tmp0.x & result; 1449 break; 1450 case OR: 1451 writeval = tmp0.x | result; 1452 break; 1453 case XOR: 1454 writeval = tmp0.x ^ result; 1455 break; 1456 case EXCH: 1457 break; 1458 case CSWAP: 1459 if (result == tmp0.x) { 1460 writeval = tmp0.y; 1461 } else { 1462 return result; // no memory store 1463 } 1464 break; 1465 } 1466 BufferMemoryStore(address, writeval, storageModifier); 1467 1468 ATOM performs a scalar atomic operation. The <y>, <z>, and <w> components 1469 of the result vector are undefined. 1470 1471 ATOM supports no base data type modifiers, but requires exactly one 1472 storage modifier. The base data types of the result vector, and the first 1473 (vector) operand are derived from the storage modifier. The second 1474 operand is always interpreted as a scalar unsigned integer. 1475 1476 1477 Section 2.X.8.Z, BFE: Bitfield Extract 1478 1479 The BFE instruction extracts a selected set of performs a component-wise 1480 bit extraction of the second vector operand to yield a result vector. For 1481 each component, the number of bits extracted is given by the x component 1482 of the first vector operand, and the bit number of the least significant 1483 bit extracted is given by the y component of the first vector operand. 1484 1485 tmp0 = VectorLoad(op0); 1486 tmp1 = VectorLoad(op1); 1487 result.x = BitfieldExtract(tmp0.x, tmp0.y, tmp1.x); 1488 result.y = BitfieldExtract(tmp0.x, tmp0.y, tmp1.y); 1489 result.z = BitfieldExtract(tmp0.x, tmp0.y, tmp1.z); 1490 result.w = BitfieldExtract(tmp0.x, tmp0.y, tmp1.w); 1491 1492 If the number of bits to extract is zero, zero is returned. The results 1493 of bitfield extraction are undefined 1494 1495 * if the number of bits to extract or the starting offset is negative, 1496 * if the sum of the number of bits to extract and the starting offset 1497 is greater than the total number of bits in the operand/result, or 1498 * if the starting offset is greater than or equal to the total number of 1499 bits in the operand/result. 1500 1501 Type BitfieldExtract(Type bits, Type offset, Type value) 1502 { 1503 if (bits < 0 || offset < 0 || offset >= TotalBits(Type) || 1504 bits + offset > TotalBits(Type)) { 1505 /* result undefined */ 1506 } else if (bits == 0) { 1507 return 0; 1508 } else { 1509 return (value << (TotalBits(Type) - (bits+offset))) >> 1510 (TotalBits(type) - bits); 1511 } 1512 } 1513 1514 BFE supports only signed and unsigned integer data type modifiers. For 1515 signed integer data types, the extracted value is sign-extended (i.e., 1516 filled with ones if the most significant bit extracted is one and filled 1517 with zeroes otherwise). For unsigned integer data types, the extracted 1518 value is zero-extended. 1519 1520 1521 Section 2.X.8.Z, BFI: Bitfield Insert 1522 1523 The BFI instruction performs a component-wise bitfield insertion of the 1524 second vector operand into the third vector operand to yield a result 1525 vector. For each component, the <n> least significant bits are extracted 1526 from the corresponding component of the second vector operand, where <n> 1527 is given by the x component of the first vector operand. Those bits are 1528 merged into the corresponding component of the third vector operand, 1529 replacing bits <b> through <b>+<n>-1, to produce the result. The bit 1530 offset <b> is specified by the y component of the first operand. 1531 1532 tmp0 = VectorLoad(op0); 1533 tmp1 = VectorLoad(op1); 1534 tmp2 = VectorLoad(op2); 1535 result.x = BitfieldInsert(op0.x, op0.y, tmp1.x, tmp2.x); 1536 result.y = BitfieldInsert(op0.x, op0.y, tmp1.y, tmp2.y); 1537 result.z = BitfieldInsert(op0.x, op0.y, tmp1.z, tmp2.z); 1538 result.w = BitfieldInsert(op0.x, op0.y, tmp1.w, tmp2.w); 1539 1540 The results of bitfield insertion are undefined 1541 1542 * if the number of bits to insert or the starting offset is negative, 1543 * if the sum of the number of bits to insert and the starting offset 1544 is greater than the total number of bits in the operand/result, or 1545 * if the starting offset is greater than or equal to the total number of 1546 bits in the operand/result. 1547 1548 Type BitfieldInsert(Type bits, Type offset, Type src, Type dst) 1549 { 1550 if (bits < 0 || offset < 0 || offset >= TotalBits(type) || 1551 bits + offset > TotalBits(Type)) { 1552 /* result undefined */ 1553 } else if (bits == TotalBits(Type)) { 1554 return src; 1555 } else { 1556 Type mask = ((1 << bits) - 1) << offset; 1557 return ((src << offset) & mask) | (dst & (~mask)); 1558 } 1559 } 1560 1561 BFI supports only signed and unsigned integer data type modifiers. If no 1562 type modifier is specified, the operand and result vectors are treated as 1563 signed integers. 1564 1565 1566 Section 2.X.8.Z, BFR: Bitfield Reverse 1567 1568 The BFR instruction performs a component-wise bit reversal of the single 1569 vector operand to produce a result vector. Bit reversal is performed by 1570 exchanging the most and least significant bits, the second-most and 1571 second-least significant bits, and so on. 1572 1573 tmp0 = VectorLoad(op0); 1574 result.x = BitReverse(tmp0.x); 1575 result.y = BitReverse(tmp0.y); 1576 result.z = BitReverse(tmp0.z); 1577 result.w = BitReverse(tmp0.w); 1578 1579 BFR supports only signed and unsigned integer data type modifiers. If no 1580 type modifier is specified, the operand and result vectors are treated as 1581 signed integers. 1582 1583 1584 Section 2.X.8.Z, BTC: Bit Count 1585 1586 The BTC instruction performs a component-wise bit count of the single 1587 source vector to yield a result vector. Each component of the result 1588 vector contains the number of one bits in the corresponding component of 1589 the source vector. 1590 1591 tmp0 = VectorLoad(op0); 1592 result.x = BitCount(tmp0.x); 1593 result.y = BitCount(tmp0.y); 1594 result.z = BitCount(tmp0.z); 1595 result.w = BitCount(tmp0.w); 1596 1597 BTC supports only signed and unsigned integer data type modifiers. If no 1598 type modifier is specified, both operands and the result are treated as 1599 signed integers. 1600 1601 1602 Section 2.X.8.Z, BTFL: Find Least Significant Bit 1603 1604 The BTFL instruction searches for the least significant bit of each 1605 component of the single source vector, yielding a result vector comprising 1606 the bit number of the located bit for each component. 1607 1608 tmp0 = VectorLoad(op0); 1609 result.x = FindLSB(tmp0.x); 1610 result.y = FindLSB(tmp0.y); 1611 result.z = FindLSB(tmp0.z); 1612 result.w = FindLSB(tmp0.w); 1613 1614 BTFL supports only signed and unsigned integer data type modifiers. For 1615 unsigned integer data types, the search will yield the bit number of the 1616 least significant one bit in each component, or the maximum integer (all 1617 bits are ones) if the source vector component is zero. For signed data 1618 types, the search will yield the bit number of the least significant one 1619 bit in each component, or -1 if the source vector component is zero. If 1620 no type modifier is specified, both operands and the result are treated as 1621 signed integers. 1622 1623 1624 Section 2.X.8.Z, BTFM: Find Most Significant Bit 1625 1626 The BTFM instruction searches for the most significant bit of each 1627 component of the single source vector, yielding a result vector comprising 1628 the bit number of the located bit for each component. 1629 1630 tmp0 = VectorLoad(op0); 1631 result.x = FindMSB(tmp0.x); 1632 result.y = FindMSB(tmp0.y); 1633 result.z = FindMSB(tmp0.z); 1634 result.w = FindMSB(tmp0.w); 1635 1636 BTFM supports only signed and unsigned integer data type modifiers. For 1637 unsigned integer data types, the search will yield the bit number of the 1638 most significant one bit in each component , or the maximum integer (all 1639 bits are ones) if the source vector component is zero. For signed data 1640 types, the search will yield the bit number of the most significant one 1641 bit if the source value is positive, the bit number of the most 1642 significant zero bit if the source value is negative, or -1 if the source 1643 value is zero. If no type modifier is specified, both operands and the 1644 result are treated as signed integers. 1645 1646 1647 Section 2.X.8.Z, CVT: Data Type Conversion 1648 1649 The CVT instruction converts each component of the single source vector 1650 from one specified data type to another to yield a result vector. 1651 1652 tmp0 = VectorLoad(op0); 1653 result = DataTypeConvert(tmp0); 1654 1655 The CVT instruction requires two storage modifiers. The first specifies 1656 the data type of the result components; the second specifies the data type 1657 of the operand components. The supported storage modifiers are F16, F32, 1658 F64, S8, S16, S32, S64, U8, U16, U32, and U64. A storage modifier of 1659 "F16" indicates a source or destination that is treated as having a 1660 floating-point type, but whose sixteen least significant bits describe a 1661 16-bit floating-point value using the encoding provided in Section 2.1.2. 1662 1663 If the component size of the source register doesn't match the size of the 1664 specified operand data type, the source register components are first 1665 interpreted as a value with the same base data type as the operand and 1666 converted to the operand data type. The operand components are then 1667 converted to the result data type. Finally, if the component size of the 1668 destination register doesn't match the specified result data type, the 1669 result components are converted to values of the same base data type with 1670 a size matching the result register's component size. 1671 1672 Data type conversion is performed by first converting the source 1673 components to an infinite-precision value of the destination data type, 1674 and then converting to the result data type. When converting between 1675 floating-point and integer values, integer values are never interpreted as 1676 being normalized to [0,1] or [-1,+1]. Converting the floating-point 1677 special values -INF, +INF, and NaN to integers will yield undefined 1678 results. 1679 1680 When converting from a non-integral floating-point value to an integer, 1681 one of the two integers closest in value to the floating-point value are 1682 chosen according to the rounding instruction modifier. If "CEIL" or "FLR" 1683 is specified, the larger or smaller value, respectively is chosen. If 1684 "TRUNC" is specified, the value nearest to zero is chosen. If "ROUND" is 1685 specified, if one integer is nearer in value to the original 1686 floating-point value, it is chosen; otherwise, the even integer is chosen. 1687 "ROUND" is used if no rounding modifier is specified. 1688 1689 When converting from the infinite-precision intermediate value to the 1690 destination data type: 1691 1692 * Floating-point values not exactly representable in the destination 1693 data are rounded to one of the two nearest values in the destination 1694 type according to the rounding modifier. Note that the results of 1695 float-to-float conversion are not automatically rounded to integer 1696 values, even if a rounding modifier such as CEIL or FLR is specified. 1697 1698 * Integer values are clamped to the closest value representable in the 1699 result data type if the "SAT" (saturation) modifier is specified. 1700 1701 * Integer values drop the most significant bits if the "SAT" modifier is 1702 not specified. 1703 1704 Negation and absolute value operators are not supported on the source 1705 operand; a program using such operators will fail to compile. 1706 1707 CVT supports no data type modifiers; the type of the operand and result 1708 vectors is fully specified by the required storage modifiers. 1709 1710 1711 Section 2.X.8.Z, EMIT: Emit Vertex 1712 1713 (Modify the description of the EMIT opcode to deal with the interaction 1714 with multiple vertex streams added by ARB_transform_feedback3. For more 1715 information on vertex streams, see ARB_transform_feedback3.) 1716 1717 The EMIT instruction emits a new vertex to be added to the current output 1718 primitive for vertex stream zero. The attributes of the emitted vertex 1719 are given by the current values of the vertex result variables. After the 1720 EMIT instruction completes, a new vertex is started and all result 1721 variables become undefined. 1722 1723 1724 Section 2.X.8.Z, EMITS: Emit Vertex to Stream 1725 1726 (Add new geometry program opcode; the EMITS instruction is not supported 1727 for any other program types. For more information on vertex streams, see 1728 ARB_transform_feedback3.) 1729 1730 The EMITS instruction emits a new vertex to be added to the current output 1731 primitive for the vertex stream specified by the single signed integer 1732 scalar operand. The attributes of the emitted vertex are given by the 1733 current values of the vertex result variables. After the EMITS 1734 instruction completes, a new vertex is started and all result variables 1735 become undefined. 1736 1737 If the specified stream is negative or greater than or equal to the 1738 implementation-dependent number of vertex streams 1739 (MAX_VERTEX_STREAMS_NV), the results of the instruction are undefined. 1740 1741 1742 Section 2.X.8.Z, IPAC: Interpolate at Centroid 1743 1744 The IPAC instruction generates a result vector by evaluating the fragment 1745 attribute named by the single vector operand at the centroid location. 1746 The result vector would be identical to the value obtained by a MOV 1747 instruction if the attribute variable were declared using the CENTROID 1748 modifier. 1749 1750 When interpolating an attribute variable with this instruction, the 1751 CENTROID and SAMPLE attribute variable modifiers are ignored. The FLAT 1752 and NOPERSPECTIVE variable modifiers operate normally. 1753 1754 tmp0 = Interpolate(op0, x_pixel + x_centroid, y_pixel + x_centroid); 1755 result = tmp0; 1756 1757 IPAC supports only floating-point data type modifiers. A program will 1758 fail to load if it contains an IPAC instruction whose single operand is 1759 not a fragment program attribute variable or matches the "fragment.facing" 1760 or "primitive.id" binding. 1761 1762 1763 Section 2.X.8.Z, IPAO: Interpolate with Offset 1764 1765 The IPAO instruction generates a result vector by evaluating the fragment 1766 attribute named by the single vector operand at an offset from the pixel 1767 center given by the x and y components of the second vector operand. The 1768 z and w components of the second vector operand are ignored. The (x,y) 1769 position used for interpolating the attribute variable is obtained by 1770 adding the (x,y) offsets in the second vector operand to the (x,y) 1771 position of the pixel center. 1772 1773 The range of offsets supported by the IPAO instruction is 1774 implementation-dependent. The position used to interpolate the attribute 1775 variable is undefined if the x or y component of the second operand is 1776 less than MIN_FRAGMENT_INTERPOLATION_OFFSET_NV or greater than 1777 MAX_FRAGMENT_INTERPOLATION_OFFSET_NV. Additionally, the granularity of 1778 offsets may be limited. The (x,y) value may be snapped to a fixed 1779 sub-pixel grid with the number of subpixel bits given by 1780 FRAGMENT_PROGRAM_INTERPOLATION_OFFSET_BITS_NV. 1781 1782 When interpolating an attribute variable with this instruction, the 1783 CENTROID and SAMPLE attribute variable modifiers are ignored. The FLAT 1784 and NOPERSPECTIVE variable modifiers operate normally. 1785 1786 tmp1 = VectorLoad(op1); 1787 tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x); 1788 result = tmp0; 1789 1790 IPAO supports only floating-point data type modifiers. A program will 1791 fail to load if it contains an IPAO instruction whose first operand is not 1792 a fragment program attribute variable or matches the "fragment.facing" or 1793 "primitive.id" binding. 1794 1795 1796 Section 2.X.8.Z, IPAS: Interpolate at Sample Location 1797 1798 The IPAS instruction generates a result vector by evaluating the fragment 1799 attribute named by the single vector operand at the location of the 1800 pixel's sample whose sample number is given by the second integer scalar 1801 operand. If multisample buffers are not available (SAMPLE_BUFFERS is 1802 zero), the attribute will be evaluated at the pixel center. If the sample 1803 number given by the second operand does not exist, the position used to 1804 interpolate the attribute is undefined. 1805 1806 When interpolating an attribute variable with this instruction, the 1807 CENTROID and SAMPLE attribute variable modifiers are ignored. The FLAT 1808 and NOPERSPECTIVE variable modifiers operate normally. 1809 1810 sample = ScalarLoad(op1); 1811 tmp1 = SampleOffset(sample); 1812 tmp0 = Interpolate(op0, x_pixel + tmp1.x, y_pixel + tmp2.x); 1813 result = tmp0; 1814 1815 IPAS supports only floating-point data type modifiers. A program will 1816 fail to load if it contains an IPAO instruction whose first operand is not 1817 a fragment program attribute variable or matches the "fragment.facing" or 1818 "primitive.id" binding. 1819 1820 1821 Section 2.X.8.Z, LDC: Load from Constant Buffer 1822 1823 The LDC instruction loads a vector operand from a buffer object to yield a 1824 result vector. The operand used for the LDC instruction must correspond 1825 to a parameter buffer variable declared using the "CBUFFER" statement; a 1826 program will fail to load if any other type of operand is used in an LDC 1827 instruction. 1828 1829 result = BufferMemoryLoad(&op0, storageModifier); 1830 1831 A base operand vector is fetched from memory as described in Section 1832 2.X.4.5, with the GPU address derived from the binding corresponding to 1833 the operand. A final operand vector is derived from the base operand 1834 vector by applying swizzle, negation, and absolute value operand modifiers 1835 as described in Section 2.X.4.2. 1836 1837 The amount of memory in any given buffer object binding accessible by the 1838 LDC instruction may be limited. If any component fetched by the LDC 1839 instruction extends 4*<n> or more basic machine units from the beginning 1840 of the buffer object binding, where <n> is the implementation-dependent 1841 constant MAX_PROGRAM_PARAMETER_BUFFER_SIZE_NV, the value fetched for that 1842 component will be undefined. 1843 1844 LDC supports no base data type modifiers, but requires exactly one storage 1845 modifier. The base data types of the operand and result vectors are 1846 derived from the storage modifier. 1847 1848 1849 Section 2.X.8.Z, LOAD: Global Load 1850 1851 The LOAD instruction generates a result vector by reading an address from 1852 the single unsigned integer scalar operand and fetching data from buffer 1853 object memory, as described in Section 2.X.4.5. 1854 1855 address = ScalarLoad(op0); 1856 result = BufferMemoryLoad(address, storageModifier); 1857 1858 LOAD supports no base data type modifiers, but requires exactly one 1859 storage modifier. The base data type of the result vector is derived from 1860 the storage modifier. The single scalar operand is always interpreted as 1861 an unsigned integer. 1862 1863 1864 Section 2.X.8.Z, MEMBAR: Memory Barrier 1865 1866 The MEMBAR instruction synchronizes memory transactions to ensure that 1867 memory transactions resulting from any instruction executed by the thread 1868 prior to the MEMBAR instruction complete prior to any memory transactions 1869 issued after the instruction. 1870 1871 MEMBAR has no operands and generates no result. 1872 1873 1874 Section 2.X.8.Z, PK64: Pack 64-Bit Component 1875 1876 The PK64 instruction reads the four components of the single vector 1877 operand as 32-bit values, packs the bit representations of these into a 1878 pair of 64-bit values, and replicates those to produce a four-component 1879 result vector. The "x" and "y" components of the operand are packed to 1880 produce the "x" and "z" components of the result vector; the "z" and "w" 1881 components of the operand are packed to produce the "y" and "w" components 1882 of the result vector. The PK64 instruction can be reversed by the UP64 1883 instruction below. 1884 1885 This instruction is intended to allow a program to reconstruct 64-bit 1886 integer or floating-point values generated by the application but passed 1887 to the GL as two 32-bit values taken from adjacent words in memory. The 1888 ability to use this technique depends on how the 64-bit value is stored in 1889 memory. For "little-endian" processors, first 32-bit value would hold the 1890 with the least significant 32 bits of the 64-bit value. For "big-endian" 1891 processors, the first 32-bit value holds the most significant 32 bits of 1892 the 64-bit value. This reconstruction assumes that the first 32-bit word 1893 comes from the x component of the operand and the second 32-bit word comes 1894 from the y component. The method used to construct a 64-bit value from a 1895 pair of 32-bit values depends on the processor type. 1896 1897 tmp = VectorLoad(op0); 1898 1899 if (underlying system is little-endian) { 1900 result.x = RawBits(tmp.x) | (RawBits(tmp.y) << 32); 1901 result.y = RawBits(tmp.z) | (RawBits(tmp.w) << 32); 1902 result.z = RawBits(tmp.x) | (RawBits(tmp.y) << 32); 1903 result.w = RawBits(tmp.z) | (RawBits(tmp.w) << 32); 1904 } else { 1905 result.x = RawBits(tmp.y) | (RawBits(tmp.x) << 32); 1906 result.y = RawBits(tmp.w) | (RawBits(tmp.z) << 32); 1907 result.z = RawBits(tmp.y) | (RawBits(tmp.x) << 32); 1908 result.w = RawBits(tmp.w) | (RawBits(tmp.z) << 32); 1909 } 1910 1911 PK64 supports integer and floating-point data type modifiers, which 1912 specify the base data type of the operand and result. The single vector 1913 operand is always treated as having 32-bit components, and the result is 1914 treated as a vector with 64-bit components. The encoding performed by 1915 PK64 can be reversed using the UP64 instruction. 1916 1917 A program will fail to load if it contains a PK64 instruction that writes 1918 its results to a variable not declared as "LONG". 1919 1920 1921 Section 2.X.8.Z, STORE: Global Store 1922 1923 The STORE instruction reads an address from the second unsigned integer 1924 scalar operand and writes the contents of the first vector operand to 1925 buffer object memory at that address, as described in Section 2.X.4.5. 1926 This instruction generates no result. 1927 1928 tmp0 = VectorLoad(op0); 1929 address = ScalarLoad(op1); 1930 BufferMemoryStore(address, tmp0, storageModifier); 1931 1932 STORE supports no base data type modifiers, but requires exactly one 1933 storage modifier. The base data type of the vector components of the 1934 first operand is derived from the storage modifier. The second operand is 1935 always interpreted as an unsigned integer scalar. 1936 1937 1938 Section 2.X.8.Z, TEX: Texture Sample 1939 1940 (Modify the instruction pseudo-code to account for texel offsets no 1941 longer need to be immediate arguments.) 1942 1943 tmp = VectorLoad(op0); 1944 if (instruction has variable texel offset) { 1945 itmp = VectorLoad(op1); 1946 } else { 1947 itmp = instruction.texelOffset; 1948 } 1949 ddx = ComputePartialsX(tmp); 1950 ddy = ComputePartialsY(tmp); 1951 lambda = ComputeLOD(ddx, ddy); 1952 result = TextureSample(tmp, lambda, ddx, ddy, itmp); 1953 1954 1955 Section 2.X.8.Z, TGALL: Test for All Non-Zero in a Thread Group 1956 1957 The TGALL instruction produces a result vector by reading a vector operand 1958 for each active thread in the current thread group and comparing each 1959 component to zero. A result vector component contains a TRUE value 1960 (described below) if the value of the corresponding component in the 1961 operand vector is non-zero for all active threads, and a FALSE value 1962 otherwise. 1963 1964 An implementation may choose to arrange programs threads into thread 1965 groups, and execute an instruction simultaneously for each thread in the 1966 group. If the TGALL instruction is contained inside conditional flow 1967 control blocks and not all threads in the group execute the instruction, 1968 the operand values for threads not executing the instruction have no 1969 bearing on the value returned. The method used to arrange threads into 1970 groups is undefined. 1971 1972 tmp = VectorLoad(op0); 1973 result = { TRUE, TRUE, TRUE, TRUE }; 1974 for (all active threads) { 1975 if ([thread]tmp.x == 0) result.x = FALSE; 1976 if ([thread]tmp.y == 0) result.y = FALSE; 1977 if ([thread]tmp.z == 0) result.z = FALSE; 1978 if ([thread]tmp.w == 0) result.w = FALSE; 1979 } 1980 1981 TGALL supports all data type modifiers. For floating-point data types, 1982 the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 1983 types, the TRUE value is -1 and the FALSE value is 0. For unsigned 1984 integer data types, the TRUE value is the maximum integer value (all bits 1985 are ones) and the FALSE value is zero. 1986 1987 1988 Section 2.X.8.Z, TGANY: Test for Any Non-Zero in a Thread Group 1989 1990 The TGANY instruction produces a result vector by reading a vector operand 1991 for each active thread in the current thread group and comparing each 1992 component to zero. A result vector component contains a TRUE value 1993 (described below) if the value of the corresponding component in the 1994 operand vector is non-zero for any active thread, and a FALSE value 1995 otherwise. 1996 1997 An implementation may choose to arrange programs threads into thread 1998 groups, and execute an instruction simultaneously for each thread in the 1999 group. If the TGANY instruction is contained inside conditional flow 2000 control blocks and not all threads in the group execute the instruction, 2001 the operand values for threads not executing the instruction have no 2002 bearing on the value returned. The method used to arrange threads into 2003 groups is undefined. 2004 2005 tmp = VectorLoad(op0); 2006 result = { FALSE, FALSE, FALSE, FALSE }; 2007 for (all active threads) { 2008 if ([thread]tmp.x != 0) result.x = TRUE; 2009 if ([thread]tmp.y != 0) result.y = TRUE; 2010 if ([thread]tmp.z != 0) result.z = TRUE; 2011 if ([thread]tmp.w != 0) result.w = TRUE; 2012 } 2013 2014 TGANY supports all data type modifiers. For floating-point data types, 2015 the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 2016 types, the TRUE value is -1 and the FALSE value is 0. For unsigned 2017 integer data types, the TRUE value is the maximum integer value (all bits 2018 are ones) and the FALSE value is zero. 2019 2020 2021 Section 2.X.8.Z, TGEQ: Test for All Equal Values in a Thread Group 2022 2023 The TGEQ instruction produces a result vector by reading a vector operand 2024 for each active thread in the current thread group and comparing each 2025 component to zero. A result vector component contains a TRUE value 2026 (described below) if the value of the corresponding component in the 2027 operand vector is the same for all active threads, and a FALSE value 2028 otherwise. 2029 2030 An implementation may choose to arrange programs threads into thread 2031 groups, and execute an instruction simultaneously for each thread in the 2032 group. If the TGEQ instruction is contained inside conditional flow 2033 control blocks and not all threads in the group execute the instruction, 2034 the operand values for threads not executing the instruction have no 2035 bearing on the value returned. The method used to arrange threads into 2036 groups is undefined. 2037 2038 tmp = VectorLoad(op0); 2039 tgall = { TRUE, TRUE, TRUE, TRUE }; 2040 tgany = { FALSE, FALSE, FALSE, FALSE }; 2041 for (all active threads) { 2042 if ([thread]tmp.x == 0) tgall.x = FALSE; else tgany.x = TRUE; 2043 if ([thread]tmp.y == 0) tgall.y = FALSE; else tgany.y = TRUE; 2044 if ([thread]tmp.z == 0) tgall.z = FALSE; else tgany.z = TRUE; 2045 if ([thread]tmp.w == 0) tgall.w = FALSE; else tgany.w = TRUE; 2046 } 2047 result.x = (tgall.x == tgany.x) ? TRUE : FALSE; 2048 result.y = (tgall.y == tgany.y) ? TRUE : FALSE; 2049 result.z = (tgall.z == tgany.z) ? TRUE : FALSE; 2050 result.w = (tgall.w == tgany.w) ? TRUE : FALSE; 2051 2052 TGEQ supports all data type modifiers. For floating-point data types, the 2053 TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 2054 types, the TRUE value is -1 and the FALSE value is 0. For unsigned 2055 integer data types, the TRUE value is the maximum integer value (all bits 2056 are ones) and the FALSE value is zero. 2057 2058 2059 Section 2.X.8.Z, TXB: Texture Sample with Bias 2060 2061 (Modify the instruction pseudo-code to account for texel offsets no 2062 longer need to be immediate arguments.) 2063 2064 tmp = VectorLoad(op0); 2065 if (instruction has variable texel offset) { 2066 itmp = VectorLoad(op1); 2067 } else { 2068 itmp = instruction.texelOffset; 2069 } 2070 ddx = ComputePartialsX(tmp); 2071 ddy = ComputePartialsY(tmp); 2072 lambda = ComputeLOD(ddx, ddy); 2073 result = TextureSample(tmp, lambda + tmp.w, ddx, ddy, itmp); 2074 2075 Section 2.X.8.Z, TXG: Texture Gather 2076 2077 (Update the TXG opcode description from NV_gpu_program4_1 specification. 2078 This version adds two capabilities: any component of a multi-component 2079 texture can be selected by tacking on a component name to the texture 2080 variable passed to identify the texture unit, and depth compares are 2081 supported if a SHADOW target is specified.) 2082 2083 The TXG instruction takes the four components of a single floating-point 2084 vector operand as a texture coordinate, determines a set of four texels to 2085 sample from the base level of detail of the specified texture image, and 2086 returns one component from each texel in a four-component result vector. 2087 To determine the four texels to sample, the minification and magnification 2088 filters are ignored and the rules for LINEAR filter are applied to the 2089 base level of the texture image to determine the texels T_i0_j1, T_i1_j1, 2090 T_i1_j0, and T_i0_j0, as defined in equations 3.23 through 3.25. The 2091 texels are then converted to texture source colors (Rs,Gs,Bs,As) according 2092 to table 3.21, followed by application of the texture swizzle as described 2093 in section 3.8.13. A four-component vector is returned by taking one of 2094 the four components of the swizzled texture source colors from each of the 2095 four selected texels. The component is selected using the 2096 <texImageUnitComp> grammar rule, by adding a scalar suffix 2097 (".x", ".y", ".z", ".w") to the identified texture; if no scalar suffix 2098 is provided, the first component is selected. 2099 2100 TXG only operates on 2D, SHADOW2D, CUBE, SHADOWCUBE, ARRAY2D, 2101 SHADOWARRAY2D, ARRAYCUBE, SHADOWARRAYCUBE, RECT, and SHADOWRECT texture 2102 targets; a program will fail to compile if any other texture target is 2103 used. 2104 2105 When using a "SHADOW" texture target, component selection is ignored. 2106 Instead, depth comparisons are performed on the depth values for each of 2107 the four selected texels, and 0/1 values are returned based on the results 2108 of the comparison. 2109 2110 As with other texture accesses, the results of a texture gather operation 2111 are undefined if the texture target in the instruction is incompatible 2112 with the selected texture's base internal format and depth compare mode. 2113 2114 tmp = VectorLoad(op0); 2115 ddx = (0,0,0); 2116 ddy = (0,0,0); 2117 lambda = 0; 2118 if (instruction has variable texel offset) { 2119 itmp = VectorLoad(op1); 2120 } else { 2121 itmp = instruction.texelOffset; 2122 } 2123 result.x = TextureSample_i0j1(tmp, lambda, ddx, ddy, itmp).<comp>; 2124 result.y = TextureSample_i1j1(tmp, lambda, ddx, ddy, itmp).<comp>; 2125 result.z = TextureSample_i1j0(tmp, lambda, ddx, ddy, itmp).<comp>; 2126 result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; 2127 2128 In this pseudocode, "<comp>" refers to the texel component selected by the 2129 <texImageUnitComp> grammar rule, as described above. 2130 2131 TXG supports all three data type modifiers. The single operand is always 2132 treated as a floating-point vector; the results are interpreted according 2133 to the data type modifier. 2134 2135 2136 Section 2.X.8.Z, TXGO: Texture Gather with Per-Texel Offsets 2137 2138 Like the TXG instruction, the TXGO instruction takes the four components 2139 of its first floating-point vector operand as a texture coordinate, 2140 determines a set of four texels to sample from the base level of detail of 2141 the specified texture image, and returns one component from each texel in 2142 a four-component result vector. The second and third vector operands are 2143 taken as signed four-component integer vectors providing the x and y 2144 components of the offsets, respectively, used to determine the location of 2145 each of the four texels. To determine the four texels to sample, each of 2146 the four independent offsets is used in conjunction with the specified 2147 texture coordinate to select a texel. The minification and magnification 2148 filters are ignored and the rules for LINEAR filtering are used to select 2149 the texel T_i0_j0, as defined in equations 3.23 through 3.25, from the 2150 base level of the texture image. The texels are then converted to texture 2151 source colors (Rs,Gs,Bs,As) according to table 3.21, followed by 2152 application of the texture swizzle as described in section 3.8.13. A 2153 four-component vector is returned by taking one of the four components 2154 of the swizzled texture source colors from each of the four selected 2155 texels. The component is selected using the <texImageUnitComp> grammar 2156 rule, by adding a scalar suffix (".x", ".y", ".z", ".w") to the identified 2157 texture; if no scalar suffix is provided, the first component is selected. 2158 2159 TXGO only operates on 2D, SHADOW2D, ARRAY2D, SHADOWARRAY2D, RECT, and 2160 SHADOWRECT texture targets; a program will fail to compile if any other 2161 texture target is used. 2162 2163 When using a "SHADOW" texture target, component selection is ignored. 2164 Instead, depth comparisons are performed on the depth values for each of 2165 the four selected texels, and 0/1 values are returned based on the results 2166 of the comparison. 2167 2168 As with other texture accesses, the results of a texture gather operation 2169 are undefined if the texture target in the instruction is incompatible 2170 with the selected texture's base internal format and depth compare mode. 2171 2172 tmp = VectorLoad(op0); 2173 itmp1 = VectorLoad(op1); 2174 itmp2 = VectorLoad(op2); 2175 ddx = (0,0,0); 2176 ddy = (0,0,0); 2177 lambda = 0; 2178 itmp = (op1.x, op2.x); 2179 result.x = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; 2180 itmp = (op1.y, op2.y); 2181 result.y = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; 2182 itmp = (op1.z, op2.z); 2183 result.z = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; 2184 itmp = (op1.w, op2.w); 2185 result.w = TextureSample_i0j0(tmp, lambda, ddx, ddy, itmp).<comp>; 2186 2187 In this pseudocode, "<comp>" refers to the texel component selected by the 2188 <texImageUnitComp> grammar rule, as described above. 2189 2190 If TEXTURE_WRAP_S or TEXTURE_WRAP_T are either CLAMP or MIRROR_CLAMP_EXT, 2191 the results of the TXGO instruction are undefined. 2192 2193 Note: The TXG instruction is equivalent to the TXGO instruction with X 2194 and Y offset vectors of (0,1,1,0) and (0,0,-1,-1), respectively. 2195 2196 TXGO supports all three data type modifiers. The first operand is always 2197 treated as a floating-point vector and the second and third operands are 2198 always treated as a signed integer vector; the results are interpreted 2199 according to the data type modifier. 2200 2201 2202 Section 2.X.8.Z, TXL: Texture Sample with LOD 2203 2204 (Modify the instruction pseudo-code to account for texel offsets no 2205 longer need to be immediate arguments.) 2206 2207 tmp = VectorLoad(op0); 2208 if (instruction has variable texel offset) { 2209 itmp = VectorLoad(op1); 2210 } else { 2211 itmp = instruction.texelOffset; 2212 } 2213 ddx = (0,0,0); 2214 ddy = (0,0,0); 2215 result = TextureSample(tmp, tmp.w, ddx, ddy, itmp); 2216 2217 2218 Section 2.X.8.Z, TXP: Texture Sample with Projection 2219 2220 (Modify the instruction pseudo-code to account for texel offsets no 2221 longer need to be immediate arguments.) 2222 2223 tmp0 = VectorLoad(op0); 2224 tmp0.x = tmp0.x / tmp0.w; 2225 tmp0.y = tmp0.y / tmp0.w; 2226 tmp0.z = tmp0.z / tmp0.w; 2227 if (instruction has variable texel offset) { 2228 itmp = VectorLoad(op1); 2229 } else { 2230 itmp = instruction.texelOffset; 2231 } 2232 ddx = ComputePartialsX(tmp); 2233 ddy = ComputePartialsY(tmp); 2234 lambda = ComputeLOD(ddx, ddy); 2235 result = TextureSample(tmp, lambda, ddx, ddy, itmp); 2236 2237 2238 Section 2.X.8.Z, UP64: Unpack 64-bit Component 2239 2240 The UP64 instruction produces a vector result with 32-bit components by 2241 unpacking the bits of the "x" and "y" components of a 64-bit vector 2242 operand. The "x" component of the operand is unpacked to produce the "x" 2243 and "y" components of the result vector; the "y" component is unpacked to 2244 produce the "z" and "w" components of the result vector. 2245 2246 This instruction is intended to allow a program to pass 64-bit integer or 2247 floating-point values to an application using two 32-bit values stored in 2248 adjacent words in memory, which will be read by the application as single 2249 64-bit values. The ability to use this technique depends on how the 2250 64-bit value is stored in memory. For "little-endian" processors, the 2251 first 32-bit value would hold the with the least significant 32 bits of 2252 the 64-bit value. For "big-endian" processors, the first 32-bit value 2253 holds the most significant 32 bits of the 64-bit value. This 2254 reconstruction assumes that the first 32-bit word comes from the "x" 2255 component of the operand and the second 32-bit word comes from the "y" 2256 component. The method used to unpack a 64-bit value into a pair of 32-bit 2257 values depends on the processor type. 2258 2259 tmp = VectorLoad(op0); 2260 if (underlying system is little-endian) { 2261 result.x = (RawBits(tmp.x) >> 0) & 0xFFFFFFFF; 2262 result.y = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF; 2263 result.z = (RawBits(tmp.y) >> 0) & 0xFFFFFFFF; 2264 result.w = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF; 2265 } else { 2266 result.x = (RawBits(tmp.x) >> 32) & 0xFFFFFFFF; 2267 result.y = (RawBits(tmp.x) >> 0) & 0xFFFFFFFF; 2268 result.z = (RawBits(tmp.y) >> 32) & 0xFFFFFFFF; 2269 result.w = (RawBits(tmp.y) >> 0) & 0xFFFFFFFF; 2270 } 2271 2272 UP64 supports integer and floating-point data type modifiers, which 2273 specify the base data type of the operand and result. The single operand 2274 vector always has 64-bit components. The result is treated as a vector 2275 with 32-bit components. The encoding performed by UP64 can be reversed 2276 using the PK64 instruction. 2277 2278 A program will fail to load if it contains a UP64 instruction whose 2279 operand is a variable not declared as "LONG". 2280 2281 2282 Modify Section 2.14.6.1 of the NV_geometry_program4 specification, 2283 Geometry Program Input Primitives 2284 2285 (add patches to the list of supported input primitive types) 2286 2287 The supported input primitive types are: ... 2288 2289 Patches (PATCHES) 2290 2291 Geometry programs that operate on patches are valid only for the 2292 PATCHES_NV primitive type. There are a variable number of vertices 2293 available for each program invocation, depending on the number of input 2294 vertices in the primitive itself. For a patch with <n> vertices, 2295 "vertex[0]" refers to the first vertex of the patch, and "vertex[<n>-1]" 2296 refers to the last vertex. 2297 2298 2299 Modify Section 2.14.6.2 of the NV_geometry_program4 specification, 2300 Geometry Program Output Primitives 2301 2302 (Add a new paragraph limiting the use of the EMITS opcode to geometry 2303 programs with a POINTS output primitive type at the end of the section. 2304 This limitation may be removed in future specifications.) 2305 2306 Geometry programs may write to multiple vertex streams only if the 2307 specified output primitive type is POINTS. A program will fail to load if 2308 it contains and EMITS instruction and the output primitive type specified 2309 by the PRIMITIVE_OUT declaration is not POINTS. 2310 2311 Modify Section 2.14.6.4 of the NV_geometry_program4 specification, 2312 Geometry Program Output Limits 2313 2314 (Modify the limitation on the total number of components emitted by a 2315 geometry program from NV_gpu_program4 to be per-invocation. If a that 2316 limit is 4096 and a program has 16 invocations, each of the 16 program 2317 invocation can emit up to 4096 total components.) 2318 2319 There are two implementation-dependent limits that limit the total number 2320 of vertices that each invocation of a program can emit. First, the vertex 2321 limit may not exceed the value of MAX_PROGRAM_OUTPUT_VERTICES_NV. Second, 2322 product of the vertex limit and the number of result variable components 2323 written by the program (PROGRAM_RESULT_COMPONENTS_NV, as described in 2324 section 2.X.3.5 of NV_gpu_program4) may not exceed the value of 2325 MAX_PROGRAM_TOTAL_OUTPUT_COMPONENTS_NV. A geometry program will fail to 2326 load if its maximum vertex count or maximum total component count exceeds 2327 the implementation-dependent limit. The limits may be queried by calling 2328 GetProgramiv with a <target> of GEOMETRY_PROGRAM_NV. Note that the 2329 maximum number of vertices that a geometry program can emit may be much 2330 lower than MAX_PROGRAM_OUTPUT_VERTICES_NV if the program writes a large 2331 number of result variable components. If a geometry program has multiple 2332 invocations (via the "INVOCATIONS" declaration), the program will load 2333 successfully as long as no single invocation exceeds the total component 2334 count limit, even if the total output of all invocations combined exceeds 2335 the limit. 2336 2337 2338Additions to Chapter 3 of the OpenGL 3.0 Specification (Rasterization) 2339 2340 Modify Section 3.X, Early Per-Fragment Tests, as documented in the 2341 EXT_shader_image_load_store specification 2342 2343 (add new paragraph at the end of a section, describing how early fragment 2344 tests work when assembly fragment programs are active) 2345 2346 If an assembly fragment program is active, early depth tests are 2347 considered enabled if and only if the fragment program source included the 2348 NV_early_fragment_tests option. 2349 2350 2351 Add to Section 3.11.4.5 of ARB_fragment_program (Fragment Program): 2352 2353 Section 3.11.4.5.3, ARB_blend_func_extended Option 2354 2355 If a fragment program specifies the "ARB_blend_func_extended" option, dual 2356 source color outputs as described in ARB_blend_func_extended are made 2357 available through the use of the "result.color[n].primary" and 2358 "result.color[n].secondary" result bindings, corresponding to SRC_COLOR 2359 and SRC1_COLOR, respectively, for the fragment color output numbered <n>. 2360 2361 2362Additions to Chapter 4 of the OpenGL 3.0 Specification (Per-Fragment 2363Operations and the Frame Buffer) 2364 2365 Modify Section 4.4.3, Rendering When an Image of a Bound Texture Object 2366 is Also Attached to the Framebuffer, p. 288 2367 2368 (Replace the complicated set of conditions with the following) 2369 2370 Specifically, the values of rendered fragments are undefined if any 2371 shader stage fetches texels from a given mipmap level, cubemap face, and 2372 array layer of a texture if that same mipmap level, cubemap face, and 2373 array layer of the texture can be written to via fragment shader outputs, 2374 even if the reads and writes are not in the same Draw call. However, an 2375 application can insert MemoryBarrier(TEXTURE_FETCH_BARRIER_BIT_NV) between 2376 Draw calls that have such read/write hazards in order to guarantee that 2377 writes have completed and caches have been invalidated, as described in 2378 section 2.20.X. 2379 2380 2381Additions to Chapter 5 of the OpenGL 3.0 Specification (Special Functions) 2382 2383 None. 2384 2385Additions to Chapter 6 of the OpenGL 3.0 Specification (State and 2386State Requests) 2387 2388 None. 2389 2390Additions to Appendix A of the OpenGL 3.0 Specification (Invariance) 2391 2392 None. 2393 2394Additions to the AGL/GLX/WGL Specifications 2395 2396 None. 2397 2398GLX Protocol 2399 2400 None. 2401 2402Errors 2403 2404 None, other than new conditions by which a program string would fail to 2405 load. 2406 2407New State 2408 2409 None. 2410 2411 2412New Implementation Dependent State 2413 2414 Minimum 2415 Get Value Type Get Command Value Description Sec. Attrib 2416 -------------------------------- ---- --------------- ------- --------------------- ------ ------ 2417 MAX_GEOMETRY_PROGRAM_ Z+ GetIntegerv 32 Maximum number of GP 2.X.6.Y - 2418 INVOCATIONS_NV invocations per prim. 2419 MIN_FRAGMENT_INTERPOLATION_ R GetFloatv -0.5 Max. negative offset 2.X.8.Z - 2420 OFFSET_NV for IPAO instruction. 2421 MAX_FRAGMENT_INTERPOLATION_ R GetFloatv +0.5 Max. positive offset 2.X.8.Z - 2422 OFFSET_NV for IPAO instruction. 2423 FRAGMENT_PROGRAM_INTERPOLATION_ Z+ GetIntegerv 4 Subpixel bit count 2.X.8.Z - 2424 OFFSET_BITS_NV for IPAO instruction 2425 2426 2427Dependencies on NV_gpu_program4, NV_vertex_program4, NV_geometry_program4, and 2428NV_fragment_program4 2429 2430 This extension is written against the NV_gpu_program4 family of 2431 extensions, and introduces new instruction set features and inputs/outputs 2432 described here. These features are available only if the extension is 2433 supported and the appropriate program header string is used ("!!NVvp5.0" 2434 for vertex programs, "!!NVgp5.0" for geometry programs, and "!!NVfp5.0" 2435 for fragment programs.) When loading a program with an older header (e.g., 2436 "!!NVvp4.0"), the instruction set features described in this extension are 2437 not available. The features in this extension build upon those documented 2438 in full in NV_gpu_program4. 2439 2440Dependencies on NV_tessellation_program5 2441 2442 This extension provides the basic assembly instruction set constructs for 2443 tessellation programs. If this extension is supported, tessellation 2444 control and evaluation programs are supported, as described in the 2445 NV_tessellation_program5 specification. There is no separate extension 2446 string for tessellation programs; such support is implied by this 2447 extension. 2448 2449Dependencies on ARB_transform_feedback3 2450 2451 The concept of multiple vertex streams emitted by a geometry shader is 2452 introduced by ARB_transform_feedback3, as is the description of how they 2453 operate and implementation-dependent limits on the number of streams. 2454 This extension simply provides a mechanism to emit a vertex to more than 2455 one stream. If ARB_transform_feedback3 is not supported, language 2456 describing the EMITS opcode and the restriction on PRIMITIVE_OUT when 2457 EMITS is used should be removed. 2458 2459Dependencies on NV_shader_buffer_load 2460 2461 The programmability functionality provided by NV_shader_buffer_load is 2462 also incorporated by this extension. Any assembly program using a program 2463 header corresponding to this or any subsequent extension (e.g., 2464 "!!NVfp5.0") may use the LOAD opcode without needing to declare "OPTION 2465 NV_shader_buffer_load". 2466 2467 NV_shader_buffer_load is required by this extension, which means that the 2468 API mechanisms documented there allowing applications to make a buffer 2469 resident and query its GPU address are available to any applications using 2470 this extension. 2471 2472 In addition to the basic functionality in NV_shader_buffer_load, this 2473 extension provides the ability to load 64-bit integers and floating-point 2474 values using the "S64", "S64X2", "S64X4", "U64", "U64X2", "U64X4", "F64", 2475 "F64X2", and "F64X4" opcode modifiers. 2476 2477Dependencies on NV_shader_buffer_store 2478 2479 This extension provides assembly programmability support for the 2480 NV_shader_buffer_store, which provides the API mechanisms allowing buffer 2481 object to be stored to. NV_shader_buffer_store does not have a separate 2482 extension string entry, and will always be supported if this extension is 2483 present. 2484 2485Dependencies on NV_parameter_buffer_object2 2486 2487 The programmability functionality provided by NV_parameter_buffer_object2 2488 is also incorporated by this extension. Any assembly program using a 2489 program header corresponding to this or any subsequent extension (e.g., 2490 "!!NVfp5.0") may use the LDC opcode without needing to declare "OPTION 2491 NV_parameter_buffer_object2". 2492 2493 In addition to the basic functionality in NV_parameter_buffer_object2, 2494 this extension provides the ability to load 64-bit integers and 2495 floating-point values using the "S64", "S64X2", "S64X4", "U64", "U64X2", 2496 "U64X4", "F64", "F64X2", and "F64X4" opcode modifiers. 2497 2498Dependencies on OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle 2499 2500 If OpenGL 3.3, ARB_texture_swizzle, and EXT_texture_swizzle are not 2501 supported, remove the swizzling step from the definition of TXG and TXGO. 2502 2503Dependencies on ARB_blend_func_extended 2504 2505 If ARB_blend_func_extended is not supported, references to the dual source 2506 color output bindings (result.color.primary and result.color.secondary) 2507 should be removed. 2508 2509Dependencies on EXT_shader_image_load_store 2510 2511 EXT_shader_image_load_store provides OpenGL Shading Language mechanisms to 2512 load/store to buffer and texture image memory, including spec language 2513 describing memory access ordering and synchronization, a built-in function 2514 (MemoryBarrierEXT) controlling synchronization of memory operations, and 2515 spec language describing early fragment tests that can be enabled via GLSL 2516 fragment shader source. These sections of the EXT_shader_image_load_store 2517 specification apply equally to the assembly program memory accesses 2518 provided by this extension. If EXT_shader_image_load_store is not 2519 supported, the sections of that specification describing these features 2520 should be considered to be added to this extension. 2521 2522 EXT_shader_image_load_store additionally provides and documents assembly 2523 language support for image loads, stores, and atomics as described in the 2524 "Dependencies on NV_gpu_program5" section of EXT_shader_image_load_store. 2525 The features described there are automatically supported for all 2526 NV_gpu_program5 assembly programs without requiring any additional 2527 "OPTION" line. 2528 2529Dependencies on ARB_shader_subroutine 2530 2531 ARB_shader_subroutine provides and documents assembly language support for 2532 subroutines as described in the "Dependencies on NV_gpu_program5" section 2533 of ARB_shader_subroutine. The features described there are automatically 2534 supported for all NV_gpu_program5 assembly programs without requiring any 2535 additional "OPTION" line. 2536 2537 2538Issues 2539 2540 (1) Are there any restrictions or performance concerns involving the 2541 support for indexing textures or parameter buffers? 2542 2543 RESOLVED: There are no significant functional limitations. Textures 2544 and parameter buffers accessed with an index must be declared as arrays, 2545 so the assembler knows which textures might be accessed this way. 2546 Additionally, accessing an array of textures or parameter buffers with 2547 an out-of-bounds index will yield undefined results. 2548 2549 In particular, there is no limitation on the values used for indexing -- 2550 they are not required to be true constants and are not required to have 2551 the same value for all vertices/fragments in a primitive. However, 2552 using divergent texture or parameter buffer indices may have performance 2553 concerns. We expect that GPU implementations of this extension will run 2554 multiple program threads in parallel (SIMD). If different threads in a 2555 thread group have different indices, it will be necessary to do lookups 2556 in more than one texture at once. This is likely to result in some 2557 thread serialization. We expect that indexed texture or parameter 2558 buffer access where all indices in a thread group match will perform 2559 identically to non-indexed accesses. 2560 2561 (2) Which texture instructions support programmable texel offsets, and 2562 what offset limits apply? 2563 2564 RESOLVED: Most texture instructions (TEX, TXB, TXF, TXG, TXL, TXP) 2565 support both constant texel offsets as provided by NV_gpu_program4 and 2566 programmable texel offsets. TXD supports only constant offsets. TXGO 2567 does not support non-zero or programmable offsets in the texture portion 2568 of the instruction, but provides full support for programmable offsets 2569 via two of the three vector arguments in the regular instruction. 2570 2571 For example, 2572 2573 TEX result, coord, texture[0], 2D, (-1,-1); 2574 2575 uses the NV_gpu_program4 mechanism applies a constant texel offset of 2576 (-1,-1) to the texture coordinates. With programmable offsets, the 2577 following code applies the same offset. 2578 2579 TEMP offxy; 2580 MOV offxy, {-1, -1}; 2581 TEX result, coord, texture[0], offset(offxy); 2582 2583 Of course, the programmable form allows the offsets to be computed in 2584 the program and does not require constant values. 2585 2586 For most texture instructions, the range of allowable offsets is 2587 [MIN_PROGRAM_TEXEL_OFFSET_EXT, MAX_PROGRAM_TEXEL_OFFSET_EXT] for both 2588 constant and programmable texel offsets. Constant offsets can be 2589 checked when the program is loaded, and out-of-bounds offsets cause the 2590 program to fail to load. Programmable offsets can not have a 2591 load-time range check; out-of-bounds offsets produce undefined results. 2592 2593 Additionally, the new TXGO instruction has a separate (likely larger) 2594 allowable offset range, [MIN_PROGRAM_TEXTURE_GATHER_OFFSET_NV, 2595 MAX_PROGRAM_TEXTURE_GATHER_OFFSET_NV], that applies to the offset 2596 vectors passed in its second and third operand. 2597 2598 In the initial implementation of this extension, the range limits are 2599 [-8,+7] for most instructions and [-32,+31] for TXGO. 2600 2601 (3) What is TXGO (texture gather with separate offsets) good for? 2602 2603 RESOLVED: TXGO allows for efficiently sampling a single-component 2604 texture with a variety of offsets that need not be contiguous. 2605 2606 For example, a shadow mapping algorithm using a high-resolution shadow 2607 map may have pixels whose footpoint covers a large number of texels in 2608 the shadow map. Such pixels could do a single lookup into a 2609 lower-resolution texture (using mipmapping), but quality problems will 2610 arise. Alternately, a shader could perform a large number of texture 2611 lookups using either NEAREST or LINEAR filtering from the 2612 high-resolution texture. NEAREST filtering will require a separate 2613 lookup for each texel accessed; LINEAR filtering may require somewhat 2614 fewer lookups, but all accesses cover a 2x2 portion of the texture. The 2615 TXG instruction added to NV_gpu_program4_1 allows a 2x2 block of texels 2616 to be returned in a single instruction in case the program wants to do 2617 something other than linear filtering with the samples. The TXGO allows 2618 a program to do semi-random sampling of the texture without requiring 2619 that each sample cover a 2x2 block of texels. For example, the TXGO 2620 instruction would allow a program to the four texels A, H, J, O from the 2621 4x4 block depicted below: 2622 2623 TXGO result, coord, {-1,+2,0,+1}, {-1,0,+1,+2}, texture[0], 2D; 2624 2625 The "equivalent" TXG instruction would only sample the four center 2626 texels F, G, J, and K 2627 2628 TXG result, coord, texture[0], 2D; 2629 2630 All sixteen texels of the footprint could be sampled with four TXG 2631 instructions, 2632 2633 TXG result0, coord, texture[0], 2D, (-1,-1); 2634 TXG result1, coord, texture[0], 2D, (-1,+1); 2635 TXG result2, coord, texture[0], 2D, (+1,-1); 2636 TXG result3, coord, texture[0], 2D, (+1,+1); 2637 2638 but accessing a smaller number of samples spread across the footprint 2639 with fewer instructions may produce results that are good enough. 2640 2641 The figure here depicts a texture with texel (0,0) shown in the 2642 upper-left corner. If you insist on a lower-left origin, please look at 2643 this figure while standing on your head. 2644 2645 (0,0) +-+-+-+-+ 2646 |A|B|C|D| 2647 +-+-+-+-+ 2648 |E|F|G|H| 2649 +-+-+-+-+ 2650 |I|J|K|L| 2651 +-+-+-+-+ 2652 |M|N|O|P| 2653 +-+-+-+-+ (4,4) 2654 2655 (4) Why are the results of TXGO (texture gather with separate offsets) 2656 undefined if the wrap mode is CLAMP or MIRROR_CLAMP_EXT? 2657 2658 RESOLVED: The CLAMP and MIRROR_CLAMP_EXT wrap modes are fairly 2659 different from other wrap modes. After adding any instruction offsets, 2660 the spec says to pre-clamp the (u,v) coordinates to [0,texture_size] 2661 before generating the footprint. If such clamping occurs on one edge 2662 for a normal texture filtering operation, the footprint ends up being 2663 half border texels, half edge texels, and the clamping effectively 2664 forces the interpolation weights used for texture filtering to 50/50. 2665 2666 We expect the TXG instruction to be used in cases where an application 2667 may want to do custom filtering, and is in control of its own filtering 2668 weights. Coordinate clamping as above will affect the footprint used 2669 for filtering, but not the weights. In the NV_gpu_program4_1 spec, we 2670 defined the TXG/CLAMP combination to simply return the "normal" 2671 footprint produced after the pre-clamp operation above. Any adjustment 2672 of weights due to clamping is the responsibility of the application. We 2673 don't expect this to be a common operation, because CLAMP_TO_EDGE or 2674 CLAMP_TO_BORDER are much more sensible wrap modes. 2675 2676 The hardware implementing TXGO is anticipated to extract all four 2677 samples in a single pass. However, the spec language is defined for 2678 simplicity to perform four separate "gather" operations with the four 2679 provided offsets, extract a single sample from each, and combine the 2680 four samples into a vector. This would require four separate pre-clamp 2681 operations, which was deemed too costly to implement in hardware for a 2682 wrap mode that doesn't work well with texture gather operations. Even 2683 if such hardware were built, it still wouldn't obtain a footprint 2684 resembling the half-border, half-edge footprint for simple TXGO offsets 2685 -- that would require different per-texel clamping rules for the four 2686 samples. We chose to leave the results of this operation undefined. 2687 2688 (5) Should double-precision floating-point support be required or 2689 optional? If optional, how? 2690 2691 RESOLVED: Double-precision floating-point support will be optional in 2692 case low-end GPUs supporting the remainder of these instruction features 2693 choose to cut costs by removing the silicon necessary to implement 2694 64-bit floating-point arithmetic. 2695 2696 (6) While this extension supports double-precision computation, how can 2697 you provide high-precision inputs and outputs to the GPU programs? 2698 2699 RESOLVED: The underlying hardware implementing this extension does not 2700 provide full support for 64-bit floats, even though DOUBLE is a standard 2701 data type provided by the GL. For example, when specifying a vertex 2702 array with a data type of DOUBLE, the vertex attribute components will 2703 end up being converted to 32-bit floats (FLOAT) by the driver before 2704 being passed to the hardware, and the extra precision in the original 2705 64-bit float values will be lost. 2706 2707 For vertex attributes, the EXT_vertex_attrib_64bit and 2708 NV_vertex_attrib_integer_64bit extensions provide the ability to specify 2709 64-bit vertex attribute components using the VertexAttribL* and 2710 VertexAttribLPointer APIs. Such attributes can be read in a vertex 2711 program using a "LONG ATTRIB" declaration: 2712 2713 LONG ATTRIB vector64; 2714 2715 The LONG modifier can only be used vertex program inputs, and can not be 2716 used for inputs of any program type or outputs of any program type. 2717 2718 For other cases, this extension provides the PK64 and UP64 instructions 2719 that provide a mechanism to pass 64-bit components using consecutive 2720 32-bit components. For example, a 3-component vector with 64-bit 2721 components can be passed to a vertex shader using multiple vertex 2722 attributes without using the VertexAttribL APIs with the following code: 2723 2724 /* Pass the X/Y components in vertex attribute 0 (X/Y/Z/W). Use 2725 stride to skip over Z. */ 2726 glVertexAttribPointer(0, 4, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble), 2727 (GLdouble *) buffer); 2728 2729 /* Pass the Z components in vertex attribute 1 (X/Y). Use stride to 2730 skip over original X/Y components. */ 2731 glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, 3*sizeof(GLdouble), 2732 (GLdouble *) buffer + 2); 2733 2734 In this example, the vertex program would use the PK64 instruction to 2735 reconstruct the 64-bit value for each component as follows: 2736 2737 LONG TEMP reconstructed; 2738 PK64 reconstructed.xy, vertex.attrib[0]; 2739 PK64 reconstructed.z, vertex.attrib[1]; 2740 2741 A similar technique can be used to pass 64-bit values computed by a GPU 2742 program, using transform feedback or writes to a color buffer. The UP64 2743 instruction would be used to convert the 64-bit computed value into two 2744 32-bit values, which would be written to adjacent components. 2745 2746 Note also that the original hardware implementation of this extension 2747 does not support interpolation of 64-bit floating-point values. If an 2748 application desires to pass a 64-bit floating-point value from a vertex 2749 or geometry program to a fragment program, and doesn't require 2750 interpolation, the PK64/UP64 techniques can be combined. For example, 2751 the vertex shader could unpack a 3-component vector with 64-bit 2752 components into a four-component and a two-component 32-bit vector: 2753 2754 LONG TEMP result64; 2755 RESULT result32[2] = { result.attrib[0..1] }; 2756 UP64 result32[0], result64.xyxy; 2757 UP64 result32[1].xy, result64.z; 2758 2759 The fragment program would read and reconstruct using PK64: 2760 2761 LONG TEMP input64; 2762 FLAT ATTRIB input32[3] = { fragment.attrib[0..1] }; 2763 PK64 input64.xy, input32[0]; 2764 PK64 input64.z, input32[1]; 2765 2766 Note that such inputs must be declared as "FLAT" in the fragment program 2767 to prevent the hardware from trying to do floating-point interpolation 2768 on the separate 32-bit halves of the value being passed. Such 2769 interpolation would produce complete garbage. 2770 2771 (7) What are instanced geometry programs useful for? 2772 2773 RESOLVED: Instanced geometry programs allow geometry programs that 2774 perform regular operations to run more efficiently. 2775 2776 Consider a simple example of an algorithm that uses geometry programs to 2777 render primitives to a cube map in a single pass. Without instanced 2778 geometry programs, the geometry program to render triangles to the cube 2779 map would do something like: 2780 2781 for (face = 0; face < 6; face++) { 2782 for (vertex = 0; vertex < 3; vertex++) { 2783 project vertex <vertex> onto face <face>, output position 2784 compute/copy attributes of emitted <vertex> to outputs 2785 output <face> to result.layer 2786 emit the projected vertex 2787 } 2788 end the primitive (next triangle) 2789 } 2790 2791 This algorithm would output 18 vertices per input triangle, three for 2792 each cube face. The six triangles emitted would be rasterized, one per 2793 face. Geometry programs that emit a large number of attributes have 2794 often posed performance challenges, since all the attributes must be 2795 stored somewhere until the emitted primitives. Large storage 2796 requirements may limit the number of threads that can be run in parallel 2797 and reduce overall performance. 2798 2799 Instanced geometry programs allow this example to be restructured to run 2800 with six separate threads, one per face. Each thread projects the 2801 triangle to only a single face (identified by the invocation number) and 2802 emits only 3 vertices. The reduced storage requirements allow more 2803 geometry program threads to be run in parallel, with greater overall 2804 efficiency. 2805 2806 Additionally, the total number of attributes that can be emitted by a 2807 single geometry program invocation is limited. However, for instanced 2808 geometry shaders, that limit applies to each of <N> program invocations 2809 which allows for a larger total output. For example, if the GL 2810 implementation supports only 1024 components of output per program 2811 invocation, the 18-vertex algorithm above could emit no more than 56 2812 components per vertex. The same algorithm implemented as a 3-vertex 2813 6-invocation geometry program could theoretically allow for 341 2814 components per vertex. 2815 2816 (8) What are the special interpolation opcodes (IPAC, IPAO, IPAS) good 2817 for, and how do they work? 2818 2819 RESOLVED: The interpolation opcodes allow programs to control the 2820 frequency and location at which fragment inputs are sampled. Limited 2821 control has been provided in previous extensions, but the support was 2822 more limited. NV_gpu_program4 had an interpolation modifier (CENTROID) 2823 that allowed attributes to be sampled inside the primitive, but that was 2824 a per-attribute modifier -- you could only sample any given attribute at 2825 one location. NV_gpu_program4_1 added a new interpolation modifier 2826 (SAMPLE) that directed that fragment programs be run once per sample, 2827 and that the specified attributes be interpolated at the sample 2828 location. Per-sample interpolation can produce higher quality, but the 2829 performance cost is significant since more fragment program invocations 2830 are required. 2831 2832 This extension provides additional control over interpolation, and 2833 allows programs to interpolate attributes at different locations without 2834 necessarily requiring the performance hit of per-sample invocation. 2835 2836 The IPAC instruction allows an attribute to be sampled at the centroid 2837 location, while still allowing the same attribute to be sampled 2838 elsewhere. The IPAS instruction allows the attribute to be sampled at a 2839 number sample location, as per-sample interpolation would do. Multiple 2840 IPAS instructions with different sample numbers allows a program to 2841 sample an attribute at multiple sample points in the pixel and then 2842 combine the samples in a programmable manner, which may allow for higher 2843 quality than simply interpolating at a single representative point in 2844 the pixel. The IPAO instruction allows the attribute to be sampled at 2845 an arbitrary (x,y) offset relative to the pixel center. The range of 2846 supported (x,y) values is limited, and the limits in the initial 2847 implementation are not large enough to permit sampling the attribute 2848 outside the pixel. 2849 2850 Note that previous instruction sets allowed shaders to fake IPAC, 2851 IPAS, and IPAO by a sequence such as: 2852 2853 TEMP ddx, ddy, offset, interp; 2854 MOV interp, fragment.attrib[0]; # start with center 2855 DDX ddx, fragment.attrib[0]; 2856 MAD interp, offset.x, ddx, interp; # add offset.x * dA/dx 2857 DDY ddx, fragment.attrib[0]; 2858 MAD interp, offset.y, ddy, interp; # add offset.y * dA/dy 2859 2860 However, this method does not apply perspective correction. The quality 2861 of the results may be unacceptable, particularly for primitives that are 2862 nearly perpendicular to the screen. 2863 2864 The semantics of the first operand of these instructions is different 2865 from normal assembly instructions. Operands are normally evaluated by 2866 loading the value of the corresponding variable and applying any 2867 swizzle/negation/absolute value modifier before the instruction is 2868 executed. In the IPAC/IPAO/IPAS instructions, the value of the 2869 attribute is evaluated by the instruction itself. Swizzles, negation, 2870 and absolute value modifiers are still allowed, and are applied after 2871 the attribute values are interpolated. 2872 2873 (9) When using a program that issues global stores (via the STORE 2874 instruction), what amount of execution ordering is guaranteed? How 2875 can an application ensure that writes executed in a shader have 2876 completed and will be visible to other operations using the buffer 2877 object in question? 2878 2879 RESOLVED: There are very few automatic guarantees for potential 2880 write/read or write/write conflicts. Program invocations will run in 2881 generally run in arbitrary order, and applications can't rely on 2882 read/write order to match primitive order. 2883 2884 To get consistent results when buffers are read and written using 2885 multiple pipeline stages, manual synchronization using the 2886 MemoryBarrierEXT() API documented in EXT_shader_image_load_store or some 2887 other synchronization primitive is necessary. 2888 2889 (10) Unlike most other shader features, the STORE opcode allows for 2890 externally-visible side effects from executing a program. How does 2891 this capability interact with other features of the GL? 2892 2893 RESOLVED: First, some GL implementations support a variety of "early Z" 2894 optimizations designed to minimize unnecessary fragment processing work, 2895 such as executing an expensive fragment program on a fragment that will 2896 eventually fail the depth test. Such optimizations have been valid 2897 because fragment programs had no side effects. That is no longer the 2898 case, and such optimizations may not be employed if the fragment program 2899 performs a global store. However, we provide a new "early depth and 2900 stencil test" enable that allows applications to deterministically 2901 control depth and stencil testing. If enabled, depth testing is always 2902 performed prior to fragment program execution. Fragment programs will 2903 never be run on fragments that fail any of these tests. 2904 2905 Second, we are permitting global stores in all program types; however, 2906 the number of program invocations is not well-defined for some program 2907 types. For example, a GL implementation may choose to combine multiple 2908 instances of identical vertices (e.g., duplicate indices in 2909 DrawElements, immediate-mode vertices with identical data) into one 2910 single vertex program invocation, or it may run a vertex program on each 2911 separately. Similarly, the tessellation primitive generator will 2912 generate independent primitives with duplicated vertices, which may or 2913 may not be combined for tessellation evaluation program execution. 2914 Fragment program execution also has several issues described in more 2915 detail below. 2916 2917 (11) What issues arise when running fragment programs doing global stores? 2918 2919 RESOLVED: The order of per-fragment operations in the existing OpenGL 2920 3.0 specification can be fairly loose, because previously-defined 2921 fragment programs, shaders, and fixed-function fragment processing had 2922 no side effects. With side effects, the order of operations must be 2923 defined more tightly. In particular, the pixel ownership and scissor 2924 tests are specified to be performed prior to fragment program execution, 2925 and we provide an option to perform depth and stencil tests early as 2926 well. 2927 2928 OpenGL implementations sometimes run fragment programs on "helper" 2929 pixels that have no coverage in order to be able to compute sane partial 2930 deriviatives for fragment program instructions (DDX, DDY) or automatic 2931 level-of-detail calculation for texturing. In this approach, 2932 derivatives are approximated by computing the difference in a quantity 2933 computed for a given fragment at (x,y) and a fragment at a neighboring 2934 pixel. When a fragment program is executed on a "helper" pixel, global 2935 stores have no effect. Helper pixels aren't explicitly mentioned in the 2936 spec body; instead, partial derivatives are obtained by magic. 2937 2938 If a fragment program contains a KIL instruction, compilers may not 2939 reorder code where an ATOM or STORE execution is executed before a KIL 2940 instruction that logically precedes it in flow control. Once a fragment 2941 is killed, subsequent atomics or stores should never be executed. 2942 2943 Multisample rasterization poses several issues for fragment programs 2944 with global stores. The number of times a fragment program is executed 2945 for multisample rendering is not fully specified, which gives 2946 implementations a number of different choices -- pure multisample (only 2947 runs once), pure supersample (runs once per covered sample), or modes in 2948 between. There are some ways for an application to indirectly control 2949 the behavior -- for example, fragment programs specifying per-sample 2950 attribute interpolation are guaranteed to run once per covered sample. 2951 2952 Note that when rendering to a multisample buffer, a pair of adjacent 2953 triangles may cause a fragment program to be executed more than once at 2954 a given (x,y) with different sets of samples covered. This can also 2955 occur in the interior of a quadrilateral or polygon primitive. 2956 Implementations are permitted to split quads and polygons with >3 2957 vertices into triangles, creating interior edges that split a pixel. 2958 2959 (12) What happens if early fragment tests are enabled, the early depth 2960 test passes, and a fragment program that computes a new depth value 2961 is executed? 2962 2963 RESOLVED: The depth value produced by the fragment program has no 2964 effect if early fragment tests are enabled. The depth value computed by 2965 a fragment program is used only by the post-fragment program stencil and 2966 depth tests, and those tests always have no effect when early depth 2967 testing is enabled. 2968 2969 (13) How do early fragment tests interact with occlusion queries? 2970 2971 RESOLVED: When early fragment tests are enabled, sample counting for 2972 occlusion queries also happens prior to fragment program execution. 2973 Enabling early fragment tests can change the overall sample count, 2974 because samples killed by alpha test and alpha to coverage will still be 2975 counted if early fragment tests are enabled. 2976 2977 (14) What happens if a program performs a global store to a GPU address 2978 corresponding to a read-only buffer mapping? What if it performs a 2979 global read to a write-only mapping? 2980 2981 RESOLVED: Implementations may choose implement full memory protection, 2982 in which case accesses using the wrong type of memory mapping will fault 2983 and lead to termination of the application. 2984 2985 However, full memory protection is not required in this extension -- 2986 implementations may choose to substitute a read-write mapping in place 2987 of a read-only or write-only mapping. As a result, we specify the 2988 result of such invalid loads and stores to be undefined. 2989 2990 Note that if a program erroneously writes to nominally read-only 2991 mappings, the results may be weird. If the implementation substitutes a 2992 read-write mapping, such invalid writes are likely to proceed normally. 2993 However, if the application later makes a buffer object non-resident and 2994 the memory manager of the GL implementation needs to move the buffer, 2995 the GL may assume that the contents of the buffer have not been modified 2996 and thus discard the new values written by the (invalid) global store 2997 instructions. 2998 2999 (15) What performance considerations apply to atomics? 3000 3001 RESOLVED: Atomics can be useful for operations like locking, or for 3002 maintaining counters. Note that high-performance GPUs may have hundreds 3003 of program threads in flight at once, and may also have some SIMD 3004 characteristics (where threads are grouped and run as a unit). Using 3005 ATOM instructions with a single memory address to implement a critical 3006 section will result in serial execution -- only one of the hundreds of 3007 threads can execute code in the critical section at a time. 3008 3009 When a global operation would be done under a lock, it may be possible 3010 to improve performance if the algorithm can be parallelized to have 3011 multiple critical sections. For example, an application could allocate 3012 an array of shared resources, each protected by its own lock, and use 3013 the LSBs of the primitive ID or some function of the screen-space (x,y) 3014 to determine which resource in the array to use. 3015 3016 (16) The atomic instruction ATOM returns the old contents of memory into 3017 the result register. Should we provide a version of this opcodes 3018 that doesn't return a value? 3019 3020 RESOLVED: No. In theory, atomics that don't return any values can 3021 perform better (because the program may not need to allocate resources 3022 to hold a result or wait for the result. However, a new opcode isn't 3023 required to obtain this behavior -- a compiler can recognize that the 3024 result of an ATOM instruction is written to a "dummy" temporary that 3025 isn't read by subsequent instructions: 3026 3027 TEMP junk; 3028 ATOM.ADD.U32 junk, address, 1; 3029 3030 The compiler can also recognize that the result will always be discarded 3031 if a conditional write mask of "(FL)" is used. 3032 3033 ATOM.ADD.U32 not_junk (FL), address, 1; 3034 3035 (17) How do we ensure that memory access made by multiple program 3036 invocations of possibly different types are coherent? 3037 3038 RESOLVED: Atomic instructions allow program invocations to coordinate 3039 using shared global memory addresses. However, memory transactions, 3040 including atomics, are not guaranteed to land in the order specified in 3041 the program; they may be reordered by the compiler, cached in different 3042 memory hierarchies, and stored in a distributed memory system where 3043 later stores to one "partition" might be completed prior to earlier 3044 stores to another. The MEMBAR instruction helps control memory 3045 transaction ordering by ensuring that all memory transactions prior to 3046 the barrier complete before any after the barrier. Additionally the 3047 ".COH" modifier ensures that memory transactions using the modifier are 3048 cached coherently and will be visible to other shader invocations. 3049 3050 (18) How do the TXG and TXGO opcodes work with sRGB textures? 3051 3052 RESOLVED. Gamma-correction is applied to the texture source color 3053 before "gathering" and hence applies to all four components, unless 3054 the texture swizzle of the selected component is ALPHA in which case 3055 no gamma-correction is applied. 3056 3057 (19) How can render-to-texture algorithms take advantage of 3058 MemoryBarrierEXT, nominally provided for global memory transactions? 3059 3060 RESOLVED: Many algorithms use RTT to ping-pong between two allocations, 3061 using the result of one rendering pass as the input to the next. 3062 Existing mechanisms require expensive FBO Binds, DrawBuffer changes, or 3063 FBO attachment changes to safely swap the render target and texture. With 3064 memory barriers, layered geometry shader rendering, and texture arrays, 3065 an application can very cheaply ping-pong between two layers of a single 3066 texture. i.e. 3067 3068 X = 0; 3069 // Bind the array texture to a texture unit 3070 // Attach the array texture to an FBO using FramebufferTextureARB 3071 while (!done) { 3072 // Stuff X in a constant, vertex attrib, etc. 3073 Draw - 3074 Texturing from layer X; 3075 Writing gl_Layer = 1 - X in the geometry shader; 3076 3077 MemoryBarrierNV(TEXTURE_FETCH_BARRIER_BIT_NV); 3078 X = 1 - X; 3079 } 3080 3081 However, be warned that this requires geometry shaders and hence adds 3082 the overhead that all geometry must pass through an additional program 3083 stage, so an application using large amounts of geometry could become 3084 geometry-limited or more shader-limited. 3085 3086 (20) What is the ".PREC" instruction modifier good for? 3087 3088 RESOLVED: ".PREC" provides some invariance guarantees is useful for 3089 certain algorithms. Using ".PREC", it is possible to ensure that an 3090 algorithm can be written to produce identical results on subtly 3091 different inputs. For example, the order of vertices visible to a 3092 geometry or tessellation shader used to subdivide primitive edges might 3093 present an edge shared between two primitives in one direction for one 3094 primitive and the other direction for the adjacent primitive. Even if 3095 the weights are identical in the two cases, there may be cracking if the 3096 computations are being done in an order-dependent manner. If the 3097 position of a new vertex were evaluation with code below with 3098 limited-precision floating-point math, it's not necessarily the case 3099 that we will get the same result for inputs (a,b,c) and (c,b,a) in the 3100 following code: 3101 3102 ADD result, a, b; 3103 ADD result, result, c; 3104 3105 There are two problems with this code: the rounding errors will be 3106 different and the implementation is free to rearrange the computation 3107 order. The code can be rewritten as follows with ".PREC" and a 3108 symmetric evaluation order to ensure a precise result with the inputs 3109 reversed: 3110 3111 ADD result, a, c; 3112 ADD.PREC result, result, b; 3113 3114 Note that in this example, the first instruction doesn't need the 3115 ".PREC" qualifier because the second instruction requires that the 3116 implementation compute <a>+<c>, which will be done reliably if <a> and 3117 <c> are inputs. If <a> and <c> were results of other computations, the 3118 first add and possibly the dependent computations may also need to be 3119 tagged with ".PREC" to ensure reliable results. 3120 3121 The ".PREC" modifier will disable certain optimization and thus carries 3122 a performance cost. 3123 3124 (21) What are the TGALL, TGANY, TGEQ instructions good for? 3125 3126 RESOLVED: If an implementation performs SIMD thread execution, 3127 divergent branching may result in reduced performance if the "if" and 3128 "else" blocks of an "if" statement are executed sequentially. For 3129 example, an algorithm may have both a "fast path" that performs a 3130 computation quickly for a subset of all cases and a "fast path" that 3131 performs a computation quickly but correctly. When performing SIMD 3132 execution, code like the following: 3133 3134 SNE.S.CC cc.x, condition.x; 3135 IF NE.x; 3136 # do fast path 3137 ELSE; 3138 # do slow path 3139 ENDIF; 3140 3141 may end up executing *both* the fast and slow paths for a SIMD thread 3142 group if <condition> diverges, and may execute more slowly than simply 3143 executing the slow path unconditionally. These instructions allow code 3144 like: 3145 3146 # Condition code matches NE if and only if condition.x is non-zero 3147 # for all threads. 3148 TGALL.S.CC cc.x, condition.x; 3149 IF NE.x; 3150 # do fast path 3151 ELSE; 3152 # do slow path 3153 ENDIF; 3154 3155 that executes the fast path if and only if it can be used for *all* 3156 threads in the group. For thread groups where <condition> diverges, 3157 this algorithm would unconditionally run the slow path, but would never 3158 run both in sequence. 3159 3160 3161Revision History 3162 3163 Rev. Date Author Changes 3164 ---- -------- -------- ----------------------------------------- 3165 7 09/11/14 pbrown Minor typo fixes. 3166 3167 6 07/04/13 pbrown Add missing language describing the 3168 <texImageUnitComp> grammar rule for component 3169 selection in TXG and TXGO instructions. 3170 3171 5 09/23/10 pbrown Add missing constants for {MIN,MAX}_PROGRAM_ 3172 TEXTURE_GATHER_OFFSET_NV (same as ARB/core). 3173 Add missing description for "su" in the opcode 3174 table; fix a couple operand order bugs for 3175 STORE. 3176 3177 4 06/22/10 pbrown Specify that the y/z/w component of the ATOM 3178 results are undefined, as is the case with 3179 ATOMIM from EXT_shader_image_load_store. 3180 3181 3 04/13/10 pbrown Remove F32 support from ATOM.ADD. 3182 3183 2 03/22/10 pbrown Various wording updates to the spec overview, 3184 dependencies, issues, and body. Remove various 3185 spec language that has been refactored into the 3186 EXT_shader_image_load_store specification. 3187 3188 1 pbrown Internal revisions. 3189