1Name 2 3 NV_shader_thread_group 4 5Name Strings 6 7 GL_NV_shader_thread_group 8 9Contributors 10 11 Jeannot Breton, NVIDIA 12 Pat Brown, NVIDIA 13 Eric Werness, NVIDIA 14 Mark Kilgard, NVIDIA 15 16Contact 17 18 Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com) 19 20Status 21 22 Shipping. 23 24Version 25 26 Last Modified Date: 7/21/2015 27 NVIDIA Revision: 4 28 29Number 30 31 OpenGL Extension #447 32 33Dependencies 34 35 This extension is written against the OpenGL 4.3 (Compatibility Profile) 36 Specification. 37 38 This extension is written against version 4.30 (revision 07) of the OpenGL 39 Shading Language Specification. 40 41 OpenGL 4.3 and GLSL 4.3 are required. 42 43 This extension interacts with NV_gpu_program5 44 45 This extension interacts with NV_compute_program5 46 47 This extension interacts with NV_tessellation_program5 48 49Overview 50 51 Implementations of the OpenGL Shading Language may, but are not required 52 to, run multiple shader threads for a single stage as a SIMD thread group, 53 where individual execution threads are assigned to thread groups in an 54 undefined, implementation-dependent order. This extension provides a set 55 of new features to the OpenGL Shading Language to query thread states and 56 to share data between fragments within a 2x2 pixel quad. 57 58 More specifically the following functionalities were added: 59 60 * New uniform variables and tokens to query the number of threads in a 61 warp, the number of warps running on a SM and the number of SMs on the 62 GPU. 63 64 * New shader inputs to query the thread id, the warp id and the SM id. 65 66 * New shader inputs to query if a fragment shader thread is a helper 67 thread. 68 69 * New shader built-in functions to query the state of a Boolean condition 70 over all threads in a thread group. 71 72 * New shader built-in functions to query which threads are active within 73 a thread group. 74 75 * New fragment shader built-in functions to share data between fragments 76 within a 2x2 pixel quad. 77 78 Shaders using the new functionalities provided by this extension should 79 enable this functionality via the construct 80 81 #extension GL_NV_shader_thread_group : require (or enable) 82 83 This extension also specifies some modifications to the program assembly 84 language to support the thread state query and thread data sharing 85 functionalities. 86 87 Note that in this extension specification warp and thread group have the 88 same meaning. A warp is a group of threads that get executed in lockstep. 89 Each thread in a warp executes the same instruction of a program, but on 90 different data. 91 92New Procedures and Functions 93 94 None 95 96 97New Tokens 98 99 Accepted by the <pname> parameter of GetBooleanv, GetIntegerv, 100 GetFloatv, and GetDoublev: 101 102 WARP_SIZE_NV 0x9339 103 WARPS_PER_SM_NV 0x933A 104 SM_COUNT_NV 0x933B 105 106 107Modifications to The OpenGL Shading Language Specification, Version 4.30 108(Revision 07) 109 110 Including the following line in a shader can be used to control the 111 language features described in this extension: 112 113 #extension GL_NV_shader_thread_group : <behavior> 114 115 where <behavior> is as specified in section 3.3. 116 117 New preprocessor #defines are added to the OpenGL Shading Language: 118 119 #define GL_NV_shader_thread_group 1 120 121 Modify Section 7.1, Built-in Languages Variable, p. 110 122 123 (Add to the list of built-in variables for the compute, vertex, geometry, 124 tessellation control, tessellation evaluation and fragment languages) 125 126 in uint gl_ThreadInWarpNV; 127 in uint gl_ThreadEqMaskNV; 128 in uint gl_ThreadGeMaskNV; 129 in uint gl_ThreadGtMaskNV; 130 in uint gl_ThreadLeMaskNV; 131 in uint gl_ThreadLtMaskNV; 132 in uint gl_WarpIDNV; 133 in uint gl_SMIDNV; 134 135 (Add to the list of built-in variables for the fragment languages) 136 137 in bool gl_HelperThreadNV; 138 139 (Add those paragraphs at the end of this section) 140 141 The variable gl_ThreadInWarpNV hold the id of the thread within the thread 142 group(or warp). This variable is in the range 0 to gl_WarpSizeNV-1, where 143 gl_WarpSizeNV is the total number of thread in a warp. 144 145 The variable gl_ThreadEqMaskNV is a bitfield in which the bit equal to the 146 current thread id is set. The variable gl_ThreadGeMaskNV is a bitfield in 147 which bits greater or equal to the current thread id are set. The variable 148 gl_ThreadGtMaskNV is a bitfield in which bits greater than the current 149 thread id are set. The variable gl_ThreadLeMaskNV is a bitfield in which 150 bits lower or equal to the current thread id are set. The variable 151 gl_ThreadLtMaskNV is a bitfield in which bits lower than the current thread 152 id are set. 153 154 The value of gl_ThreadEqMaskNV, gl_ThreadGeMaskNV, gl_ThreadGtMaskNV, 155 gl_ThreadLeMaskNV and gl_ThreadLtMaskNV are derived from the value of 156 gl_ThreadInWarpNV using simple bit-shift arithmetic, they don't take into 157 account the value of the thread group active mask. For example, if the 158 application wants a bitfield in which bits lower or equal to the current 159 thread id are set only for active threads, the result of gl_ThreadLeMaskNV 160 will need to be ANDed with the thread group active mask. 161 162 The variable gl_WarpIDNV hold the warp id of the executing thread. This 163 variable is in the range 0 to gl_WarpsPerSMNV-1, where gl_WarpsPerSMNV is 164 the maximum number of warp executing on a SM. 165 166 The variable gl_SMIDNV hold the SM id of the executing thread. This 167 variable is in the range 0 to gl_SMCountNV-1, where gl_SMCountNV is the 168 number of SM on the GPU. 169 170 The variable gl_HelperThreadNV specifies if the current thread is a helper 171 thread. In implementations supporting this extension, fragment shader 172 invocations may be arranged in SIMD thread groups of 2x2 fragments called 173 "quad". When a fragment shader instruction is executed on a quad, it's 174 possible that some fragments within the quad will execute the instruction 175 even if they are not covered by the primitive. Those threads are called 176 helper threads. Their outputs will be discarded and they will not execute 177 global store functions, but the intermediate values they compute can still 178 be used by thread group sharing functions or by fragment derivative 179 functions like dFdx and dFdy. 180 181 182 Modify Section 7.4, Built-In Uniform State, p. 125 183 184 (Add to the list of built-in uniform variable declaration) 185 186 uniform uint gl_WarpSizeNV; 187 uniform uint gl_WarpsPerSMNV; 188 uniform uint gl_SMCountNV; 189 190 (Add this paragraph at the end of this section) 191 192 The variable gl_WarpSizeNV is the total number of thread in a warp. The 193 variable gl_WarpsPerSMNV is the maximum number of warp executing on a SM. 194 The variable gl_SMCountNV is the number of SM on the GPU. 195 196 197 Modify Section 8.3, Common Functions, p. 133 198 199 (add a function to query which threads are active within a thread group) 200 201 Syntax: 202 203 uint activeThreadsNV(void) 204 205 In the value returned by activeThreadsNV(), bit <N> is set to 1 if the 206 corresponding thread in the SIMD thread group is executing the call to 207 activeThreadsNV() and 0 otherwise. A bit in the return value may be set 208 to zero due to conditional flow control (e.g., returning from a function, 209 executing the "else" part of an "if" statement) or SIMD thread group was 210 dispatched without a full collection of threads. 211 212 (add a function to query the state of a Boolean condition over all the 213 threads in a thread group) 214 215 Syntax: 216 217 uint ballotThreadNV(bool value) 218 219 The function ballotThreadNV() computes a 32-bit bitfield. It looks at the 220 condition <value> for each active thread of a thread group and set to 1 221 each bit for which the condition in the corresponding thread is true. Bits 222 for threads with false condition are set to 0. Bits for inactive threads 223 are also set to 0. It's possible to query the active thread mask by 224 calling the function activeThreadsNV. 225 226 (add a function to share data between fragment in a quad) 227 228 Syntax: 229 230 float quadSwizzle0NV(float swizzledValue, [float unswizzledValue]) 231 vec2 quadSwizzle0NV(vec2 swizzledValue, [vec2 unswizzledValue]) 232 vec3 quadSwizzle0NV(vec3 swizzledValue, [vec3 unswizzledValue]) 233 vec4 quadSwizzle0NV(vec4 swizzledValue, [vec4 unswizzledValue]) 234 235 float quadSwizzle1NV(float swizzledValue, [float unswizzledValue]) 236 vec2 quadSwizzle1NV(vec2 swizzledValue, [vec2 unswizzledValue]) 237 vec3 quadSwizzle1NV(vec3 swizzledValue, [vec3 unswizzledValue]) 238 vec4 quadSwizzle1NV(vec4 swizzledValue, [vec4 unswizzledValue]) 239 240 float quadSwizzle2NV(float swizzledValue, [float unswizzledValue]) 241 vec2 quadSwizzle2NV(vec2 swizzledValue, [vec2 unswizzledValue]) 242 vec3 quadSwizzle2NV(vec3 swizzledValue, [vec3 unswizzledValue]) 243 vec4 quadSwizzle2NV(vec4 swizzledValue, [vec4 unswizzledValue]) 244 245 float quadSwizzle3NV(float swizzledValue, [float unswizzledValue]) 246 vec2 quadSwizzle3NV(vec2 swizzledValue, [vec2 unswizzledValue]) 247 vec3 quadSwizzle3NV(vec3 swizzledValue, [vec3 unswizzledValue]) 248 vec4 quadSwizzle3NV(vec4 swizzledValue, [vec4 unswizzledValue]) 249 250 float quadSwizzleXNV(float swizzledValue, [float unswizzledValue]) 251 vec2 quadSwizzleXNV(vec2 swizzledValue, [vec2 unswizzledValue]) 252 vec3 quadSwizzleXNV(vec3 swizzledValue, [vec3 unswizzledValue]) 253 vec4 quadSwizzleXNV(vec4 swizzledValue, [vec4 unswizzledValue]) 254 255 float quadSwizzleYNV(float swizzledValue, [float unswizzledValue]) 256 vec2 quadSwizzleYNV(vec2 swizzledValue, [vec2 unswizzledValue]) 257 vec3 quadSwizzleYNV(vec3 swizzledValue, [vec3 unswizzledValue]) 258 vec4 quadSwizzleYNV(vec4 swizzledValue, [vec4 unswizzledValue]) 259 260 In implementations supporting this extension, if a primitive covers a 261 fragment at (x,y), its fragment shader invocation will be arranged in a 262 SIMD thread group with fragment shader invocations corresponding to three 263 neighboring pixels. These four invocations are arranged in a 2x2 grid, 264 called a "quad". If the neighbors of a fragment are not covered by the 265 primitive, fragment shader invocations will still be generated. The 266 implementation may compute differences between values in these threads to 267 estimate derivatives for dFdx(), dFdy(), and for texture lookups with 268 automatic LOD calculations. 269 270 Fragments may have different locations in the quads based on the type of 271 render target. 272 273 When rendering to a window, fragments within a quad follow this pattern: 274 275 --------------------------------------------------- 276 | gl_ThreadInWarpNV 4N+0 | gl_ThreadInWarpNV 4N+1 | 277 | pixel (X+0,Y+1) | pixel (X+1,Y+1) | 278 --------------------------------------------------- 279 | gl_ThreadInWarpNV 4N+2 | gl_ThreadInWarpNV 4N+3 | 280 | pixel (X+0,Y+0) | pixel (X+1,Y+0) | 281 --------------------------------------------------- 282 283 284 When rendering to a framebuffer object, fragments within a quad follow this 285 pattern: 286 287 --------------------------------------------------- 288 | gl_ThreadInWarpNV 4N+2 | gl_ThreadInWarpNV 4N+3 | 289 | pixel (X+0,Y+1) | pixel (X+1,Y+1) | 290 --------------------------------------------------- 291 | gl_ThreadInWarpNV 4N+0 | gl_ThreadInWarpNV 4N+1 | 292 | pixel (X+0,Y+0) | pixel (X+1,Y+0) | 293 --------------------------------------------------- 294 295 There are 6 quadSwizzle functions that allow fragments within a quad to 296 exchange data. All those functions will read a floating point 297 operand <swizzledValue>, which can come from any fragment in the quad. 298 Another optional floating point operand <unswizzledValue>, which comes from 299 the current fragment, can be added to <swizzledValue>. The only difference 300 between all those quadSwizzle functions is the location where they get the 301 <swizzledValue> operand within the 2x2 pixel quad. 302 303 quadSwizzle0NV will read the <swizzledValue> operand from the fragment 0: 304 305 result[thread N] = swizzledValue[thread 0] + unswizzledValue[thread N] 306 307 308 quadSwizzle1NV will read the <swizzledValue> operand from the fragment 1: 309 310 result[thread N] = swizzledValue[thread 1] + unswizzledValue[thread N] 311 312 313 quadSwizzle2NV will read the <swizzledValue> operand from the fragment 2: 314 315 result[thread N] = swizzledValue[thread 2] + unswizzledValue[thread N] 316 317 318 quadSwizzle3NV will read the <swizzledValue> operand from the fragment 3: 319 320 result[thread N] = swizzledValue[thread 3] + unswizzledValue[thread N] 321 322 323 quadSwizzleXNV will read the <swizzledValue> operand for each fragment 324 from its neighbor in X: 325 326 result[thread 0] = swizzledValue[thread 1] + unswizzledValue[thread 0] 327 result[thread 1] = swizzledValue[thread 0] + unswizzledValue[thread 1] 328 result[thread 2] = swizzledValue[thread 3] + unswizzledValue[thread 2] 329 result[thread 3] = swizzledValue[thread 2] + unswizzledValue[thread 3] 330 331 332 quadSwizzleYNV will read the <swizzledValue> operand for each fragment 333 from its neighbor in Y: 334 335 result[thread 0] = swizzledValue[thread 2] + unswizzledValue[thread 0] 336 result[thread 1] = swizzledValue[thread 3] + unswizzledValue[thread 1] 337 result[thread 2] = swizzledValue[thread 0] + unswizzledValue[thread 2] 338 result[thread 3] = swizzledValue[thread 1] + unswizzledValue[thread 3] 339 340 341 If any thread in a 2x2 pixel quad is inactive, the quad is divergent. In 342 this case quadSwizzle will return 0 for all fragments in the quad. 343 344 345Dependencies on NV_gpu_program5 346 347 If NV_gpu_program5 is supported and "OPTION NV_shader_thread_group" is 348 specified in an assembly program, the following edits are made to extend 349 the assembly programming model documented in the NV_gpu_program4 extension 350 and extended by NV_gpu_program5. 351 352 If NV_gpu_program5 is not supported, or if "OPTION NV_shader_thread_group" 353 is not specified in an assembly program, the contents of this dependencies 354 section should be ignored. 355 356 Modify Section 2.X.2, Program Grammar 357 358 (add the following rules to the the NV_gpu_program4 and 359 NV_gpu_program5 base grammars) 360 361 <VECTORop> ::= "TGBALLOT" 362 363 <stateSingleItem> ::= "state" "." <stateThreadItem> 364 365 <stateThreadItem> ::= "thread" "." <stateThreadProperty> 366 367 <stateThreadProperty> ::= "warpsize" 368 | "warpspersm" 369 | "smcount" 370 371 (add/change the following rules to the NV_fragment_program4 and 372 NV_gpu_program5 base grammars) 373 374 <VECTORop> ::= "QSWZ0" 375 | "QSWZ1" 376 | "QSWZ2" 377 | "QSWZ3" 378 | "QSWZX" 379 | "QSWZY" 380 381 <attribBasic> ::= <fragPrefix> "threadid" 382 | <fragPrefix> "threadeqmask" 383 | <fragPrefix> "threadltmask" 384 | <fragPrefix> "threadlemask" 385 | <fragPrefix> "threadgtmask" 386 | <fragPrefix> "threadgemask" 387 | <fragPrefix> "warpid" 388 | <fragPrefix> "smid" 389 | <fragPrefix> "helperthread" 390 391 (add/change the following rules to the NV_vertex_program4 and 392 NV_gpu_program5 base grammars) 393 394 <attribBasic> ::= <vtxPrefix> "threadid" 395 | <vtxPrefix> "threadeqmask" 396 | <vtxPrefix> "threadltmask" 397 | <vtxPrefix> "threadlemask" 398 | <vtxPrefix> "threadgtmask" 399 | <vtxPrefix> "threadgemask" 400 | <vtxPrefix> "warpid" 401 | <vtxPrefix> "smid" 402 403 (add/change the following rules to the NV_geometry_program4 and 404 NV_gpu_program5 base grammars) 405 406 <attribBasic> ::= <primPrefix> "threadid" 407 | <primPrefix> "threadeqmask" 408 | <primPrefix> "threadltmask" 409 | <primPrefix> "threadlemask" 410 | <primPrefix> "threadgtmask" 411 | <primPrefix> "threadgemask" 412 | <primPrefix> "warpid" 413 | <primPrefix> "smid" 414 415 Modify Section 2.X.3.2 of the NV_gpu_program4 specification, Program 416 Attribute Variables. 417 418 (Add the table entries and relevant text describing the fragment program 419 input variable use to query thread states.) 420 421 Fragment Attribute Binding Components Underlying State 422 -------------------------- ---------- ---------------------------- 423 ... 424 fragment.threadid (id,-,-,-) id of the current thread 425 fragment.threadeqmask (m,-,-,-) mask with the current thread 426 fragment.threadltmask (m,-,-,-) mask with lower thread 427 fragment.threadlemask (m,-,-,-) mask with lower or equal thread 428 fragment.threadgtmask (m,-,-,-) mask with greater thread 429 fragment.threadgemask (m,-,-,-) mask with greater or equal thread 430 fragment.warpid (id,-,-,-) warp id of the current thread 431 fragment.smid (id,-,-,-) SM id of the current thread 432 fragment.helperthread (k,-,-,-) current thread is a helper thread 433 ... 434 435 If a fragment attribute binding matches "fragment.threadid", the "x" 436 component is filled with the thread id of the current thread. The thread 437 id is an unsigned integer in the range 0 to 31. 438 439 If a fragment attribute binding matches "fragment.threadeqmask", the "x" 440 component is filled with a 32-bit unsigned integer bitfield in which the 441 bit equal to the current thread id is set. 442 443 If a fragment attribute binding matches "fragment.threadltmask", the "x" 444 component is filled with a 32-bit unsigned integer bitfield in which bits 445 lower than the current thread id are set. 446 447 If a fragment attribute binding matches "fragment.threadlemask", the "x" 448 component is filled with a 32-bit unsigned integer bitfield in which bits 449 lower or equal to the current thread id are set. 450 451 If a fragment attribute binding matches "fragment.threadgtmask", the "x" 452 component is filled with a 32-bit unsigned integer bitfield in which bits 453 greater than the current thread id are set. 454 455 If a fragment attribute binding matches "fragment.threadgemask", the "x" 456 component is filled with a 32-bit unsigned integer bitfield in which bits 457 greater or equal to the current thread id are set. 458 459 If a fragment attribute binding matches "fragment.warpid", the "x" 460 component is filled with the warp id of the current thread. The warp id is 461 an unsigned integer, the range of this value is hw dependent. 462 463 If a fragment attribute binding matches "fragment.smid", the "x" component 464 is filled with the SM id of the current thread. The SM id is an unsigned 465 integer, the range of this value is hw dependent. 466 467 If a fragment attribute binding matches "fragment.helperthread", the "x" 468 component is an integer value equal to -1 when the current thread is a 469 helper thread and 0 otherwise. In implementations supporting this 470 extension, fragment program invocations may be arranged in SIMD thread 471 groups of 2x2 fragments called "quad". When a fragment program instruction 472 is executed on a quad, it's possible that some fragments within the quad 473 will execute the instruction even if they are not covered by the primitive. 474 Those threads are called helper threads. Their outputs will be discarded 475 and they will not execute global store instructions, but the intermediate 476 values they compute can still be used by thread group sharing instructions 477 or by fragment derivative instructions like DDX and DDY. 478 479 (Add the table entries and relevant text describing the vertex program 480 attribute variable use to query thread states.) 481 482 Vertex Attribute Binding Components Underlying State 483 ------------------------ ---------- ---------------------------- 484 ... 485 vertex.threadid (id,-,-,-) id of the current thread 486 vertex.threadeqmask (m,-,-,-) mask with the current thread 487 vertex.threadltmask (m,-,-,-) mask with lower thread 488 vertex.threadlemask (m,-,-,-) mask with lower or equal thread 489 vertex.threadgtmask (m,-,-,-) mask with greater thread 490 vertex.threadgemask (m,-,-,-) mask with greater or equal thread 491 vertex.warpid (id,-,-,-) warp id of the current thread 492 vertex.smid (id,-,-,-) SM id of the current thread 493 ... 494 495 If a vertex attribute binding matches "vertex.threadid", the "x" component 496 is filled with the thread id of the current thread. The thread id is an 497 unsigned integer in the range 0 to 31. 498 499 If a vertex attribute binding matches "vertex.threadeqmask", the "x" 500 component is filled with a 32-bit unsigned integer bitfield in which the 501 bit equal to the current thread id is set. 502 503 If a vertex attribute binding matches "vertex.threadltmask", the "x" 504 component is filled with a 32-bit unsigned integer bitfield in which bits 505 lower than the current thread id are set. 506 507 If a vertex attribute binding matches "vertex.threadlemask", the "x" 508 component is filled with a 32-bit unsigned integer bitfield in which bits 509 lower or equal to the current thread id are set. 510 511 If a vertex attribute binding matches "vertex.threadgtmask", the "x" 512 component is filled with a 32-bit unsigned integer bitfield in which bits 513 greater than the current thread id are set. 514 515 If a vertex attribute binding matches "vertex.threadgemask", the "x" 516 component is filled with a 32-bit unsigned integer bitfield in which bits 517 greater or equal to the current thread id are set. 518 519 If a vertex attribute binding matches "vertex.warpid", the "x" component is 520 filled with the warp id of the current thread. The warp id is an unsigned 521 integer, the range of this value is hw dependent. 522 523 If a vertex attribute binding matches "vertex.smid", the "x" component 524 is filled with the SM id of the current thread. The SM id is an unsigned 525 integer, the range of this value is hw dependent. 526 527 528 (Add the table entries and relevant text describing the geometry program 529 attribute variable use to query thread states.) 530 531 Geometry Attribute Binding Components Underlying State 532 -------------------------- ---------- ---------------------------- 533 ... 534 primitive.threadid (id,-,-,-) id of the current thread 535 primitive.threadeqmask (m,-,-,-) mask with the current thread 536 primitive.threadltmask (m,-,-,-) mask with lower thread 537 primitive.threadlemask (m,-,-,-) mask with lower or equal thread 538 primitive.threadgtmask (m,-,-,-) mask with greater thread 539 primitive.threadgemask (m,-,-,-) mask with greater or equal thread 540 primitive.warpid (id,-,-,-) warp id of the current thread 541 primitive.smid (id,-,-,-) SM id of the current thread 542 ... 543 544 If a geometry attribute binding matches "primitive.threadid", the "x" 545 component is filled with the thread id of the current thread. The thread 546 id is an unsigned integer in the range 0 to 31. 547 548 If a geometry attribute binding matches "primitive.threadeqmask", the "x" 549 component is filled with a 32-bit unsigned integer bitfield in which the 550 bit equal to the current thread id is set. 551 552 If a geometry attribute binding matches "primitive.threadltmask", the "x" 553 component is filled with a 32-bit unsigned integer bitfield in which bits 554 lower than the current thread id are set. 555 556 If a geometry attribute binding matches "primitive.threadlemask", the "x" 557 component is filled with a 32-bit unsigned integer bitfield in which bits 558 lower or equal to the current thread id are set. 559 560 If a geometry attribute binding matches "primitive.threadgtmask", the "x" 561 component is filled with a 32-bit unsigned integer bitfield in which bits 562 greater than the current thread id are set. 563 564 If a geometry attribute binding matches "primitive.threadgemask", the "x" 565 component is filled with a 32-bit unsigned integer bitfield in which bits 566 greater or equal to the current thread id are set. 567 568 If a geometry attribute binding matches "primitive.warpid", the "x" 569 component is filled with the warp id of the current thread. The warp id is 570 an unsigned integer, the range of this value is hw dependent. 571 572 If a geometry attribute binding matches "primitive.smid", the "x" component 573 is filled with the SM id of the current thread. The SM id is an unsigned 574 integer, the range of this value is hw dependent. 575 576 577 (add the following subsection to section 2.X.3.3, Parameters) 578 579 Thread Group Property Bindings 580 581 Binding Components Underlying State 582 ----------------------------- ---------- ---------------------------- 583 state.thread.warpsize (x,-,-,-) total number of thread in a 584 warp 585 state.thread.warpspersm (x,-,-,-) maximum number of warp 586 executing on a SM 587 state.thread.smcount (x,-,-,-) number of SM on the GPU 588 589 If a program parameter binding matches "state.thread.warpsize", the "x" 590 component of the program parameter variable is filled with an integer value 591 indicating the total number of thread in a warp. The "y", "z", and "w" 592 components are undefined. 593 594 If a program parameter binding matches "state.thread.warpspersm", the "x" 595 component of the program parameter variable is filled with an integer value 596 indicating the maximum number of warp executing on a SM. The "y", "z", and 597 "w" components are undefined. 598 599 If a program parameter binding matches "state.thread.smcount", the "x" 600 component of the program parameter variable is filled with an integer value 601 indicating the number of SM on the GPU. The "y", "z", and "w" components 602 are undefined. 603 604 605 Modify Section 2.X.4, Program Execution Environment 606 607 (Add the table entries and relevant text describing the program 608 instruction to query thread conditions.) 609 610 Instr- Modifiers 611 uction V F I C S H D Out Inputs Description 612 ------- -- - - - - - - --- -------- -------------------------------- 613 ... 614 TGBALLOT 50 X X X X - - F vu v query a boolean in thread group 615 ... 616 617 618 (Add the table entries and relevant text describing the fragment program 619 instructions to exchange data between threads.) 620 621 Instr- Modifiers 622 uction V F I C S H D Out Inputs Description 623 ------- -- - - - - - - --- -------- -------------------------------- 624 ... 625 QSWZ0 50 X - - - - - F v v,v add fragment 0 in a quad 626 QSWZ1 50 X - - - - - F v v,v add fragment 1 in a quad 627 QSWZ2 50 X - - - - - F v v,v add fragment 2 in a quad 628 QSWZ3 50 X - - - - - F v v,v add fragment 3 in a quad 629 QSWZX 50 X - - - - - F v v,v add fragments horizontally 630 QSWZY 50 X - - - - - F v v,v add fragments vertically 631 ... 632 633 634 (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, 635 as extended by NV_gpu_program5) 636 637 + Shader thread group (NV_shader_thread_group) 638 639 If a fragment program specifies the "NV_shader_thread_group" option, it 640 may use the "fragment.threadid", "fragment.threadeqmask", 641 "fragment.threadltmask", "fragment.threadlemask", "fragment.threadgtmask", 642 "fragment.threadgemask", "fragment.warpid", "fragment.smid", 643 "fragment.helperthread", "state.thread.warpsize", "state.thread.warpspersm" 644 and "state.thread.smcount" bindings. It may also use the "TGBALLOT", 645 "QSWZ0", "QSWZ1", "QSWZ2", "QSWZ3", "QSWZX" and "QSWZY" instructions. If 646 this option is not specified, a program will fail to compile if it uses 647 those instructions or bindings. 648 649 If a vertex program specifies the "NV_shader_thread_group" option, it may 650 use the "vertex.threadid", "vertex.threadeqmask", "vertex.threadltmask", 651 "vertex.threadlemask", "vertex.threadgtmask", "vertex.threadgemask", 652 "vertex.warpid", "vertex.smid", "state.thread.warpsize", 653 "state.thread.warpspersm" and "state.thread.smcount" bindings. It may also 654 use the "TGBALLOT" instruction. If this option is not specified, a program 655 will fail to compile if it uses those instructions or bindings. 656 657 If a geometry program specifies the "NV_shader_thread_group" option, it 658 may use the "primitive.threadid", "primitive.threadeqmask", 659 "primitive.threadltmask", "primitive.threadlemask", 660 "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid", 661 "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and 662 "state.thread.smcount" bindings. It may also use the "TGBALLOT" 663 instruction. If this option is not specified, a program will fail to 664 compile if it uses those instructions or bindings. 665 666 Section 2.X.8.Z, QSWZ0: add fragment 0 data to all fragment in a quad 667 668 The QSWZ0 instruction produces a floating point result by adding the 669 first operand, a floating point value from fragment 0, to the second 670 operand, another floating point value from the current fragment. 671 672 quadSwizzle0NV is the GLSL function that implements the same functionality 673 as the QSWZ0 assembly instruction. The section 8.3 of the OpenGL Shading 674 Language Specification has more detail about the implementation of 675 quadSwizzle0NV. This additional information also applies to QSWZ0. 676 677 678 Section 2.X.8.Z, QSWZ1: add fragment 1 data to all fragment in a quad 679 680 The QSWZ1 instruction produces a floating point result by adding the 681 first operand, a floating point value from fragment 1, to the second 682 operand, another floating point value from the current fragment. 683 684 quadSwizzle1NV is the GLSL function that implements the same functionality 685 as the QSWZ1 assembly instruction. The section 8.3 of the OpenGL Shading 686 Language Specification has more detail about the implementation of 687 quadSwizzle1NV. This additional information also applies to QSWZ1. 688 689 690 Section 2.X.8.Z, QSWZ2: add fragment 2 data to all fragment in a quad 691 692 The QSWZ2 instruction produces a floating point result by adding the 693 first operand, a floating point value from fragment 2, to the second 694 operand, another floating point value from the current fragment. 695 696 quadSwizzle2NV is the GLSL function that implements the same functionality 697 as the QSWZ2 assembly instruction. The section 8.3 of the OpenGL Shading 698 Language Specification has more detail about the implementation of 699 quadSwizzle2NV. This additional information also applies to QSWZ2. 700 701 702 Section 2.X.8.Z, QSWZ3: add fragment 3 data to all fragment in a quad 703 704 The QSWZ3 instruction produces a floating point result by adding the 705 first operand, a floating point value from fragment 3, to the second 706 operand, another floating point value from the current fragment. 707 708 quadSwizzle3NV is the GLSL function that implements the same functionality 709 as the QSWZ3 assembly instruction. The section 8.3 of the OpenGL Shading 710 Language Specification has more detail about the implementation of 711 quadSwizzle3NV. This additional information also applies to QSWZ3. 712 713 714 Section 2.X.8.Z, QSWZX: add fragments in a quad horizontally 715 716 The QSWZX instruction produces a floating point result by adding the 717 first operand, a floating point value from the fragment neighbor in X to 718 the current fragment, to the second operand, another floating point value 719 from the current fragment. 720 721 quadSwizzleXNV is the GLSL function that implements the same functionality 722 as the QSWZX assembly instruction. The section 8.3 of the OpenGL Shading 723 Language Specification has more detail about the implementation of 724 quadSwizzleXNV. This additional information also applies to QSWZX. 725 726 727 Section 2.X.8.Z, QSWZY: add fragments in a quad vertically 728 729 The QSWZY instruction produces a floating point result by adding the 730 first operand, a floating point value from the fragment neighbor in Y to 731 the current fragment, to the second operand, another floating point value 732 from the current fragment. 733 734 quadSwizzleYNV is the GLSL function that implements the same functionality 735 as the QSWZY assembly instruction. The section 8.3 of the OpenGL Shading 736 Language Specification has more detail about the implementation of 737 quadSwizzleYNV. This additional information also applies to QSWZY. 738 739 740 Section 2.X.8.Z, TGBALLOT: query a boolean condition over a thread group 741 742 The TGBALLOT instruction produces a result vector by reading a vector 743 operand for each active thread in the current thread group and comparing 744 each component to zero. A result vector component contains an integer 745 bitmask value (described below) for which the bits in a component bitmask 746 are set if the value in the operand vector is non-zero for the 747 corresponding thread, and not set otherwise. 748 749 Sometime when the instruction is in a conditional control flow block or 750 when it's not possible to completely fill a thread group, only a subset of 751 the threads in the thread group will be active and will execute the 752 TGBALLOT instruction. Each bit in the bitfield corresponding to inactive 753 threads will be set to 0. It's possible to query the active thread mask 754 by calling TGBALLOT with 1 as the first operand. 755 756 tmp = VectorLoad(op0); 757 result = { 0, 0, 0, 0 }; 758 for (all active threads) { 759 if ([thread]tmp.x != 0) result.x |= 1 << thread; 760 if ([thread]tmp.y != 0) result.y |= 1 << thread; 761 if ([thread]tmp.z != 0) result.z |= 1 << thread; 762 if ([thread]tmp.w != 0) result.w |= 1 << thread; 763 } 764 765Dependencies on NV_tessellation_program5 766 767 If NV_tessellation_program5 is supported and 768 "OPTION NV_shader_thread_group" is specified in an assembly program, the 769 following edits are made to extend the assembly programming model 770 documented in the NV_gpu_program4 extension and extended by NV_gpu_program5 771 and NV_tessellation_program5. 772 773 If NV_tessellation_program5 is not supported, or if 774 "OPTION NV_shader_thread_group" is not specified in an assembly program, 775 the contents of this dependencies section should be ignored. 776 777 778 Modify Section 2.X.2, Program Grammar 779 780 (add/change the following rules to the NV_gpu_program5 base grammars for 781 tessellation control programs) 782 783 <attribBasic> ::= <primPrefix> "threadid" 784 | <primPrefix> "threadeqmask" 785 | <primPrefix> "threadltmask" 786 | <primPrefix> "threadlemask" 787 | <primPrefix> "threadgtmask" 788 | <primPrefix> "threadgemask" 789 | <primPrefix> "warpid" 790 | <primPrefix> "smid" 791 792 (add/change the following rules to the NV_gpu_program5 base grammars for 793 tessellation evaluation programs) 794 795 <attribBasic> ::= <primPrefix> "threadid" 796 | <primPrefix> "threadeqmask" 797 | <primPrefix> "threadltmask" 798 | <primPrefix> "threadlemask" 799 | <primPrefix> "threadgtmask" 800 | <primPrefix> "threadgemask" 801 | <primPrefix> "warpid" 802 | <primPrefix> "smid" 803 804 805 Modify Section 2.X.3.2 of the NV_tessellation_program5 specification, 806 Program Attribute Variables. 807 808 (Add the table entries and relevant text describing the Tessellation 809 control and evaluation program attribute variables use to query thread 810 states.) 811 812 813 Primitive Binding Suffix Components Underlying State 814 -------------------------- ---------- ---------------------------- 815 ... 816 primitive.threadid (id,-,-,-) id of the current thread 817 primitive.threadeqmask (m,-,-,-) mask with the current thread 818 primitive.threadltmask (m,-,-,-) mask with lower thread 819 primitive.threadlemask (m,-,-,-) mask with lower or equal thread 820 primitive.threadgtmask (m,-,-,-) mask with greater thread 821 primitive.threadgemask (m,-,-,-) mask with greater or equal thread 822 primitive.warpid (id,-,-,-) warp id of the current thread 823 primitive.smid (id,-,-,-) SM id of the current thread 824 ... 825 826 If a attribute binding matches "primitive.threadid", the "x" component is 827 filled with the thread id of the current thread. The thread id is an 828 unsigned integer in the range 0 to 31. 829 830 If a attribute binding matches "primitive.threadeqmask", the "x" 831 component is filled with a 32-bit unsigned integer bitfield in which the 832 bit equal to the current thread id is set. 833 834 If a attribute binding matches "primitive.threadltmask", the "x" 835 component is filled with a 32-bit unsigned integer bitfield in which bits 836 lower than the current thread id are set. 837 838 If a attribute binding matches "primitive.threadlemask", the "x" 839 component is filled with a 32-bit unsigned integer bitfield in which bits 840 lower or equal to the current thread id are set. 841 842 If a attribute binding matches "primitive.threadgtmask", the "x" 843 component is filled with a 32-bit unsigned integer bitfield in which bits 844 greater than the current thread id are set. 845 846 If a attribute binding matches "primitive.threadgemask", the "x" 847 component is filled with a 32-bit unsigned integer bitfield in which bits 848 greater or equal to the current thread id are set. 849 850 If a attribute binding matches "primitive.warpid", the "x" component is 851 filled with the warp id of the current thread. The warp id is an unsigned 852 integer, the range of this value is hw dependent. 853 854 If a attribute binding matches "primitive.smid", the "x" component is 855 filled with the SM id of the current thread. The SM id is an unsigned 856 integer, the range of this value is hw dependent. 857 858 (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, 859 as extended by NV_gpu_program5 and NV_tessellation_program5) 860 861 + Shader thread group (NV_shader_thread_group) 862 863 If a program specifies the "NV_shader_thread_group" option, it may use 864 the "primitive.threadid", "primitive.threadeqmask", 865 "primitive.threadltmask", "primitive.threadlemask", 866 "primitive.threadgtmask", "primitive.threadgemask", "primitive.warpid", 867 "primitive.smid", "state.thread.warpsize", "state.thread.warpspersm" and 868 "state.thread.smcount" bindings. It may also use the "TGBALLOT" 869 instruction. If this option is not specified, a program will fail to 870 compile if it uses those bindings. 871 872 873Dependencies on NV_compute_program5 874 875 If NV_compute_program5 is supported and "OPTION NV_shader_thread_group" is 876 specified in an assembly program, the following edits are made to extend 877 the assembly programming model documented in the NV_gpu_program4 extension 878 and extended by NV_gpu_program5 and NV_compute_program5. 879 880 If NV_compute_program5 is not supported, or if 881 "OPTION NV_shader_thread_group" is not specified in an assembly program, 882 the contents of this dependencies section should be ignored. 883 884 Section 2.X.2, Program Grammar 885 886 (add the following rules to the grammar) 887 888 <attribBasic> ::= "invocation" "." "threadid" 889 | "invocation" "." "threadeqmask" 890 | "invocation" "." "threadltmask" 891 | "invocation" "." "threadlemask" 892 | "invocation" "." "threadgtmask" 893 | "invocation" "." "threadgemask" 894 | "invocation" "." "warpid" 895 | "invocation" "." "smid" 896 897 Modify Section 2.X.3.2 of the NV_compute_program5 specification, Program 898 Attribute Variables. 899 900 (Add the table entries and relevant text describing the compute program 901 input variable use to query thread states.) 902 903 Attribute Binding Components Underlying State 904 -------------------------- ---------- ---------------------------- 905 ... 906 invocation.threadid (id,-,-,-) id of the current thread 907 invocation.threadeqmask (m,-,-,-) mask with the current thread 908 invocation.threadltmask (m,-,-,-) mask with lower thread 909 invocation.threadlemask (m,-,-,-) mask with lower or equal thread 910 invocation.threadgtmask (m,-,-,-) mask with greater thread 911 invocation.threadgemask (m,-,-,-) mask with greater or equal thread 912 invocation.warpid (id,-,-,-) warp id of the current thread 913 invocation.smid (id,-,-,-) SM id of the current thread 914 ... 915 916 If a compute attribute binding matches "invocation.threadid", the "x" 917 component is filled with the thread id of the current thread. The thread 918 id is an unsigned integer in the range 0 to 31. 919 920 If a compute attribute binding matches "invocation.threadeqmask", the "x" 921 component is filled with a 32-bit unsigned integer bitfield in which the 922 bit equal to the current thread id is set. 923 924 If a compute attribute binding matches "invocation.threadltmask", the "x" 925 component is filled with a 32-bit unsigned integer bitfield in which bits 926 lower than the current thread id are set. 927 928 If a compute attribute binding matches "invocation.threadlemask", the "x" 929 component is filled with a 32-bit unsigned integer bitfield in which bits 930 lower or equal to the current thread id are set. 931 932 If a compute attribute binding matches "invocation.threadgtmask", the "x" 933 component is filled with a 32-bit unsigned integer bitfield in which bits 934 greater than the current thread id are set. 935 936 If a compute attribute binding matches "invocation.threadgemask", the "x" 937 component is filled with a 32-bit unsigned integer bitfield in which bits 938 greater or equal to the current thread id are set. 939 940 If a compute attribute binding matches "invocation.warpid", the "x" 941 component is filled with the warp id of the current thread. The warp id is 942 an unsigned integer, the range of this value is hw dependent. 943 944 If a compute attribute binding matches "invocation.smid", the "x" component 945 is filled with the SM id of the current thread. The SM id is an unsigned 946 integer, the range of this value is hw dependent. 947 948 (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, 949 as extended by NV_gpu_program5 and NV_compute_program5) 950 951 952 + Shader thread group (NV_shader_thread_group) 953 954 If a program specifies the "NV_shader_thread_group" option, it may use the 955 "invocation.threadid", "invocation.threadeqmask", 956 "invocation.threadltmask", "invocation.threadlemask", 957 "invocation.threadgtmask", "invocation.threadgemask", "invocation.warpid", 958 "invocation.smid", "state.thread.warpsize", "state.thread.warpspersm" and 959 "state.thread.smcount" bindings. It may also use the "TGBALLOT" 960 instruction. If this option is not specified, a program will fail to 961 compile if it uses those bindings. 962 963 964Errors 965 966 None. 967 968New State 969 970 None. 971 972New Implementation Dependent State 973 974 Minimum 975 Get Value Type Get Command Value Description Sec. Attrib 976 -------------------------------- ---- --------------- ------- --------------------- ------ ------ 977 WARP_SIZE_NV Z+ GetIntegerv 1 total number of 2.X.3.3 - 978 thread in a warp. 979 980 WARPS_PER_SM_NV Z+ GetIntegerv 1 maximum number of 2.X.3.3 - 981 warp executing on a 982 SM. 983 984 SM_COUNT_NV Z+ GetIntegerv 1 number of SM on the 2.X.3.3 - 985 GPU. 986 987 988Issues 989 990 None 991 992 993Revision History 994 995 Rev. Date Author Changes 996 ---- -------- -------- ----------------------------------------- 997 4 7/21/15 jbreton Update the layout of threads within a quad for 998 window and framebuffer object rendering. 999 3 2/14/14 jbreton Rename the extension from NVX to NV. 1000 2 9/4/13 jbreton Add helperThread attribute binding. 1001 1 12/19/12 jbreton Internal revisions. 1002