1Name 2 3 NV_shader_thread_shuffle 4 5Name Strings 6 7 GL_NV_shader_thread_shuffle 8 9Contributors 10 11 Jeannot Breton, NVIDIA 12 Pat Brown, NVIDIA 13 Eric Werness, NVIDIA 14 Mark Kilgard, NVIDIA 15 16Contact 17 18 Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com) 19 20Status 21 22 Shipping. 23 24Version 25 26 Last Modified Date: 2/14/2014 27 NVIDIA Revision: 3 28 29Number 30 31 OpenGL Extension #448 32 33Dependencies 34 35 This extension is written against the OpenGL 4.3 (Compatibility Profile) 36 Specification. 37 38 This extension is written against version 4.30 (revision 07) of the OpenGL 39 Shading Language Specification. 40 41 OpenGL 4.3 and GLSL 4.3 are required. 42 43 This extension interacts with NV_gpu_program5 44 45Overview 46 47 Implementations of the OpenGL Shading Language may, but are not required, 48 to run multiple shader threads for a single stage as a SIMD thread group, 49 where individual execution threads are assigned to thread groups in an 50 undefined, implementation-dependent order. This extension provides a set 51 of new features to the OpenGL Shading Language to share data between 52 multiple threads within a thread group. 53 54 Shaders using the new functionalities provided by this extension should 55 enable this functionality via the construct 56 57 #extension GL_NV_shader_thread_shuffle : require (or enable) 58 59 This extension also specifies some modifications to the program assembly 60 language to support the thread data sharing functionalities. 61 62New Procedures and Functions 63 64 None 65 66 67New Tokens 68 69 None 70 71 72Modifications to The OpenGL Shading Language Specification, Version 4.30 73(Revision 07) 74 75 Including the following line in a shader can be used to control the 76 language features described in this extension: 77 78 #extension GL_NV_shader_thread_shuffle : <behavior> 79 80 where <behavior> is as specified in section 3.3. 81 82 New preprocessor #defines are added to the OpenGL Shading Language: 83 84 #define GL_NV_shader_thread_shuffle 1 85 86 87 Modify Section 8.3, Common Functions, p. 133 88 89 (add a function to share data between threads in a thread group) 90 91 Syntax: 92 93 int shuffleDownNV(int data, uint index, uint width, 94 [out bool threadIdValid]) 95 ivec2 shuffleDownNV(ivec2 data, uint index, uint width, 96 [out bool threadIdValid]) 97 ivec3 shuffleDownNV(ivec3 data, uint index, uint width, 98 [out bool threadIdValid]) 99 ivec4 shuffleDownNV(ivec4 data, uint index, uint width, 100 [out bool threadIdValid]) 101 102 uint shuffleDownNV(uint data, uint index, uint width, 103 [out bool threadIdValid]) 104 uvec2 shuffleDownNV(uvec2 data, uint index, uint width, 105 [out bool threadIdValid]) 106 uvec3 shuffleDownNV(uvec3 data, uint index, uint width, 107 [out bool threadIdValid]) 108 uvec4 shuffleDownNV(uvec4 data, uint index, uint width, 109 [out bool threadIdValid]) 110 111 float shuffleDownNV(float data, uint index, uint width, 112 [out bool threadIdValid]) 113 vec2 shuffleDownNV(vec2 data, uint index, uint width, 114 [out bool threadIdValid]) 115 vec3 shuffleDownNV(vec3 data, uint index, uint width, 116 [out bool threadIdValid]) 117 vec4 shuffleDownNV(vec4 data, uint index, uint width, 118 [out bool threadIdValid]) 119 120 bool shuffleDownNV(bool data, uint index, uint width, 121 [out bool threadIdValid]) 122 bvec2 shuffleDownNV(bvec2 data, uint index, uint width, 123 [out bool threadIdValid]) 124 bvec3 shuffleDownNV(bvec3 data, uint index, uint width, 125 [out bool threadIdValid]) 126 bvec4 shuffleDownNV(bvec4 data, uint index, uint width, 127 [out bool threadIdValid]) 128 129 130 int shuffleUpNV(int data, uint index, uint width, 131 [out bool threadIdValid]) 132 ivec2 shuffleUpNV(ivec2 data, uint index, uint width, 133 [out bool threadIdValid]) 134 ivec3 shuffleUpNV(ivec3 data, uint index, uint width, 135 [out bool threadIdValid]) 136 ivec4 shuffleUpNV(ivec4 data, uint index, uint width, 137 [out bool threadIdValid]) 138 139 uint shuffleUpNV(uint data, uint index, uint width, 140 [out bool threadIdValid]) 141 uvec2 shuffleUpNV(uvec2 data, uint index, uint width, 142 [out bool threadIdValid]) 143 uvec3 shuffleUpNV(uvec3 data, uint index, uint width, 144 [out bool threadIdValid]) 145 uvec4 shuffleUpNV(uvec4 data, uint index, uint width, 146 [out bool threadIdValid]) 147 148 float shuffleUpNV(float data, uint index, uint width, 149 [out bool threadIdValid]) 150 vec2 shuffleUpNV(vec2 data, uint index, uint width, 151 [out bool threadIdValid]) 152 vec3 shuffleUpNV(vec3 data, uint index, uint width, 153 [out bool threadIdValid]) 154 vec4 shuffleUpNV(vec4 data, uint index, uint width, 155 [out bool threadIdValid]) 156 157 bool shuffleUpNV(bool data, uint index, uint width, 158 [out bool threadIdValid]) 159 bvec2 shuffleUpNV(bvec2 data, uint index, uint width, 160 [out bool threadIdValid]) 161 bvec3 shuffleUpNV(bvec3 data, uint index, uint width, 162 [out bool threadIdValid]) 163 bvec4 shuffleUpNV(bvec4 data, uint index, uint width, 164 [out bool threadIdValid]) 165 166 167 int shuffleXorNV(int data, uint index, uint width, 168 [out bool threadIdValid]) 169 ivec2 shuffleXorNV(ivec2 data, uint index, uint width, 170 [out bool threadIdValid]) 171 ivec3 shuffleXorNV(ivec3 data, uint index, uint width, 172 [out bool threadIdValid]) 173 ivec4 shuffleXorNV(ivec4 data, uint index, uint width, 174 [out bool threadIdValid]) 175 176 uint shuffleXorNV(uint data, uint index, uint width, 177 [out bool threadIdValid]) 178 uvec2 shuffleXorNV(uvec2 data, uint index, uint width, 179 [out bool threadIdValid]) 180 uvec3 shuffleXorNV(uvec3 data, uint index, uint width, 181 [out bool threadIdValid]) 182 uvec4 shuffleXorNV(uvec4 data, uint index, uint width, 183 [out bool threadIdValid]) 184 185 float shuffleXorNV(float data, uint index, uint width, 186 [out bool threadIdValid]) 187 vec2 shuffleXorNV(vec2 data, uint index, uint width, 188 [out bool threadIdValid]) 189 vec3 shuffleXorNV(vec3 data, uint index, uint width, 190 [out bool threadIdValid]) 191 vec4 shuffleXorNV(vec4 data, uint index, uint width, 192 [out bool threadIdValid]) 193 194 bool shuffleXorNV(bool data, uint index, uint width, 195 [out bool threadIdValid]) 196 bvec2 shuffleXorNV(bvec2 data, uint index, uint width, 197 [out bool threadIdValid]) 198 bvec3 shuffleXorNV(bvec3 data, uint index, uint width, 199 [out bool threadIdValid]) 200 bvec4 shuffleXorNV(bvec4 data, uint index, uint width, 201 [out bool threadIdValid]) 202 203 204 int shuffleNV(int data, uint index, uint width, 205 [out bool threadIdValid]) 206 ivec2 shuffleNV(ivec2 data, uint index, uint width, 207 [out bool threadIdValid]) 208 ivec3 shuffleNV(ivec3 data, uint index, uint width, 209 [out bool threadIdValid]) 210 ivec4 shuffleNV(ivec4 data, uint index, uint width, 211 [out bool threadIdValid]) 212 213 uint shuffleNV(uint data, uint index, uint width, 214 [out bool threadIdValid]) 215 uvec2 shuffleNV(uvec2 data, uint index, uint width, 216 [out bool threadIdValid]) 217 uvec3 shuffleNV(uvec3 data, uint index, uint width, 218 [out bool threadIdValid]) 219 uvec4 shuffleNV(uvec4 data, uint index, uint width, 220 [out bool threadIdValid]) 221 222 float shuffleNV(float data, uint index, uint width, 223 [out bool threadIdValid]) 224 vec2 shuffleNV(vec2 data, uint index, uint width, 225 [out bool threadIdValid]) 226 vec3 shuffleNV(vec3 data, uint index, uint width, 227 [out bool threadIdValid]) 228 vec4 shuffleNV(vec4 data, uint index, uint width, 229 [out bool threadIdValid]) 230 231 bool shuffleNV(bool data, uint index, uint width, 232 [out bool threadIdValid]) 233 bvec2 shuffleNV(bvec2 data, uint index, uint width, 234 [out bool threadIdValid]) 235 bvec3 shuffleNV(bvec3 data, uint index, uint width, 236 [out bool threadIdValid]) 237 bvec4 shuffleNV(bvec4 data, uint index, uint width, 238 [out bool threadIdValid]) 239 240 Shuffle functions allow active threads within a thread group to exchange 241 data using 4 different modes (up, down, xor, indexed). They all load 242 the operand <data> which can be different per thread and return a value 243 read from the source thread at an address computed with the <index> and 244 the <width> operands. 245 246 <index> is a 5 bits value in the range 0 to 31, MSBs are ignored. 247 <threadIdValid> is an optional operand. It hold the value of the predicate 248 that specifies if the source thread from which the current thread reads 249 data is in range or not. 250 251 <width> is used for segmenting the thread group in multiple segments. The 252 segments need to be subdivided equally, so <width> needs to be a power of 2 253 in the range 2 to 32. Using a <width> of 32 would divide the thread 254 group in a single segment. A <width> of 8 would divide the thread group in 255 4 segments of size 8. Using a <width> that is not a power of 2, that is 256 lower than 2 or larger than 32 will return an undefined value. 257 258 Threads can only share data within their own segment. Each thread 259 executing the built-in shuffle function will determine the ID of another 260 thread by combining its value of gl_ThreadInWarpNV with its value of 261 <index> as described below. Such threads will attempt to read the value of 262 <data> in the computed other thread and return that value to the caller. 263 264 When a shuffle function attempts to access the value of <data> from another 265 thread, it determines whether the other thread is in accessible range or 266 not. If it is in range, true will be returned in the optional 267 <threadIdValid> parameter, if provided by the caller. If it's out of 268 range, false will be returned in <threadIdValid>, if provided by the 269 caller, and the value returned by the function will come from the current 270 thread. 271 272 273 The 4 modes use the following logic to compute the source thread index and 274 the <threadIdValid> value: 275 276 shuffleNV computes the source index using <index> as an absolute address 277 within the thread group segment. 278 279 srcThreadId = <index> 280 <threadIdValid> = <index> < <width> 281 282 For example, with this thread group segment: 283 284 ----------------- 285 Thread Id |0|1|2|3|4|5|6|7| 286 ----------------- 287 Thread <data> |a|b|c|d|e|f|g|h| 288 ----------------- 289 290 If <index> is 2 291 292 ----------------- 293 src thread Id |2|2|2|2|2|2|2|2| 294 ----------------- 295 <threadIdValid> |1|1|1|1|1|1|1|1| 296 ----------------- 297 result |b|b|b|b|b|b|b|b| 298 ----------------- 299 300 If <index> is 9 301 302 ----------------- 303 src thread Id |9|9|9|9|9|9|9|9| 304 ----------------- 305 <threadIdValid> |0|0|0|0|0|0|0|0| 306 ----------------- 307 result |a|b|c|d|e|f|g|h| 308 ----------------- 309 310 311 shuffleUpNV subtracts <index> from the current thread id to get the source 312 thread id. This have the effect of shifting up the segment by <index> 313 threads. Source thread id do not wrap around, so lower thread id 314 will be left unchanged. 315 316 srcThreadId = currentThreadId - <index> 317 <threadIdValid> = srcThreadId >= 0 318 319 For example, with this thread group segment: 320 321 ----------------- 322 Thread Id |0|1|2|3|4|5|6|7| 323 ----------------- 324 Thread <data> |a|b|c|d|e|f|g|h| 325 ----------------- 326 327 If <index> is 1 328 329 ------------------ 330 src thread Id |-1|0|1|2|3|4|5|6| 331 ------------------ 332 <threadIdValid> |0 |1|1|1|1|1|1|1| 333 ------------------ 334 result |a |a|b|c|d|e|f|g| 335 ------------------ 336 337 338 shuffleDownNV adds <index> to the current thread id to get the source 339 thread id. This have the effect of shifting down the segment by 340 <index> threads. Source thread id do not wrap around, so higher thread id 341 will be left unchanged. 342 343 srcThreadId = currentThreadId + <index> 344 <threadIdValid> = srcThreadId < <width> 345 346 For example, with this thread group segment: 347 348 ----------------- 349 Thread Id |0|1|2|3|4|5|6|7| 350 ----------------- 351 Thread <data> |a|b|c|d|e|f|g|h| 352 ----------------- 353 354 If <index> is 2 355 356 ----------------- 357 src thread Id |2|3|4|5|6|7|8|9| 358 ----------------- 359 <threadIdValid> |1|1|1|1|1|1|0|0| 360 ----------------- 361 result |c|d|e|f|g|h|g|h| 362 ----------------- 363 364 365 shuffleXorNv does a bitwise xor between the <index> and the current 366 thread id to get the src thread id: 367 368 srcThreadId = currentThreadId ^ <index> 369 <threadIdValid> = srcThreadId < <width> 370 371 For example, with this thread group segment: 372 373 ----------------- 374 Thread Id |0|1|2|3|4|5|6|7| 375 ----------------- 376 Thread <data> |a|b|c|d|e|f|g|h| 377 ----------------- 378 379 If <index> is 0x1 380 381 ----------------- 382 src thread Id |1|0|3|2|5|4|7|6| 383 ----------------- 384 <threadIdValid> |1|1|1|1|1|1|1|1| 385 ----------------- 386 result |b|a|d|c|f|e|h|g| 387 ----------------- 388 389Dependencies on NV_gpu_program5 390 391 If NV_gpu_program5 is supported and "OPTION NV_shader_thread_shuffle" is 392 specified in an assembly program, the following edits are made to extend 393 the assembly programming model documented in the NV_gpu_program4 extension 394 and extended by NV_gpu_program5. 395 396 If NV_gpu_program5 is not supported, or if 397 "OPTION NV_shader_thread_shuffle" is not specified in an assembly program, 398 the contents of this dependencies section should be ignored. 399 400 Section 2.X.2, Program Grammar 401 402 (add the following rules to the grammar) 403 404 <VECTORop> ::= "SHFDOWN" 405 | "SHFIDX" 406 | "SHFUP" 407 | "SHFXOR" 408 409 410 Modify Section 2.X.4, Program Execution Environment 411 412 (Add the table entries and relevant text describing the program 413 instructions to exchange data between threads.) 414 415 Instr- Modifiers 416 uction V F I C S H D Out Inputs Description 417 ------- -- - - - - - - --- -------- -------------------------------- 418 ... 419 SHFDOWN 50 X X - - - - F v v,vu,vu warp shuffle with added index 420 SHFIDX 50 X X - - - - F v v,vu,vu warp shuffle with absolute index 421 SHFUP 50 X X - - - - F v v,vu,vu warp shuffle with subtracted index 422 SHFXOR 50 X X - - - - F v v,vu,vu warp shuffle with XORed index 423 ... 424 425 426 (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, 427 as extended by NV_gpu_program5) 428 429 + Shader thread shuffle (NV_shader_thread_shuffle) 430 431 If a program specifies the "NV_shader_thread_shuffle" option, it may use 432 the "SHFXOR", "SHFDOWN", "SHFIDX" and "SHFUP" instructions. If this option 433 is not specified, a program will fail to compile if it uses those 434 instructions. 435 436 437 Section 2.X.8.Z, SHFDOWN: warp shuffle with added index 438 439 The SHFDOWN instruction allows a 32-bit scalar value to be exchanged 440 between multiple thread within a thread group. The instruction has 3 441 operands as input. The first operand is a 32-bit scalar. This value will 442 be shared between thread, it can be a float, a signed or an unsigned 443 integer. The second operand is an unsigned integer index in the range 0 to 444 31. It is used to compute from which thread the current thread will read 445 the 32-bit scalar value. For the SHFDOWN instruction this source thread is 446 the id of the current thread added with the index operand. 447 448 The last operand is an unsigned integer mask. The mask is used for 449 segmenting the thread group and limiting the source thread index. Bits 0 450 to 4 of <mask> are a clamp value that limits the source thread index and 451 bits 8 to 12 a segmentation mask used to segment the thread group in 452 multiple smaller groups. Together the clamp value and the segmentation 453 mask will generate 2 internal values, the minThreadId and the maxThreadId, 454 using the following logic: 455 456 minThreadId = current thread id & segmentationMask 457 458 maxThreadId = minThreadId | (clamp & ~segmentationMask) 459 460 Those 2 values will segment the thread group by restricting the address 461 range a specific thread can access. 462 463 SHFDOWN returns a 2-component vector. The first component is a predicate 464 that is TRUE when the computed source thread id is in range and FALSE when 465 it's out of bounds. For SHFDOWN, the source thread id is in range when it 466 is lower than maxThreadId. The second component holds a 32-bit value. 467 When the source thread id is in range, this value comes from the source 468 thread. When the source thread id is out of range, it read the value from 469 the current thread. If the source thread id reference to an inactive 470 thread, the returned result will be undefined. 471 472 SHFDOWN supports all data type modifiers. For floating-point data types, 473 the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 474 types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer 475 data types, the TRUE value is the maximum integer value (all bits are ones) 476 and the FALSE value is zero. 477 478 479 Section 2.X.8.Z, SHFIDX: warp shuffle with absolute index 480 481 The SHFIDX instruction allows a 32-bit scalar value to be exchanged between 482 multiple thread within a thread group. The instruction has 3 operands as 483 input. The first operand is a 32-bit scalar. This value will be shared 484 between thread, it can be a float, a signed or an unsigned integer. The 485 second operand is an unsigned integer index in the range 0 to 31. It is 486 used to compute from which thread the current thread will read the 487 32-bit scalar value. For the SHFIDX instruction, this source thread id is 488 computed using the following operation: 489 490 source thread id =( index operand & ~segmentationMask) | minThreadId 491 492 The last operand is an unsigned integer mask. The mask is used for 493 segmenting the thread group and limiting the source thread index. Bits 0 494 to 4 of <mask> are a clamp value that limits the source thread index and 495 bits 8 to 12 a segmentation mask used to segment the thread group in 496 multiple smaller groups. Together the clamp value and the segmentation 497 mask will generate 2 internal values, the minThreadId and the maxThreadId, 498 using the following logic: 499 500 minThreadId = current thread id & segmentationMask 501 502 maxThreadId = minThreadId | (clamp & ~segmentationMask) 503 504 Those 2 values will segment the thread group by restricting the address 505 range a specific thread can access. 506 507 SHFIDX returns a 2-component vector. The first component is a predicate 508 that is TRUE when the computed source thread id is in range and FALSE when 509 it's out of bounds. For SHFIDX, the source thread id is in range when it 510 is lower than maxThreadId. The second component holds a 32-bit value. 511 When the source thread id is in range, this value comes from the source 512 thread. When the source thread id is out of range, it read the value from 513 the current thread. If the source thread id reference to an inactive 514 thread, the returned result will be undefined. 515 516 SHFIDX supports all data type modifiers. For floating-point data types, 517 the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 518 types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer 519 data types, the TRUE value is the maximum integer value (all bits are ones) 520 and the FALSE value is zero. 521 522 523 Section 2.X.8.Z, SHFUP: warp shuffle with subtracted index 524 525 The SHFUP instruction allows a 32-bit scalar value to be exchanged between 526 multiple thread within a thread group. The instruction has 3 operands as 527 input. The first operand is a 32-bit scalar. This value will be shared 528 between thread, it can be a float, a signed or an unsigned integer. The 529 second operand is an unsigned integer index in the range 0 to 31. It is 530 used to compute from which thread the current thread will read the 32-bit 531 scalar value. For the SHFUP instruction this source thread is the id of 532 the current thread subtracted with the index operand. 533 534 The last operand is an unsigned integer mask. The mask is used for 535 segmenting the thread group and limiting the source thread index. Bits 0 536 to 4 of <mask> are a clamp value that limits the source thread index and 537 bits 8 to 12 a segmentation mask used to segment the thread group in 538 multiple smaller groups. Together the clamp value and the segmentation 539 mask will generate 2 internal values, the minThreadId and the maxThreadId, 540 using the following logic: 541 542 minThreadId = current thread id & segmentationMask 543 544 maxThreadId = minThreadId | (clamp & ~segmentationMask) 545 546 Those 2 values will segment the thread group by restricting the address 547 range a specific thread can access. 548 549 SHFUP returns a 2-component vector. The first component is a predicate 550 that is TRUE when the computed source thread id is in range and FALSE when 551 it's out of bounds. For SHFUP, the source thread id is in range when it 552 is greater than maxThreadId. The second component holds a 32-bit value. 553 When the source thread id is in range, this value comes from the source 554 thread. When the source thread id is out of range, it read the value from 555 the current thread. If the source thread id reference to an inactive 556 thread, the returned result will be undefined. 557 558 SHFUP supports all data type modifiers. For floating-point data types, 559 the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 560 types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer 561 data types, the TRUE value is the maximum integer value (all bits are ones) 562 and the FALSE value is zero. 563 564 565 Section 2.X.8.Z, SHFXOR: warp shuffle with XORed index 566 567 The SHFXOR instruction allows a 32-bit scalar value to be exchanged 568 between multiple threads within a thread group. The instruction has 3 569 operands as input. The first operand is a 32-bit scalar. This value will 570 be shared between threads, it can be a float, a signed or an unsigned 571 integer. The second operand is an unsigned integer index in the range 0 to 572 31. It is used to compute from which thread the current thread will read 573 the 32-bit scalar value. For the SHFXOR instruction this source thread is 574 the id of the current thread XORed with the index operand. 575 576 The last operand is an unsigned integer mask. The mask is used for 577 segmenting the thread group and limiting the source thread index. Bits 0 578 to 4 of <mask> are a clamp value that limits the source thread index and 579 bits 8 to 12 a segmentation mask used to segment the thread group in 580 multiple smaller groups. Together the clamp value and the segmentation 581 mask will generate 2 internal values, the minThreadId and the maxThreadId, 582 using the following logic: 583 584 minThreadId = current thread id & segmentationMask 585 586 maxThreadId = minThreadId | (clamp & ~segmentationMask) 587 588 Those 2 values will segment the thread group by restricting the address 589 range a specific thread can access. 590 591 SHFXOR returns a 2-component vector. The first component is a predicate 592 that is TRUE when the computed source thread id is in range and FALSE when 593 it's out of bounds. For SHFXOR, the source thread id is in range when it 594 is lower than maxThreadId. The second component holds a 32-bit value. 595 When the source thread id is in range, this value comes from the source 596 thread. When the source thread id is out of range, it read the value from 597 the current thread. If the source thread id reference to an inactive 598 thread, the returned result will be undefined. 599 600 SHFXOR supports all data type modifiers. For floating-point data types, 601 the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 602 types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer 603 data types, the TRUE value is the maximum integer value (all bits are ones) 604 and the FALSE value is zero. 605 606Errors 607 608 None. 609 610New State 611 612 None. 613 614New Implementation Dependent State 615 616 None. 617 618Issues 619 620 None 621 622 623Revision History 624 625 Rev. Date Author Changes 626 ---- -------- -------- ----------------------------------------- 627 3 2/14/14 jbreton Rename the extension from NVX to NV. 628 2 9/4/13 jbreton Replace mask by width in the shuffle functions. 629 1 11/27/12 jbreton Internal revisions. 630