• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1Name
2
3    NV_shader_thread_shuffle
4
5Name Strings
6
7    GL_NV_shader_thread_shuffle
8
9Contributors
10
11    Jeannot Breton, NVIDIA
12    Pat Brown, NVIDIA
13    Eric Werness, NVIDIA
14    Mark Kilgard, NVIDIA
15
16Contact
17
18    Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com)
19
20Status
21
22    Shipping.
23
24Version
25
26    Last Modified Date:         2/14/2014
27    NVIDIA Revision:            3
28
29Number
30
31    OpenGL Extension #448
32
33Dependencies
34
35    This extension is written against the OpenGL 4.3 (Compatibility Profile)
36    Specification.
37
38    This extension is written against version 4.30 (revision 07) of the OpenGL
39    Shading Language Specification.
40
41    OpenGL 4.3 and GLSL 4.3 are required.
42
43    This extension interacts with NV_gpu_program5
44
45Overview
46
47    Implementations of the OpenGL Shading Language may, but are not required,
48    to run multiple shader threads for a single stage as a SIMD thread group,
49    where individual execution threads are assigned to thread groups in an
50    undefined, implementation-dependent order.  This extension provides a set
51    of new features to the OpenGL Shading Language to share data between
52    multiple threads within a thread group.
53
54    Shaders using the new functionalities provided by this extension should
55    enable this functionality via the construct
56
57        #extension GL_NV_shader_thread_shuffle : require     (or enable)
58
59    This extension also specifies some modifications to the program assembly
60    language to support the thread data sharing functionalities.
61
62New Procedures and Functions
63
64    None
65
66
67New Tokens
68
69    None
70
71
72Modifications to The OpenGL Shading Language Specification, Version 4.30
73(Revision 07)
74
75    Including the following line in a shader can be used to control the
76    language features described in this extension:
77
78      #extension GL_NV_shader_thread_shuffle : <behavior>
79
80    where <behavior> is as specified in section 3.3.
81
82    New preprocessor #defines are added to the OpenGL Shading Language:
83
84      #define GL_NV_shader_thread_shuffle         1
85
86
87    Modify Section 8.3, Common Functions, p. 133
88
89    (add a function to share data between threads in a thread group)
90
91    Syntax:
92
93        int    shuffleDownNV(int data,   uint index, uint width,
94                            [out bool threadIdValid])
95        ivec2  shuffleDownNV(ivec2 data, uint index, uint width,
96                            [out bool threadIdValid])
97        ivec3  shuffleDownNV(ivec3 data, uint index, uint width,
98                            [out bool threadIdValid])
99        ivec4  shuffleDownNV(ivec4 data, uint index, uint width,
100                            [out bool threadIdValid])
101
102        uint   shuffleDownNV(uint  data, uint index, uint width,
103                            [out bool threadIdValid])
104        uvec2  shuffleDownNV(uvec2 data, uint index, uint width,
105                            [out bool threadIdValid])
106        uvec3  shuffleDownNV(uvec3 data, uint index, uint width,
107                            [out bool threadIdValid])
108        uvec4  shuffleDownNV(uvec4 data, uint index, uint width,
109                            [out bool threadIdValid])
110
111        float  shuffleDownNV(float data, uint index, uint width,
112                            [out bool threadIdValid])
113        vec2   shuffleDownNV(vec2  data, uint index, uint width,
114                            [out bool threadIdValid])
115        vec3   shuffleDownNV(vec3  data, uint index, uint width,
116                            [out bool threadIdValid])
117        vec4   shuffleDownNV(vec4  data, uint index, uint width,
118                            [out bool threadIdValid])
119
120        bool   shuffleDownNV(bool data, uint index, uint width,
121                            [out bool threadIdValid])
122        bvec2  shuffleDownNV(bvec2  data, uint index, uint width,
123                            [out bool threadIdValid])
124        bvec3  shuffleDownNV(bvec3  data, uint index, uint width,
125                            [out bool threadIdValid])
126        bvec4  shuffleDownNV(bvec4  data, uint index, uint width,
127                            [out bool threadIdValid])
128
129
130        int    shuffleUpNV(int data,   uint index, uint width,
131                            [out bool threadIdValid])
132        ivec2  shuffleUpNV(ivec2 data, uint index, uint width,
133                            [out bool threadIdValid])
134        ivec3  shuffleUpNV(ivec3 data, uint index, uint width,
135                            [out bool threadIdValid])
136        ivec4  shuffleUpNV(ivec4 data, uint index, uint width,
137                            [out bool threadIdValid])
138
139        uint   shuffleUpNV(uint  data, uint index, uint width,
140                            [out bool threadIdValid])
141        uvec2  shuffleUpNV(uvec2 data, uint index, uint width,
142                            [out bool threadIdValid])
143        uvec3  shuffleUpNV(uvec3 data, uint index, uint width,
144                            [out bool threadIdValid])
145        uvec4  shuffleUpNV(uvec4 data, uint index, uint width,
146                            [out bool threadIdValid])
147
148        float  shuffleUpNV(float data, uint index, uint width,
149                            [out bool threadIdValid])
150        vec2   shuffleUpNV(vec2  data, uint index, uint width,
151                            [out bool threadIdValid])
152        vec3   shuffleUpNV(vec3  data, uint index, uint width,
153                            [out bool threadIdValid])
154        vec4   shuffleUpNV(vec4  data, uint index, uint width,
155                            [out bool threadIdValid])
156
157        bool   shuffleUpNV(bool  data, uint index, uint width,
158                            [out bool threadIdValid])
159        bvec2  shuffleUpNV(bvec2 data, uint index, uint width,
160                            [out bool threadIdValid])
161        bvec3  shuffleUpNV(bvec3 data, uint index, uint width,
162                            [out bool threadIdValid])
163        bvec4  shuffleUpNV(bvec4 data, uint index, uint width,
164                            [out bool threadIdValid])
165
166
167        int    shuffleXorNV(int data,   uint index, uint width,
168                            [out bool threadIdValid])
169        ivec2  shuffleXorNV(ivec2 data, uint index, uint width,
170                            [out bool threadIdValid])
171        ivec3  shuffleXorNV(ivec3 data, uint index, uint width,
172                            [out bool threadIdValid])
173        ivec4  shuffleXorNV(ivec4 data, uint index, uint width,
174                            [out bool threadIdValid])
175
176        uint   shuffleXorNV(uint  data, uint index, uint width,
177                            [out bool threadIdValid])
178        uvec2  shuffleXorNV(uvec2 data, uint index, uint width,
179                            [out bool threadIdValid])
180        uvec3  shuffleXorNV(uvec3 data, uint index, uint width,
181                            [out bool threadIdValid])
182        uvec4  shuffleXorNV(uvec4 data, uint index, uint width,
183                            [out bool threadIdValid])
184
185        float  shuffleXorNV(float data, uint index, uint width,
186                            [out bool threadIdValid])
187        vec2   shuffleXorNV(vec2  data, uint index, uint width,
188                            [out bool threadIdValid])
189        vec3   shuffleXorNV(vec3  data, uint index, uint width,
190                            [out bool threadIdValid])
191        vec4   shuffleXorNV(vec4  data, uint index, uint width,
192                            [out bool threadIdValid])
193
194        bool   shuffleXorNV(bool  data, uint index, uint width,
195                            [out bool threadIdValid])
196        bvec2  shuffleXorNV(bvec2 data, uint index, uint width,
197                            [out bool threadIdValid])
198        bvec3  shuffleXorNV(bvec3 data, uint index, uint width,
199                            [out bool threadIdValid])
200        bvec4  shuffleXorNV(bvec4 data, uint index, uint width,
201                            [out bool threadIdValid])
202
203
204        int    shuffleNV(int data,   uint index, uint width,
205                            [out bool threadIdValid])
206        ivec2  shuffleNV(ivec2 data, uint index, uint width,
207                            [out bool threadIdValid])
208        ivec3  shuffleNV(ivec3 data, uint index, uint width,
209                            [out bool threadIdValid])
210        ivec4  shuffleNV(ivec4 data, uint index, uint width,
211                            [out bool threadIdValid])
212
213        uint   shuffleNV(uint  data, uint index, uint width,
214                            [out bool threadIdValid])
215        uvec2  shuffleNV(uvec2 data, uint index, uint width,
216                            [out bool threadIdValid])
217        uvec3  shuffleNV(uvec3 data, uint index, uint width,
218                            [out bool threadIdValid])
219        uvec4  shuffleNV(uvec4 data, uint index, uint width,
220                            [out bool threadIdValid])
221
222        float  shuffleNV(float data, uint index, uint width,
223                            [out bool threadIdValid])
224        vec2   shuffleNV(vec2  data, uint index, uint width,
225                            [out bool threadIdValid])
226        vec3   shuffleNV(vec3  data, uint index, uint width,
227                            [out bool threadIdValid])
228        vec4   shuffleNV(vec4  data, uint index, uint width,
229                            [out bool threadIdValid])
230
231        bool   shuffleNV(bool  data, uint index, uint width,
232                            [out bool threadIdValid])
233        bvec2  shuffleNV(bvec2 data, uint index, uint width,
234                            [out bool threadIdValid])
235        bvec3  shuffleNV(bvec3 data, uint index, uint width,
236                            [out bool threadIdValid])
237        bvec4  shuffleNV(bvec4 data, uint index, uint width,
238                            [out bool threadIdValid])
239
240    Shuffle functions allow active threads within a thread group to exchange
241    data using 4 different modes (up, down, xor, indexed).  They all load
242    the operand <data> which can be different per thread and return a value
243    read from the source thread at an address computed with the <index> and
244    the <width> operands.
245
246    <index> is a 5 bits value in the range 0 to 31, MSBs are ignored.
247    <threadIdValid> is an optional operand.  It hold the value of the predicate
248    that specifies if the source thread from which the current thread reads
249    data is in range or not.
250
251    <width> is used for segmenting the thread group in multiple segments.  The
252    segments need to be subdivided equally, so <width> needs to be a power of 2
253    in the range 2 to 32.  Using a <width> of 32 would divide the thread
254    group in a single segment.  A <width> of 8 would divide the thread group in
255    4 segments of size 8.  Using a <width> that is not a power of 2, that is
256    lower than 2 or larger than 32 will return an undefined value.
257
258    Threads can only share data within their own segment.  Each thread
259    executing the built-in shuffle function will determine the ID of another
260    thread by combining its value of gl_ThreadInWarpNV with its value of
261    <index> as described below.  Such threads will attempt to read the value of
262    <data> in the computed other thread and return that value to the caller.
263
264    When a shuffle function attempts to access the value of <data> from another
265    thread, it determines whether the other thread is in accessible range or
266    not.  If it is in range, true will be returned in the optional
267    <threadIdValid> parameter, if provided by the caller.  If it's out of
268    range, false will be returned in <threadIdValid>, if provided by the
269    caller, and the value returned by the function will come from the current
270    thread.
271
272
273    The 4 modes use the following logic to compute the source thread index and
274    the <threadIdValid> value:
275
276    shuffleNV computes the source index using <index> as an absolute address
277    within the thread group segment.
278
279        srcThreadId = <index>
280        <threadIdValid> = <index> < <width>
281
282      For example, with this thread group segment:
283
284                        -----------------
285       Thread Id        |0|1|2|3|4|5|6|7|
286                        -----------------
287       Thread <data>    |a|b|c|d|e|f|g|h|
288                        -----------------
289
290      If <index> is 2
291
292                        -----------------
293       src thread Id    |2|2|2|2|2|2|2|2|
294                        -----------------
295       <threadIdValid>  |1|1|1|1|1|1|1|1|
296                        -----------------
297       result           |b|b|b|b|b|b|b|b|
298                        -----------------
299
300      If <index> is 9
301
302                        -----------------
303       src thread Id    |9|9|9|9|9|9|9|9|
304                        -----------------
305       <threadIdValid>  |0|0|0|0|0|0|0|0|
306                        -----------------
307       result           |a|b|c|d|e|f|g|h|
308                        -----------------
309
310
311    shuffleUpNV subtracts <index> from the current thread id to get the source
312    thread id.  This have the effect of shifting up the segment by <index>
313    threads.  Source thread id do not wrap around, so lower thread id
314    will be left unchanged.
315
316        srcThreadId = currentThreadId - <index>
317        <threadIdValid> = srcThreadId >= 0
318
319      For example, with this thread group segment:
320
321                        -----------------
322       Thread Id        |0|1|2|3|4|5|6|7|
323                        -----------------
324       Thread <data>    |a|b|c|d|e|f|g|h|
325                        -----------------
326
327      If <index> is 1
328
329                        ------------------
330       src thread Id    |-1|0|1|2|3|4|5|6|
331                        ------------------
332       <threadIdValid>  |0 |1|1|1|1|1|1|1|
333                        ------------------
334       result           |a |a|b|c|d|e|f|g|
335                        ------------------
336
337
338    shuffleDownNV adds <index> to the current thread id to get the source
339    thread id.  This have the effect of shifting down the segment by
340    <index> threads. Source thread id do not wrap around, so higher thread id
341    will be left unchanged.
342
343        srcThreadId = currentThreadId + <index>
344        <threadIdValid> = srcThreadId < <width>
345
346      For example, with this thread group segment:
347
348                        -----------------
349       Thread Id        |0|1|2|3|4|5|6|7|
350                        -----------------
351       Thread <data>    |a|b|c|d|e|f|g|h|
352                        -----------------
353
354      If <index> is 2
355
356                        -----------------
357       src thread Id    |2|3|4|5|6|7|8|9|
358                        -----------------
359       <threadIdValid>  |1|1|1|1|1|1|0|0|
360                        -----------------
361       result           |c|d|e|f|g|h|g|h|
362                        -----------------
363
364
365    shuffleXorNv does a bitwise xor between the <index> and the current
366    thread id to get the src thread id:
367
368        srcThreadId = currentThreadId ^ <index>
369        <threadIdValid> = srcThreadId < <width>
370
371      For example, with this thread group segment:
372
373                        -----------------
374       Thread Id        |0|1|2|3|4|5|6|7|
375                        -----------------
376       Thread <data>    |a|b|c|d|e|f|g|h|
377                        -----------------
378
379      If <index> is 0x1
380
381                        -----------------
382       src thread Id    |1|0|3|2|5|4|7|6|
383                        -----------------
384       <threadIdValid>  |1|1|1|1|1|1|1|1|
385                        -----------------
386       result           |b|a|d|c|f|e|h|g|
387                        -----------------
388
389Dependencies on NV_gpu_program5
390
391    If NV_gpu_program5 is supported and "OPTION NV_shader_thread_shuffle" is
392    specified in an assembly program, the following edits are made to extend
393    the assembly programming model documented in the NV_gpu_program4 extension
394    and extended by NV_gpu_program5.
395
396    If NV_gpu_program5 is not supported, or if
397    "OPTION NV_shader_thread_shuffle" is not specified in an assembly program,
398    the contents of this dependencies section should be ignored.
399
400    Section 2.X.2, Program Grammar
401
402    (add the following rules to the grammar)
403
404    <VECTORop>              ::= "SHFDOWN"
405                              | "SHFIDX"
406                              | "SHFUP"
407                              | "SHFXOR"
408
409
410    Modify Section 2.X.4, Program Execution Environment
411
412    (Add the table entries and relevant text describing the program
413     instructions to exchange data between threads.)
414
415      Instr-      Modifiers
416      uction  V  F I C S H D  Out   Inputs    Description
417      ------- -- - - - - - -  ---   --------  --------------------------------
418      ...
419      SHFDOWN 50 X X - - - - F  v   v,vu,vu   warp shuffle with added index
420      SHFIDX  50 X X - - - - F  v   v,vu,vu   warp shuffle with absolute index
421      SHFUP   50 X X - - - - F  v   v,vu,vu   warp shuffle with subtracted index
422      SHFXOR  50 X X - - - - F  v   v,vu,vu   warp shuffle with XORed index
423      ...
424
425
426    (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension,
427     as extended by NV_gpu_program5)
428
429    + Shader thread shuffle (NV_shader_thread_shuffle)
430
431    If a program specifies the "NV_shader_thread_shuffle" option, it may use
432    the "SHFXOR", "SHFDOWN", "SHFIDX" and "SHFUP" instructions.  If this option
433    is not specified, a program will fail to compile if it uses those
434    instructions.
435
436
437    Section 2.X.8.Z, SHFDOWN:  warp shuffle with added index
438
439    The SHFDOWN instruction allows a 32-bit scalar value to be exchanged
440    between multiple thread within a thread group.  The instruction has 3
441    operands as input.  The first operand is a 32-bit scalar.  This value will
442    be shared between thread, it can be a float, a signed or an unsigned
443    integer.  The second operand is an unsigned integer index in the range 0 to
444    31.  It is used to compute from which thread the current thread will read
445    the 32-bit scalar value.  For the SHFDOWN instruction this source thread is
446    the id of the current thread added with the index operand.
447
448    The last operand is an unsigned integer mask.  The mask is used for
449    segmenting the thread group and limiting the source thread index.  Bits 0
450    to 4 of <mask> are a clamp value that limits the source thread index and
451    bits 8 to 12 a segmentation mask used to segment the thread group in
452    multiple smaller groups.  Together the clamp value and the segmentation
453    mask will generate 2 internal values, the minThreadId and the maxThreadId,
454    using the following logic:
455
456      minThreadId = current thread id & segmentationMask
457
458      maxThreadId = minThreadId | (clamp & ~segmentationMask)
459
460    Those 2 values will segment the thread group by restricting the address
461    range a specific thread can access.
462
463    SHFDOWN returns a 2-component vector.  The first component is a predicate
464    that is TRUE when the computed source thread id is in range and FALSE when
465    it's out of bounds.  For SHFDOWN, the source thread id is in range when it
466    is lower than maxThreadId.  The second component holds a 32-bit value.
467    When the source thread id is in range, this value comes from the source
468    thread.  When the source thread id is out of range, it read the value from
469    the current thread.  If the source thread id reference to an inactive
470    thread, the returned result will be undefined.
471
472    SHFDOWN supports all data type modifiers.  For floating-point data types,
473    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
474    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
475    data types, the TRUE value is the maximum integer value (all bits are ones)
476    and the FALSE value is zero.
477
478
479    Section 2.X.8.Z, SHFIDX:  warp shuffle with absolute index
480
481    The SHFIDX instruction allows a 32-bit scalar value to be exchanged between
482    multiple thread within a thread group.  The instruction has 3 operands as
483    input.  The first operand is a 32-bit scalar.  This value will be shared
484    between thread, it can be a float, a signed or an unsigned integer.  The
485    second operand is an unsigned integer index in the range 0 to 31.  It is
486    used to compute from which thread the current thread will read the
487    32-bit scalar value.  For the SHFIDX instruction, this source thread id is
488    computed using the following operation:
489
490      source thread id =( index operand & ~segmentationMask) | minThreadId
491
492    The last operand is an unsigned integer mask.  The mask is used for
493    segmenting the thread group and limiting the source thread index.  Bits 0
494    to 4 of <mask> are a clamp value that limits the source thread index and
495    bits 8 to 12 a segmentation mask used to segment the thread group in
496    multiple smaller groups.  Together the clamp value and the segmentation
497    mask will generate 2 internal values, the minThreadId and the maxThreadId,
498    using the following logic:
499
500      minThreadId = current thread id & segmentationMask
501
502      maxThreadId = minThreadId | (clamp & ~segmentationMask)
503
504    Those 2 values will segment the thread group by restricting the address
505    range a specific thread can access.
506
507    SHFIDX returns a 2-component vector.  The first component is a predicate
508    that is TRUE when the computed source thread id is in range and FALSE when
509    it's out of bounds.  For SHFIDX, the source thread id is in range when it
510    is lower than maxThreadId.  The second component holds a 32-bit value.
511    When the source thread id is in range, this value comes from the source
512    thread. When the source thread id is out of range, it read the value from
513    the current thread.  If the source thread id reference to an inactive
514    thread, the returned result will be undefined.
515
516    SHFIDX supports all data type modifiers.  For floating-point data types,
517    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
518    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
519    data types, the TRUE value is the maximum integer value (all bits are ones)
520    and the FALSE value is zero.
521
522
523    Section 2.X.8.Z, SHFUP:  warp shuffle with subtracted index
524
525    The SHFUP instruction allows a 32-bit scalar value to be exchanged between
526    multiple thread within a thread group.  The instruction has 3 operands as
527    input.  The first operand is a 32-bit scalar.  This value will be shared
528    between thread, it can be a float, a signed or an unsigned integer.  The
529    second operand is an unsigned integer index in the range 0 to 31.  It is
530    used to compute from which thread the current thread will read the 32-bit
531    scalar value.  For the SHFUP instruction this source thread is the id of
532    the current thread subtracted with the index operand.
533
534    The last operand is an unsigned integer mask.  The mask is used for
535    segmenting the thread group and limiting the source thread index.  Bits 0
536    to 4 of <mask> are a clamp value that limits the source thread index and
537    bits 8 to 12 a segmentation mask used to segment the thread group in
538    multiple smaller groups.  Together the clamp value and the segmentation
539    mask will generate 2 internal values, the minThreadId and the maxThreadId,
540    using the following logic:
541
542      minThreadId = current thread id & segmentationMask
543
544      maxThreadId = minThreadId | (clamp & ~segmentationMask)
545
546    Those 2 values will segment the thread group by restricting the address
547    range a specific thread can access.
548
549    SHFUP returns a 2-component vector.  The first component is a predicate
550    that is TRUE when the computed source thread id is in range and FALSE when
551    it's out of bounds.  For SHFUP, the source thread id is in range when it
552    is greater than maxThreadId.  The second component holds a 32-bit value.
553    When the source thread id is in range, this value comes from the source
554    thread.  When the source thread id is out of range, it read the value from
555    the current thread.  If the source thread id reference to an inactive
556    thread, the returned result will be undefined.
557
558    SHFUP supports all data type modifiers.  For floating-point data types,
559    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
560    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
561    data types, the TRUE value is the maximum integer value (all bits are ones)
562    and the FALSE value is zero.
563
564
565    Section 2.X.8.Z, SHFXOR:  warp shuffle with XORed index
566
567    The SHFXOR instruction allows a 32-bit scalar value to be exchanged
568    between multiple threads within a thread group.  The instruction has 3
569    operands as input.  The first operand is a 32-bit scalar.  This value will
570    be shared between threads, it can be a float, a signed or an unsigned
571    integer.  The second operand is an unsigned integer index in the range 0 to
572    31.  It is used to compute from which thread the current thread will read
573    the 32-bit scalar value.  For the SHFXOR instruction this source thread is
574    the id of the current thread XORed with the index operand.
575
576    The last operand is an unsigned integer mask.  The mask is used for
577    segmenting the thread group and limiting the source thread index.  Bits 0
578    to 4 of <mask> are a clamp value that limits the source thread index and
579    bits 8 to 12 a segmentation mask used to segment the thread group in
580    multiple smaller groups.  Together the clamp value and the segmentation
581    mask will generate 2 internal values, the minThreadId and the maxThreadId,
582    using the following logic:
583
584      minThreadId = current thread id & segmentationMask
585
586      maxThreadId = minThreadId | (clamp & ~segmentationMask)
587
588    Those 2 values will segment the thread group by restricting the address
589    range a specific thread can access.
590
591    SHFXOR returns a 2-component vector.  The first component is a predicate
592    that is TRUE when the computed source thread id is in range and FALSE when
593    it's out of bounds.  For SHFXOR, the source thread id is in range when it
594    is lower than maxThreadId.  The second component holds a 32-bit value.
595    When the source thread id is in range, this value comes from the source
596    thread.  When the source thread id is out of range, it read the value from
597    the current thread.  If the source thread id reference to an inactive
598    thread, the returned result will be undefined.
599
600    SHFXOR supports all data type modifiers.  For floating-point data types,
601    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
602    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
603    data types, the TRUE value is the maximum integer value (all bits are ones)
604    and the FALSE value is zero.
605
606Errors
607
608    None.
609
610New State
611
612    None.
613
614New Implementation Dependent State
615
616    None.
617
618Issues
619
620    None
621
622
623Revision History
624
625    Rev.    Date    Author    Changes
626    ----  --------  --------  -----------------------------------------
627     3     2/14/14  jbreton    Rename the extension from NVX to NV.
628     2      9/4/13  jbreton    Replace mask by width in the shuffle functions.
629     1    11/27/12  jbreton    Internal revisions.
630