15bd8deadSopenharmony_ciName
25bd8deadSopenharmony_ci
35bd8deadSopenharmony_ci    NV_shader_thread_shuffle
45bd8deadSopenharmony_ci
55bd8deadSopenharmony_ciName Strings
65bd8deadSopenharmony_ci
75bd8deadSopenharmony_ci    GL_NV_shader_thread_shuffle
85bd8deadSopenharmony_ci
95bd8deadSopenharmony_ciContributors
105bd8deadSopenharmony_ci
115bd8deadSopenharmony_ci    Jeannot Breton, NVIDIA
125bd8deadSopenharmony_ci    Pat Brown, NVIDIA
135bd8deadSopenharmony_ci    Eric Werness, NVIDIA
145bd8deadSopenharmony_ci    Mark Kilgard, NVIDIA
155bd8deadSopenharmony_ci
165bd8deadSopenharmony_ciContact
175bd8deadSopenharmony_ci
185bd8deadSopenharmony_ci    Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com)
195bd8deadSopenharmony_ci
205bd8deadSopenharmony_ciStatus
215bd8deadSopenharmony_ci
225bd8deadSopenharmony_ci    Shipping.
235bd8deadSopenharmony_ci
245bd8deadSopenharmony_ciVersion
255bd8deadSopenharmony_ci
265bd8deadSopenharmony_ci    Last Modified Date:         2/14/2014
275bd8deadSopenharmony_ci    NVIDIA Revision:            3
285bd8deadSopenharmony_ci
295bd8deadSopenharmony_ciNumber
305bd8deadSopenharmony_ci
315bd8deadSopenharmony_ci    OpenGL Extension #448
325bd8deadSopenharmony_ci
335bd8deadSopenharmony_ciDependencies
345bd8deadSopenharmony_ci
355bd8deadSopenharmony_ci    This extension is written against the OpenGL 4.3 (Compatibility Profile)
365bd8deadSopenharmony_ci    Specification.
375bd8deadSopenharmony_ci
385bd8deadSopenharmony_ci    This extension is written against version 4.30 (revision 07) of the OpenGL 
395bd8deadSopenharmony_ci    Shading Language Specification.
405bd8deadSopenharmony_ci
415bd8deadSopenharmony_ci    OpenGL 4.3 and GLSL 4.3 are required.
425bd8deadSopenharmony_ci
435bd8deadSopenharmony_ci    This extension interacts with NV_gpu_program5
445bd8deadSopenharmony_ci
455bd8deadSopenharmony_ciOverview
465bd8deadSopenharmony_ci
475bd8deadSopenharmony_ci    Implementations of the OpenGL Shading Language may, but are not required, 
485bd8deadSopenharmony_ci    to run multiple shader threads for a single stage as a SIMD thread group, 
495bd8deadSopenharmony_ci    where individual execution threads are assigned to thread groups in an 
505bd8deadSopenharmony_ci    undefined, implementation-dependent order.  This extension provides a set 
515bd8deadSopenharmony_ci    of new features to the OpenGL Shading Language to share data between 
525bd8deadSopenharmony_ci    multiple threads within a thread group. 
535bd8deadSopenharmony_ci    
545bd8deadSopenharmony_ci    Shaders using the new functionalities provided by this extension should 
555bd8deadSopenharmony_ci    enable this functionality via the construct
565bd8deadSopenharmony_ci    
575bd8deadSopenharmony_ci        #extension GL_NV_shader_thread_shuffle : require     (or enable)
585bd8deadSopenharmony_ci
595bd8deadSopenharmony_ci    This extension also specifies some modifications to the program assembly
605bd8deadSopenharmony_ci    language to support the thread data sharing functionalities.
615bd8deadSopenharmony_ci
625bd8deadSopenharmony_ciNew Procedures and Functions
635bd8deadSopenharmony_ci
645bd8deadSopenharmony_ci    None
655bd8deadSopenharmony_ci
665bd8deadSopenharmony_ci
675bd8deadSopenharmony_ciNew Tokens
685bd8deadSopenharmony_ci
695bd8deadSopenharmony_ci    None
705bd8deadSopenharmony_ci          
715bd8deadSopenharmony_ci
725bd8deadSopenharmony_ciModifications to The OpenGL Shading Language Specification, Version 4.30 
735bd8deadSopenharmony_ci(Revision 07)
745bd8deadSopenharmony_ci
755bd8deadSopenharmony_ci    Including the following line in a shader can be used to control the 
765bd8deadSopenharmony_ci    language features described in this extension:
775bd8deadSopenharmony_ci
785bd8deadSopenharmony_ci      #extension GL_NV_shader_thread_shuffle : <behavior>
795bd8deadSopenharmony_ci
805bd8deadSopenharmony_ci    where <behavior> is as specified in section 3.3.
815bd8deadSopenharmony_ci
825bd8deadSopenharmony_ci    New preprocessor #defines are added to the OpenGL Shading Language:
835bd8deadSopenharmony_ci
845bd8deadSopenharmony_ci      #define GL_NV_shader_thread_shuffle         1
855bd8deadSopenharmony_ci
865bd8deadSopenharmony_ci
875bd8deadSopenharmony_ci    Modify Section 8.3, Common Functions, p. 133
885bd8deadSopenharmony_ci    
895bd8deadSopenharmony_ci    (add a function to share data between threads in a thread group)
905bd8deadSopenharmony_ci
915bd8deadSopenharmony_ci    Syntax:
925bd8deadSopenharmony_ci    
935bd8deadSopenharmony_ci        int    shuffleDownNV(int data,   uint index, uint width, 
945bd8deadSopenharmony_ci                            [out bool threadIdValid])
955bd8deadSopenharmony_ci        ivec2  shuffleDownNV(ivec2 data, uint index, uint width, 
965bd8deadSopenharmony_ci                            [out bool threadIdValid])
975bd8deadSopenharmony_ci        ivec3  shuffleDownNV(ivec3 data, uint index, uint width, 
985bd8deadSopenharmony_ci                            [out bool threadIdValid])
995bd8deadSopenharmony_ci        ivec4  shuffleDownNV(ivec4 data, uint index, uint width, 
1005bd8deadSopenharmony_ci                            [out bool threadIdValid])
1015bd8deadSopenharmony_ci
1025bd8deadSopenharmony_ci        uint   shuffleDownNV(uint  data, uint index, uint width, 
1035bd8deadSopenharmony_ci                            [out bool threadIdValid])
1045bd8deadSopenharmony_ci        uvec2  shuffleDownNV(uvec2 data, uint index, uint width, 
1055bd8deadSopenharmony_ci                            [out bool threadIdValid])
1065bd8deadSopenharmony_ci        uvec3  shuffleDownNV(uvec3 data, uint index, uint width, 
1075bd8deadSopenharmony_ci                            [out bool threadIdValid])
1085bd8deadSopenharmony_ci        uvec4  shuffleDownNV(uvec4 data, uint index, uint width, 
1095bd8deadSopenharmony_ci                            [out bool threadIdValid])
1105bd8deadSopenharmony_ci
1115bd8deadSopenharmony_ci        float  shuffleDownNV(float data, uint index, uint width, 
1125bd8deadSopenharmony_ci                            [out bool threadIdValid])
1135bd8deadSopenharmony_ci        vec2   shuffleDownNV(vec2  data, uint index, uint width, 
1145bd8deadSopenharmony_ci                            [out bool threadIdValid])
1155bd8deadSopenharmony_ci        vec3   shuffleDownNV(vec3  data, uint index, uint width, 
1165bd8deadSopenharmony_ci                            [out bool threadIdValid])
1175bd8deadSopenharmony_ci        vec4   shuffleDownNV(vec4  data, uint index, uint width, 
1185bd8deadSopenharmony_ci                            [out bool threadIdValid])
1195bd8deadSopenharmony_ci
1205bd8deadSopenharmony_ci        bool   shuffleDownNV(bool data, uint index, uint width, 
1215bd8deadSopenharmony_ci                            [out bool threadIdValid])
1225bd8deadSopenharmony_ci        bvec2  shuffleDownNV(bvec2  data, uint index, uint width, 
1235bd8deadSopenharmony_ci                            [out bool threadIdValid])
1245bd8deadSopenharmony_ci        bvec3  shuffleDownNV(bvec3  data, uint index, uint width, 
1255bd8deadSopenharmony_ci                            [out bool threadIdValid])
1265bd8deadSopenharmony_ci        bvec4  shuffleDownNV(bvec4  data, uint index, uint width, 
1275bd8deadSopenharmony_ci                            [out bool threadIdValid])
1285bd8deadSopenharmony_ci
1295bd8deadSopenharmony_ci
1305bd8deadSopenharmony_ci        int    shuffleUpNV(int data,   uint index, uint width, 
1315bd8deadSopenharmony_ci                            [out bool threadIdValid])
1325bd8deadSopenharmony_ci        ivec2  shuffleUpNV(ivec2 data, uint index, uint width, 
1335bd8deadSopenharmony_ci                            [out bool threadIdValid])
1345bd8deadSopenharmony_ci        ivec3  shuffleUpNV(ivec3 data, uint index, uint width, 
1355bd8deadSopenharmony_ci                            [out bool threadIdValid])
1365bd8deadSopenharmony_ci        ivec4  shuffleUpNV(ivec4 data, uint index, uint width, 
1375bd8deadSopenharmony_ci                            [out bool threadIdValid])
1385bd8deadSopenharmony_ci
1395bd8deadSopenharmony_ci        uint   shuffleUpNV(uint  data, uint index, uint width, 
1405bd8deadSopenharmony_ci                            [out bool threadIdValid])
1415bd8deadSopenharmony_ci        uvec2  shuffleUpNV(uvec2 data, uint index, uint width, 
1425bd8deadSopenharmony_ci                            [out bool threadIdValid])
1435bd8deadSopenharmony_ci        uvec3  shuffleUpNV(uvec3 data, uint index, uint width, 
1445bd8deadSopenharmony_ci                            [out bool threadIdValid])
1455bd8deadSopenharmony_ci        uvec4  shuffleUpNV(uvec4 data, uint index, uint width, 
1465bd8deadSopenharmony_ci                            [out bool threadIdValid])
1475bd8deadSopenharmony_ci
1485bd8deadSopenharmony_ci        float  shuffleUpNV(float data, uint index, uint width, 
1495bd8deadSopenharmony_ci                            [out bool threadIdValid])
1505bd8deadSopenharmony_ci        vec2   shuffleUpNV(vec2  data, uint index, uint width, 
1515bd8deadSopenharmony_ci                            [out bool threadIdValid])
1525bd8deadSopenharmony_ci        vec3   shuffleUpNV(vec3  data, uint index, uint width, 
1535bd8deadSopenharmony_ci                            [out bool threadIdValid])
1545bd8deadSopenharmony_ci        vec4   shuffleUpNV(vec4  data, uint index, uint width, 
1555bd8deadSopenharmony_ci                            [out bool threadIdValid])
1565bd8deadSopenharmony_ci
1575bd8deadSopenharmony_ci        bool   shuffleUpNV(bool  data, uint index, uint width, 
1585bd8deadSopenharmony_ci                            [out bool threadIdValid])
1595bd8deadSopenharmony_ci        bvec2  shuffleUpNV(bvec2 data, uint index, uint width, 
1605bd8deadSopenharmony_ci                            [out bool threadIdValid])
1615bd8deadSopenharmony_ci        bvec3  shuffleUpNV(bvec3 data, uint index, uint width, 
1625bd8deadSopenharmony_ci                            [out bool threadIdValid])
1635bd8deadSopenharmony_ci        bvec4  shuffleUpNV(bvec4 data, uint index, uint width, 
1645bd8deadSopenharmony_ci                            [out bool threadIdValid])
1655bd8deadSopenharmony_ci
1665bd8deadSopenharmony_ci
1675bd8deadSopenharmony_ci        int    shuffleXorNV(int data,   uint index, uint width, 
1685bd8deadSopenharmony_ci                            [out bool threadIdValid])
1695bd8deadSopenharmony_ci        ivec2  shuffleXorNV(ivec2 data, uint index, uint width, 
1705bd8deadSopenharmony_ci                            [out bool threadIdValid])
1715bd8deadSopenharmony_ci        ivec3  shuffleXorNV(ivec3 data, uint index, uint width, 
1725bd8deadSopenharmony_ci                            [out bool threadIdValid])
1735bd8deadSopenharmony_ci        ivec4  shuffleXorNV(ivec4 data, uint index, uint width, 
1745bd8deadSopenharmony_ci                            [out bool threadIdValid])
1755bd8deadSopenharmony_ci
1765bd8deadSopenharmony_ci        uint   shuffleXorNV(uint  data, uint index, uint width, 
1775bd8deadSopenharmony_ci                            [out bool threadIdValid])
1785bd8deadSopenharmony_ci        uvec2  shuffleXorNV(uvec2 data, uint index, uint width, 
1795bd8deadSopenharmony_ci                            [out bool threadIdValid])
1805bd8deadSopenharmony_ci        uvec3  shuffleXorNV(uvec3 data, uint index, uint width, 
1815bd8deadSopenharmony_ci                            [out bool threadIdValid])
1825bd8deadSopenharmony_ci        uvec4  shuffleXorNV(uvec4 data, uint index, uint width, 
1835bd8deadSopenharmony_ci                            [out bool threadIdValid])
1845bd8deadSopenharmony_ci
1855bd8deadSopenharmony_ci        float  shuffleXorNV(float data, uint index, uint width, 
1865bd8deadSopenharmony_ci                            [out bool threadIdValid])
1875bd8deadSopenharmony_ci        vec2   shuffleXorNV(vec2  data, uint index, uint width, 
1885bd8deadSopenharmony_ci                            [out bool threadIdValid])
1895bd8deadSopenharmony_ci        vec3   shuffleXorNV(vec3  data, uint index, uint width, 
1905bd8deadSopenharmony_ci                            [out bool threadIdValid])
1915bd8deadSopenharmony_ci        vec4   shuffleXorNV(vec4  data, uint index, uint width, 
1925bd8deadSopenharmony_ci                            [out bool threadIdValid])
1935bd8deadSopenharmony_ci
1945bd8deadSopenharmony_ci        bool   shuffleXorNV(bool  data, uint index, uint width, 
1955bd8deadSopenharmony_ci                            [out bool threadIdValid])
1965bd8deadSopenharmony_ci        bvec2  shuffleXorNV(bvec2 data, uint index, uint width, 
1975bd8deadSopenharmony_ci                            [out bool threadIdValid])
1985bd8deadSopenharmony_ci        bvec3  shuffleXorNV(bvec3 data, uint index, uint width, 
1995bd8deadSopenharmony_ci                            [out bool threadIdValid])
2005bd8deadSopenharmony_ci        bvec4  shuffleXorNV(bvec4 data, uint index, uint width, 
2015bd8deadSopenharmony_ci                            [out bool threadIdValid])
2025bd8deadSopenharmony_ci
2035bd8deadSopenharmony_ci
2045bd8deadSopenharmony_ci        int    shuffleNV(int data,   uint index, uint width, 
2055bd8deadSopenharmony_ci                            [out bool threadIdValid])
2065bd8deadSopenharmony_ci        ivec2  shuffleNV(ivec2 data, uint index, uint width, 
2075bd8deadSopenharmony_ci                            [out bool threadIdValid])
2085bd8deadSopenharmony_ci        ivec3  shuffleNV(ivec3 data, uint index, uint width, 
2095bd8deadSopenharmony_ci                            [out bool threadIdValid])
2105bd8deadSopenharmony_ci        ivec4  shuffleNV(ivec4 data, uint index, uint width, 
2115bd8deadSopenharmony_ci                            [out bool threadIdValid])
2125bd8deadSopenharmony_ci
2135bd8deadSopenharmony_ci        uint   shuffleNV(uint  data, uint index, uint width, 
2145bd8deadSopenharmony_ci                            [out bool threadIdValid])
2155bd8deadSopenharmony_ci        uvec2  shuffleNV(uvec2 data, uint index, uint width, 
2165bd8deadSopenharmony_ci                            [out bool threadIdValid])
2175bd8deadSopenharmony_ci        uvec3  shuffleNV(uvec3 data, uint index, uint width, 
2185bd8deadSopenharmony_ci                            [out bool threadIdValid])
2195bd8deadSopenharmony_ci        uvec4  shuffleNV(uvec4 data, uint index, uint width, 
2205bd8deadSopenharmony_ci                            [out bool threadIdValid])
2215bd8deadSopenharmony_ci
2225bd8deadSopenharmony_ci        float  shuffleNV(float data, uint index, uint width, 
2235bd8deadSopenharmony_ci                            [out bool threadIdValid])
2245bd8deadSopenharmony_ci        vec2   shuffleNV(vec2  data, uint index, uint width, 
2255bd8deadSopenharmony_ci                            [out bool threadIdValid])
2265bd8deadSopenharmony_ci        vec3   shuffleNV(vec3  data, uint index, uint width, 
2275bd8deadSopenharmony_ci                            [out bool threadIdValid])
2285bd8deadSopenharmony_ci        vec4   shuffleNV(vec4  data, uint index, uint width, 
2295bd8deadSopenharmony_ci                            [out bool threadIdValid])
2305bd8deadSopenharmony_ci
2315bd8deadSopenharmony_ci        bool   shuffleNV(bool  data, uint index, uint width, 
2325bd8deadSopenharmony_ci                            [out bool threadIdValid])
2335bd8deadSopenharmony_ci        bvec2  shuffleNV(bvec2 data, uint index, uint width, 
2345bd8deadSopenharmony_ci                            [out bool threadIdValid])
2355bd8deadSopenharmony_ci        bvec3  shuffleNV(bvec3 data, uint index, uint width, 
2365bd8deadSopenharmony_ci                            [out bool threadIdValid])
2375bd8deadSopenharmony_ci        bvec4  shuffleNV(bvec4 data, uint index, uint width, 
2385bd8deadSopenharmony_ci                            [out bool threadIdValid])
2395bd8deadSopenharmony_ci
2405bd8deadSopenharmony_ci    Shuffle functions allow active threads within a thread group to exchange
2415bd8deadSopenharmony_ci    data using 4 different modes (up, down, xor, indexed).  They all load
2425bd8deadSopenharmony_ci    the operand <data> which can be different per thread and return a value
2435bd8deadSopenharmony_ci    read from the source thread at an address computed with the <index> and
2445bd8deadSopenharmony_ci    the <width> operands.
2455bd8deadSopenharmony_ci
2465bd8deadSopenharmony_ci    <index> is a 5 bits value in the range 0 to 31, MSBs are ignored.
2475bd8deadSopenharmony_ci    <threadIdValid> is an optional operand.  It hold the value of the predicate
2485bd8deadSopenharmony_ci    that specifies if the source thread from which the current thread reads
2495bd8deadSopenharmony_ci    data is in range or not.  
2505bd8deadSopenharmony_ci      
2515bd8deadSopenharmony_ci    <width> is used for segmenting the thread group in multiple segments.  The
2525bd8deadSopenharmony_ci    segments need to be subdivided equally, so <width> needs to be a power of 2
2535bd8deadSopenharmony_ci    in the range 2 to 32.  Using a <width> of 32 would divide the thread
2545bd8deadSopenharmony_ci    group in a single segment.  A <width> of 8 would divide the thread group in
2555bd8deadSopenharmony_ci    4 segments of size 8.  Using a <width> that is not a power of 2, that is 
2565bd8deadSopenharmony_ci    lower than 2 or larger than 32 will return an undefined value.
2575bd8deadSopenharmony_ci
2585bd8deadSopenharmony_ci    Threads can only share data within their own segment.  Each thread
2595bd8deadSopenharmony_ci    executing the built-in shuffle function will determine the ID of another
2605bd8deadSopenharmony_ci    thread by combining its value of gl_ThreadInWarpNV with its value of
2615bd8deadSopenharmony_ci    <index> as described below.  Such threads will attempt to read the value of
2625bd8deadSopenharmony_ci    <data> in the computed other thread and return that value to the caller.
2635bd8deadSopenharmony_ci
2645bd8deadSopenharmony_ci    When a shuffle function attempts to access the value of <data> from another
2655bd8deadSopenharmony_ci    thread, it determines whether the other thread is in accessible range or
2665bd8deadSopenharmony_ci    not.  If it is in range, true will be returned in the optional 
2675bd8deadSopenharmony_ci    <threadIdValid> parameter, if provided by the caller.  If it's out of
2685bd8deadSopenharmony_ci    range, false will be returned in <threadIdValid>, if provided by the
2695bd8deadSopenharmony_ci    caller, and the value returned by the function will come from the current 
2705bd8deadSopenharmony_ci    thread.
2715bd8deadSopenharmony_ci
2725bd8deadSopenharmony_ci
2735bd8deadSopenharmony_ci    The 4 modes use the following logic to compute the source thread index and 
2745bd8deadSopenharmony_ci    the <threadIdValid> value:
2755bd8deadSopenharmony_ci    
2765bd8deadSopenharmony_ci    shuffleNV computes the source index using <index> as an absolute address
2775bd8deadSopenharmony_ci    within the thread group segment.
2785bd8deadSopenharmony_ci    
2795bd8deadSopenharmony_ci        srcThreadId = <index>
2805bd8deadSopenharmony_ci        <threadIdValid> = <index> < <width>
2815bd8deadSopenharmony_ci    
2825bd8deadSopenharmony_ci      For example, with this thread group segment:
2835bd8deadSopenharmony_ci
2845bd8deadSopenharmony_ci                        -----------------
2855bd8deadSopenharmony_ci       Thread Id        |0|1|2|3|4|5|6|7|
2865bd8deadSopenharmony_ci                        -----------------
2875bd8deadSopenharmony_ci       Thread <data>    |a|b|c|d|e|f|g|h|
2885bd8deadSopenharmony_ci                        -----------------
2895bd8deadSopenharmony_ci
2905bd8deadSopenharmony_ci      If <index> is 2
2915bd8deadSopenharmony_ci
2925bd8deadSopenharmony_ci                        -----------------
2935bd8deadSopenharmony_ci       src thread Id    |2|2|2|2|2|2|2|2|
2945bd8deadSopenharmony_ci                        -----------------
2955bd8deadSopenharmony_ci       <threadIdValid>  |1|1|1|1|1|1|1|1|
2965bd8deadSopenharmony_ci                        -----------------
2975bd8deadSopenharmony_ci       result           |b|b|b|b|b|b|b|b|
2985bd8deadSopenharmony_ci                        -----------------
2995bd8deadSopenharmony_ci
3005bd8deadSopenharmony_ci      If <index> is 9
3015bd8deadSopenharmony_ci
3025bd8deadSopenharmony_ci                        -----------------
3035bd8deadSopenharmony_ci       src thread Id    |9|9|9|9|9|9|9|9|
3045bd8deadSopenharmony_ci                        -----------------
3055bd8deadSopenharmony_ci       <threadIdValid>  |0|0|0|0|0|0|0|0|
3065bd8deadSopenharmony_ci                        -----------------
3075bd8deadSopenharmony_ci       result           |a|b|c|d|e|f|g|h|
3085bd8deadSopenharmony_ci                        -----------------       
3095bd8deadSopenharmony_ci
3105bd8deadSopenharmony_ci       
3115bd8deadSopenharmony_ci    shuffleUpNV subtracts <index> from the current thread id to get the source
3125bd8deadSopenharmony_ci    thread id.  This have the effect of shifting up the segment by <index> 
3135bd8deadSopenharmony_ci    threads.  Source thread id do not wrap around, so lower thread id
3145bd8deadSopenharmony_ci    will be left unchanged.
3155bd8deadSopenharmony_ci    
3165bd8deadSopenharmony_ci        srcThreadId = currentThreadId - <index>
3175bd8deadSopenharmony_ci        <threadIdValid> = srcThreadId >= 0
3185bd8deadSopenharmony_ci    
3195bd8deadSopenharmony_ci      For example, with this thread group segment:
3205bd8deadSopenharmony_ci
3215bd8deadSopenharmony_ci                        -----------------
3225bd8deadSopenharmony_ci       Thread Id        |0|1|2|3|4|5|6|7|
3235bd8deadSopenharmony_ci                        -----------------
3245bd8deadSopenharmony_ci       Thread <data>    |a|b|c|d|e|f|g|h|
3255bd8deadSopenharmony_ci                        -----------------
3265bd8deadSopenharmony_ci
3275bd8deadSopenharmony_ci      If <index> is 1
3285bd8deadSopenharmony_ci
3295bd8deadSopenharmony_ci                        ------------------
3305bd8deadSopenharmony_ci       src thread Id    |-1|0|1|2|3|4|5|6|
3315bd8deadSopenharmony_ci                        ------------------
3325bd8deadSopenharmony_ci       <threadIdValid>  |0 |1|1|1|1|1|1|1|
3335bd8deadSopenharmony_ci                        ------------------
3345bd8deadSopenharmony_ci       result           |a |a|b|c|d|e|f|g|
3355bd8deadSopenharmony_ci                        ------------------     
3365bd8deadSopenharmony_ci
3375bd8deadSopenharmony_ci        
3385bd8deadSopenharmony_ci    shuffleDownNV adds <index> to the current thread id to get the source 
3395bd8deadSopenharmony_ci    thread id.  This have the effect of shifting down the segment by 
3405bd8deadSopenharmony_ci    <index> threads. Source thread id do not wrap around, so higher thread id
3415bd8deadSopenharmony_ci    will be left unchanged.
3425bd8deadSopenharmony_ci    
3435bd8deadSopenharmony_ci        srcThreadId = currentThreadId + <index>
3445bd8deadSopenharmony_ci        <threadIdValid> = srcThreadId < <width>
3455bd8deadSopenharmony_ci
3465bd8deadSopenharmony_ci      For example, with this thread group segment:
3475bd8deadSopenharmony_ci
3485bd8deadSopenharmony_ci                        -----------------
3495bd8deadSopenharmony_ci       Thread Id        |0|1|2|3|4|5|6|7|
3505bd8deadSopenharmony_ci                        -----------------
3515bd8deadSopenharmony_ci       Thread <data>    |a|b|c|d|e|f|g|h|
3525bd8deadSopenharmony_ci                        -----------------
3535bd8deadSopenharmony_ci
3545bd8deadSopenharmony_ci      If <index> is 2
3555bd8deadSopenharmony_ci
3565bd8deadSopenharmony_ci                        -----------------
3575bd8deadSopenharmony_ci       src thread Id    |2|3|4|5|6|7|8|9|
3585bd8deadSopenharmony_ci                        -----------------
3595bd8deadSopenharmony_ci       <threadIdValid>  |1|1|1|1|1|1|0|0|
3605bd8deadSopenharmony_ci                        -----------------
3615bd8deadSopenharmony_ci       result           |c|d|e|f|g|h|g|h|
3625bd8deadSopenharmony_ci                        -----------------  
3635bd8deadSopenharmony_ci
3645bd8deadSopenharmony_ci        
3655bd8deadSopenharmony_ci    shuffleXorNv does a bitwise xor between the <index> and the current 
3665bd8deadSopenharmony_ci    thread id to get the src thread id:
3675bd8deadSopenharmony_ci    
3685bd8deadSopenharmony_ci        srcThreadId = currentThreadId ^ <index>
3695bd8deadSopenharmony_ci        <threadIdValid> = srcThreadId < <width>
3705bd8deadSopenharmony_ci
3715bd8deadSopenharmony_ci      For example, with this thread group segment:
3725bd8deadSopenharmony_ci
3735bd8deadSopenharmony_ci                        -----------------
3745bd8deadSopenharmony_ci       Thread Id        |0|1|2|3|4|5|6|7|
3755bd8deadSopenharmony_ci                        -----------------
3765bd8deadSopenharmony_ci       Thread <data>    |a|b|c|d|e|f|g|h|
3775bd8deadSopenharmony_ci                        -----------------
3785bd8deadSopenharmony_ci
3795bd8deadSopenharmony_ci      If <index> is 0x1
3805bd8deadSopenharmony_ci
3815bd8deadSopenharmony_ci                        -----------------
3825bd8deadSopenharmony_ci       src thread Id    |1|0|3|2|5|4|7|6|
3835bd8deadSopenharmony_ci                        -----------------
3845bd8deadSopenharmony_ci       <threadIdValid>  |1|1|1|1|1|1|1|1|
3855bd8deadSopenharmony_ci                        -----------------
3865bd8deadSopenharmony_ci       result           |b|a|d|c|f|e|h|g|
3875bd8deadSopenharmony_ci                        -----------------  
3885bd8deadSopenharmony_ci
3895bd8deadSopenharmony_ciDependencies on NV_gpu_program5
3905bd8deadSopenharmony_ci
3915bd8deadSopenharmony_ci    If NV_gpu_program5 is supported and "OPTION NV_shader_thread_shuffle" is 
3925bd8deadSopenharmony_ci    specified in an assembly program, the following edits are made to extend 
3935bd8deadSopenharmony_ci    the assembly programming model documented in the NV_gpu_program4 extension
3945bd8deadSopenharmony_ci    and extended by NV_gpu_program5.
3955bd8deadSopenharmony_ci
3965bd8deadSopenharmony_ci    If NV_gpu_program5 is not supported, or if 
3975bd8deadSopenharmony_ci    "OPTION NV_shader_thread_shuffle" is not specified in an assembly program,
3985bd8deadSopenharmony_ci    the contents of this dependencies section should be ignored.
3995bd8deadSopenharmony_ci
4005bd8deadSopenharmony_ci    Section 2.X.2, Program Grammar
4015bd8deadSopenharmony_ci
4025bd8deadSopenharmony_ci    (add the following rules to the grammar)
4035bd8deadSopenharmony_ci
4045bd8deadSopenharmony_ci    <VECTORop>              ::= "SHFDOWN"
4055bd8deadSopenharmony_ci                              | "SHFIDX"
4065bd8deadSopenharmony_ci                              | "SHFUP"
4075bd8deadSopenharmony_ci                              | "SHFXOR"
4085bd8deadSopenharmony_ci
4095bd8deadSopenharmony_ci
4105bd8deadSopenharmony_ci    Modify Section 2.X.4, Program Execution Environment  
4115bd8deadSopenharmony_ci
4125bd8deadSopenharmony_ci    (Add the table entries and relevant text describing the program 
4135bd8deadSopenharmony_ci     instructions to exchange data between threads.)
4145bd8deadSopenharmony_ci    
4155bd8deadSopenharmony_ci      Instr-      Modifiers 
4165bd8deadSopenharmony_ci      uction  V  F I C S H D  Out   Inputs    Description
4175bd8deadSopenharmony_ci      ------- -- - - - - - -  ---   --------  --------------------------------      
4185bd8deadSopenharmony_ci      ...
4195bd8deadSopenharmony_ci      SHFDOWN 50 X X - - - - F  v   v,vu,vu   warp shuffle with added index
4205bd8deadSopenharmony_ci      SHFIDX  50 X X - - - - F  v   v,vu,vu   warp shuffle with absolute index
4215bd8deadSopenharmony_ci      SHFUP   50 X X - - - - F  v   v,vu,vu   warp shuffle with subtracted index
4225bd8deadSopenharmony_ci      SHFXOR  50 X X - - - - F  v   v,vu,vu   warp shuffle with XORed index    
4235bd8deadSopenharmony_ci      ...
4245bd8deadSopenharmony_ci
4255bd8deadSopenharmony_ci
4265bd8deadSopenharmony_ci    (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, 
4275bd8deadSopenharmony_ci     as extended by NV_gpu_program5)
4285bd8deadSopenharmony_ci
4295bd8deadSopenharmony_ci    + Shader thread shuffle (NV_shader_thread_shuffle)
4305bd8deadSopenharmony_ci
4315bd8deadSopenharmony_ci    If a program specifies the "NV_shader_thread_shuffle" option, it may use
4325bd8deadSopenharmony_ci    the "SHFXOR", "SHFDOWN", "SHFIDX" and "SHFUP" instructions.  If this option
4335bd8deadSopenharmony_ci    is not specified, a program will fail to compile if it uses those 
4345bd8deadSopenharmony_ci    instructions.
4355bd8deadSopenharmony_ci
4365bd8deadSopenharmony_ci    
4375bd8deadSopenharmony_ci    Section 2.X.8.Z, SHFDOWN:  warp shuffle with added index
4385bd8deadSopenharmony_ci
4395bd8deadSopenharmony_ci    The SHFDOWN instruction allows a 32-bit scalar value to be exchanged
4405bd8deadSopenharmony_ci    between multiple thread within a thread group.  The instruction has 3
4415bd8deadSopenharmony_ci    operands as input.  The first operand is a 32-bit scalar.  This value will
4425bd8deadSopenharmony_ci    be shared between thread, it can be a float, a signed or an unsigned
4435bd8deadSopenharmony_ci    integer.  The second operand is an unsigned integer index in the range 0 to
4445bd8deadSopenharmony_ci    31.  It is used to compute from which thread the current thread will read
4455bd8deadSopenharmony_ci    the 32-bit scalar value.  For the SHFDOWN instruction this source thread is
4465bd8deadSopenharmony_ci    the id of the current thread added with the index operand.
4475bd8deadSopenharmony_ci
4485bd8deadSopenharmony_ci    The last operand is an unsigned integer mask.  The mask is used for 
4495bd8deadSopenharmony_ci    segmenting the thread group and limiting the source thread index.  Bits 0
4505bd8deadSopenharmony_ci    to 4 of <mask> are a clamp value that limits the source thread index and 
4515bd8deadSopenharmony_ci    bits 8 to 12 a segmentation mask used to segment the thread group in 
4525bd8deadSopenharmony_ci    multiple smaller groups.  Together the clamp value and the segmentation
4535bd8deadSopenharmony_ci    mask will generate 2 internal values, the minThreadId and the maxThreadId,
4545bd8deadSopenharmony_ci    using the following logic:
4555bd8deadSopenharmony_ci    
4565bd8deadSopenharmony_ci      minThreadId = current thread id & segmentationMask
4575bd8deadSopenharmony_ci      
4585bd8deadSopenharmony_ci      maxThreadId = minThreadId | (clamp & ~segmentationMask)
4595bd8deadSopenharmony_ci
4605bd8deadSopenharmony_ci    Those 2 values will segment the thread group by restricting the address
4615bd8deadSopenharmony_ci    range a specific thread can access.
4625bd8deadSopenharmony_ci
4635bd8deadSopenharmony_ci    SHFDOWN returns a 2-component vector.  The first component is a predicate
4645bd8deadSopenharmony_ci    that is TRUE when the computed source thread id is in range and FALSE when 
4655bd8deadSopenharmony_ci    it's out of bounds.  For SHFDOWN, the source thread id is in range when it
4665bd8deadSopenharmony_ci    is lower than maxThreadId.  The second component holds a 32-bit value.
4675bd8deadSopenharmony_ci    When the source thread id is in range, this value comes from the source 
4685bd8deadSopenharmony_ci    thread.  When the source thread id is out of range, it read the value from
4695bd8deadSopenharmony_ci    the current thread.  If the source thread id reference to an inactive
4705bd8deadSopenharmony_ci    thread, the returned result will be undefined.
4715bd8deadSopenharmony_ci
4725bd8deadSopenharmony_ci    SHFDOWN supports all data type modifiers.  For floating-point data types, 
4735bd8deadSopenharmony_ci    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
4745bd8deadSopenharmony_ci    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
4755bd8deadSopenharmony_ci    data types, the TRUE value is the maximum integer value (all bits are ones) 
4765bd8deadSopenharmony_ci    and the FALSE value is zero.
4775bd8deadSopenharmony_ci    
4785bd8deadSopenharmony_ci    
4795bd8deadSopenharmony_ci    Section 2.X.8.Z, SHFIDX:  warp shuffle with absolute index
4805bd8deadSopenharmony_ci
4815bd8deadSopenharmony_ci    The SHFIDX instruction allows a 32-bit scalar value to be exchanged between
4825bd8deadSopenharmony_ci    multiple thread within a thread group.  The instruction has 3 operands as
4835bd8deadSopenharmony_ci    input.  The first operand is a 32-bit scalar.  This value will be shared
4845bd8deadSopenharmony_ci    between thread, it can be a float, a signed or an unsigned integer.  The 
4855bd8deadSopenharmony_ci    second operand is an unsigned integer index in the range 0 to 31.  It is
4865bd8deadSopenharmony_ci    used to compute from which thread the current thread will read the
4875bd8deadSopenharmony_ci    32-bit scalar value.  For the SHFIDX instruction, this source thread id is
4885bd8deadSopenharmony_ci    computed using the following operation:
4895bd8deadSopenharmony_ci
4905bd8deadSopenharmony_ci      source thread id =( index operand & ~segmentationMask) | minThreadId
4915bd8deadSopenharmony_ci
4925bd8deadSopenharmony_ci    The last operand is an unsigned integer mask.  The mask is used for 
4935bd8deadSopenharmony_ci    segmenting the thread group and limiting the source thread index.  Bits 0
4945bd8deadSopenharmony_ci    to 4 of <mask> are a clamp value that limits the source thread index and 
4955bd8deadSopenharmony_ci    bits 8 to 12 a segmentation mask used to segment the thread group in 
4965bd8deadSopenharmony_ci    multiple smaller groups.  Together the clamp value and the segmentation
4975bd8deadSopenharmony_ci    mask will generate 2 internal values, the minThreadId and the maxThreadId,
4985bd8deadSopenharmony_ci    using the following logic:
4995bd8deadSopenharmony_ci    
5005bd8deadSopenharmony_ci      minThreadId = current thread id & segmentationMask
5015bd8deadSopenharmony_ci      
5025bd8deadSopenharmony_ci      maxThreadId = minThreadId | (clamp & ~segmentationMask)
5035bd8deadSopenharmony_ci
5045bd8deadSopenharmony_ci    Those 2 values will segment the thread group by restricting the address
5055bd8deadSopenharmony_ci    range a specific thread can access.
5065bd8deadSopenharmony_ci
5075bd8deadSopenharmony_ci    SHFIDX returns a 2-component vector.  The first component is a predicate
5085bd8deadSopenharmony_ci    that is TRUE when the computed source thread id is in range and FALSE when 
5095bd8deadSopenharmony_ci    it's out of bounds.  For SHFIDX, the source thread id is in range when it
5105bd8deadSopenharmony_ci    is lower than maxThreadId.  The second component holds a 32-bit value.
5115bd8deadSopenharmony_ci    When the source thread id is in range, this value comes from the source 
5125bd8deadSopenharmony_ci    thread. When the source thread id is out of range, it read the value from 
5135bd8deadSopenharmony_ci    the current thread.  If the source thread id reference to an inactive 
5145bd8deadSopenharmony_ci    thread, the returned result will be undefined.
5155bd8deadSopenharmony_ci
5165bd8deadSopenharmony_ci    SHFIDX supports all data type modifiers.  For floating-point data types, 
5175bd8deadSopenharmony_ci    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
5185bd8deadSopenharmony_ci    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
5195bd8deadSopenharmony_ci    data types, the TRUE value is the maximum integer value (all bits are ones) 
5205bd8deadSopenharmony_ci    and the FALSE value is zero.
5215bd8deadSopenharmony_ci
5225bd8deadSopenharmony_ci
5235bd8deadSopenharmony_ci    Section 2.X.8.Z, SHFUP:  warp shuffle with subtracted index
5245bd8deadSopenharmony_ci
5255bd8deadSopenharmony_ci    The SHFUP instruction allows a 32-bit scalar value to be exchanged between
5265bd8deadSopenharmony_ci    multiple thread within a thread group.  The instruction has 3 operands as
5275bd8deadSopenharmony_ci    input.  The first operand is a 32-bit scalar.  This value will be shared 
5285bd8deadSopenharmony_ci    between thread, it can be a float, a signed or an unsigned integer.  The 
5295bd8deadSopenharmony_ci    second operand is an unsigned integer index in the range 0 to 31.  It is
5305bd8deadSopenharmony_ci    used to compute from which thread the current thread will read the 32-bit
5315bd8deadSopenharmony_ci    scalar value.  For the SHFUP instruction this source thread is the id of
5325bd8deadSopenharmony_ci    the current thread subtracted with the index operand.
5335bd8deadSopenharmony_ci
5345bd8deadSopenharmony_ci    The last operand is an unsigned integer mask.  The mask is used for 
5355bd8deadSopenharmony_ci    segmenting the thread group and limiting the source thread index.  Bits 0
5365bd8deadSopenharmony_ci    to 4 of <mask> are a clamp value that limits the source thread index and 
5375bd8deadSopenharmony_ci    bits 8 to 12 a segmentation mask used to segment the thread group in 
5385bd8deadSopenharmony_ci    multiple smaller groups.  Together the clamp value and the segmentation
5395bd8deadSopenharmony_ci    mask will generate 2 internal values, the minThreadId and the maxThreadId,
5405bd8deadSopenharmony_ci    using the following logic:
5415bd8deadSopenharmony_ci    
5425bd8deadSopenharmony_ci      minThreadId = current thread id & segmentationMask
5435bd8deadSopenharmony_ci      
5445bd8deadSopenharmony_ci      maxThreadId = minThreadId | (clamp & ~segmentationMask)
5455bd8deadSopenharmony_ci
5465bd8deadSopenharmony_ci    Those 2 values will segment the thread group by restricting the address
5475bd8deadSopenharmony_ci    range a specific thread can access.
5485bd8deadSopenharmony_ci
5495bd8deadSopenharmony_ci    SHFUP returns a 2-component vector.  The first component is a predicate
5505bd8deadSopenharmony_ci    that is TRUE when the computed source thread id is in range and FALSE when 
5515bd8deadSopenharmony_ci    it's out of bounds.  For SHFUP, the source thread id is in range when it
5525bd8deadSopenharmony_ci    is greater than maxThreadId.  The second component holds a 32-bit value.
5535bd8deadSopenharmony_ci    When the source thread id is in range, this value comes from the source 
5545bd8deadSopenharmony_ci    thread.  When the source thread id is out of range, it read the value from
5555bd8deadSopenharmony_ci    the current thread.  If the source thread id reference to an inactive 
5565bd8deadSopenharmony_ci    thread, the returned result will be undefined.
5575bd8deadSopenharmony_ci
5585bd8deadSopenharmony_ci    SHFUP supports all data type modifiers.  For floating-point data types, 
5595bd8deadSopenharmony_ci    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
5605bd8deadSopenharmony_ci    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
5615bd8deadSopenharmony_ci    data types, the TRUE value is the maximum integer value (all bits are ones)
5625bd8deadSopenharmony_ci    and the FALSE value is zero.
5635bd8deadSopenharmony_ci
5645bd8deadSopenharmony_ci    
5655bd8deadSopenharmony_ci    Section 2.X.8.Z, SHFXOR:  warp shuffle with XORed index
5665bd8deadSopenharmony_ci
5675bd8deadSopenharmony_ci    The SHFXOR instruction allows a 32-bit scalar value to be exchanged
5685bd8deadSopenharmony_ci    between multiple threads within a thread group.  The instruction has 3
5695bd8deadSopenharmony_ci    operands as input.  The first operand is a 32-bit scalar.  This value will
5705bd8deadSopenharmony_ci    be shared between threads, it can be a float, a signed or an unsigned 
5715bd8deadSopenharmony_ci    integer.  The second operand is an unsigned integer index in the range 0 to
5725bd8deadSopenharmony_ci    31.  It is used to compute from which thread the current thread will read
5735bd8deadSopenharmony_ci    the 32-bit scalar value.  For the SHFXOR instruction this source thread is
5745bd8deadSopenharmony_ci    the id of the current thread XORed with the index operand.
5755bd8deadSopenharmony_ci
5765bd8deadSopenharmony_ci    The last operand is an unsigned integer mask.  The mask is used for 
5775bd8deadSopenharmony_ci    segmenting the thread group and limiting the source thread index.  Bits 0
5785bd8deadSopenharmony_ci    to 4 of <mask> are a clamp value that limits the source thread index and 
5795bd8deadSopenharmony_ci    bits 8 to 12 a segmentation mask used to segment the thread group in 
5805bd8deadSopenharmony_ci    multiple smaller groups.  Together the clamp value and the segmentation
5815bd8deadSopenharmony_ci    mask will generate 2 internal values, the minThreadId and the maxThreadId,
5825bd8deadSopenharmony_ci    using the following logic:
5835bd8deadSopenharmony_ci    
5845bd8deadSopenharmony_ci      minThreadId = current thread id & segmentationMask
5855bd8deadSopenharmony_ci      
5865bd8deadSopenharmony_ci      maxThreadId = minThreadId | (clamp & ~segmentationMask)
5875bd8deadSopenharmony_ci
5885bd8deadSopenharmony_ci    Those 2 values will segment the thread group by restricting the address
5895bd8deadSopenharmony_ci    range a specific thread can access.
5905bd8deadSopenharmony_ci
5915bd8deadSopenharmony_ci    SHFXOR returns a 2-component vector.  The first component is a predicate
5925bd8deadSopenharmony_ci    that is TRUE when the computed source thread id is in range and FALSE when 
5935bd8deadSopenharmony_ci    it's out of bounds.  For SHFXOR, the source thread id is in range when it
5945bd8deadSopenharmony_ci    is lower than maxThreadId.  The second component holds a 32-bit value.
5955bd8deadSopenharmony_ci    When the source thread id is in range, this value comes from the source 
5965bd8deadSopenharmony_ci    thread.  When the source thread id is out of range, it read the value from
5975bd8deadSopenharmony_ci    the current thread.  If the source thread id reference to an inactive 
5985bd8deadSopenharmony_ci    thread, the returned result will be undefined.
5995bd8deadSopenharmony_ci
6005bd8deadSopenharmony_ci    SHFXOR supports all data type modifiers.  For floating-point data types, 
6015bd8deadSopenharmony_ci    the TRUE value is 1.0 and the FALSE value is 0.0.  For signed integer data
6025bd8deadSopenharmony_ci    types, the TRUE value is -1 and the FALSE value is 0.  For unsigned integer
6035bd8deadSopenharmony_ci    data types, the TRUE value is the maximum integer value (all bits are ones) 
6045bd8deadSopenharmony_ci    and the FALSE value is zero.
6055bd8deadSopenharmony_ci  
6065bd8deadSopenharmony_ciErrors
6075bd8deadSopenharmony_ci
6085bd8deadSopenharmony_ci    None.
6095bd8deadSopenharmony_ci
6105bd8deadSopenharmony_ciNew State
6115bd8deadSopenharmony_ci
6125bd8deadSopenharmony_ci    None.
6135bd8deadSopenharmony_ci
6145bd8deadSopenharmony_ciNew Implementation Dependent State
6155bd8deadSopenharmony_ci
6165bd8deadSopenharmony_ci    None.
6175bd8deadSopenharmony_ci
6185bd8deadSopenharmony_ciIssues
6195bd8deadSopenharmony_ci
6205bd8deadSopenharmony_ci    None
6215bd8deadSopenharmony_ci
6225bd8deadSopenharmony_ci
6235bd8deadSopenharmony_ciRevision History
6245bd8deadSopenharmony_ci
6255bd8deadSopenharmony_ci    Rev.    Date    Author    Changes
6265bd8deadSopenharmony_ci    ----  --------  --------  -----------------------------------------
6275bd8deadSopenharmony_ci     3     2/14/14  jbreton    Rename the extension from NVX to NV.
6285bd8deadSopenharmony_ci     2      9/4/13  jbreton    Replace mask by width in the shuffle functions.
6295bd8deadSopenharmony_ci     1    11/27/12  jbreton    Internal revisions.
630