15bd8deadSopenharmony_ciName 25bd8deadSopenharmony_ci 35bd8deadSopenharmony_ci NV_shader_thread_shuffle 45bd8deadSopenharmony_ci 55bd8deadSopenharmony_ciName Strings 65bd8deadSopenharmony_ci 75bd8deadSopenharmony_ci GL_NV_shader_thread_shuffle 85bd8deadSopenharmony_ci 95bd8deadSopenharmony_ciContributors 105bd8deadSopenharmony_ci 115bd8deadSopenharmony_ci Jeannot Breton, NVIDIA 125bd8deadSopenharmony_ci Pat Brown, NVIDIA 135bd8deadSopenharmony_ci Eric Werness, NVIDIA 145bd8deadSopenharmony_ci Mark Kilgard, NVIDIA 155bd8deadSopenharmony_ci 165bd8deadSopenharmony_ciContact 175bd8deadSopenharmony_ci 185bd8deadSopenharmony_ci Jeannot Breton, NVIDIA Corporation (jbreton 'at' nvidia.com) 195bd8deadSopenharmony_ci 205bd8deadSopenharmony_ciStatus 215bd8deadSopenharmony_ci 225bd8deadSopenharmony_ci Shipping. 235bd8deadSopenharmony_ci 245bd8deadSopenharmony_ciVersion 255bd8deadSopenharmony_ci 265bd8deadSopenharmony_ci Last Modified Date: 2/14/2014 275bd8deadSopenharmony_ci NVIDIA Revision: 3 285bd8deadSopenharmony_ci 295bd8deadSopenharmony_ciNumber 305bd8deadSopenharmony_ci 315bd8deadSopenharmony_ci OpenGL Extension #448 325bd8deadSopenharmony_ci 335bd8deadSopenharmony_ciDependencies 345bd8deadSopenharmony_ci 355bd8deadSopenharmony_ci This extension is written against the OpenGL 4.3 (Compatibility Profile) 365bd8deadSopenharmony_ci Specification. 375bd8deadSopenharmony_ci 385bd8deadSopenharmony_ci This extension is written against version 4.30 (revision 07) of the OpenGL 395bd8deadSopenharmony_ci Shading Language Specification. 405bd8deadSopenharmony_ci 415bd8deadSopenharmony_ci OpenGL 4.3 and GLSL 4.3 are required. 425bd8deadSopenharmony_ci 435bd8deadSopenharmony_ci This extension interacts with NV_gpu_program5 445bd8deadSopenharmony_ci 455bd8deadSopenharmony_ciOverview 465bd8deadSopenharmony_ci 475bd8deadSopenharmony_ci Implementations of the OpenGL Shading Language may, but are not required, 485bd8deadSopenharmony_ci to run multiple shader threads for a single stage as a SIMD thread group, 495bd8deadSopenharmony_ci where individual execution threads are assigned to thread groups in an 505bd8deadSopenharmony_ci undefined, implementation-dependent order. This extension provides a set 515bd8deadSopenharmony_ci of new features to the OpenGL Shading Language to share data between 525bd8deadSopenharmony_ci multiple threads within a thread group. 535bd8deadSopenharmony_ci 545bd8deadSopenharmony_ci Shaders using the new functionalities provided by this extension should 555bd8deadSopenharmony_ci enable this functionality via the construct 565bd8deadSopenharmony_ci 575bd8deadSopenharmony_ci #extension GL_NV_shader_thread_shuffle : require (or enable) 585bd8deadSopenharmony_ci 595bd8deadSopenharmony_ci This extension also specifies some modifications to the program assembly 605bd8deadSopenharmony_ci language to support the thread data sharing functionalities. 615bd8deadSopenharmony_ci 625bd8deadSopenharmony_ciNew Procedures and Functions 635bd8deadSopenharmony_ci 645bd8deadSopenharmony_ci None 655bd8deadSopenharmony_ci 665bd8deadSopenharmony_ci 675bd8deadSopenharmony_ciNew Tokens 685bd8deadSopenharmony_ci 695bd8deadSopenharmony_ci None 705bd8deadSopenharmony_ci 715bd8deadSopenharmony_ci 725bd8deadSopenharmony_ciModifications to The OpenGL Shading Language Specification, Version 4.30 735bd8deadSopenharmony_ci(Revision 07) 745bd8deadSopenharmony_ci 755bd8deadSopenharmony_ci Including the following line in a shader can be used to control the 765bd8deadSopenharmony_ci language features described in this extension: 775bd8deadSopenharmony_ci 785bd8deadSopenharmony_ci #extension GL_NV_shader_thread_shuffle : <behavior> 795bd8deadSopenharmony_ci 805bd8deadSopenharmony_ci where <behavior> is as specified in section 3.3. 815bd8deadSopenharmony_ci 825bd8deadSopenharmony_ci New preprocessor #defines are added to the OpenGL Shading Language: 835bd8deadSopenharmony_ci 845bd8deadSopenharmony_ci #define GL_NV_shader_thread_shuffle 1 855bd8deadSopenharmony_ci 865bd8deadSopenharmony_ci 875bd8deadSopenharmony_ci Modify Section 8.3, Common Functions, p. 133 885bd8deadSopenharmony_ci 895bd8deadSopenharmony_ci (add a function to share data between threads in a thread group) 905bd8deadSopenharmony_ci 915bd8deadSopenharmony_ci Syntax: 925bd8deadSopenharmony_ci 935bd8deadSopenharmony_ci int shuffleDownNV(int data, uint index, uint width, 945bd8deadSopenharmony_ci [out bool threadIdValid]) 955bd8deadSopenharmony_ci ivec2 shuffleDownNV(ivec2 data, uint index, uint width, 965bd8deadSopenharmony_ci [out bool threadIdValid]) 975bd8deadSopenharmony_ci ivec3 shuffleDownNV(ivec3 data, uint index, uint width, 985bd8deadSopenharmony_ci [out bool threadIdValid]) 995bd8deadSopenharmony_ci ivec4 shuffleDownNV(ivec4 data, uint index, uint width, 1005bd8deadSopenharmony_ci [out bool threadIdValid]) 1015bd8deadSopenharmony_ci 1025bd8deadSopenharmony_ci uint shuffleDownNV(uint data, uint index, uint width, 1035bd8deadSopenharmony_ci [out bool threadIdValid]) 1045bd8deadSopenharmony_ci uvec2 shuffleDownNV(uvec2 data, uint index, uint width, 1055bd8deadSopenharmony_ci [out bool threadIdValid]) 1065bd8deadSopenharmony_ci uvec3 shuffleDownNV(uvec3 data, uint index, uint width, 1075bd8deadSopenharmony_ci [out bool threadIdValid]) 1085bd8deadSopenharmony_ci uvec4 shuffleDownNV(uvec4 data, uint index, uint width, 1095bd8deadSopenharmony_ci [out bool threadIdValid]) 1105bd8deadSopenharmony_ci 1115bd8deadSopenharmony_ci float shuffleDownNV(float data, uint index, uint width, 1125bd8deadSopenharmony_ci [out bool threadIdValid]) 1135bd8deadSopenharmony_ci vec2 shuffleDownNV(vec2 data, uint index, uint width, 1145bd8deadSopenharmony_ci [out bool threadIdValid]) 1155bd8deadSopenharmony_ci vec3 shuffleDownNV(vec3 data, uint index, uint width, 1165bd8deadSopenharmony_ci [out bool threadIdValid]) 1175bd8deadSopenharmony_ci vec4 shuffleDownNV(vec4 data, uint index, uint width, 1185bd8deadSopenharmony_ci [out bool threadIdValid]) 1195bd8deadSopenharmony_ci 1205bd8deadSopenharmony_ci bool shuffleDownNV(bool data, uint index, uint width, 1215bd8deadSopenharmony_ci [out bool threadIdValid]) 1225bd8deadSopenharmony_ci bvec2 shuffleDownNV(bvec2 data, uint index, uint width, 1235bd8deadSopenharmony_ci [out bool threadIdValid]) 1245bd8deadSopenharmony_ci bvec3 shuffleDownNV(bvec3 data, uint index, uint width, 1255bd8deadSopenharmony_ci [out bool threadIdValid]) 1265bd8deadSopenharmony_ci bvec4 shuffleDownNV(bvec4 data, uint index, uint width, 1275bd8deadSopenharmony_ci [out bool threadIdValid]) 1285bd8deadSopenharmony_ci 1295bd8deadSopenharmony_ci 1305bd8deadSopenharmony_ci int shuffleUpNV(int data, uint index, uint width, 1315bd8deadSopenharmony_ci [out bool threadIdValid]) 1325bd8deadSopenharmony_ci ivec2 shuffleUpNV(ivec2 data, uint index, uint width, 1335bd8deadSopenharmony_ci [out bool threadIdValid]) 1345bd8deadSopenharmony_ci ivec3 shuffleUpNV(ivec3 data, uint index, uint width, 1355bd8deadSopenharmony_ci [out bool threadIdValid]) 1365bd8deadSopenharmony_ci ivec4 shuffleUpNV(ivec4 data, uint index, uint width, 1375bd8deadSopenharmony_ci [out bool threadIdValid]) 1385bd8deadSopenharmony_ci 1395bd8deadSopenharmony_ci uint shuffleUpNV(uint data, uint index, uint width, 1405bd8deadSopenharmony_ci [out bool threadIdValid]) 1415bd8deadSopenharmony_ci uvec2 shuffleUpNV(uvec2 data, uint index, uint width, 1425bd8deadSopenharmony_ci [out bool threadIdValid]) 1435bd8deadSopenharmony_ci uvec3 shuffleUpNV(uvec3 data, uint index, uint width, 1445bd8deadSopenharmony_ci [out bool threadIdValid]) 1455bd8deadSopenharmony_ci uvec4 shuffleUpNV(uvec4 data, uint index, uint width, 1465bd8deadSopenharmony_ci [out bool threadIdValid]) 1475bd8deadSopenharmony_ci 1485bd8deadSopenharmony_ci float shuffleUpNV(float data, uint index, uint width, 1495bd8deadSopenharmony_ci [out bool threadIdValid]) 1505bd8deadSopenharmony_ci vec2 shuffleUpNV(vec2 data, uint index, uint width, 1515bd8deadSopenharmony_ci [out bool threadIdValid]) 1525bd8deadSopenharmony_ci vec3 shuffleUpNV(vec3 data, uint index, uint width, 1535bd8deadSopenharmony_ci [out bool threadIdValid]) 1545bd8deadSopenharmony_ci vec4 shuffleUpNV(vec4 data, uint index, uint width, 1555bd8deadSopenharmony_ci [out bool threadIdValid]) 1565bd8deadSopenharmony_ci 1575bd8deadSopenharmony_ci bool shuffleUpNV(bool data, uint index, uint width, 1585bd8deadSopenharmony_ci [out bool threadIdValid]) 1595bd8deadSopenharmony_ci bvec2 shuffleUpNV(bvec2 data, uint index, uint width, 1605bd8deadSopenharmony_ci [out bool threadIdValid]) 1615bd8deadSopenharmony_ci bvec3 shuffleUpNV(bvec3 data, uint index, uint width, 1625bd8deadSopenharmony_ci [out bool threadIdValid]) 1635bd8deadSopenharmony_ci bvec4 shuffleUpNV(bvec4 data, uint index, uint width, 1645bd8deadSopenharmony_ci [out bool threadIdValid]) 1655bd8deadSopenharmony_ci 1665bd8deadSopenharmony_ci 1675bd8deadSopenharmony_ci int shuffleXorNV(int data, uint index, uint width, 1685bd8deadSopenharmony_ci [out bool threadIdValid]) 1695bd8deadSopenharmony_ci ivec2 shuffleXorNV(ivec2 data, uint index, uint width, 1705bd8deadSopenharmony_ci [out bool threadIdValid]) 1715bd8deadSopenharmony_ci ivec3 shuffleXorNV(ivec3 data, uint index, uint width, 1725bd8deadSopenharmony_ci [out bool threadIdValid]) 1735bd8deadSopenharmony_ci ivec4 shuffleXorNV(ivec4 data, uint index, uint width, 1745bd8deadSopenharmony_ci [out bool threadIdValid]) 1755bd8deadSopenharmony_ci 1765bd8deadSopenharmony_ci uint shuffleXorNV(uint data, uint index, uint width, 1775bd8deadSopenharmony_ci [out bool threadIdValid]) 1785bd8deadSopenharmony_ci uvec2 shuffleXorNV(uvec2 data, uint index, uint width, 1795bd8deadSopenharmony_ci [out bool threadIdValid]) 1805bd8deadSopenharmony_ci uvec3 shuffleXorNV(uvec3 data, uint index, uint width, 1815bd8deadSopenharmony_ci [out bool threadIdValid]) 1825bd8deadSopenharmony_ci uvec4 shuffleXorNV(uvec4 data, uint index, uint width, 1835bd8deadSopenharmony_ci [out bool threadIdValid]) 1845bd8deadSopenharmony_ci 1855bd8deadSopenharmony_ci float shuffleXorNV(float data, uint index, uint width, 1865bd8deadSopenharmony_ci [out bool threadIdValid]) 1875bd8deadSopenharmony_ci vec2 shuffleXorNV(vec2 data, uint index, uint width, 1885bd8deadSopenharmony_ci [out bool threadIdValid]) 1895bd8deadSopenharmony_ci vec3 shuffleXorNV(vec3 data, uint index, uint width, 1905bd8deadSopenharmony_ci [out bool threadIdValid]) 1915bd8deadSopenharmony_ci vec4 shuffleXorNV(vec4 data, uint index, uint width, 1925bd8deadSopenharmony_ci [out bool threadIdValid]) 1935bd8deadSopenharmony_ci 1945bd8deadSopenharmony_ci bool shuffleXorNV(bool data, uint index, uint width, 1955bd8deadSopenharmony_ci [out bool threadIdValid]) 1965bd8deadSopenharmony_ci bvec2 shuffleXorNV(bvec2 data, uint index, uint width, 1975bd8deadSopenharmony_ci [out bool threadIdValid]) 1985bd8deadSopenharmony_ci bvec3 shuffleXorNV(bvec3 data, uint index, uint width, 1995bd8deadSopenharmony_ci [out bool threadIdValid]) 2005bd8deadSopenharmony_ci bvec4 shuffleXorNV(bvec4 data, uint index, uint width, 2015bd8deadSopenharmony_ci [out bool threadIdValid]) 2025bd8deadSopenharmony_ci 2035bd8deadSopenharmony_ci 2045bd8deadSopenharmony_ci int shuffleNV(int data, uint index, uint width, 2055bd8deadSopenharmony_ci [out bool threadIdValid]) 2065bd8deadSopenharmony_ci ivec2 shuffleNV(ivec2 data, uint index, uint width, 2075bd8deadSopenharmony_ci [out bool threadIdValid]) 2085bd8deadSopenharmony_ci ivec3 shuffleNV(ivec3 data, uint index, uint width, 2095bd8deadSopenharmony_ci [out bool threadIdValid]) 2105bd8deadSopenharmony_ci ivec4 shuffleNV(ivec4 data, uint index, uint width, 2115bd8deadSopenharmony_ci [out bool threadIdValid]) 2125bd8deadSopenharmony_ci 2135bd8deadSopenharmony_ci uint shuffleNV(uint data, uint index, uint width, 2145bd8deadSopenharmony_ci [out bool threadIdValid]) 2155bd8deadSopenharmony_ci uvec2 shuffleNV(uvec2 data, uint index, uint width, 2165bd8deadSopenharmony_ci [out bool threadIdValid]) 2175bd8deadSopenharmony_ci uvec3 shuffleNV(uvec3 data, uint index, uint width, 2185bd8deadSopenharmony_ci [out bool threadIdValid]) 2195bd8deadSopenharmony_ci uvec4 shuffleNV(uvec4 data, uint index, uint width, 2205bd8deadSopenharmony_ci [out bool threadIdValid]) 2215bd8deadSopenharmony_ci 2225bd8deadSopenharmony_ci float shuffleNV(float data, uint index, uint width, 2235bd8deadSopenharmony_ci [out bool threadIdValid]) 2245bd8deadSopenharmony_ci vec2 shuffleNV(vec2 data, uint index, uint width, 2255bd8deadSopenharmony_ci [out bool threadIdValid]) 2265bd8deadSopenharmony_ci vec3 shuffleNV(vec3 data, uint index, uint width, 2275bd8deadSopenharmony_ci [out bool threadIdValid]) 2285bd8deadSopenharmony_ci vec4 shuffleNV(vec4 data, uint index, uint width, 2295bd8deadSopenharmony_ci [out bool threadIdValid]) 2305bd8deadSopenharmony_ci 2315bd8deadSopenharmony_ci bool shuffleNV(bool data, uint index, uint width, 2325bd8deadSopenharmony_ci [out bool threadIdValid]) 2335bd8deadSopenharmony_ci bvec2 shuffleNV(bvec2 data, uint index, uint width, 2345bd8deadSopenharmony_ci [out bool threadIdValid]) 2355bd8deadSopenharmony_ci bvec3 shuffleNV(bvec3 data, uint index, uint width, 2365bd8deadSopenharmony_ci [out bool threadIdValid]) 2375bd8deadSopenharmony_ci bvec4 shuffleNV(bvec4 data, uint index, uint width, 2385bd8deadSopenharmony_ci [out bool threadIdValid]) 2395bd8deadSopenharmony_ci 2405bd8deadSopenharmony_ci Shuffle functions allow active threads within a thread group to exchange 2415bd8deadSopenharmony_ci data using 4 different modes (up, down, xor, indexed). They all load 2425bd8deadSopenharmony_ci the operand <data> which can be different per thread and return a value 2435bd8deadSopenharmony_ci read from the source thread at an address computed with the <index> and 2445bd8deadSopenharmony_ci the <width> operands. 2455bd8deadSopenharmony_ci 2465bd8deadSopenharmony_ci <index> is a 5 bits value in the range 0 to 31, MSBs are ignored. 2475bd8deadSopenharmony_ci <threadIdValid> is an optional operand. It hold the value of the predicate 2485bd8deadSopenharmony_ci that specifies if the source thread from which the current thread reads 2495bd8deadSopenharmony_ci data is in range or not. 2505bd8deadSopenharmony_ci 2515bd8deadSopenharmony_ci <width> is used for segmenting the thread group in multiple segments. The 2525bd8deadSopenharmony_ci segments need to be subdivided equally, so <width> needs to be a power of 2 2535bd8deadSopenharmony_ci in the range 2 to 32. Using a <width> of 32 would divide the thread 2545bd8deadSopenharmony_ci group in a single segment. A <width> of 8 would divide the thread group in 2555bd8deadSopenharmony_ci 4 segments of size 8. Using a <width> that is not a power of 2, that is 2565bd8deadSopenharmony_ci lower than 2 or larger than 32 will return an undefined value. 2575bd8deadSopenharmony_ci 2585bd8deadSopenharmony_ci Threads can only share data within their own segment. Each thread 2595bd8deadSopenharmony_ci executing the built-in shuffle function will determine the ID of another 2605bd8deadSopenharmony_ci thread by combining its value of gl_ThreadInWarpNV with its value of 2615bd8deadSopenharmony_ci <index> as described below. Such threads will attempt to read the value of 2625bd8deadSopenharmony_ci <data> in the computed other thread and return that value to the caller. 2635bd8deadSopenharmony_ci 2645bd8deadSopenharmony_ci When a shuffle function attempts to access the value of <data> from another 2655bd8deadSopenharmony_ci thread, it determines whether the other thread is in accessible range or 2665bd8deadSopenharmony_ci not. If it is in range, true will be returned in the optional 2675bd8deadSopenharmony_ci <threadIdValid> parameter, if provided by the caller. If it's out of 2685bd8deadSopenharmony_ci range, false will be returned in <threadIdValid>, if provided by the 2695bd8deadSopenharmony_ci caller, and the value returned by the function will come from the current 2705bd8deadSopenharmony_ci thread. 2715bd8deadSopenharmony_ci 2725bd8deadSopenharmony_ci 2735bd8deadSopenharmony_ci The 4 modes use the following logic to compute the source thread index and 2745bd8deadSopenharmony_ci the <threadIdValid> value: 2755bd8deadSopenharmony_ci 2765bd8deadSopenharmony_ci shuffleNV computes the source index using <index> as an absolute address 2775bd8deadSopenharmony_ci within the thread group segment. 2785bd8deadSopenharmony_ci 2795bd8deadSopenharmony_ci srcThreadId = <index> 2805bd8deadSopenharmony_ci <threadIdValid> = <index> < <width> 2815bd8deadSopenharmony_ci 2825bd8deadSopenharmony_ci For example, with this thread group segment: 2835bd8deadSopenharmony_ci 2845bd8deadSopenharmony_ci ----------------- 2855bd8deadSopenharmony_ci Thread Id |0|1|2|3|4|5|6|7| 2865bd8deadSopenharmony_ci ----------------- 2875bd8deadSopenharmony_ci Thread <data> |a|b|c|d|e|f|g|h| 2885bd8deadSopenharmony_ci ----------------- 2895bd8deadSopenharmony_ci 2905bd8deadSopenharmony_ci If <index> is 2 2915bd8deadSopenharmony_ci 2925bd8deadSopenharmony_ci ----------------- 2935bd8deadSopenharmony_ci src thread Id |2|2|2|2|2|2|2|2| 2945bd8deadSopenharmony_ci ----------------- 2955bd8deadSopenharmony_ci <threadIdValid> |1|1|1|1|1|1|1|1| 2965bd8deadSopenharmony_ci ----------------- 2975bd8deadSopenharmony_ci result |b|b|b|b|b|b|b|b| 2985bd8deadSopenharmony_ci ----------------- 2995bd8deadSopenharmony_ci 3005bd8deadSopenharmony_ci If <index> is 9 3015bd8deadSopenharmony_ci 3025bd8deadSopenharmony_ci ----------------- 3035bd8deadSopenharmony_ci src thread Id |9|9|9|9|9|9|9|9| 3045bd8deadSopenharmony_ci ----------------- 3055bd8deadSopenharmony_ci <threadIdValid> |0|0|0|0|0|0|0|0| 3065bd8deadSopenharmony_ci ----------------- 3075bd8deadSopenharmony_ci result |a|b|c|d|e|f|g|h| 3085bd8deadSopenharmony_ci ----------------- 3095bd8deadSopenharmony_ci 3105bd8deadSopenharmony_ci 3115bd8deadSopenharmony_ci shuffleUpNV subtracts <index> from the current thread id to get the source 3125bd8deadSopenharmony_ci thread id. This have the effect of shifting up the segment by <index> 3135bd8deadSopenharmony_ci threads. Source thread id do not wrap around, so lower thread id 3145bd8deadSopenharmony_ci will be left unchanged. 3155bd8deadSopenharmony_ci 3165bd8deadSopenharmony_ci srcThreadId = currentThreadId - <index> 3175bd8deadSopenharmony_ci <threadIdValid> = srcThreadId >= 0 3185bd8deadSopenharmony_ci 3195bd8deadSopenharmony_ci For example, with this thread group segment: 3205bd8deadSopenharmony_ci 3215bd8deadSopenharmony_ci ----------------- 3225bd8deadSopenharmony_ci Thread Id |0|1|2|3|4|5|6|7| 3235bd8deadSopenharmony_ci ----------------- 3245bd8deadSopenharmony_ci Thread <data> |a|b|c|d|e|f|g|h| 3255bd8deadSopenharmony_ci ----------------- 3265bd8deadSopenharmony_ci 3275bd8deadSopenharmony_ci If <index> is 1 3285bd8deadSopenharmony_ci 3295bd8deadSopenharmony_ci ------------------ 3305bd8deadSopenharmony_ci src thread Id |-1|0|1|2|3|4|5|6| 3315bd8deadSopenharmony_ci ------------------ 3325bd8deadSopenharmony_ci <threadIdValid> |0 |1|1|1|1|1|1|1| 3335bd8deadSopenharmony_ci ------------------ 3345bd8deadSopenharmony_ci result |a |a|b|c|d|e|f|g| 3355bd8deadSopenharmony_ci ------------------ 3365bd8deadSopenharmony_ci 3375bd8deadSopenharmony_ci 3385bd8deadSopenharmony_ci shuffleDownNV adds <index> to the current thread id to get the source 3395bd8deadSopenharmony_ci thread id. This have the effect of shifting down the segment by 3405bd8deadSopenharmony_ci <index> threads. Source thread id do not wrap around, so higher thread id 3415bd8deadSopenharmony_ci will be left unchanged. 3425bd8deadSopenharmony_ci 3435bd8deadSopenharmony_ci srcThreadId = currentThreadId + <index> 3445bd8deadSopenharmony_ci <threadIdValid> = srcThreadId < <width> 3455bd8deadSopenharmony_ci 3465bd8deadSopenharmony_ci For example, with this thread group segment: 3475bd8deadSopenharmony_ci 3485bd8deadSopenharmony_ci ----------------- 3495bd8deadSopenharmony_ci Thread Id |0|1|2|3|4|5|6|7| 3505bd8deadSopenharmony_ci ----------------- 3515bd8deadSopenharmony_ci Thread <data> |a|b|c|d|e|f|g|h| 3525bd8deadSopenharmony_ci ----------------- 3535bd8deadSopenharmony_ci 3545bd8deadSopenharmony_ci If <index> is 2 3555bd8deadSopenharmony_ci 3565bd8deadSopenharmony_ci ----------------- 3575bd8deadSopenharmony_ci src thread Id |2|3|4|5|6|7|8|9| 3585bd8deadSopenharmony_ci ----------------- 3595bd8deadSopenharmony_ci <threadIdValid> |1|1|1|1|1|1|0|0| 3605bd8deadSopenharmony_ci ----------------- 3615bd8deadSopenharmony_ci result |c|d|e|f|g|h|g|h| 3625bd8deadSopenharmony_ci ----------------- 3635bd8deadSopenharmony_ci 3645bd8deadSopenharmony_ci 3655bd8deadSopenharmony_ci shuffleXorNv does a bitwise xor between the <index> and the current 3665bd8deadSopenharmony_ci thread id to get the src thread id: 3675bd8deadSopenharmony_ci 3685bd8deadSopenharmony_ci srcThreadId = currentThreadId ^ <index> 3695bd8deadSopenharmony_ci <threadIdValid> = srcThreadId < <width> 3705bd8deadSopenharmony_ci 3715bd8deadSopenharmony_ci For example, with this thread group segment: 3725bd8deadSopenharmony_ci 3735bd8deadSopenharmony_ci ----------------- 3745bd8deadSopenharmony_ci Thread Id |0|1|2|3|4|5|6|7| 3755bd8deadSopenharmony_ci ----------------- 3765bd8deadSopenharmony_ci Thread <data> |a|b|c|d|e|f|g|h| 3775bd8deadSopenharmony_ci ----------------- 3785bd8deadSopenharmony_ci 3795bd8deadSopenharmony_ci If <index> is 0x1 3805bd8deadSopenharmony_ci 3815bd8deadSopenharmony_ci ----------------- 3825bd8deadSopenharmony_ci src thread Id |1|0|3|2|5|4|7|6| 3835bd8deadSopenharmony_ci ----------------- 3845bd8deadSopenharmony_ci <threadIdValid> |1|1|1|1|1|1|1|1| 3855bd8deadSopenharmony_ci ----------------- 3865bd8deadSopenharmony_ci result |b|a|d|c|f|e|h|g| 3875bd8deadSopenharmony_ci ----------------- 3885bd8deadSopenharmony_ci 3895bd8deadSopenharmony_ciDependencies on NV_gpu_program5 3905bd8deadSopenharmony_ci 3915bd8deadSopenharmony_ci If NV_gpu_program5 is supported and "OPTION NV_shader_thread_shuffle" is 3925bd8deadSopenharmony_ci specified in an assembly program, the following edits are made to extend 3935bd8deadSopenharmony_ci the assembly programming model documented in the NV_gpu_program4 extension 3945bd8deadSopenharmony_ci and extended by NV_gpu_program5. 3955bd8deadSopenharmony_ci 3965bd8deadSopenharmony_ci If NV_gpu_program5 is not supported, or if 3975bd8deadSopenharmony_ci "OPTION NV_shader_thread_shuffle" is not specified in an assembly program, 3985bd8deadSopenharmony_ci the contents of this dependencies section should be ignored. 3995bd8deadSopenharmony_ci 4005bd8deadSopenharmony_ci Section 2.X.2, Program Grammar 4015bd8deadSopenharmony_ci 4025bd8deadSopenharmony_ci (add the following rules to the grammar) 4035bd8deadSopenharmony_ci 4045bd8deadSopenharmony_ci <VECTORop> ::= "SHFDOWN" 4055bd8deadSopenharmony_ci | "SHFIDX" 4065bd8deadSopenharmony_ci | "SHFUP" 4075bd8deadSopenharmony_ci | "SHFXOR" 4085bd8deadSopenharmony_ci 4095bd8deadSopenharmony_ci 4105bd8deadSopenharmony_ci Modify Section 2.X.4, Program Execution Environment 4115bd8deadSopenharmony_ci 4125bd8deadSopenharmony_ci (Add the table entries and relevant text describing the program 4135bd8deadSopenharmony_ci instructions to exchange data between threads.) 4145bd8deadSopenharmony_ci 4155bd8deadSopenharmony_ci Instr- Modifiers 4165bd8deadSopenharmony_ci uction V F I C S H D Out Inputs Description 4175bd8deadSopenharmony_ci ------- -- - - - - - - --- -------- -------------------------------- 4185bd8deadSopenharmony_ci ... 4195bd8deadSopenharmony_ci SHFDOWN 50 X X - - - - F v v,vu,vu warp shuffle with added index 4205bd8deadSopenharmony_ci SHFIDX 50 X X - - - - F v v,vu,vu warp shuffle with absolute index 4215bd8deadSopenharmony_ci SHFUP 50 X X - - - - F v v,vu,vu warp shuffle with subtracted index 4225bd8deadSopenharmony_ci SHFXOR 50 X X - - - - F v v,vu,vu warp shuffle with XORed index 4235bd8deadSopenharmony_ci ... 4245bd8deadSopenharmony_ci 4255bd8deadSopenharmony_ci 4265bd8deadSopenharmony_ci (Add to "Section 2.X.6, Program Options" of the NV_gpu_program4 extension, 4275bd8deadSopenharmony_ci as extended by NV_gpu_program5) 4285bd8deadSopenharmony_ci 4295bd8deadSopenharmony_ci + Shader thread shuffle (NV_shader_thread_shuffle) 4305bd8deadSopenharmony_ci 4315bd8deadSopenharmony_ci If a program specifies the "NV_shader_thread_shuffle" option, it may use 4325bd8deadSopenharmony_ci the "SHFXOR", "SHFDOWN", "SHFIDX" and "SHFUP" instructions. If this option 4335bd8deadSopenharmony_ci is not specified, a program will fail to compile if it uses those 4345bd8deadSopenharmony_ci instructions. 4355bd8deadSopenharmony_ci 4365bd8deadSopenharmony_ci 4375bd8deadSopenharmony_ci Section 2.X.8.Z, SHFDOWN: warp shuffle with added index 4385bd8deadSopenharmony_ci 4395bd8deadSopenharmony_ci The SHFDOWN instruction allows a 32-bit scalar value to be exchanged 4405bd8deadSopenharmony_ci between multiple thread within a thread group. The instruction has 3 4415bd8deadSopenharmony_ci operands as input. The first operand is a 32-bit scalar. This value will 4425bd8deadSopenharmony_ci be shared between thread, it can be a float, a signed or an unsigned 4435bd8deadSopenharmony_ci integer. The second operand is an unsigned integer index in the range 0 to 4445bd8deadSopenharmony_ci 31. It is used to compute from which thread the current thread will read 4455bd8deadSopenharmony_ci the 32-bit scalar value. For the SHFDOWN instruction this source thread is 4465bd8deadSopenharmony_ci the id of the current thread added with the index operand. 4475bd8deadSopenharmony_ci 4485bd8deadSopenharmony_ci The last operand is an unsigned integer mask. The mask is used for 4495bd8deadSopenharmony_ci segmenting the thread group and limiting the source thread index. Bits 0 4505bd8deadSopenharmony_ci to 4 of <mask> are a clamp value that limits the source thread index and 4515bd8deadSopenharmony_ci bits 8 to 12 a segmentation mask used to segment the thread group in 4525bd8deadSopenharmony_ci multiple smaller groups. Together the clamp value and the segmentation 4535bd8deadSopenharmony_ci mask will generate 2 internal values, the minThreadId and the maxThreadId, 4545bd8deadSopenharmony_ci using the following logic: 4555bd8deadSopenharmony_ci 4565bd8deadSopenharmony_ci minThreadId = current thread id & segmentationMask 4575bd8deadSopenharmony_ci 4585bd8deadSopenharmony_ci maxThreadId = minThreadId | (clamp & ~segmentationMask) 4595bd8deadSopenharmony_ci 4605bd8deadSopenharmony_ci Those 2 values will segment the thread group by restricting the address 4615bd8deadSopenharmony_ci range a specific thread can access. 4625bd8deadSopenharmony_ci 4635bd8deadSopenharmony_ci SHFDOWN returns a 2-component vector. The first component is a predicate 4645bd8deadSopenharmony_ci that is TRUE when the computed source thread id is in range and FALSE when 4655bd8deadSopenharmony_ci it's out of bounds. For SHFDOWN, the source thread id is in range when it 4665bd8deadSopenharmony_ci is lower than maxThreadId. The second component holds a 32-bit value. 4675bd8deadSopenharmony_ci When the source thread id is in range, this value comes from the source 4685bd8deadSopenharmony_ci thread. When the source thread id is out of range, it read the value from 4695bd8deadSopenharmony_ci the current thread. If the source thread id reference to an inactive 4705bd8deadSopenharmony_ci thread, the returned result will be undefined. 4715bd8deadSopenharmony_ci 4725bd8deadSopenharmony_ci SHFDOWN supports all data type modifiers. For floating-point data types, 4735bd8deadSopenharmony_ci the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 4745bd8deadSopenharmony_ci types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer 4755bd8deadSopenharmony_ci data types, the TRUE value is the maximum integer value (all bits are ones) 4765bd8deadSopenharmony_ci and the FALSE value is zero. 4775bd8deadSopenharmony_ci 4785bd8deadSopenharmony_ci 4795bd8deadSopenharmony_ci Section 2.X.8.Z, SHFIDX: warp shuffle with absolute index 4805bd8deadSopenharmony_ci 4815bd8deadSopenharmony_ci The SHFIDX instruction allows a 32-bit scalar value to be exchanged between 4825bd8deadSopenharmony_ci multiple thread within a thread group. The instruction has 3 operands as 4835bd8deadSopenharmony_ci input. The first operand is a 32-bit scalar. This value will be shared 4845bd8deadSopenharmony_ci between thread, it can be a float, a signed or an unsigned integer. The 4855bd8deadSopenharmony_ci second operand is an unsigned integer index in the range 0 to 31. It is 4865bd8deadSopenharmony_ci used to compute from which thread the current thread will read the 4875bd8deadSopenharmony_ci 32-bit scalar value. For the SHFIDX instruction, this source thread id is 4885bd8deadSopenharmony_ci computed using the following operation: 4895bd8deadSopenharmony_ci 4905bd8deadSopenharmony_ci source thread id =( index operand & ~segmentationMask) | minThreadId 4915bd8deadSopenharmony_ci 4925bd8deadSopenharmony_ci The last operand is an unsigned integer mask. The mask is used for 4935bd8deadSopenharmony_ci segmenting the thread group and limiting the source thread index. Bits 0 4945bd8deadSopenharmony_ci to 4 of <mask> are a clamp value that limits the source thread index and 4955bd8deadSopenharmony_ci bits 8 to 12 a segmentation mask used to segment the thread group in 4965bd8deadSopenharmony_ci multiple smaller groups. Together the clamp value and the segmentation 4975bd8deadSopenharmony_ci mask will generate 2 internal values, the minThreadId and the maxThreadId, 4985bd8deadSopenharmony_ci using the following logic: 4995bd8deadSopenharmony_ci 5005bd8deadSopenharmony_ci minThreadId = current thread id & segmentationMask 5015bd8deadSopenharmony_ci 5025bd8deadSopenharmony_ci maxThreadId = minThreadId | (clamp & ~segmentationMask) 5035bd8deadSopenharmony_ci 5045bd8deadSopenharmony_ci Those 2 values will segment the thread group by restricting the address 5055bd8deadSopenharmony_ci range a specific thread can access. 5065bd8deadSopenharmony_ci 5075bd8deadSopenharmony_ci SHFIDX returns a 2-component vector. The first component is a predicate 5085bd8deadSopenharmony_ci that is TRUE when the computed source thread id is in range and FALSE when 5095bd8deadSopenharmony_ci it's out of bounds. For SHFIDX, the source thread id is in range when it 5105bd8deadSopenharmony_ci is lower than maxThreadId. The second component holds a 32-bit value. 5115bd8deadSopenharmony_ci When the source thread id is in range, this value comes from the source 5125bd8deadSopenharmony_ci thread. When the source thread id is out of range, it read the value from 5135bd8deadSopenharmony_ci the current thread. If the source thread id reference to an inactive 5145bd8deadSopenharmony_ci thread, the returned result will be undefined. 5155bd8deadSopenharmony_ci 5165bd8deadSopenharmony_ci SHFIDX supports all data type modifiers. For floating-point data types, 5175bd8deadSopenharmony_ci the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 5185bd8deadSopenharmony_ci types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer 5195bd8deadSopenharmony_ci data types, the TRUE value is the maximum integer value (all bits are ones) 5205bd8deadSopenharmony_ci and the FALSE value is zero. 5215bd8deadSopenharmony_ci 5225bd8deadSopenharmony_ci 5235bd8deadSopenharmony_ci Section 2.X.8.Z, SHFUP: warp shuffle with subtracted index 5245bd8deadSopenharmony_ci 5255bd8deadSopenharmony_ci The SHFUP instruction allows a 32-bit scalar value to be exchanged between 5265bd8deadSopenharmony_ci multiple thread within a thread group. The instruction has 3 operands as 5275bd8deadSopenharmony_ci input. The first operand is a 32-bit scalar. This value will be shared 5285bd8deadSopenharmony_ci between thread, it can be a float, a signed or an unsigned integer. The 5295bd8deadSopenharmony_ci second operand is an unsigned integer index in the range 0 to 31. It is 5305bd8deadSopenharmony_ci used to compute from which thread the current thread will read the 32-bit 5315bd8deadSopenharmony_ci scalar value. For the SHFUP instruction this source thread is the id of 5325bd8deadSopenharmony_ci the current thread subtracted with the index operand. 5335bd8deadSopenharmony_ci 5345bd8deadSopenharmony_ci The last operand is an unsigned integer mask. The mask is used for 5355bd8deadSopenharmony_ci segmenting the thread group and limiting the source thread index. Bits 0 5365bd8deadSopenharmony_ci to 4 of <mask> are a clamp value that limits the source thread index and 5375bd8deadSopenharmony_ci bits 8 to 12 a segmentation mask used to segment the thread group in 5385bd8deadSopenharmony_ci multiple smaller groups. Together the clamp value and the segmentation 5395bd8deadSopenharmony_ci mask will generate 2 internal values, the minThreadId and the maxThreadId, 5405bd8deadSopenharmony_ci using the following logic: 5415bd8deadSopenharmony_ci 5425bd8deadSopenharmony_ci minThreadId = current thread id & segmentationMask 5435bd8deadSopenharmony_ci 5445bd8deadSopenharmony_ci maxThreadId = minThreadId | (clamp & ~segmentationMask) 5455bd8deadSopenharmony_ci 5465bd8deadSopenharmony_ci Those 2 values will segment the thread group by restricting the address 5475bd8deadSopenharmony_ci range a specific thread can access. 5485bd8deadSopenharmony_ci 5495bd8deadSopenharmony_ci SHFUP returns a 2-component vector. The first component is a predicate 5505bd8deadSopenharmony_ci that is TRUE when the computed source thread id is in range and FALSE when 5515bd8deadSopenharmony_ci it's out of bounds. For SHFUP, the source thread id is in range when it 5525bd8deadSopenharmony_ci is greater than maxThreadId. The second component holds a 32-bit value. 5535bd8deadSopenharmony_ci When the source thread id is in range, this value comes from the source 5545bd8deadSopenharmony_ci thread. When the source thread id is out of range, it read the value from 5555bd8deadSopenharmony_ci the current thread. If the source thread id reference to an inactive 5565bd8deadSopenharmony_ci thread, the returned result will be undefined. 5575bd8deadSopenharmony_ci 5585bd8deadSopenharmony_ci SHFUP supports all data type modifiers. For floating-point data types, 5595bd8deadSopenharmony_ci the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 5605bd8deadSopenharmony_ci types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer 5615bd8deadSopenharmony_ci data types, the TRUE value is the maximum integer value (all bits are ones) 5625bd8deadSopenharmony_ci and the FALSE value is zero. 5635bd8deadSopenharmony_ci 5645bd8deadSopenharmony_ci 5655bd8deadSopenharmony_ci Section 2.X.8.Z, SHFXOR: warp shuffle with XORed index 5665bd8deadSopenharmony_ci 5675bd8deadSopenharmony_ci The SHFXOR instruction allows a 32-bit scalar value to be exchanged 5685bd8deadSopenharmony_ci between multiple threads within a thread group. The instruction has 3 5695bd8deadSopenharmony_ci operands as input. The first operand is a 32-bit scalar. This value will 5705bd8deadSopenharmony_ci be shared between threads, it can be a float, a signed or an unsigned 5715bd8deadSopenharmony_ci integer. The second operand is an unsigned integer index in the range 0 to 5725bd8deadSopenharmony_ci 31. It is used to compute from which thread the current thread will read 5735bd8deadSopenharmony_ci the 32-bit scalar value. For the SHFXOR instruction this source thread is 5745bd8deadSopenharmony_ci the id of the current thread XORed with the index operand. 5755bd8deadSopenharmony_ci 5765bd8deadSopenharmony_ci The last operand is an unsigned integer mask. The mask is used for 5775bd8deadSopenharmony_ci segmenting the thread group and limiting the source thread index. Bits 0 5785bd8deadSopenharmony_ci to 4 of <mask> are a clamp value that limits the source thread index and 5795bd8deadSopenharmony_ci bits 8 to 12 a segmentation mask used to segment the thread group in 5805bd8deadSopenharmony_ci multiple smaller groups. Together the clamp value and the segmentation 5815bd8deadSopenharmony_ci mask will generate 2 internal values, the minThreadId and the maxThreadId, 5825bd8deadSopenharmony_ci using the following logic: 5835bd8deadSopenharmony_ci 5845bd8deadSopenharmony_ci minThreadId = current thread id & segmentationMask 5855bd8deadSopenharmony_ci 5865bd8deadSopenharmony_ci maxThreadId = minThreadId | (clamp & ~segmentationMask) 5875bd8deadSopenharmony_ci 5885bd8deadSopenharmony_ci Those 2 values will segment the thread group by restricting the address 5895bd8deadSopenharmony_ci range a specific thread can access. 5905bd8deadSopenharmony_ci 5915bd8deadSopenharmony_ci SHFXOR returns a 2-component vector. The first component is a predicate 5925bd8deadSopenharmony_ci that is TRUE when the computed source thread id is in range and FALSE when 5935bd8deadSopenharmony_ci it's out of bounds. For SHFXOR, the source thread id is in range when it 5945bd8deadSopenharmony_ci is lower than maxThreadId. The second component holds a 32-bit value. 5955bd8deadSopenharmony_ci When the source thread id is in range, this value comes from the source 5965bd8deadSopenharmony_ci thread. When the source thread id is out of range, it read the value from 5975bd8deadSopenharmony_ci the current thread. If the source thread id reference to an inactive 5985bd8deadSopenharmony_ci thread, the returned result will be undefined. 5995bd8deadSopenharmony_ci 6005bd8deadSopenharmony_ci SHFXOR supports all data type modifiers. For floating-point data types, 6015bd8deadSopenharmony_ci the TRUE value is 1.0 and the FALSE value is 0.0. For signed integer data 6025bd8deadSopenharmony_ci types, the TRUE value is -1 and the FALSE value is 0. For unsigned integer 6035bd8deadSopenharmony_ci data types, the TRUE value is the maximum integer value (all bits are ones) 6045bd8deadSopenharmony_ci and the FALSE value is zero. 6055bd8deadSopenharmony_ci 6065bd8deadSopenharmony_ciErrors 6075bd8deadSopenharmony_ci 6085bd8deadSopenharmony_ci None. 6095bd8deadSopenharmony_ci 6105bd8deadSopenharmony_ciNew State 6115bd8deadSopenharmony_ci 6125bd8deadSopenharmony_ci None. 6135bd8deadSopenharmony_ci 6145bd8deadSopenharmony_ciNew Implementation Dependent State 6155bd8deadSopenharmony_ci 6165bd8deadSopenharmony_ci None. 6175bd8deadSopenharmony_ci 6185bd8deadSopenharmony_ciIssues 6195bd8deadSopenharmony_ci 6205bd8deadSopenharmony_ci None 6215bd8deadSopenharmony_ci 6225bd8deadSopenharmony_ci 6235bd8deadSopenharmony_ciRevision History 6245bd8deadSopenharmony_ci 6255bd8deadSopenharmony_ci Rev. Date Author Changes 6265bd8deadSopenharmony_ci ---- -------- -------- ----------------------------------------- 6275bd8deadSopenharmony_ci 3 2/14/14 jbreton Rename the extension from NVX to NV. 6285bd8deadSopenharmony_ci 2 9/4/13 jbreton Replace mask by width in the shuffle functions. 6295bd8deadSopenharmony_ci 1 11/27/12 jbreton Internal revisions. 630