15bd8deadSopenharmony_ciName 25bd8deadSopenharmony_ci 35bd8deadSopenharmony_ci NV_compute_program5 45bd8deadSopenharmony_ci 55bd8deadSopenharmony_ciName Strings 65bd8deadSopenharmony_ci 75bd8deadSopenharmony_ci GL_NV_compute_program5 85bd8deadSopenharmony_ci 95bd8deadSopenharmony_ciContact 105bd8deadSopenharmony_ci 115bd8deadSopenharmony_ci Pat Brown, NVIDIA Corporation (pbrown 'at' nvidia.com) 125bd8deadSopenharmony_ci 135bd8deadSopenharmony_ciStatus 145bd8deadSopenharmony_ci 155bd8deadSopenharmony_ci Complete 165bd8deadSopenharmony_ci 175bd8deadSopenharmony_ciVersion 185bd8deadSopenharmony_ci 195bd8deadSopenharmony_ci Last Modified Date: 10/23/2012 205bd8deadSopenharmony_ci NVIDIA Revision: 2 215bd8deadSopenharmony_ci 225bd8deadSopenharmony_ciNumber 235bd8deadSopenharmony_ci 245bd8deadSopenharmony_ci 421 255bd8deadSopenharmony_ci 265bd8deadSopenharmony_ciDependencies 275bd8deadSopenharmony_ci 285bd8deadSopenharmony_ci OpenGL 4.0 (Core or Compatibiity Profile) is required. 295bd8deadSopenharmony_ci 305bd8deadSopenharmony_ci This extension is written against the OpenGL 4.2 Specification 315bd8deadSopenharmony_ci (Compatibility Profile). 325bd8deadSopenharmony_ci 335bd8deadSopenharmony_ci NV_gpu_program4 and NV_gpu_program5 are required. 345bd8deadSopenharmony_ci 355bd8deadSopenharmony_ci ARB_compute_shader is required. 365bd8deadSopenharmony_ci 375bd8deadSopenharmony_ci This specification interacts with NV_shader_atomic_float. 385bd8deadSopenharmony_ci 395bd8deadSopenharmony_ci This specification interacts with EXT_shader_image_load_store. 405bd8deadSopenharmony_ci 415bd8deadSopenharmony_ciOverview 425bd8deadSopenharmony_ci 435bd8deadSopenharmony_ci This extension builds on the ARB_compute_shader extension to provide new 445bd8deadSopenharmony_ci assembly compute program capability for OpenGL. ARB_compute_shader adds 455bd8deadSopenharmony_ci the basic functionality, including the ability to dispatch compute work. 465bd8deadSopenharmony_ci This extension provides the ability to write a compute program in 475bd8deadSopenharmony_ci assembly, using the same basic syntax and capability set found in the 485bd8deadSopenharmony_ci NV_gpu_program4 and NV_gpu_program5 extensions. 495bd8deadSopenharmony_ci 505bd8deadSopenharmony_ciNew Procedures and Functions 515bd8deadSopenharmony_ci 525bd8deadSopenharmony_ci None. 535bd8deadSopenharmony_ci 545bd8deadSopenharmony_ciNew Tokens 555bd8deadSopenharmony_ci 565bd8deadSopenharmony_ci Accepted by the <cap> parameter of Disable, Enable, and IsEnabled, 575bd8deadSopenharmony_ci by the <pname> parameter of GetBooleanv, GetIntegerv, GetFloatv, 585bd8deadSopenharmony_ci and GetDoublev, and by the <target> parameter of ProgramStringARB, 595bd8deadSopenharmony_ci BindProgramARB, ProgramEnvParameter4[df][v]ARB, 605bd8deadSopenharmony_ci ProgramLocalParameter4[df][v]ARB, GetProgramEnvParameter[df]vARB, 615bd8deadSopenharmony_ci GetProgramLocalParameter[df]vARB, GetProgramivARB and 625bd8deadSopenharmony_ci GetProgramStringARB: 635bd8deadSopenharmony_ci 645bd8deadSopenharmony_ci COMPUTE_PROGRAM_NV 0x90FB 655bd8deadSopenharmony_ci 665bd8deadSopenharmony_ci Accepted by the <target> parameter of ProgramBufferParametersfvNV, 675bd8deadSopenharmony_ci ProgramBufferParametersIivNV, and ProgramBufferParametersIuivNV, 685bd8deadSopenharmony_ci BindBufferRangeNV, BindBufferOffsetNV, BindBufferBaseNV, and BindBuffer 695bd8deadSopenharmony_ci and the <value> parameter of GetIntegerIndexedvEXT: 705bd8deadSopenharmony_ci 715bd8deadSopenharmony_ci COMPUTE_PROGRAM_PARAMETER_BUFFER_NV 0x90FC 725bd8deadSopenharmony_ci 735bd8deadSopenharmony_ci (Note: Various enumerants from ARB_compute_shader will also be used by 745bd8deadSopenharmony_ci this extension.) 755bd8deadSopenharmony_ci 765bd8deadSopenharmony_ciAdditions to Chapter 2 of the OpenGL 4.2 (Compatibility Profile) Specification 775bd8deadSopenharmony_ci(OpenGL Operation) 785bd8deadSopenharmony_ci 795bd8deadSopenharmony_ci Modify Section 2.X, GPU Programs, of NV_gpu_program4 (as modified by 805bd8deadSopenharmony_ci NV_gpu_program5) 815bd8deadSopenharmony_ci 825bd8deadSopenharmony_ci (insert after second paragraph) 835bd8deadSopenharmony_ci 845bd8deadSopenharmony_ci Compute Programs 855bd8deadSopenharmony_ci 865bd8deadSopenharmony_ci Compute programs are used to perform general purpose computations using a 875bd8deadSopenharmony_ci three-dimensional array of program invocations (threads). The compute 885bd8deadSopenharmony_ci shader invocations are arranged into work groups specified by the 895bd8deadSopenharmony_ci mandatory GROUP_SIZE declaration, each of which comprises a fixed-size, 905bd8deadSopenharmony_ci three-dimensional array of program invocations. One or more work groups 915bd8deadSopenharmony_ci are scheduled for execution using the DispatchCompute or 925bd8deadSopenharmony_ci DispatchComputeIndirect commands. 935bd8deadSopenharmony_ci 945bd8deadSopenharmony_ci Each work group scheduled for execution will launch a separate program 955bd8deadSopenharmony_ci invocation for each work group member. While the program invocations in a 965bd8deadSopenharmony_ci work group are launched together, they run independently after launch. 975bd8deadSopenharmony_ci The BAR (barrier) instruction is available to synchronize program 985bd8deadSopenharmony_ci invocations; an invocation stops at each BAR instruction until all 995bd8deadSopenharmony_ci invocations in the work group have executed the BAR instruction. Each 1005bd8deadSopenharmony_ci work group has an optional shared memory allocation (specified by the 1015bd8deadSopenharmony_ci SHARED_MEMORY declaration) that can be read or written by any invocations 1025bd8deadSopenharmony_ci of the work group. 1035bd8deadSopenharmony_ci 1045bd8deadSopenharmony_ci Unlike other program types, compute program invocations have no inputs or 1055bd8deadSopenharmony_ci outputs interfacing with the rest of the pipeline. Compute programs may 1065bd8deadSopenharmony_ci obtain inputs using mechanisms such as global loads, image loads, atomic 1075bd8deadSopenharmony_ci counter reads, shader storage buffer reads, and program parameters. 1085bd8deadSopenharmony_ci Built-in inputs are also provided to allow a compute shader invocation to 1095bd8deadSopenharmony_ci determine its position in the work group, the position of its work group 1105bd8deadSopenharmony_ci in the full dispatch, as well as the work group and full dispatch sizes. 1115bd8deadSopenharmony_ci Compute program results are expected to be written to globally accessible 1125bd8deadSopenharmony_ci memory using mechanisms such as global stores, image stores, atomic 1135bd8deadSopenharmony_ci counters, and shader storage buffers. 1145bd8deadSopenharmony_ci 1155bd8deadSopenharmony_ci 1165bd8deadSopenharmony_ci Modify Section 2.X.2, Program Grammar 1175bd8deadSopenharmony_ci 1185bd8deadSopenharmony_ci (replace third paragraph) 1195bd8deadSopenharmony_ci 1205bd8deadSopenharmony_ci Compute programs are required to begin with the header string "!!NVcp5.0". 1215bd8deadSopenharmony_ci This header string identifies the subsequent program body as being a 1225bd8deadSopenharmony_ci compute program and indicates that it should be parsed according to the 1235bd8deadSopenharmony_ci base NV_gpu_program5 grammar plus the additions below. Program string 1245bd8deadSopenharmony_ci parsing begins with the character immediately following the header string. 1255bd8deadSopenharmony_ci 1265bd8deadSopenharmony_ci (add the following grammar rules to the NV_gpu_program5 base grammar for 1275bd8deadSopenharmony_ci compute programs) 1285bd8deadSopenharmony_ci 1295bd8deadSopenharmony_ci <declSequence> ::= <declaration> <declSequence> 1305bd8deadSopenharmony_ci 1315bd8deadSopenharmony_ci <instruction> ::= <SpecialInstruction> 1325bd8deadSopenharmony_ci 1335bd8deadSopenharmony_ci <opModifier> ::= "CTA" 1345bd8deadSopenharmony_ci 1355bd8deadSopenharmony_ci <namingStatement> ::= <SHARED_statement> 1365bd8deadSopenharmony_ci 1375bd8deadSopenharmony_ci <SHARED_statement> ::= "SHARED" <establishName> <sharedSingleInit> 1385bd8deadSopenharmony_ci | "SHARED" <establishName> <optArraySize> 1395bd8deadSopenharmony_ci <sharedMultipleInit> 1405bd8deadSopenharmony_ci 1415bd8deadSopenharmony_ci <sharedSingleInit> ::= "=" <sharedUseDS> 1425bd8deadSopenharmony_ci 1435bd8deadSopenharmony_ci <sharedMultipleInit> ::= "=" "{" <sharedItemList> "}" 1445bd8deadSopenharmony_ci 1455bd8deadSopenharmony_ci <sharedItemList> ::= <sharedUseDM> 1465bd8deadSopenharmony_ci | <sharedUseDM> "," <sharedItemList> 1475bd8deadSopenharmony_ci 1485bd8deadSopenharmony_ci <sharedUseV> ::= <sharedVarName> <optArrayMem> 1495bd8deadSopenharmony_ci 1505bd8deadSopenharmony_ci <sharedUseDS> ::= <sharedBaseBinding> <arrayMemAbs> 1515bd8deadSopenharmony_ci 1525bd8deadSopenharmony_ci <sharedUseDM> ::= <sharedUseDS> 1535bd8deadSopenharmony_ci | <sharedBaseBinding> <arrayRange> 1545bd8deadSopenharmony_ci 1555bd8deadSopenharmony_ci <sharedBaseBinding> ::= "program" "." "sharedmem" 1565bd8deadSopenharmony_ci 1575bd8deadSopenharmony_ci <SpecialInstruction> ::= "BAR" 1585bd8deadSopenharmony_ci | "ATOMS" <opModifiers> <instResult> "," 1595bd8deadSopenharmony_ci <instOperandV> "," <sharedUseV> 1605bd8deadSopenharmony_ci | "LDS" <opModifiers> <instResult> "," 1615bd8deadSopenharmony_ci <sharedUseV> 1625bd8deadSopenharmony_ci | "STS" <opModifiers> <instOperandV> "," 1635bd8deadSopenharmony_ci <sharedUseV> 1645bd8deadSopenharmony_ci 1655bd8deadSopenharmony_ci <declaration> ::= "GROUP_SIZE" <int> 1665bd8deadSopenharmony_ci | "GROUP_SIZE" <int> <int> 1675bd8deadSopenharmony_ci | "GROUP_SIZE" <int> <int> <int> 1685bd8deadSopenharmony_ci | "SHARED_MEMORY" <int> 1695bd8deadSopenharmony_ci 1705bd8deadSopenharmony_ci <attribBasic> ::= "invocation" "." "localid" 1715bd8deadSopenharmony_ci | "invocation" "." "globalid" 1725bd8deadSopenharmony_ci | "invocation" "." "groupid" 1735bd8deadSopenharmony_ci | "invocation" "." "groupcount" 1745bd8deadSopenharmony_ci | "invocation" "." "groupsize" 1755bd8deadSopenharmony_ci | "invocation" "." "localindex" 1765bd8deadSopenharmony_ci 1775bd8deadSopenharmony_ci 1785bd8deadSopenharmony_ci (add the following subsection to Section 2.X.3.2, Program Attribute 1795bd8deadSopenharmony_ci Variables) 1805bd8deadSopenharmony_ci 1815bd8deadSopenharmony_ci Compute program attribute variables describe the attributes of the current 1825bd8deadSopenharmony_ci program invocation. Each DispatchCompute command produces a set of 1835bd8deadSopenharmony_ci program invocations arranged as a one-, two-, or three-dimensional array. 1845bd8deadSopenharmony_ci Figure X.1 illustrates a two-dimensional dispatch with a local work group 1855bd8deadSopenharmony_ci size of 8x4, and a total dispatch of 5x4 local workgroups. Each 1865bd8deadSopenharmony_ci individual program invocation has a global one-, two-, or 1875bd8deadSopenharmony_ci three-dimensional global coordinate, which can be further decomposed into 1885bd8deadSopenharmony_ci a work group offset (in fixed-size work groups) and a local offset 1895bd8deadSopenharmony_ci relative to the origin of an invocation's work group. 1905bd8deadSopenharmony_ci 1915bd8deadSopenharmony_ci +-------+-------+-------+-------+-------+ 1925bd8deadSopenharmony_ci | | | work | | | 1935bd8deadSopenharmony_ci | | | group | | | 1945bd8deadSopenharmony_ci | | | (2,3) | | | 1955bd8deadSopenharmony_ci (0,12) +-------+-------+-------+-------+-------+ 1965bd8deadSopenharmony_ci | | | | | | 1975bd8deadSopenharmony_ci | | | | | | 1985bd8deadSopenharmony_ci | | * | | | | 1995bd8deadSopenharmony_ci (0,8) +-------+-------+-------+-------+-------+ 2005bd8deadSopenharmony_ci | | | | | work | 2015bd8deadSopenharmony_ci | | | | | group | 2025bd8deadSopenharmony_ci | | | | | (4,1) | 2035bd8deadSopenharmony_ci (0,4) +-------+-------+-------+-------+-------+ 2045bd8deadSopenharmony_ci | work | | | | | 2055bd8deadSopenharmony_ci | group | | | | | 2065bd8deadSopenharmony_ci | (0,0) | | | | | 2075bd8deadSopenharmony_ci +-------+-------+-------+-------+-------+ 2085bd8deadSopenharmony_ci (0,0) (8,0) (16,0) (24,0) (32,0) 2095bd8deadSopenharmony_ci 2105bd8deadSopenharmony_ci Figure X.1, Compute Dispatch. The single invocation at the location 2115bd8deadSopenharmony_ci labeled "*" has a location (invocation.globalid) of (10,9). The offset 2125bd8deadSopenharmony_ci relative to its local work group (invocation.localid) is (2,1). Its 2135bd8deadSopenharmony_ci local work group has an offset (invocation.groupid) of (1,2), in units 2145bd8deadSopenharmony_ci of work groups. 2155bd8deadSopenharmony_ci 2165bd8deadSopenharmony_ci The set of available compute program attribute bindings is enumerated in 2175bd8deadSopenharmony_ci Table X.1. All bindings are considered four-component unsigned integer 2185bd8deadSopenharmony_ci vectors with the value of the fourth component undefined. 2195bd8deadSopenharmony_ci 2205bd8deadSopenharmony_ci Attribute Binding Components Underlying State 2215bd8deadSopenharmony_ci ------------------------- ---------- ------------------------------ 2225bd8deadSopenharmony_ci invocation.localid (x,y,z,-) offset relative to base of 2235bd8deadSopenharmony_ci work group 2245bd8deadSopenharmony_ci 2255bd8deadSopenharmony_ci invocation.globalid (x,y,z,-) offset relative to the base 2265bd8deadSopenharmony_ci of the dispatched work 2275bd8deadSopenharmony_ci 2285bd8deadSopenharmony_ci invocation.groupid (x,y,z,-) offset (in groups) of local work 2295bd8deadSopenharmony_ci group 2305bd8deadSopenharmony_ci 2315bd8deadSopenharmony_ci invocation.groupcount (x,y,z,-) total local work group count 2325bd8deadSopenharmony_ci 2335bd8deadSopenharmony_ci invocation.groupsize (x,y,z,-) number of invocations in each 2345bd8deadSopenharmony_ci dimension of the local work group 2355bd8deadSopenharmony_ci 2365bd8deadSopenharmony_ci invocation.localindex (x,-,-,-) one-dimensional (flattened) index 2375bd8deadSopenharmony_ci in local workgroup 2385bd8deadSopenharmony_ci 2395bd8deadSopenharmony_ci Table X.1, Compute Program Attribute Bindings. 2405bd8deadSopenharmony_ci 2415bd8deadSopenharmony_ci If a compute attribute binding matches "invocation.localid", the "x", "y", 2425bd8deadSopenharmony_ci and "z" components of the invocation attribute variable are filled with 2435bd8deadSopenharmony_ci the "x", "y", "z" components, respectively, of the offset of the 2445bd8deadSopenharmony_ci invocation relative to the base of its local workgroup. The "w" component 2455bd8deadSopenharmony_ci of the attribute is undefined. 2465bd8deadSopenharmony_ci 2475bd8deadSopenharmony_ci If a compute attribute binding matches "invocation.globalid", the "x", 2485bd8deadSopenharmony_ci "y", and "z" components of the invocation attribute variable are filled 2495bd8deadSopenharmony_ci with the "x", "y", "z" components, respectively, of the offset of the 2505bd8deadSopenharmony_ci invocation relative to the full compute dispatch. The "w" component of 2515bd8deadSopenharmony_ci the attribute is undefined. 2525bd8deadSopenharmony_ci 2535bd8deadSopenharmony_ci If a compute attribute binding matches "invocation.groupid", the "x", "y", 2545bd8deadSopenharmony_ci and "z" components of the invocation attribute variable are filled with 2555bd8deadSopenharmony_ci the "x", "y", "z" components, respectively, of the offset of the local 2565bd8deadSopenharmony_ci work group (in groups) relative to the full compute dispatch. The "w" 2575bd8deadSopenharmony_ci component of the attribute is undefined. 2585bd8deadSopenharmony_ci 2595bd8deadSopenharmony_ci If a compute attribute binding matches "invocation.groupcount", the "x", 2605bd8deadSopenharmony_ci "y", and "z" components of the invocation attribute variable are filled 2615bd8deadSopenharmony_ci the "x", "y", and "z" dimensions, respectively, in local work groups of 2625bd8deadSopenharmony_ci the full compute dispatch. The "w" component of the attribute is 2635bd8deadSopenharmony_ci undefined. 2645bd8deadSopenharmony_ci 2655bd8deadSopenharmony_ci If a compute attribute binding matches "invocation.groupsize", the "x", 2665bd8deadSopenharmony_ci "y", and "z" components of the invocation attribute variable are filled 2675bd8deadSopenharmony_ci the "x", "y", and "z" dimensions, respectively, of the local work group, 2685bd8deadSopenharmony_ci as specified by the GROUP_SIZE declaration. The "w" component of the 2695bd8deadSopenharmony_ci attribute is undefined. 2705bd8deadSopenharmony_ci 2715bd8deadSopenharmony_ci If a compute attribute binding matches "invocation.localindex", the "x", 2725bd8deadSopenharmony_ci components of the invocation attribute variable is filled with a flattened 2735bd8deadSopenharmony_ci one-dimensional index of the invocation, which is derived as: 2745bd8deadSopenharmony_ci 2755bd8deadSopenharmony_ci invocation.localid.z * invocation.groupsize.x * invocation.groupsize.y + 2765bd8deadSopenharmony_ci invocation.localid.y * invocation.groupsize.x + 2775bd8deadSopenharmony_ci invocation.localid.x 2785bd8deadSopenharmony_ci 2795bd8deadSopenharmony_ci The "y", "z", and "w" components of the attribute are undefined. 2805bd8deadSopenharmony_ci 2815bd8deadSopenharmony_ci For one-dimensional dispatches, the "y" components of 2825bd8deadSopenharmony_ci "invocation.localid", "invocation.globalid", and "invocation.groupid" will 2835bd8deadSopenharmony_ci be zero. For one- and two- dimensional dispatches, the "z" components of 2845bd8deadSopenharmony_ci "invocation.localid", "invocation.globalid", and "invocation.groupid" will 2855bd8deadSopenharmony_ci be zero. The same components of "invocation.groupcount" and 2865bd8deadSopenharmony_ci "invocation.groupsize" will be one in these cases. 2875bd8deadSopenharmony_ci 2885bd8deadSopenharmony_ci 2895bd8deadSopenharmony_ci (add the following subsection to section 2.X.3.5, Program Results.) 2905bd8deadSopenharmony_ci 2915bd8deadSopenharmony_ci Compute programs have no result variables; all shader results must be 2925bd8deadSopenharmony_ci written to memory. 2935bd8deadSopenharmony_ci 2945bd8deadSopenharmony_ci 2955bd8deadSopenharmony_ci Add New Section 2.X.3.Y, Compute Program Shared Memory, after Section 2965bd8deadSopenharmony_ci 2.X.3.6, Program Parameter Buffers 2975bd8deadSopenharmony_ci 2985bd8deadSopenharmony_ci Compute program shared memory variables are arrays of basic machine units 2995bd8deadSopenharmony_ci from which data can be read or written using the LDS and STS instructions. 3005bd8deadSopenharmony_ci Compute program shared memory also supports atomic memory operations using 3015bd8deadSopenharmony_ci the ATOMS instruction. The GL allocates a single block of shared memory 3025bd8deadSopenharmony_ci for each local work group, whose size in basic machine units is specified 3035bd8deadSopenharmony_ci by the "SHARED_MEMORY" statement. The contents of compute program shared 3045bd8deadSopenharmony_ci memory are undefined when program execution for the local work group 3055bd8deadSopenharmony_ci begins and can be changed only by using the ATOMS or STS instructions. 3065bd8deadSopenharmony_ci Compute program shared memory variables are shared between all invocations 3075bd8deadSopenharmony_ci of a local work group. Writes performed by one invocation will be visible 3085bd8deadSopenharmony_ci for any reads of the same memory from any other invocation executed after 3095bd8deadSopenharmony_ci the write. Note that the order of reads and writes between different 3105bd8deadSopenharmony_ci invocations in a local work group is largely undefined, although the BAR 3115bd8deadSopenharmony_ci instruction can be used to introduce synchronization points for all 3125bd8deadSopenharmony_ci invocations in a local work group. 3135bd8deadSopenharmony_ci 3145bd8deadSopenharmony_ci Shared memory variables may only be used as operands in the ATOMS, LDS, 3155bd8deadSopenharmony_ci and STS instructions; they may not be used by used as results or operands 3165bd8deadSopenharmony_ci in general instructions. Shared memory variables must be declared 3175bd8deadSopenharmony_ci explicitly via the <SHARED_statement> grammar rule. Shared memory 3185bd8deadSopenharmony_ci bindings can not be used directly in executable instructions. 3195bd8deadSopenharmony_ci 3205bd8deadSopenharmony_ci Shader storage buffer variables may be declared as arrays, but all 3215bd8deadSopenharmony_ci bindings assigned to the array must use the same binding point(s) and must 3225bd8deadSopenharmony_ci increase consecutively. 3235bd8deadSopenharmony_ci 3245bd8deadSopenharmony_ci Binding Components Underlying State 3255bd8deadSopenharmony_ci ----------------------------- ---------- ----------------------------- 3265bd8deadSopenharmony_ci program.sharedmem[a] (x,x,x,x) compute shared memory, 3275bd8deadSopenharmony_ci element a 3285bd8deadSopenharmony_ci program.sharedmem[a..b] (x,x,x,x) compute shared memory, 3295bd8deadSopenharmony_ci elements a through b 3305bd8deadSopenharmony_ci program.sharedmem (x,x,x,x) compute shared memory, 3315bd8deadSopenharmony_ci all elements 3325bd8deadSopenharmony_ci 3335bd8deadSopenharmony_ci Table X.3: Shared Memory Bindings. <a> and <b> indicate individual 3345bd8deadSopenharmony_ci elements of shared memory. 3355bd8deadSopenharmony_ci 3365bd8deadSopenharmony_ci If a shared memory binding matches "program.sharedmem[a]", the shared 3375bd8deadSopenharmony_ci memory variable is associated with basic machine element <a> of compute 3385bd8deadSopenharmony_ci shared memory. 3395bd8deadSopenharmony_ci 3405bd8deadSopenharmony_ci For shared memory declarations, "program.sharedmem[a..b]" is equivalent to 3415bd8deadSopenharmony_ci specifying elements <a> through <b> of compute shared memory in order. 3425bd8deadSopenharmony_ci 3435bd8deadSopenharmony_ci For shared memory declarations, "program.sharedmem" is equivalent to 3445bd8deadSopenharmony_ci specifying elements zero through <N>-1 of compute shared memory in order, 3455bd8deadSopenharmony_ci where <N> is the total shared memory size declared by the "SHARED_MEMORY" 3465bd8deadSopenharmony_ci statement. 3475bd8deadSopenharmony_ci 3485bd8deadSopenharmony_ci 3495bd8deadSopenharmony_ci Modify Section 2.X.4, Program Execution Environment 3505bd8deadSopenharmony_ci 3515bd8deadSopenharmony_ci (add to the opcode table) 3525bd8deadSopenharmony_ci 3535bd8deadSopenharmony_ci Modifiers 3545bd8deadSopenharmony_ci Instruction F I C S H D Out Inputs Description 3555bd8deadSopenharmony_ci ----------- - - - - - - --- -------- -------------------------------- 3565bd8deadSopenharmony_ci ATOMS - - X - - - s v,su atomic transaction to shared mem 3575bd8deadSopenharmony_ci BAR - - - - - - - - work group execution barrier 3585bd8deadSopenharmony_ci LDS - - X X - F v su load from shared memory 3595bd8deadSopenharmony_ci STS - - - - - - - v,su store to shared memory 3605bd8deadSopenharmony_ci 3615bd8deadSopenharmony_ci 3625bd8deadSopenharmony_ci Modify Section 2.X.4.1, Program Instruction Modifiers 3635bd8deadSopenharmony_ci 3645bd8deadSopenharmony_ci Modifier Description 3655bd8deadSopenharmony_ci -------- ----------------------------------------------- 3665bd8deadSopenharmony_ci CTA Memory barrier orders only memory transactions 3675bd8deadSopenharmony_ci relative to invocations within local work group 3685bd8deadSopenharmony_ci 3695bd8deadSopenharmony_ci (add to descriptions of opcode modifiers) 3705bd8deadSopenharmony_ci 3715bd8deadSopenharmony_ci For the MEMBAR (memory barrier) instruction, the "CTA" modifier specifies 3725bd8deadSopenharmony_ci that memory transactions before and after the barrier are strongly ordered 3735bd8deadSopenharmony_ci as observed by any other shader invocation in the local work group. 3745bd8deadSopenharmony_ci 3755bd8deadSopenharmony_ci 3765bd8deadSopenharmony_ci Modify Section 2.X.4.5, Program Memory Access, from NV_gpu_program5 3775bd8deadSopenharmony_ci 3785bd8deadSopenharmony_ci (add to the end of the first paragraph) ... Additionally programs may load 3795bd8deadSopenharmony_ci from or store to shared memory via the ATOMS (atomic shared memory 3805bd8deadSopenharmony_ci operation), LDS (load from shared memory), and STS (store to shared 3815bd8deadSopenharmony_ci memory) instructions. 3825bd8deadSopenharmony_ci 3835bd8deadSopenharmony_ci (modify miscellaneous other language referring to "buffer object memory" 3845bd8deadSopenharmony_ci to instead refer to "buffer object and shared memory") 3855bd8deadSopenharmony_ci 3865bd8deadSopenharmony_ci (add hypothetical built-in functions SharedMemoryLoad() and 3875bd8deadSopenharmony_ci SharedMemoryStore() that behave similarly to BufferMemoryLoad() and 3885bd8deadSopenharmony_ci BufferMemoryStore(), except that they access local work group shared 3895bd8deadSopenharmony_ci memory instead of buffer object memory) 3905bd8deadSopenharmony_ci 3915bd8deadSopenharmony_ci 3925bd8deadSopenharmony_ci Add the following subsection to section 2.X.7, Program Declarations 3935bd8deadSopenharmony_ci 3945bd8deadSopenharmony_ci Section 2.X.7.Y, Compute Program Declarations 3955bd8deadSopenharmony_ci 3965bd8deadSopenharmony_ci Compute programs support two types of declaration statement, as described 3975bd8deadSopenharmony_ci below. 3985bd8deadSopenharmony_ci 3995bd8deadSopenharmony_ci - Shader Thread Group Size (GROUP_SIZE) 4005bd8deadSopenharmony_ci 4015bd8deadSopenharmony_ci The GROUP_SIZE statement declares the number of shader threads in a one-, 4025bd8deadSopenharmony_ci two-, or three-dimensional local work group. The statement must have one 4035bd8deadSopenharmony_ci to three unsigned integer arguments. Each argument must be less than or 4045bd8deadSopenharmony_ci equal to the value of the implementation-dependent limit 4055bd8deadSopenharmony_ci MAX_COMPUTE_LOCAL_WORK_SIZE for its corresponding dimension (X, Y, or Z). 4065bd8deadSopenharmony_ci A program will fail to load unless it contains exactly one GROUP_SIZE 4075bd8deadSopenharmony_ci declaration. 4085bd8deadSopenharmony_ci 4095bd8deadSopenharmony_ci 4105bd8deadSopenharmony_ci - Shared Memory Storage Size (SHARED_MEMORY) 4115bd8deadSopenharmony_ci 4125bd8deadSopenharmony_ci The SHARED_MEMORY statement declares the size of the shared memory, in 4135bd8deadSopenharmony_ci basic machine units, available to the threads of each local work group. 4145bd8deadSopenharmony_ci The SHARED_MEMORY statement is optional, but a program will fail to load 4155bd8deadSopenharmony_ci if it includes multiple SHARED_MEMORY declarations, if it uses the the 4165bd8deadSopenharmony_ci ATOMS, LDS, or STS instructions in a program without a SHARED_MEMORY 4175bd8deadSopenharmony_ci declaration, if uses these instructions with an offset that would access 4185bd8deadSopenharmony_ci memory beyond the declared shared memory size, or if the declared shared 4195bd8deadSopenharmony_ci memory size is greater than the implementation-dependent limit 4205bd8deadSopenharmony_ci MAX_COMPUTE_SHARED_VARIABLE_SIZE. 4215bd8deadSopenharmony_ci 4225bd8deadSopenharmony_ci 4235bd8deadSopenharmony_ci (add the following subsection to section 2.X.8, Program Instruction Set.) 4245bd8deadSopenharmony_ci 4255bd8deadSopenharmony_ci Section 2.X.8.Z, ATOMS: Atomic Memory Operation (Shared Memory) 4265bd8deadSopenharmony_ci 4275bd8deadSopenharmony_ci The ATOMS instruction performs an atomic memory operation by reading from 4285bd8deadSopenharmony_ci shared memory specified by the second unsigned integer scalar operand, 4295bd8deadSopenharmony_ci computing a new value based on the value read from memory and the first 4305bd8deadSopenharmony_ci (vector) operand, and then writing the result back to the same memory 4315bd8deadSopenharmony_ci address. The memory transaction is atomic, guaranteeing that no other 4325bd8deadSopenharmony_ci write to the memory accessed will occur between the time it is read and 4335bd8deadSopenharmony_ci written by the ATOMS instruction. The result of the ATOMS instruction is 4345bd8deadSopenharmony_ci the scalar value read from memory. The second operand used for the ATOMS 4355bd8deadSopenharmony_ci instruction must correspond to a shared memory variable declared using the 4365bd8deadSopenharmony_ci "SHARED" statement; a program will fail to load if any other type of 4375bd8deadSopenharmony_ci operand is used for the second operand of an ATOMS instruction. 4385bd8deadSopenharmony_ci 4395bd8deadSopenharmony_ci The ATOMS instruction has two required instruction modifiers. The atomic 4405bd8deadSopenharmony_ci modifier specifies the type of operation to be performed. The storage 4415bd8deadSopenharmony_ci modifier specifies the size and data type of the operand read from memory 4425bd8deadSopenharmony_ci and the base data type of the operation used to compute the value to be 4435bd8deadSopenharmony_ci written to memory. 4445bd8deadSopenharmony_ci 4455bd8deadSopenharmony_ci atomic storage 4465bd8deadSopenharmony_ci modifier modifiers operation 4475bd8deadSopenharmony_ci -------- ------------------ -------------------------------------- 4485bd8deadSopenharmony_ci ADD U32, S32, U64, F32 compute a sum 4495bd8deadSopenharmony_ci MIN U32, S32 compute minimum 4505bd8deadSopenharmony_ci MAX U32, S32 compute maximum 4515bd8deadSopenharmony_ci IWRAP U32 increment memory, wrapping at operand 4525bd8deadSopenharmony_ci DWRAP U32 decrement memory, wrapping at operand 4535bd8deadSopenharmony_ci AND U32, S32 compute bit-wise AND 4545bd8deadSopenharmony_ci OR U32, S32 compute bit-wise OR 4555bd8deadSopenharmony_ci XOR U32, S32 compute bit-wise XOR 4565bd8deadSopenharmony_ci EXCH U32, S32, U64, F32 exchange memory with operand 4575bd8deadSopenharmony_ci CSWAP U32, S32, U64 compare-and-swap 4585bd8deadSopenharmony_ci 4595bd8deadSopenharmony_ci Table X.Y, Supported atomic and storage modifiers for the ATOM 4605bd8deadSopenharmony_ci instruction. 4615bd8deadSopenharmony_ci 4625bd8deadSopenharmony_ci Not all storage modifiers are supported by ATOMS, and the set of modifiers 4635bd8deadSopenharmony_ci allowed for any given instruction depends on the atomic modifier 4645bd8deadSopenharmony_ci specified. Table X.Y enumerates the set of atomic modifiers supported by 4655bd8deadSopenharmony_ci the ATOMS instruction, and the storage modifiers allowed for each. 4665bd8deadSopenharmony_ci 4675bd8deadSopenharmony_ci tmp0 = VectorLoad(op0); 4685bd8deadSopenharmony_ci result = SharedMemoryLoad(op1, storageModifier); 4695bd8deadSopenharmony_ci switch (atomicModifier) { 4705bd8deadSopenharmony_ci case ADD: 4715bd8deadSopenharmony_ci writeval = tmp0.x + result; 4725bd8deadSopenharmony_ci break; 4735bd8deadSopenharmony_ci case MIN: 4745bd8deadSopenharmony_ci writeval = min(tmp0.x, result); 4755bd8deadSopenharmony_ci break; 4765bd8deadSopenharmony_ci case MAX: 4775bd8deadSopenharmony_ci writeval = max(tmp0.x, result); 4785bd8deadSopenharmony_ci break; 4795bd8deadSopenharmony_ci case IWRAP: 4805bd8deadSopenharmony_ci writeval = (result >= tmp0.x) ? 0 : result+1; 4815bd8deadSopenharmony_ci break; 4825bd8deadSopenharmony_ci case DWRAP: 4835bd8deadSopenharmony_ci writeval = (result == 0 || result > tmp0.x) ? tmp0.x : result-1; 4845bd8deadSopenharmony_ci break; 4855bd8deadSopenharmony_ci case AND: 4865bd8deadSopenharmony_ci writeval = tmp0.x & result; 4875bd8deadSopenharmony_ci break; 4885bd8deadSopenharmony_ci case OR: 4895bd8deadSopenharmony_ci writeval = tmp0.x | result; 4905bd8deadSopenharmony_ci break; 4915bd8deadSopenharmony_ci case XOR: 4925bd8deadSopenharmony_ci writeval = tmp0.x ^ result; 4935bd8deadSopenharmony_ci break; 4945bd8deadSopenharmony_ci case EXCH: 4955bd8deadSopenharmony_ci break; 4965bd8deadSopenharmony_ci case CSWAP: 4975bd8deadSopenharmony_ci if (result == tmp0.x) { 4985bd8deadSopenharmony_ci writeval = tmp0.y; 4995bd8deadSopenharmony_ci } else { 5005bd8deadSopenharmony_ci return result; // no memory store 5015bd8deadSopenharmony_ci } 5025bd8deadSopenharmony_ci break; 5035bd8deadSopenharmony_ci } 5045bd8deadSopenharmony_ci SharedMemoryStore(op1, writeval, storageModifier); 5055bd8deadSopenharmony_ci 5065bd8deadSopenharmony_ci ATOMS performs a scalar atomic operation. The <y>, <z>, and <w> 5075bd8deadSopenharmony_ci components of the result vector are undefined. 5085bd8deadSopenharmony_ci 5095bd8deadSopenharmony_ci ATOMS supports no base data type modifiers, but requires exactly one 5105bd8deadSopenharmony_ci storage modifier. The base data types of the result vector, and the first 5115bd8deadSopenharmony_ci (vector) operand are derived from the storage modifier. The second 5125bd8deadSopenharmony_ci operand is always interpreted as a scalar unsigned integer. 5135bd8deadSopenharmony_ci 5145bd8deadSopenharmony_ci 5155bd8deadSopenharmony_ci Section 2.X.8.Z, BAR: Execution Barrier 5165bd8deadSopenharmony_ci 5175bd8deadSopenharmony_ci The BAR instruction synchronizes the execution of compute shader 5185bd8deadSopenharmony_ci invocations within a local work group. When a compute shader invocation 5195bd8deadSopenharmony_ci executes the BAR instruction, it pauses until the same BAR instruction has 5205bd8deadSopenharmony_ci been executed by all invocations in the current local work group. Once 5215bd8deadSopenharmony_ci all invocations have executed the BAR instruction, processing continues 5225bd8deadSopenharmony_ci with the instruction following the BAR instruction. 5235bd8deadSopenharmony_ci 5245bd8deadSopenharmony_ci There is no compile-time restriction on the locations in a program where 5255bd8deadSopenharmony_ci BAR is allowed. However, BAR instructions are not allowed in divergent 5265bd8deadSopenharmony_ci flow control; if any compute shader invocation in the work group executes 5275bd8deadSopenharmony_ci the BAR instruction, all compute shaders invocations must execute the 5285bd8deadSopenharmony_ci instruction. Results of executing a BAR instruction are undefined and can 5295bd8deadSopenharmony_ci result in application hangs and/or program termination if the instruction 5305bd8deadSopenharmony_ci is issued: 5315bd8deadSopenharmony_ci 5325bd8deadSopenharmony_ci * inside any IF/ELSE/ENDIF block where the results of the condition 5335bd8deadSopenharmony_ci evaluated by the IF instruction are not identical across the work 5345bd8deadSopenharmony_ci group; 5355bd8deadSopenharmony_ci 5365bd8deadSopenharmony_ci * inside any iteration of REP/ENDREP block where at least one invocation 5375bd8deadSopenharmony_ci in the work group has skipped to the next iteration using the CONT 5385bd8deadSopenharmony_ci instruction, exited the loop using a BRK or RET instruction, or exited 5395bd8deadSopenharmony_ci the loop due to having completed the requested number of loop 5405bd8deadSopenharmony_ci iterations; or 5415bd8deadSopenharmony_ci 5425bd8deadSopenharmony_ci * inside any subroutine (including main) where at least one invocation 5435bd8deadSopenharmony_ci in the work group has exited the subroutine using the RET instruction. 5445bd8deadSopenharmony_ci 5455bd8deadSopenharmony_ci BAR has no operands and generates no result. 5465bd8deadSopenharmony_ci 5475bd8deadSopenharmony_ci 5485bd8deadSopenharmony_ci Section 2.X.8.Z, LDS: Load from Shared Memory 5495bd8deadSopenharmony_ci 5505bd8deadSopenharmony_ci The LDS instruction generates a result vector by fetching data from the 5515bd8deadSopenharmony_ci shared memory for the current local work group identified by the first 5525bd8deadSopenharmony_ci operand, as described in Section 2.X.4.5. The single operand for the LDS 5535bd8deadSopenharmony_ci instruction must correspond to a shader shared memory variable declared 5545bd8deadSopenharmony_ci using the "SHARED" statement; a program will fail to load if any other 5555bd8deadSopenharmony_ci type of operand is used in an LDS instruction. 5565bd8deadSopenharmony_ci 5575bd8deadSopenharmony_ci result = SharedMemoryLoad(op0, storageModifier); 5585bd8deadSopenharmony_ci 5595bd8deadSopenharmony_ci LDS supports no base data type modifiers, but requires exactly one storage 5605bd8deadSopenharmony_ci modifier. The base data type of the result vector is derived from the 5615bd8deadSopenharmony_ci storage modifier. 5625bd8deadSopenharmony_ci 5635bd8deadSopenharmony_ci 5645bd8deadSopenharmony_ci Replace Section 2.X.8.Z, MEMBAR: Memory Barrier, as added by 5655bd8deadSopenharmony_ci EXT_shader_image_load_store 5665bd8deadSopenharmony_ci 5675bd8deadSopenharmony_ci The MEMBAR instruction synchronizes memory transactions to ensure that 5685bd8deadSopenharmony_ci memory transactions resulting from any instruction executed by the thread 5695bd8deadSopenharmony_ci prior to the MEMBAR instruction complete prior to any memory transactions 5705bd8deadSopenharmony_ci issued after the instruction, as observed by other shader invocations. 5715bd8deadSopenharmony_ci 5725bd8deadSopenharmony_ci The MEMBAR instruction has one optional instruction modifier. If the CTA 5735bd8deadSopenharmony_ci instruction modifier is specified, memory transactions before and after 5745bd8deadSopenharmony_ci the barrier will be strongly ordered as observed by other shader 5755bd8deadSopenharmony_ci invocations in the same local work group. However, it does not order 5765bd8deadSopenharmony_ci transactions as viewed by any other shader. With the CTA modifier, 5775bd8deadSopenharmony_ci shaders not in the local work group may observe the results of memory 5785bd8deadSopenharmony_ci transactions issued after the MEMBAR instruction before those issued 5795bd8deadSopenharmony_ci before the MEMBAR instruction. If the CTA instruction modifier is not 5805bd8deadSopenharmony_ci specified, all shader invocations will see the results of any memory 5815bd8deadSopenharmony_ci transaction issued before the MEMBAR instruction before those issued after 5825bd8deadSopenharmony_ci the MEMBAR instruction. 5835bd8deadSopenharmony_ci 5845bd8deadSopenharmony_ci MEMBAR has no operands and generates no result. 5855bd8deadSopenharmony_ci 5865bd8deadSopenharmony_ci 5875bd8deadSopenharmony_ci Section 2.X.8.Z, STS: Store to Shared Memory 5885bd8deadSopenharmony_ci 5895bd8deadSopenharmony_ci The STS instruction writes the contents of the first vector operand to 5905bd8deadSopenharmony_ci shared memory for the current local work group identified by the second 5915bd8deadSopenharmony_ci operand, as described in Section 2.X.4.5. This instruction generates no 5925bd8deadSopenharmony_ci result. The second operand for the STS instruction must correspond to a 5935bd8deadSopenharmony_ci shared memory variable declared using the "SHARED" statement; a program 5945bd8deadSopenharmony_ci will fail to load if any other type of operand is used in an STS 5955bd8deadSopenharmony_ci instruction. 5965bd8deadSopenharmony_ci 5975bd8deadSopenharmony_ci tmp0 = VectorLoad(op0); 5985bd8deadSopenharmony_ci SharedMemoryStore(op1, tmp0, storageModifier); 5995bd8deadSopenharmony_ci 6005bd8deadSopenharmony_ci STS supports no base data type modifiers, but requires exactly one storage 6015bd8deadSopenharmony_ci modifier. The base data type of the vector components of the first 6025bd8deadSopenharmony_ci operand is derived from the storage modifier. 6035bd8deadSopenharmony_ci 6045bd8deadSopenharmony_ci 6055bd8deadSopenharmony_ciAdditions to Chapter 3 of the OpenGL 4.2 (Compatibility Profile) Specification 6065bd8deadSopenharmony_ci(Rasterization) 6075bd8deadSopenharmony_ci 6085bd8deadSopenharmony_ci None. 6095bd8deadSopenharmony_ci 6105bd8deadSopenharmony_ciAdditions to Chapter 4 of the OpenGL 4.2 (Compatibility Profile) Specification 6115bd8deadSopenharmony_ci(Per-Fragment Operations and the Frame Buffer) 6125bd8deadSopenharmony_ci 6135bd8deadSopenharmony_ci None. 6145bd8deadSopenharmony_ci 6155bd8deadSopenharmony_ciAdditions to Chapter 5 of the OpenGL 4.2 (Compatibility Profile) Specification 6165bd8deadSopenharmony_ci(Special Functions) 6175bd8deadSopenharmony_ci 6185bd8deadSopenharmony_ci None. 6195bd8deadSopenharmony_ci 6205bd8deadSopenharmony_ciAdditions to Chapter 6 of the OpenGL 4.2 (Compatibility Profile) Specification 6215bd8deadSopenharmony_ci(State and State Requests) 6225bd8deadSopenharmony_ci 6235bd8deadSopenharmony_ci None. 6245bd8deadSopenharmony_ci 6255bd8deadSopenharmony_ciAdditions to the AGL/GLX/WGL Specifications 6265bd8deadSopenharmony_ci 6275bd8deadSopenharmony_ci None. 6285bd8deadSopenharmony_ci 6295bd8deadSopenharmony_ciGLX Protocol 6305bd8deadSopenharmony_ci 6315bd8deadSopenharmony_ci None. 6325bd8deadSopenharmony_ci 6335bd8deadSopenharmony_ciDependencies on NV_shader_atomic_float 6345bd8deadSopenharmony_ci 6355bd8deadSopenharmony_ci If NV_shader_atomic_float is not supported, the ADD and EXCH atomic 6365bd8deadSopenharmony_ci operations in the ATOMS instruction do not support the "F32" storage 6375bd8deadSopenharmony_ci modifier. 6385bd8deadSopenharmony_ci 6395bd8deadSopenharmony_ciDependencies on EXT_shader_image_load_store 6405bd8deadSopenharmony_ci 6415bd8deadSopenharmony_ci If EXT_shader_image_load_store is not supported, language describing the 6425bd8deadSopenharmony_ci "CTA" instruction modifier and modifying the MEMBAR instruction (as added 6435bd8deadSopenharmony_ci by EXT_shader_image_load_store) should be removed. 6445bd8deadSopenharmony_ci 6455bd8deadSopenharmony_ciErrors 6465bd8deadSopenharmony_ci 6475bd8deadSopenharmony_ci None. 6485bd8deadSopenharmony_ci 6495bd8deadSopenharmony_ciNew State 6505bd8deadSopenharmony_ci 6515bd8deadSopenharmony_ci (Modify ARB_vertex_program, Table X.6 -- Program State) 6525bd8deadSopenharmony_ci 6535bd8deadSopenharmony_ci Initial 6545bd8deadSopenharmony_ci Get Value Type Get Command Value Description Sec. Attribute 6555bd8deadSopenharmony_ci --------- ------- ----------- ------- ------------------------ ------ --------- 6565bd8deadSopenharmony_ci COMPUTE_PROGRAM_PARAMETER_ Z+ GetIntegerv 0 Active compute program 2.14.1 - 6575bd8deadSopenharmony_ci BUFFER_NV buffer object binding 6585bd8deadSopenharmony_ci COMPUTE_PROGRAM_PARAMETER_ nxZ+ GetInteger- 0 Buffer objects bound for 2.14.1 - 6595bd8deadSopenharmony_ci BUFFER_NV IndexedvEXT compute program use 6605bd8deadSopenharmony_ci 6615bd8deadSopenharmony_ci Also shares buffer bindings and other state with the ARB_compute_shader 6625bd8deadSopenharmony_ci extension. 6635bd8deadSopenharmony_ci 6645bd8deadSopenharmony_ciNew Implementation Dependent State 6655bd8deadSopenharmony_ci 6665bd8deadSopenharmony_ci None, but shares implementation-dependent state with the 6675bd8deadSopenharmony_ci ARB_compute_shader extension. 6685bd8deadSopenharmony_ci 6695bd8deadSopenharmony_ciIssues 6705bd8deadSopenharmony_ci 6715bd8deadSopenharmony_ci None. 6725bd8deadSopenharmony_ci 6735bd8deadSopenharmony_ciRevision History 6745bd8deadSopenharmony_ci 6755bd8deadSopenharmony_ci Rev. Date Author Changes 6765bd8deadSopenharmony_ci ---- -------- -------- -------------------------------------------- 6775bd8deadSopenharmony_ci 2 10/23/12 pbrown Remove the restriction forbidding the use of BAR 6785bd8deadSopenharmony_ci inside potentially divergent flow control. 6795bd8deadSopenharmony_ci Instead, we will allow BAR to be executed 6805bd8deadSopenharmony_ci anywhere, but specify undefined results 6815bd8deadSopenharmony_ci (including hangs or program termination) if the 6825bd8deadSopenharmony_ci flow control is divergent (bug 9367). 6835bd8deadSopenharmony_ci 6845bd8deadSopenharmony_ci 1 pbrown Internal spec development. 685