1bf215546Sopenharmony_ci# Notes on opcodes
2bf215546Sopenharmony_ci
3bf215546Sopenharmony_ci_Notes mainly by Connor Abbott extracted from the disassembler_
4bf215546Sopenharmony_ci
5bf215546Sopenharmony_ciLOG_FREXPM:
6bf215546Sopenharmony_ci
7bf215546Sopenharmony_ci        // From the ARM patent US20160364209A1:
8bf215546Sopenharmony_ci        // "Decompose v (the input) into numbers x1 and s such that v = x1 * 2^s,
9bf215546Sopenharmony_ci        // and x1 is a floating point value in a predetermined range where the
10bf215546Sopenharmony_ci        // value 1 is within the range and not at one extremity of the range (e.g.
11bf215546Sopenharmony_ci        // choose a range where 1 is towards middle of range)."
12bf215546Sopenharmony_ci        //
13bf215546Sopenharmony_ci        // This computes x1.
14bf215546Sopenharmony_ci 
15bf215546Sopenharmony_ciFRCP_FREXPM:
16bf215546Sopenharmony_ci
17bf215546Sopenharmony_ci        // Given a floating point number m * 2^e, returns m * 2^{-1}. This is
18bf215546Sopenharmony_ci        // exactly the same as the mantissa part of frexp().
19bf215546Sopenharmony_ci
20bf215546Sopenharmony_ciFSQRT_FREXPM:
21bf215546Sopenharmony_ci        // Given a floating point number m * 2^e, returns m * 2^{-2} if e is even,
22bf215546Sopenharmony_ci        // and m * 2^{-1} if e is odd. In other words, scales by powers of 4 until
23bf215546Sopenharmony_ci        // within the range [0.25, 1). Used for square-root and reciprocal
24bf215546Sopenharmony_ci        // square-root.
25bf215546Sopenharmony_ci
26bf215546Sopenharmony_ci
27bf215546Sopenharmony_ci
28bf215546Sopenharmony_ci
29bf215546Sopenharmony_ciFRCP_FREXPE:
30bf215546Sopenharmony_ci        // Given a floating point number m * 2^e, computes -e - 1 as an integer.
31bf215546Sopenharmony_ci        // Zero and infinity/NaN return 0.
32bf215546Sopenharmony_ci
33bf215546Sopenharmony_ciFSQRT_FREXPE:
34bf215546Sopenharmony_ci        // Computes floor(e/2) + 1.
35bf215546Sopenharmony_ci
36bf215546Sopenharmony_ciFRSQ_FREXPE:
37bf215546Sopenharmony_ci        // Given a floating point number m * 2^e, computes -floor(e/2) - 1 as an
38bf215546Sopenharmony_ci        // integer.
39bf215546Sopenharmony_ci
40bf215546Sopenharmony_ciLSHIFT_ADD_LOW32:
41bf215546Sopenharmony_ci        // These instructions in the FMA slot, together with LSHIFT_ADD_HIGH32.i32
42bf215546Sopenharmony_ci        // in the ADD slot, allow one to do a 64-bit addition with an extra small
43bf215546Sopenharmony_ci        // shift on one of the sources. There are three possible scenarios:
44bf215546Sopenharmony_ci        //
45bf215546Sopenharmony_ci        // 1) Full 64-bit addition. Do:
46bf215546Sopenharmony_ci        // out.x = LSHIFT_ADD_LOW32.i64 src1.x, src2.x, shift
47bf215546Sopenharmony_ci        // out.y = LSHIFT_ADD_HIGH32.i32 src1.y, src2.y
48bf215546Sopenharmony_ci        //
49bf215546Sopenharmony_ci        // The shift amount is applied to src2 before adding. The shift amount, and
50bf215546Sopenharmony_ci        // any extra bits from src2 plus the overflow bit, are sent directly from
51bf215546Sopenharmony_ci        // FMA to ADD instead of being passed explicitly. Hence, these two must be
52bf215546Sopenharmony_ci        // bundled together into the same instruction.
53bf215546Sopenharmony_ci        //
54bf215546Sopenharmony_ci        // 2) Add a 64-bit value src1 to a zero-extended 32-bit value src2. Do:
55bf215546Sopenharmony_ci        // out.x = LSHIFT_ADD_LOW32.u32 src1.x, src2, shift
56bf215546Sopenharmony_ci        // out.y = LSHIFT_ADD_HIGH32.i32 src1.x, 0
57bf215546Sopenharmony_ci        //
58bf215546Sopenharmony_ci        // Note that in this case, the second argument to LSHIFT_ADD_HIGH32 is
59bf215546Sopenharmony_ci        // ignored, so it can actually be anything. As before, the shift is applied
60bf215546Sopenharmony_ci        // to src2 before adding.
61bf215546Sopenharmony_ci        //
62bf215546Sopenharmony_ci        // 3) Add a 64-bit value to a sign-extended 32-bit value src2. Do:
63bf215546Sopenharmony_ci        // out.x = LSHIFT_ADD_LOW32.i32 src1.x, src2, shift
64bf215546Sopenharmony_ci        // out.y = LSHIFT_ADD_HIGH32.i32 src1.x, 0
65bf215546Sopenharmony_ci        //
66bf215546Sopenharmony_ci        // The only difference is the .i32 instead of .u32. Otherwise, this is
67bf215546Sopenharmony_ci        // exactly the same as before.
68bf215546Sopenharmony_ci        //
69bf215546Sopenharmony_ci        // In all these instructions, the shift amount is stored where the third
70bf215546Sopenharmony_ci        // source would be, so the shift has to be a small immediate from 0 to 7.
71bf215546Sopenharmony_ci        // This is fine for the expected use-case of these instructions, which is
72bf215546Sopenharmony_ci        // manipulating 64-bit pointers.
73bf215546Sopenharmony_ci        //
74bf215546Sopenharmony_ci        // These instructions can also be combined with various load/store
75bf215546Sopenharmony_ci        // instructions which normally take a 64-bit pointer in order to add a
76bf215546Sopenharmony_ci        // 32-bit or 64-bit offset to the pointer before doing the operation,
77bf215546Sopenharmony_ci        // optionally shifting the offset. The load/store op implicity does
78bf215546Sopenharmony_ci        // LSHIFT_ADD_HIGH32.i32 internally. Letting ptr be the pointer, and offset
79bf215546Sopenharmony_ci        // the desired offset, the cases go as follows:
80bf215546Sopenharmony_ci        //
81bf215546Sopenharmony_ci        // 1) Add a 64-bit offset:
82bf215546Sopenharmony_ci        // LSHIFT_ADD_LOW32.i64 ptr.x, offset.x, shift
83bf215546Sopenharmony_ci        // ld_st_op ptr.y, offset.y, ...
84bf215546Sopenharmony_ci        //
85bf215546Sopenharmony_ci        // Note that the output of LSHIFT_ADD_LOW32.i64 is not used, instead being
86bf215546Sopenharmony_ci        // implicitly sent to the load/store op to serve as the low 32 bits of the
87bf215546Sopenharmony_ci        // pointer.
88bf215546Sopenharmony_ci        //
89bf215546Sopenharmony_ci        // 2) Add a 32-bit unsigned offset:
90bf215546Sopenharmony_ci        // temp = LSHIFT_ADD_LOW32.u32 ptr.x, offset, shift
91bf215546Sopenharmony_ci        // ld_st_op temp, ptr.y, ...
92bf215546Sopenharmony_ci        //
93bf215546Sopenharmony_ci        // Now, the low 32 bits of offset << shift + ptr are passed explicitly to
94bf215546Sopenharmony_ci        // the ld_st_op, to match the case where there is no offset and ld_st_op is
95bf215546Sopenharmony_ci        // called directly.
96bf215546Sopenharmony_ci        //
97bf215546Sopenharmony_ci        // 3) Add a 32-bit signed offset:
98bf215546Sopenharmony_ci        // temp = LSHIFT_ADD_LOW32.i32 ptr.x, offset, shift
99bf215546Sopenharmony_ci        // ld_st_op temp, ptr.y, ...
100bf215546Sopenharmony_ci        //
101bf215546Sopenharmony_ci        // Again, the same as the unsigned case except for the offset.
102bf215546Sopenharmony_ci
103bf215546Sopenharmony_ci---
104bf215546Sopenharmony_ci
105bf215546Sopenharmony_ciADD ops..
106bf215546Sopenharmony_ci
107bf215546Sopenharmony_ciF16_TO_F32.X: // take the low  16 bits, and expand it to a 32-bit float
108bf215546Sopenharmony_ciF16_TO_F32.Y: // take the high 16 bits, and expand it to a 32-bit float
109bf215546Sopenharmony_ci
110bf215546Sopenharmony_ciMOV: 
111bf215546Sopenharmony_ci        // Logically, this should be SWZ.XY, but that's equivalent to a move, and
112bf215546Sopenharmony_ci        // this seems to be the canonical way the blob generates a MOV.
113bf215546Sopenharmony_ci 
114bf215546Sopenharmony_ci
115bf215546Sopenharmony_ciFRCP_FREXPM:
116bf215546Sopenharmony_ci        // Given a floating point number m * 2^e, returns m ^ 2^{-1}.
117bf215546Sopenharmony_ci
118bf215546Sopenharmony_ciFLOG_FREXPE:
119bf215546Sopenharmony_ci        // From the ARM patent US20160364209A1:
120bf215546Sopenharmony_ci        // "Decompose v (the input) into numbers x1 and s such that v = x1 * 2^s,
121bf215546Sopenharmony_ci        // and x1 is a floating point value in a predetermined range where the
122bf215546Sopenharmony_ci        // value 1 is within the range and not at one extremity of the range (e.g.
123bf215546Sopenharmony_ci        // choose a range where 1 is towards middle of range)."
124bf215546Sopenharmony_ci        //
125bf215546Sopenharmony_ci        // This computes s.
126bf215546Sopenharmony_ci
127bf215546Sopenharmony_ciLD_UBO.v4i32
128bf215546Sopenharmony_ci        // src0 = offset, src1 = binding
129bf215546Sopenharmony_ci
130bf215546Sopenharmony_ciFRCP_FAST.f32:
131bf215546Sopenharmony_ci        // *_FAST does not exist on G71 (added to G51, G72, and everything after)
132bf215546Sopenharmony_ci
133bf215546Sopenharmony_ciFRCP_TABLE
134bf215546Sopenharmony_ci        // Given a floating point number m * 2^e, produces a table-based
135bf215546Sopenharmony_ci        // approximation of 2/m using the top 17 bits. Includes special cases for
136bf215546Sopenharmony_ci        // infinity, NaN, and zero, and copies the sign bit.
137bf215546Sopenharmony_ci
138bf215546Sopenharmony_ciFRCP_FAST.f16.X
139bf215546Sopenharmony_ci        // Exists on G71
140bf215546Sopenharmony_ci
141bf215546Sopenharmony_ciFRSQ_TABLE:
142bf215546Sopenharmony_ci        // A similar table for inverse square root, using the high 17 bits of the
143bf215546Sopenharmony_ci        // mantissa as well as the low bit of the exponent.
144bf215546Sopenharmony_ci
145bf215546Sopenharmony_ciFRCP_APPROX:
146bf215546Sopenharmony_ci        // Used in the argument reduction for log. Given a floating-point number
147bf215546Sopenharmony_ci        // m * 2^e, uses the top 4 bits of m to produce an approximation to 1/m
148bf215546Sopenharmony_ci        // with the exponent forced to 0 and only the top 5 bits are nonzero. 0,
149bf215546Sopenharmony_ci        // infinity, and NaN all return 1.0.
150bf215546Sopenharmony_ci        // See the ARM patent for more information.
151bf215546Sopenharmony_ci
152bf215546Sopenharmony_ciMUX:
153bf215546Sopenharmony_ci        // For each bit i, return src2[i] ? src0[i] : src1[i]. In other words, this
154bf215546Sopenharmony_ci        // is the same as (src2 & src0) | (~src2 & src1).
155bf215546Sopenharmony_ci
156bf215546Sopenharmony_ciST_VAR:
157bf215546Sopenharmony_ci        // store a varying given the address and datatype from LD_VAR_ADDR
158bf215546Sopenharmony_ci
159bf215546Sopenharmony_ciLD_VAR_ADDR:
160bf215546Sopenharmony_ci        // Compute varying address and datatype (for storing in the vertex shader),
161bf215546Sopenharmony_ci        // and store the vec3 result in the data register. The result is passed as
162bf215546Sopenharmony_ci        // the 3 normal arguments to ST_VAR.
163bf215546Sopenharmony_ci
164bf215546Sopenharmony_ciDISCARD
165bf215546Sopenharmony_ci        // Conditional discards (discard_if) in NIR. Compares the first two
166bf215546Sopenharmony_ci        // sources and discards if the result is true
167bf215546Sopenharmony_ci
168bf215546Sopenharmony_ciATEST.f32:
169bf215546Sopenharmony_ci        // Implements alpha-to-coverage, as well as possibly the late depth and
170bf215546Sopenharmony_ci        // stencil tests. The first source is the existing sample mask in R60
171bf215546Sopenharmony_ci        // (possibly modified by gl_SampleMask), and the second source is the alpha
172bf215546Sopenharmony_ci        // value.  The sample mask is written right away based on the
173bf215546Sopenharmony_ci        // alpha-to-coverage result using the normal register write mechanism,
174bf215546Sopenharmony_ci        // since that doesn't need to read from any memory, and then written again
175bf215546Sopenharmony_ci        // later based on the result of the stencil and depth tests using the
176bf215546Sopenharmony_ci        // special register.
177bf215546Sopenharmony_ci
178bf215546Sopenharmony_ciBLEND:
179bf215546Sopenharmony_ci        // This takes the sample coverage mask (computed by ATEST above) as a
180bf215546Sopenharmony_ci        // regular argument, in addition to the vec4 color in the special register.
181