1bf215546Sopenharmony_ci# Unofficial GCN/RDNA ISA reference errata
2bf215546Sopenharmony_ci
3bf215546Sopenharmony_ci## `v_sad_u32`
4bf215546Sopenharmony_ci
5bf215546Sopenharmony_ciThe Vega ISA reference writes its behaviour as:
6bf215546Sopenharmony_ci
7bf215546Sopenharmony_ci```
8bf215546Sopenharmony_ciD.u = abs(S0.i - S1.i) + S2.u.
9bf215546Sopenharmony_ci```
10bf215546Sopenharmony_ci
11bf215546Sopenharmony_ciThis is incorrect. The actual behaviour is what is written in the GCN3 reference
12bf215546Sopenharmony_ciguide:
13bf215546Sopenharmony_ci
14bf215546Sopenharmony_ci```
15bf215546Sopenharmony_ciABS_DIFF (A,B) = (A>B) ? (A-B) : (B-A)
16bf215546Sopenharmony_ciD.u = ABS_DIFF (S0.u,S1.u) + S2.u
17bf215546Sopenharmony_ci```
18bf215546Sopenharmony_ci
19bf215546Sopenharmony_ciThe instruction doesn't subtract the S0 and S1 and use the absolute value (the
20bf215546Sopenharmony_ci_signed_ distance), it uses the _unsigned_ distance between the operands. So
21bf215546Sopenharmony_ci`v_sad_u32(-5, 0, 0)` would return `4294967291` (`-5` interpreted as unsigned),
22bf215546Sopenharmony_cinot `5`.
23bf215546Sopenharmony_ci
24bf215546Sopenharmony_ci## `s_bfe_*`
25bf215546Sopenharmony_ci
26bf215546Sopenharmony_ciBoth the RDNA, Vega and GCN3 ISA references write that these instructions don't write
27bf215546Sopenharmony_ciSCC. They do.
28bf215546Sopenharmony_ci
29bf215546Sopenharmony_ci## `v_bcnt_u32_b32`
30bf215546Sopenharmony_ci
31bf215546Sopenharmony_ciThe Vega ISA reference writes its behaviour as:
32bf215546Sopenharmony_ci
33bf215546Sopenharmony_ci```
34bf215546Sopenharmony_ciD.u = 0;
35bf215546Sopenharmony_cifor i in 0 ... 31 do
36bf215546Sopenharmony_ciD.u += (S0.u[i] == 1 ? 1 : 0);
37bf215546Sopenharmony_ciendfor.
38bf215546Sopenharmony_ci```
39bf215546Sopenharmony_ci
40bf215546Sopenharmony_ciThis is incorrect. The actual behaviour (and number of operands) is what
41bf215546Sopenharmony_ciis written in the GCN3 reference guide:
42bf215546Sopenharmony_ci
43bf215546Sopenharmony_ci```
44bf215546Sopenharmony_ciD.u = CountOneBits(S0.u) + S1.u.
45bf215546Sopenharmony_ci```
46bf215546Sopenharmony_ci
47bf215546Sopenharmony_ci## `v_alignbyte_b32`
48bf215546Sopenharmony_ci
49bf215546Sopenharmony_ciAll versions of the ISA document are vague about it, but after some trial and
50bf215546Sopenharmony_cierror we discovered that only 2 bits of the 3rd operand are used.
51bf215546Sopenharmony_ciTherefore, this instruction can't shift more than 24 bits.
52bf215546Sopenharmony_ci
53bf215546Sopenharmony_ciThe correct description of `v_alignbyte_b32` is probably the following:
54bf215546Sopenharmony_ci
55bf215546Sopenharmony_ci```
56bf215546Sopenharmony_ciD.u = ({S0, S1} >> (8 * S2.u[1:0])) & 0xffffffff
57bf215546Sopenharmony_ci```
58bf215546Sopenharmony_ci
59bf215546Sopenharmony_ci## SMEM stores
60bf215546Sopenharmony_ci
61bf215546Sopenharmony_ciThe Vega ISA references doesn't say this (or doesn't make it clear), but
62bf215546Sopenharmony_cithe offset for SMEM stores must be in m0 if IMM == 0.
63bf215546Sopenharmony_ci
64bf215546Sopenharmony_ciThe RDNA ISA doesn't mention SMEM stores at all, but they seem to be supported
65bf215546Sopenharmony_ciby the chip and are present in LLVM. AMD devs however highly recommend avoiding
66bf215546Sopenharmony_cithese instructions.
67bf215546Sopenharmony_ci
68bf215546Sopenharmony_ci## SMEM atomics
69bf215546Sopenharmony_ci
70bf215546Sopenharmony_ciRDNA ISA: same as the SMEM stores, the ISA pretends they don't exist, but they
71bf215546Sopenharmony_ciare there in LLVM.
72bf215546Sopenharmony_ci
73bf215546Sopenharmony_ci## VMEM stores
74bf215546Sopenharmony_ci
75bf215546Sopenharmony_ciAll reference guides say (under "Vector Memory Instruction Data Dependencies"):
76bf215546Sopenharmony_ci
77bf215546Sopenharmony_ci> When a VM instruction is issued, the address is immediately read out of VGPRs
78bf215546Sopenharmony_ci> and sent to the texture cache. Any texture or buffer resources and samplers
79bf215546Sopenharmony_ci> are also sent immediately. However, write-data is not immediately sent to the
80bf215546Sopenharmony_ci> texture cache.
81bf215546Sopenharmony_ci
82bf215546Sopenharmony_ciReading that, one might think that waitcnts need to be added when writing to
83bf215546Sopenharmony_cithe registers used for a VMEM store's data. Experimentation has shown that this
84bf215546Sopenharmony_cidoes not seem to be the case on GFX8 and GFX9 (GFX6 and GFX7 are untested). It
85bf215546Sopenharmony_cialso seems unlikely, since NOPs are apparently needed in a subset of these
86bf215546Sopenharmony_cisituations.
87bf215546Sopenharmony_ci
88bf215546Sopenharmony_ci## MIMG opcodes on GFX8/GCN3
89bf215546Sopenharmony_ci
90bf215546Sopenharmony_ciThe `image_atomic_{swap,cmpswap,add,sub}` opcodes in the GCN3 ISA reference
91bf215546Sopenharmony_ciguide are incorrect. The Vega ISA reference guide has the correct ones.
92bf215546Sopenharmony_ci
93bf215546Sopenharmony_ci## VINTRP encoding
94bf215546Sopenharmony_ci
95bf215546Sopenharmony_ciVEGA ISA doc says the encoding should be `110010` but `110101` works.
96bf215546Sopenharmony_ci
97bf215546Sopenharmony_ci## VOP1 instructions encoded as VOP3
98bf215546Sopenharmony_ci
99bf215546Sopenharmony_ciRDNA ISA doc says that `0x140` should be added to the opcode, but that doesn't
100bf215546Sopenharmony_ciwork. What works is adding `0x180`, which LLVM also does.
101bf215546Sopenharmony_ci
102bf215546Sopenharmony_ci## FLAT, Scratch, Global instructions
103bf215546Sopenharmony_ci
104bf215546Sopenharmony_ciThe NV bit was removed in RDNA, but some parts of the doc still mention it.
105bf215546Sopenharmony_ci
106bf215546Sopenharmony_ciRDNA ISA doc 13.8.1 says that SADDR should be set to 0x7f when ADDR is used, but
107bf215546Sopenharmony_ci9.3.1 says it should be set to NULL. We assume 9.3.1 is correct and set it to
108bf215546Sopenharmony_ciSGPR_NULL.
109bf215546Sopenharmony_ci
110bf215546Sopenharmony_ci## Legacy instructions
111bf215546Sopenharmony_ci
112bf215546Sopenharmony_ciSome instructions have a `_LEGACY` variant which implements "DX9 rules", in which
113bf215546Sopenharmony_cithe zero "wins" in multiplications, ie. `0.0*x` is always `0.0`. The VEGA ISA
114bf215546Sopenharmony_cimentions `V_MAC_LEGACY_F32` but this instruction is not really there on VEGA.
115bf215546Sopenharmony_ci
116bf215546Sopenharmony_ci## `m0` with LDS instructions on Vega and newer
117bf215546Sopenharmony_ci
118bf215546Sopenharmony_ciThe Vega ISA doc (both the old one and the "7nm" one) claims that LDS instructions
119bf215546Sopenharmony_ciuse the `m0` register for address clamping like older GPUs, but this is not the case.
120bf215546Sopenharmony_ci
121bf215546Sopenharmony_ciIn reality, only the `_addtid` variants of LDS instructions use `m0` on Vega and
122bf215546Sopenharmony_cinewer GPUs, so the relevant section of the RDNA ISA doc seems to apply.
123bf215546Sopenharmony_ciLLVM also doesn't emit any initialization of `m0` for LDS instructions, and this
124bf215546Sopenharmony_ciwas also confirmed by AMD devs.
125bf215546Sopenharmony_ci
126bf215546Sopenharmony_ci## RDNA L0, L1 cache and DLC, GLC bits
127bf215546Sopenharmony_ci
128bf215546Sopenharmony_ciThe old L1 cache was renamed to L0, and a new L1 cache was added to RDNA. The
129bf215546Sopenharmony_ciL1 cache is 1 cache per shader array. Some instruction encodings have DLC and
130bf215546Sopenharmony_ciGLC bits that interact with the cache.
131bf215546Sopenharmony_ci
132bf215546Sopenharmony_ci* DLC ("device level coherent") bit: controls the L1 cache
133bf215546Sopenharmony_ci* GLC ("globally coherent") bit: controls the L0 cache
134bf215546Sopenharmony_ci
135bf215546Sopenharmony_ciThe recommendation from AMD devs is to always set these two bits at the same time,
136bf215546Sopenharmony_cias it doesn't make too much sense to set them independently, aside from some
137bf215546Sopenharmony_cicircumstances (eg. we needn't set DLC when only one shader array is used).
138bf215546Sopenharmony_ci
139bf215546Sopenharmony_ciStores and atomics always bypass the L1 cache, so they don't support the DLC bit,
140bf215546Sopenharmony_ciand it shouldn't be set in these cases. Setting the DLC for these cases can result
141bf215546Sopenharmony_ciin graphical glitches or hangs.
142bf215546Sopenharmony_ci
143bf215546Sopenharmony_ci## RDNA `s_dcache_wb`
144bf215546Sopenharmony_ci
145bf215546Sopenharmony_ciThe `s_dcache_wb` is not mentioned in the RDNA ISA doc, but it is needed in order
146bf215546Sopenharmony_cito achieve correct behavior in some SSBO CTS tests.
147bf215546Sopenharmony_ci
148bf215546Sopenharmony_ci## RDNA subvector mode
149bf215546Sopenharmony_ci
150bf215546Sopenharmony_ciThe documentation of `s_subvector_loop_begin` and `s_subvector_mode_end` is not clear
151bf215546Sopenharmony_cion what sort of addressing should be used, but it says that it
152bf215546Sopenharmony_ci"is equivalent to an `S_CBRANCH` with extra math", so the subvector loop handling
153bf215546Sopenharmony_ciin ACO is done according to the `s_cbranch` doc.
154bf215546Sopenharmony_ci
155bf215546Sopenharmony_ci## RDNA early rasterization
156bf215546Sopenharmony_ci
157bf215546Sopenharmony_ciThe ISA documentation says about `s_endpgm`:
158bf215546Sopenharmony_ci
159bf215546Sopenharmony_ci> The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0
160bf215546Sopenharmony_ci> before executing this instruction.
161bf215546Sopenharmony_ci
162bf215546Sopenharmony_ciWhat the doc doesn't say is that in case of NGG (and legacy VS) when there
163bf215546Sopenharmony_ciare no param exports, the driver sets `NO_PC_EXPORT=1` for optimal performance,
164bf215546Sopenharmony_ciand when this is set, the hardware will start clipping and rasterization
165bf215546Sopenharmony_cias soon as it encounters a position export with `DONE=1`, without waiting
166bf215546Sopenharmony_cifor the NGG (or VS) to finish.
167bf215546Sopenharmony_ci
168bf215546Sopenharmony_ciIt can even launch PS waves before NGG (or VS) ends.
169bf215546Sopenharmony_ci
170bf215546Sopenharmony_ciWhen this happens, any store performed by a VS is not guaranteed
171bf215546Sopenharmony_cito be complete when PS tries to load it, so we need to manually
172bf215546Sopenharmony_cimake sure to insert wait instructions before the position exports.
173bf215546Sopenharmony_ci
174bf215546Sopenharmony_ci## A16 and G16
175bf215546Sopenharmony_ci
176bf215546Sopenharmony_ciOn GFX9, the A16 field enables both 16 bit addresses and derivatives.
177bf215546Sopenharmony_ciSince GFX10+ these are fully independent of each other, A16 controls 16 bit addresses
178bf215546Sopenharmony_ciand G16 opcodes 16 bit derivatives. A16 without G16 uses 32 bit derivatives.
179bf215546Sopenharmony_ci
180bf215546Sopenharmony_ci# Hardware Bugs
181bf215546Sopenharmony_ci
182bf215546Sopenharmony_ci## SMEM corrupts VCCZ on SI/CI
183bf215546Sopenharmony_ci
184bf215546Sopenharmony_ci[See this LLVM source.](https://github.com/llvm/llvm-project/blob/acb089e12ae48b82c0b05c42326196a030df9b82/llvm/lib/Target/AMDGPU/SIInsertWaits.cpp#L580-L616)
185bf215546Sopenharmony_ci
186bf215546Sopenharmony_ciAfter issuing a SMEM instructions, we need to wait for the SMEM instructions to
187bf215546Sopenharmony_cifinish and then write to vcc (for example, `s_mov_b64 vcc, vcc`) to correct vccz
188bf215546Sopenharmony_ci
189bf215546Sopenharmony_ciCurrently, we don't do this.
190bf215546Sopenharmony_ci
191bf215546Sopenharmony_ci## SGPR offset on MUBUF prevents addr clamping on SI/CI
192bf215546Sopenharmony_ci
193bf215546Sopenharmony_ci[See this LLVM source.](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp#L1917-L1922)
194bf215546Sopenharmony_ci
195bf215546Sopenharmony_ciThis leads to wrong bounds checking, using a VGPR offset fixes it.
196bf215546Sopenharmony_ci
197bf215546Sopenharmony_ci## GCN / GFX6 hazards
198bf215546Sopenharmony_ci
199bf215546Sopenharmony_ci### VINTRP followed by a read with `v_readfirstlane` or `v_readlane`
200bf215546Sopenharmony_ci
201bf215546Sopenharmony_ciIt's required to insert 1 wait state if the dst VGPR of any  `v_interp_*` is
202bf215546Sopenharmony_cifollowed by a read with `v_readfirstlane` or `v_readlane` to fix GPU hangs on GFX6.
203bf215546Sopenharmony_ciNote that `v_writelane_*` is apparently not affected. This hazard isn't
204bf215546Sopenharmony_cidocumented anywhere but AMD confirmed it.
205bf215546Sopenharmony_ci
206bf215546Sopenharmony_ci## RDNA / GFX10 hazards
207bf215546Sopenharmony_ci
208bf215546Sopenharmony_ci### SMEM store followed by a load with the same address
209bf215546Sopenharmony_ci
210bf215546Sopenharmony_ciWe found that an `s_buffer_load` will produce incorrect results if it is preceded
211bf215546Sopenharmony_ciby an `s_buffer_store` with the same address. Inserting an `s_nop` between them
212bf215546Sopenharmony_cidoes not mitigate the issue, so an `s_waitcnt lgkmcnt(0)` must be inserted.
213bf215546Sopenharmony_ciThis is not mentioned by LLVM among the other GFX10 bugs, but LLVM doesn't use
214bf215546Sopenharmony_ciSMEM stores, so it's not surprising that they didn't notice it.
215bf215546Sopenharmony_ci
216bf215546Sopenharmony_ci### VMEMtoScalarWriteHazard
217bf215546Sopenharmony_ci
218bf215546Sopenharmony_ciTriggered by:
219bf215546Sopenharmony_ciVMEM/FLAT/GLOBAL/SCRATCH/DS instruction reads an SGPR (or EXEC, or M0).
220bf215546Sopenharmony_ciThen, a SALU/SMEM instruction writes the same SGPR.
221bf215546Sopenharmony_ci
222bf215546Sopenharmony_ciMitigated by:
223bf215546Sopenharmony_ciA VALU instruction or an `s_waitcnt` between the two instructions.
224bf215546Sopenharmony_ci
225bf215546Sopenharmony_ci### SMEMtoVectorWriteHazard
226bf215546Sopenharmony_ci
227bf215546Sopenharmony_ciTriggered by:
228bf215546Sopenharmony_ciAn SMEM instruction reads an SGPR. Then, a VALU instruction writes that same SGPR.
229bf215546Sopenharmony_ci
230bf215546Sopenharmony_ciMitigated by:
231bf215546Sopenharmony_ciAny non-SOPP SALU instruction (except `s_setvskip`, `s_version`, and any non-lgkmcnt `s_waitcnt`).
232bf215546Sopenharmony_ci
233bf215546Sopenharmony_ci### Offset3fBug
234bf215546Sopenharmony_ci
235bf215546Sopenharmony_ciAny branch that is located at offset 0x3f will be buggy. Just insert some NOPs to make sure no branch
236bf215546Sopenharmony_ciis located at this offset.
237bf215546Sopenharmony_ci
238bf215546Sopenharmony_ci### InstFwdPrefetchBug
239bf215546Sopenharmony_ci
240bf215546Sopenharmony_ciAccording to LLVM, the `s_inst_prefetch` instruction can cause a hang.
241bf215546Sopenharmony_ciThere are no further details.
242bf215546Sopenharmony_ci
243bf215546Sopenharmony_ci### LdsMisalignedBug
244bf215546Sopenharmony_ci
245bf215546Sopenharmony_ciWhen there is a misaligned multi-dword FLAT load/store instruction in WGP mode,
246bf215546Sopenharmony_ciit needs to be split into multiple single-dword FLAT instructions.
247bf215546Sopenharmony_ci
248bf215546Sopenharmony_ciACO doesn't use FLAT load/store on GFX10, so is unaffected.
249bf215546Sopenharmony_ci
250bf215546Sopenharmony_ci### FlatSegmentOffsetBug
251bf215546Sopenharmony_ci
252bf215546Sopenharmony_ciThe 12-bit immediate OFFSET field of FLAT instructions must always be 0.
253bf215546Sopenharmony_ciGLOBAL and SCRATCH are unaffected.
254bf215546Sopenharmony_ci
255bf215546Sopenharmony_ciACO doesn't use FLAT load/store on GFX10, so is unaffected.
256bf215546Sopenharmony_ci
257bf215546Sopenharmony_ci### VcmpxPermlaneHazard
258bf215546Sopenharmony_ci
259bf215546Sopenharmony_ciTriggered by:
260bf215546Sopenharmony_ciAny permlane instruction that follows any VOPC instruction.
261bf215546Sopenharmony_ciConfirmed by AMD devs that despite the name, this doesn't only affect v_cmpx.
262bf215546Sopenharmony_ci
263bf215546Sopenharmony_ciMitigated by: any VALU instruction except `v_nop`.
264bf215546Sopenharmony_ci
265bf215546Sopenharmony_ci### VcmpxExecWARHazard
266bf215546Sopenharmony_ci
267bf215546Sopenharmony_ciTriggered by:
268bf215546Sopenharmony_ciAny non-VALU instruction reads the EXEC mask. Then, any VALU instruction writes the EXEC mask.
269bf215546Sopenharmony_ci
270bf215546Sopenharmony_ciMitigated by:
271bf215546Sopenharmony_ciA VALU instruction that writes an SGPR (or has a valid SDST operand), or `s_waitcnt_depctr 0xfffe`.
272bf215546Sopenharmony_ciNote: `s_waitcnt_depctr` is an internal instruction, so there is no further information
273bf215546Sopenharmony_ciabout what it does or what its operand means.
274bf215546Sopenharmony_ci
275bf215546Sopenharmony_ci### LdsBranchVmemWARHazard
276bf215546Sopenharmony_ci
277bf215546Sopenharmony_ciTriggered by:
278bf215546Sopenharmony_ciVMEM/GLOBAL/SCRATCH instruction, then a branch, then a DS instruction,
279bf215546Sopenharmony_cior vice versa: DS instruction, then a branch, then a VMEM/GLOBAL/SCRATCH instruction.
280bf215546Sopenharmony_ci
281bf215546Sopenharmony_ciMitigated by:
282bf215546Sopenharmony_ciOnly `s_waitcnt_vscnt null, 0`. Needed even if the first instruction is a load.
283bf215546Sopenharmony_ci
284bf215546Sopenharmony_ci### NSAClauseBug
285bf215546Sopenharmony_ci
286bf215546Sopenharmony_ci"MIMG-NSA in a hard clause has unpredictable results on GFX10.1"
287bf215546Sopenharmony_ci
288bf215546Sopenharmony_ci### NSAMaxSize5
289bf215546Sopenharmony_ci
290bf215546Sopenharmony_ciNSA MIMG instructions should be limited to 3 dwords before GFX10.3 to avoid
291bf215546Sopenharmony_cistability issues: https://reviews.llvm.org/D103348
292