1bf215546Sopenharmony_ci# Unofficial GCN/RDNA ISA reference errata 2bf215546Sopenharmony_ci 3bf215546Sopenharmony_ci## `v_sad_u32` 4bf215546Sopenharmony_ci 5bf215546Sopenharmony_ciThe Vega ISA reference writes its behaviour as: 6bf215546Sopenharmony_ci 7bf215546Sopenharmony_ci``` 8bf215546Sopenharmony_ciD.u = abs(S0.i - S1.i) + S2.u. 9bf215546Sopenharmony_ci``` 10bf215546Sopenharmony_ci 11bf215546Sopenharmony_ciThis is incorrect. The actual behaviour is what is written in the GCN3 reference 12bf215546Sopenharmony_ciguide: 13bf215546Sopenharmony_ci 14bf215546Sopenharmony_ci``` 15bf215546Sopenharmony_ciABS_DIFF (A,B) = (A>B) ? (A-B) : (B-A) 16bf215546Sopenharmony_ciD.u = ABS_DIFF (S0.u,S1.u) + S2.u 17bf215546Sopenharmony_ci``` 18bf215546Sopenharmony_ci 19bf215546Sopenharmony_ciThe instruction doesn't subtract the S0 and S1 and use the absolute value (the 20bf215546Sopenharmony_ci_signed_ distance), it uses the _unsigned_ distance between the operands. So 21bf215546Sopenharmony_ci`v_sad_u32(-5, 0, 0)` would return `4294967291` (`-5` interpreted as unsigned), 22bf215546Sopenharmony_cinot `5`. 23bf215546Sopenharmony_ci 24bf215546Sopenharmony_ci## `s_bfe_*` 25bf215546Sopenharmony_ci 26bf215546Sopenharmony_ciBoth the RDNA, Vega and GCN3 ISA references write that these instructions don't write 27bf215546Sopenharmony_ciSCC. They do. 28bf215546Sopenharmony_ci 29bf215546Sopenharmony_ci## `v_bcnt_u32_b32` 30bf215546Sopenharmony_ci 31bf215546Sopenharmony_ciThe Vega ISA reference writes its behaviour as: 32bf215546Sopenharmony_ci 33bf215546Sopenharmony_ci``` 34bf215546Sopenharmony_ciD.u = 0; 35bf215546Sopenharmony_cifor i in 0 ... 31 do 36bf215546Sopenharmony_ciD.u += (S0.u[i] == 1 ? 1 : 0); 37bf215546Sopenharmony_ciendfor. 38bf215546Sopenharmony_ci``` 39bf215546Sopenharmony_ci 40bf215546Sopenharmony_ciThis is incorrect. The actual behaviour (and number of operands) is what 41bf215546Sopenharmony_ciis written in the GCN3 reference guide: 42bf215546Sopenharmony_ci 43bf215546Sopenharmony_ci``` 44bf215546Sopenharmony_ciD.u = CountOneBits(S0.u) + S1.u. 45bf215546Sopenharmony_ci``` 46bf215546Sopenharmony_ci 47bf215546Sopenharmony_ci## `v_alignbyte_b32` 48bf215546Sopenharmony_ci 49bf215546Sopenharmony_ciAll versions of the ISA document are vague about it, but after some trial and 50bf215546Sopenharmony_cierror we discovered that only 2 bits of the 3rd operand are used. 51bf215546Sopenharmony_ciTherefore, this instruction can't shift more than 24 bits. 52bf215546Sopenharmony_ci 53bf215546Sopenharmony_ciThe correct description of `v_alignbyte_b32` is probably the following: 54bf215546Sopenharmony_ci 55bf215546Sopenharmony_ci``` 56bf215546Sopenharmony_ciD.u = ({S0, S1} >> (8 * S2.u[1:0])) & 0xffffffff 57bf215546Sopenharmony_ci``` 58bf215546Sopenharmony_ci 59bf215546Sopenharmony_ci## SMEM stores 60bf215546Sopenharmony_ci 61bf215546Sopenharmony_ciThe Vega ISA references doesn't say this (or doesn't make it clear), but 62bf215546Sopenharmony_cithe offset for SMEM stores must be in m0 if IMM == 0. 63bf215546Sopenharmony_ci 64bf215546Sopenharmony_ciThe RDNA ISA doesn't mention SMEM stores at all, but they seem to be supported 65bf215546Sopenharmony_ciby the chip and are present in LLVM. AMD devs however highly recommend avoiding 66bf215546Sopenharmony_cithese instructions. 67bf215546Sopenharmony_ci 68bf215546Sopenharmony_ci## SMEM atomics 69bf215546Sopenharmony_ci 70bf215546Sopenharmony_ciRDNA ISA: same as the SMEM stores, the ISA pretends they don't exist, but they 71bf215546Sopenharmony_ciare there in LLVM. 72bf215546Sopenharmony_ci 73bf215546Sopenharmony_ci## VMEM stores 74bf215546Sopenharmony_ci 75bf215546Sopenharmony_ciAll reference guides say (under "Vector Memory Instruction Data Dependencies"): 76bf215546Sopenharmony_ci 77bf215546Sopenharmony_ci> When a VM instruction is issued, the address is immediately read out of VGPRs 78bf215546Sopenharmony_ci> and sent to the texture cache. Any texture or buffer resources and samplers 79bf215546Sopenharmony_ci> are also sent immediately. However, write-data is not immediately sent to the 80bf215546Sopenharmony_ci> texture cache. 81bf215546Sopenharmony_ci 82bf215546Sopenharmony_ciReading that, one might think that waitcnts need to be added when writing to 83bf215546Sopenharmony_cithe registers used for a VMEM store's data. Experimentation has shown that this 84bf215546Sopenharmony_cidoes not seem to be the case on GFX8 and GFX9 (GFX6 and GFX7 are untested). It 85bf215546Sopenharmony_cialso seems unlikely, since NOPs are apparently needed in a subset of these 86bf215546Sopenharmony_cisituations. 87bf215546Sopenharmony_ci 88bf215546Sopenharmony_ci## MIMG opcodes on GFX8/GCN3 89bf215546Sopenharmony_ci 90bf215546Sopenharmony_ciThe `image_atomic_{swap,cmpswap,add,sub}` opcodes in the GCN3 ISA reference 91bf215546Sopenharmony_ciguide are incorrect. The Vega ISA reference guide has the correct ones. 92bf215546Sopenharmony_ci 93bf215546Sopenharmony_ci## VINTRP encoding 94bf215546Sopenharmony_ci 95bf215546Sopenharmony_ciVEGA ISA doc says the encoding should be `110010` but `110101` works. 96bf215546Sopenharmony_ci 97bf215546Sopenharmony_ci## VOP1 instructions encoded as VOP3 98bf215546Sopenharmony_ci 99bf215546Sopenharmony_ciRDNA ISA doc says that `0x140` should be added to the opcode, but that doesn't 100bf215546Sopenharmony_ciwork. What works is adding `0x180`, which LLVM also does. 101bf215546Sopenharmony_ci 102bf215546Sopenharmony_ci## FLAT, Scratch, Global instructions 103bf215546Sopenharmony_ci 104bf215546Sopenharmony_ciThe NV bit was removed in RDNA, but some parts of the doc still mention it. 105bf215546Sopenharmony_ci 106bf215546Sopenharmony_ciRDNA ISA doc 13.8.1 says that SADDR should be set to 0x7f when ADDR is used, but 107bf215546Sopenharmony_ci9.3.1 says it should be set to NULL. We assume 9.3.1 is correct and set it to 108bf215546Sopenharmony_ciSGPR_NULL. 109bf215546Sopenharmony_ci 110bf215546Sopenharmony_ci## Legacy instructions 111bf215546Sopenharmony_ci 112bf215546Sopenharmony_ciSome instructions have a `_LEGACY` variant which implements "DX9 rules", in which 113bf215546Sopenharmony_cithe zero "wins" in multiplications, ie. `0.0*x` is always `0.0`. The VEGA ISA 114bf215546Sopenharmony_cimentions `V_MAC_LEGACY_F32` but this instruction is not really there on VEGA. 115bf215546Sopenharmony_ci 116bf215546Sopenharmony_ci## `m0` with LDS instructions on Vega and newer 117bf215546Sopenharmony_ci 118bf215546Sopenharmony_ciThe Vega ISA doc (both the old one and the "7nm" one) claims that LDS instructions 119bf215546Sopenharmony_ciuse the `m0` register for address clamping like older GPUs, but this is not the case. 120bf215546Sopenharmony_ci 121bf215546Sopenharmony_ciIn reality, only the `_addtid` variants of LDS instructions use `m0` on Vega and 122bf215546Sopenharmony_cinewer GPUs, so the relevant section of the RDNA ISA doc seems to apply. 123bf215546Sopenharmony_ciLLVM also doesn't emit any initialization of `m0` for LDS instructions, and this 124bf215546Sopenharmony_ciwas also confirmed by AMD devs. 125bf215546Sopenharmony_ci 126bf215546Sopenharmony_ci## RDNA L0, L1 cache and DLC, GLC bits 127bf215546Sopenharmony_ci 128bf215546Sopenharmony_ciThe old L1 cache was renamed to L0, and a new L1 cache was added to RDNA. The 129bf215546Sopenharmony_ciL1 cache is 1 cache per shader array. Some instruction encodings have DLC and 130bf215546Sopenharmony_ciGLC bits that interact with the cache. 131bf215546Sopenharmony_ci 132bf215546Sopenharmony_ci* DLC ("device level coherent") bit: controls the L1 cache 133bf215546Sopenharmony_ci* GLC ("globally coherent") bit: controls the L0 cache 134bf215546Sopenharmony_ci 135bf215546Sopenharmony_ciThe recommendation from AMD devs is to always set these two bits at the same time, 136bf215546Sopenharmony_cias it doesn't make too much sense to set them independently, aside from some 137bf215546Sopenharmony_cicircumstances (eg. we needn't set DLC when only one shader array is used). 138bf215546Sopenharmony_ci 139bf215546Sopenharmony_ciStores and atomics always bypass the L1 cache, so they don't support the DLC bit, 140bf215546Sopenharmony_ciand it shouldn't be set in these cases. Setting the DLC for these cases can result 141bf215546Sopenharmony_ciin graphical glitches or hangs. 142bf215546Sopenharmony_ci 143bf215546Sopenharmony_ci## RDNA `s_dcache_wb` 144bf215546Sopenharmony_ci 145bf215546Sopenharmony_ciThe `s_dcache_wb` is not mentioned in the RDNA ISA doc, but it is needed in order 146bf215546Sopenharmony_cito achieve correct behavior in some SSBO CTS tests. 147bf215546Sopenharmony_ci 148bf215546Sopenharmony_ci## RDNA subvector mode 149bf215546Sopenharmony_ci 150bf215546Sopenharmony_ciThe documentation of `s_subvector_loop_begin` and `s_subvector_mode_end` is not clear 151bf215546Sopenharmony_cion what sort of addressing should be used, but it says that it 152bf215546Sopenharmony_ci"is equivalent to an `S_CBRANCH` with extra math", so the subvector loop handling 153bf215546Sopenharmony_ciin ACO is done according to the `s_cbranch` doc. 154bf215546Sopenharmony_ci 155bf215546Sopenharmony_ci## RDNA early rasterization 156bf215546Sopenharmony_ci 157bf215546Sopenharmony_ciThe ISA documentation says about `s_endpgm`: 158bf215546Sopenharmony_ci 159bf215546Sopenharmony_ci> The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0 160bf215546Sopenharmony_ci> before executing this instruction. 161bf215546Sopenharmony_ci 162bf215546Sopenharmony_ciWhat the doc doesn't say is that in case of NGG (and legacy VS) when there 163bf215546Sopenharmony_ciare no param exports, the driver sets `NO_PC_EXPORT=1` for optimal performance, 164bf215546Sopenharmony_ciand when this is set, the hardware will start clipping and rasterization 165bf215546Sopenharmony_cias soon as it encounters a position export with `DONE=1`, without waiting 166bf215546Sopenharmony_cifor the NGG (or VS) to finish. 167bf215546Sopenharmony_ci 168bf215546Sopenharmony_ciIt can even launch PS waves before NGG (or VS) ends. 169bf215546Sopenharmony_ci 170bf215546Sopenharmony_ciWhen this happens, any store performed by a VS is not guaranteed 171bf215546Sopenharmony_cito be complete when PS tries to load it, so we need to manually 172bf215546Sopenharmony_cimake sure to insert wait instructions before the position exports. 173bf215546Sopenharmony_ci 174bf215546Sopenharmony_ci## A16 and G16 175bf215546Sopenharmony_ci 176bf215546Sopenharmony_ciOn GFX9, the A16 field enables both 16 bit addresses and derivatives. 177bf215546Sopenharmony_ciSince GFX10+ these are fully independent of each other, A16 controls 16 bit addresses 178bf215546Sopenharmony_ciand G16 opcodes 16 bit derivatives. A16 without G16 uses 32 bit derivatives. 179bf215546Sopenharmony_ci 180bf215546Sopenharmony_ci# Hardware Bugs 181bf215546Sopenharmony_ci 182bf215546Sopenharmony_ci## SMEM corrupts VCCZ on SI/CI 183bf215546Sopenharmony_ci 184bf215546Sopenharmony_ci[See this LLVM source.](https://github.com/llvm/llvm-project/blob/acb089e12ae48b82c0b05c42326196a030df9b82/llvm/lib/Target/AMDGPU/SIInsertWaits.cpp#L580-L616) 185bf215546Sopenharmony_ci 186bf215546Sopenharmony_ciAfter issuing a SMEM instructions, we need to wait for the SMEM instructions to 187bf215546Sopenharmony_cifinish and then write to vcc (for example, `s_mov_b64 vcc, vcc`) to correct vccz 188bf215546Sopenharmony_ci 189bf215546Sopenharmony_ciCurrently, we don't do this. 190bf215546Sopenharmony_ci 191bf215546Sopenharmony_ci## SGPR offset on MUBUF prevents addr clamping on SI/CI 192bf215546Sopenharmony_ci 193bf215546Sopenharmony_ci[See this LLVM source.](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp#L1917-L1922) 194bf215546Sopenharmony_ci 195bf215546Sopenharmony_ciThis leads to wrong bounds checking, using a VGPR offset fixes it. 196bf215546Sopenharmony_ci 197bf215546Sopenharmony_ci## GCN / GFX6 hazards 198bf215546Sopenharmony_ci 199bf215546Sopenharmony_ci### VINTRP followed by a read with `v_readfirstlane` or `v_readlane` 200bf215546Sopenharmony_ci 201bf215546Sopenharmony_ciIt's required to insert 1 wait state if the dst VGPR of any `v_interp_*` is 202bf215546Sopenharmony_cifollowed by a read with `v_readfirstlane` or `v_readlane` to fix GPU hangs on GFX6. 203bf215546Sopenharmony_ciNote that `v_writelane_*` is apparently not affected. This hazard isn't 204bf215546Sopenharmony_cidocumented anywhere but AMD confirmed it. 205bf215546Sopenharmony_ci 206bf215546Sopenharmony_ci## RDNA / GFX10 hazards 207bf215546Sopenharmony_ci 208bf215546Sopenharmony_ci### SMEM store followed by a load with the same address 209bf215546Sopenharmony_ci 210bf215546Sopenharmony_ciWe found that an `s_buffer_load` will produce incorrect results if it is preceded 211bf215546Sopenharmony_ciby an `s_buffer_store` with the same address. Inserting an `s_nop` between them 212bf215546Sopenharmony_cidoes not mitigate the issue, so an `s_waitcnt lgkmcnt(0)` must be inserted. 213bf215546Sopenharmony_ciThis is not mentioned by LLVM among the other GFX10 bugs, but LLVM doesn't use 214bf215546Sopenharmony_ciSMEM stores, so it's not surprising that they didn't notice it. 215bf215546Sopenharmony_ci 216bf215546Sopenharmony_ci### VMEMtoScalarWriteHazard 217bf215546Sopenharmony_ci 218bf215546Sopenharmony_ciTriggered by: 219bf215546Sopenharmony_ciVMEM/FLAT/GLOBAL/SCRATCH/DS instruction reads an SGPR (or EXEC, or M0). 220bf215546Sopenharmony_ciThen, a SALU/SMEM instruction writes the same SGPR. 221bf215546Sopenharmony_ci 222bf215546Sopenharmony_ciMitigated by: 223bf215546Sopenharmony_ciA VALU instruction or an `s_waitcnt` between the two instructions. 224bf215546Sopenharmony_ci 225bf215546Sopenharmony_ci### SMEMtoVectorWriteHazard 226bf215546Sopenharmony_ci 227bf215546Sopenharmony_ciTriggered by: 228bf215546Sopenharmony_ciAn SMEM instruction reads an SGPR. Then, a VALU instruction writes that same SGPR. 229bf215546Sopenharmony_ci 230bf215546Sopenharmony_ciMitigated by: 231bf215546Sopenharmony_ciAny non-SOPP SALU instruction (except `s_setvskip`, `s_version`, and any non-lgkmcnt `s_waitcnt`). 232bf215546Sopenharmony_ci 233bf215546Sopenharmony_ci### Offset3fBug 234bf215546Sopenharmony_ci 235bf215546Sopenharmony_ciAny branch that is located at offset 0x3f will be buggy. Just insert some NOPs to make sure no branch 236bf215546Sopenharmony_ciis located at this offset. 237bf215546Sopenharmony_ci 238bf215546Sopenharmony_ci### InstFwdPrefetchBug 239bf215546Sopenharmony_ci 240bf215546Sopenharmony_ciAccording to LLVM, the `s_inst_prefetch` instruction can cause a hang. 241bf215546Sopenharmony_ciThere are no further details. 242bf215546Sopenharmony_ci 243bf215546Sopenharmony_ci### LdsMisalignedBug 244bf215546Sopenharmony_ci 245bf215546Sopenharmony_ciWhen there is a misaligned multi-dword FLAT load/store instruction in WGP mode, 246bf215546Sopenharmony_ciit needs to be split into multiple single-dword FLAT instructions. 247bf215546Sopenharmony_ci 248bf215546Sopenharmony_ciACO doesn't use FLAT load/store on GFX10, so is unaffected. 249bf215546Sopenharmony_ci 250bf215546Sopenharmony_ci### FlatSegmentOffsetBug 251bf215546Sopenharmony_ci 252bf215546Sopenharmony_ciThe 12-bit immediate OFFSET field of FLAT instructions must always be 0. 253bf215546Sopenharmony_ciGLOBAL and SCRATCH are unaffected. 254bf215546Sopenharmony_ci 255bf215546Sopenharmony_ciACO doesn't use FLAT load/store on GFX10, so is unaffected. 256bf215546Sopenharmony_ci 257bf215546Sopenharmony_ci### VcmpxPermlaneHazard 258bf215546Sopenharmony_ci 259bf215546Sopenharmony_ciTriggered by: 260bf215546Sopenharmony_ciAny permlane instruction that follows any VOPC instruction. 261bf215546Sopenharmony_ciConfirmed by AMD devs that despite the name, this doesn't only affect v_cmpx. 262bf215546Sopenharmony_ci 263bf215546Sopenharmony_ciMitigated by: any VALU instruction except `v_nop`. 264bf215546Sopenharmony_ci 265bf215546Sopenharmony_ci### VcmpxExecWARHazard 266bf215546Sopenharmony_ci 267bf215546Sopenharmony_ciTriggered by: 268bf215546Sopenharmony_ciAny non-VALU instruction reads the EXEC mask. Then, any VALU instruction writes the EXEC mask. 269bf215546Sopenharmony_ci 270bf215546Sopenharmony_ciMitigated by: 271bf215546Sopenharmony_ciA VALU instruction that writes an SGPR (or has a valid SDST operand), or `s_waitcnt_depctr 0xfffe`. 272bf215546Sopenharmony_ciNote: `s_waitcnt_depctr` is an internal instruction, so there is no further information 273bf215546Sopenharmony_ciabout what it does or what its operand means. 274bf215546Sopenharmony_ci 275bf215546Sopenharmony_ci### LdsBranchVmemWARHazard 276bf215546Sopenharmony_ci 277bf215546Sopenharmony_ciTriggered by: 278bf215546Sopenharmony_ciVMEM/GLOBAL/SCRATCH instruction, then a branch, then a DS instruction, 279bf215546Sopenharmony_cior vice versa: DS instruction, then a branch, then a VMEM/GLOBAL/SCRATCH instruction. 280bf215546Sopenharmony_ci 281bf215546Sopenharmony_ciMitigated by: 282bf215546Sopenharmony_ciOnly `s_waitcnt_vscnt null, 0`. Needed even if the first instruction is a load. 283bf215546Sopenharmony_ci 284bf215546Sopenharmony_ci### NSAClauseBug 285bf215546Sopenharmony_ci 286bf215546Sopenharmony_ci"MIMG-NSA in a hard clause has unpredictable results on GFX10.1" 287bf215546Sopenharmony_ci 288bf215546Sopenharmony_ci### NSAMaxSize5 289bf215546Sopenharmony_ci 290bf215546Sopenharmony_ciNSA MIMG instructions should be limited to 3 dwords before GFX10.3 to avoid 291bf215546Sopenharmony_cistability issues: https://reviews.llvm.org/D103348 292