amd/compiler/README-ISA.md

bf215546Sopenharmony_ci# Unofficial GCN/RDNA ISA reference errata
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## `v_sad_u32`
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe Vega ISA reference writes its behaviour as:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ciD.u = abs(S0.i - S1.i) + S2.u.
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThis is incorrect. The actual behaviour is what is written in the GCN3 reference
bf215546Sopenharmony_ciguide:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ciABS_DIFF (A,B) = (A>B) ? (A-B) : (B-A)
bf215546Sopenharmony_ciD.u = ABS_DIFF (S0.u,S1.u) + S2.u
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe instruction doesn't subtract the S0 and S1 and use the absolute value (the
bf215546Sopenharmony_ci_signed_ distance), it uses the _unsigned_ distance between the operands. So
bf215546Sopenharmony_ci`v_sad_u32(-5, 0, 0)` would return `4294967291` (`-5` interpreted as unsigned),
bf215546Sopenharmony_cinot `5`.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## `s_bfe_*`
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciBoth the RDNA, Vega and GCN3 ISA references write that these instructions don't write
bf215546Sopenharmony_ciSCC. They do.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## `v_bcnt_u32_b32`
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe Vega ISA reference writes its behaviour as:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ciD.u = 0;
bf215546Sopenharmony_cifor i in 0 ... 31 do
bf215546Sopenharmony_ciD.u += (S0.u[i] == 1 ? 1 : 0);
bf215546Sopenharmony_ciendfor.
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThis is incorrect. The actual behaviour (and number of operands) is what
bf215546Sopenharmony_ciis written in the GCN3 reference guide:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ciD.u = CountOneBits(S0.u) + S1.u.
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## `v_alignbyte_b32`
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciAll versions of the ISA document are vague about it, but after some trial and
bf215546Sopenharmony_cierror we discovered that only 2 bits of the 3rd operand are used.
bf215546Sopenharmony_ciTherefore, this instruction can't shift more than 24 bits.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe correct description of `v_alignbyte_b32` is probably the following:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ciD.u = ({S0, S1} >> (8 * S2.u[1:0])) & 0xffffffff
bf215546Sopenharmony_ci```
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## SMEM stores
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe Vega ISA references doesn't say this (or doesn't make it clear), but
bf215546Sopenharmony_cithe offset for SMEM stores must be in m0 if IMM == 0.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe RDNA ISA doesn't mention SMEM stores at all, but they seem to be supported
bf215546Sopenharmony_ciby the chip and are present in LLVM. AMD devs however highly recommend avoiding
bf215546Sopenharmony_cithese instructions.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## SMEM atomics
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciRDNA ISA: same as the SMEM stores, the ISA pretends they don't exist, but they
bf215546Sopenharmony_ciare there in LLVM.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## VMEM stores
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciAll reference guides say (under "Vector Memory Instruction Data Dependencies"):
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci> When a VM instruction is issued, the address is immediately read out of VGPRs
bf215546Sopenharmony_ci> and sent to the texture cache. Any texture or buffer resources and samplers
bf215546Sopenharmony_ci> are also sent immediately. However, write-data is not immediately sent to the
bf215546Sopenharmony_ci> texture cache.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciReading that, one might think that waitcnts need to be added when writing to
bf215546Sopenharmony_cithe registers used for a VMEM store's data. Experimentation has shown that this
bf215546Sopenharmony_cidoes not seem to be the case on GFX8 and GFX9 (GFX6 and GFX7 are untested). It
bf215546Sopenharmony_cialso seems unlikely, since NOPs are apparently needed in a subset of these
bf215546Sopenharmony_cisituations.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## MIMG opcodes on GFX8/GCN3
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe `image_atomic_{swap,cmpswap,add,sub}` opcodes in the GCN3 ISA reference
bf215546Sopenharmony_ciguide are incorrect. The Vega ISA reference guide has the correct ones.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## VINTRP encoding
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciVEGA ISA doc says the encoding should be `110010` but `110101` works.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## VOP1 instructions encoded as VOP3
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciRDNA ISA doc says that `0x140` should be added to the opcode, but that doesn't
bf215546Sopenharmony_ciwork. What works is adding `0x180`, which LLVM also does.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## FLAT, Scratch, Global instructions
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe NV bit was removed in RDNA, but some parts of the doc still mention it.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciRDNA ISA doc 13.8.1 says that SADDR should be set to 0x7f when ADDR is used, but
bf215546Sopenharmony_ci9.3.1 says it should be set to NULL. We assume 9.3.1 is correct and set it to
bf215546Sopenharmony_ciSGPR_NULL.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## Legacy instructions
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciSome instructions have a `_LEGACY` variant which implements "DX9 rules", in which
bf215546Sopenharmony_cithe zero "wins" in multiplications, ie. `0.0*x` is always `0.0`. The VEGA ISA
bf215546Sopenharmony_cimentions `V_MAC_LEGACY_F32` but this instruction is not really there on VEGA.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## `m0` with LDS instructions on Vega and newer
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe Vega ISA doc (both the old one and the "7nm" one) claims that LDS instructions
bf215546Sopenharmony_ciuse the `m0` register for address clamping like older GPUs, but this is not the case.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciIn reality, only the `_addtid` variants of LDS instructions use `m0` on Vega and
bf215546Sopenharmony_cinewer GPUs, so the relevant section of the RDNA ISA doc seems to apply.
bf215546Sopenharmony_ciLLVM also doesn't emit any initialization of `m0` for LDS instructions, and this
bf215546Sopenharmony_ciwas also confirmed by AMD devs.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## RDNA L0, L1 cache and DLC, GLC bits
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe old L1 cache was renamed to L0, and a new L1 cache was added to RDNA. The
bf215546Sopenharmony_ciL1 cache is 1 cache per shader array. Some instruction encodings have DLC and
bf215546Sopenharmony_ciGLC bits that interact with the cache.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci* DLC ("device level coherent") bit: controls the L1 cache
bf215546Sopenharmony_ci* GLC ("globally coherent") bit: controls the L0 cache
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe recommendation from AMD devs is to always set these two bits at the same time,
bf215546Sopenharmony_cias it doesn't make too much sense to set them independently, aside from some
bf215546Sopenharmony_cicircumstances (eg. we needn't set DLC when only one shader array is used).
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciStores and atomics always bypass the L1 cache, so they don't support the DLC bit,
bf215546Sopenharmony_ciand it shouldn't be set in these cases. Setting the DLC for these cases can result
bf215546Sopenharmony_ciin graphical glitches or hangs.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## RDNA `s_dcache_wb`
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe `s_dcache_wb` is not mentioned in the RDNA ISA doc, but it is needed in order
bf215546Sopenharmony_cito achieve correct behavior in some SSBO CTS tests.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## RDNA subvector mode
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe documentation of `s_subvector_loop_begin` and `s_subvector_mode_end` is not clear
bf215546Sopenharmony_cion what sort of addressing should be used, but it says that it
bf215546Sopenharmony_ci"is equivalent to an `S_CBRANCH` with extra math", so the subvector loop handling
bf215546Sopenharmony_ciin ACO is done according to the `s_cbranch` doc.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## RDNA early rasterization
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe ISA documentation says about `s_endpgm`:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci> The hardware implicitly executes S_WAITCNT 0 and S_WAITCNT_VSCNT 0
bf215546Sopenharmony_ci> before executing this instruction.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciWhat the doc doesn't say is that in case of NGG (and legacy VS) when there
bf215546Sopenharmony_ciare no param exports, the driver sets `NO_PC_EXPORT=1` for optimal performance,
bf215546Sopenharmony_ciand when this is set, the hardware will start clipping and rasterization
bf215546Sopenharmony_cias soon as it encounters a position export with `DONE=1`, without waiting
bf215546Sopenharmony_cifor the NGG (or VS) to finish.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciIt can even launch PS waves before NGG (or VS) ends.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciWhen this happens, any store performed by a VS is not guaranteed
bf215546Sopenharmony_cito be complete when PS tries to load it, so we need to manually
bf215546Sopenharmony_cimake sure to insert wait instructions before the position exports.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## A16 and G16
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciOn GFX9, the A16 field enables both 16 bit addresses and derivatives.
bf215546Sopenharmony_ciSince GFX10+ these are fully independent of each other, A16 controls 16 bit addresses
bf215546Sopenharmony_ciand G16 opcodes 16 bit derivatives. A16 without G16 uses 32 bit derivatives.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci# Hardware Bugs
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## SMEM corrupts VCCZ on SI/CI
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci[See this LLVM source.](https://github.com/llvm/llvm-project/blob/acb089e12ae48b82c0b05c42326196a030df9b82/llvm/lib/Target/AMDGPU/SIInsertWaits.cpp#L580-L616)
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciAfter issuing a SMEM instructions, we need to wait for the SMEM instructions to
bf215546Sopenharmony_cifinish and then write to vcc (for example, `s_mov_b64 vcc, vcc`) to correct vccz
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciCurrently, we don't do this.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## SGPR offset on MUBUF prevents addr clamping on SI/CI
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci[See this LLVM source.](https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp#L1917-L1922)
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThis leads to wrong bounds checking, using a VGPR offset fixes it.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## GCN / GFX6 hazards
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### VINTRP followed by a read with `v_readfirstlane` or `v_readlane`
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciIt's required to insert 1 wait state if the dst VGPR of any  `v_interp_*` is
bf215546Sopenharmony_cifollowed by a read with `v_readfirstlane` or `v_readlane` to fix GPU hangs on GFX6.
bf215546Sopenharmony_ciNote that `v_writelane_*` is apparently not affected. This hazard isn't
bf215546Sopenharmony_cidocumented anywhere but AMD confirmed it.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci## RDNA / GFX10 hazards
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### SMEM store followed by a load with the same address
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciWe found that an `s_buffer_load` will produce incorrect results if it is preceded
bf215546Sopenharmony_ciby an `s_buffer_store` with the same address. Inserting an `s_nop` between them
bf215546Sopenharmony_cidoes not mitigate the issue, so an `s_waitcnt lgkmcnt(0)` must be inserted.
bf215546Sopenharmony_ciThis is not mentioned by LLVM among the other GFX10 bugs, but LLVM doesn't use
bf215546Sopenharmony_ciSMEM stores, so it's not surprising that they didn't notice it.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### VMEMtoScalarWriteHazard
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciTriggered by:
bf215546Sopenharmony_ciVMEM/FLAT/GLOBAL/SCRATCH/DS instruction reads an SGPR (or EXEC, or M0).
bf215546Sopenharmony_ciThen, a SALU/SMEM instruction writes the same SGPR.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciMitigated by:
bf215546Sopenharmony_ciA VALU instruction or an `s_waitcnt` between the two instructions.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### SMEMtoVectorWriteHazard
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciTriggered by:
bf215546Sopenharmony_ciAn SMEM instruction reads an SGPR. Then, a VALU instruction writes that same SGPR.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciMitigated by:
bf215546Sopenharmony_ciAny non-SOPP SALU instruction (except `s_setvskip`, `s_version`, and any non-lgkmcnt `s_waitcnt`).
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### Offset3fBug
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciAny branch that is located at offset 0x3f will be buggy. Just insert some NOPs to make sure no branch
bf215546Sopenharmony_ciis located at this offset.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### InstFwdPrefetchBug
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciAccording to LLVM, the `s_inst_prefetch` instruction can cause a hang.
bf215546Sopenharmony_ciThere are no further details.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### LdsMisalignedBug
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciWhen there is a misaligned multi-dword FLAT load/store instruction in WGP mode,
bf215546Sopenharmony_ciit needs to be split into multiple single-dword FLAT instructions.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciACO doesn't use FLAT load/store on GFX10, so is unaffected.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### FlatSegmentOffsetBug
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe 12-bit immediate OFFSET field of FLAT instructions must always be 0.
bf215546Sopenharmony_ciGLOBAL and SCRATCH are unaffected.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciACO doesn't use FLAT load/store on GFX10, so is unaffected.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### VcmpxPermlaneHazard
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciTriggered by:
bf215546Sopenharmony_ciAny permlane instruction that follows any VOPC instruction.
bf215546Sopenharmony_ciConfirmed by AMD devs that despite the name, this doesn't only affect v_cmpx.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciMitigated by: any VALU instruction except `v_nop`.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### VcmpxExecWARHazard
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciTriggered by:
bf215546Sopenharmony_ciAny non-VALU instruction reads the EXEC mask. Then, any VALU instruction writes the EXEC mask.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciMitigated by:
bf215546Sopenharmony_ciA VALU instruction that writes an SGPR (or has a valid SDST operand), or `s_waitcnt_depctr 0xfffe`.
bf215546Sopenharmony_ciNote: `s_waitcnt_depctr` is an internal instruction, so there is no further information
bf215546Sopenharmony_ciabout what it does or what its operand means.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### LdsBranchVmemWARHazard
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciTriggered by:
bf215546Sopenharmony_ciVMEM/GLOBAL/SCRATCH instruction, then a branch, then a DS instruction,
bf215546Sopenharmony_cior vice versa: DS instruction, then a branch, then a VMEM/GLOBAL/SCRATCH instruction.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciMitigated by:
bf215546Sopenharmony_ciOnly `s_waitcnt_vscnt null, 0`. Needed even if the first instruction is a load.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### NSAClauseBug
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci"MIMG-NSA in a hard clause has unpredictable results on GFX10.1"
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci### NSAMaxSize5
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciNSA MIMG instructions should be limited to 3 dwords before GFX10.3 to avoid
bf215546Sopenharmony_cistability issues: https://reviews.llvm.org/D103348