1bf215546Sopenharmony_ciIR3 NOTES 2bf215546Sopenharmony_ci========= 3bf215546Sopenharmony_ci 4bf215546Sopenharmony_ciSome notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx. The same shader ISA is present, with some small differences, in adreno a4xx. 5bf215546Sopenharmony_ci 6bf215546Sopenharmony_ciCompared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set. However, the compiler is responsible, in most cases, to schedule the instructions. The hardware does not try to hide the shader core pipeline stages. For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or NOPs). When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit. Although that results in a lot of edge cases where things fall over, like: 7bf215546Sopenharmony_ci 8bf215546Sopenharmony_ci:: 9bf215546Sopenharmony_ci 10bf215546Sopenharmony_ci ADD TEMP[0], TEMP[1], TEMP[2] 11bf215546Sopenharmony_ci MUL TEMP[0], TEMP[1], TEMP[0].wzyx 12bf215546Sopenharmony_ci 13bf215546Sopenharmony_ciHere, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``. Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over. 14bf215546Sopenharmony_ci 15bf215546Sopenharmony_ciSo the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment. 16bf215546Sopenharmony_ci 17bf215546Sopenharmony_ciFor additional documentation about the hardware, see wiki: `a3xx ISA 18bf215546Sopenharmony_ci<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_. 19bf215546Sopenharmony_ci 20bf215546Sopenharmony_ciExternal Structure 21bf215546Sopenharmony_ci------------------ 22bf215546Sopenharmony_ci 23bf215546Sopenharmony_ci``ir3_shader`` 24bf215546Sopenharmony_ci A single vertex/fragment/etc shader from gallium perspective (i.e. 25bf215546Sopenharmony_ci maps to a single TGSI shader), and manages a set of shader variants 26bf215546Sopenharmony_ci which are generated on demand based on the shader key. 27bf215546Sopenharmony_ci 28bf215546Sopenharmony_ci``ir3_shader_key`` 29bf215546Sopenharmony_ci The configuration key that identifies a shader variant. I.e. based 30bf215546Sopenharmony_ci on other GL state (two-sided-color, render-to-alpha, etc) or render 31bf215546Sopenharmony_ci stages (binning-pass vertex shader) different shader variants are 32bf215546Sopenharmony_ci generated. 33bf215546Sopenharmony_ci 34bf215546Sopenharmony_ci``ir3_shader_variant`` 35bf215546Sopenharmony_ci The actual hw shader generated based on input TGSI and shader key. 36bf215546Sopenharmony_ci 37bf215546Sopenharmony_ci``ir3_compiler`` 38bf215546Sopenharmony_ci Compiler frontend which generates ir3 and runs the various backend 39bf215546Sopenharmony_ci stages to schedule and do register assignment. 40bf215546Sopenharmony_ci 41bf215546Sopenharmony_ciThe IR 42bf215546Sopenharmony_ci------ 43bf215546Sopenharmony_ci 44bf215546Sopenharmony_ciThe ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s). But there are a few extensions, in the form of meta_ instructions. And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value. So, for example, the following TGSI shader: 45bf215546Sopenharmony_ci 46bf215546Sopenharmony_ci:: 47bf215546Sopenharmony_ci 48bf215546Sopenharmony_ci VERT 49bf215546Sopenharmony_ci DCL IN[0] 50bf215546Sopenharmony_ci DCL IN[1] 51bf215546Sopenharmony_ci DCL OUT[0], POSITION 52bf215546Sopenharmony_ci DCL TEMP[0], LOCAL 53bf215546Sopenharmony_ci 1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz 54bf215546Sopenharmony_ci 2: MOV OUT[0], TEMP[0].xxxx 55bf215546Sopenharmony_ci 3: END 56bf215546Sopenharmony_ci 57bf215546Sopenharmony_cieventually generates: 58bf215546Sopenharmony_ci 59bf215546Sopenharmony_ci.. graphviz:: 60bf215546Sopenharmony_ci 61bf215546Sopenharmony_ci digraph G { 62bf215546Sopenharmony_ci rankdir=RL; 63bf215546Sopenharmony_ci nodesep=0.25; 64bf215546Sopenharmony_ci ranksep=1.5; 65bf215546Sopenharmony_ci subgraph clusterdce198 { 66bf215546Sopenharmony_ci label="vert"; 67bf215546Sopenharmony_ci inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"]; 68bf215546Sopenharmony_ci instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"]; 69bf215546Sopenharmony_ci instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"]; 70bf215546Sopenharmony_ci inputdce198:<in2>:w -> instrdcedd0:<src0> 71bf215546Sopenharmony_ci inputdce198:<in6>:w -> instrdcedd0:<src1> 72bf215546Sopenharmony_ci instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"]; 73bf215546Sopenharmony_ci inputdce198:<in1>:w -> instrdcec30:<src0> 74bf215546Sopenharmony_ci inputdce198:<in5>:w -> instrdcec30:<src1> 75bf215546Sopenharmony_ci instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"]; 76bf215546Sopenharmony_ci inputdce198:<in0>:w -> instrdceb60:<src0> 77bf215546Sopenharmony_ci inputdce198:<in4>:w -> instrdceb60:<src1> 78bf215546Sopenharmony_ci instrdceb60:<dst0> -> instrdcec30:<src2> 79bf215546Sopenharmony_ci instrdcec30:<dst0> -> instrdcedd0:<src2> 80bf215546Sopenharmony_ci instrdcedd0:<dst0> -> instrdcf348:<src0> 81bf215546Sopenharmony_ci instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"]; 82bf215546Sopenharmony_ci instrdcedd0:<dst0> -> instrdcf400:<src0> 83bf215546Sopenharmony_ci instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"]; 84bf215546Sopenharmony_ci instrdcedd0:<dst0> -> instrdcf4b8:<src0> 85bf215546Sopenharmony_ci outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"]; 86bf215546Sopenharmony_ci instrdcf348:<dst0> -> outputdce198:<out0>:e 87bf215546Sopenharmony_ci instrdcf400:<dst0> -> outputdce198:<out1>:e 88bf215546Sopenharmony_ci instrdcf4b8:<dst0> -> outputdce198:<out2>:e 89bf215546Sopenharmony_ci instrdcedd0:<dst0> -> outputdce198:<out3>:e 90bf215546Sopenharmony_ci } 91bf215546Sopenharmony_ci } 92bf215546Sopenharmony_ci 93bf215546Sopenharmony_ci(after scheduling, etc, but before register assignment). 94bf215546Sopenharmony_ci 95bf215546Sopenharmony_ciInternal Structure 96bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~~ 97bf215546Sopenharmony_ci 98bf215546Sopenharmony_ci``ir3_block`` 99bf215546Sopenharmony_ci Represents a basic block. 100bf215546Sopenharmony_ci 101bf215546Sopenharmony_ci TODO: currently blocks are nested, but I think I need to change that 102bf215546Sopenharmony_ci to a more conventional arrangement before implementing proper flow 103bf215546Sopenharmony_ci control. Currently the only flow control handles is if/else which 104bf215546Sopenharmony_ci gets flattened out and results chosen with ``sel`` instructions. 105bf215546Sopenharmony_ci 106bf215546Sopenharmony_ci``ir3_instruction`` 107bf215546Sopenharmony_ci Represents a machine instruction or meta_ instruction. Has pointers 108bf215546Sopenharmony_ci to dst register (``regs[0]``) and src register(s) (``regs[1..n]``), 109bf215546Sopenharmony_ci as needed. 110bf215546Sopenharmony_ci 111bf215546Sopenharmony_ci``ir3_register`` 112bf215546Sopenharmony_ci Represents a src or dst register, flags indicate const/relative/etc. 113bf215546Sopenharmony_ci If ``IR3_REG_SSA`` is set on a src register, the actual register 114bf215546Sopenharmony_ci number (name) has not been assigned yet, and instead the ``instr`` 115bf215546Sopenharmony_ci field points to src instruction. 116bf215546Sopenharmony_ci 117bf215546Sopenharmony_ciIn addition there are various util macros/functions to simplify manipulation/traversal of the graph: 118bf215546Sopenharmony_ci 119bf215546Sopenharmony_ci``foreach_src(srcreg, instr)`` 120bf215546Sopenharmony_ci Iterate each instruction's source ``ir3_register``\s 121bf215546Sopenharmony_ci 122bf215546Sopenharmony_ci``foreach_src_n(srcreg, n, instr)`` 123bf215546Sopenharmony_ci Like ``foreach_src``, also setting ``n`` to the source number (starting 124bf215546Sopenharmony_ci with ``0``). 125bf215546Sopenharmony_ci 126bf215546Sopenharmony_ci``foreach_ssa_src(srcinstr, instr)`` 127bf215546Sopenharmony_ci Iterate each instruction's SSA source ``ir3_instruction``\s. This skips 128bf215546Sopenharmony_ci non-SSA sources (consts, etc), but includes virtual sources (such as the 129bf215546Sopenharmony_ci address register if `relative addressing`_ is used). 130bf215546Sopenharmony_ci 131bf215546Sopenharmony_ci``foreach_ssa_src_n(srcinstr, n, instr)`` 132bf215546Sopenharmony_ci Like ``foreach_ssa_src``, also setting ``n`` to the source number. 133bf215546Sopenharmony_ci 134bf215546Sopenharmony_ciFor example: 135bf215546Sopenharmony_ci 136bf215546Sopenharmony_ci.. code-block:: c 137bf215546Sopenharmony_ci 138bf215546Sopenharmony_ci foreach_ssa_src_n(src, i, instr) { 139bf215546Sopenharmony_ci unsigned d = delay_calc_srcn(ctx, src, instr, i); 140bf215546Sopenharmony_ci delay = MAX2(delay, d); 141bf215546Sopenharmony_ci } 142bf215546Sopenharmony_ci 143bf215546Sopenharmony_ci 144bf215546Sopenharmony_ciTODO probably other helper/util stuff worth mentioning here 145bf215546Sopenharmony_ci 146bf215546Sopenharmony_ci.. _meta: 147bf215546Sopenharmony_ci 148bf215546Sopenharmony_ciMeta Instructions 149bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~ 150bf215546Sopenharmony_ci 151bf215546Sopenharmony_ci**input** 152bf215546Sopenharmony_ci Used for shader inputs (registers configured in the command-stream 153bf215546Sopenharmony_ci to hold particular input values, written by the shader core before 154bf215546Sopenharmony_ci start of execution. Also used for connecting up values within a 155bf215546Sopenharmony_ci basic block to an output of a previous block. 156bf215546Sopenharmony_ci 157bf215546Sopenharmony_ci**output** 158bf215546Sopenharmony_ci Used to hold outputs of a basic block. 159bf215546Sopenharmony_ci 160bf215546Sopenharmony_ci**flow** 161bf215546Sopenharmony_ci TODO 162bf215546Sopenharmony_ci 163bf215546Sopenharmony_ci**phi** 164bf215546Sopenharmony_ci TODO 165bf215546Sopenharmony_ci 166bf215546Sopenharmony_ci**collect** 167bf215546Sopenharmony_ci Groups registers which need to be assigned to consecutive scalar 168bf215546Sopenharmony_ci registers, for example `sam` (texture fetch) src instructions (see 169bf215546Sopenharmony_ci `register groups`_) or array element dereference 170bf215546Sopenharmony_ci (see `relative addressing`_). 171bf215546Sopenharmony_ci 172bf215546Sopenharmony_ci**split** 173bf215546Sopenharmony_ci The counterpart to **collect**, when an instruction such as `sam` 174bf215546Sopenharmony_ci writes multiple components, splits the result into individual 175bf215546Sopenharmony_ci scalar components to be consumed by other instructions. 176bf215546Sopenharmony_ci 177bf215546Sopenharmony_ci 178bf215546Sopenharmony_ci.. _`flow control`: 179bf215546Sopenharmony_ci 180bf215546Sopenharmony_ciFlow Control 181bf215546Sopenharmony_ci~~~~~~~~~~~~ 182bf215546Sopenharmony_ci 183bf215546Sopenharmony_ciTODO 184bf215546Sopenharmony_ci 185bf215546Sopenharmony_ci 186bf215546Sopenharmony_ci.. _`register groups`: 187bf215546Sopenharmony_ci 188bf215546Sopenharmony_ciRegister Groups 189bf215546Sopenharmony_ci~~~~~~~~~~~~~~~ 190bf215546Sopenharmony_ci 191bf215546Sopenharmony_ciCertain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers. In the simplest example: 192bf215546Sopenharmony_ci 193bf215546Sopenharmony_ci:: 194bf215546Sopenharmony_ci 195bf215546Sopenharmony_ci sam (f32)(xyz)r2.x, r0.z, s#0, t#0 196bf215546Sopenharmony_ci 197bf215546Sopenharmony_cifor a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``. 198bf215546Sopenharmony_ci 199bf215546Sopenharmony_ciBefore register assignment, to group the two components of the texture src together: 200bf215546Sopenharmony_ci 201bf215546Sopenharmony_ci.. graphviz:: 202bf215546Sopenharmony_ci 203bf215546Sopenharmony_ci digraph G { 204bf215546Sopenharmony_ci { rank=same; 205bf215546Sopenharmony_ci collect; 206bf215546Sopenharmony_ci }; 207bf215546Sopenharmony_ci { rank=same; 208bf215546Sopenharmony_ci coord_x; 209bf215546Sopenharmony_ci coord_y; 210bf215546Sopenharmony_ci }; 211bf215546Sopenharmony_ci sam -> collect [label="regs[1]"]; 212bf215546Sopenharmony_ci collect -> coord_x [label="regs[1]"]; 213bf215546Sopenharmony_ci collect -> coord_y [label="regs[2]"]; 214bf215546Sopenharmony_ci coord_x -> coord_y [label="right",style=dotted]; 215bf215546Sopenharmony_ci coord_y -> coord_x [label="left",style=dotted]; 216bf215546Sopenharmony_ci coord_x [label="coord.x"]; 217bf215546Sopenharmony_ci coord_y [label="coord.y"]; 218bf215546Sopenharmony_ci } 219bf215546Sopenharmony_ci 220bf215546Sopenharmony_ciThe frontend sets up the SSA ptrs from ``sam`` source register to the ``collect`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values. And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``collect``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers. 221bf215546Sopenharmony_ci 222bf215546Sopenharmony_ciAnd likewise, for the consecutive scalar registers for the destination: 223bf215546Sopenharmony_ci 224bf215546Sopenharmony_ci.. graphviz:: 225bf215546Sopenharmony_ci 226bf215546Sopenharmony_ci digraph { 227bf215546Sopenharmony_ci { rank=same; 228bf215546Sopenharmony_ci A; 229bf215546Sopenharmony_ci B; 230bf215546Sopenharmony_ci C; 231bf215546Sopenharmony_ci }; 232bf215546Sopenharmony_ci { rank=same; 233bf215546Sopenharmony_ci split_0; 234bf215546Sopenharmony_ci split_1; 235bf215546Sopenharmony_ci split_2; 236bf215546Sopenharmony_ci }; 237bf215546Sopenharmony_ci A -> split_0; 238bf215546Sopenharmony_ci B -> split_1; 239bf215546Sopenharmony_ci C -> split_2; 240bf215546Sopenharmony_ci split_0 [label="split\noff=0"]; 241bf215546Sopenharmony_ci split_0 -> sam; 242bf215546Sopenharmony_ci split_1 [label="split\noff=1"]; 243bf215546Sopenharmony_ci split_1 -> sam; 244bf215546Sopenharmony_ci split_2 [label="split\noff=2"]; 245bf215546Sopenharmony_ci split_2 -> sam; 246bf215546Sopenharmony_ci split_0 -> split_1 [label="right",style=dotted]; 247bf215546Sopenharmony_ci split_1 -> split_0 [label="left",style=dotted]; 248bf215546Sopenharmony_ci split_1 -> split_2 [label="right",style=dotted]; 249bf215546Sopenharmony_ci split_2 -> split_1 [label="left",style=dotted]; 250bf215546Sopenharmony_ci sam; 251bf215546Sopenharmony_ci } 252bf215546Sopenharmony_ci 253bf215546Sopenharmony_ci.. _`relative addressing`: 254bf215546Sopenharmony_ci 255bf215546Sopenharmony_ciRelative Addressing 256bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 257bf215546Sopenharmony_ci 258bf215546Sopenharmony_ciMost instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers. In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, i.e. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number). 259bf215546Sopenharmony_ci 260bf215546Sopenharmony_ci Note that cat5 (texture sample) instructions are the notable exception, not 261bf215546Sopenharmony_ci supporting relative addressing of src or dst. 262bf215546Sopenharmony_ci 263bf215546Sopenharmony_ciRelative addressing of the const file (for example, a uniform array) is relatively simple. We don't do register assignment of the const file, so all that is required is to schedule things properly. I.e. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time. 264bf215546Sopenharmony_ci 265bf215546Sopenharmony_ciBut relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (i.e. the array elements must be assigned to consecutive scalar registers). And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written. 266bf215546Sopenharmony_ci 267bf215546Sopenharmony_ciEach instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s). This behaves as an additional virtual src register, i.e. ``foreach_ssa_src()`` will also iterate the address register (last). 268bf215546Sopenharmony_ci 269bf215546Sopenharmony_ci Note that ``nop``\'s for timing constraints, type specifiers (i.e. 270bf215546Sopenharmony_ci ``add.f`` vs ``add.u``), etc, omitted for brevity in examples 271bf215546Sopenharmony_ci 272bf215546Sopenharmony_ci:: 273bf215546Sopenharmony_ci 274bf215546Sopenharmony_ci mova a0.x, hr1.y 275bf215546Sopenharmony_ci sub r1.y, r2.x, r3.x 276bf215546Sopenharmony_ci add r0.x, r1.y, c<a0.x + 2> 277bf215546Sopenharmony_ci 278bf215546Sopenharmony_ciresults in: 279bf215546Sopenharmony_ci 280bf215546Sopenharmony_ci.. graphviz:: 281bf215546Sopenharmony_ci 282bf215546Sopenharmony_ci digraph { 283bf215546Sopenharmony_ci rankdir=LR; 284bf215546Sopenharmony_ci sub; 285bf215546Sopenharmony_ci const [label="const file"]; 286bf215546Sopenharmony_ci add; 287bf215546Sopenharmony_ci mova; 288bf215546Sopenharmony_ci add -> mova; 289bf215546Sopenharmony_ci add -> sub; 290bf215546Sopenharmony_ci add -> const [label="off=2"]; 291bf215546Sopenharmony_ci } 292bf215546Sopenharmony_ci 293bf215546Sopenharmony_ciThe scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time. 294bf215546Sopenharmony_ci 295bf215546Sopenharmony_ciTo implement variable arrays, the NIR registers are stored as an ``ir3_array``, 296bf215546Sopenharmony_ciwhich will be register allocated to consecutive hardware registers. The array 297bf215546Sopenharmony_ciaccess uses the id field in the ``ir3_register`` to map to the array being 298bf215546Sopenharmony_ciaccessed, and the offset field for the fixed offset within the array. A NIR 299bf215546Sopenharmony_ciindirect register read such as: 300bf215546Sopenharmony_ci 301bf215546Sopenharmony_ci:: 302bf215546Sopenharmony_ci 303bf215546Sopenharmony_ci decl_reg vec2 32 r0[2] 304bf215546Sopenharmony_ci ... 305bf215546Sopenharmony_ci vec2 32 ssa_19 = mov r0[0 + ssa_9] 306bf215546Sopenharmony_ci 307bf215546Sopenharmony_ci 308bf215546Sopenharmony_ciresults in: 309bf215546Sopenharmony_ci 310bf215546Sopenharmony_ci:: 311bf215546Sopenharmony_ci 312bf215546Sopenharmony_ci 0000:0000:001: shl.b hssa_19, hssa_17, himm[0.000000,1,0x1] 313bf215546Sopenharmony_ci 0000:0000:002: mov.s16s16 hr61.x, hssa_19 314bf215546Sopenharmony_ci 0000:0000:002: mov.u32u32 ssa_21, arr[id=1, offset=0, size=4, ssa_12], address=_[0000:0000:002: mov.s16s16] 315bf215546Sopenharmony_ci 0000:0000:002: mov.u32u32 ssa_22, arr[id=1, offset=1, size=4, ssa_12], address=_[0000:0000:002: mov.s16s16] 316bf215546Sopenharmony_ci 317bf215546Sopenharmony_ci 318bf215546Sopenharmony_ciArray writes write to the array in ``instr->regs[0]->array.id``. A NIR indirect 319bf215546Sopenharmony_ciregister write such as: 320bf215546Sopenharmony_ci 321bf215546Sopenharmony_ci:: 322bf215546Sopenharmony_ci 323bf215546Sopenharmony_ci decl_reg vec2 32 r0[2] 324bf215546Sopenharmony_ci ... 325bf215546Sopenharmony_ci r0[0 + ssa_12] = mov ssa_13 326bf215546Sopenharmony_ci 327bf215546Sopenharmony_ciresults in: 328bf215546Sopenharmony_ci 329bf215546Sopenharmony_ci:: 330bf215546Sopenharmony_ci 331bf215546Sopenharmony_ci 0000:0000:001: shl.b hssa_29, hssa_27, himm[0.000000,1,0x1] 332bf215546Sopenharmony_ci 0000:0000:002: mov.s16s16 hr61.x, hssa_29 333bf215546Sopenharmony_ci 0000:0000:001: mov.u32u32 arr[id=1, offset=0, size=4, ssa_17], c2.y, address=_[0000:0000:002: mov.s16s16] 334bf215546Sopenharmony_ci 0000:0000:004: mov.u32u32 arr[id=1, offset=1, size=4, ssa_31], c2.z, address=_[0000:0000:002: mov.s16s16] 335bf215546Sopenharmony_ci 336bf215546Sopenharmony_ciNote that only cat1 (mov) can do indirect write, and thus NIR register stores 337bf215546Sopenharmony_cimay need to introduce an extra mov. 338bf215546Sopenharmony_ci 339bf215546Sopenharmony_ciir3 array accesses in the DAG get serialized by the ``instr->barrier_class`` and 340bf215546Sopenharmony_cicontaining ``IR3_BARRIER_ARRAY_W`` or ``IR3_BARRIER_ARRAY_R``. 341bf215546Sopenharmony_ci 342bf215546Sopenharmony_ciShader Passes 343bf215546Sopenharmony_ci------------- 344bf215546Sopenharmony_ci 345bf215546Sopenharmony_ciAfter the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_. Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail. 346bf215546Sopenharmony_ci 347bf215546Sopenharmony_ci Note that we essentially have ~256 scalar registers in the 348bf215546Sopenharmony_ci architecture (although larger register usage will at some thresholds 349bf215546Sopenharmony_ci limit the number of threads which can run in parallel). And at some 350bf215546Sopenharmony_ci point we will have to deal with spilling. 351bf215546Sopenharmony_ci 352bf215546Sopenharmony_ci.. _flatten: 353bf215546Sopenharmony_ci 354bf215546Sopenharmony_ciFlatten 355bf215546Sopenharmony_ci~~~~~~~ 356bf215546Sopenharmony_ci 357bf215546Sopenharmony_ciIn this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions. The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else. 358bf215546Sopenharmony_ci 359bf215546Sopenharmony_ci 360bf215546Sopenharmony_ci.. _`copy propagation`: 361bf215546Sopenharmony_ci 362bf215546Sopenharmony_ciCopy Propagation 363bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~ 364bf215546Sopenharmony_ci 365bf215546Sopenharmony_ciCurrently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources. And the CP pass simply removes all simple ``mov``\s (i.e. src-type is same as dst-type, no abs/neg flags, etc). 366bf215546Sopenharmony_ci 367bf215546Sopenharmony_ciThe eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things. 368bf215546Sopenharmony_ci 369bf215546Sopenharmony_ci 370bf215546Sopenharmony_ci.. _grouping: 371bf215546Sopenharmony_ci 372bf215546Sopenharmony_ciGrouping 373bf215546Sopenharmony_ci~~~~~~~~ 374bf215546Sopenharmony_ci 375bf215546Sopenharmony_ciIn the grouping pass, instructions which need to be grouped (for ``collect``\s, etc) have their ``left`` / ``right`` neighbor pointers setup. In cases where there is a conflict (i.e. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted. This ensures that there is some possible valid `register assignment`_ at the later stages. 376bf215546Sopenharmony_ci 377bf215546Sopenharmony_ci 378bf215546Sopenharmony_ci.. _depth: 379bf215546Sopenharmony_ci 380bf215546Sopenharmony_ciDepth 381bf215546Sopenharmony_ci~~~~~ 382bf215546Sopenharmony_ci 383bf215546Sopenharmony_ciIn the depth pass, a depth is calculated for each instruction node within its basic block. The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of its source instructions. (meta_ instructions don't add to the depth). As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction. Unreachable instructions and inputs are marked. 384bf215546Sopenharmony_ci 385bf215546Sopenharmony_ci TODO: we should probably calculate both hard and soft depths (?) to 386bf215546Sopenharmony_ci try to coax additional instructions to fit in places where we need 387bf215546Sopenharmony_ci to use sync bits, such as after a texture fetch or SFU. 388bf215546Sopenharmony_ci 389bf215546Sopenharmony_ci.. _scheduling: 390bf215546Sopenharmony_ci 391bf215546Sopenharmony_ciScheduling 392bf215546Sopenharmony_ci~~~~~~~~~~ 393bf215546Sopenharmony_ci 394bf215546Sopenharmony_ciAfter the grouping_ pass, there are no more instructions to insert or remove. Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after its source instructions plus delay slots. Insert ``nop``\s as required. 395bf215546Sopenharmony_ci 396bf215546Sopenharmony_ci.. _`register assignment`: 397bf215546Sopenharmony_ci 398bf215546Sopenharmony_ciRegister Assignment 399bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~~~ 400bf215546Sopenharmony_ci 401bf215546Sopenharmony_ciTODO 402bf215546Sopenharmony_ci 403bf215546Sopenharmony_ci 404