1bf215546Sopenharmony_ciIR3 NOTES
2bf215546Sopenharmony_ci=========
3bf215546Sopenharmony_ci
4bf215546Sopenharmony_ciSome notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx.  The same shader ISA is present, with some small differences, in adreno a4xx.
5bf215546Sopenharmony_ci
6bf215546Sopenharmony_ciCompared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set.  However, the compiler is responsible, in most cases, to schedule the instructions.  The hardware does not try to hide the shader core pipeline stages.  For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or NOPs).  When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit.  Although that results in a lot of edge cases where things fall over, like:
7bf215546Sopenharmony_ci
8bf215546Sopenharmony_ci::
9bf215546Sopenharmony_ci
10bf215546Sopenharmony_ci  ADD TEMP[0], TEMP[1], TEMP[2]
11bf215546Sopenharmony_ci  MUL TEMP[0], TEMP[1], TEMP[0].wzyx
12bf215546Sopenharmony_ci
13bf215546Sopenharmony_ciHere, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``.  Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over.
14bf215546Sopenharmony_ci
15bf215546Sopenharmony_ciSo the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment.
16bf215546Sopenharmony_ci
17bf215546Sopenharmony_ciFor additional documentation about the hardware, see wiki: `a3xx ISA
18bf215546Sopenharmony_ci<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_.
19bf215546Sopenharmony_ci
20bf215546Sopenharmony_ciExternal Structure
21bf215546Sopenharmony_ci------------------
22bf215546Sopenharmony_ci
23bf215546Sopenharmony_ci``ir3_shader``
24bf215546Sopenharmony_ci    A single vertex/fragment/etc shader from gallium perspective (i.e.
25bf215546Sopenharmony_ci    maps to a single TGSI shader), and manages a set of shader variants
26bf215546Sopenharmony_ci    which are generated on demand based on the shader key.
27bf215546Sopenharmony_ci
28bf215546Sopenharmony_ci``ir3_shader_key``
29bf215546Sopenharmony_ci    The configuration key that identifies a shader variant.  I.e. based
30bf215546Sopenharmony_ci    on other GL state (two-sided-color, render-to-alpha, etc) or render
31bf215546Sopenharmony_ci    stages (binning-pass vertex shader) different shader variants are
32bf215546Sopenharmony_ci    generated.
33bf215546Sopenharmony_ci
34bf215546Sopenharmony_ci``ir3_shader_variant``
35bf215546Sopenharmony_ci    The actual hw shader generated based on input TGSI and shader key.
36bf215546Sopenharmony_ci
37bf215546Sopenharmony_ci``ir3_compiler``
38bf215546Sopenharmony_ci    Compiler frontend which generates ir3 and runs the various backend
39bf215546Sopenharmony_ci    stages to schedule and do register assignment.
40bf215546Sopenharmony_ci
41bf215546Sopenharmony_ciThe IR
42bf215546Sopenharmony_ci------
43bf215546Sopenharmony_ci
44bf215546Sopenharmony_ciThe ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s).  But there are a few extensions, in the form of meta_ instructions.  And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value.  So, for example, the following TGSI shader:
45bf215546Sopenharmony_ci
46bf215546Sopenharmony_ci::
47bf215546Sopenharmony_ci
48bf215546Sopenharmony_ci  VERT
49bf215546Sopenharmony_ci  DCL IN[0]
50bf215546Sopenharmony_ci  DCL IN[1]
51bf215546Sopenharmony_ci  DCL OUT[0], POSITION
52bf215546Sopenharmony_ci  DCL TEMP[0], LOCAL
53bf215546Sopenharmony_ci    1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz
54bf215546Sopenharmony_ci    2: MOV OUT[0], TEMP[0].xxxx
55bf215546Sopenharmony_ci    3: END
56bf215546Sopenharmony_ci
57bf215546Sopenharmony_cieventually generates:
58bf215546Sopenharmony_ci
59bf215546Sopenharmony_ci.. graphviz::
60bf215546Sopenharmony_ci
61bf215546Sopenharmony_ci  digraph G {
62bf215546Sopenharmony_ci  rankdir=RL;
63bf215546Sopenharmony_ci  nodesep=0.25;
64bf215546Sopenharmony_ci  ranksep=1.5;
65bf215546Sopenharmony_ci  subgraph clusterdce198 {
66bf215546Sopenharmony_ci  label="vert";
67bf215546Sopenharmony_ci  inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"];
68bf215546Sopenharmony_ci  instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
69bf215546Sopenharmony_ci  instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
70bf215546Sopenharmony_ci  inputdce198:<in2>:w -> instrdcedd0:<src0>
71bf215546Sopenharmony_ci  inputdce198:<in6>:w -> instrdcedd0:<src1>
72bf215546Sopenharmony_ci  instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
73bf215546Sopenharmony_ci  inputdce198:<in1>:w -> instrdcec30:<src0>
74bf215546Sopenharmony_ci  inputdce198:<in5>:w -> instrdcec30:<src1>
75bf215546Sopenharmony_ci  instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"];
76bf215546Sopenharmony_ci  inputdce198:<in0>:w -> instrdceb60:<src0>
77bf215546Sopenharmony_ci  inputdce198:<in4>:w -> instrdceb60:<src1>
78bf215546Sopenharmony_ci  instrdceb60:<dst0> -> instrdcec30:<src2>
79bf215546Sopenharmony_ci  instrdcec30:<dst0> -> instrdcedd0:<src2>
80bf215546Sopenharmony_ci  instrdcedd0:<dst0> -> instrdcf348:<src0>
81bf215546Sopenharmony_ci  instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
82bf215546Sopenharmony_ci  instrdcedd0:<dst0> -> instrdcf400:<src0>
83bf215546Sopenharmony_ci  instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
84bf215546Sopenharmony_ci  instrdcedd0:<dst0> -> instrdcf4b8:<src0>
85bf215546Sopenharmony_ci  outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"];
86bf215546Sopenharmony_ci  instrdcf348:<dst0> -> outputdce198:<out0>:e
87bf215546Sopenharmony_ci  instrdcf400:<dst0> -> outputdce198:<out1>:e
88bf215546Sopenharmony_ci  instrdcf4b8:<dst0> -> outputdce198:<out2>:e
89bf215546Sopenharmony_ci  instrdcedd0:<dst0> -> outputdce198:<out3>:e
90bf215546Sopenharmony_ci  }
91bf215546Sopenharmony_ci  }
92bf215546Sopenharmony_ci
93bf215546Sopenharmony_ci(after scheduling, etc, but before register assignment).
94bf215546Sopenharmony_ci
95bf215546Sopenharmony_ciInternal Structure
96bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~~
97bf215546Sopenharmony_ci
98bf215546Sopenharmony_ci``ir3_block``
99bf215546Sopenharmony_ci    Represents a basic block.
100bf215546Sopenharmony_ci
101bf215546Sopenharmony_ci    TODO: currently blocks are nested, but I think I need to change that
102bf215546Sopenharmony_ci    to a more conventional arrangement before implementing proper flow
103bf215546Sopenharmony_ci    control.  Currently the only flow control handles is if/else which
104bf215546Sopenharmony_ci    gets flattened out and results chosen with ``sel`` instructions.
105bf215546Sopenharmony_ci
106bf215546Sopenharmony_ci``ir3_instruction``
107bf215546Sopenharmony_ci    Represents a machine instruction or meta_ instruction.  Has pointers
108bf215546Sopenharmony_ci    to dst register (``regs[0]``) and src register(s) (``regs[1..n]``),
109bf215546Sopenharmony_ci    as needed.
110bf215546Sopenharmony_ci
111bf215546Sopenharmony_ci``ir3_register``
112bf215546Sopenharmony_ci    Represents a src or dst register, flags indicate const/relative/etc.
113bf215546Sopenharmony_ci    If ``IR3_REG_SSA`` is set on a src register, the actual register
114bf215546Sopenharmony_ci    number (name) has not been assigned yet, and instead the ``instr``
115bf215546Sopenharmony_ci    field points to src instruction.
116bf215546Sopenharmony_ci
117bf215546Sopenharmony_ciIn addition there are various util macros/functions to simplify manipulation/traversal of the graph:
118bf215546Sopenharmony_ci
119bf215546Sopenharmony_ci``foreach_src(srcreg, instr)``
120bf215546Sopenharmony_ci    Iterate each instruction's source ``ir3_register``\s
121bf215546Sopenharmony_ci
122bf215546Sopenharmony_ci``foreach_src_n(srcreg, n, instr)``
123bf215546Sopenharmony_ci    Like ``foreach_src``, also setting ``n`` to the source number (starting
124bf215546Sopenharmony_ci    with ``0``).
125bf215546Sopenharmony_ci
126bf215546Sopenharmony_ci``foreach_ssa_src(srcinstr, instr)``
127bf215546Sopenharmony_ci    Iterate each instruction's SSA source ``ir3_instruction``\s.  This skips
128bf215546Sopenharmony_ci    non-SSA sources (consts, etc), but includes virtual sources (such as the
129bf215546Sopenharmony_ci    address register if `relative addressing`_ is used).
130bf215546Sopenharmony_ci
131bf215546Sopenharmony_ci``foreach_ssa_src_n(srcinstr, n, instr)``
132bf215546Sopenharmony_ci    Like ``foreach_ssa_src``, also setting ``n`` to the source number.
133bf215546Sopenharmony_ci
134bf215546Sopenharmony_ciFor example:
135bf215546Sopenharmony_ci
136bf215546Sopenharmony_ci.. code-block:: c
137bf215546Sopenharmony_ci
138bf215546Sopenharmony_ci  foreach_ssa_src_n(src, i, instr) {
139bf215546Sopenharmony_ci    unsigned d = delay_calc_srcn(ctx, src, instr, i);
140bf215546Sopenharmony_ci    delay = MAX2(delay, d);
141bf215546Sopenharmony_ci  }
142bf215546Sopenharmony_ci
143bf215546Sopenharmony_ci
144bf215546Sopenharmony_ciTODO probably other helper/util stuff worth mentioning here
145bf215546Sopenharmony_ci
146bf215546Sopenharmony_ci.. _meta:
147bf215546Sopenharmony_ci
148bf215546Sopenharmony_ciMeta Instructions
149bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~
150bf215546Sopenharmony_ci
151bf215546Sopenharmony_ci**input**
152bf215546Sopenharmony_ci    Used for shader inputs (registers configured in the command-stream
153bf215546Sopenharmony_ci    to hold particular input values, written by the shader core before
154bf215546Sopenharmony_ci    start of execution.  Also used for connecting up values within a
155bf215546Sopenharmony_ci    basic block to an output of a previous block.
156bf215546Sopenharmony_ci
157bf215546Sopenharmony_ci**output**
158bf215546Sopenharmony_ci    Used to hold outputs of a basic block.
159bf215546Sopenharmony_ci
160bf215546Sopenharmony_ci**flow**
161bf215546Sopenharmony_ci    TODO
162bf215546Sopenharmony_ci
163bf215546Sopenharmony_ci**phi**
164bf215546Sopenharmony_ci    TODO
165bf215546Sopenharmony_ci
166bf215546Sopenharmony_ci**collect**
167bf215546Sopenharmony_ci    Groups registers which need to be assigned to consecutive scalar
168bf215546Sopenharmony_ci    registers, for example `sam` (texture fetch) src instructions (see
169bf215546Sopenharmony_ci    `register groups`_) or array element dereference
170bf215546Sopenharmony_ci    (see `relative addressing`_).
171bf215546Sopenharmony_ci
172bf215546Sopenharmony_ci**split**
173bf215546Sopenharmony_ci    The counterpart to **collect**, when an instruction such as `sam`
174bf215546Sopenharmony_ci    writes multiple components, splits the result into individual
175bf215546Sopenharmony_ci    scalar components to be consumed by other instructions.
176bf215546Sopenharmony_ci
177bf215546Sopenharmony_ci
178bf215546Sopenharmony_ci.. _`flow control`:
179bf215546Sopenharmony_ci
180bf215546Sopenharmony_ciFlow Control
181bf215546Sopenharmony_ci~~~~~~~~~~~~
182bf215546Sopenharmony_ci
183bf215546Sopenharmony_ciTODO
184bf215546Sopenharmony_ci
185bf215546Sopenharmony_ci
186bf215546Sopenharmony_ci.. _`register groups`:
187bf215546Sopenharmony_ci
188bf215546Sopenharmony_ciRegister Groups
189bf215546Sopenharmony_ci~~~~~~~~~~~~~~~
190bf215546Sopenharmony_ci
191bf215546Sopenharmony_ciCertain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers.  In the simplest example:
192bf215546Sopenharmony_ci
193bf215546Sopenharmony_ci::
194bf215546Sopenharmony_ci
195bf215546Sopenharmony_ci  sam (f32)(xyz)r2.x, r0.z, s#0, t#0
196bf215546Sopenharmony_ci
197bf215546Sopenharmony_cifor a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``.
198bf215546Sopenharmony_ci
199bf215546Sopenharmony_ciBefore register assignment, to group the two components of the texture src together:
200bf215546Sopenharmony_ci
201bf215546Sopenharmony_ci.. graphviz::
202bf215546Sopenharmony_ci
203bf215546Sopenharmony_ci  digraph G {
204bf215546Sopenharmony_ci    { rank=same;
205bf215546Sopenharmony_ci      collect;
206bf215546Sopenharmony_ci    };
207bf215546Sopenharmony_ci    { rank=same;
208bf215546Sopenharmony_ci      coord_x;
209bf215546Sopenharmony_ci      coord_y;
210bf215546Sopenharmony_ci    };
211bf215546Sopenharmony_ci    sam -> collect [label="regs[1]"];
212bf215546Sopenharmony_ci    collect -> coord_x [label="regs[1]"];
213bf215546Sopenharmony_ci    collect -> coord_y [label="regs[2]"];
214bf215546Sopenharmony_ci    coord_x -> coord_y [label="right",style=dotted];
215bf215546Sopenharmony_ci    coord_y -> coord_x [label="left",style=dotted];
216bf215546Sopenharmony_ci    coord_x [label="coord.x"];
217bf215546Sopenharmony_ci    coord_y [label="coord.y"];
218bf215546Sopenharmony_ci  }
219bf215546Sopenharmony_ci
220bf215546Sopenharmony_ciThe frontend sets up the SSA ptrs from ``sam`` source register to the ``collect`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values.  And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``collect``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers.
221bf215546Sopenharmony_ci
222bf215546Sopenharmony_ciAnd likewise, for the consecutive scalar registers for the destination:
223bf215546Sopenharmony_ci
224bf215546Sopenharmony_ci.. graphviz::
225bf215546Sopenharmony_ci
226bf215546Sopenharmony_ci  digraph {
227bf215546Sopenharmony_ci    { rank=same;
228bf215546Sopenharmony_ci      A;
229bf215546Sopenharmony_ci      B;
230bf215546Sopenharmony_ci      C;
231bf215546Sopenharmony_ci    };
232bf215546Sopenharmony_ci    { rank=same;
233bf215546Sopenharmony_ci      split_0;
234bf215546Sopenharmony_ci      split_1;
235bf215546Sopenharmony_ci      split_2;
236bf215546Sopenharmony_ci    };
237bf215546Sopenharmony_ci    A -> split_0;
238bf215546Sopenharmony_ci    B -> split_1;
239bf215546Sopenharmony_ci    C -> split_2;
240bf215546Sopenharmony_ci    split_0 [label="split\noff=0"];
241bf215546Sopenharmony_ci    split_0 -> sam;
242bf215546Sopenharmony_ci    split_1 [label="split\noff=1"];
243bf215546Sopenharmony_ci    split_1 -> sam;
244bf215546Sopenharmony_ci    split_2 [label="split\noff=2"];
245bf215546Sopenharmony_ci    split_2 -> sam;
246bf215546Sopenharmony_ci    split_0 -> split_1 [label="right",style=dotted];
247bf215546Sopenharmony_ci    split_1 -> split_0 [label="left",style=dotted];
248bf215546Sopenharmony_ci    split_1 -> split_2 [label="right",style=dotted];
249bf215546Sopenharmony_ci    split_2 -> split_1 [label="left",style=dotted];
250bf215546Sopenharmony_ci    sam;
251bf215546Sopenharmony_ci  }
252bf215546Sopenharmony_ci
253bf215546Sopenharmony_ci.. _`relative addressing`:
254bf215546Sopenharmony_ci
255bf215546Sopenharmony_ciRelative Addressing
256bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~~~
257bf215546Sopenharmony_ci
258bf215546Sopenharmony_ciMost instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers.  In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, i.e. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number).
259bf215546Sopenharmony_ci
260bf215546Sopenharmony_ci    Note that cat5 (texture sample) instructions are the notable exception, not
261bf215546Sopenharmony_ci    supporting relative addressing of src or dst.
262bf215546Sopenharmony_ci
263bf215546Sopenharmony_ciRelative addressing of the const file (for example, a uniform array) is relatively simple.  We don't do register assignment of the const file, so all that is required is to schedule things properly.  I.e. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time.
264bf215546Sopenharmony_ci
265bf215546Sopenharmony_ciBut relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (i.e. the array elements must be assigned to consecutive scalar registers).  And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written.
266bf215546Sopenharmony_ci
267bf215546Sopenharmony_ciEach instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s).  This behaves as an additional virtual src register, i.e. ``foreach_ssa_src()`` will also iterate the address register (last).
268bf215546Sopenharmony_ci
269bf215546Sopenharmony_ci    Note that ``nop``\'s for timing constraints, type specifiers (i.e.
270bf215546Sopenharmony_ci    ``add.f`` vs ``add.u``), etc, omitted for brevity in examples
271bf215546Sopenharmony_ci
272bf215546Sopenharmony_ci::
273bf215546Sopenharmony_ci
274bf215546Sopenharmony_ci  mova a0.x, hr1.y
275bf215546Sopenharmony_ci  sub r1.y, r2.x, r3.x
276bf215546Sopenharmony_ci  add r0.x, r1.y, c<a0.x + 2>
277bf215546Sopenharmony_ci
278bf215546Sopenharmony_ciresults in:
279bf215546Sopenharmony_ci
280bf215546Sopenharmony_ci.. graphviz::
281bf215546Sopenharmony_ci
282bf215546Sopenharmony_ci  digraph {
283bf215546Sopenharmony_ci    rankdir=LR;
284bf215546Sopenharmony_ci    sub;
285bf215546Sopenharmony_ci    const [label="const file"];
286bf215546Sopenharmony_ci    add;
287bf215546Sopenharmony_ci    mova;
288bf215546Sopenharmony_ci    add -> mova;
289bf215546Sopenharmony_ci    add -> sub;
290bf215546Sopenharmony_ci    add -> const [label="off=2"];
291bf215546Sopenharmony_ci  }
292bf215546Sopenharmony_ci
293bf215546Sopenharmony_ciThe scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time.
294bf215546Sopenharmony_ci
295bf215546Sopenharmony_ciTo implement variable arrays, the NIR registers are stored as an ``ir3_array``,
296bf215546Sopenharmony_ciwhich will be register allocated to consecutive hardware registers.  The array
297bf215546Sopenharmony_ciaccess uses the id field in the ``ir3_register`` to map to the array being
298bf215546Sopenharmony_ciaccessed, and the offset field for the fixed offset within the array.  A NIR
299bf215546Sopenharmony_ciindirect register read such as:
300bf215546Sopenharmony_ci
301bf215546Sopenharmony_ci::
302bf215546Sopenharmony_ci
303bf215546Sopenharmony_ci  decl_reg vec2 32 r0[2]
304bf215546Sopenharmony_ci  ...
305bf215546Sopenharmony_ci  vec2 32 ssa_19 = mov r0[0 + ssa_9]
306bf215546Sopenharmony_ci
307bf215546Sopenharmony_ci
308bf215546Sopenharmony_ciresults in:
309bf215546Sopenharmony_ci
310bf215546Sopenharmony_ci::
311bf215546Sopenharmony_ci
312bf215546Sopenharmony_ci  0000:0000:001:  shl.b hssa_19, hssa_17, himm[0.000000,1,0x1]
313bf215546Sopenharmony_ci  0000:0000:002:  mov.s16s16 hr61.x, hssa_19
314bf215546Sopenharmony_ci  0000:0000:002:  mov.u32u32 ssa_21, arr[id=1, offset=0, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
315bf215546Sopenharmony_ci  0000:0000:002:  mov.u32u32 ssa_22, arr[id=1, offset=1, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
316bf215546Sopenharmony_ci
317bf215546Sopenharmony_ci
318bf215546Sopenharmony_ciArray writes write to the array in ``instr->regs[0]->array.id``.  A NIR indirect
319bf215546Sopenharmony_ciregister write such as:
320bf215546Sopenharmony_ci
321bf215546Sopenharmony_ci::
322bf215546Sopenharmony_ci
323bf215546Sopenharmony_ci  decl_reg vec2 32 r0[2]
324bf215546Sopenharmony_ci  ...
325bf215546Sopenharmony_ci  r0[0 + ssa_12] = mov ssa_13
326bf215546Sopenharmony_ci
327bf215546Sopenharmony_ciresults in:
328bf215546Sopenharmony_ci
329bf215546Sopenharmony_ci::
330bf215546Sopenharmony_ci
331bf215546Sopenharmony_ci  0000:0000:001:  shl.b hssa_29, hssa_27, himm[0.000000,1,0x1]
332bf215546Sopenharmony_ci  0000:0000:002:  mov.s16s16 hr61.x, hssa_29
333bf215546Sopenharmony_ci  0000:0000:001:  mov.u32u32 arr[id=1, offset=0, size=4, ssa_17], c2.y, address=_[0000:0000:002:  mov.s16s16]
334bf215546Sopenharmony_ci  0000:0000:004:  mov.u32u32 arr[id=1, offset=1, size=4, ssa_31], c2.z, address=_[0000:0000:002:  mov.s16s16]
335bf215546Sopenharmony_ci
336bf215546Sopenharmony_ciNote that only cat1 (mov) can do indirect write, and thus NIR register stores
337bf215546Sopenharmony_cimay need to introduce an extra mov.
338bf215546Sopenharmony_ci
339bf215546Sopenharmony_ciir3 array accesses in the DAG get serialized by the ``instr->barrier_class`` and
340bf215546Sopenharmony_cicontaining ``IR3_BARRIER_ARRAY_W`` or ``IR3_BARRIER_ARRAY_R``.
341bf215546Sopenharmony_ci
342bf215546Sopenharmony_ciShader Passes
343bf215546Sopenharmony_ci-------------
344bf215546Sopenharmony_ci
345bf215546Sopenharmony_ciAfter the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_.  Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail.
346bf215546Sopenharmony_ci
347bf215546Sopenharmony_ci    Note that we essentially have ~256 scalar registers in the
348bf215546Sopenharmony_ci    architecture (although larger register usage will at some thresholds
349bf215546Sopenharmony_ci    limit the number of threads which can run in parallel).  And at some
350bf215546Sopenharmony_ci    point we will have to deal with spilling.
351bf215546Sopenharmony_ci
352bf215546Sopenharmony_ci.. _flatten:
353bf215546Sopenharmony_ci
354bf215546Sopenharmony_ciFlatten
355bf215546Sopenharmony_ci~~~~~~~
356bf215546Sopenharmony_ci
357bf215546Sopenharmony_ciIn this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions.  The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else.
358bf215546Sopenharmony_ci
359bf215546Sopenharmony_ci
360bf215546Sopenharmony_ci.. _`copy propagation`:
361bf215546Sopenharmony_ci
362bf215546Sopenharmony_ciCopy Propagation
363bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~
364bf215546Sopenharmony_ci
365bf215546Sopenharmony_ciCurrently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources.  And the CP pass simply removes all simple ``mov``\s (i.e. src-type is same as dst-type, no abs/neg flags, etc).
366bf215546Sopenharmony_ci
367bf215546Sopenharmony_ciThe eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things.
368bf215546Sopenharmony_ci
369bf215546Sopenharmony_ci
370bf215546Sopenharmony_ci.. _grouping:
371bf215546Sopenharmony_ci
372bf215546Sopenharmony_ciGrouping
373bf215546Sopenharmony_ci~~~~~~~~
374bf215546Sopenharmony_ci
375bf215546Sopenharmony_ciIn the grouping pass, instructions which need to be grouped (for ``collect``\s, etc) have their ``left`` / ``right`` neighbor pointers setup.  In cases where there is a conflict (i.e. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted.  This ensures that there is some possible valid `register assignment`_ at the later stages.
376bf215546Sopenharmony_ci
377bf215546Sopenharmony_ci
378bf215546Sopenharmony_ci.. _depth:
379bf215546Sopenharmony_ci
380bf215546Sopenharmony_ciDepth
381bf215546Sopenharmony_ci~~~~~
382bf215546Sopenharmony_ci
383bf215546Sopenharmony_ciIn the depth pass, a depth is calculated for each instruction node within its basic block.  The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of its source instructions.  (meta_ instructions don't add to the depth).  As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction.  Unreachable instructions and inputs are marked.
384bf215546Sopenharmony_ci
385bf215546Sopenharmony_ci    TODO: we should probably calculate both hard and soft depths (?) to
386bf215546Sopenharmony_ci    try to coax additional instructions to fit in places where we need
387bf215546Sopenharmony_ci    to use sync bits, such as after a texture fetch or SFU.
388bf215546Sopenharmony_ci
389bf215546Sopenharmony_ci.. _scheduling:
390bf215546Sopenharmony_ci
391bf215546Sopenharmony_ciScheduling
392bf215546Sopenharmony_ci~~~~~~~~~~
393bf215546Sopenharmony_ci
394bf215546Sopenharmony_ciAfter the grouping_ pass, there are no more instructions to insert or remove.  Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after its source instructions plus delay slots.  Insert ``nop``\s as required.
395bf215546Sopenharmony_ci
396bf215546Sopenharmony_ci.. _`register assignment`:
397bf215546Sopenharmony_ci
398bf215546Sopenharmony_ciRegister Assignment
399bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~~~
400bf215546Sopenharmony_ci
401bf215546Sopenharmony_ciTODO
402bf215546Sopenharmony_ci
403bf215546Sopenharmony_ci
404