drivers/freedreno/ir3-notes.rst

bf215546Sopenharmony_ciIR3 NOTES
bf215546Sopenharmony_ci=========
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciSome notes about ir3, the compiler and machine-specific IR for the shader ISA introduced with adreno a3xx.  The same shader ISA is present, with some small differences, in adreno a4xx.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciCompared to the previous generation a2xx ISA (ir2), the a3xx ISA is a "simple" scalar instruction set.  However, the compiler is responsible, in most cases, to schedule the instructions.  The hardware does not try to hide the shader core pipeline stages.  For a common example, a common (cat2) ALU instruction takes four cycles, so a subsequent cat2 instruction which uses the result must have three intervening instructions (or NOPs).  When operating on vec4's, typically the corresponding scalar instructions for operating on the remaining three components could typically fit.  Although that results in a lot of edge cases where things fall over, like:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  ADD TEMP[0], TEMP[1], TEMP[2]
bf215546Sopenharmony_ci  MUL TEMP[0], TEMP[1], TEMP[0].wzyx
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciHere, the second instruction needs the output of the first group of scalar instructions in the wrong order, resulting in not enough instruction spots between the ``add r0.w, r1.w, r2.w`` and ``mul r0.x, r1.x, r0.w``.  Which is why the original (old) compiler which merely translated nearly literally from TGSI to ir3, had a strong tendency to fall over.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciSo the current compiler instead, in the frontend, generates a directed-acyclic-graph of instructions and basic blocks, which go through various additional passes to eventually schedule and do register assignment.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciFor additional documentation about the hardware, see wiki: `a3xx ISA
bf215546Sopenharmony_ci<https://github.com/freedreno/freedreno/wiki/A3xx-shader-instruction-set-architecture>`_.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciExternal Structure
bf215546Sopenharmony_ci------------------
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``ir3_shader``
bf215546Sopenharmony_ci    A single vertex/fragment/etc shader from gallium perspective (i.e.
bf215546Sopenharmony_ci    maps to a single TGSI shader), and manages a set of shader variants
bf215546Sopenharmony_ci    which are generated on demand based on the shader key.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``ir3_shader_key``
bf215546Sopenharmony_ci    The configuration key that identifies a shader variant.  I.e. based
bf215546Sopenharmony_ci    on other GL state (two-sided-color, render-to-alpha, etc) or render
bf215546Sopenharmony_ci    stages (binning-pass vertex shader) different shader variants are
bf215546Sopenharmony_ci    generated.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``ir3_shader_variant``
bf215546Sopenharmony_ci    The actual hw shader generated based on input TGSI and shader key.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``ir3_compiler``
bf215546Sopenharmony_ci    Compiler frontend which generates ir3 and runs the various backend
bf215546Sopenharmony_ci    stages to schedule and do register assignment.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe IR
bf215546Sopenharmony_ci------
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe ir3 IR maps quite directly to the hardware, in that instruction opcodes map directly to hardware opcodes, and that dst/src register(s) map directly to the hardware dst/src register(s).  But there are a few extensions, in the form of meta_ instructions.  And additionally, for normal (non-const, etc) src registers, the ``IR3_REG_SSA`` flag is set and ``reg->instr`` points to the source instruction which produced that value.  So, for example, the following TGSI shader:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  VERT
bf215546Sopenharmony_ci  DCL IN[0]
bf215546Sopenharmony_ci  DCL IN[1]
bf215546Sopenharmony_ci  DCL OUT[0], POSITION
bf215546Sopenharmony_ci  DCL TEMP[0], LOCAL
bf215546Sopenharmony_ci    1: DP3 TEMP[0].x, IN[0].xyzz, IN[1].xyzz
bf215546Sopenharmony_ci    2: MOV OUT[0], TEMP[0].xxxx
bf215546Sopenharmony_ci    3: END
bf215546Sopenharmony_ci
bf215546Sopenharmony_cieventually generates:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. graphviz::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  digraph G {
bf215546Sopenharmony_ci  rankdir=RL;
bf215546Sopenharmony_ci  nodesep=0.25;
bf215546Sopenharmony_ci  ranksep=1.5;
bf215546Sopenharmony_ci  subgraph clusterdce198 {
bf215546Sopenharmony_ci  label="vert";
bf215546Sopenharmony_ci  inputdce198 [shape=record,label="inputs|<in0> i0.x|<in1> i0.y|<in2> i0.z|<in4> i1.x|<in5> i1.y|<in6> i1.z"];
bf215546Sopenharmony_ci  instrdcf348 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
bf215546Sopenharmony_ci  instrdcedd0 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
bf215546Sopenharmony_ci  inputdce198:<in2>:w -> instrdcedd0:<src0>
bf215546Sopenharmony_ci  inputdce198:<in6>:w -> instrdcedd0:<src1>
bf215546Sopenharmony_ci  instrdcec30 [shape=record,style=filled,fillcolor=lightgrey,label="{mad.f32|<dst0>|<src0> |<src1> |<src2> }"];
bf215546Sopenharmony_ci  inputdce198:<in1>:w -> instrdcec30:<src0>
bf215546Sopenharmony_ci  inputdce198:<in5>:w -> instrdcec30:<src1>
bf215546Sopenharmony_ci  instrdceb60 [shape=record,style=filled,fillcolor=lightgrey,label="{mul.f|<dst0>|<src0> |<src1> }"];
bf215546Sopenharmony_ci  inputdce198:<in0>:w -> instrdceb60:<src0>
bf215546Sopenharmony_ci  inputdce198:<in4>:w -> instrdceb60:<src1>
bf215546Sopenharmony_ci  instrdceb60:<dst0> -> instrdcec30:<src2>
bf215546Sopenharmony_ci  instrdcec30:<dst0> -> instrdcedd0:<src2>
bf215546Sopenharmony_ci  instrdcedd0:<dst0> -> instrdcf348:<src0>
bf215546Sopenharmony_ci  instrdcf400 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
bf215546Sopenharmony_ci  instrdcedd0:<dst0> -> instrdcf400:<src0>
bf215546Sopenharmony_ci  instrdcf4b8 [shape=record,style=filled,fillcolor=lightgrey,label="{mov.f32f32|<dst0>|<src0> }"];
bf215546Sopenharmony_ci  instrdcedd0:<dst0> -> instrdcf4b8:<src0>
bf215546Sopenharmony_ci  outputdce198 [shape=record,label="outputs|<out0> o0.x|<out1> o0.y|<out2> o0.z|<out3> o0.w"];
bf215546Sopenharmony_ci  instrdcf348:<dst0> -> outputdce198:<out0>:e
bf215546Sopenharmony_ci  instrdcf400:<dst0> -> outputdce198:<out1>:e
bf215546Sopenharmony_ci  instrdcf4b8:<dst0> -> outputdce198:<out2>:e
bf215546Sopenharmony_ci  instrdcedd0:<dst0> -> outputdce198:<out3>:e
bf215546Sopenharmony_ci  }
bf215546Sopenharmony_ci  }
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci(after scheduling, etc, but before register assignment).
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciInternal Structure
bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``ir3_block``
bf215546Sopenharmony_ci    Represents a basic block.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci    TODO: currently blocks are nested, but I think I need to change that
bf215546Sopenharmony_ci    to a more conventional arrangement before implementing proper flow
bf215546Sopenharmony_ci    control.  Currently the only flow control handles is if/else which
bf215546Sopenharmony_ci    gets flattened out and results chosen with ``sel`` instructions.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``ir3_instruction``
bf215546Sopenharmony_ci    Represents a machine instruction or meta_ instruction.  Has pointers
bf215546Sopenharmony_ci    to dst register (``regs[0]``) and src register(s) (``regs[1..n]``),
bf215546Sopenharmony_ci    as needed.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``ir3_register``
bf215546Sopenharmony_ci    Represents a src or dst register, flags indicate const/relative/etc.
bf215546Sopenharmony_ci    If ``IR3_REG_SSA`` is set on a src register, the actual register
bf215546Sopenharmony_ci    number (name) has not been assigned yet, and instead the ``instr``
bf215546Sopenharmony_ci    field points to src instruction.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciIn addition there are various util macros/functions to simplify manipulation/traversal of the graph:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``foreach_src(srcreg, instr)``
bf215546Sopenharmony_ci    Iterate each instruction's source ``ir3_register``\s
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``foreach_src_n(srcreg, n, instr)``
bf215546Sopenharmony_ci    Like ``foreach_src``, also setting ``n`` to the source number (starting
bf215546Sopenharmony_ci    with ``0``).
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``foreach_ssa_src(srcinstr, instr)``
bf215546Sopenharmony_ci    Iterate each instruction's SSA source ``ir3_instruction``\s.  This skips
bf215546Sopenharmony_ci    non-SSA sources (consts, etc), but includes virtual sources (such as the
bf215546Sopenharmony_ci    address register if `relative addressing`_ is used).
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci``foreach_ssa_src_n(srcinstr, n, instr)``
bf215546Sopenharmony_ci    Like ``foreach_ssa_src``, also setting ``n`` to the source number.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciFor example:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. code-block:: c
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  foreach_ssa_src_n(src, i, instr) {
bf215546Sopenharmony_ci    unsigned d = delay_calc_srcn(ctx, src, instr, i);
bf215546Sopenharmony_ci    delay = MAX2(delay, d);
bf215546Sopenharmony_ci  }
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciTODO probably other helper/util stuff worth mentioning here
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _meta:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciMeta Instructions
bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci**input**
bf215546Sopenharmony_ci    Used for shader inputs (registers configured in the command-stream
bf215546Sopenharmony_ci    to hold particular input values, written by the shader core before
bf215546Sopenharmony_ci    start of execution.  Also used for connecting up values within a
bf215546Sopenharmony_ci    basic block to an output of a previous block.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci**output**
bf215546Sopenharmony_ci    Used to hold outputs of a basic block.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci**flow**
bf215546Sopenharmony_ci    TODO
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci**phi**
bf215546Sopenharmony_ci    TODO
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci**collect**
bf215546Sopenharmony_ci    Groups registers which need to be assigned to consecutive scalar
bf215546Sopenharmony_ci    registers, for example `sam` (texture fetch) src instructions (see
bf215546Sopenharmony_ci    `register groups`_) or array element dereference
bf215546Sopenharmony_ci    (see `relative addressing`_).
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci**split**
bf215546Sopenharmony_ci    The counterpart to **collect**, when an instruction such as `sam`
bf215546Sopenharmony_ci    writes multiple components, splits the result into individual
bf215546Sopenharmony_ci    scalar components to be consumed by other instructions.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _`flow control`:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciFlow Control
bf215546Sopenharmony_ci~~~~~~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciTODO
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _`register groups`:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciRegister Groups
bf215546Sopenharmony_ci~~~~~~~~~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciCertain instructions, such as texture sample instructions, consume multiple consecutive scalar registers via a single src register encoded in the instruction, and/or write multiple consecutive scalar registers.  In the simplest example:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  sam (f32)(xyz)r2.x, r0.z, s#0, t#0
bf215546Sopenharmony_ci
bf215546Sopenharmony_cifor a 2d texture, would read ``r0.zw`` to get the coordinate, and write ``r2.xyz``.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciBefore register assignment, to group the two components of the texture src together:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. graphviz::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  digraph G {
bf215546Sopenharmony_ci    { rank=same;
bf215546Sopenharmony_ci      collect;
bf215546Sopenharmony_ci    };
bf215546Sopenharmony_ci    { rank=same;
bf215546Sopenharmony_ci      coord_x;
bf215546Sopenharmony_ci      coord_y;
bf215546Sopenharmony_ci    };
bf215546Sopenharmony_ci    sam -> collect [label="regs[1]"];
bf215546Sopenharmony_ci    collect -> coord_x [label="regs[1]"];
bf215546Sopenharmony_ci    collect -> coord_y [label="regs[2]"];
bf215546Sopenharmony_ci    coord_x -> coord_y [label="right",style=dotted];
bf215546Sopenharmony_ci    coord_y -> coord_x [label="left",style=dotted];
bf215546Sopenharmony_ci    coord_x [label="coord.x"];
bf215546Sopenharmony_ci    coord_y [label="coord.y"];
bf215546Sopenharmony_ci  }
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe frontend sets up the SSA ptrs from ``sam`` source register to the ``collect`` meta instruction, which in turn points to the instructions producing the ``coord.x`` and ``coord.y`` values.  And the grouping_ pass sets up the ``left`` and ``right`` neighbor pointers to the ``collect``\'s sources, used later by the `register assignment`_ pass to assign blocks of scalar registers.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciAnd likewise, for the consecutive scalar registers for the destination:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. graphviz::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  digraph {
bf215546Sopenharmony_ci    { rank=same;
bf215546Sopenharmony_ci      A;
bf215546Sopenharmony_ci      B;
bf215546Sopenharmony_ci      C;
bf215546Sopenharmony_ci    };
bf215546Sopenharmony_ci    { rank=same;
bf215546Sopenharmony_ci      split_0;
bf215546Sopenharmony_ci      split_1;
bf215546Sopenharmony_ci      split_2;
bf215546Sopenharmony_ci    };
bf215546Sopenharmony_ci    A -> split_0;
bf215546Sopenharmony_ci    B -> split_1;
bf215546Sopenharmony_ci    C -> split_2;
bf215546Sopenharmony_ci    split_0 [label="split\noff=0"];
bf215546Sopenharmony_ci    split_0 -> sam;
bf215546Sopenharmony_ci    split_1 [label="split\noff=1"];
bf215546Sopenharmony_ci    split_1 -> sam;
bf215546Sopenharmony_ci    split_2 [label="split\noff=2"];
bf215546Sopenharmony_ci    split_2 -> sam;
bf215546Sopenharmony_ci    split_0 -> split_1 [label="right",style=dotted];
bf215546Sopenharmony_ci    split_1 -> split_0 [label="left",style=dotted];
bf215546Sopenharmony_ci    split_1 -> split_2 [label="right",style=dotted];
bf215546Sopenharmony_ci    split_2 -> split_1 [label="left",style=dotted];
bf215546Sopenharmony_ci    sam;
bf215546Sopenharmony_ci  }
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _`relative addressing`:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciRelative Addressing
bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciMost instructions support addressing indirectly (relative to address register) into const or gpr register file in some or all of their src/dst registers.  In this case the register accessed is taken from ``r<a0.x + n>`` or ``c<a0.x + n>``, i.e. address register (``a0.x``) value plus ``n``, where ``n`` is encoded in the instruction (rather than the absolute register number).
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci    Note that cat5 (texture sample) instructions are the notable exception, not
bf215546Sopenharmony_ci    supporting relative addressing of src or dst.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciRelative addressing of the const file (for example, a uniform array) is relatively simple.  We don't do register assignment of the const file, so all that is required is to schedule things properly.  I.e. the instruction that writes the address register must be scheduled first, and we cannot have two different address register values live at one time.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciBut relative addressing of gpr file (which can be as src or dst) has additional restrictions on register assignment (i.e. the array elements must be assigned to consecutive scalar registers).  And in the case of relative dst, subsequent instructions now depend on both the relative write, as well as the previous instruction which wrote that register, since we do not know at compile time which actual register was written.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciEach instruction has an optional ``address`` pointer, to capture the dependency on the address register value when relative addressing is used for any of the src/dst register(s).  This behaves as an additional virtual src register, i.e. ``foreach_ssa_src()`` will also iterate the address register (last).
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci    Note that ``nop``\'s for timing constraints, type specifiers (i.e.
bf215546Sopenharmony_ci    ``add.f`` vs ``add.u``), etc, omitted for brevity in examples
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  mova a0.x, hr1.y
bf215546Sopenharmony_ci  sub r1.y, r2.x, r3.x
bf215546Sopenharmony_ci  add r0.x, r1.y, c<a0.x + 2>
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciresults in:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. graphviz::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  digraph {
bf215546Sopenharmony_ci    rankdir=LR;
bf215546Sopenharmony_ci    sub;
bf215546Sopenharmony_ci    const [label="const file"];
bf215546Sopenharmony_ci    add;
bf215546Sopenharmony_ci    mova;
bf215546Sopenharmony_ci    add -> mova;
bf215546Sopenharmony_ci    add -> sub;
bf215546Sopenharmony_ci    add -> const [label="off=2"];
bf215546Sopenharmony_ci  }
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe scheduling pass has some smarts to schedule things such that only a single ``a0.x`` value is used at any one time.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciTo implement variable arrays, the NIR registers are stored as an ``ir3_array``,
bf215546Sopenharmony_ciwhich will be register allocated to consecutive hardware registers.  The array
bf215546Sopenharmony_ciaccess uses the id field in the ``ir3_register`` to map to the array being
bf215546Sopenharmony_ciaccessed, and the offset field for the fixed offset within the array.  A NIR
bf215546Sopenharmony_ciindirect register read such as:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  decl_reg vec2 32 r0[2]
bf215546Sopenharmony_ci  ...
bf215546Sopenharmony_ci  vec2 32 ssa_19 = mov r0[0 + ssa_9]
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciresults in:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  0000:0000:001:  shl.b hssa_19, hssa_17, himm[0.000000,1,0x1]
bf215546Sopenharmony_ci  0000:0000:002:  mov.s16s16 hr61.x, hssa_19
bf215546Sopenharmony_ci  0000:0000:002:  mov.u32u32 ssa_21, arr[id=1, offset=0, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
bf215546Sopenharmony_ci  0000:0000:002:  mov.u32u32 ssa_22, arr[id=1, offset=1, size=4, ssa_12], address=_[0000:0000:002:  mov.s16s16]
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciArray writes write to the array in ``instr->regs[0]->array.id``.  A NIR indirect
bf215546Sopenharmony_ciregister write such as:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  decl_reg vec2 32 r0[2]
bf215546Sopenharmony_ci  ...
bf215546Sopenharmony_ci  r0[0 + ssa_12] = mov ssa_13
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciresults in:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci::
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci  0000:0000:001:  shl.b hssa_29, hssa_27, himm[0.000000,1,0x1]
bf215546Sopenharmony_ci  0000:0000:002:  mov.s16s16 hr61.x, hssa_29
bf215546Sopenharmony_ci  0000:0000:001:  mov.u32u32 arr[id=1, offset=0, size=4, ssa_17], c2.y, address=_[0000:0000:002:  mov.s16s16]
bf215546Sopenharmony_ci  0000:0000:004:  mov.u32u32 arr[id=1, offset=1, size=4, ssa_31], c2.z, address=_[0000:0000:002:  mov.s16s16]
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciNote that only cat1 (mov) can do indirect write, and thus NIR register stores
bf215546Sopenharmony_cimay need to introduce an extra mov.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciir3 array accesses in the DAG get serialized by the ``instr->barrier_class`` and
bf215546Sopenharmony_cicontaining ``IR3_BARRIER_ARRAY_W`` or ``IR3_BARRIER_ARRAY_R``.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciShader Passes
bf215546Sopenharmony_ci-------------
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciAfter the frontend has generated the use-def graph of instructions, they are run through various passes which include scheduling_ and `register assignment`_.  Because inserting ``mov`` instructions after scheduling would also require inserting additional ``nop`` instructions (since it is too late to reschedule to try and fill the bubbles), the earlier stages try to ensure that (at least given an infinite supply of registers) that `register assignment`_ after scheduling_ cannot fail.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci    Note that we essentially have ~256 scalar registers in the
bf215546Sopenharmony_ci    architecture (although larger register usage will at some thresholds
bf215546Sopenharmony_ci    limit the number of threads which can run in parallel).  And at some
bf215546Sopenharmony_ci    point we will have to deal with spilling.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _flatten:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciFlatten
bf215546Sopenharmony_ci~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciIn this stage, simple if/else blocks are flattened into a single block with ``phi`` nodes converted into ``sel`` instructions.  The a3xx ISA has very few predicated instructions, and we would prefer not to use branches for simple if/else.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _`copy propagation`:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciCopy Propagation
bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciCurrently the frontend inserts ``mov``\s in various cases, because certain categories of instructions have limitations about const regs as sources.  And the CP pass simply removes all simple ``mov``\s (i.e. src-type is same as dst-type, no abs/neg flags, etc).
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciThe eventual plan is to invert that, with the front-end inserting no ``mov``\s and CP legalize things.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _grouping:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciGrouping
bf215546Sopenharmony_ci~~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciIn the grouping pass, instructions which need to be grouped (for ``collect``\s, etc) have their ``left`` / ``right`` neighbor pointers setup.  In cases where there is a conflict (i.e. one instruction cannot have two unique left or right neighbors), an additional ``mov`` instruction is inserted.  This ensures that there is some possible valid `register assignment`_ at the later stages.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _depth:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciDepth
bf215546Sopenharmony_ci~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciIn the depth pass, a depth is calculated for each instruction node within its basic block.  The depth is the sum of the required cycles (delay slots needed between two instructions plus one) of each instruction plus the max depth of any of its source instructions.  (meta_ instructions don't add to the depth).  As an instruction's depth is calculated, it is inserted into a per block list sorted by deepest instruction.  Unreachable instructions and inputs are marked.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci    TODO: we should probably calculate both hard and soft depths (?) to
bf215546Sopenharmony_ci    try to coax additional instructions to fit in places where we need
bf215546Sopenharmony_ci    to use sync bits, such as after a texture fetch or SFU.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _scheduling:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciScheduling
bf215546Sopenharmony_ci~~~~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciAfter the grouping_ pass, there are no more instructions to insert or remove.  Start scheduling each basic block from the deepest node in the depth sorted list created by the depth_ pass, recursively trying to schedule each instruction after its source instructions plus delay slots.  Insert ``nop``\s as required.
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci.. _`register assignment`:
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciRegister Assignment
bf215546Sopenharmony_ci~~~~~~~~~~~~~~~~~~~
bf215546Sopenharmony_ci
bf215546Sopenharmony_ciTODO
bf215546Sopenharmony_ci
bf215546Sopenharmony_ci