1<!-- 2 Copyright (C) 2021 Collabora Ltd. 3 4 Permission is hereby granted, free of charge, to any person obtaining a 5 copy of this software and associated documentation files (the "Software"), 6 to deal in the Software without restriction, including without limitation 7 the rights to use, copy, modify, merge, publish, distribute, sublicense, 8 and/or sell copies of the Software, and to permit persons to whom the 9 Software is furnished to do so, subject to the following conditions: 10 11 The above copyright notice and this permission notice (including the next 12 paragraph) shall be included in all copies or substantial portions of the 13 Software. 14 15 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL 18 THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 SOFTWARE. 22--> 23 24<valhall> 25 <lut name="Immediates"> 26 <desc> 27 This immediates are accessible in (almost) any instruction, provided the 28 immediate mode is kept to the default. They optimize for the most common 29 immediate values; any immediate listed here may be used without taking up 30 a uniform slot or a register. Most integer instructions can access 31 separate half-words and individual bytes via swizzles on the source. 32 </desc> 33 <constant desc="Zero">0x00000000</constant> 34 <constant desc="All ones; integer $-1$">0xFFFFFFFF</constant> 35 <constant desc="Maximum integer; floating-point NaN">0x7FFFFFFF</constant> 36 <constant desc="Integers $(-2, -3, -4, -5)$">0xFAFCFDFE</constant> 37 <constant desc="16-bit integer $2^8$">0x01000000</constant> 38 <constant desc="Multiples of 16 $(0, 32, 0, 128)$">0x80002000</constant> 39 <constant desc="Multiples of 16 $(48, 80, 96, 112)$">0x70605030</constant> 40 <constant desc="Multiples of 16 $(144, 160, 176, 192)$">0xC0B0A090</constant> 41 <constant desc="Integers $(0, 1, 2, 3)$">0x03020100</constant> 42 <constant desc="Integers $(4, 5, 6, 7)$">0x07060504</constant> 43 <constant desc="Integers $(8, 9, 10, 11)$">0x0B0A0908</constant> 44 <constant desc="Integers $(12, 13, 14, 15)$">0x0F0E0D0C</constant> 45 <constant desc="Integers $(16, 17, 18, 19)$">0x13121110</constant> 46 <constant desc="Integers $(20, 21, 22, 23)$">0x17161514</constant> 47 <constant desc="Integers $(24, 25, 26, 27)$">0x1B1A1918</constant> 48 <constant desc="Integers $(28, 29, 30, 31)$">0x1F1E1D1C</constant> 49 <constant desc="Float $1.0$">0x3F800000</constant> 50 <constant desc="Float $0.1$">0x3DCCCCCD</constant> 51 <constant desc="Float $1 / \pi$">0x3EA2F983</constant> 52 <constant desc="Float $\log(2)$">0x3F317218</constant> 53 <constant desc="Float $\pi$">0x40490FDB</constant> 54 <constant desc="Float $0.0$">0x00000000</constant> 55 <constant desc="Float $65535.0 = 2^{16} - 1$">0x477FFF00</constant> 56 <constant desc="Half-float $(255.0, 256.0) = (2^8 - 1, 2^8)$">0x5C005BF8</constant> 57 <constant desc="Half-float $0.1 = 1 / 10$">0x2E660000</constant> 58 <constant desc="Half-float $0.25 = 2^{-2}$">0x34000000</constant> 59 <constant desc="Half-float $0.5 = 2^{-1}$">0x38000000</constant> 60 <constant desc="Half-float $1.0 = 2^0$">0x3C000000</constant> 61 <constant desc="Half-float $2.0 = 2^1$">0x40000000</constant> 62 <constant desc="Half-float $4.0 = 2^2$">0x44000000</constant> 63 <constant desc="Half-float $8.0 = 2^3$">0x48000000</constant> 64 <constant desc="Half-float $\pi$">0x42480000</constant> 65 </lut> 66 67 <enum name="Flow"> 68 <desc> 69 Every Valhall instruction can wait on dependency 70 slots. A few special flows are available, specified in the instruction 71 metadata from this enum. The `wait0126` flow is required to wait on 72 dependency slot #6 and should be set on the instruction immediately 73 preceding `ATEST`. The `wait` flow should be set for barriers. 74 The `discard` flow only applies to fragment shaders and is used to 75 terminate helper invocations, it should be set as early as possible after 76 helper invocations are no longer needed as determined by data flow 77 analysis. The `end` flow is used to terminate the shader, although it 78 may be overloaded by the `BLEND` instruction. 79 80 The `reconverge` flow is required on any instruction immediately 81 preceding a possible change to the mask of active threads in a subgroup. 82 This includes all divergent branches, but it also includes the final 83 instruction at the end of any basic block where the immediate successor 84 (fallthrough) is the target of a divergent branch. 85 </desc> 86 <value name="None" default="true">none</value> 87 <value name="Wait on slot 0">wait0</value> 88 <value name="Wait on slot 1">wait1</value> 89 <value name="Wait on slots 0, 1">wait01</value> 90 <value name="Wait on slot 2">wait2</value> 91 <value name="Wait on slots 0, 2">wait02</value> 92 <value name="Wait on slots 1, 2">wait12</value> 93 <value name="Wait on slots 0, 1, 2">wait012</value> 94 <value name="Wait on slots 0, 1, 2, 6">wait0126</value> 95 <value name="Wait on slots 0, 1, 2, 6, 7">wait</value> 96 <value name="Perform branch reconverge">reconverge</value> 97 <reserved/> 98 <reserved/> 99 <value name="Terminate discarded threads">discard</value> 100 <reserved/> 101 <value name="Return from shader">end</value> 102 </enum> 103 104 <enum name="FAU special page 0"> 105 <desc> 106 Situated between the immediates hard-coded in the hardware and the 107 uniforms defined purely in software, Valhall has a some special 108 "constants" passing through data structures. These are encoded like the 109 table of immediates, as if special constant $i$ were lookup table entry 110 $32 + i$. 111 </desc> 112 <reserved/> 113 <reserved/> 114 <value desc="Warp ID and warps/core - 1">warp_id</value> 115 <reserved/> 116 <value desc="Bounding box maximum X/Y">framebuffer_size</value> 117 <value desc="ATEST datum">atest_datum</value> 118 <value desc="Sample positions">sample</value> 119 <reserved/> 120 <value desc="Blend descriptor 0">blend_descriptor_0</value> 121 <value desc="Blend descriptor 1">blend_descriptor_1</value> 122 <value desc="Blend descriptor 2">blend_descriptor_2</value> 123 <value desc="Blend descriptor 3">blend_descriptor_3</value> 124 <value desc="Blend descriptor 4">blend_descriptor_4</value> 125 <value desc="Blend descriptor 5">blend_descriptor_5</value> 126 <value desc="Blend descriptor 6">blend_descriptor_6</value> 127 <value desc="Blend descriptor 7">blend_descriptor_7</value> 128 </enum> 129 130 <enum name="FAU special page 1"> 131 <desc> 132 Situated between the immediates hard-coded in the hardware and the 133 uniforms defined purely in software, Valhall has a some special 134 "constants" passing through data structures. These are encoded like the 135 table of immediates, as if special constant $i$ were lookup table entry 136 $32 + i$. 137 </desc> 138 <reserved/> 139 <value desc="Thread local storage base pointer">thread_local_pointer</value> 140 <reserved/> 141 <value desc="Workgroup local storage base pointer">workgroup_local_pointer</value> 142 <reserved/> 143 <reserved/> 144 <reserved/> 145 <value desc="Shader resource table base pointer">resource_table_pointer</value> 146 <reserved/> 147 <reserved/> 148 <reserved/> 149 <reserved/> 150 <reserved/> 151 <reserved/> 152 <reserved/> 153 <reserved/> 154 </enum> 155 156 <enum name="FAU special page 3"> 157 <desc> 158 Situated between the immediates hard-coded in the hardware and the 159 uniforms defined purely in software, Valhall has a some special 160 "constants" passing through data structures. These are encoded like the 161 table of immediates, as if special constant $i$ were lookup table entry 162 $32 + i$. 163 </desc> 164 <reserved/> 165 <value desc="Lane ID">lane_id</value> 166 <reserved/> 167 <value desc="Core ID">core_id</value> 168 <reserved/> 169 <reserved/> 170 <reserved/> 171 <reserved/> 172 <reserved/> 173 <reserved/> 174 <reserved/> 175 <reserved/> 176 <reserved/> 177 <reserved/> 178 <reserved/> 179 <value desc="Program counter">program_counter</value> 180 </enum> 181 182 <enum name="Swizzles (8-bit)"> 183 <value default="true">b0123</value> 184 <value>b3210</value> 185 <value>b0101</value> 186 <value>b2323</value> 187 <value>b0000</value> 188 <value>b1111</value> 189 <value>b2222</value> 190 <value>b3333</value> 191 <value>b2301</value> 192 <value>b1032</value> 193 <value>b0011</value> 194 <value>b2233</value> 195 <reserved/> 196 <reserved/> 197 <reserved/> 198 <reserved/> 199 </enum> 200 201 <enum name="Lanes (8-bit)"> 202 <desc>Used to select the 2 bytes for shifts of 16-bit vectors</desc> 203 <value>b02</value> 204 <reserved/> 205 <reserved/> 206 <reserved/> 207 <value>b00</value> 208 <value>b11</value> 209 <value>b22</value> 210 <value>b33</value> 211 <reserved/> 212 <reserved/> 213 <value>b01</value> 214 <value>b23</value> 215 <reserved/> 216 <reserved/> 217 <reserved/> 218 <reserved/> 219 </enum> 220 221 <enum name="Half-swizzles (8-bit)"> 222 <desc> 223 Used to select the 2 bytes to convert for conversions from 8-bit vectors 224 to 16-bit vectors 225 </desc> 226 <value>b00</value> 227 <value>b10</value> 228 <value>b20</value> 229 <value>b30</value> 230 <value>b01</value> 231 <value>b11</value> 232 <value>b21</value> 233 <value>b31</value> 234 <value>b02</value> 235 <value>b12</value> 236 <value>b22</value> 237 <value>b32</value> 238 <value>b03</value> 239 <value>b13</value> 240 <value>b23</value> 241 <value>b33</value> 242 </enum> 243 244 <enum name="Swizzles (16-bit)"> 245 <value>h00</value> <!-- 0,2 --> 246 <value>h10</value> 247 <value default="true">h01</value> 248 <value>h11</value> 249 <value>b00</value> <!-- 0,0 --> 250 <value>b20</value> <!-- 1,1 --> 251 <value>b02</value> <!-- 2,2 --> 252 <value>b22</value> <!-- 3,3 --> 253 <value>b11</value> 254 <value>b31</value> 255 <value>b13</value> <!-- 0,1 --> 256 <value>b33</value> <!-- 2,3 --> 257 <value>b01</value> 258 <value>b23</value> 259 <reserved/> 260 <reserved/> 261 </enum> 262 263 <enum name="Swizzles (32-bit)"> 264 <value default="true">none</value> 265 <reserved/> 266 <value>h0</value> 267 <value>h1</value> 268 <value>b0</value> 269 <value>b1</value> 270 <value>b2</value> 271 <value>b3</value> 272 </enum> 273 274 <enum name="Swizzles (64-bit)"> 275 <value default="true">none</value> 276 <reserved/> 277 <value>h0</value> 278 <value>h1</value> 279 <value>b0</value> 280 <value>b1</value> 281 <value>b2</value> 282 <value>b3</value> 283 <value>w0</value> 284 <reserved/> 285 <reserved/> 286 <reserved/> 287 <reserved/> 288 <reserved/> 289 <reserved/> 290 <reserved/> 291 </enum> 292 293 <enum name="Lane (8-bit)" implied="true"> 294 <value>b0</value> 295 <value>b1</value> 296 <value>b2</value> 297 <value>b3</value> 298 </enum> 299 300 <enum name="Combine"> 301 <desc> 302 Used for the lane select of `BRANCHZ`. To use an 8-bit condition, a 303 separate `ICMP` is required to cast to 16-bit. 304 </desc> 305 <value default="true">none</value> 306 <value>h0</value> 307 <value>h1</value> 308 <value>and</value> 309 <value>lowbits</value> 310 </enum> 311 312 <enum name="Lane (16-bit)" implied="true"> 313 <value>h0</value> 314 <value>h1</value> 315 </enum> 316 317 <enum name="Load lane (8-bit)"> 318 <value default="true">b0</value> 319 <value>b1</value> 320 <value>b2</value> 321 <value>b3</value> 322 <value desc="Zero-extend to 16-bit, low-half">h0</value> 323 <value desc="Zero-extend to 16-bit, high-half">h1</value> 324 <value desc="Zero-extend to 32-bit">w0</value> 325 <value desc="Zero-extend to 32-bit">d0</value> 326 </enum> 327 328 <enum name="Load lane (16-bit)"> 329 <value desc="Low half" default="true">h0</value> 330 <value desc="High half">h1</value> 331 <value desc="Zero-extend to 32-bit">w0</value> 332 <value desc="Zero-extend to 64-bit">d0</value> 333 <reserved/> 334 <reserved/> 335 <reserved/> 336 <reserved/> 337 </enum> 338 339 <enum name="Load lane (24-bit)" implied="true"> 340 <value default="true">identity</value> 341 <reserved/> 342 <reserved/> 343 <reserved/> 344 <reserved/> 345 <reserved/> 346 <reserved/> 347 </enum> 348 349 <enum name="Load lane (32-bit)"> 350 <value default="true">w0</value> 351 <value desc="Zero-extend to 64-bit">d0</value> 352 <reserved/> 353 <reserved/> 354 <reserved/> 355 <reserved/> 356 <reserved/> 357 <reserved/> 358 </enum> 359 360 <enum name="Load lane (48-bit)"> 361 <reserved/> 362 <reserved/> 363 <reserved/> 364 <reserved/> 365 <value default="true">identity</value> 366 <reserved/> 367 <reserved/> 368 <reserved/> 369 </enum> 370 371 <enum name="Load lane (64-bit)"> 372 <reserved/> 373 <reserved/> 374 <reserved/> 375 <reserved/> 376 <reserved/> 377 <reserved/> 378 <reserved/> 379 <value default="true">identity</value> 380 </enum> 381 382 <enum name="Load lane (96-bit)"> 383 <reserved/> 384 <reserved/> 385 <reserved/> 386 <reserved/> 387 <reserved/> 388 <reserved/> 389 <value default="true">identity</value> 390 <reserved/> 391 </enum> 392 393 <enum name="Load lane (128-bit)"> 394 <reserved/> 395 <reserved/> 396 <reserved/> 397 <reserved/> 398 <reserved/> 399 <reserved/> 400 <reserved/> 401 <value default="true">identity</value> 402 </enum> 403 404 <enum name="Round mode"> 405 <desc>Corresponds to IEEE 754 rounding modes</desc> 406 <value desc="Round to nearest even" default="true">rte</value> 407 <value desc="Round to positive infinity">rtp</value> 408 <value desc="Round to negative infinity">rtn</value> 409 <value desc="Round to zero">rtz</value> 410 </enum> 411 412 <enum name="Result type"> 413 <desc> 414 Comparison instructions like `FCMP` return a boolean but may encode this 415 boolean in a variety of ways. `i1` gives a OpenGL style `0/1` boolean. 416 `m1` gives a Direct3D style `0/~0` boolean. `f1` gives a floating-point 417 `0.0f / 1.0f` boolean. Switching between these modes is useful to fold a 418 boolean type convert into a comparison. `u1` is used internally to 419 implement 64-bit comparisons. 420 </desc> 421 <value desc="Integer 1">i1</value> 422 <value desc="Float 1">f1</value> 423 <value desc="Minus 1">m1</value> 424 <value desc="Low half of 64-bit compare">u1</value> 425 </enum> 426 427 <enum name="Widen"> 428 <value default="true">none</value> 429 <value>h0</value> 430 <value>h1</value> 431 <reserved/> 432 <reserved/> 433 <reserved/> 434 <reserved/> 435 <reserved/> 436 </enum> 437 438 <enum name="Clamp"> 439 <desc> 440 Clamp applied to the destination of a floating-point instruction. Note the 441 clamps may be decomposed as two independent bits for `clamp_0_inf` and 442 `clamp_m1_1`, with `clamp_0_1` arising as the composition of `clamp_0_inf` 443 and `clamp_m1_1` in either order. 444 445 Clamps are implemented per the SPIR-V specification: 446 447 $$\text{clamp} \; (x, \ell, h) = \min( \max( x, \ell ), h)$$ 448 449 The min/max functions return the other operand if one operand is NaN, and 450 compare $-0 < +0$. That means the following identities hold for Valhall 451 clamps: 452 453 \begin{align*} 454 \text{clamp}(-0.0, 0.0, 1.0) & = +0.0 \\ 455 \text{clamp}(-\text{NaN}, 0.0, 1.0) & = +0.0 \\ 456 \text{clamp}(\text{NaN}, 0.0, 1.0) & = +0.0 \\ 457 & \\ 458 \text{clamp}(-0.0, -1.0, 1.0) & = -0.0 \\ 459 \text{clamp}(\text{NaN}, -1.0, 1.0) & = -1.0 \\ 460 \text{clamp}(-\text{NaN}, -1.0, 1.0) & = -1.0 \\ 461 & \\ 462 \max(\text{NaN}, 0.0) & = +0.0 \\ 463 \max(-\text{NaN}, 0.0) & = +0.0 \\ 464 \max(-0.0, 0.0) & = +0.0 \\ 465 \end{align*} 466 467 This behaviour is consistent with the FMin/FMax/FClamp and 468 NMin/NMax/NClamp rules prescribed by SPIR-V and governed by IEEE-754. As 469 a consequence, substituting these clamps for equivalent minimum/maximum 470 exprssions is legal even with strict floating point rules. 471 </desc> 472 <value default="true" desc="Identity">none</value> 473 <value desc="Clamp positive">clamp_0_inf</value> 474 <value desc="Clamp to $[-1, 1]$">clamp_m1_1</value> 475 <value desc="Clamp to $[0, 1]$">clamp_0_1</value> 476 </enum> 477 478 <enum name="Condition"> 479 <desc> 480 Condition code. Type must be inferred from the instruction. IEEE 754 total 481 ordering only applies to floating point compares. "Not equal" and "greater 482 than or less than" are distinguished by NaN behaviour conforming to 483 the IEEE 754 specification. 484 </desc> 485 <value desc="Equal">eq</value> 486 <value desc="Greater than">gt</value> 487 <value desc="Greater than or equal">ge</value> 488 <value desc="Not equal">ne</value> 489 <value desc="Less than">lt</value> 490 <value desc="Less than or equal">le</value> 491 <value desc="Greater than or less than">gtlt</value> 492 <value desc="Totally ordered">total</value> 493 </enum> 494 495 <enum name="Dimension"> 496 <desc>Texture dimension.</desc> 497 <value desc="1D or buffer">1d</value> 498 <value desc="2D or 2D array">2d</value> 499 <value desc="3D or 3D array">3d</value> 500 <value desc="Cube map or cube map array">cube</value> 501 </enum> 502 503 <enum name="LOD mode"> 504 <desc>Level-of-detail selection mode in a texture instruction.</desc> 505 <value desc="Set to zero">zero</value> 506 <value desc="Computed based on neighboring fragments">computed</value> 507 <reserved/> 508 <reserved/> 509 <value desc="Explicitly specified in a register">explicit</value> 510 <value desc="Computed based on neighboring fragments added with bias in a register">computed_bias</value> 511 <value desc="Derived from a gradient descriptor in registers">grdesc</value> 512 <reserved/> 513 </enum> 514 515 <enum name="Register format"> 516 <desc>Format of data loaded to / stored from registers for general memory access.</desc> 517 <value desc="32-bit type based on descriptor format">auto</value> 518 <reserved/> 519 <value desc="32-bit floats">f32</value> 520 <value desc="16-bit floats">f16</value> 521 <value desc="32-bit signed integers">s32</value> 522 <value desc="16-bit signed integers">s16</value> 523 <value desc="32-bit unsigned integers">u32</value> 524 <value desc="16-bit unsigned integers">u16</value> 525 </enum> 526 527 <enum name="Staging register count" implied="true"> 528 <value>sr0</value> 529 <value>sr1</value> 530 <value>sr2</value> 531 <value>sr3</value> 532 <value>sr4</value> 533 <value>sr5</value> 534 <value>sr6</value> 535 <value>sr7</value> 536 </enum> 537 538 <enum name="Staging register write count" implied="true"> 539 <value>write1</value> 540 <value>write2</value> 541 <value>write3</value> 542 <value>write4</value> 543 <value>write5</value> 544 <value>write6</value> 545 <value>write7</value> 546 <value>write8</value> 547 </enum> 548 549 <enum name="Write mask"> 550 <reserved/> 551 <value>r</value> 552 <value>g</value> 553 <value>rg</value> 554 <value>b</value> 555 <value>rb</value> 556 <value>gb</value> 557 <value>rgb</value> 558 <value>a</value> 559 <value>ra</value> 560 <value>ga</value> 561 <value>rga</value> 562 <value>ba</value> 563 <value>rba</value> 564 <value>gba</value> 565 <value default="true">rgba</value> 566 </enum> 567 568 <enum name="Fetch component"> 569 <value desc="Red">gather4_r</value> 570 <value desc="Green">gather4_g</value> 571 <value desc="Blue">gather4_b</value> 572 <value desc="Alpha">gather4_a</value> 573 </enum> 574 575 <enum name="Register type"> 576 <desc>Unsized type, part of a register format.</desc> 577 <reserved/> 578 <value name="Float">f</value> 579 <value name="Unsigned">u</value> 580 <value name="Signed">s</value> 581 </enum> 582 583 <enum name="Register width"> 584 <desc>Untyped size, part of a register format.</desc> 585 <value>16</value> 586 <value>32</value> 587 </enum> 588 589 <enum name="Varying texture register width"> 590 <desc> 591 Size of results for varying texture instructions. For dual 16-bit results 592 use "16-bit". 593 </desc> 594 <value desc="16-bit">16</value> 595 <value desc="32-bit">32</value> 596 <value desc="16-bit, 32-bit">16.32</value> 597 <value desc="32-bit, 32-bit">32.32</value> 598 </enum> 599 600 <enum name="Vector size"> 601 <desc>Number of channels loaded/stored for general memory access.</desc> 602 <value default="true" desc="Scalar">none</value> 603 <value desc="2 channels">v2</value> 604 <value desc="3 channels">v3</value> 605 <value desc="4 channels">v4</value> 606 </enum> 607 608 <enum name="Slot"> 609 <desc> 610 Dependency slot set on a message-passing instruction that writes to 611 registers. Before reading the destination, a future instruction must wait 612 on the specified slot. Slot #7 is for `BARRIER` instructions only. 613 </desc> 614 <value desc="Slot #0">slot0</value> 615 <value desc="Slot #1">slot1</value> 616 <value desc="Slot #2">slot2</value> 617 <reserved/> 618 <reserved/> 619 <reserved/> 620 <reserved/> 621 <value desc="Slot #7">slot7</value> 622 </enum> 623 624 <enum name="Memory access"> 625 <desc>Memory access hint for a `LOAD` or `STORE` instruction.</desc> 626 <value desc="No hint (global)" default="true">none</value> 627 <value desc="Internally streaming (position output)">istream</value> 628 <value desc="Externally streaming (varying output)">estream</value> 629 <value desc="Force access in discarded threads (thread local storage)">force</value> 630 </enum> 631 632 <enum name="Subgroup size"> 633 <desc> 634 Selects the effective subgroup size from subgroup operations. The hardware 635 warps are sixteen threads on Valhall, but subdividing a warp may be useful 636 for API requirements. In particular, derivatives may be calculated with 637 quads (four threads). 638 </desc> 639 <value desc="Two threads">subgroup2</value> 640 <value desc="Four threads">subgroup4</value> 641 <value desc="Eight threads">subgroup8</value> 642 <value desc="Sixteen threads" default="true">subgroup16</value> 643 </enum> 644 645 <enum name="Lane operation"> 646 <desc> 647 Acts as a modifier on the lane specificier for a `CLPER` instruction. The 648 `accumulate` mode is required for efficient subgroup reductions. 649 </desc> 650 <value name="No operation" default="true">none</value> 651 <value name="Exclusive-or">xor</value> 652 <value name="Accumulate">accumulate</value> 653 <value name="Shift">shift</value> 654 </enum> 655 656 <enum name="Inactive result"> 657 <desc> 658 Accesses to inactive lanes (due to divergence) in a subgroup is generally 659 undefined in APIs. However, the results of permuting with an inactive lane 660 with `CLPER.i32` are well-defined in Valhall: they return one of the 661 following values, as specified in the `CLPER.i32` instructions. Sometimes 662 certain values enable small optimizations. 663 </desc> 664 <value name="0x00000000" default="true">zero</value> 665 <value name="0xFFFFFFFF">umax</value> 666 <value name="0x00000001">i1</value> 667 <value name="0x00010001">v2i1</value> 668 <value name="0x80000000">smin</value> 669 <value name="0x7FFFFFFF">smax</value> 670 <value name="0x80008000">v2smin</value> 671 <value name="0x7FFF7FFF">v2smax</value> 672 <value name="0x80808080">v4smin</value> 673 <value name="0x7F7F7F7F">v4smax</value> 674 <value name="0x3F800000">f1</value> 675 <value name="0x3C003C00">v2f1</value> 676 <value name="0xFF800000">infn</value> 677 <value name="0x7F800000">inf</value> 678 <value name="0xFC00FC00">v2infn</value> 679 <value name="0x7C007C00">v2inf</value> 680 </enum> 681 682 <enum name="Mux"> 683 <desc> 684 Condition to use for a `MUX` instruction. `neg` checks the sign bit, 685 `int_zero` compares to `0x00000000`, `fp_zero` compares to $\pm 0.0$ as 686 an IEEE 754 float, and `bit` checks each bit separately. The `bit` mode 687 acts like an imaginary `CSEL.v32u1` instruction, and implements 688 `bitselect()` in OpenCL. 689 </desc> 690 <value desc="Negative">neg</value> 691 <value desc="Integer zero" default="true">int_zero</value> 692 <value desc="Floating point zero">fp_zero</value> 693 <value desc="Bitwise">bit</value> 694 </enum> 695 696 <enum name="Sample mode"> 697 <desc> 698 Varying interpolation mode, for choosing the correct sample to 699 interpolate at, allowing the `sample` and `centroid` qualifiers to be 700 implemented, as well as the `interpolateAt*` functions. 701 </desc> 702 <value desc="Center">center</value> 703 <value desc="Centroid">centroid</value> 704 <value desc="Sample">sample</value> 705 <value desc="Explicit">explicit</value> 706 </enum> 707 708 <enum name="Update mode"> 709 <desc> 710 The Valhall GPU maintains hidden state when interpolating varyings, to 711 allow reusing sample location calculations. The update mode of a varying 712 load controls this hidden state. 713 </desc> 714 <value desc="Store interpolation position">store</value> 715 <value desc="Retrieve interpolation position">retrieve</value> 716 <reserved/> 717 <value desc="Clobber saved position">clobber</value> 718 </enum> 719 720 <enum name="Sample and update mode"> 721 <desc> 722 For fused varying/texture instructions, only the following specific 723 combinations of sample and update modes are permitted. 724 </desc> 725 <value desc="Center, store">center_store</value> 726 <value desc="Centroid, store">centroid_store</value> 727 <value desc="Sample, store">sample_store</value> 728 <value desc="Explicit, store">explicit_store</value> 729 <value desc="Center, clobber">center_clobber</value> 730 <reserved/> 731 <value desc="Sample, clobber">sample_clobber</value> 732 <value desc="Retrieve previous state">retrieve</value> 733 </enum> 734 735 <enum name="Source format"> 736 <desc> 737 In-memory format of varyings. 738 739 Note: src_flat32 is only valid with 32-bit varying instructions and 740 src_flat16 is only valid with 16-bit varying instructions. 741 </desc> 742 <value desc="Uninterpreted 32-bit values">src_flat32</value> 743 <value desc="Uninterpreted 16-bit values">src_flat16</value> 744 <value desc="Interpolated 32-bit floats">src_f32</value> 745 <value desc="Interpolated 16-bit floats">src_f16</value> 746 </enum> 747 748 <enum name="Atomic operation"> 749 <desc> 750 Operation performed in a general computational atomic instruction. 751 </desc> 752 <reserved/> 753 <reserved/> 754 <value desc="Add">aadd</value> 755 <reserved/> 756 <reserved/> 757 <reserved/> 758 <reserved/> 759 <reserved/> 760 <value desc="Signed minimum">asmin</value> 761 <value desc="Signed maximum">asmax</value> 762 <value desc="Unsigned minimum">aumin</value> 763 <value desc="Unsigned maximum">aumax</value> 764 <value desc="Bitwise and">aand</value> 765 <value desc="Bitwise or">aor</value> 766 <value desc="Bitwise exclusive-or">axor</value> 767 <value desc="Exchange (must return the value)">axchg</value> 768 </enum> 769 770 <enum name="Atomic operation with 1"> 771 <desc> 772 Operation performed in a computational atomic-with-1 instruction. 773 </desc> 774 <value desc="Increment">ainc</value> 775 <value desc="Decrement">adec</value> 776 <value desc="Unsigned maximum with 1">aumax1</value> 777 <value desc="Signed maximum with 1">asmax1</value> 778 <value desc="Set bottom bit">aor1</value> 779 </enum> 780 781 <ins name="NOP" title="No operation" dests="0" opcode="0x00" unit="CVT"> 782 <desc> 783 Do nothing. Useful at the start of a block for waiting on slots required 784 by the first actual instruction of the block, to reconcile dependencies 785 after a branch. Also useful as the sole instruction of an empty shader. 786 </desc> 787 </ins> 788 789 <ins name="BRANCHZ" title="Compare to zero and branch" dests="0" opcode="0x1F" unit="CVT"> 790 <desc> 791 Branches to a specified relative offset if its source is nonzero (default) 792 or if its source is zero (if `.eq` is set). The offset is 27-bits and 793 sign-extended, giving an effective range of ±26-bits. The offset is 794 specified in units of instructions, relative to the *next* instruction. 795 Positive offsets may be interpreted as "number of instructions to skip". 796 Since Valhall instructions are 8 bytes, this operates as: 797 798 $$PC := \begin{cases} PC + 8 \cdot (\text{offset} \; + 1) & \text{if} \; 799 \text{src} \stackrel{?}{=} 0 \\ PC + 8 & \text{otherwise} \end{cases}$$ 800 801 Used with comparison instructions to implement control flow. Tie the 802 source to a nonzero constant to implement a jump. May introduce 803 divergence, so generally requires `.reconverge` flow control. 804 </desc> 805 <src combine="true">Value to compare against zero</src> 806 <imm name="offset" start="8" size="27" signed="true"/> 807 <conservative/> 808 <mod name="eq" start="36" size="1"/> 809 </ins> 810 811 <ins name="DISCARD.f32" title="Discard fragment" dests="0" opcode="0x20" unit="CVT"> 812 <desc> 813 Evaluates the given condition, and if it passes, discards the current 814 fragment and terminates the thread. Only valid in a **fragment** shader. 815 </desc> 816 <cmp/> 817 <src absneg="true" swizzle="true">Left value to compare</src> 818 <src absneg="true" swizzle="true">Right value to compare</src> 819 </ins> 820 821 <ins name="BRANCHZI" title="Compare to zero and branch indirect" opcode="0x2F" unit="CVT"> 822 <desc> 823 Jump to an indirectly specified (absolute or relative) address. Used to 824 jump to blend shaders at the end of a fragment shader. 825 </desc> 826 <src combine="true">Value to compare against zero</src> 827 <src>Branch target</src> 828 <conservative/> 829 <mod name="eq" start="36" size="1"/> 830 <mod name="absolute" start="40" size="1"/> 831 </ins> 832 833 <ins name="BARRIER" title="Execution and memory barrier" opcode="0x45" unit="NONE"> 834 <desc> 835 General-purpose barrier. Must use slot #7. Must be paired with a 836 `.wait` flow on the instruction. 837 </desc> 838 <slot/> 839 </ins> 840 841 <group name="CSEL" title="Floating-point conditional select" dests="1" unit="CVT"> 842 <ins name="CSEL.f32" opcode="0x154"/> 843 <ins name="CSEL.v2f16" opcode="0x155"/> 844 <desc> 845 Evaluates the given condition and outputs either the true source or the 846 false source. 847 </desc> 848 <cmp/> 849 <src float="true">Left value to compare</src> 850 <src float="true">Right value to compare</src> 851 <src float="true">Return value if true</src> 852 <src float="true">Return value if false</src> 853 </group> 854 855 <group name="CSEL" title="Integer conditional select" dests="1" unit="CVT"> 856 <ins name="CSEL.u32" opcode="0x150"/> 857 <ins name="CSEL.v2u16" opcode="0x151"/> 858 <ins name="CSEL.s32" opcode="0x158"/> 859 <ins name="CSEL.v2s16" opcode="0x159"/> 860 <desc> 861 Evaluates the given condition and outputs either the true source or the 862 false source. 863 864 Valhall lacks integer minimum/maximum instructions. `CSEL` instructions 865 with tied operands form the canonical implementations of these 866 instructions. Similarly, the integer $\text{sign}$ function is canonically 867 implemented with a pair of `CSEL` instructions. 868 </desc> 869 <cmp/> 870 <src>Left value to compare</src> 871 <src>Right value to compare</src> 872 <src>Return value if true</src> 873 <src>Return value if false</src> 874 </group> 875 876 <ins name="LD_VAR_SPECIAL" title="Load special varying" opcode="0x56" unit="V"> 877 <sr write="true"/> 878 <sr_count/> 879 <vecsize/> 880 <regfmt/> 881 <sample/> 882 <update/> 883 <slot/> 884 <src/> 885 <imm name="index" start="12" size="4"/> <!-- 0 for pointx, 1 for pointy, 2 for fragw, 3 for fragz --> 886 </ins> 887 888 <group name="LD_VAR_BUF_IMM" title="Load immediate varying" unit="V"> 889 <desc>Interpolates a given varying from hardware buffer</desc> 890 <ins name="LD_VAR_BUF_IMM.f32" opcode="0x5C"/> 891 <ins name="LD_VAR_BUF_IMM.f16" opcode="0x5D"/> 892 <slot/> 893 <vecsize/> 894 <source_format/> 895 <sample/> 896 <update/> 897 <sr write="true"/> 898 <sr_count/> 899 <src/> 900 <imm name="index" start="16" size="8"/> 901 </group> 902 903 <group name="LD_VAR_BUF" title="Load indirect varying" unit="V"> 904 <desc>Interpolates a given varying from hardware buffer</desc> 905 <ins name="LD_VAR_BUF.f32" opcode="0x6C"/> 906 <ins name="LD_VAR_BUF.f16" opcode="0x6D"/> 907 <slot/> 908 <vecsize/> 909 <source_format/> 910 <sample/> 911 <update/> 912 <sr write="true"/> 913 <sr_count/> 914 <src/> 915 <src/> 916 </group> 917 918 <ins name="LD_VAR" title="Load indirect varying" unit="V" opcode="0x64"> 919 <desc>Interpolates a given varying from a software buffer</desc> 920 <slot/> 921 <vecsize/> 922 <regfmt/> 923 <sample/> 924 <update/> 925 <sr write="true"/> 926 <sr_count/> 927 <src/> 928 <src>Varying index and table</src> 929 </ins> 930 931 <ins name="LD_VAR_IMM" title="Load immediate varying" unit="V" opcode="0x54"> 932 <desc>Interpolates a given varying from a software buffer</desc> 933 <slot/> 934 <vecsize/> 935 <regfmt/> 936 <sample/> 937 <update/> 938 <sr write="true"/> 939 <sr_count/> 940 <src/> 941 <imm name="table" start="8" size="4"/> 942 <imm name="index" start="12" size="8"/> 943 </ins> 944 945 <ins name="LD_VAR_FLAT" title="Load indirect varying" unit="V" opcode="0x55"> 946 <desc>Fetches a given varying from a software buffer</desc> 947 <slot/> 948 <vecsize/> 949 <regfmt/> 950 <sr write="true"/> 951 <sr_count/> 952 <src>Varying index and table</src> 953 </ins> 954 955 <ins name="LD_VAR_FLAT_IMM" title="Load immediate varying" unit="V" opcode="0x41"> 956 <desc>Fetches a given varying from a software buffer</desc> 957 <slot/> 958 <vecsize/> 959 <regfmt/> 960 <sr write="true"/> 961 <sr_count/> 962 <imm name="table" start="8" size="4"/> 963 <imm name="index" start="12" size="8"/> 964 </ins> 965 966 <ins name="LD_ATTR_IMM" title="Load immediate attribute" opcode="0x66" opcode2="0" unit="LS"> 967 <desc> 968 Load `vecsize` components from the attribute descriptor at entry `index` 969 of resource table `table` at index (vertex ID, instance ID), converting 970 to the specified register format. 971 </desc> 972 <sr_count/> 973 <vecsize/> 974 <regfmt/> 975 <slot/> 976 <mod name="descriptor_type" start="128" size="1" implied="true"/> 977 <sr write="true"/> 978 <src>Vertex ID</src> 979 <src>Instance ID</src> 980 <imm name="index" start="20" size="4"/> 981 <imm name="table" start="16" size="4"/> 982 </ins> 983 984 <ins name="LD_ATTR" title="Load indirect attribute" opcode="0x76" opcode2="0" unit="LS"> 985 <desc> 986 Load `vecsize` components from the attribute descriptor at the specified 987 location at index (vertex ID, instance ID), converting 988 to the specified register format. 989 990 The index must not diverge within a warp. 991 </desc> 992 <sr_count/> 993 <vecsize/> 994 <regfmt/> 995 <slot/> 996 <mod name="descriptor_type" start="128" size="1" implied="true"/> 997 <sr write="true"/> 998 <src>Vertex ID</src> 999 <src>Instance ID</src> 1000 <src>Index and table</src> 1001 </ins> 1002 1003 <ins name="LD_TEX_IMM" title="Load immediate texture" opcode="0x66" opcode2="1" unit="LS"> 1004 <desc> 1005 Load `vecsize` components from the texture descriptor at entry `index` 1006 of resource table `table`, converting 1007 to the specified register format. 1008 </desc> 1009 <sr_count/> 1010 <vecsize/> 1011 <regfmt/> 1012 <slot/> 1013 <mod name="descriptor_type" start="128" size="1" implied="true"/> 1014 <sr write="true"/> 1015 <src>X/Y coordinates (16:16)</src> 1016 <src>Z/W coordinates (16:16)</src> 1017 <imm name="index" start="20" size="4"/> 1018 <imm name="table" start="16" size="4"/> 1019 </ins> 1020 1021 <ins name="LD_TEX" title="Load indirect texture" opcode="0x76" opcode2="1" unit="LS"> 1022 <desc> 1023 Load `vecsize` components from the texture descriptor at the specified 1024 location at index, converting 1025 to the specified register format. 1026 </desc> 1027 <sr_count/> 1028 <vecsize/> 1029 <regfmt/> 1030 <slot/> 1031 <mod name="descriptor_type" start="128" size="1" implied="true"/> 1032 <sr write="true"/> 1033 <src>X/Y coordinates (16:16)</src> 1034 <src>Z/W coordinates (16:16)</src> 1035 <src>Index and table</src> 1036 </ins> 1037 1038 <ins name="LEA_ATTR_IMM" title="Load effective address of image texel" opcode="0x67" opcode2="0" unit="LS"> 1039 <desc> 1040 Load the effective address of an attribute specified with the 1041 given immediate index. Returns three staging register: the low/high 1042 32-bits of the address and the internal conversion descriptor. 1043 </desc> 1044 <slot/> 1045 <sr_count/> 1046 <mod name="descriptor_type" start="128" size="1" implied="true"/> 1047 <sr write="true"/> 1048 <src>Vertex index</src> 1049 <src>Instance index</src> 1050 <imm name="table" start="16" size="4"/> 1051 <imm name="index" start="20" size="4"/> 1052 </ins> 1053 1054 <ins name="LEA_ATTR" title="Load effective address of image texel" opcode="0x77" opcode2="0" unit="LS"> 1055 <desc> 1056 Load the effective address of an attribute specified with the 1057 given index. Returns three staging register: the low/high 1058 32-bits of the address and the internal conversion descriptor. 1059 </desc> 1060 <vecsize/> 1061 <slot/> 1062 <sr_count/> 1063 <mod name="descriptor_type" start="128" size="1" implied="true"/> 1064 <sr write="true"/> 1065 <src>Vertex index</src> 1066 <src>Instance index</src> 1067 <src>Attribute index and table</src> 1068 </ins> 1069 1070 <ins name="LEA_TEX_IMM" title="Load effective address of image texel" opcode="0x67" opcode2="1" unit="LS"> 1071 <desc> 1072 Load the effective address of a texel from the image specified with the 1073 given immediate index. Returns three staging registers: the low/high 1074 32-bits of the address and the internal conversion descriptor. The format 1075 of the internal conversion descriptor is compatible with Bifrost but 1076 omits the register format, as this is specified with the ST_CVT 1077 instruction on Valhall. 1078 1079 Coordinates are specified as 16-bit integers, packed into 32-bit sources. 1080 </desc> 1081 <slot/> 1082 <sr_count/> 1083 <mod name="descriptor_type" start="128" size="1" implied="true"/> 1084 <sr write="true"/> 1085 <src>X/Y coordinates (16:16)</src> 1086 <src>Z/W coordinates (16:16)</src> 1087 <imm name="table" start="16" size="4"/> 1088 <imm name="index" start="20" size="4"/> 1089 </ins> 1090 1091 <ins name="LEA_TEX" title="Load effective address of image texel" opcode="0x77" opcode2="1" unit="LS"> 1092 <desc> 1093 Load the effective address of a texel from the image specified with the 1094 given index. Returns three staging register: the low/high 1095 32-bits of the address and the internal conversion descriptor. The format 1096 of the internal conversion descriptor is compatible with Bifrost but 1097 omits the register format, as this is specified with the ST_CVT 1098 instruction on Valhall. 1099 1100 Coordinates are specified as 16-bit integers, packed into 32-bit sources. 1101 </desc> 1102 <vecsize/> 1103 <slot/> 1104 <sr_count/> 1105 <mod name="descriptor_type" start="128" size="1" implied="true"/> 1106 <sr write="true"/> 1107 <src size="16">X/Y coordinates (16:16)</src> 1108 <src>Z/W coordinates (16:16)</src> 1109 <src>Index and table</src> 1110 </ins> 1111 1112 <ins name="LD_BUFFER.i8" title="Global memory load" opcode="0x6a" opcode2="0" unit="LS"> 1113 <desc> 1114 Loads a buffer descriptor. If bits 25...31 of the mode descriptor are 1115 all-ones, load from the buffer descriptors in the table indexed by the 1116 bottom byte of the mode descriptor. If they are all zeroes, load the 1117 contents of the buffer in the first table indexed by the bottom byte of 1118 the mode descriptor. 1119 </desc> 1120 <sr write="true"/> 1121 <sr_count/> 1122 <mod name="load_lane_8_bit" start="36" size="3"/> 1123 <mod name="unsigned" start="39" size="1"/> 1124 <slot/> 1125 <src size="32">Address to load from after adding offset</src> 1126 <src size="32">Mode descriptor</src> 1127 </ins> 1128 1129 <ins name="LD_BUFFER.i16" title="Global memory load" opcode="0x6a" opcode2="1" unit="LS"> 1130 <desc> 1131 Loads a buffer descriptor. If bits 25...31 of the mode descriptor are 1132 all-ones, load from the buffer descriptors in the table indexed by the 1133 bottom byte of the mode descriptor. If they are all zeroes, load the 1134 contents of the buffer in the first table indexed by the bottom byte of 1135 the mode descriptor. 1136 </desc> 1137 <sr write="true"/> 1138 <sr_count/> 1139 <mod name="load_lane_16_bit" start="36" size="3"/> 1140 <mod name="unsigned" start="39" size="1"/> 1141 <slot/> 1142 <src size="32">Byte offset</src> 1143 <src size="32">Mode descriptor</src> 1144 </ins> 1145 1146 <ins name="LD_BUFFER.i24" title="Global memory load" opcode="0x6a" opcode2="2" unit="LS"> 1147 <desc> 1148 Loads a buffer descriptor. If bits 25...31 of the mode descriptor are 1149 all-ones, load from the buffer descriptors in the table indexed by the 1150 bottom byte of the mode descriptor. If they are all zeroes, load the 1151 contents of the buffer in the first table indexed by the bottom byte of 1152 the mode descriptor. 1153 </desc> 1154 <sr write="true"/> 1155 <sr_count/> 1156 <mod name="load_lane_24_bit" start="36" size="3"/> 1157 <mod name="unsigned" start="39" size="1"/> 1158 <slot/> 1159 <src size="32">Byte offset</src> 1160 <src size="32">Mode descriptor</src> 1161 </ins> 1162 1163 <ins name="LD_BUFFER.i32" title="Global memory load" opcode="0x6a" opcode2="3" unit="LS"> 1164 <desc> 1165 Loads a buffer descriptor. If bits 25...31 of the mode descriptor are 1166 all-ones, load from the buffer descriptors in the table indexed by the 1167 bottom byte of the mode descriptor. If they are all zeroes, load the 1168 contents of the buffer in the first table indexed by the bottom byte of 1169 the mode descriptor. 1170 </desc> 1171 <sr write="true"/> 1172 <sr_count/> 1173 <mod name="load_lane_32_bit" start="36" size="3"/> 1174 <mod name="unsigned" start="39" size="1"/> 1175 <slot/> 1176 <src size="32">Byte offset</src> 1177 <src size="32">Mode descriptor</src> 1178 </ins> 1179 1180 <ins name="LD_BUFFER.i48" title="Global memory load" opcode="0x6a" opcode2="4" unit="LS"> 1181 <desc> 1182 Loads a buffer descriptor. If bits 25...31 of the mode descriptor are 1183 all-ones, load from the buffer descriptors in the table indexed by the 1184 bottom byte of the mode descriptor. If they are all zeroes, load the 1185 contents of the buffer in the first table indexed by the bottom byte of 1186 the mode descriptor. 1187 </desc> 1188 <sr write="true"/> 1189 <sr_count/> 1190 <mod name="load_lane_48_bit" start="36" size="3"/> 1191 <mod name="unsigned" start="39" size="1"/> 1192 <slot/> 1193 <src size="32">Byte offset</src> 1194 <src size="32">Mode descriptor</src> 1195 </ins> 1196 1197 <ins name="LD_BUFFER.i64" title="Global memory load" opcode="0x6a" opcode2="5" unit="LS"> 1198 <desc> 1199 Loads a buffer descriptor. If bits 25...31 of the mode descriptor are 1200 all-ones, load from the buffer descriptors in the table indexed by the 1201 bottom byte of the mode descriptor. If they are all zeroes, load the 1202 contents of the buffer in the first table indexed by the bottom byte of 1203 the mode descriptor. 1204 </desc> 1205 <sr write="true"/> 1206 <sr_count/> 1207 <mod name="load_lane_64_bit" start="36" size="3"/> 1208 <mod name="unsigned" start="39" size="1"/> 1209 <slot/> 1210 <src size="32">Byte offset</src> 1211 <src size="32">Mode descriptor</src> 1212 </ins> 1213 1214 <ins name="LD_BUFFER.i96" title="Global memory load" opcode="0x6a" opcode2="6" unit="LS"> 1215 <desc> 1216 Loads a buffer descriptor. If bits 25...31 of the mode descriptor are 1217 all-ones, load from the buffer descriptors in the table indexed by the 1218 bottom byte of the mode descriptor. If they are all zeroes, load the 1219 contents of the buffer in the first table indexed by the bottom byte of 1220 the mode descriptor. 1221 </desc> 1222 <sr write="true"/> 1223 <sr_count/> 1224 <mod name="load_lane_96_bit" start="36" size="3"/> 1225 <mod name="unsigned" start="39" size="1"/> 1226 <slot/> 1227 <src size="32">Byte offset</src> 1228 <src size="32">Mode descriptor</src> 1229 </ins> 1230 1231 <ins name="LD_BUFFER.i128" title="Global memory load" opcode="0x6a" opcode2="7" unit="LS"> 1232 <desc> 1233 Loads a buffer descriptor. If bits 25...31 of the mode descriptor are 1234 all-ones, load from the buffer descriptors in the table indexed by the 1235 bottom byte of the mode descriptor. If they are all zeroes, load the 1236 contents of the buffer in the first table indexed by the bottom byte of 1237 the mode descriptor. 1238 </desc> 1239 <sr write="true"/> 1240 <sr_count/> 1241 <mod name="load_lane_128_bit" start="36" size="3"/> 1242 <mod name="unsigned" start="39" size="1"/> 1243 <slot/> 1244 <src size="32">Byte offset</src> 1245 <src size="32">Mode descriptor</src> 1246 </ins> 1247 1248 <ins name="LEA_BUF_IMM" title="Load buffer effective address" opcode="0x5E" unit="LS"> 1249 <desc> 1250 Load effective address of a buffer with an immediate offset added. 1251 </desc> 1252 <sr write="true"/> 1253 <sr_count/> 1254 <slot/> 1255 <imm name="table" start="8" size="4"/> 1256 <imm name="index" start="12" size="8"/> 1257 <src>Linear ID</src> 1258 </ins> 1259 1260 <ins name="LOAD.i8" title="Global memory load" opcode="0x60" opcode2="0" unit="LS"> 1261 <desc>Loads from main memory</desc> 1262 <sr write="true"/> 1263 <memory_access/> 1264 <sr_count/> 1265 <mod name="load_lane_8_bit" start="36" size="3"/> 1266 <mod name="unsigned" start="39" size="1"/> 1267 <slot/> 1268 <src size="64">Address to load from after adding offset</src> 1269 <imm name="offset" start="8" size="16" signed="true"/> 1270 </ins> 1271 1272 <ins name="LOAD.i16" title="Global memory load" opcode="0x60" opcode2="1" unit="LS"> 1273 <desc>Loads from main memory</desc> 1274 <sr write="true"/> 1275 <memory_access/> 1276 <sr_count/> 1277 <mod name="load_lane_16_bit" start="36" size="3"/> 1278 <mod name="unsigned" start="39" size="1"/> 1279 <slot/> 1280 <src size="64">Address to load from after adding offset</src> 1281 <imm name="offset" start="8" size="16" signed="true"/> 1282 </ins> 1283 1284 <ins name="LOAD.i24" title="Global memory load" opcode="0x60" opcode2="2" unit="LS"> 1285 <desc>Loads from main memory</desc> 1286 <sr write="true"/> 1287 <memory_access/> 1288 <sr_count/> 1289 <mod name="load_lane_24_bit" start="36" size="3"/> 1290 <mod name="unsigned" start="39" size="1"/> 1291 <slot/> 1292 <src size="64">Address to load from after adding offset</src> 1293 <imm name="offset" start="8" size="16" signed="true"/> 1294 </ins> 1295 1296 <ins name="LOAD.i32" title="Global memory load" opcode="0x60" opcode2="3" unit="LS"> 1297 <desc>Loads from main memory</desc> 1298 <sr write="true"/> 1299 <memory_access/> 1300 <sr_count/> 1301 <mod name="load_lane_32_bit" start="36" size="3"/> 1302 <mod name="unsigned" start="39" size="1"/> 1303 <slot/> 1304 <src size="64">Address to load from after adding offset</src> 1305 <imm name="offset" start="8" size="16" signed="true"/> 1306 </ins> 1307 1308 <ins name="LOAD.i48" title="Global memory load" opcode="0x60" opcode2="4" unit="LS"> 1309 <desc>Loads from main memory</desc> 1310 <sr write="true"/> 1311 <memory_access/> 1312 <sr_count/> 1313 <mod name="load_lane_48_bit" start="36" size="3"/> 1314 <mod name="unsigned" start="39" size="1"/> 1315 <slot/> 1316 <src size="64">Address to load from after adding offset</src> 1317 <imm name="offset" start="8" size="16" signed="true"/> 1318 </ins> 1319 1320 <ins name="LOAD.i64" title="Global memory load" opcode="0x60" opcode2="5" unit="LS"> 1321 <desc>Loads from main memory</desc> 1322 <sr write="true"/> 1323 <memory_access/> 1324 <sr_count/> 1325 <mod name="load_lane_64_bit" start="36" size="3"/> 1326 <mod name="unsigned" start="39" size="1"/> 1327 <slot/> 1328 <src size="64">Address to load from after adding offset</src> 1329 <imm name="offset" start="8" size="16" signed="true"/> 1330 </ins> 1331 1332 <ins name="LOAD.i96" title="Global memory load" opcode="0x60" opcode2="6" unit="LS"> 1333 <desc>Loads from main memory</desc> 1334 <sr write="true"/> 1335 <memory_access/> 1336 <sr_count/> 1337 <mod name="load_lane_96_bit" start="36" size="3"/> 1338 <mod name="unsigned" start="39" size="1"/> 1339 <slot/> 1340 <src size="64">Address to load from after adding offset</src> 1341 <imm name="offset" start="8" size="16" signed="true"/> 1342 </ins> 1343 1344 <ins name="LOAD.i128" title="Global memory load" opcode="0x60" opcode2="7" unit="LS"> 1345 <desc>Loads from main memory</desc> 1346 <sr write="true"/> 1347 <memory_access/> 1348 <sr_count/> 1349 <mod name="load_lane_128_bit" start="36" size="3"/> 1350 <mod name="unsigned" start="39" size="1"/> 1351 <slot/> 1352 <src size="64">Address to load from after adding offset</src> 1353 <imm name="offset" start="8" size="16" signed="true"/> 1354 </ins> 1355 1356 <group name="STORE" title="Global memory store" opcode="0x61" unit="LS"> 1357 <desc>Stores to main memory</desc> 1358 <sr read="true"/> 1359 <ins name="STORE.i8" opcode2="0x0"/> 1360 <ins name="STORE.i16" opcode2="0x1"/> 1361 <ins name="STORE.i24" opcode2="0x2"/> 1362 <ins name="STORE.i32" opcode2="0x3"/> 1363 <ins name="STORE.i48" opcode2="0x4"/> 1364 <ins name="STORE.i64" opcode2="0x5"/> 1365 <ins name="STORE.i96" opcode2="0x6"/> 1366 <ins name="STORE.i128" opcode2="0x7"/> 1367 <sr_count/> 1368 <memory_access/> 1369 <slot/> 1370 <src size="64">Address to store to after adding offset</src> 1371 <imm name="offset" start="8" size="16" signed="true"/> 1372 </group> 1373 1374 <ins name="ST_CVT" title="Store with conversion" opcode="0x71" unit="LS"> 1375 <desc> 1376 Store to memory with data conversion. The address to store to is given in 1377 the first source, which must be a 64-bit register (a pair of 32-bit 1378 registers). The other source is the conversion descriptor used for the store. 1379 1380 Used with LEA_TEX_IMM to implement image stores. 1381 </desc> 1382 <slot/> 1383 <mod name="memory_access" start="37" size="3"/> 1384 <vecsize/> 1385 <regfmt/> 1386 <sr read="true"/> 1387 <sr_count/> 1388 <src size="64">64-bit address to store to</src> 1389 <imm name="offset" start="8" size="8"/> 1390 <src>Internal conversion descriptor</src> 1391 </ins> 1392 1393 <ins name="LD_TILE" title="Load from tilebuffer" opcode="0x78" unit="NONE"> 1394 <desc> 1395 Loads a given render target, specified in the pixel indices descriptor, at 1396 a given location and sample, and convert to the format specified in the 1397 internal conversion descriptor. Used to implement EXT_framebuffer_fetch 1398 and internally in blend shaders. 1399 </desc> 1400 <sr write="true"/> 1401 <sr_count/> 1402 <vecsize/> 1403 <regfmt/> 1404 <slot/> 1405 <src>Pixel indices descriptor</src> 1406 <src>Coverage mask</src> 1407 <src>Conversion descriptor</src> 1408 </ins> 1409 1410 <ins name="ST_TILE" title="Store to tilebuffer" opcode="0x79" unit="NONE"> 1411 <desc> 1412 Store to given render target, specified in the pixel indices descriptor, at 1413 a given location and sample, and convert to the format specified in the 1414 internal conversion descriptor. Used internally in blend shaders. 1415 </desc> 1416 <sr read="true"/> 1417 <sr_count/> 1418 <vecsize/> 1419 <regfmt/> 1420 <slot/> 1421 <src>Pixel indices descriptor</src> 1422 <src>Coverage mask</src> 1423 <src>Conversion descriptor</src> 1424 </ins> 1425 1426 <ins name="BLEND" title="Blend render target" opcode="0x7F" unit="NONE"> 1427 <desc> 1428 Blends a given render target. This loads the API-specified blend state for 1429 the render target from the first source. Blend descriptors are available 1430 as special immediates. It then reads the colour to be blended from the 1431 first staging register, with the specified vector size and register format 1432 as desired. The resulting coverage mask is stored to the second set of 1433 staging registers. 1434 1435 In the fixed-function path, `BLEND` sends the colour to the blender to be 1436 written to the tilebuffer. Then, if the instruction's flow control 1437 specifies termination, the fragment program is ended. If it does not 1438 specify termination, `BLEND` acts as a relative branch, branching with the 1439 offset specified as `target`. This allows the subsequent instructions to 1440 be skipped when fixed-function blending is used. Note this implicit branch 1441 can never introduce divergence, so `.reconverge` is not required. 1442 1443 In the blend shader path, `BLEND` ignores the specified flow control and 1444 does not branch to the specified offset. Instead, execution continues 1445 normally with the next instruction. The compiler should insert code for 1446 calling a blend shader after the `BLEND` instruction unless it is known 1447 that a blend shader will never be required. 1448 1449 The indirection is required to support both fixed-function and blend 1450 shaders efficiently and without shader variants. 1451 </desc> 1452 <sr read="true"/> 1453 <src size="64">Blend descriptor</src> 1454 <src>Sample coverage</src> 1455 <imm name="target" start="8" size="8"/> 1456 <slot/> 1457 <sr_count/> 1458 <vecsize/> 1459 <regfmt/> 1460 </ins> 1461 1462 <ins name="ATEST" title="Alpha test" opcode="0x7D" unit="NONE"> 1463 <desc> 1464 Does alpha-to-coverage testing, updating the sample coverage mask. ATEST 1465 does not do an implicit discard. It should be executed before the first 1466 ZS_EMIT or BLEND instruction. 1467 </desc> 1468 <sr write="true">Updated coverage mask</sr> 1469 <src>Input coverage mask</src> 1470 <src swizzle="true">Alpha value (render target 0)</src> 1471 <src/> 1472 <sr_count/> 1473 </ins> 1474 1475 <ins name="ZS_EMIT" title="Depth/stencil write" opcode="0x7E" unit="NONE"> 1476 <desc> 1477 Programatically writes out depth, stencil, or both, depending on which 1478 modifiers are set. Used to implement gl_FragDepth and gl_FragStencil. 1479 </desc> 1480 <mod name="z" start="25" size="1"/> 1481 <mod name="stencil" start="24" size="1"/> 1482 <sr write="true">Updated coverage mask</sr> 1483 <src>Depth value</src> 1484 <src>Stencil value</src> 1485 <src>Input coverage mask</src> 1486 <sr_count/> 1487 <slot/> 1488 </ins> 1489 1490 <group name="CONVERT" title="Data conversions" dests="1" opcode="0x90" unit="CVT"> 1491 <desc> 1492 Performs the given data conversion. Note that floating-point rounding is 1493 handled via the same hardware and therefore shares an encoding. Round mode 1494 is specified where it makes sense. 1495 </desc> 1496 1497 <ins name="V2S16_TO_V2F16" opcode2="0x7"/> 1498 1499 <ins name="S32_TO_F32" opcode2="0x9"/> 1500 1501 <ins name="V2U16_TO_V2F16" opcode2="0x17"/> 1502 1503 <ins name="U32_TO_F32" opcode2="0x19"/> 1504 1505 <roundmode/> 1506 <src widen="true">Value to convert</src> 1507 </group> 1508 1509 <group name="CONVERT" title="16->32 integer data conversions" dests="1" opcode="0x90" unit="CVT"> 1510 <desc> 1511 Performs the given data conversion. 1512 </desc> 1513 1514 <ins name="S16_TO_S32" opcode2="0x4"/> 1515 <ins name="S16_TO_F32" opcode2="0x5"/> 1516 <ins name="U16_TO_U32" opcode2="0x14"/> 1517 <ins name="U16_TO_F32" opcode2="0x15"/> 1518 1519 <src swizzle="true" size="16">Value to convert</src> 1520 </group> 1521 1522 <group name="CONVERT" title="Float-to-int data conversions" dests="1" opcode="0x90" unit="CVT"> 1523 <desc>Performs the given data conversion.</desc> 1524 <ins name="F32_TO_S32" opcode2="0xC"/> 1525 <ins name="F32_TO_U32" opcode2="0x1C"/> 1526 <roundmode/> 1527 <src absneg="true">Value to convert</src> 1528 </group> 1529 1530 <group name="CONVERT" title="Float-to-int data conversions" dests="1" opcode="0x90" unit="CVT"> 1531 <desc>Performs the given data conversion.</desc> 1532 <ins name="V2F16_TO_V2S16" opcode2="0xE"/> 1533 <ins name="V2F16_TO_V2U16" opcode2="0x1E"/> 1534 <ins name="F16_TO_S32" opcode2="0xA"/> 1535 <ins name="F16_TO_U32" opcode2="0x1A"/> 1536 <roundmode/> 1537 <src swizzle="true" absneg="true" size="16">Value to convert</src> 1538 </group> 1539 1540 <ins name="F16_TO_F32" title="16-bit float to 32-bit float conversion" dests="1" opcode="0x90" opcode2="0xB" unit="CVT"> 1541 <desc>Converts up with the specified round mode.</desc> 1542 <roundmode/> 1543 <src lane="28" size="16" absneg="true">Value to convert</src> 1544 </ins> 1545 1546 <group name="CONVERT" title="8-bit to 32-bit data conversions" dests="1" opcode="0x90" unit="CVT"> 1547 <desc> 1548 Performs the given data conversion. 1549 </desc> 1550 1551 <ins name="S8_TO_S32" opcode2="0x0"/> 1552 <ins name="S8_TO_F32" opcode2="0x1"/> 1553 1554 <ins name="U8_TO_U32" opcode2="0x10"/> 1555 <ins name="U8_TO_F32" opcode2="0x11"/> 1556 1557 <src lane="28" size="8">Value to convert</src> 1558 </group> 1559 1560 <group name="CONVERT" title="8-bit to 16-bit data conversions" dests="1" opcode="0x90" unit="CVT"> 1561 <desc> 1562 Performs the given data conversion. 1563 </desc> 1564 1565 <ins name="V2S8_TO_V2S16" opcode2="0x2"/> 1566 <ins name="V2S8_TO_V2F16" opcode2="0x3"/> 1567 1568 <ins name="V2U8_TO_V2U16" opcode2="0x12"/> 1569 <ins name="V2U8_TO_V2F16" opcode2="0x13"/> 1570 1571 <src halfswizzle="true" size="8">Value to convert</src> 1572 </group> 1573 1574 <group name="FROUND" title="Floating-point rounding" dests="1" opcode="0x90" unit="CVT"> 1575 <desc> 1576 Performs the given rounding, using the convert unit. 1577 </desc> 1578 1579 <ins name="FROUND.f32" opcode2="0xD"/> 1580 <ins name="FROUND.v2f16" opcode2="0xF"/> 1581 1582 <roundmode/> 1583 <src swizzle="true" absneg="true">Value to convert</src> 1584 </group> 1585 1586 <ins name="MOV.i32" title="Register move" dests="1" opcode="0x91" opcode2="0x0" unit="CVT"> 1587 <desc>Canonical register-to-register move.</desc> 1588 <src/> 1589 </ins> 1590 1591 <ins name="CLZ.u32" title="Count leading zeroes" dests="1" opcode="0x91" opcode2="0x4" unit="CVT"> 1592 <desc> 1593 Used as a primitive for various bitwise operations. 1594 </desc> 1595 <src/> 1596 </ins> 1597 1598 <ins name="CLZ.v2u16" title="Count leading zeroes" dests="1" opcode="0x91" opcode2="0x5" unit="CVT"> 1599 <desc> 1600 Used as a primitive for various bitwise operations. 1601 </desc> 1602 <src/> 1603 </ins> 1604 1605 <ins name="CLZ.v4u8" title="Count leading zeroes" dests="1" opcode="0x91" opcode2="0x6" unit="CVT"> 1606 <desc> 1607 Used as a primitive for various bitwise operations. 1608 </desc> 1609 <src/> 1610 </ins> 1611 1612 <ins name="IABS.s32" title="Absolute value" dests="1" opcode="0x91" opcode2="0x8" unit="CVT"> 1613 <desc> 1614 64-bit abs may be constructed in 4 instructions (5 clocks) by checking the 1615 sign with `ICMP.s32.lt.m1 hi, 0` and negating based on the result with 1616 `IADD.s64` and `LSHIFT_XOR.i32` on each half. 1617 </desc> 1618 <src widen="true"/> 1619 </ins> 1620 1621 <ins name="IABS.v2s16" title="Absolute value" dests="1" opcode="0x91" opcode2="0x9" unit="CVT"> 1622 <src widen="true"/> 1623 </ins> 1624 1625 <ins name="IABS.v4s8" title="Absolute value" dests="1" opcode="0x91" opcode2="0xa" unit="CVT"> 1626 <src/> 1627 </ins> 1628 1629 <ins name="POPCOUNT.i32" title="Population count" dests="1" opcode="0x91" opcode2="0xC" unit="SFU"> 1630 <desc> 1631 Only available as 32-bit. Smaller bitsizes require explicit conversions. 1632 64-bit popcount may be constructed in 3 clocks by separate 32-bit 1633 popcounts of each half and a 32-bit add, which is guaranteed not to 1634 overflow. 1635 </desc> 1636 <src/> 1637 </ins> 1638 1639 <ins name="BITREV.i32" title="Bitwise reverse" dests="1" opcode="0x91" opcode2="0xD" unit="SFU"> 1640 <desc> 1641 Only available as 32-bit. Other bitsizes may be derived with swizzles. 1642 </desc> 1643 <src/> 1644 </ins> 1645 1646 <ins name="NOT_OLD.i32" title="Bitwise complement" dests="1" opcode="0x91" opcode2="0xE" unit="SFU"> 1647 <desc> 1648 For fully featured bitwise operation, see the shift opcodes. 1649 </desc> 1650 <src/> 1651 </ins> 1652 1653 <ins name="NOT_OLD.i64" title="Bitwise complement" dests="1" opcode="0x191" opcode2="0xE" unit="SFU"> 1654 <desc> 1655 For fully featured bitwise operation, see the shift opcodes. 1656 </desc> 1657 <src/> 1658 </ins> 1659 1660 <ins name="WMASK" title="Warp mask" dests="1" opcode="0x95" unit="CVT"> 1661 <desc> 1662 Returns the mask of lanes ever active within the warp (subgroup), such 1663 that the source is nonzero. The number of work-items in a subgroup is 1664 given as the popcount of this value with a nonzero input. 1665 1666 An `all()` subgroup operation may be constructed as `WMASK` of the input 1667 compared for equality with `WMASK` of an nonzero value. 1668 1669 An `any()` subgroup operation may be constructed as `WMASK` of the input 1670 compared against zero. 1671 </desc> 1672 <src/> 1673 <subgroup/> 1674 </ins> 1675 1676 <group name="FREXP" title="Fraction/exponent extract" dests="1" opcode="0x99" unit="CVT"> 1677 <ins name="FREXPM.f32" opcode2="0"/> 1678 <ins name="FREXPM.v2f16" opcode2="1"/> 1679 <ins name="FREXPE.f32" opcode2="2"/> 1680 <ins name="FREXPE.v2f16" opcode2="3"/> 1681 <desc> 1682 Breaks up the floating-point input into its fractional (mantissa) and 1683 exponent parts. By default, this is compatible with the `frexp()` function 1684 in APIs. With the log/sqrt modifiers, the floating point format is 1685 adjusted to be compatible with Valhall's argument reduction for logarithm 1686 and square root computation respectively. 1687 </desc> 1688 <mod name="sqrt" start="24" size="1"/> 1689 <mod name="log" start="25" size="1"/> 1690 <src float="true" swizzle="true"/> 1691 </group> 1692 1693 <group name="SFU" title="Special function unit" dests="1" opcode="0x9C" unit="SFU"> 1694 <ins name="FRCP.f32" opcode2="0"/> 1695 <ins name="FRCP.f16" opcode2="1"/> 1696 <ins name="FRSQ.f32" opcode2="2"/> 1697 <ins name="FRSQ.f16" opcode2="3"/> 1698 <ins name="FLOGD.f32" opcode2="8"/> 1699 <ins name="FPCLASS.f32" opcode2="10"/> 1700 <ins name="FPCLASS.f16" opcode2="11"/> 1701 <ins name="FLOG_TABLE.f32" opcode2="12"/> 1702 <ins name="FRCP_APPROX.f32" opcode2="14"/> 1703 <ins name="FRSQ_APPROX.f32" opcode2="15"/> 1704 <desc> 1705 Performs a given special function. The floating-point reciprocal (`FRCP`) 1706 and reciprocal square root (`FRSQ`) instructions may be freely used as-is. 1707 The logarithm instruction (`FLOGD.f32`) requires an argument 1708 reduction. See the transcendentals section for more information. Like the 1709 Bifrost op, `FRSQ_APPROX.f32` does an implicit `FREXPM.f32.sqrt` on the 1710 source. 1711 </desc> 1712 <src float="true" swizzle="true" absneg="true"/> 1713 </group> 1714 1715 <group name="SFU" title="Special function unit" dests="1" opcode="0x9C" unit="SFU"> 1716 <ins name="FSIN_TABLE.u6" opcode2="4"/> 1717 <ins name="FCOS_TABLE.u6" opcode2="5"/> 1718 <ins name="FSINCOS_OFFSET.u6" opcode2="6"/> 1719 <ins name="FEXP_TABLE.u4" opcode2="13"/> 1720 <desc> 1721 Performs a given special function. The trigonometric tables 1722 (`FSIN_TABLE.u6` and `FCOS_TABLE.u6`) are crude, requiring both an 1723 argument reduction and postprocessing. 1724 </desc> 1725 <src/> 1726 </group> 1727 1728 <group name="FADD" title="Floating-point add" dests="1" opcode2="0" unit="FMA"> 1729 <ins name="FADD.f32" opcode="0xA4"/> 1730 <ins name="FADD.v2f16" opcode="0xA5"/> 1731 <desc>$A + B$</desc> 1732 <clamp/> 1733 <src absneg="true" swizzle="true">A</src> 1734 <src absneg="true" swizzle="true">B</src> 1735 </group> 1736 1737 <group name="FMIN" title="Floating-point minimum" dests="1" opcode2="2" unit="CVT"> 1738 <ins name="FMIN.f32" opcode="0xA4"/> 1739 <ins name="FMIN.v2f16" opcode="0xA5"/> 1740 <desc>$\min \{ A, B \}$</desc> 1741 <clamp/> 1742 <src absneg="true" swizzle="true">A</src> 1743 <src absneg="true" swizzle="true">B</src> 1744 </group> 1745 1746 <group name="FMAX" title="Floating-point maximum" dests="1" opcode2="3" unit="CVT"> 1747 <ins name="FMAX.f32" opcode="0xA4"/> 1748 <ins name="FMAX.v2f16" opcode="0xA5"/> 1749 <desc>$\max \{ A, B \}$</desc> 1750 <clamp/> 1751 <src absneg="true" swizzle="true">A</src> 1752 <src absneg="true" swizzle="true">B</src> 1753 </group> 1754 1755 <group name="V2F32_TO_V2F16" title="Vectorized floating-point conversion" dests="1" opcode2="4" unit="CVT"> 1756 <ins name="V2F32_TO_V2F16" opcode="0xA5"/> 1757 <desc> 1758 Given a pair of 32-bit floats, output a pair of 16-bit floats packed into 1759 a 32-bit destination. 1760 </desc> 1761 <clamp/> 1762 <roundmode/> 1763 <src absneg="true">A</src> 1764 <src absneg="true">B</src> 1765 </group> 1766 1767 <group name="LDEXP" title="Floating-point rescaling" dests="1" opcode2="6" unit="FMA"> 1768 <ins name="LDEXP.f32" opcode="0xA4"/> 1769 <ins name="LDEXP.v2f16" opcode="0xA5"/> 1770 <desc> 1771 Computes $A \cdot 2^B$ by adding B to the exponent of A. Used to calculate 1772 various special functions, particularly base-2 exponents. Special case 1773 handling differs from an actual floating-point multiply, so this should 1774 not be used outside fixed instruction sequences. 1775 </desc> 1776 <src absneg="true" swizzle="true">A</src> 1777 <src/> 1778 <roundmode/> <!-- Also has rtna --> 1779 <!-- Also has infinity handling for arctan --> 1780 </group> 1781 1782 <ins name="FEXP.f32" title="Floating-point exponent" dests="1" opcode="0xA4" opcode2="8" unit="SFU"> 1783 <desc> 1784 Calculates the base-2 exponent of an argument specified as a 8:24 1785 fixed-point. The original argument is passed as well for correct handling 1786 of special cases. 1787 </desc> 1788 <clamp/> 1789 <src>Input as 8:24 fixed-point</src> 1790 <src absneg="true">Input as 32-bit float</src> 1791 </ins> 1792 1793 <ins name="FADD_LSCALE.f32" title="Floating-point add with logarithm scale" dests="1" opcode="0xA4" opcode2="9" unit="FMA"> 1794 <desc> 1795 Performs a floating-point addition specialized for logarithm computation. 1796 </desc> 1797 <clamp/> 1798 <src absneg="true">A</src> 1799 <src absneg="true">B</src> 1800 </ins> 1801 1802 <ins name="FATAN_ASSIST.f32" title="ATAN calculation helper" dests="1" opcode="0xA4" opcode2="14" unit="SFU"> 1803 <desc> 1804 Used for `atan2()` implementation. Destination is two 16-bit 1805 values (int and float) for the first form, and a single 32-bit float when 1806 `.second` is set (indicating the FATAN_TABLE.f32 instruction). 1807 </desc> 1808 <mod name="second" start="24" size="1"/> 1809 <src>A</src> 1810 <src>B</src> 1811 </ins> 1812 1813 <group name="IADD" title="Integer addition" dests="1" opcode2="0" unit="CVT"> 1814 <desc> 1815 $A + B$ with optional saturation. 1816 1817 As Valhall lacks swizzle instructions, `IADD.v2i16` with zero is the 1818 canonical lowering for swizzles. 1819 </desc> 1820 <ins name="IADD.u32" opcode="0xA0"/> 1821 <ins name="IADD.v2u16" opcode="0xA1"/> 1822 <ins name="IADD.v4u8" opcode="0xA2"/> 1823 <ins name="IADD.s32" opcode="0xA8"/> 1824 <ins name="IADD.v2s16" opcode="0xA9"/> 1825 <ins name="IADD.v4s8" opcode="0x1A2"/> 1826 <ins name="IADD.u64" opcode="0x1A3"/> 1827 <ins name="IADD.s64" opcode="0x1AB"/> 1828 <!-- <ins name="IADD.s32" opcode="0x1A0"/> --> 1829 <src widen="true">A</src> 1830 <src widen="true">B</src> 1831 <saturate/> 1832 </group> 1833 1834 <ins name="MKVEC.v2i16" title="Make 16-bit vector" dests="1" opcode="0xA1" opcode2="0x5" unit="CVT"> 1835 <desc>Calculates $A | (B \ll 16)$. Used to implement `(ushort2)(A, B)`</desc> 1836 <src swizzle="true">A</src> 1837 <src swizzle="true">B</src> 1838 </ins> 1839 1840 <group name="ISUB" title="Integer subtract" dests="1" opcode2="1" unit="CVT"> 1841 <ins name="ISUB.u32" opcode="0xA0"/> 1842 <ins name="ISUB.v2u16" opcode="0xA1"/> 1843 <ins name="ISUB.v4u8" opcode="0xA2"/> 1844 <ins name="ISUB.s32" opcode="0xA8"/> 1845 <ins name="ISUB.v2s16" opcode="0xA9"/> 1846 <ins name="ISUB.v4s8" opcode="0x1A2"/> 1847 <ins name="ISUB.u64" opcode="0x1A3"/> 1848 <ins name="ISUB.s64" opcode="0x1AB"/> 1849 <desc>$A - B$ with optional saturation</desc> 1850 <src widen="true">A</src> 1851 <src widen="true">B</src> 1852 <saturate/> 1853 </group> 1854 1855 <group name="SEG_ADD" title="Segment addition" dests="1" opcode2="6" unit="CVT"> 1856 <desc> 1857 Similar to SHADDX, but especially used for loading offsets into 1858 WLS. Usually this is only required for atomic operations, which cannot 1859 directly use wls_pointer as an address. 1860 1861 .neg indicates SEG_SUB instead. 1862 </desc> 1863 <ins name="SEG_ADD.u64" opcode="0x1A3"/> 1864 <mod name="neg" start="38" size="1"/> 1865 <mod name="preserve_null" start="39" size="1"/> 1866 <src>A</src> 1867 <src widen="true">B</src> 1868 </group> 1869 1870 <group name="SHADDX" title="Shift, extend, and 64-bit add" dests="1" opcode2="7" unit="CVT"> 1871 <desc> 1872 Sign or zero extend B to 64-bits, left-shift by `shift`, and add the 1873 64-bit value A. These instructions accelerate address arithmetic, but may 1874 be used in full generality for 64-bit integer arithmetic. 1875 </desc> 1876 <ins name="SHADDX.u64" opcode="0x1A3"/> 1877 <ins name="SHADDX.s64" opcode="0x1AB"/> 1878 <imm name="shift" start="20" size="3"/> 1879 <src>A</src> 1880 <src widen="true">B</src> 1881 </group> 1882 1883 <group name="IMUL" title="Integer multiply" dests="1" opcode2="0x0A" unit="SFU"> 1884 <ins name="IMUL.i32" opcode="0xA0"/> 1885 <ins name="IMUL.v2i16" opcode="0xA1"/> 1886 <ins name="IMUL.v4i8" opcode="0xA2"/> 1887 <ins name="IMUL.s32" opcode="0xA8"/> 1888 <ins name="IMUL.v2s16" opcode="0xA9"/> 1889 <ins name="IMUL.v4s8" opcode="0x1A2"/> 1890 <ins name="IMULD.u64" opcode="0x1A3"/> 1891 <!-- <ins name="IMUL.s32" opcode="0x1A0"/> --> 1892 <desc> 1893 $A \cdot B$ with optional saturation. Note the multipliers can only handle up to 1894 32-bit by 32-bit multiplies. The 64-bit "multiply" acts like IMUL.u32 but 1895 additionally writes the high half of the product to the high half of the 1896 64-bit destination. Along with IADD.u32 and IADD.u64, this allows the 1897 construction of a 64-bit multiply in 5 instructions (6 clocks). 1898 </desc> 1899 <src widen="true">A</src> 1900 <src widen="true">B</src> 1901 <saturate/> 1902 </group> 1903 1904 <group name="HADD" title="Integer half-add" dests="1" opcode2="0x0B" unit="CVT"> 1905 <ins name="HADD.u32" opcode="0xA0"/> 1906 <ins name="HADD.v2u16" opcode="0xA1"/> 1907 <ins name="HADD.v4u8" opcode="0xA2"/> 1908 <ins name="HADD.s32" opcode="0xA8"/> 1909 <ins name="HADD.v2s16" opcode="0xA9"/> 1910 <ins name="HADD.v4s8" opcode="0x1A2"/> 1911 <mod name="rhadd" start="30" size="1"/> 1912 <src widen="true">A</src> 1913 <src widen="true">B</src> 1914 <desc> 1915 $(A + B) \gg 1$ without intermediate overflow, corresponding to `hadd()` in 1916 OpenCL. With the `.rhadd` modifier set, it instead calculates 1917 $(A + B + 1) \gg 1$ corresponding to `rhadd()` in OpenCL. 1918 </desc> 1919 </group> 1920 1921 <group name="CLPER" title="Cross-lane permute" dests="1" opcode2="0xF" unit="SFU"> 1922 <ins name="CLPER.i32" opcode="0xA0"/> 1923 <ins name="CLPER.v2u16" opcode="0xA1"/> 1924 <ins name="CLPER.v4u8" opcode="0xA2"/> 1925 <ins name="CLPER.s32" opcode="0xA8"/> 1926 <ins name="CLPER.v2s16" opcode="0xA9"/> 1927 <ins name="CLPER.v4s8" opcode="0x1A2"/> 1928 <ins name="CLPER.u64" opcode="0x1A3"/> 1929 <ins name="CLPER.s64" opcode="0x1AB"/> 1930 <!-- <ins name="CLPER.s32" opcode="0x1A0"/> --> 1931 <desc> 1932 Selects the value of A in the subgroup lane given by B. This implements 1933 subgroup broadcasts. It may be used as a primitive for screen space 1934 derivatives in fragment shaders. 1935 </desc> 1936 <src>A</src> 1937 <src widen="true">B</src> 1938 <subgroup/> 1939 <lane_op/> 1940 <inactive_result/> 1941 </group> 1942 1943 <group name="FMA" title="Fused floating-point multiply add" dests="1" unit="FMA"> 1944 <ins name="FMA.f32" opcode="0xB2"/> 1945 <ins name="FMA.v2f16" opcode="0xB3"/> 1946 <desc>$A \cdot B + C$</desc> 1947 <clamp/> 1948 <src absneg="true" swizzle="true">A</src> 1949 <src absneg="true" swizzle="true">B</src> 1950 <src absneg="true" swizzle="true">C</src> 1951 </group> 1952 1953 <group name="LSHIFT_AND" title="Left shift and bitwise AND" dests="1" opcode2="0x100" unit="SFU"> 1954 <ins name="LSHIFT_AND.i32" opcode="0xB4"/> 1955 <ins name="LSHIFT_AND.v2i16" opcode="0xB5"/> 1956 <ins name="LSHIFT_AND.v4i8" opcode="0xB6"/> 1957 <ins name="LSHIFT_AND.i64" opcode="0x1B7"/> 1958 <mod name="left" start="128" size="1" implied="true"/> 1959 <desc> 1960 Left shifts its first source by a specified amount and bitwise ANDs it with the 1961 second source, optionally inverting the second source or the result. 1962 </desc> 1963 <not_result/> 1964 <src widen="true">A</src> 1965 <src lanes="true" size="8">shift</src> 1966 <src not="true">B</src> 1967 </group> 1968 1969 <group name="RSHIFT_AND" title="Right shift and bitwise AND" dests="1" opcode2="0x000" unit="SFU"> 1970 <ins name="RSHIFT_AND.i32" opcode="0xB4"/> 1971 <ins name="RSHIFT_AND.v2i16" opcode="0xB5"/> 1972 <ins name="RSHIFT_AND.v4i8" opcode="0xB6"/> 1973 <ins name="RSHIFT_AND.i64" opcode="0x1B7"/> 1974 <mod name="left" start="128" size="1" implied="true"/> 1975 <desc> 1976 Right shifts its first source by a specified amount and bitwise ANDs it with the 1977 second source, optionally inverting the second source or the result. If 1978 `signed` is set, the hardware performs an arithmetic right shift; otherwise, 1979 it performs an unsigned right shift. 1980 </desc> 1981 <mod name="signed" start="34" size="1"/> 1982 <not_result/> 1983 <src widen="true">A</src> 1984 <src lanes="true" size="8">shift</src> 1985 <src not="true">B</src> 1986 </group> 1987 1988 <group name="LSHIFT_OR" title="Left shift and bitwise OR" dests="1" opcode2="0x101" unit="SFU"> 1989 <ins name="LSHIFT_OR.i32" opcode="0xB4"/> 1990 <ins name="LSHIFT_OR.v2i16" opcode="0xB5"/> 1991 <ins name="LSHIFT_OR.v4i8" opcode="0xB6"/> 1992 <ins name="LSHIFT_OR.i64" opcode="0x1B7"/> 1993 <mod name="left" start="128" size="1" implied="true"/> 1994 <desc> 1995 Left shifts its first source by a specified amount and bitwise ORs it with the 1996 second source, optionally inverting the second source or the result. 1997 </desc> 1998 <not_result/> 1999 <src widen="true">A</src> 2000 <src lanes="true" size="8">shift</src> 2001 <src not="true">B</src> 2002 </group> 2003 2004 <group name="RSHIFT_OR" title="Right shift and bitwise OR" dests="1" opcode2="0x001" unit="SFU"> 2005 <ins name="RSHIFT_OR.i32" opcode="0xB4"/> 2006 <ins name="RSHIFT_OR.v2i16" opcode="0xB5"/> 2007 <ins name="RSHIFT_OR.v4i8" opcode="0xB6"/> 2008 <ins name="RSHIFT_OR.i64" opcode="0x1B7"/> 2009 <mod name="left" start="128" size="1" implied="true"/> 2010 <desc> 2011 Right shifts its first source by a specified amount and bitwise ORs it with the 2012 second source, optionally inverting the second source or the result. If 2013 `signed` is set, the hardware performs an arithmetic right shift; otherwise, 2014 it performs an unsigned right shift. 2015 </desc> 2016 <mod name="signed" start="34" size="1"/> 2017 <not_result/> 2018 <src widen="true">A</src> 2019 <src lanes="true" size="8">shift</src> 2020 <src not="true">B</src> 2021 </group> 2022 2023 <group name="LSHIFT_XOR" title="Left shift and bitwise XOR" dests="1" opcode2="0x102" unit="SFU"> 2024 <ins name="LSHIFT_XOR.i32" opcode="0xB4"/> 2025 <ins name="LSHIFT_XOR.v2i16" opcode="0xB5"/> 2026 <ins name="LSHIFT_XOR.v4i8" opcode="0xB6"/> 2027 <ins name="LSHIFT_XOR.i64" opcode="0x1B7"/> 2028 <mod name="left" start="128" size="1" implied="true"/> 2029 <desc> 2030 Left shifts its first source by a specified amount and bitwise XORs it with the 2031 second source, optionally inverting the second source or the result. 2032 </desc> 2033 <not_result/> 2034 <src widen="true">A</src> 2035 <src lanes="true" size="8">shift</src> 2036 <src not="true">B</src> 2037 </group> 2038 2039 <group name="RSHIFT_XOR" title="Right shift and bitwise XOR" dests="1" opcode2="0x002" unit="SFU"> 2040 <ins name="RSHIFT_XOR.i32" opcode="0xB4"/> 2041 <ins name="RSHIFT_XOR.v2i16" opcode="0xB5"/> 2042 <ins name="RSHIFT_XOR.v4i8" opcode="0xB6"/> 2043 <ins name="RSHIFT_XOR.i64" opcode="0x1B7"/> 2044 <mod name="left" start="128" size="1" implied="true"/> 2045 <desc> 2046 Right shifts its first source by a specified amount and bitwise XORs it with the 2047 second source, optionally inverting the second source or the result. If 2048 `signed` is set, the hardware performs an arithmetic right shift; otherwise, 2049 it performs an unsigned right shift. 2050 </desc> 2051 <mod name="signed" start="34" size="1"/> 2052 <not_result/> 2053 <src widen="true">A</src> 2054 <src lanes="true" size="8">shift</src> 2055 <src not="true">B</src> 2056 </group> 2057 2058 <ins name="MUX.i32" title="Mux" dests="1" opcode="0xB8" unit="SFU"> 2059 <desc> 2060 Mux between A and B based on the provided mask. The condition specified 2061 as the `mux` modifier is evaluated on the mask. If true, `A` is chosen, 2062 else `B` is chosen. The `bit` modifier acts bitwise, equivalent to 2063 `bitselect()` in OpenCL, so `MUX.i32.bit A, B, mask` calculates 2064 `(A & mask) | (A & ~mask)`. 2065 </desc> 2066 <mod name="mux" start="32" size="2"/> 2067 <src>A</src> 2068 <src>B</src> 2069 <src>Mask</src> 2070 </ins> 2071 2072 <ins name="MUX.v2i16" title="Mux" dests="1" opcode="0xB9" unit="SFU"> 2073 <desc> 2074 Mux between A and B based on the provided mask. The condition specified 2075 as the `mux` modifier is evaluated on the mask. If true, `A` is chosen, 2076 else `B` is chosen. The `bit` modifier acts bitwise, equivalent to 2077 `bitselect()` in OpenCL, so `MUX.i32.bit A, B, mask` calculates 2078 `(A & mask) | (A & ~mask)`. 2079 </desc> 2080 <mod name="mux" start="32" size="2"/> 2081 <src swizzle="true">A</src> 2082 <src swizzle="true">B</src> 2083 <src swizzle="true">Mask</src> 2084 </ins> 2085 2086 <ins name="MUX.v4i8" title="Mux" dests="1" opcode="0xBA" unit="SFU"> 2087 <desc> 2088 Mux between A and B based on the provided mask. The condition specified 2089 as the `mux` modifier is evaluated on the mask. If true, `A` is chosen, 2090 else `B` is chosen. The `bit` modifier acts bitwise, equivalent to 2091 `bitselect()` in OpenCL, so `MUX.i32.bit A, B, mask` calculates 2092 `(A & mask) | (A & ~mask)`. 2093 </desc> 2094 <mod name="mux" start="32" size="2"/> 2095 <src>A</src> 2096 <src>B</src> 2097 <src>Mask</src> 2098 </ins> 2099 2100 <ins name="CUBE_SSEL" title="Cube S-coordinate select" dests="1" opcode="0xBC" opcode2="0" unit="SFU"> 2101 <desc>During a cube map transform, select the S coordinate given a selected face.</desc> 2102 <src absneg="true">Z coordinate as 32-bit floating point</src> 2103 <src absneg="true">X coordinate as 32-bit floating point</src> 2104 <src>Cube face index</src> 2105 </ins> 2106 2107 <ins name="CUBE_TSEL" title="Cube T-coordinate select" dests="1" opcode="0xBC" opcode2="1" unit="SFU"> 2108 <desc>During a cube map transform, select the T coordinate given a selected face.</desc> 2109 <src absneg="true">Y coordinate as 32-bit floating point</src> 2110 <src absneg="true">Z coordinate as 32-bit floating point</src> 2111 <src>Cube face index</src> 2112 </ins> 2113 2114 <ins name="MKVEC.v2i8" title="Make 8-bit vector" dests="1" opcode="0xBD" unit="CVT"> 2115 <desc> 2116 Calculates $A | (B \ll 8) | (CD \ll 16)$ for 8-bit A and B and 16-bit CD. 2117 2118 To implement `(uchar4) (A, B, C, D)` in full generality, use the sequence 2119 `MKVEC.v2i8 CD, C, D, #0; MKVEC.v2i8 out, A, B, CD` 2120 2121 `MKVEC.v2i8` also allows zero extending arbitrary 8-bit lanes. For 2122 example, to extend `r0.b3` to `r1`, use `MKVEC.v2i8 r1, r0.b3, 0x0.b0, 0x0`. 2123 </desc> 2124 <src lane="true">A</src> 2125 <src lane="true">B</src> 2126 <src>CD</src> 2127 </ins> 2128 2129 <ins name="CUBEFACE1" title="Cube map transform step 1" dests="1" opcode="0xC0" unit="SFU"> 2130 <desc>Select the maximum absolute value of its arguments.</desc> 2131 <src absneg="true">X coordinate as 32-bit floating point</src> 2132 <src absneg="true">Y coordinate as 32-bit floating point</src> 2133 <src absneg="true">Z coordinate as 32-bit floating point</src> 2134 </ins> 2135 2136 <ins name="CUBEFACE2" title="Cube map transform step 2" dests="1" opcode="0xC1" unit="SFU"> 2137 <desc>Select the cube face index corresponding to the arguments.</desc> 2138 <src absneg="true">X coordinate as 32-bit floating point</src> 2139 <src absneg="true">Y coordinate as 32-bit floating point</src> 2140 <src absneg="true">Z coordinate as 32-bit floating point</src> 2141 </ins> 2142 2143 <group name="IDP" title="8-bit dot product" dests="1" opcode="0xC2" unit="FMA"> 2144 <desc> 2145 8-bit integer dot product between 4 channel vectors, intended for machine 2146 learning. Available in both unsigned and signed variants, controlling 2147 sign-extension/zero-extension behaviour to the final 32-bit destination. 2148 Saturation is available. Corresponds to the `cl_arm_integer_dot_product_*` 2149 family of OpenCL extensions. Not for actual use, just for completeness. 2150 Instead, use your platform's neural accelerator. 2151 2152 For $A, B \in \{ 0, \ldots, 255 \}^4$ and $\text{Accumulator} \in 2153 \mathbb{Z}$, calculates $(A \cdot B) + \text{Accumulator}$ and optionally 2154 saturates. 2155 </desc> 2156 <ins name="IDP.v4s8" opcode2="0"/> 2157 <ins name="IDP.v4u8" opcode2="1"/> 2158 <src>A</src> 2159 <src>B</src> 2160 <src>Accumulator</src> 2161 <saturate/> 2162 </group> 2163 2164 <group name="ICMP" title="Unsigned integer compare" dests="1" unit="CVT"> 2165 <desc> 2166 Evaluates the given condition, do a logical and/or with the condition in 2167 the result source, and return in the given result type (integer 2168 one, integer minus one, or floating-point one). The third source is useful 2169 for chaining together conditions without intermediate bitwise arithmetic; 2170 when this is not desired, tie it to zero and use the OR combine mode (do 2171 not set the `.and` modifier). 2172 2173 The sequence modifier `.seq` is used to construct 64-bit compares in 2 2174 `ICMP.u32` instructions, in conjunction with the `u1` result type on the 2175 low half, the `m1` result type on the high half, and the result of the low 2176 half comparison passed as the third source. For comparisons other than 2177 64-bit, do not set the `.seq` modifier and do not use the `u1` result 2178 type. 2179 </desc> 2180 <ins name="ICMP.u32" opcode="0xF0"/> 2181 <ins name="ICMP.v2u16" opcode="0xF1"/> 2182 <ins name="ICMP.v4u8" opcode="0xF2"/> 2183 <cmp/> 2184 <result_type/> 2185 <mod name="and" start="24" size="1"/> 2186 <mod name="seq" start="25" size="1"/> 2187 <src widen="true">A</src> 2188 <src widen="true">B</src> 2189 <src>C</src> 2190 </group> 2191 2192 <group name="FCMP" title="Floating-point compare" dests="1" unit="CVT"> 2193 <desc> 2194 Evaluates the given condition, do a logical and/or with the condition in 2195 the result source, and return in the given result type (integer 2196 one, integer minus one, or floating-point one). The third source is useful 2197 for chaining together conditions without intermediate bitwise arithmetic; 2198 when this is not desired, tie it to zero and use the OR combine mode (do 2199 not set the `.and` modifier). 2200 </desc> 2201 <ins name="FCMP.f32" opcode="0xF4"/> 2202 <ins name="FCMP.v2f16" opcode="0xF5"/> 2203 <cmp/> 2204 <result_type/> 2205 <mod name="and" start="24" size="1"/> 2206 <src absneg="true" swizzle="true">A</src> 2207 <src absneg="true" swizzle="true">B</src> 2208 <src>C</src> 2209 </group> 2210 2211 <group name="ICMP" title="Signed integer compare" dests="1" unit="CVT"> 2212 <desc> 2213 Evaluates the given condition, do a logical and/or with the condition in 2214 the result source, and return in the given result type (integer 2215 one, integer minus one, or floating-point one). The third source is useful 2216 for chaining together conditions without intermediate bitwise arithmetic; 2217 when this is not desired, tie it to zero and use the OR combine mode (do 2218 not set the `.and` modifier). 2219 2220 The sequence modifier `.seq` is used to construct signed 64-bit compares 2221 in 1 `ICMP.u32` and 1 `ICMP.s32` instruction, in conjunction with the `u1` 2222 result type on the low half, the `m1` result type on the high half, and 2223 the result of the low half comparison passed as the third source. For 2224 comparisons other than 64-bit, do not set the `.seq` modifier and do not 2225 use the `u1` result type. 2226 </desc> 2227 <ins name="ICMP.s32" opcode="0xF8"/> 2228 <ins name="ICMP.v2s16" opcode="0xF9"/> 2229 <ins name="ICMP.v4s8" opcode="0xFA"/> 2230 <cmp/> 2231 <result_type/> 2232 <mod name="and" start="24" size="1"/> 2233 <mod name="seq" start="25" size="1"/> 2234 <src widen="true">A</src> 2235 <src widen="true">B</src> 2236 <src>C</src> 2237 </group> 2238 2239 <ins name="IADD_IMM.i32" title="Integer addition with immediate" dests="1" opcode="0x110" unit="CVT"> 2240 <desc> 2241 Adds an arbitrary 32-bit immediate embedded within the instruction stream. 2242 If no modifiers are required, this is preferred to `IADD.i32` with a 2243 constant accessed as a uniform. However, if the constant is available 2244 inline, `IADD.f32` is preferred. 2245 2246 `IADD_IMM.i32` with the source tied to zero is the canonical immediate move. 2247 </desc> 2248 <src>A</src> 2249 <imm name="constant" start="8" size="32"/> 2250 </ins> 2251 2252 <ins name="IADD_IMM.v2i16" title="Integer addition with immediate" dests="1" opcode="0x111" unit="CVT"> 2253 <desc> 2254 Adds an arbitrary pair of 16-bit immediates embedded within the 2255 instruction stream. If no modifiers are required, this is preferred to 2256 `IADD.v2i16` with a constant accessed as a uniform. However, if the 2257 constant is available inline, `IADD.v2i16` is preferred. Adding only a 2258 single 16-bit constant requires replication of the constant. 2259 </desc> 2260 <src>A</src> 2261 <imm name="constant" start="8" size="32"/> 2262 </ins> 2263 2264 <ins name="IADD_IMM.v4i8" title="Integer addition with immediate" dests="1" opcode="0x112" unit="CVT"> 2265 <desc> 2266 Adds an arbitrary quad of 8-bit immediates embedded within the 2267 instruction stream. If no modifiers are required, this is preferred to 2268 `IADD.v4i8` with a constant accessed as a uniform. However, if the 2269 constant is available inline, `IADD.v4i8` is preferred. Adding only a 2270 single 8-bit constant requires replication of the constant. 2271 </desc> 2272 <src>A</src> 2273 <imm name="constant" start="8" size="32"/> 2274 </ins> 2275 2276 <ins name="FADD_IMM.f32" title="Floating-point addition with immediate" dests="1" opcode="0x114" unit="FMA"> 2277 <desc> 2278 Adds an arbitrary 32-bit immediate embedded within the instruction stream. 2279 If no modifiers are required, this is preferred to `FADD.f32` with a 2280 constant accessed as a uniform. However, if the constant is available 2281 inline, `FADD.f32` is preferred. 2282 </desc> 2283 <src>A</src> 2284 <imm name="constant" start="8" size="32"/> 2285 </ins> 2286 2287 <ins name="FADD_IMM.v2f16" title="Floating-point addition with immediate" dests="1" opcode="0x115" unit="FMA"> 2288 <desc> 2289 Adds an arbitrary pair of 16-bit immediates embedded within the 2290 instruction stream. If no modifiers are required, this is preferred to 2291 `FADD.v2f16` with a constant accessed as a uniform. However, if the 2292 constant is available inline, `FADD.v2f16` is preferred. Adding only a 2293 single 16-bit constant requires replication of the constant. 2294 </desc> 2295 <src float="true">A</src> 2296 <imm name="constant" start="8" size="32"/> 2297 </ins> 2298 2299 <ins name="ATOM1_RETURN.i32" title="Atomic operations on memory with 1" opcode="0x69" opcode2="3" unit="LS"> 2300 <slot/> 2301 <sr_count/> 2302 <atom_opc_1/> 2303 <mod name="memory_width" start="128" size="1" implied="true"/> 2304 2305 <!-- Optional for ATOM1.i32, in which sr_count must be 0 --> 2306 <sr write="true"/> 2307 <src size="64">64-bit address to operate on</src> 2308 <imm name="offset" start="8" size="8"/> 2309 </ins> 2310 2311 <ins name="ATOM1_RETURN.i64" title="Atomic operations on memory with 1" opcode="0x69" opcode2="5" unit="LS"> 2312 <slot/> 2313 <sr_count/> 2314 <atom_opc_1/> 2315 <mod name="memory_width" start="128" size="1" implied="true"/> 2316 2317 <!-- Optional for ATOM1.i64, in which sr_count must be 0 --> 2318 <sr write="true"/> 2319 <src size="64">64-bit address to operate on</src> 2320 <imm name="offset" start="8" size="8"/> 2321 </ins> 2322 2323 <ins name="ATOM.i32" title="Atomic operations on memory" opcode="0x68" opcode2="3" unit="LS"> 2324 <slot/> 2325 <sr_count/> 2326 <atom_opc/> 2327 <mod name="memory_width" start="128" size="1" implied="true"/> 2328 2329 <sr read="true"/> 2330 <src size="64">64-bit address to operate on</src> 2331 <imm name="offset" start="8" size="8"/> 2332 </ins> 2333 2334 <ins name="ATOM.i64" title="Atomic operations on memory" opcode="0x68" opcode2="5" unit="LS"> 2335 <slot/> 2336 <sr_count/> 2337 <atom_opc/> 2338 <mod name="memory_width" start="128" size="1" implied="true"/> 2339 2340 <sr read="true"/> 2341 <src size="64">64-bit address to operate on</src> 2342 <imm name="offset" start="8" size="8"/> 2343 </ins> 2344 2345 <ins name="ATOM_RETURN.i32" title="Atomic operations on memory" opcode="0x120" opcode2="3" unit="LS"> 2346 <slot/> 2347 <sr_count/> 2348 <sr_write_count/> 2349 2350 <!-- Only valid with .xchg to implement ACMPXCHG --> 2351 <mod name="compare" start="26" size="1"/> 2352 2353 <atom_opc/> 2354 <mod name="memory_width" start="128" size="1" implied="true"/> 2355 2356 <sr write="true" flags="false"/> 2357 <sr read="true" flags="rw"/> 2358 <src size="64">64-bit address to operate on</src> 2359 <imm name="offset" start="8" size="8"/> 2360 </ins> 2361 2362 <ins name="ATOM_RETURN.i64" title="Atomic operations on memory" opcode="0x120" opcode2="5" unit="LS"> 2363 <slot/> 2364 <sr_count/> 2365 <sr_write_count/> 2366 <mod name="compare" start="26" size="1"/> 2367 <atom_opc/> 2368 <mod name="memory_width" start="128" size="1" implied="true"/> 2369 2370 <sr write="true" flags="false"/> 2371 <sr read="true" flags="rw"/> 2372 <src size="64">64-bit address to operate on</src> 2373 <imm name="offset" start="8" size="8"/> 2374 </ins> 2375 2376 <ins name="TEX_FETCH" title="Texel fetch" opcode="0x125" unit="T"> 2377 <desc>Unfiltered textured instruction.</desc> 2378 <slot/> 2379 <skip/> 2380 <register_type/> 2381 <register_width/> 2382 <write_mask/> 2383 <dimension/> 2384 <wide_indices/> 2385 <array_enable/> 2386 <texel_offset/> 2387 2388 <!-- Leave secondary_register_width as 0 --> 2389 <sr_count/> 2390 <sr_write_count/> 2391 2392 <sr write="true" flags="false"/> 2393 <sr read="true" flags="false"/> 2394 <src size="64">Image to read from</src> 2395 </ins> 2396 2397 <ins name="TEX_SINGLE" title="Texture load" opcode="0x128" unit="T"> 2398 <desc>Ordinary texturing instruction using a sampler.</desc> 2399 <slot/> 2400 <skip/> 2401 <register_type/> 2402 <register_width/> 2403 <write_mask/> 2404 <dimension/> 2405 <wide_indices/> 2406 <array_enable/> 2407 <texel_offset/> 2408 <shadow/> 2409 <lod_mode/> 2410 2411 <!-- Leave secondary_register_width as 0 --> 2412 <sr_count/> 2413 <sr_write_count/> 2414 2415 <sr write="true" flags="false"/> 2416 <sr read="true" flags="false"/> 2417 <src size="64">Image to read from</src> 2418 </ins> 2419 2420 <ins name="TEX_GATHER" title="Texel gather" opcode="0x129" unit="T"> 2421 <desc>Texture gather instruction.</desc> 2422 <slot/> 2423 <skip/> 2424 <register_type/> 2425 <register_width/> 2426 <write_mask/> 2427 <dimension/> 2428 <wide_indices/> 2429 <array_enable/> 2430 <texel_offset/> 2431 <integer_coordinates/> 2432 <fetch_component/> 2433 <shadow/> 2434 2435 <!-- Leave secondary_register_width as 0 --> 2436 <sr_count/> 2437 <sr_write_count/> 2438 2439 <sr write="true" flags="false"/> 2440 <sr read="true" flags="false"/> 2441 <src size="64">Image to read from</src> 2442 </ins> 2443 2444 <ins name="TEX_DUAL" title="Dual texture" opcode="0x12F" unit="T"> 2445 <desc>Pair of texture instructions.</desc> 2446 <slot/> 2447 <skip/> 2448 <register_type/> 2449 <register_width/> 2450 <secondary_register_width/> 2451 <write_mask/> 2452 <dimension/> 2453 <wide_indices/> 2454 <array_enable/> 2455 <texel_offset/> 2456 <shadow/> 2457 <lod_mode/> 2458 2459 <sr_count/> 2460 <sr_write_count/> 2461 2462 <sr write="true" flags="false"/> 2463 <sr read="true" flags="false"/> 2464 <src size="64">Image to read from</src> 2465 </ins> 2466 2467 <ins name="VAR_TEX_BUF_SINGLE" title="Fused varying-texturing" opcode="0x130" unit="VT"> 2468 <desc> 2469 Only works for FP32 varyings. Performance characteristics are similar 2470 to LD_VAR_BUF_IMM_F32.v2.f32 followed by TEX, using both V and T units. 2471 </desc> 2472 <slot/> 2473 <skip/> 2474 <sample_and_update/> 2475 <register_type/> 2476 <vartex_register_width/> 2477 <dimension/> 2478 <array_enable/> 2479 <shadow/> 2480 <lod_mode/> 2481 2482 <sr_write_count/> 2483 2484 <sr write="true"/> 2485 <src size="64">Image to read from</src> 2486 <src>Varying offset</src> 2487 </ins> 2488 2489 <ins name="VAR_TEX_BUF_GATHER" title="Fused varying-texturing" opcode="0x131" unit="VT"> 2490 <desc> 2491 Only works for FP32 varyings. Performance characteristics are similar 2492 to LD_VAR_BUF_IMM_F32.v2.f32 followed by TEX, using both V and T units. 2493 </desc> 2494 <slot/> 2495 <skip/> 2496 <sample_and_update/> 2497 <register_type/> 2498 <vartex_register_width/> 2499 <dimension/> 2500 <array_enable/> 2501 <integer_coordinates/> 2502 <fetch_component/> 2503 <shadow/> 2504 2505 <sr_write_count/> 2506 2507 <sr write="true"/> 2508 <src size="64">Image to read from</src> 2509 <src>Varying offset</src> 2510 </ins> 2511 2512 <ins name="VAR_TEX_BUF_GRADIENT" title="Fused varying-texturing" opcode="0x132" unit="VT"> 2513 <desc> 2514 Only works for FP32 varyings. Performance characteristics are similar 2515 to LD_VAR_BUF_IMM_F32.v2.f32 followed by TEX, using both V and T units. 2516 </desc> 2517 <slot/> 2518 <skip/> 2519 <sample_and_update/> 2520 <register_type/> 2521 <vartex_register_width/> 2522 <dimension/> 2523 <array_enable/> 2524 <shadow/> 2525 <lod_bias_disable/> 2526 <lod_clamp_disable/> 2527 2528 <sr_write_count/> 2529 2530 <sr write="true"/> 2531 <src size="64">Image to read from</src> 2532 <src>Varying offset</src> 2533 </ins> 2534 2535 <ins name="VAR_TEX_BUF_DUAL" title="Fused varying-texturing" opcode="0x137" unit="VT"> 2536 <desc> 2537 Only works for FP32 varyings. Performance characteristics are similar 2538 to LD_VAR_BUF_IMM_F32.v2.f32 followed by TEX_DUAL, using both V and T units. 2539 </desc> 2540 <slot/> 2541 <skip/> 2542 <sample_and_update/> 2543 <register_type/> 2544 <vartex_register_width/> 2545 <dimension/> 2546 <array_enable/> 2547 <shadow/> 2548 <lod_mode/> 2549 2550 <sr_write_count/> 2551 2552 <sr write="true"/> 2553 <src size="64">Image to read from</src> 2554 <src>Varying offset</src> 2555 </ins> 2556 2557 <ins name="VAR_TEX_SINGLE" title="Fused varying-texturing" opcode="0x138" unit="VT"> 2558 <desc> 2559 Only works for FP32 varyings. Performance characteristics are similar 2560 to LD_VAR_IMM_F32.v2.f32 followed by TEX, using both V and T units. 2561 </desc> 2562 <slot/> 2563 <skip/> 2564 <sample_and_update/> 2565 <register_type/> 2566 <vartex_register_width/> 2567 <dimension/> 2568 <array_enable/> 2569 <shadow/> 2570 <lod_mode/> 2571 2572 <sr_write_count/> 2573 2574 <sr write="true"/> 2575 <src size="64">Image to read from</src> 2576 <src>Varying offset</src> 2577 </ins> 2578 2579 <ins name="VAR_TEX_GATHER" title="Fused varying-texturing" opcode="0x139" unit="VT"> 2580 <desc> 2581 Only works for FP32 varyings. Performance characteristics are similar 2582 to LD_VAR_IMM_F32.v2.f32 followed by TEX, using both V and T units. 2583 </desc> 2584 <slot/> 2585 <skip/> 2586 <sample_and_update/> 2587 <register_type/> 2588 <vartex_register_width/> 2589 <dimension/> 2590 <array_enable/> 2591 <integer_coordinates/> 2592 <fetch_component/> 2593 <shadow/> 2594 2595 <sr_write_count/> 2596 2597 <sr write="true"/> 2598 <src size="64">Image to read from</src> 2599 <src>Varying offset</src> 2600 </ins> 2601 2602 <ins name="VAR_TEX_GRADIENT" title="Fused varying-texturing" opcode="0x13A" unit="VT"> 2603 <desc> 2604 Only works for FP32 varyings. Performance characteristics are similar 2605 to LD_VAR_IMM_F32.v2.f32 followed by TEX, using both V and T units. 2606 </desc> 2607 <slot/> 2608 <skip/> 2609 <sample_and_update/> 2610 <register_type/> 2611 <vartex_register_width/> 2612 <dimension/> 2613 <array_enable/> 2614 <shadow/> 2615 <lod_bias_disable/> 2616 <lod_clamp_disable/> 2617 2618 <sr_write_count/> 2619 2620 <sr write="true"/> 2621 <src size="64">Image to read from</src> 2622 <src>Varying offset</src> 2623 </ins> 2624 2625 <ins name="VAR_TEX_DUAL" title="Fused varying-texturing" opcode="0x13F" unit="VT"> 2626 <desc> 2627 Only works for FP32 varyings. Performance characteristics are similar 2628 to LD_VAR_IMM_F32.v2.f32 followed by TEX_DUAL, using both V and T units. 2629 </desc> 2630 <slot/> 2631 <skip/> 2632 <sample_and_update/> 2633 <register_type/> 2634 <vartex_register_width/> 2635 <dimension/> 2636 <array_enable/> 2637 <shadow/> 2638 <lod_mode/> 2639 2640 <sr_write_count/> 2641 2642 <sr write="true"/> 2643 <src size="64">Image to read from</src> 2644 <src>Varying offset</src> 2645 </ins> 2646 2647 <ins name="FMA_RSCALE.f32" title="Fused floating-point multiply add with exponent bias" dests="1" opcode="0x160" unit="FMA"> 2648 <desc> 2649 First calculates $A \cdot B + C$ and then biases the exponent by D. Used in 2650 special transcendental function sequences. It should not be used for 2651 general code as its special case handling differs from two back-to-back 2652 `FMA.f32` operations. Equivalent to `FMA.f32` back-to-back with 2653 `LDEXP.f32` 2654 </desc> 2655 <clamp/> 2656 <src absneg="true">A</src> 2657 <src absneg="true">B</src> 2658 <src absneg="true">C</src> 2659 <src>D</src> 2660 </ins> 2661 2662 <ins name="FMA_RSCALE_N.f32" title="Fused floating-point multiply add with exponent bias and zero override" dests="1" opcode="0x161" unit="FMA"> 2663 <desc> 2664 First calculates $A \cdot B + C$ and then biases the exponent by D. If $A 2665 = 0$ or $B = 0$, the multiply $A \cdot B$ is treated as zero even if an 2666 ordinary multiply would return NaN. Used in special transcendental 2667 function sequences. It should not be used for general code as its special 2668 case handling differs from two back-to-back `FMA.f32` operations. 2669 Equivalent to `FMA.f32` back-to-back with `LDEXP.f32` 2670 </desc> 2671 <clamp/> 2672 <src absneg="true">A</src> 2673 <src absneg="true">B</src> 2674 <src absneg="true">C</src> 2675 <src>D</src> 2676 </ins> 2677 2678 <ins name="FMA_RSCALE_LEFT.f32" title="Fused floating-point multiply add with exponent bias and asymmetric zero handling" dests="1" opcode="0x162" unit="FMA"> 2679 <desc> 2680 First calculates $A \cdot B + C$ and then biases the exponent by D. If $A 2681 = 0$ or $B = 0$, the multiply is treated as $A$ even if an 2682 ordinary multiply would return NaN. Used in special transcendental 2683 function sequences. It should not be used for general code as its special 2684 case handling differs from two back-to-back `FMA.f32` operations. 2685 Equivalent to `FMA.f32` back-to-back with `LDEXP.f32` 2686 </desc> 2687 <clamp/> 2688 <src absneg="true">A</src> 2689 <src absneg="true">B</src> 2690 <src absneg="true">C</src> 2691 <src>D</src> 2692 </ins> 2693 2694 <ins name="FMA_RSCALE_SCALE16.f32" title="Fused floating-point multiply add with 16-bit exponent bias" dests="1" opcode="0x163" unit="FMA"> 2695 <desc> 2696 First calculates $A \cdot B + C$ and then biases the exponent by D, 2697 interpreted as a 16-bit value. Used in special transcendental function 2698 sequences. It should not be used for general code as its special case 2699 handling differs from two back-to-back `FMA.f32` operations. Equivalent 2700 to `FMA.f32` back-to-back with `LDEXP.f32` 2701 </desc> 2702 <clamp/> 2703 <src absneg="true">A</src> 2704 <src absneg="true">B</src> 2705 <src absneg="true">C</src> 2706 <src>D</src> 2707 </ins> 2708 2709</valhall> 2710