cat3 src1 and src2, some parts are similar to cat2/cat4 src encoding, but a few extra bits trimmed out to squeeze in the 3rd src register (dropping (abs), immed encoding, and moving a few other bits elsewhere) {HALF}{SRC} 00000 src {IMMED_ENCODING} {IMMED} 1 {HALF}c{CONST}.{SWIZ} 10 src->num >> 2 src->num & 0x3 extract_reg_uim(src) 01 src->array.offset {HALF}r<a0.x + {OFFSET}> 0 {HALF}c<a0.x + {OFFSET}> 1 {SY}{SS}{JP}{SAT}(nop{NOP}) {UL}{NAME} {DST_HALF}{DST}, {SRC1_NEG}{SRC1}, {SRC2_NEG}{HALF}{SRC2}, {SRC3_NEG}{SRC3} {SY}{SS}{JP}{SAT}{REPEAT}{UL}{NAME} {DST_HALF}{DST}, {SRC1_NEG}{SRC1_R}{SRC1}, {SRC2_NEG}{SRC2_R}{HALF}{SRC2}, {SRC3_NEG}{SRC3_R}{SRC3} 011 !!(src->srcs[0]->flags & (IR3_REG_FNEG | IR3_REG_SNEG | IR3_REG_BNOT)) extract_SRC1_R(src) extract_SRC2_R(src) !!(src->srcs[2]->flags & IR3_REG_R) !!(src->srcs[1]->flags & (IR3_REG_FNEG | IR3_REG_SNEG | IR3_REG_BNOT)) !!(src->srcs[2]->flags & (IR3_REG_FNEG | IR3_REG_SNEG | IR3_REG_BNOT)) src->srcs[0] 0 The source precision is determined by the instruction opcode. If {DST_CONV} the result is widened/narrowed to the opposite precision. ((src->dsts[0]->num >> 2) == 62) ? 0 : !!((src->srcs[0]->flags ^ src->dsts[0]->flags) & IR3_REG_HALF) The difference is that this cat3 version does not support plain const registers as src1/src3 but does support inmidiate values. On the other hand it still supports relative gpr and consts. 1 src->srcs[2] !(src->srcs[1]->flags & IR3_REG_HALF) ((src->dsts[0]->num >> 2) == 62) ? 0 : !!((src->srcs[1]->flags ^ src->dsts[0]->flags) & IR3_REG_HALF) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 (src2 >> src1) & src3 1000 (src2 << src1) & src3 1001 (src2 >> src1) | src3 1010 (src2 << src1) | src3 1011 (src2 & src1) | src3 1100 {SY}{SS}{JP}{SAT}(nop{NOP}) {UL}{NAME}{SRC_SIGN}{SRC_PACK} {DST}, {SRC1}, {SRC2}, {SRC3_NEG}{SRC3} 1 src->srcs[2] src->cat3.signedness src->cat3.packed Given: SRC1 is a i8vec2 or u8vec2 SRC2 is a u8vec2 SRC1 and SRC2 are packed into low or high halves of the registers. SRC3 is a int32_t or uint32_t Do: DST = dot(SRC1, SRC2) + SRC3 0 1101 Same a dp2acc but for vec4 instead of vec2. Corresponds to packed variantes of OpUDotKHR and OpSUDotKHR. 1 1101 (!{DST_FULL}) 1 src->srcs[2] !(src->srcs[0]->flags & IR3_REG_HALF) ((src->dsts[0]->num >> 2) == 62) ? 1 : !(src->dsts[0]->flags & IR3_REG_HALF) Given: SRC1 = (x_1, x_2, x_3, x_4) - 4 consecutive registers SRC2 = (y_1, y_2, y_3, y_4) - 4 consecutive registers SRC3 is an immediate in range of [0, 160] Do: float y_sum = y_1 + y_2 + y_3 + y_4 vec4 result = (x_1 * y_sum, x_2 * y_sum, x_3 * y_sum, x_4 * y_sum) Starting from DST reg duplicate *result* into consecutive registers (1 << (SRC3 / 32)) times. 0 1110 Same as wmm but instead of overwriting DST - the result is added to DST registers, however the first reg of the result is always overwritten. 1 1110