Allow early z-test and early-lrz (if applicable)
Disable early z-test and early-lrz test (if applicable)
A special mode that allows early-lrz test but disables
early-z test. Which might sound a bit funny, since
lrz-test happens before z-test. But as long as a couple
conditions are maintained this allows using lrz-test in
cases where fragment shader has kill/discard:
1) Disable lrz-write in cases where it is uncertain during
binning pass that a fragment will pass. Ie. if frag
shader has-kill, writes-z, or alpha/stencil test is
enabled. (For correctness, lrz-write must be disabled
when blend is enabled.) This is analogous to how a
z-prepass works.
2) Disable lrz-write and test if a depth-test direction
reversal is detected. Due to condition (1), the contents
of the lrz buffer are a conservative estimation of the
depth buffer during the draw pass. Meaning that geometry
that we know for certain will not be visible will not pass
lrz-test. But geometry which may be (or contributes to
blend) will pass the lrz-test.
This allows us to keep early-lrz-test in cases where the frag
shader does not write-z (ie. we know the z-value before FS)
and does not have side-effects (image/ssbo writes, etc), but
does have kill/discard. Which turns out to be a common
enough case that it is useful to keep early-lrz test against
the conservative lrz buffer to discard fragments that we
know will definitely not be visible.
b0..7 seems to contain the size of buffered by not yet processed
RB level cmdstream.. it's possible that it is a low threshold
and b8..15 is a high threshold?
b16..23 identifies where IB1 data starts (and RB data ends?)
b24..31 identifies where IB2 data starts (and IB1 data ends)
low bits identify where CP_SET_DRAW_STATE stateobj
processing starts (and IB2 data ends). I'm guessing
b8 is part of this since (from downstream kgsl):
/* ROQ sizes are twice as big on a640/a680 than on a630 */
if (adreno_is_a640(adreno_dev) || adreno_is_a680(adreno_dev)) {
kgsl_regwrite(device, A6XX_CP_ROQ_THRESHOLDS_2, 0x02000140);
kgsl_regwrite(device, A6XX_CP_ROQ_THRESHOLDS_1, 0x8040362C);
} ...
number of remaining dwords incl current dword being consumed?
number of remaining dwords incl current dword being consumed?
number of dwords that have already been read but haven't been consumed by $addr
Configures the mapping between VSC_PIPE buffer and
bin, X/Y specify the bin index in the horiz/vert
direction (0,0 is upper left, 0,1 is leftmost bin
on second row, and so on). W/H specify the number
of bins assigned to this VSC_PIPE in the horiz/vert
dimension.
Seems to be a bitmap of which tiles mapped to the VSC
pipe contain geometry.
I suppose we can connect a maximum of 32 tiles to a
single VSC pipe.
Has the size of data written to corresponding VSC_PRIM_STRM
buffer.
Has the size of data written to corresponding VSC pipe, ie.
same thing that is written out to VSC_DRAW_STRM_SIZE_ADDRESS_LO/HI
In addition to FLUSH_PER_OVERLAP, guarantee that UCHE
and CCU don't get out of sync when fetching the previous
value for the current pixel. With NO_FLUSH, there's the
possibility that the flags for the current pixel are
flushed before the data or vice-versa, leading to
texture fetches via UCHE getting out of sync values.
This mode should eliminate that. It's used in bypass
mode for coherent blending
(GL_KHR_blend_equation_advanced_coherent) as well as
non-coherent blending.
Invalidate UCHE and wait for any pending work to finish
if there was possibly an overlapping primitive prior to
the current one. This is similar to a combination of
GRAS_SC_CONTROL::INJECT_L2_INVALIDATE_EVENT and
WAIT_RB_IDLE_ALL_TRI on a3xx. It's used in GMEM mode for
coherent blending
(GL_KHR_blend_equation_advanced_coherent).
LRZ write also disabled for blend/etc.
update MAX instead of MIN value, ie. GL_GREATER/GL_GEQUAL
Z_READ_ENABLE bit is set for zfunc other than GL_ALWAYS or GL_NEVER
also set when Z_BOUNDS_ENABLE is set
For clearing depth/stencil
1 - depth
2 - stencil
3 - depth+stencil
For clearing color buffer:
then probably a component mask, I always see 0xf
num of varyings plus four for gl_Position (plus one if gl_PointSize)
plus # of transform-feedback (streamout) varyings if using the
hw streamout (rather than stg instructions in shader)
The number of extra copies of POSITION, i.e.
number of views minus one when multi-position
output is enabled, otherwise 0.
This VPC location will be overwritten with
ViewID when multiview is enabled. It's used when
fragment shaders read ViewID. It's only
strictly required for multi-position output,
where the same VS invocation is used for all the
views at once, but it can be used when multi-pos
output is disabled too, to avoid having to pass
ViewID through the VS.
num of varyings plus four for gl_Position (plus one if gl_PointSize)
plus # of transform-feedback (streamout) varyings if using the
hw streamout (rather than stg instructions in shader)
geometry shader
size in vec4s of per-primitive storage for gs. TODO: not actually in VPC
Multi-position output lets the last geometry
stage shader write multiple copies of
gl_Position. If disabled then the VS is run once
for each view, and ViewID is passed as a
register to the VS.
Possibly not really "initiating" the draw but the layout is similar
to VGT_DRAW_INITIATOR on older gens
Written by CP_SET_VISIBILITY_OVERRIDE handler
This is the ID of the current patch within the
subdraw, used to calculate the offset of the
patch within the HS->DS buffers. When a draw is
split into multiple subdraws then this differs
from gl_PrimitiveID on the second, third, etc.
subdraws.
The size of memory that ldp/stp can address.
Seems to be the same as a3xx. The maximum stack
size in units of 4 calls, so a call depth of 7
would result in a value of 2.
TODO: What's the actual size per call, i.e. the
size of the PC? a3xx docs say it's 16 bits
there, but the length register now takes 28 bits
so it's probably been bumped to 32 bits.
There are four indices used to compute the
private memory location for an access:
- stp/ldp offset
- fiber id
- wavefront id (a swizzled version of what "getwid" returns)
- SP ID (the same as what "getspid" returns)
The stride for the SP ID is always set by
TOTALPVTMEMSIZE. In the per-wave layout, the
indices are used in this order:
- offset % 4 (offset within dword)
- fiber id
- offset / 4
- wavefront id
- SP ID
and the stride for the wavefront ID is
MEMSIZEPERITEM, multiplied by 128 (fibers per
wavefront). In the per-fiber layout, the indices
are used in this order:
- offset
- fiber id % 4
- wavefront id
- fiber id / 4
- SP ID
and the stride for the fiber id/wavefront id
combo is MEMSIZEPERITEM.
Note: Accesses of more than 1 dword do not work
with per-fiber layout. The blob will fall back
to per-wave instead.
This seems to be be the equivalent of HWSTACKOFFSET in
a3xx. The ldp/stp offset formula above isn't affected by
HWSTACKSIZEPERTHREAD at all, so the HW return address
stack seems to be after all the normal per-SP private
memory.
Normally the size of the output of the last stage in
dwords. It should be programmed as follows:
size less than 63 - size
size of 63 (?) or 64 - 63
size greater than 64 - 64
What to program when the size is 61-63 is a guess, but
both the blob and ir3 align the size to 4 dword's so it
doesn't matter in practice.
per MRT
If 0 - all 32k of shared storage is enabled, otherwise
(SHARED_SIZE + 1) * 1k is enabled.
The ldl/stl offset seems to be rewritten to 0 when it is beyond
this limit. This is different from ldlw/stlw, which wraps at
64k (and has 36k of storage on A640 - reads between 36k-64k
always return 0)
This register clears pending loads queued up by
CP_LOAD_STATE6. Each bit resets a particular kind(s) of
CP_LOAD_STATE6.
Shared constants are intended to be used for Vulkan push
constants. When enabled, 8 vec4's are reserved in the FS
const pool and 16 in the geometry const pool although
only 8 are actually used (why?) and they are mapped to
c504-c511 in each stage. Both VS and FS shared consts
are written using ST6_CONSTANTS/SB6_IBO, so that both
the geometry and FS shared consts can be written at once
by using CP_LOAD_STATE6 rather than
CP_LOAD_STATE6_FRAG/CP_LOAD_STATE6_GEOM. In addition
DST_OFF and NUM_UNIT are in units of dwords instead of
vec4's.
There is also a separate shared constant pool for CS,
which is loaded through CP_LOAD_STATE6_FRAG with
ST6_UBO/ST6_IBO. However the only real difference for CS
is the dword units.
Texture sampler dwords
Texture constant dwords
Pitch in bytes (so actually stride)
Pitch in bytes (so actually stride)