Lines Matching refs:into
12 Ultimately, once values are loaded into CPU registers, they cost nothing to
16 more data from memory into registers. This means that a GEMM implementation
29 This is achieved by subdividing the matrices into blocks sized to fit in L2
30 cache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
37 loading into SIMD vector registers by the kernel.
65 a block of the result in int32 accumulators and then we "unpack" it into the
75 3. Unpack the result block into the output matrix.
96 // new: unpack int32 accums into destination matrix
136 The files in `internal/` fall into a few categories:
143 They both call into pack/compute/unpack stages (see [kernel.md](kernel.md) and
149 * This in turn calls into [internal/output.h](../internal/output.h) for
158 * This in turn calls into
161 The compute stage contains generic code in compute.h that only calls into