Lines Matching refs:into
16 Ultimately, once values are loaded into CPU registers, they cost nothing to
20 more data from memory into registers. This means that
33 This is achieved by subdividing the matrices into blocks sized to fit in L2
34 cache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
41 and 2) simple loading into SIMD vector registers by the kernel.
69 a block of the result in int32 accumulators and then we "unpack" it into the
79 3. Unpack the result block into the output matrix.
99 // new: unpack int32 accums into destination matrix
136 The files in internal/ fall into a few categories:
142 They both call into pack/compute/unpack stages implemented in the following files:
152 The compute stage contains generic code in compute.h that only calls into