1# Kernels in gemmlowp 2 3## Kernels provide an inner-loop implementation, and a format 4 5Here we assume familiarity with the concepts of kernels and of packing as 6explained in [design.md](design.md). 7 8gemmlowp is designed to be easily extensible to different architectures and 9other low-level details, while achieving high performance. Thus a line had to be 10drawn between the generic GEMM code and the specific parts that need to be 11manually designed for each architecture, etc. The design choice made in gemmlowp 12is to have easily swappable GEMM kernels. 13 14In itself, a GEMM kernel is just an implementation of the inner-most loop in a 15GEMM (That inner-most loop has to be over the 'depth' dimension so as to be able 16to accumulate into a small enough number of accumulators to fit in registers). 17 18Thus, by itself, a GEMM kernel should be just a function computing a block of 19GEMM. 20 21However, GEMM kernels may need to differ not just in how they implement this 22computation, but also in the format of data that they operate on. Indeed, in 23order to maximize the ratio of arithmetic instructions to memory access 24instructions, GEMM kernels want to handle blocks as wide as possible given the 25number of registers of the CPU architecture. 26 27Thus, in order to allow efficient specialization to diverse architectures, 28gemmlowp allows each GEMM kernel to dictate the format of data that it expects, 29in addition to providing its inner-loop implementation. 30 31The former is given by a 'Format' typedef, and the latter by a 'Run' method. 32 33A good example is to look at internal/kernel_neon.h, and specifically at the 34NEONKernel12x4Depth2 kernel, which specifies its format as 35 36``` 37 typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, 3>, 38 KernelSideFormat<CellFormat<4, 2>, 1> > Format; 39``` 40 41The meaning of these terms is explained in the lengthy comment at the top of 42internal/kernel.h. Here, they mean that this kernel handles at each iteration 43(along the depth dimension): - 3 'cells' of size 4x2 each of the lhs, so a total 44lhs block of size 12x2 - 1 'cell' of size 2x4 of the rhs. In other words, this 45kernel handles 12 rows of the lhs and 4 columns of the rhs, and handles two 46levels of depth at once. The 'cells' and `CellFormat` detail the layout of these 4712x2 and 2x4 blocks. 48 49This kernel then loads these 12x2 and 2x4 blocks and computes the corresponding 5012x4 GEMM; for ease of reference let us paste the critical comment and code 51here: 52 53``` 54"loop_NEONKernel12x4Depth2_%=:\n" 55 56// Overview of register layout: 57// 58// A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0). 59// A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in d2--d7 60// (q1--q3). 61// A 12x4 block of accumulators is stored in 32bit in q4--q15. 62// 63// +-----+-----+-----+-----+ 64// |d0[0]|d0[1]|d0[2]|d0[3]| 65// Rhs +-----+-----+-----+-----+ 66// |d1[0]|d1[1]|d1[2]|d1[3]| 67// +-----+-----+-----+-----+ 68// 69// | | | | | 70// 71// Lhs | | | | | 72// 73// +--+--+ - - - - +-----+-----+-----+-----+ 74// |d2|d3| | q4 | q5 | q6 | q7 | 75// |d2|d3| | q4 | q5 | q6 | q7 | 76// |d2|d3| | q4 | q5 | q6 | q7 | 77// |d2|d3| | q4 | q5 | q6 | q7 | 78// +--+--+ - - - - +-----+-----+-----+-----+ 79// |d4|d5| | q8 | q9 | q10 | q11 | 80// |d4|d5| | q8 | q9 | q10 | q11 | 81// |d4|d5| | q8 | q9 | q10 | q11 | 82// |d4|d5| | q8 | q9 | q10 | q11 | 83// +--+--+ - - - - +-----+-----+-----+-----+ 84// |d6|d7| | q12 | q13 | q14 | q15 | 85// |d6|d7| | q12 | q13 | q14 | q15 | 86// |d6|d7| | q12 | q13 | q14 | q15 | 87// |d6|d7| | q12 | q13 | q14 | q15 | 88// +--+--+ - - - - +-----+-----+-----+-----+ 89// 90// Accumulator 91 92// Load 1 Rhs cell of size 2x4 93"vld1.8 {d0}, [%[rhs_ptr]:64]!\n" 94 95// Load 3 Lhs cells of size 4x2 each 96"vld1.8 {d2}, [%[lhs_ptr]:64]!\n" 97"vld1.8 {d4}, [%[lhs_ptr]:64]!\n" 98"vld1.8 {d6}, [%[lhs_ptr]:64]!\n" 99 100// Expand Lhs/Rhs cells to 16 bit. 101"vmovl.u8 q0, d0\n" 102"vmovl.u8 q1, d2\n" 103"vmovl.u8 q2, d4\n" 104"vmovl.u8 q3, d6\n" 105 106// Multiply-accumulate, level of depth 0 107"vmlal.u16 q4, d2, d0[0]\n" 108"vmlal.u16 q5, d2, d0[1]\n" 109"vmlal.u16 q6, d2, d0[2]\n" 110"vmlal.u16 q7, d2, d0[3]\n" 111"vmlal.u16 q8, d4, d0[0]\n" 112"vmlal.u16 q9, d4, d0[1]\n" 113"vmlal.u16 q10, d4, d0[2]\n" 114"vmlal.u16 q11, d4, d0[3]\n" 115"vmlal.u16 q12, d6, d0[0]\n" 116"vmlal.u16 q13, d6, d0[1]\n" 117"vmlal.u16 q14, d6, d0[2]\n" 118"vmlal.u16 q15, d6, d0[3]\n" 119 120// Multiply-accumulate, level of depth 1 121"vmlal.u16 q4, d3, d1[0]\n" 122"vmlal.u16 q5, d3, d1[1]\n" 123"vmlal.u16 q6, d3, d1[2]\n" 124"vmlal.u16 q7, d3, d1[3]\n" 125"vmlal.u16 q8, d5, d1[0]\n" 126"vmlal.u16 q9, d5, d1[1]\n" 127"vmlal.u16 q10, d5, d1[2]\n" 128"vmlal.u16 q11, d5, d1[3]\n" 129"vmlal.u16 q12, d7, d1[0]\n" 130"vmlal.u16 q13, d7, d1[1]\n" 131"vmlal.u16 q14, d7, d1[2]\n" 132"vmlal.u16 q15, d7, d1[3]\n" 133 134// Loop. Decrement loop index (depth) by 2, since we just handled 2 135// levels of depth (Kernel::kDepth=2). 136"subs %[run_depth], #2\n" 137"bne loop_NEONKernel12x4Depth2_%=\n" 138``` 139 140## Packing code adapts to the format chosen by the kernel 141 142As explained in [design.md](design.md), gemmlowp starts by packing blocks of the 143lhs and rhs matrices for optimally efficient traversal by the kernel. This 144depends on fine details of the kernel format, in ways that can only be 145efficiently handled by knowing these kernel format details at compile-time. 146 147This is the reason why all the code in [internal/pack.h](../internal/pack.h) is 148templated in the corresponding kernel format. 149 150The code in internal/pack.h isn't tightly optimized by itself, but it is 151structured in such a way that the critical code is in a template, 152`PackingRegisterBlock`, that can easily be specialized to override the slow 153generic code with fast specific packing code for specific formats, on specific 154platforms. 155 156See [internal/pack_neon.h](../internal/pack_neon.h) which provides NEON 157specializations of the packing code for the particular kernel formats that are 158used by the NEON kernels in [internal/kernel_neon.h](../internal/kernel_neon.h). 159 160## Wrapping up: how to optimize gemmlowp for a CPU architecture 161 162In conclusion, the key feature of gemmlowp when it comes to efficiently 163supporting a specific CPU architecture, is that it allows to freely replace the 164inner loop of the GEMM by providing one's own GEMM kernel, which is also free to 165dictate its required data layout; each data layout then also needs optimized 166packing code. The steps are thus: 167 1681. Freely design a GEMM kernel with a freely chosen data layout. 1692. Implement the GEMM kernel, similar to 170 [internal/kernel_neon.h](../internal/kernel_neon.h). 1713. Implement the optimized packing code, similar to 172 [internal/pack_neon.h](../internal/pack_neon.h). 173