meta - OpenGrok cross reference for /external/gemmlowp/meta/

METAPROGRAMMED GEMM
===================

The two main goals of this library are:
- providing a new matrix multiplication kernel.
- providing the optimized codepaths for as many possible user scenarios without
  enforcing additional input data constraints (padding, sizes, strides, layout)

To enable this code add -DGEMMLOWP_USE_META_FASTPATH to your build setup.

The new kernel
--------------

The multiplication kernel - the most inner loop of the matrix multiplication,
which is responsible for the row/column products was rewritten. The new code
produces a 3x3 result patch and processes the row/column arrays in 8 element
packs (the kernel 'shape' is 3x3x8 compared to the previous 12x4x2). By using
specialized 8bit multiplication, aggregating to vector aggregators and then
reduction with parallel horizontal addition we devised code that achieved
higher arithmetical density (arithmetical operation per assembly instruction).
The arithmetical performance of the new kernel exceeds 18 GOps/s on a vanilla
Nexus 5 phone (which is practically peak for this device).

In order to feed the kernel with input data and minimize the number of
instructions other than the arithmetical operations a different packing
scheme was used. Three rows (columns) are interweaved every 8 elements so that
they can be read from continuous memory in one op inside the kernel. Additional
memory preload hint operations are inserted into the kernel to hide memory
latency behind arithmetical operations.

Generated code
--------------

The basic kernel used in this approach is of shape 3x3x8. Obviously this
kernel can be easily applied to multipications where matrix sizes are:
M x K, K x N where M and N are multiplies of 3 and K is a multiply of 8.

We rejected two obvious solutions of: padding the matrix sizes to appropriate
sizes, or using the reference implementation for the leftovers. Neither did
we consider enforcing extra constraints on the caller.

In order to allow all matrix sizes the kernels processing all combinations of
1, 2 or 3 rows and 1, 2 or 3 columns are required. Similarily to allow all
possible depths the leftover values (up to 7 elements) needed to be handled.

Instead of writing those kernels by hand we decided to generate them with
some python scripts. 9 Versions of the multiplication kernel were prepared.
Additionally packing and unpacking code for different row/column counts and
depth leftovers was generated. Finally different code was generated for
aligned memory reads/writes and unaligned memory reads/writes.

Using those multiplication and packing/unpacking primitives 144 gemm function
versions were prepared. On top of them one high level gemm function that would
switch to one of those preoptimized versions.

This approach allowed moving all unnecessary branching and conditional execution
outside of the inner loops. It also allowed removing of all short loops required
for leftover handling. Finally aligned memory reads/writes are used everywhere
where the provided input data allows.

Results
-------

The library shows up to 35% faster gemm execution in some cases (e.g. ImageNet
benchmark).

Files
-----

single_thread_gemm.h
-- generated ARM/NEON 8bit x 8bit gemm implementation. Contains all the
   optimized, unrolled and curried pack/unpack, and multiply procedures and
   a single gemm function that switches between the optimized versions based
   on the runtime parameters.

multi_thread_gemm.h
-- a simple parallelization scheme for the gemm function.

generators/gemm_NxMxK_neon.py
-- script that generates the single_thread_gemm.h header library.
   Usage: python gemm_NxMxK_neon > single_thread_gemm.h
Name		Date	Size	#Lines	LOC
..		-	-
generators/		03-May-2024	-	6,467	4,998
README	D	03-May-2024	3.6 KiB	82	63
base.h	D	03-May-2024	3.9 KiB	146	98
legacy_multi_thread_common.h	D	03-May-2024	5.3 KiB	152	115
legacy_multi_thread_gemm.h	D	03-May-2024	11.1 KiB	261	215
legacy_multi_thread_gemv.h	D	03-May-2024	6.8 KiB	169	129
legacy_operations_common.h	D	03-May-2024	1.8 KiB	62	40
legacy_single_thread_gemm.h	D	03-May-2024	9.4 KiB	300	236
multi_thread_common.h	D	03-May-2024	1.6 KiB	57	33
multi_thread_gemm.h	D	03-May-2024	5.1 KiB	145	107
multi_thread_transform.h	D	03-May-2024	3.4 KiB	99	67
quantized_mul_kernels.h	D	03-May-2024	5.6 KiB	178	145
quantized_mul_kernels_arm_32.h	D	03-May-2024	128.3 KiB	4,289	3,359
quantized_mul_kernels_arm_64.h	D	03-May-2024	127.1 KiB	4,107	3,177
single_thread_gemm.h	D	03-May-2024	25.1 KiB	689	556
single_thread_transform.h	D	03-May-2024	2.9 KiB	91	65
streams.h	D	03-May-2024	10.8 KiB	313	260
streams_arm_32.h	D	03-May-2024	381.6 KiB	12,249	10,772
streams_arm_64.h	D	03-May-2024	401.1 KiB	12,274	10,797
test_gemm_correctness.cc	D	03-May-2024	18.1 KiB	522	455
test_streams_correctness.cc	D	03-May-2024	5.6 KiB	183	143
test_transform_benchmark.cc	D	03-May-2024	5 KiB	152	110
test_transform_correctness.cc	D	03-May-2024	9.9 KiB	286	231
transform_kernels.h	D	03-May-2024	7.1 KiB	245	203
transform_kernels_arm_32.h	D	03-May-2024	241.6 KiB	8,110	7,048
transform_kernels_arm_64.h	D	03-May-2024	249.5 KiB	7,966	6,904