README.md
1# gemmlowp: a small self-contained low-precision GEMM library
2
3[![Build Status](https://secure.travis-ci.org/google/gemmlowp.png)](http://travis-ci.org/google/gemmlowp)
4
5This is not a full linear algebra library, only a GEMM library: it only does
6general matrix multiplication ("GEMM").
7
8The meaning of "low precision" is detailed in this document:
9[doc/low-precision.md](doc/low-precision.md)
10
11Some of the general design is explained in [doc/design.md](doc/design.md).
12
13**Warning:** This library goes very slow if compiled incorrectly; see below.
14
15## Disclaimer
16
17This is not an official Google product (experimental or otherwise), it is just
18code that happens to be owned by Google.
19
20## Mailing list
21
22gemmlowp-related discussion, about either development or usage, is welcome on
23this Google Group (mailing list / forum):
24
25https://groups.google.com/forum/#!forum/gemmlowp
26
27## Portability, target platforms/architectures
28
29Should be portable to any platform with some C++11 and POSIX support, while we
30have optional optimized code paths for specific architectures.
31
32Required:
33
34* C++11 (a small conservative subset of it)
35
36Required for some features:
37
38* Some POSIX interfaces:
39 * pthreads (for multi-threaded operation and for profiling).
40 * sysconf (for multi-threaded operation to detect number of cores; may be
41 bypassed).
42
43Optional:
44
45* Architecture-specific code paths use intrinsics or inline assembly. See
46 "Architecture-specific optimized code paths" below.
47
48## Architecture-specific optimized code paths
49
50We have some optimized code paths for specific instruction sets. Some are
51written in inline assembly, some are written in C++ using intrinsics. Both GCC
52and Clang are supported.
53
54Current optimized code paths:
55
56* ARM with NEON (both 32bit and 64bit).
57* Intel x86 with SSE 4.1 (both 32bit and 64bit).
58
59When building for x86, it's very important to pass `-msse4.1` to the compiler,
60otherwise gemmlowp will use slow reference code. Bazel users can compile by
61running `bazel build --copt=-msse4.1 //gemmlowp:all`. The compiled binary should
62work on all Intel CPUs since 2008 (including low power microarchitectures) as
63well as AMD CPUs since 2011.
64
65Please note when compiling binaries that don't need to be distributed, it's
66generally a better idea to pass `-march=native` to the compiler. That flag
67implies `-msse4.1` flag, along with others that might be helpful. This of course
68assumes the host machine supports those instructions. Bazel users should prefer
69to run `bazel build --config=opt //gemmlowp:all` instead.
70
71Details of what it takes to make an efficient port of gemmlowp, namely writing a
72suitable GEMM kernel and accompanying packing code, are explained in this file:
73[doc/kernel.md](doc/kernel.md).
74
75## Public interfaces
76
77### The gemmlowp public interface
78
79gemmlowp's main public interface is in the `public/` subdirectory.
80
81This is a headers-only library, so there is nothing to link to.
82
83Usage documentation, and comments on the deprecation status of each public entry
84point, may be found in [doc/public.md](doc/public.md) .
85
86A full, self-contained usage example, showing how to quantize float matrices and
87perform a quantized matrix multiplication approximating a float matrix
88multiplication, is given in
89[doc/quantization_example.cc](doc/quantization_example.cc).
90
91### Old EightBitIntGemm legacy deprecated interface
92
93The `eight_bit_int_gemm/` subdirectory contains an alternate interface that
94should be considered purely legacy, deprecated, and going to be removed at some
95point in the future.
96
97## Building
98
99### Building by manually invoking your compiler
100
101Because gemmlowp is so simple, working with it involves only single-command-line
102compiler invocations. Therefore we expect that most people working with gemmlowp
103will either manually invoke their compiler, or write their own rules for their
104own preferred build system.
105
106Keep in mind (previous section) that gemmlowp itself is a pure-headers-only
107library so there is nothing to build.
108
109For a Android gemmlowp development workflow, the `scripts/` directory contains a
110script to build and run a program on an Android device:
111
112```
113scripts/test-android.sh
114```
115
116### Building using Bazel
117
118That being said, we also maintain a Bazel BUILD system as part of gemmlowp. Its
119usage is not mandatory at all and is only one possible way that gemmlowp
120libraries and tests may be built. If you are interested, Bazel's home page is
121http://bazel.build/ And you can get started with using Bazel to build gemmlowp
122targets by first creating an empty WORKSPACE file in a parent directory, for
123instance:
124
125```
126$ cd gemmlowp/.. # change to parent directory containing gemmlowp/
127$ touch WORKSPACE # declare that to be our workspace root
128$ bazel build gemmlowp:all
129```
130
131## Testing
132
133### Testing by manually building and running tests
134
135The test/ directory contains unit tests. The primary unit test is
136
137```
138test/test.cc
139```
140
141Since it covers also the EightBitIntGemm interface, it needs to be linked
142against
143
144```
145eight_bit_int_gemm/eight_bit_int_gemm.cc
146```
147
148It also uses realistic data captured from a neural network run in
149
150```
151test/test_data.cc
152```
153
154Thus you'll want to pass the following list of source files to your
155compiler/linker:
156
157```
158test/test.cc
159eight_bit_int_gemm/eight_bit_int_gemm.cc
160test/test_data.cc
161```
162
163The `scripts/` directory contains a script to build and run a program on an
164Android device:
165
166```
167scripts/test-android.sh
168```
169
170It expects the `CXX` environment variable to point to an Android toolchain's C++
171compiler, and expects source files (and optionally, cflags) as command-line
172parameters. To build and run the above-mentioned main unit test, first set `CXX`
173e.g.:
174
175```
176$ export CXX=/some/toolchains/arm-linux-androideabi-4.8/bin/arm-linux-androideabi-g++
177```
178
179Then run:
180
181```
182$ ./scripts/test-android.sh \
183test/test.cc \
184eight_bit_int_gemm/eight_bit_int_gemm.cc \
185test/test_data.cc
186```
187
188### Testing using Bazel
189
190Alternatively, you can use Bazel to build and run tests. See the Bazel
191instruction in the above section on building. Once your Bazel workspace is set
192up, you can for instance do:
193
194```
195$ bazel test gemmlowp:all
196```
197
198## Troubleshooting Compilation
199
200If you're having trouble finding the compiler, follow these instructions to
201build a standalone toolchain:
202https://developer.android.com/ndk/guides/standalone_toolchain.html
203
204Here's an example of setting up Clang 3.5:
205
206```
207$ export INSTALL_DIR=~/toolchains/clang-21-stl-gnu
208$ $NDK/build/tools/make-standalone-toolchain.sh \
209--toolchain=arm-linux-androideabi-clang3.5 --platform=android-21 \
210--install-dir=$INSTALL_DIR
211$ export CXX="$INSTALL_DIR/bin/arm-linux-androideabi-g++ \
212--sysroot=$INSTALL_DIR/sysroot"
213```
214
215Some compilers (e.g. the default clang++ in the same bin directory) don't
216support NEON assembly. The benchmark build process will issue a warning if
217support isn't detected, and you should make sure you're using a compiler like
218arm-linux-androideabi-g++ that does include NEON.
219
220## Benchmarking
221
222The main benchmark is
223
224```
225test/benchmark.cc
226```
227
228It doesn't need to be linked to any other source file. We recommend building
229with assertions disabled (`-DNDEBUG`).
230
231For example, the benchmark can be built and run on an Android device by doing:
232
233```
234$ ./scripts/test-android.sh test/benchmark.cc -DNDEBUG
235```
236
237If `GEMMLOWP_TEST_PROFILE` is defined then the benchmark will be built with
238profiling instrumentation (which makes it slower) and will dump profiles. See
239next section on profiling.
240
241## Profiling
242
243The `profiling/` subdirectory offers a very simple, naive, inaccurate,
244non-interrupting sampling profiler that only requires pthreads (no signals).
245
246It relies on source code being instrumented with pseudo-stack labels. See
247`profiling/instrumentation.h`. A full example of using this profiler is given in
248the top comment of `profiling/profiler.h`.
249
250## Contributing
251
252Contribution-related discussion is always welcome on the gemmlowp mailing list
253(see above).
254
255We try to keep a current list of TODO items in the `todo/` directory.
256Prospective contributors are welcome to pick one to work on, and communicate
257about it on the gemmlowp mailing list.
258
259Details of the contributing process, including legalese, are in CONTRIBUTING.
260
261## Performance goals
262
263Our performance goals differ from typical GEMM performance goals in the
264following ways:
265
2661. We care not only about speed, but also about minimizing power usage. We
267 specifically care about charge usage in mobile/embedded devices. This
268 implies that we care doubly about minimizing memory bandwidth usage: we care
269 about it, like any GEMM, because of the impact on speed, and we also care
270 about it because it is a key factor of power usage.
271
2722. Most GEMMs are optimized primarily for large dense matrix sizes (>= 1000).
273 We do care about large sizes, but we also care specifically about the
274 typically smaller matrix sizes encountered in various mobile applications.
275 This means that we have to optimize for all sizes, not just for large enough
276 sizes.
277