README.md
1# Gemm Tuner
2
3## Introduction
4
5This is a set of tools for tuning the performance of OpenCL GEMM kernels. Specifically, we tune 3 GEMM kernels, each
6has a different implementation **strategy** of the GEMM operation: **native**, **reshaped**, **reshaped only rhs**.
7The details of these strategies can be found in the documentations of the corresponding kernels:
8**CLGEMMMatrixMultiplyNativeKernel**, **CLGEMMMatrixMultiplyReshapedKernel** and
9**CLGEMMMatrixMultiplyReshapedOnlyRHSKernel**.
10
11The Tuner consists of 2 scripts and 3 binaries:
12* benchmark_gemm_examples.sh and GemmTuner.py under examples/gemm_tuner, and
13* benchmark_cl_gemm_native, benchmark_cl_gemm_reshaped_rhs_only and benchmark_cl_gemm_reshaped under
14 build/tests/gemm_tuner (you'll need to build the library first)
15
16The inputs to the Tuner are a list of 4 valued tuples we call **GEMM shape** or **GEMMParam** (M, N, K, B, and possibly
17data type). They define the "shape" and other parameters (eg. data type) of a GEMM operation:
18```
19LHS x RHS = DST
20```
21Where LHS is of shape MxK, RHS is of shape KxN and DST is of shape MxN, and B is the batch size.
22
23The outputs of the tuning process are 4 json files:
241. gemm_type_selection.json: selects which kernel type is the best for each GEMMParam
252. gemm_config_native.json: selects a list of best **GEMMConfigs** of the native kernel for each GEMMParam
263. gemm_config_reshapedonlyrhs.json: selects a list of best GEMMConfigs of the reshaped_only_rhs kernel for each GEMMParam
274. gemm_config_reshaped.json: selects a list of best GEMMConfigs of the reshaped kernel for each GEMMParam
28
29These 4 files are the current representations we use for what we call the **heuristics** of a GEMM op: given a GEMMParam,
30what kernel and subsequently what configurations for that kernels are the most performant.
31
32## Step-by-step example
33
34### Step1: Prepare the shape and configs files
351. We first need to identify the shapes that we are interested in and store them in a csv file, say *gemm_shapes.csv*.
362. Then we need to specify a set of good GEMMConfig candidates for each kernel in 3 separate csv files (this requires
37 some prior heuristics, but can be provided by the Compute Library developers upon requests, based on your target device).
38
39 Say we have *gemm_configs_native.csv", "gemm_configs_reshaped.csv" and "gemm_configs_reshaped_only_rhs.csv".
40
41 Please refer to the Prerequisite section for more details
42
43### Step2: Push relevant files to the target device
44All the files that need to be present on the target device are:
45* benchmark script: \<ComputeLibrary\>/examples/gemm_tuner/benchmark_gemm_examples.sh
46* shapes and configs csv files: gemm_shapes.csv, gemm_configs_native.csv, gemm_configs_reshaped_only_rhs.csv, gemm_configs_reshaped.csv
47* Example benchmark binaries: \<ComputeLibrary\>/build/tests/gemm_tuner/benchmark_cl_gemm*
48
49### Step3: Collect benchmark data
50With these files on device, we can collect benchmark data using the script. Assume all the example binaries are pushed
51to a folder called *gemm_tuner*. While logged onto our device:
52```
53# Native
54./benchmark_gemm_examples.sh -s native -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_native.csv -o results/native
55# Reshaped Only RHS
56./benchmark_gemm_examples.sh -s reshaped_rhs_only -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped_only_rhs.csv -o results/reshaped_only_rhs
57# Reshaped
58./benchmark_gemm_examples.sh -s reshaped -e ./gemm_tuner -g ./gemm_shapes.csv -c ./gemm_configs_reshaped.csv -o results/reshaped
59```
60You can repeat the 3 commands above to have a bit redundancy in your benchmark data (as you can imagine, measurement is noisy),
61but you may need to change the output folder for each repeat
62
63### Step4: Generate the heuristics
641. After benchmarking, we pull the benchmark data, the *results* folder, from the target device to our host machine
652. We use the GemmTuner.py script to give us the heuristics
66 ```
67 python3 <ComputeLibrary>/examples/gemm_tuner/GemmTuner.py -b ./results -o heuristics
68 ```
69 When it's finished, there should be 4 json files in the *heuristics* folder
70
71One thing to notice is that the config heuristics might give more than 1 recommendations for each GEMMParam, because
72we accept all good GEMMConfigs with a tolerance. If you want fewer recommendations, you can decrease the tolerance by
73passing a lower value to *-t \<tolerance\>* to the GemmTuner.py script.
74
75## Prerequisite
76* A target device to be tuned, plus the following on the device:
77 * Android or Linux OS
78 * Bash shell
79 * Built Compute Library with benchmark examples binaries
80 * benchmark_gemm_examples.sh script
81 * gemm shape file
82
83 A csv file containing the **GEMMParam search list**. This is the list of GEMMParams/gemm shapes that we're
84 interested in (For more details see Approach section). The default list is prepared by Compute Library developers in advance
85 and can be provided on request.
86
87 The format is described as:
88
89 A headerless csv file with fields separated by commas.
90
91 A gemm shape is a list of 4 positive integers \<M, N, K, B\> describing the shapes of the two matrices (LHS and
92 RHS) with:
93
94 M - Number of lhs matrix rows
95 N - Number of rhs matrix columns
96 K - Number of lhs matrix columns/rhs matrix rows
97 B - Batch size
98
99 An example gemm shape file looks like:
100 ```
101 100,100,30,1
102 100,100,30,3
103 ...
104 ```
105 * gemm config file
106 A csv file containing the **GEMMConfig search list**. This is the list of candidate GEMMConfigs among which we
107 search for the optimal one. **Note that we have a different list for each strategy.**
108 The default lists are prepared by Compute Library developers in advance and can be provided on request.
109
110 The format of the file for each strategy is the same:
111
112 A headerless csv file with fields separated by commas.
113
114 However the fields of GEMMConfig differ for each strategy:
115
116 * Strategy **native**:
117 A gemm config is a list of 3 positive integers \<m0, n0, k0\>, with:
118
119 m0 - Number of rows processed by the matrix multiplication
120 n0 - Number of columns processed by the matrix multiplication
121 k0 - Number of partial accumulations performed by the matrix multiplication
122
123 Only the following configurations of M0, N0 and K0 are currently supported:
124
125 M0 = 1, 2, 3, 4, 5, 6, 7, 8
126 N0 = 2, 3, 4, 8, 16
127 K0 = 2, 3, 4, 8, 16
128
129 An example gemm config file looks like:
130 ```
131 1,4,4
132 2,3,8
133 ...
134 ```
135 * Strategy **reshaped_rhs_only**:
136 A gemm config is a list of 4 positive integers <m0, n0, k0, h0> and 3 boolean values:
137
138 m0 - Number of rows processed by the matrix multiplication
139 n0 - Number of columns processed by the matrix multiplication
140 k0 - Number of partial accumulations performed by the matrix multiplication
141 h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
142 interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
143 transpose_rhs - Transpose rhs matrix (1) / Do not transpose rhs matrix (0)
144 export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
145 with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
146 for more details
147
148 Only the following configurations of M0, N0 and K0 are currently supported:
149
150 M0 = 1, 2, 3, 4, 5, 6, 7, 8
151 N0 = 2, 3, 4, 8, 16
152 K0 = 2, 3, 4, 8, 16
153 H0 >= 1
154
155 An example gemm config file looks like:
156 ```
157 4,4,4,1,1,1,0
158 4,4,4,3,1,0,1
159 ...
160 ```
161 * Strategy **reshaped**:
162 A gemm config is a list of 5 positive integers <m0, n0, k0, v0, h0> and 4 boolean values:
163
164 m0 - Number of rows processed by the matrix multiplication
165 n0 - Number of columns processed by the matrix multiplication
166 k0 - Number of partial accumulations performed by the matrix multiplication
167 v0 - Number of vertical blocks of size (m0xk0) stored on the same output row
168 h0 - Number of horizontal blocks of size (k0xn0) stored on the same output row
169 interleave_lhs - Interleave lhs matrix (1) / Do not interleave lhs matrix (0)
170 interleave_rhs - Interleave rhs matrix (1) / Do not interleave rhs matrix (0)
171 transpose_rhs - Transpose rhs matrix but not lhs matrix (1) / Do not transpose rhs matrix but do transpose lhs matrix (0)
172 export_to_cl_image_rhs - Export rhs matrix to cl_image (1) / Do not export rhs matrix to cl_image (0). Can only be true
173 with certain combinations of the GEMMParams and other configs. Please refer to CLGEMMReshapeRHSMatrixKernel
174 for more details
175
176 If rhs matrix is transposed only the following configurations are currently supported:
177
178 M0 = 2, 3, 4, 5, 6, 7, 8
179 N0 = 2, 3, 4, 8, 16
180 K0 = 2, 3, 4, 8, 16
181 V0 >= 1
182 H0 >= 1
183
184 If lhs matrix is transposed only the following configurations are currently supported:
185
186 M0 = 2, 3, 4, 8
187 N0 = 2, 3, 4, 8, 16
188 K0 = 2, 3, 4, 8, 16
189 V0 >= 1
190 H0 >= 1
191
192 An example gemm config file looks like:
193 ```
194 4,4,4,1,3,1,1,1,0
195 4,4,4,3,3,1,1,0,1
196 ...
197 ```
198* A host machine, plus these on the machine:
199 * python >= 3.6
200 * GemmTuner.py script
201
202## Usage
203The usage of the 2 scripts:
204
2051. benchmark_gemm_examples.sh
206
207 Run the shell script (**benchmark_gemm_examples.sh**) on your **target device**. Note that all the built benchmark
208 examples: build/tests/gemm_tuner/benchmark_cl_gemm*, have to be present on your target device prior to running.
209 The benchmark results will be saved to json files in an output directory.
210 ```
211 Usage: benchmark_gemm_examples.sh [-h] -s \<strategy\> -e \<example_binary_dir\> -g \<gemm_shape_file\>
212 -c \<gemm_config_file\> [-d \<data_type\>] [-o \<out_dir\>]
213
214 Options:
215 -h
216 Print help messages. If a strategy is specified with -s <strategy>, then only display messages relevant to that
217 strategy. Otherwise if no strategy is specified, display messages for all available strategies.
218
219 -s <strategy>
220 Strategy option.
221 Options: ${ALL_STRATEGY_OPTIONS[@]}.
222
223 -e <example_binary_dir>
224 Path to directory that holds all example binaries
225
226 -g <gemm_shape_file>
227 Path to gemm shape csv file
228
229 -c <gemm_config_file>
230 Path to gemm config csv file
231
232 -d <data_type>
233 Data type option with which to run benchmark examples
234 Default: ${DEFAULT_DATA_TYPE}
235 Supported options:
236 Strategy : Data Types
237 Native : F32
238 Reshaped : F16, F32
239 Reshaped RHS Only : F16, F32
240
241 -o <out_dir>
242 Path to output directory that holds output json files
243 Default: ${DEFAULT_OUT_DIR}
244 ```
2452. GemmTuner.py:
246
247 Run the python script (**GemmTuner.py**) on your **host machine**.
248 You'll need to transfer all the benchmark result json files generated from the previous step to your host machine
249 beforehand. The script will output the best kernel and gemm configurations for each gemm param in the 4 output json files
250 ```
251 Usage: GemmTuner.py [-h] -b PATH [-o PATH] [-t TOLERANCE] [-D]
252
253 CL GEMM Tuner
254 optional arguments:
255 -h, --help show this help message and exit
256 -b PATH, --benchmark_results PATH
257 Path to benchmark result directory, where benchmark
258 result json files have a file extension of
259 'gemmtuner_benchmark'
260 -o PATH, --output_dir PATH
261 Path to directory that holds output json files.
262 -t TOLERANCE, --tolerance TOLERANCE
263 For testing if two GEMMConfigs are equivalent in terms
264 of performance. The tolerance is OpenCL timer in
265 milliseconds. Recommended value: <= 0.1 ms
266 -D, --debug Enable script debugging output
267
268 ```