• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# TFLite Model Benchmark Tool with C++ Binary
2
3## Description
4
5A simple C++ binary to benchmark a TFLite model and its individual operators,
6both on desktop machines and on Android. The binary takes a TFLite model,
7generates random inputs and then repeatedly runs the model for specified number
8of runs. Aggregate latency statistics are reported after running the benchmark.
9
10The instructions below are for running the binary on Desktop and Android,
11for iOS please use the
12[iOS benchmark app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/ios).
13
14An experimental Android APK wrapper for the benchmark model utility offers more
15faithful execution behavior on Android (via a foreground Activity). It is
16located
17[here](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/android).
18
19## Parameters
20
21The binary takes the following required parameters:
22
23*   `graph`: `string` \
24    The path to the TFLite model file.
25
26and the following optional parameters:
27
28*   `num_threads`: `int` (default=-1) \
29    The number of threads to use for running TFLite interpreter. By default,
30    this is set to the platform default value -1.
31*   `warmup_runs`: `int` (default=1) \
32    The number of warmup runs to do before starting the benchmark.
33*   `num_runs`: `int` (default=50) \
34    The number of runs. Increase this to reduce variance.
35*   `run_delay`: `float` (default=-1.0) \
36    The delay in seconds between subsequent benchmark runs. Non-positive values
37    mean use no delay.
38*   `run_frequency`: `float` (default=-1.0) \
39    The frequency of running a benchmark run as the number of prorated runs per
40    second. If the targeted rate per second cannot be reached, the benchmark
41    would start the next run immediately, trying its best to catch up. If set,
42    this will override the `run_delay` parameter. A non-positive value means
43    there is no delay between subsequent runs.
44*   `enable_op_profiling`: `bool` (default=false) \
45    Whether to enable per-operator profiling measurement.
46*   `max_profiling_buffer_entries`: `int` (default=1024) \
47    The max number of profiling events that will be stored during each inference
48    run. It is only meaningful when `enable_op_profiling` is set to `true`.
49    Note, the actual value of this parameter will be adjusted if the model has
50    more nodes than the specified value of this parameter.
51*   `profiling_output_csv_file`: `str` (default="") \
52    File path to export profile data to as CSV. The results are printed to
53    `stdout` if option is not set. Requires `enable_op_profiling` to be `true`
54    and the path to include the name of the output CSV; otherwise results are
55    printed to `stdout`.
56*  `print_preinvoke_state`: `bool` (default=false) \
57    Whether to print out the TfLite interpreter internals just before calling
58    tflite::Interpreter::Invoke. The internals will include allocated memory
59    size of each tensor etc. Enabling this could help understand TfLite graph
60    and memory usage.
61*  `print_postinvoke_state`: `bool` (default=false) \
62    Whether to print out the TfLite interpreter internals just before benchmark
63    completes (i.e. after all repeated Invoke calls complete). The internals
64    will include allocated memory size of each tensor etc. Enabling this could
65    help understand TfLite graph and memory usage, particularly when there are
66    dynamic-shaped tensors in the graph.
67*  `report_peak_memory_footprint`: `bool` (default=false) \
68    Whether to report the peak memory footprint by periodically checking the
69    memory footprint. Internally, a separate thread will be spawned for this
70    periodic check. Therefore, the performance benchmark result could be
71    affected.
72*  `memory_footprint_check_interval_ms`: `int` (default=50) \
73   The interval in millisecond between two consecutive memory footprint checks.
74   This is only used when --report_peak_memory_footprint is set to true.
75
76*  `dry_run`: `bool` (default=false) \
77    Whether to run the tool just with simply loading the model, allocating
78    tensors etc. but without actually invoking any op kernels.
79*  `verbose`: `bool` (default=false) \
80    Whether to log parameters whose values are not set. By default, only log
81    those parameters that are set by parsing their values from the commandline
82    flags.
83
84### Model input parameters
85By default, the tool will use randomized data for model inputs. The following
86parameters allow users to specify customized input values to the model when
87running the benchmark tool:
88
89*   `input_layer`: `string` \
90    A comma-separated list of input layer names, e.g. 'input1,input2'. Note all
91    inputs of the model graph need to be specified. However, the input name
92    does not need to match that encoded in the model. Additionally, the order
93    of input layer names specified here is assumed to be same with that is seen
94    by the Tensorflow Lite interpreter. This is a bit inconvenient but the
95    [visualization tool](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/visualize.py)
96    should help to find this order.
97*   `input_layer_shape`: `string` \
98    A colon-separated list of input layer shapes, where each shape is a
99    comma-separated list, e.g. '1,30:1,10'. Similar to `input_layer`, this
100    parameter also requires shapes of all inputs be specified, and the order of
101    inputs be same with that is seen by the interpreter.
102*   `input_layer_value_range`: `string` \
103    A map-like string representing value range for *integer* input layers. Each
104    item is separated by ':', and the item value consists of input layer name
105    and integer-only range values (both low and high are inclusive) separated by
106    ',', e.g. 'input1,1,2:input2,0,254'. Note that the input layer name must
107    exist in the list of names specified by `input_layer`.
108*   `input_layer_value_files`: `string` \
109    A map-like string representing files that contain input values. Each
110    item is separated by ',', and the item value consists of input layer name
111    and the file path separated by ':',
112    e.g. 'input1:file_path1,input2:file_path2'. If a input name appears in both
113    `input_layer_value_range` and `input_layer_value_files`,
114    the corresponding input value range specified by`input_layer_value_range`
115    will be ignored. The file format is binary, and the content should be either
116    a byte array or null-separated strings. Note that the inpput layer name must
117    also exist in the list of names specified by `input_layer`.
118
119### TFLite delegate parameters
120The tool supports all runtime/delegate parameters introduced by
121[the delegate registrar](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/delegates).
122The following simply lists the names of these parameters and additional notes
123where applicable. For details about each parameter, please refer to
124[this page](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar).
125#### Common parameters
126* `max_delegated_partitions`: `int` (default=0)
127* `min_nodes_per_partition`:`int` (default=0)
128* `delegate_serialize_dir`: `str` (default="")
129* `delegate_serialize_token`: `str` (default="")
130
131#### GPU delegate
132* `use_gpu`: `bool` (default=false)
133* `gpu_precision_loss_allowed`: `bool` (default=true)
134* `gpu_experimental_enable_quant`: `bool` (default=true)
135* `gpu_inference_for_sustained_speed`: `bool` (default=false)
136* `gpu_backend`: `string` (default="")
137* `gpu_wait_type`: `str` (default="")
138
139#### NNAPI delegate
140
141*   `use_nnapi`: `bool` (default=false) \
142    Note some Android P devices will fail to use NNAPI for models in
143    `/data/local/tmp/` and this benchmark tool will not correctly use NNAPI.
144*   `nnapi_execution_preference`: `str` (default="") \
145    Should be one of: `fast_single_answer`, `sustained_speed`, `low_power`,
146    `undefined`.
147*   `nnapi_execution_priority`: `str` (default="") \
148    Note this requires Android 11+.
149*   `nnapi_accelerator_name`: `str` (default="") \
150    Note this requires Android 10+.
151*   `disable_nnapi_cpu`: `bool` (default=true)
152*   `nnapi_allow_fp16`: `bool` (default=false)
153*   `nnapi_allow_dynamic_dimensions`:`bool` (default=false)
154*   `nnapi_use_burst_mode`:`bool` (default=false)
155
156#### Hexagon delegate
157* `use_hexagon`: `bool` (default=false)
158* `hexagon_profiling`: `bool` (default=false) \
159Note enabling this option will not produce profiling results outputs unless
160`enable_op_profiling` is also turned on. When both parameters are set to true,
161the profile of ops on hexagon DSP will be added to the profile table. Note that,
162the reported data on hexagon is in cycles, not in ms like on cpu.
163
164#### XNNPACK delegate
165*   `use_xnnpack`: `bool` (default=false) \
166Note if this option is explicitly set to `false`, the TfLite runtime will use
167its original CPU kernels for model execution. In other words, after enabling
168the feature that the XNNPACK delegate is applied by default in TfLite runtime,
169explictly setting this flag to `false` will cause the benchmark tool to disable
170the feature at runtime, and to use the original non-delegated CPU execution path
171for model benchmarking.
172
173#### CoreML delegate
174*   `use_coreml`: `bool` (default=false)
175*   `coreml_version`: `int` (default=0)
176
177#### External delegate
178*   `external_delegate_path`: `string` (default="")
179*   `external_delegate_options`: `string` (default="")
180
181As some delegates are only available on certain platforms, when running the
182benchmark tool on a particular platform, specifying `--help` will print out all
183supported parameters.
184
185### Use multiple delegates
186When multiple delegates are specified to be used in the commandline flags, the
187order of delegates applied to the TfLite runtime will be same as their enabling
188commandline flag is specified. For example, "--use_xnnpack=true --use_gpu=true"
189means applying the XNNPACK delegate first, and then the GPU delegate secondly.
190In comparison, "--use_gpu=true --use_xnnpack=true" means applying the GPU
191delegate first, and then the XNNPACK delegate secondly.
192
193## To build/install/run
194
195### On Android:
196
197(0) Refer to https://www.tensorflow.org/lite/guide/build_android to edit the
198`WORKSPACE` to configure the android NDK/SDK.
199
200(1) Build for your specific platform, e.g.:
201
202```
203bazel build -c opt \
204  --config=android_arm64 \
205  tensorflow/lite/tools/benchmark:benchmark_model
206```
207
208(2) Connect your phone. Push the binary to your phone with adb push
209     (make the directory if required):
210
211```
212adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model /data/local/tmp
213```
214
215(3) Make the binary executable.
216
217```
218adb shell chmod +x /data/local/tmp/benchmark_model
219```
220
221(4) Push the compute graph that you need to test. For example:
222
223```
224adb push mobilenet_quant_v1_224.tflite /data/local/tmp
225```
226
227(5) Optionally, install Hexagon libraries on device.
228
229That step is only needed when using the Hexagon delegate.
230
231```
232bazel build --config=android_arm64 \
233  tensorflow/lite/delegates/hexagon/hexagon_nn:libhexagon_interface.so
234adb push bazel-bin/tensorflow/lite/delegates/hexagon/hexagon_nn/libhexagon_interface.so /data/local/tmp
235adb push libhexagon_nn_skel*.so /data/local/tmp
236```
237
238(6) Run the benchmark. For example:
239
240```
241adb shell /data/local/tmp/benchmark_model \
242  --graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \
243  --num_threads=4
244```
245
246### On desktop:
247(1) build the binary
248
249```
250bazel build -c opt tensorflow/lite/tools/benchmark:benchmark_model
251```
252
253(2) Run on your compute graph, similar to the Android case but without the need of adb shell.
254For example:
255
256```
257bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \
258  --graph=mobilenet_quant_v1_224.tflite \
259  --num_threads=4
260```
261
262The MobileNet graph used as an example here may be downloaded from [here](https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_224_android_quant_2017_11_08.zip).
263
264
265## Reducing variance between runs on Android.
266
267Most modern Android phones use [ARM big.LITTLE](https://en.wikipedia.org/wiki/ARM_big.LITTLE)
268architecture where some cores are more power hungry but faster than other cores.
269When running benchmarks on these phones there can be significant variance
270between different runs of the benchmark. One way to reduce variance between runs
271is to set the [CPU affinity](https://en.wikipedia.org/wiki/Processor_affinity)
272before running the benchmark. On Android this can be done using the `taskset`
273command.
274E.g. for running the benchmark on big cores on Pixel 2 with a single thread one
275can use the following command:
276
277```
278adb shell taskset f0 /data/local/tmp/benchmark_model \
279  --graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \
280  --num_threads=1
281```
282
283where `f0` is the affinity mask for big cores on Pixel 2.
284Note: The affinity mask varies with the device.
285
286## Profiling model operators
287The benchmark model binary also allows you to profile operators and give
288execution times of each operator. To do this, pass the flag
289`--enable_op_profiling=true` to `benchmark_model` during invocation, e.g.,
290
291```
292adb shell taskset f0 /data/local/tmp/benchmark_model \
293  --graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \
294  --enable_op_profiling=true
295```
296
297When enabled, the `benchmark_model` binary will produce detailed statistics for
298each operation similar to those shown below:
299
300```
301
302============================== Run Order ==============================
303	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
304	                 CONV_2D	    0.000	    4.269	    4.269	  0.107%	  0.107%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_0/Relu6]
305	       DEPTHWISE_CONV_2D	    4.270	    2.150	    2.150	  0.054%	  0.161%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu6]
306	                 CONV_2D	    6.421	    6.107	    6.107	  0.153%	  0.314%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]
307	       DEPTHWISE_CONV_2D	   12.528	    1.366	    1.366	  0.034%	  0.348%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_2_depthwise/Relu6]
308	                 CONV_2D	   13.895	    4.195	    4.195	  0.105%	  0.454%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6]
309	       DEPTHWISE_CONV_2D	   18.091	    1.260	    1.260	  0.032%	  0.485%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6]
310	                 CONV_2D	   19.352	    6.652	    6.652	  0.167%	  0.652%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]
311	       DEPTHWISE_CONV_2D	   26.005	    0.698	    0.698	  0.018%	  0.670%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_4_depthwise/Relu6]
312	                 CONV_2D	   26.703	    3.344	    3.344	  0.084%	  0.754%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6]
313	       DEPTHWISE_CONV_2D	   30.047	    0.646	    0.646	  0.016%	  0.770%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_5_depthwise/Relu6]
314	                 CONV_2D	   30.694	    5.800	    5.800	  0.145%	  0.915%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]
315	       DEPTHWISE_CONV_2D	   36.495	    0.331	    0.331	  0.008%	  0.924%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6]
316	                 CONV_2D	   36.826	    2.838	    2.838	  0.071%	  0.995%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6]
317	       DEPTHWISE_CONV_2D	   39.665	    0.439	    0.439	  0.011%	  1.006%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6]
318	                 CONV_2D	   40.105	    5.293	    5.293	  0.133%	  1.139%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]
319	       DEPTHWISE_CONV_2D	   45.399	    0.352	    0.352	  0.009%	  1.147%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_8_depthwise/Relu6]
320	                 CONV_2D	   45.752	    5.322	    5.322	  0.133%	  1.281%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]
321	       DEPTHWISE_CONV_2D	   51.075	    0.357	    0.357	  0.009%	  1.290%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_9_depthwise/Relu6]
322	                 CONV_2D	   51.432	    5.693	    5.693	  0.143%	  1.433%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]
323	       DEPTHWISE_CONV_2D	   57.126	    0.366	    0.366	  0.009%	  1.442%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6]
324	                 CONV_2D	   57.493	    5.472	    5.472	  0.137%	  1.579%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]
325	       DEPTHWISE_CONV_2D	   62.966	    0.364	    0.364	  0.009%	  1.588%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6]
326	                 CONV_2D	   63.330	    5.404	    5.404	  0.136%	  1.724%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6]
327	       DEPTHWISE_CONV_2D	   68.735	    0.155	    0.155	  0.004%	  1.728%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_12_depthwise/Relu6]
328	                 CONV_2D	   68.891	    2.970	    2.970	  0.074%	  1.802%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6]
329	       DEPTHWISE_CONV_2D	   71.862	    0.206	    0.206	  0.005%	  1.807%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_13_depthwise/Relu6]
330	                 CONV_2D	   72.069	    5.888	    5.888	  0.148%	  1.955%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]
331	         AVERAGE_POOL_2D	   77.958	    0.036	    0.036	  0.001%	  1.956%	     0.000	        0	[MobilenetV1/Logits/AvgPool_1a/AvgPool]
332	                 CONV_2D	   77.994	    1.445	    1.445	  0.036%	  1.992%	     0.000	        0	[MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd]
333	                 RESHAPE	   79.440	    0.002	    0.002	  0.000%	  1.992%	     0.000	        0	[MobilenetV1/Predictions/Reshape]
334	                 SOFTMAX	   79.443	    0.029	    0.029	  0.001%	  1.993%	     0.000	        0	[MobilenetV1/Predictions/Softmax]
335
336============================== Top by Computation Time ==============================
337	             [node type]	  [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
338	                 CONV_2D	   19.352	    6.652	    6.652	  0.167%	  0.167%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6]
339	                 CONV_2D	    6.421	    6.107	    6.107	  0.153%	  0.320%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6]
340	                 CONV_2D	   72.069	    5.888	    5.888	  0.148%	  0.468%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6]
341	                 CONV_2D	   30.694	    5.800	    5.800	  0.145%	  0.613%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6]
342	                 CONV_2D	   51.432	    5.693	    5.693	  0.143%	  0.756%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6]
343	                 CONV_2D	   57.493	    5.472	    5.472	  0.137%	  0.893%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6]
344	                 CONV_2D	   63.330	    5.404	    5.404	  0.136%	  1.029%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6]
345	                 CONV_2D	   45.752	    5.322	    5.322	  0.133%	  1.162%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6]
346	                 CONV_2D	   40.105	    5.293	    5.293	  0.133%	  1.295%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6]
347	                 CONV_2D	    0.000	    4.269	    4.269	  0.107%	  1.402%	     0.000	        0	[MobilenetV1/MobilenetV1/Conv2d_0/Relu6]
348
349Number of nodes executed: 31
350============================== Summary by node type ==============================
351	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
352	                 CONV_2D	       15	     1.406	    89.270%	    89.270%	     0.000	        0
353	       DEPTHWISE_CONV_2D	       13	     0.169	    10.730%	   100.000%	     0.000	        0
354	                 SOFTMAX	        1	     0.000	     0.000%	   100.000%	     0.000	        0
355	                 RESHAPE	        1	     0.000	     0.000%	   100.000%	     0.000	        0
356	         AVERAGE_POOL_2D	        1	     0.000	     0.000%	   100.000%	     0.000	        0
357
358Timings (microseconds): count=50 first=79449 curr=81350 min=77385 max=88213 avg=79732 std=1929
359Memory (bytes): count=0
36031 nodes observed
361
362
363Average inference timings in us: Warmup: 83235, Init: 38467, Inference: 79760.9
364```
365
366## Benchmark multiple performance options in a single run
367
368A convenient and simple C++ binary is also provided to benchmark multiple
369performance options in a single run. This binary is built based on the
370aforementioned benchmark tool that could only benchmark a single performance
371option at a time. They share the same build/install/run process, but the BUILD
372target name of this binary is `benchmark_model_performance_options` and it takes
373some additional parameters as detailed below.
374
375### Additional Parameters
376*   `perf_options_list`: `string` (default='all') \
377    A comma-separated list of TFLite performance options to benchmark.
378*   `option_benchmark_run_delay`: `float` (default=-1.0) \
379    The delay between two consecutive runs of benchmarking performance options
380    in seconds.
381*   `random_shuffle_benchmark_runs`: `bool` (default=true) \
382    Whether to perform all benchmark runs, each of which has different
383    performance options, in a random order.
384
385## Build the benchmark tool with Tensorflow ops support
386
387You can build the benchmark tool with [Tensorflow operators support](https://www.tensorflow.org/lite/guide/ops_select).
388
389### How to build
390
391To build the tool, you need to use 'benchmark_model_plus_flex' target with
392'--config=monolithic' option.
393
394```
395bazel build -c opt \
396  --config=monolithic \
397  tensorflow/lite/tools/benchmark:benchmark_model_plus_flex
398```
399
400### How to benchmark tflite model with Tensorflow ops
401
402Tensorflow ops support just works the benchmark tool is built with Tensorflow
403ops support. It doesn't require any additional option to use it.
404
405```
406bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_flex \
407  --graph=model_converted_with_TF_ops.tflite \
408```
409