1# TFLite Model Benchmark Tool with C++ Binary 2 3## Description 4 5A simple C++ binary to benchmark a TFLite model and its individual operators, 6both on desktop machines and on Android. The binary takes a TFLite model, 7generates random inputs and then repeatedly runs the model for specified number 8of runs. Aggregate latency statistics are reported after running the benchmark. 9 10The instructions below are for running the binary on Desktop and Android, 11for iOS please use the 12[iOS benchmark app](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/ios). 13 14An experimental Android APK wrapper for the benchmark model utility offers more 15faithful execution behavior on Android (via a foreground Activity). It is 16located 17[here](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark/android). 18 19## Parameters 20 21The binary takes the following required parameters: 22 23* `graph`: `string` \ 24 The path to the TFLite model file. 25 26and the following optional parameters: 27 28* `num_threads`: `int` (default=-1) \ 29 The number of threads to use for running TFLite interpreter. By default, 30 this is set to the platform default value -1. 31* `warmup_runs`: `int` (default=1) \ 32 The number of warmup runs to do before starting the benchmark. 33* `num_runs`: `int` (default=50) \ 34 The number of runs. Increase this to reduce variance. 35* `run_delay`: `float` (default=-1.0) \ 36 The delay in seconds between subsequent benchmark runs. Non-positive values 37 mean use no delay. 38* `run_frequency`: `float` (default=-1.0) \ 39 The frequency of running a benchmark run as the number of prorated runs per 40 second. If the targeted rate per second cannot be reached, the benchmark 41 would start the next run immediately, trying its best to catch up. If set, 42 this will override the `run_delay` parameter. A non-positive value means 43 there is no delay between subsequent runs. 44* `enable_op_profiling`: `bool` (default=false) \ 45 Whether to enable per-operator profiling measurement. 46* `max_profiling_buffer_entries`: `int` (default=1024) \ 47 The max number of profiling events that will be stored during each inference 48 run. It is only meaningful when `enable_op_profiling` is set to `true`. 49 Note, the actual value of this parameter will be adjusted if the model has 50 more nodes than the specified value of this parameter. 51* `profiling_output_csv_file`: `str` (default="") \ 52 File path to export profile data to as CSV. The results are printed to 53 `stdout` if option is not set. Requires `enable_op_profiling` to be `true` 54 and the path to include the name of the output CSV; otherwise results are 55 printed to `stdout`. 56* `print_preinvoke_state`: `bool` (default=false) \ 57 Whether to print out the TfLite interpreter internals just before calling 58 tflite::Interpreter::Invoke. The internals will include allocated memory 59 size of each tensor etc. Enabling this could help understand TfLite graph 60 and memory usage. 61* `print_postinvoke_state`: `bool` (default=false) \ 62 Whether to print out the TfLite interpreter internals just before benchmark 63 completes (i.e. after all repeated Invoke calls complete). The internals 64 will include allocated memory size of each tensor etc. Enabling this could 65 help understand TfLite graph and memory usage, particularly when there are 66 dynamic-shaped tensors in the graph. 67* `report_peak_memory_footprint`: `bool` (default=false) \ 68 Whether to report the peak memory footprint by periodically checking the 69 memory footprint. Internally, a separate thread will be spawned for this 70 periodic check. Therefore, the performance benchmark result could be 71 affected. 72* `memory_footprint_check_interval_ms`: `int` (default=50) \ 73 The interval in millisecond between two consecutive memory footprint checks. 74 This is only used when --report_peak_memory_footprint is set to true. 75 76* `dry_run`: `bool` (default=false) \ 77 Whether to run the tool just with simply loading the model, allocating 78 tensors etc. but without actually invoking any op kernels. 79* `verbose`: `bool` (default=false) \ 80 Whether to log parameters whose values are not set. By default, only log 81 those parameters that are set by parsing their values from the commandline 82 flags. 83 84### Model input parameters 85By default, the tool will use randomized data for model inputs. The following 86parameters allow users to specify customized input values to the model when 87running the benchmark tool: 88 89* `input_layer`: `string` \ 90 A comma-separated list of input layer names, e.g. 'input1,input2'. Note all 91 inputs of the model graph need to be specified. However, the input name 92 does not need to match that encoded in the model. Additionally, the order 93 of input layer names specified here is assumed to be same with that is seen 94 by the Tensorflow Lite interpreter. This is a bit inconvenient but the 95 [visualization tool](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/visualize.py) 96 should help to find this order. 97* `input_layer_shape`: `string` \ 98 A colon-separated list of input layer shapes, where each shape is a 99 comma-separated list, e.g. '1,30:1,10'. Similar to `input_layer`, this 100 parameter also requires shapes of all inputs be specified, and the order of 101 inputs be same with that is seen by the interpreter. 102* `input_layer_value_range`: `string` \ 103 A map-like string representing value range for *integer* input layers. Each 104 item is separated by ':', and the item value consists of input layer name 105 and integer-only range values (both low and high are inclusive) separated by 106 ',', e.g. 'input1,1,2:input2,0,254'. Note that the input layer name must 107 exist in the list of names specified by `input_layer`. 108* `input_layer_value_files`: `string` \ 109 A map-like string representing files that contain input values. Each 110 item is separated by ',', and the item value consists of input layer name 111 and the file path separated by ':', 112 e.g. 'input1:file_path1,input2:file_path2'. If a input name appears in both 113 `input_layer_value_range` and `input_layer_value_files`, 114 the corresponding input value range specified by`input_layer_value_range` 115 will be ignored. The file format is binary, and the content should be either 116 a byte array or null-separated strings. Note that the inpput layer name must 117 also exist in the list of names specified by `input_layer`. 118 119### TFLite delegate parameters 120The tool supports all runtime/delegate parameters introduced by 121[the delegate registrar](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/delegates). 122The following simply lists the names of these parameters and additional notes 123where applicable. For details about each parameter, please refer to 124[this page](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/delegates/README.md#tflite-delegate-registrar). 125#### Common parameters 126* `max_delegated_partitions`: `int` (default=0) 127* `min_nodes_per_partition`:`int` (default=0) 128* `delegate_serialize_dir`: `str` (default="") 129* `delegate_serialize_token`: `str` (default="") 130 131#### GPU delegate 132* `use_gpu`: `bool` (default=false) 133* `gpu_precision_loss_allowed`: `bool` (default=true) 134* `gpu_experimental_enable_quant`: `bool` (default=true) 135* `gpu_inference_for_sustained_speed`: `bool` (default=false) 136* `gpu_backend`: `string` (default="") 137* `gpu_wait_type`: `str` (default="") 138 139#### NNAPI delegate 140 141* `use_nnapi`: `bool` (default=false) \ 142 Note some Android P devices will fail to use NNAPI for models in 143 `/data/local/tmp/` and this benchmark tool will not correctly use NNAPI. 144* `nnapi_execution_preference`: `str` (default="") \ 145 Should be one of: `fast_single_answer`, `sustained_speed`, `low_power`, 146 `undefined`. 147* `nnapi_execution_priority`: `str` (default="") \ 148 Note this requires Android 11+. 149* `nnapi_accelerator_name`: `str` (default="") \ 150 Note this requires Android 10+. 151* `disable_nnapi_cpu`: `bool` (default=true) 152* `nnapi_allow_fp16`: `bool` (default=false) 153* `nnapi_allow_dynamic_dimensions`:`bool` (default=false) 154* `nnapi_use_burst_mode`:`bool` (default=false) 155 156#### Hexagon delegate 157* `use_hexagon`: `bool` (default=false) 158* `hexagon_profiling`: `bool` (default=false) \ 159Note enabling this option will not produce profiling results outputs unless 160`enable_op_profiling` is also turned on. When both parameters are set to true, 161the profile of ops on hexagon DSP will be added to the profile table. Note that, 162the reported data on hexagon is in cycles, not in ms like on cpu. 163 164#### XNNPACK delegate 165* `use_xnnpack`: `bool` (default=false) \ 166Note if this option is explicitly set to `false`, the TfLite runtime will use 167its original CPU kernels for model execution. In other words, after enabling 168the feature that the XNNPACK delegate is applied by default in TfLite runtime, 169explictly setting this flag to `false` will cause the benchmark tool to disable 170the feature at runtime, and to use the original non-delegated CPU execution path 171for model benchmarking. 172 173#### CoreML delegate 174* `use_coreml`: `bool` (default=false) 175* `coreml_version`: `int` (default=0) 176 177#### External delegate 178* `external_delegate_path`: `string` (default="") 179* `external_delegate_options`: `string` (default="") 180 181As some delegates are only available on certain platforms, when running the 182benchmark tool on a particular platform, specifying `--help` will print out all 183supported parameters. 184 185### Use multiple delegates 186When multiple delegates are specified to be used in the commandline flags, the 187order of delegates applied to the TfLite runtime will be same as their enabling 188commandline flag is specified. For example, "--use_xnnpack=true --use_gpu=true" 189means applying the XNNPACK delegate first, and then the GPU delegate secondly. 190In comparison, "--use_gpu=true --use_xnnpack=true" means applying the GPU 191delegate first, and then the XNNPACK delegate secondly. 192 193## To build/install/run 194 195### On Android: 196 197(0) Refer to https://www.tensorflow.org/lite/guide/build_android to edit the 198`WORKSPACE` to configure the android NDK/SDK. 199 200(1) Build for your specific platform, e.g.: 201 202``` 203bazel build -c opt \ 204 --config=android_arm64 \ 205 tensorflow/lite/tools/benchmark:benchmark_model 206``` 207 208(2) Connect your phone. Push the binary to your phone with adb push 209 (make the directory if required): 210 211``` 212adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model /data/local/tmp 213``` 214 215(3) Make the binary executable. 216 217``` 218adb shell chmod +x /data/local/tmp/benchmark_model 219``` 220 221(4) Push the compute graph that you need to test. For example: 222 223``` 224adb push mobilenet_quant_v1_224.tflite /data/local/tmp 225``` 226 227(5) Optionally, install Hexagon libraries on device. 228 229That step is only needed when using the Hexagon delegate. 230 231``` 232bazel build --config=android_arm64 \ 233 tensorflow/lite/delegates/hexagon/hexagon_nn:libhexagon_interface.so 234adb push bazel-bin/tensorflow/lite/delegates/hexagon/hexagon_nn/libhexagon_interface.so /data/local/tmp 235adb push libhexagon_nn_skel*.so /data/local/tmp 236``` 237 238(6) Run the benchmark. For example: 239 240``` 241adb shell /data/local/tmp/benchmark_model \ 242 --graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \ 243 --num_threads=4 244``` 245 246### On desktop: 247(1) build the binary 248 249``` 250bazel build -c opt tensorflow/lite/tools/benchmark:benchmark_model 251``` 252 253(2) Run on your compute graph, similar to the Android case but without the need of adb shell. 254For example: 255 256``` 257bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model \ 258 --graph=mobilenet_quant_v1_224.tflite \ 259 --num_threads=4 260``` 261 262The MobileNet graph used as an example here may be downloaded from [here](https://storage.googleapis.com/download.tensorflow.org/models/tflite/mobilenet_v1_224_android_quant_2017_11_08.zip). 263 264 265## Reducing variance between runs on Android. 266 267Most modern Android phones use [ARM big.LITTLE](https://en.wikipedia.org/wiki/ARM_big.LITTLE) 268architecture where some cores are more power hungry but faster than other cores. 269When running benchmarks on these phones there can be significant variance 270between different runs of the benchmark. One way to reduce variance between runs 271is to set the [CPU affinity](https://en.wikipedia.org/wiki/Processor_affinity) 272before running the benchmark. On Android this can be done using the `taskset` 273command. 274E.g. for running the benchmark on big cores on Pixel 2 with a single thread one 275can use the following command: 276 277``` 278adb shell taskset f0 /data/local/tmp/benchmark_model \ 279 --graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \ 280 --num_threads=1 281``` 282 283where `f0` is the affinity mask for big cores on Pixel 2. 284Note: The affinity mask varies with the device. 285 286## Profiling model operators 287The benchmark model binary also allows you to profile operators and give 288execution times of each operator. To do this, pass the flag 289`--enable_op_profiling=true` to `benchmark_model` during invocation, e.g., 290 291``` 292adb shell taskset f0 /data/local/tmp/benchmark_model \ 293 --graph=/data/local/tmp/mobilenet_quant_v1_224.tflite \ 294 --enable_op_profiling=true 295``` 296 297When enabled, the `benchmark_model` binary will produce detailed statistics for 298each operation similar to those shown below: 299 300``` 301 302============================== Run Order ============================== 303 [node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] 304 CONV_2D 0.000 4.269 4.269 0.107% 0.107% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_0/Relu6] 305 DEPTHWISE_CONV_2D 4.270 2.150 2.150 0.054% 0.161% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_1_depthwise/Relu6] 306 CONV_2D 6.421 6.107 6.107 0.153% 0.314% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6] 307 DEPTHWISE_CONV_2D 12.528 1.366 1.366 0.034% 0.348% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_2_depthwise/Relu6] 308 CONV_2D 13.895 4.195 4.195 0.105% 0.454% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_2_pointwise/Relu6] 309 DEPTHWISE_CONV_2D 18.091 1.260 1.260 0.032% 0.485% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_3_depthwise/Relu6] 310 CONV_2D 19.352 6.652 6.652 0.167% 0.652% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6] 311 DEPTHWISE_CONV_2D 26.005 0.698 0.698 0.018% 0.670% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_4_depthwise/Relu6] 312 CONV_2D 26.703 3.344 3.344 0.084% 0.754% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_4_pointwise/Relu6] 313 DEPTHWISE_CONV_2D 30.047 0.646 0.646 0.016% 0.770% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_5_depthwise/Relu6] 314 CONV_2D 30.694 5.800 5.800 0.145% 0.915% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6] 315 DEPTHWISE_CONV_2D 36.495 0.331 0.331 0.008% 0.924% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_6_depthwise/Relu6] 316 CONV_2D 36.826 2.838 2.838 0.071% 0.995% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_6_pointwise/Relu6] 317 DEPTHWISE_CONV_2D 39.665 0.439 0.439 0.011% 1.006% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_7_depthwise/Relu6] 318 CONV_2D 40.105 5.293 5.293 0.133% 1.139% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6] 319 DEPTHWISE_CONV_2D 45.399 0.352 0.352 0.009% 1.147% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_8_depthwise/Relu6] 320 CONV_2D 45.752 5.322 5.322 0.133% 1.281% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6] 321 DEPTHWISE_CONV_2D 51.075 0.357 0.357 0.009% 1.290% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_9_depthwise/Relu6] 322 CONV_2D 51.432 5.693 5.693 0.143% 1.433% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6] 323 DEPTHWISE_CONV_2D 57.126 0.366 0.366 0.009% 1.442% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_10_depthwise/Relu6] 324 CONV_2D 57.493 5.472 5.472 0.137% 1.579% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6] 325 DEPTHWISE_CONV_2D 62.966 0.364 0.364 0.009% 1.588% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_11_depthwise/Relu6] 326 CONV_2D 63.330 5.404 5.404 0.136% 1.724% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6] 327 DEPTHWISE_CONV_2D 68.735 0.155 0.155 0.004% 1.728% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_12_depthwise/Relu6] 328 CONV_2D 68.891 2.970 2.970 0.074% 1.802% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_12_pointwise/Relu6] 329 DEPTHWISE_CONV_2D 71.862 0.206 0.206 0.005% 1.807% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_13_depthwise/Relu6] 330 CONV_2D 72.069 5.888 5.888 0.148% 1.955% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6] 331 AVERAGE_POOL_2D 77.958 0.036 0.036 0.001% 1.956% 0.000 0 [MobilenetV1/Logits/AvgPool_1a/AvgPool] 332 CONV_2D 77.994 1.445 1.445 0.036% 1.992% 0.000 0 [MobilenetV1/Logits/Conv2d_1c_1x1/BiasAdd] 333 RESHAPE 79.440 0.002 0.002 0.000% 1.992% 0.000 0 [MobilenetV1/Predictions/Reshape] 334 SOFTMAX 79.443 0.029 0.029 0.001% 1.993% 0.000 0 [MobilenetV1/Predictions/Softmax] 335 336============================== Top by Computation Time ============================== 337 [node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name] 338 CONV_2D 19.352 6.652 6.652 0.167% 0.167% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_3_pointwise/Relu6] 339 CONV_2D 6.421 6.107 6.107 0.153% 0.320% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_1_pointwise/Relu6] 340 CONV_2D 72.069 5.888 5.888 0.148% 0.468% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_13_pointwise/Relu6] 341 CONV_2D 30.694 5.800 5.800 0.145% 0.613% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_5_pointwise/Relu6] 342 CONV_2D 51.432 5.693 5.693 0.143% 0.756% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_9_pointwise/Relu6] 343 CONV_2D 57.493 5.472 5.472 0.137% 0.893% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_10_pointwise/Relu6] 344 CONV_2D 63.330 5.404 5.404 0.136% 1.029% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_11_pointwise/Relu6] 345 CONV_2D 45.752 5.322 5.322 0.133% 1.162% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_8_pointwise/Relu6] 346 CONV_2D 40.105 5.293 5.293 0.133% 1.295% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_7_pointwise/Relu6] 347 CONV_2D 0.000 4.269 4.269 0.107% 1.402% 0.000 0 [MobilenetV1/MobilenetV1/Conv2d_0/Relu6] 348 349Number of nodes executed: 31 350============================== Summary by node type ============================== 351 [Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called] 352 CONV_2D 15 1.406 89.270% 89.270% 0.000 0 353 DEPTHWISE_CONV_2D 13 0.169 10.730% 100.000% 0.000 0 354 SOFTMAX 1 0.000 0.000% 100.000% 0.000 0 355 RESHAPE 1 0.000 0.000% 100.000% 0.000 0 356 AVERAGE_POOL_2D 1 0.000 0.000% 100.000% 0.000 0 357 358Timings (microseconds): count=50 first=79449 curr=81350 min=77385 max=88213 avg=79732 std=1929 359Memory (bytes): count=0 36031 nodes observed 361 362 363Average inference timings in us: Warmup: 83235, Init: 38467, Inference: 79760.9 364``` 365 366## Benchmark multiple performance options in a single run 367 368A convenient and simple C++ binary is also provided to benchmark multiple 369performance options in a single run. This binary is built based on the 370aforementioned benchmark tool that could only benchmark a single performance 371option at a time. They share the same build/install/run process, but the BUILD 372target name of this binary is `benchmark_model_performance_options` and it takes 373some additional parameters as detailed below. 374 375### Additional Parameters 376* `perf_options_list`: `string` (default='all') \ 377 A comma-separated list of TFLite performance options to benchmark. 378* `option_benchmark_run_delay`: `float` (default=-1.0) \ 379 The delay between two consecutive runs of benchmarking performance options 380 in seconds. 381* `random_shuffle_benchmark_runs`: `bool` (default=true) \ 382 Whether to perform all benchmark runs, each of which has different 383 performance options, in a random order. 384 385## Build the benchmark tool with Tensorflow ops support 386 387You can build the benchmark tool with [Tensorflow operators support](https://www.tensorflow.org/lite/guide/ops_select). 388 389### How to build 390 391To build the tool, you need to use 'benchmark_model_plus_flex' target with 392'--config=monolithic' option. 393 394``` 395bazel build -c opt \ 396 --config=monolithic \ 397 tensorflow/lite/tools/benchmark:benchmark_model_plus_flex 398``` 399 400### How to benchmark tflite model with Tensorflow ops 401 402Tensorflow ops support just works the benchmark tool is built with Tensorflow 403ops support. It doesn't require any additional option to use it. 404 405``` 406bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_flex \ 407 --graph=model_converted_with_TF_ops.tflite \ 408``` 409