1# Model optimization 2 3Edge devices often have limited memory or computational power. Various 4optimizations can be applied to models so that they can be run within these 5constraints. In addition, some optimizations allow the use of specialized 6hardware for accelerated inference. 7 8TensorFlow Lite and the 9[TensorFlow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization) 10provide tools to minimize the complexity of optimizing inference. 11 12It's recommended that you consider model optimization during your application 13development process. This document outlines some best practices for optimizing 14TensorFlow models for deployment to edge hardware. 15 16## Why models should be optimized 17 18There are several main ways model optimization can help with application 19development. 20 21### Size reduction 22 23Some forms of optimization can be used to reduce the size of a model. Smaller 24models have the following benefits: 25 26- **Smaller storage size:** Smaller models occupy less storage space on your 27 users' devices. For example, an Android app using a smaller model will take 28 up less storage space on a user's mobile device. 29- **Smaller download size:** Smaller models require less time and bandwidth to 30 download to users' devices. 31- **Less memory usage:** Smaller models use less RAM when they are run, which 32 frees up memory for other parts of your application to use, and can 33 translate to better performance and stability. 34 35Quantization can reduce the size of a model in all of these cases, potentially 36at the expense of some accuracy. Pruning and clustering can reduce the size of a 37model for download by making it more easily compressible. 38 39### Latency reduction 40 41*Latency* is the amount of time it takes to run a single inference with a given 42model. Some forms of optimization can reduce the amount of computation required 43to run inference using a model, resulting in lower latency. Latency can also 44have an impact on power consumption. 45 46Currently, quantization can be used to reduce latency by simplifying the 47calculations that occur during inference, potentially at the expense of some 48accuracy. 49 50### Accelerator compatibility 51 52Some hardware accelerators, such as the 53[Edge TPU](https://cloud.google.com/edge-tpu/), can run inference extremely fast 54with models that have been correctly optimized. 55 56Generally, these types of devices require models to be quantized in a specific 57way. See each hardware accelerator's documentation to learn more about their 58requirements. 59 60## Trade-offs 61 62Optimizations can potentially result in changes in model accuracy, which must be 63considered during the application development process. 64 65The accuracy changes depend on the individual model being optimized, and are 66difficult to predict ahead of time. Generally, models that are optimized for 67size or latency will lose a small amount of accuracy. Depending on your 68application, this may or may not impact your users' experience. In rare cases, 69certain models may gain some accuracy as a result of the optimization process. 70 71## Types of optimization 72 73TensorFlow Lite currently supports optimization via quantization, pruning and 74clustering. 75 76These are part of the 77[TensorFlow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization), 78which provides resources for model optimization techniques that are compatible 79with TensorFlow Lite. 80 81### Quantization 82 83[Quantization](https://www.tensorflow.org/model_optimization/guide/quantization/post_training) 84works by reducing the precision of the numbers used to represent a model's 85parameters, which by default are 32-bit floating point numbers. This results in 86a smaller model size and faster computation. 87 88The following types of quantization are available in TensorFlow Lite: 89 90Technique | Data requirements | Size reduction | Accuracy | Supported hardware 91------------------------------------------------------------------------------------------------------- | -------------------------------- | -------------- | --------------------------- | ------------------ 92[Post-training float16 quantization](post_training_float16_quant.ipynb) | No data | Up to 50% | Insignificant accuracy loss | CPU, GPU 93[Post-training dynamic range quantization](post_training_quant.ipynb) | No data | Up to 75% | Accuracy loss | CPU, GPU (Android) 94[Post-training integer quantization](post_training_integer_quant.ipynb) | Unlabelled representative sample | Up to 75% | Smaller accuracy loss | CPU, GPU (Android), EdgeTPU, Hexagon DSP 95[Quantization-aware training](http://www.tensorflow.org/model_optimization/guide/quantization/training) | Labelled training data | Up to 75% | Smallest accuracy loss | CPU, GPU (Android), EdgeTPU, Hexagon DSP 96 97The following decision tree helps you select the quantization schemes you might 98want to use for your model, simply based on the expected model size and 99accuracy. 100 101 102 103Below are the latency and accuracy results for post-training quantization and 104quantization-aware training on a few models. All latency numbers are measured on 105Pixel 2 devices using a single big core CPU. As the toolkit improves, so will 106the numbers here: 107 108<figure> 109 <table> 110 <tr> 111 <th>Model</th> 112 <th>Top-1 Accuracy (Original) </th> 113 <th>Top-1 Accuracy (Post Training Quantized) </th> 114 <th>Top-1 Accuracy (Quantization Aware Training) </th> 115 <th>Latency (Original) (ms) </th> 116 <th>Latency (Post Training Quantized) (ms) </th> 117 <th>Latency (Quantization Aware Training) (ms) </th> 118 <th> Size (Original) (MB)</th> 119 <th> Size (Optimized) (MB)</th> 120 </tr> <tr><td>Mobilenet-v1-1-224</td><td>0.709</td><td>0.657</td><td>0.70</td> 121 <td>124</td><td>112</td><td>64</td><td>16.9</td><td>4.3</td></tr> 122 <tr><td>Mobilenet-v2-1-224</td><td>0.719</td><td>0.637</td><td>0.709</td> 123 <td>89</td><td>98</td><td>54</td><td>14</td><td>3.6</td></tr> 124 <tr><td>Inception_v3</td><td>0.78</td><td>0.772</td><td>0.775</td> 125 <td>1130</td><td>845</td><td>543</td><td>95.7</td><td>23.9</td></tr> 126 <tr><td>Resnet_v2_101</td><td>0.770</td><td>0.768</td><td>N/A</td> 127 <td>3973</td><td>2868</td><td>N/A</td><td>178.3</td><td>44.9</td></tr> 128 </table> 129 <figcaption> 130 <b>Table 1</b> Benefits of model quantization for select CNN models 131 </figcaption> 132</figure> 133 134### Full integer quantization with int16 activations and int8 weights 135 136[Quantization with int16 activations](https://www.tensorflow.org/model_optimization/guide/quantization/post_training) 137is a full integer quantization scheme with activations in int16 and weights in 138int8. This mode can improve accuracy of the quantized model in comparison to the 139full integer quantization scheme with both activations and weights in int8 140keeping a similar model size. It is recommended when activations are sensitive 141to the quantization. 142 143<i>NOTE:</i> Currently only non-optimized reference kernel implementations are 144available in TFLite for this quantization scheme, so by default the performance 145will be slow compared to int8 kernels. Full advantages of this mode can 146currently be accessed via specialised hardware, or custom software. 147 148Below are the accuracy results for some models that benefit from this mode. 149<figure> 150 <table> 151 <tr> 152 <th>Model</th> 153 <th>Accuracy metric type </th> 154 <th>Accuracy (float32 activations) </th> 155 <th>Accuracy (int8 activations) </th> 156 <th>Accuracy (int16 activations) </th> 157 </tr> <tr><td>Wav2letter</td><td>WER</td><td>6.7%</td><td>7.7%</td> 158 <td>7.2%</td></tr> 159 <tr><td>DeepSpeech 0.5.1 (unrolled)</td><td>CER</td><td>6.13%</td><td>43.67%</td> 160 <td>6.52%</td></tr> 161 <tr><td>YoloV3</td><td>mAP(IOU=0.5)</td><td>0.577</td><td>0.563</td> 162 <td>0.574</td></tr> 163 <tr><td>MobileNetV1</td><td>Top-1 Accuracy</td><td>0.7062</td><td>0.694</td> 164 <td>0.6936</td></tr> 165 <tr><td>MobileNetV2</td><td>Top-1 Accuracy</td><td>0.718</td><td>0.7126</td> 166 <td>0.7137</td></tr> 167 <tr><td>MobileBert</td><td>F1(Exact match)</td><td>88.81(81.23)</td><td>2.08(0)</td> 168 <td>88.73(81.15)</td></tr> 169 </table> 170 <figcaption> 171 <b>Table 2</b> Benefits of model quantization with int16 activations 172 </figcaption> 173</figure> 174 175### Pruning 176 177[Pruning](https://www.tensorflow.org/model_optimization/guide/pruning) works by 178removing parameters within a model that have only a minor impact on its 179predictions. Pruned models are the same size on disk, and have the same runtime 180latency, but can be compressed more effectively. This makes pruning a useful 181technique for reducing model download size. 182 183In the future, TensorFlow Lite will provide latency reduction for pruned models. 184 185### Clustering 186 187[Clustering](https://www.tensorflow.org/model_optimization/guide/clustering) 188works by grouping the weights of each layer in a model into a predefined number 189of clusters, then sharing the centroid values for the weights belonging to each 190individual cluster. This reduces the number of unique weight values in a model, 191thus reducing its complexity. 192 193As a result, clustered models can be compressed more effectively, providing 194deployment benefits similar to pruning. 195 196## Development workflow 197 198As a starting point, check if the models in 199[hosted models](../guide/hosted_models.md) can work for your application. If 200not, we recommend that users start with the 201[post-training quantization tool](post_training_quantization.md) since this is 202broadly applicable and does not require training data. 203 204For cases where the accuracy and latency targets are not met, or hardware 205accelerator support is important, 206[quantization-aware training](https://www.tensorflow.org/model_optimization/guide/quantization/training){:.external} 207is the better option. See additional optimization techniques under the 208[TensorFlow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization). 209 210If you want to further reduce your model size, you can try [pruning](#pruning) 211and/or [clustering](#clustering) prior to quantizing your models. 212