• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Model optimization
2
3Edge devices often have limited memory or computational power. Various
4optimizations can be applied to models so that they can be run within these
5constraints. In addition, some optimizations allow the use of specialized
6hardware for accelerated inference.
7
8TensorFlow Lite and the
9[TensorFlow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization)
10provide tools to minimize the complexity of optimizing inference.
11
12It's recommended that you consider model optimization during your application
13development process. This document outlines some best practices for optimizing
14TensorFlow models for deployment to edge hardware.
15
16## Why models should be optimized
17
18There are several main ways model optimization can help with application
19development.
20
21### Size reduction
22
23Some forms of optimization can be used to reduce the size of a model. Smaller
24models have the following benefits:
25
26-   **Smaller storage size:** Smaller models occupy less storage space on your
27    users' devices. For example, an Android app using a smaller model will take
28    up less storage space on a user's mobile device.
29-   **Smaller download size:** Smaller models require less time and bandwidth to
30    download to users' devices.
31-   **Less memory usage:** Smaller models use less RAM when they are run, which
32    frees up memory for other parts of your application to use, and can
33    translate to better performance and stability.
34
35Quantization can reduce the size of a model in all of these cases, potentially
36at the expense of some accuracy. Pruning and clustering can reduce the size of a
37model for download by making it more easily compressible.
38
39### Latency reduction
40
41*Latency* is the amount of time it takes to run a single inference with a given
42model. Some forms of optimization can reduce the amount of computation required
43to run inference using a model, resulting in lower latency. Latency can also
44have an impact on power consumption.
45
46Currently, quantization can be used to reduce latency by simplifying the
47calculations that occur during inference, potentially at the expense of some
48accuracy.
49
50### Accelerator compatibility
51
52Some hardware accelerators, such as the
53[Edge TPU](https://cloud.google.com/edge-tpu/), can run inference extremely fast
54with models that have been correctly optimized.
55
56Generally, these types of devices require models to be quantized in a specific
57way. See each hardware accelerator's documentation to learn more about their
58requirements.
59
60## Trade-offs
61
62Optimizations can potentially result in changes in model accuracy, which must be
63considered during the application development process.
64
65The accuracy changes depend on the individual model being optimized, and are
66difficult to predict ahead of time. Generally, models that are optimized for
67size or latency will lose a small amount of accuracy. Depending on your
68application, this may or may not impact your users' experience. In rare cases,
69certain models may gain some accuracy as a result of the optimization process.
70
71## Types of optimization
72
73TensorFlow Lite currently supports optimization via quantization, pruning and
74clustering.
75
76These are part of the
77[TensorFlow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization),
78which provides resources for model optimization techniques that are compatible
79with TensorFlow Lite.
80
81### Quantization
82
83[Quantization](https://www.tensorflow.org/model_optimization/guide/quantization/post_training)
84works by reducing the precision of the numbers used to represent a model's
85parameters, which by default are 32-bit floating point numbers. This results in
86a smaller model size and faster computation.
87
88The following types of quantization are available in TensorFlow Lite:
89
90Technique                                                                                               | Data requirements                | Size reduction | Accuracy                    | Supported hardware
91------------------------------------------------------------------------------------------------------- | -------------------------------- | -------------- | --------------------------- | ------------------
92[Post-training float16 quantization](post_training_float16_quant.ipynb)                                 | No data                          | Up to 50%      | Insignificant accuracy loss | CPU, GPU
93[Post-training dynamic range quantization](post_training_quant.ipynb)                                   | No data                          | Up to 75%      | Accuracy loss               | CPU, GPU (Android)
94[Post-training integer quantization](post_training_integer_quant.ipynb)                                 | Unlabelled representative sample | Up to 75%      | Smaller accuracy loss       | CPU, GPU (Android), EdgeTPU, Hexagon DSP
95[Quantization-aware training](http://www.tensorflow.org/model_optimization/guide/quantization/training) | Labelled training data           | Up to 75%      | Smallest accuracy loss      | CPU, GPU (Android), EdgeTPU, Hexagon DSP
96
97The following decision tree helps you select the quantization schemes you might
98want to use for your model, simply based on the expected model size and
99accuracy.
100
101![quantization-decision-tree](images/quantization_decision_tree.png)
102
103Below are the latency and accuracy results for post-training quantization and
104quantization-aware training on a few models. All latency numbers are measured on
105Pixel 2 devices using a single big core CPU. As the toolkit improves, so will
106the numbers here:
107
108<figure>
109  <table>
110    <tr>
111      <th>Model</th>
112      <th>Top-1 Accuracy (Original) </th>
113      <th>Top-1 Accuracy (Post Training Quantized) </th>
114      <th>Top-1 Accuracy (Quantization Aware Training) </th>
115      <th>Latency (Original) (ms) </th>
116      <th>Latency (Post Training Quantized) (ms) </th>
117      <th>Latency (Quantization Aware Training) (ms) </th>
118      <th> Size (Original) (MB)</th>
119      <th> Size (Optimized) (MB)</th>
120    </tr> <tr><td>Mobilenet-v1-1-224</td><td>0.709</td><td>0.657</td><td>0.70</td>
121      <td>124</td><td>112</td><td>64</td><td>16.9</td><td>4.3</td></tr>
122    <tr><td>Mobilenet-v2-1-224</td><td>0.719</td><td>0.637</td><td>0.709</td>
123      <td>89</td><td>98</td><td>54</td><td>14</td><td>3.6</td></tr>
124   <tr><td>Inception_v3</td><td>0.78</td><td>0.772</td><td>0.775</td>
125      <td>1130</td><td>845</td><td>543</td><td>95.7</td><td>23.9</td></tr>
126   <tr><td>Resnet_v2_101</td><td>0.770</td><td>0.768</td><td>N/A</td>
127      <td>3973</td><td>2868</td><td>N/A</td><td>178.3</td><td>44.9</td></tr>
128 </table>
129  <figcaption>
130    <b>Table 1</b> Benefits of model quantization for select CNN models
131  </figcaption>
132</figure>
133
134### Full integer quantization with int16 activations and int8 weights
135
136[Quantization with int16 activations](https://www.tensorflow.org/model_optimization/guide/quantization/post_training)
137is a full integer quantization scheme with activations in int16 and weights in
138int8. This mode can improve accuracy of the quantized model in comparison to the
139full integer quantization scheme with both activations and weights in int8
140keeping a similar model size. It is recommended when activations are sensitive
141to the quantization.
142
143<i>NOTE:</i> Currently only non-optimized reference kernel implementations are
144available in TFLite for this quantization scheme, so by default the performance
145will be slow compared to int8 kernels. Full advantages of this mode can
146currently be accessed via specialised hardware, or custom software.
147
148Below are the accuracy results for some models that benefit from this mode.
149<figure>
150  <table>
151    <tr>
152      <th>Model</th>
153      <th>Accuracy metric type </th>
154      <th>Accuracy (float32 activations) </th>
155      <th>Accuracy (int8 activations) </th>
156      <th>Accuracy (int16 activations) </th>
157    </tr> <tr><td>Wav2letter</td><td>WER</td><td>6.7%</td><td>7.7%</td>
158      <td>7.2%</td></tr>
159    <tr><td>DeepSpeech 0.5.1 (unrolled)</td><td>CER</td><td>6.13%</td><td>43.67%</td>
160      <td>6.52%</td></tr>
161    <tr><td>YoloV3</td><td>mAP(IOU=0.5)</td><td>0.577</td><td>0.563</td>
162      <td>0.574</td></tr>
163    <tr><td>MobileNetV1</td><td>Top-1 Accuracy</td><td>0.7062</td><td>0.694</td>
164      <td>0.6936</td></tr>
165    <tr><td>MobileNetV2</td><td>Top-1 Accuracy</td><td>0.718</td><td>0.7126</td>
166      <td>0.7137</td></tr>
167    <tr><td>MobileBert</td><td>F1(Exact match)</td><td>88.81(81.23)</td><td>2.08(0)</td>
168      <td>88.73(81.15)</td></tr>
169 </table>
170  <figcaption>
171    <b>Table 2</b> Benefits of model quantization with int16 activations
172  </figcaption>
173</figure>
174
175### Pruning
176
177[Pruning](https://www.tensorflow.org/model_optimization/guide/pruning) works by
178removing parameters within a model that have only a minor impact on its
179predictions. Pruned models are the same size on disk, and have the same runtime
180latency, but can be compressed more effectively. This makes pruning a useful
181technique for reducing model download size.
182
183In the future, TensorFlow Lite will provide latency reduction for pruned models.
184
185### Clustering
186
187[Clustering](https://www.tensorflow.org/model_optimization/guide/clustering)
188works by grouping the weights of each layer in a model into a predefined number
189of clusters, then sharing the centroid values for the weights belonging to each
190individual cluster. This reduces the number of unique weight values in a model,
191thus reducing its complexity.
192
193As a result, clustered models can be compressed more effectively, providing
194deployment benefits similar to pruning.
195
196## Development workflow
197
198As a starting point, check if the models in
199[hosted models](../guide/hosted_models.md) can work for your application. If
200not, we recommend that users start with the
201[post-training quantization tool](post_training_quantization.md) since this is
202broadly applicable and does not require training data.
203
204For cases where the accuracy and latency targets are not met, or hardware
205accelerator support is important,
206[quantization-aware training](https://www.tensorflow.org/model_optimization/guide/quantization/training){:.external}
207is the better option. See additional optimization techniques under the
208[TensorFlow Model Optimization Toolkit](https://www.tensorflow.org/model_optimization).
209
210If you want to further reduce your model size, you can try [pruning](#pruning)
211and/or [clustering](#clustering) prior to quantizing your models.
212