1# Model optimization 2 3The *Tensorflow Model Optimization Toolkit* minimizes the complexity 4of optimizing inference. Inference efficiency 5is a critical issue when deploying machine learning 6models to mobile devices because of the model size, latency, and power consumption. 7 8Computational demand for *training* 9grows with the number of models trained on different architectures, whereas the 10computational demand for *inference* grows in proportion to the number of 11users. 12 13## Use cases 14 15Model optimization is useful for: 16 17* Deploying models to edge devices with restrictions on processing, memory, or power-consumption. 18 For example, mobile and Internet of Things (IoT) devices. 19* Reduce the payload size for over-the-air model updates. 20* Execution on hardware constrained by fixed-point operations. 21* Optimize models for special purpose hardware accelerators. 22 23 24## Optimization methods 25 26Model optimization uses multiple techniques: 27 28* Reduce parameter count with pruning and structured pruning. 29* Reduce representational precision with quantization. 30* Update the original model topology to a more efficient one with reduced parameters or faster execution. For example, tensor decomposition methods and distillation. 31 32We support quantization, and are working to add support for other techniques. 33 34## Model quantization 35 36Quantizing deep neural networks uses techniques that allow for reduced precision 37representations of weights and, optionally, activations for both storage and 38computation. Quantization provides several benefits: 39 40* Support on existing CPU platforms. 41* Quantization of activations reduces memory access costs for reading and storing intermediate activations. 42* Many CPU and hardware accelerator implementations provide SIMD instruction capabilities, which are especially beneficial for quantization. 43 44TensorFlow Lite provides several levels of support for quantization. 45 46* [Post-training quantization](post_training_quantization.md) quantizes weights and activations post training and is very easy to use. 47* [Quantization-aware training](https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize){:.external} allows for training networks that can be quantized with minimal accuracy drop and is only available for a subset of convolutional neural network architectures. 48 49### Latency and accuracy results 50 51Below are the latency and accuracy results for post-training quantization and 52quantization-aware training on a few models. All latency numbers are measured on 53Pixel 2 devices using a single big core. As the toolkit improves, so will the numbers here: 54 55<figure> 56 <table> 57 <tr> 58 <th>Model</th> 59 <th>Top-1 Accuracy (Original) </th> 60 <th>Top-1 Accuracy (Post Training Quantized) </th> 61 <th>Top-1 Accuracy (Quantization Aware Training) </th> 62 <th>Latency (Original) (ms) </th> 63 <th>Latency (Post Training Quantized) (ms) </th> 64 <th>Latency (Quantization Aware Training) (ms) </th> 65 <th> Size (Original) (MB)</th> 66 <th> Size (Optimized) (MB)</th> 67 </tr> <tr><td>Mobilenet-v1-1-224</td><td>0.709</td><td>0.657</td><td>0.70</td> 68 <td>124</td><td>112</td><td>64</td><td>16.9</td><td>4.3</td></tr> 69 <tr><td>Mobilenet-v2-1-224</td><td>0.719</td><td>0.637</td><td>0.709</td> 70 <td>89</td><td>98</td><td>54</td><td>14</td><td>3.6</td></tr> 71 <tr><td>Inception_v3</td><td>0.78</td><td>0.772</td><td>0.775</td> 72 <td>1130</td><td>845</td><td>543</td><td>95.7</td><td>23.9</td></tr> 73 <tr><td>Resnet_v2_101</td><td>0.770</td><td>0.768</td><td>N/A</td> 74 <td>3973</td><td>2868</td><td>N/A</td><td>178.3</td><td>44.9</td></tr> 75 </table> 76 <figcaption> 77 <b>Table 1</b> Benefits of model quantization for select CNN models 78 </figcaption> 79</figure> 80 81## Choice of quantization tool 82 83As a starting point, check if the models in [hosted models](../guide/hosted_models.md) can work for 84your application. If not, we recommend that users start with the [post-training quantization tool](post_training_quantization.md) 85since this is broadly applicable and does not require training data. For cases where the accuracy 86and latency targets are not met, or hardware accelerator support is important, [quantization-aware 87training](https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize){:.external} is the better option. 88 89Note: Quantization-aware training supports a subset of convolutional neural network architectures. 90