1# Post-training quantization 2 3Post-training quantization is a general technique to reduce model size while also 4providing up to 3x lower latency with little degradation in model accuracy. Post-training 5quantization quantizes weights from floating point to 8-bits of precision. This technique 6is enabled as an option in the [TensorFlow Lite converter](../convert/): 7 8``` 9import tensorflow as tf 10converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir) 11converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE] 12tflite_quant_model = converter.convert() 13``` 14 15At inference, weights are converted from 8-bits of precision to floating point and 16computed using floating-point kernels. This conversion is done once and cached to reduce latency. 17 18To further improve latency, hybrid operators dynamically quantize activations to 8-bits and 19perform computations with 8-bit weights and activations. This optimization provides latencies 20close to fully fixed-point inference. However, the outputs are still stored using 21floating point, so that the speedup with hybrid ops is less than a full fixed-point computation. 22Hybrid ops are available for the most compute-intensive operators in a network: 23 24* [tf.contrib.layers.fully_connected](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/fully_connected) 25* [tf.nn.conv2d](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d) 26* [tf.nn.embedding_lookup](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) 27* [BasicRNN](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicRNNCell) 28* [tf.nn.bidirectional_dynamic_rnn for BasicRNNCell type](https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn) 29* [tf.nn.dynamic_rnn for LSTM and BasicRNN Cell types](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) 30 31 32Since weights are quantized post training, there could be an accuracy loss, particularly for 33smaller networks. Pre-trained fully quantized models are provided for specific networks in 34the [TensorFlow Lite model repository](../models/). It is important to check the accuracy of the quantized model to verify that any degradation 35in accuracy is within acceptable limits. There is a tool to evaluate [TensorFlow Lite model accuracy](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/accuracy/README.md){:.external}. 36 37If the accuracy drop is too high, consider using [quantization aware training](https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize){:.external}. 38 39### Representation for quantized tensors 40 41TensorFlow approaches the conversion of floating-point arrays of numbers into 428-bit representations as a compression problem. Since the weights and activation 43tensors in trained neural network models tend to have values that are distributed 44across comparatively small ranges (e.g. -15 to +15 for weights or -500 to 451000 for image model activations). 46 47Since neural networks tend to be robust at handling noise, the error introduced 48by quantizing to a small set of values maintains the precision of the overall 49results within an acceptable threshold. A chosen representation must perform 50fast calculations, especially with large matrix multiplications that comprise 51the bulk of the computations while running a model. 52 53This is represented with two floats that store the overall minimum and maximum 54values corresponding to the lowest and highest quantized value. Each entry in the 55quantized array represents a float value in that range, distributed linearly 56between the minimum and maximum. 57 58With our post-training quantization tooling, we use symmetric quantization for 59our weights, meaning we expand the represented range and force the min and max 60to be the negative of each other. 61 62For example, with an overall minimum of -10.0 and a maximum 63of 30.0f, we instead represent a minimum of -30.0 and maximum of 30.0f. In an 648-bit array, the quantized values would be represented as follows: 65 66<figure> 67 <table> 68 <tr><th>Quantized</th><th>Float</th></tr> 69 <tr><td>-42</td><td>-10.0</td></tr> 70 <tr><td>0</td><td>0</td></tr> 71 <tr><td>127</td><td>30.0</td></tr> 72 <tr><td>-127</td><td>30.0 (this value does not ever show up)</td></tr> 73 </table> 74 <figcaption> 75 <b>Table 2</b>: Quantized value range example 76 </figcaption> 77</figure> 78 79The advantages of this representation format are: 80 81* It efficiently represents an arbitrary magnitude of ranges. 82* The linear spread makes multiplications straightforward. 83* A symmetric range for weights enables downstream hardware optimizations. 84