• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Post-training quantization
2
3Post-training quantization is a general technique to reduce model size while also
4providing up to 3x lower latency with little degradation in model accuracy. Post-training
5quantization quantizes weights from floating point to 8-bits of precision. This technique
6is enabled as an option in the [TensorFlow Lite converter](../convert/):
7
8```
9import tensorflow as tf
10converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
11converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
12tflite_quant_model = converter.convert()
13```
14
15At inference, weights are converted from 8-bits of precision to floating point and
16computed using floating-point kernels. This conversion is done once and cached to reduce latency.
17
18To further improve latency, hybrid operators dynamically quantize activations to 8-bits and
19perform computations with 8-bit weights and activations. This optimization provides latencies
20close to fully fixed-point inference. However, the outputs are still stored using
21floating point, so that the speedup with hybrid ops is less than a full fixed-point computation.
22Hybrid ops are available for the most compute-intensive operators in a network:
23
24*  [tf.contrib.layers.fully_connected](https://www.tensorflow.org/api_docs/python/tf/contrib/layers/fully_connected)
25*  [tf.nn.conv2d](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d)
26*  [tf.nn.embedding_lookup](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup)
27*  [BasicRNN](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicRNNCell)
28*  [tf.nn.bidirectional_dynamic_rnn for BasicRNNCell type](https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn)
29*  [tf.nn.dynamic_rnn for LSTM and BasicRNN Cell types](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn)
30
31
32Since weights are quantized post training, there could be an accuracy loss, particularly for
33smaller networks. Pre-trained fully quantized models are provided for specific networks in
34the [TensorFlow Lite model repository](../models/). It is important to check the accuracy of the quantized model to verify that any degradation
35in accuracy is within acceptable limits. There is a tool to evaluate [TensorFlow Lite model accuracy](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tools/accuracy/README.md){:.external}.
36
37If the accuracy drop is too high, consider using [quantization aware training](https://github.com/tensorflow/tensorflow/tree/r1.13/tensorflow/contrib/quantize){:.external}.
38
39### Representation for quantized tensors
40
41TensorFlow approaches the conversion of floating-point arrays of numbers into
428-bit representations as a compression problem. Since the weights and activation
43tensors in trained neural network models tend to have values that are distributed
44across comparatively small ranges (e.g. -15 to +15 for weights or -500 to
451000 for image model activations).
46
47Since neural networks tend to be robust at handling noise, the error introduced
48by quantizing to a small set of values maintains the precision of the overall
49results within an acceptable threshold. A chosen representation must perform
50fast calculations, especially with large matrix multiplications that comprise
51the bulk of the computations while running a model.
52
53This is represented with two floats that store the overall minimum and maximum
54values corresponding to the lowest and highest quantized value. Each entry in the
55quantized array represents a float value in that range, distributed linearly
56between the minimum and maximum.
57
58With our post-training quantization tooling, we use symmetric quantization for
59our weights, meaning we expand the represented range and force the min and max
60to be the negative of each other.
61
62For example, with an overall minimum of -10.0 and a maximum
63of 30.0f, we instead represent a minimum of -30.0 and maximum of 30.0f. In an
648-bit array, the quantized values would be represented as follows:
65
66<figure>
67  <table>
68    <tr><th>Quantized</th><th>Float</th></tr>
69    <tr><td>-42</td><td>-10.0</td></tr>
70    <tr><td>0</td><td>0</td></tr>
71    <tr><td>127</td><td>30.0</td></tr>
72    <tr><td>-127</td><td>30.0 (this value does not ever show up)</td></tr>
73  </table>
74  <figcaption>
75    <b>Table 2</b>: Quantized value range example
76  </figcaption>
77</figure>
78
79The advantages of this representation format are:
80
81* It efficiently represents an arbitrary magnitude of ranges.
82* The linear spread makes multiplications straightforward.
83* A symmetric range for weights enables downstream hardware optimizations.
84