• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# TFLite Quantize Weights Tool
2
3## Recommended usage
4
5The Quantize Weights transformation is integrated with
6[tflite_convert](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/convert/cmdline_reference.md#transformation-flags).
7
8The recommended way of invoking this tool is by simply adding the
9`--post_training_quantize` flag to your original tflite_convert invocation. For
10example,
11
12```
13tflite_convert \
14  --output_file=/tmp/foo.tflite \
15  --saved_model_dir=/tmp/saved_model \
16  --post_training_quantize
17```
18
19## Overview
20
21The Quantize Weights tool provides a simple way to quantize the weights for a
22float TFLite model.
23
24TODO(raghuramank): Add link to weight quantization tutorial.
25
26### Size reduction
27
28float32 weights will be converted to 8 bit integers. This results in a model
29that is around 1/4th the size of the original model.
30
31### Latency reduction
32
33TFLite also has "hybrid" kernels implemented for many operations. These "hybrid"
34kernels take 8 bit integer weights and float inputs, dynamically quantize the
35inputs tensor (based on the input tensor's min and max elements), and does
36computations using the 8 bit integer values. This results in a 2-4x reduction in
37latency for "hybrid" kernels. In this mode the inference type is still FLOAT
38since the inputs and output to each operation is still float.
39
40For operations that do not yet have "hybrid" kernels implemented, we introduce a
41Dequantize operation after 8 bit integer weights. These convert weights back to
42float32 during inference to allow original float32 kernels to run. Since we
43cache dequantized results, the result of each of this dequantized path will be
44on-par with the original float model.
45
46TODO(yunluli): Fill in latency results from latency experiments.
47
48### Accuracy
49
50Since this technique quantizes weights after the model has already been trained,
51there can be accuracy drops depending on the model. For common CNN networks, the
52observed accuracy drops are small and can be seen below.
53
54TODO(yunluli): Fill in accuracy results from accuracy experiments.
55
56## Direct usage
57
58One can also invoke the Quantize Weights directly via C++ if they have a float
59`::tflite::Model` that they want to convert. They must provide a
60`flatbuffers::FlatBufferBuilder` which owns the underlying buffer of the created
61model. Here is an example invocation:
62
63```
64::tflite::Model* input_model = ...;
65flatbuffers::FlatBufferBuilder builder;
66TfLiteStatus status = ::tflite::optimize::QuantizeWeights(&builder, input_model);
67CHECK(status, kTfLiteStatusOk);
68const uint8_t* buffer = builder->GetBufferPointer();
69tflite::Model* output_model = ::tflite::GetModel(buffer);
70```
71