1# TFLite Quantize Weights Tool 2 3## Recommended usage 4 5The Quantize Weights transformation is integrated with 6[tflite_convert](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/convert/cmdline_reference.md#transformation-flags). 7 8The recommended way of invoking this tool is by simply adding the 9`--post_training_quantize` flag to your original tflite_convert invocation. For 10example, 11 12``` 13tflite_convert \ 14 --output_file=/tmp/foo.tflite \ 15 --saved_model_dir=/tmp/saved_model \ 16 --post_training_quantize 17``` 18 19## Overview 20 21The Quantize Weights tool provides a simple way to quantize the weights for a 22float TFLite model. 23 24TODO(raghuramank): Add link to weight quantization tutorial. 25 26### Size reduction 27 28float32 weights will be converted to 8 bit integers. This results in a model 29that is around 1/4th the size of the original model. 30 31### Latency reduction 32 33TFLite also has "hybrid" kernels implemented for many operations. These "hybrid" 34kernels take 8 bit integer weights and float inputs, dynamically quantize the 35inputs tensor (based on the input tensor's min and max elements), and does 36computations using the 8 bit integer values. This results in a 2-4x reduction in 37latency for "hybrid" kernels. In this mode the inference type is still FLOAT 38since the inputs and output to each operation is still float. 39 40For operations that do not yet have "hybrid" kernels implemented, we introduce a 41Dequantize operation after 8 bit integer weights. These convert weights back to 42float32 during inference to allow original float32 kernels to run. Since we 43cache dequantized results, the result of each of this dequantized path will be 44on-par with the original float model. 45 46TODO(yunluli): Fill in latency results from latency experiments. 47 48### Accuracy 49 50Since this technique quantizes weights after the model has already been trained, 51there can be accuracy drops depending on the model. For common CNN networks, the 52observed accuracy drops are small and can be seen below. 53 54TODO(yunluli): Fill in accuracy results from accuracy experiments. 55 56## Direct usage 57 58One can also invoke the Quantize Weights directly via C++ if they have a float 59`::tflite::Model` that they want to convert. They must provide a 60`flatbuffers::FlatBufferBuilder` which owns the underlying buffer of the created 61model. Here is an example invocation: 62 63``` 64::tflite::Model* input_model = ...; 65flatbuffers::FlatBufferBuilder builder; 66TfLiteStatus status = ::tflite::optimize::QuantizeWeights(&builder, input_model); 67CHECK(status, kTfLiteStatusOk); 68const uint8_t* buffer = builder->GetBufferPointer(); 69tflite::Model* output_model = ::tflite::GetModel(buffer); 70``` 71