xnnpack - OpenGrok cross reference for /external/tensorflow/tensorflow/lite/delegates/xnnpack/

# XNNPACK backend for TensorFlow Lite

XNNPACK is a highly optimized library of neural network inference operators for
ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, macOS,
and Emscripten environments. This document describes how to use the XNNPACK
library as an inference engine for TensorFlow Lite.

## Using XNNPACK engine with TensorFlow Lite interpreter

XNNPACK integrates with TensorFlow Lite interpreter through the delegation
mechanism. TensorFlow Lite supports several methods to enable XNNPACK
for floating-point inference.

### Enable XNNPACK via Java API on Android (recommended on Android)

Pre-built [nightly TensorFlow Lite binaries for Android](https://www.tensorflow.org/lite/guide/android#use_the_tensorflow_lite_aar_from_mavencentral)
include XNNPACK, albeit it is disabled by default. Use the `setUseXNNPACK`
method in `Interpreter.Options` class to enable it:

```java
Interpreter.Options interpreterOptions = new Interpreter.Options();
interpreterOptions.setUseXNNPACK(true);
Interpreter interpreter = new Interpreter(model, interpreterOptions);
```

### Enable XNNPACK via Swift/Objective-C API on iOS (recommended on iOS)

Pre-built [nightly TensorFlow Lite CocoaPods](https://www.tensorflow.org/lite/guide/ios#specifying_versions)
include XNNPACK, but do not enable it by default. Swift developers can use
`InterpreterOptions` object to enable XNNPACK:

```swift
var options = InterpreterOptions()
options.isXNNPackEnabled = true
var interpreter = try Interpreter(modelPath: "model/path", options: options)
```

Objective-C developers can enable XNNPACK via a new property in the
`TFLInterpreterOptions` class:

```objc
TFLInterpreterOptions *options = [[TFLInterpreterOptions alloc] init];
options.useXNNPACK = YES;
NSError *error;
TFLInterpreter *interpreter =
    [[TFLInterpreter alloc] initWithModelPath:@"model/path"
                                      options:options
                                        error:&error];
```

### Enable XNNPACK via Bazel build flags (recommended on desktop)

When building TensorFlow Lite with Bazel, add
`--define tflite_with_xnnpack=true`, and the TensorFlow Lite interpreter will
use XNNPACK engine by default.

The exact command depends on the target platform, e.g. for Android AAR you'd use

```
bazel build -c opt --fat_apk_cpu=x86,x86_64,arm64-v8a,armeabi-v7a \
  --host_crosstool_top=@bazel_tools//tools/cpp:toolchain \
  --define tflite_with_xnnpack=true \
  //tensorflow/lite/java:tensorflow-lite
```

Note that in this case `Interpreter::SetNumThreads` invocation does not take
effect on number of threads used by XNNPACK engine. In order to specify number
of threads available for XNNPACK engine you should manually pass the value when
constructing the interpreter. The snippet below illustrates this assuming you
are using `InterpreterBuilder` to construct the interpreter:

```c++
// Load model
tflite::Model* model;
...

// Construct the interprepter
tflite::ops::builtin::BuiltinOpResolver resolver;
std::unique_ptr<tflite::Interpreter> interpreter;

TfLiteStatus res = tflite::InterpreterBuilder(model, resolver, num_threads);
```

**XNNPACK engine used by TensorFlow Lite interpreter uses a single thread for
inference by default.**

### Enable XNNPACK via additional dependency

Another way to enable XNNPACK is to build and link the
`//tensorflow/lite:tflite_with_xnnpack` target into your application alongside
the TensorFlow Lite framework.

This method works on platforms which support POSIX-style weak symbols (Android,
iOS, Linux, Mac, but **NOT** Windows).

### Enable XNNPACK via low-level delegate API (not recommended)

While it is possible to use low-level delegate API to enable XNNPACK, this
method is **NOT RECOMMENDED** unless you need to use TensorFlow Lite both with
and without XNNPACK (e.g. for benchmarking).

With low-level delegate API users create an XNNPACK delegate with the
`TfLiteXNNPackDelegateCreate` function, and then call
`Interpreter::ModifyGraphWithDelegate` to delegate supported parts of
the model to the XNNPACK delegate. The users must destroy the delegate with
`TfLiteXNNPackDelegateDelete` **after** releasing the TensorFlow Lite
interpreter. The snippet below illustrates the typical usage:

```c++
// Build the interpreter
std::unique_ptr<tflite::Interpreter> interpreter;
...

// IMPORTANT: initialize options with TfLiteXNNPackDelegateOptionsDefault() for
// API-compatibility with future extensions of the TfLiteXNNPackDelegateOptions
// structure.
TfLiteXNNPackDelegateOptions xnnpack_options =
    TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.num_threads = num_threads;

TfLiteDelegate* xnnpack_delegate =
    TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter->ModifyGraphWithDelegate(xnnpack_delegate) != kTfLiteOk) {
  // Report error and fall back to another delegate, or the default backend
}

...

// Run inference using XNNPACK
interpreter->Invoke()

...

// IMPORTANT: release the interpreter before destroying the delegate
interpreter.reset();
TfLiteXNNPackDelegateDelete(xnnpack_delegate);
```

### Using the XNNPACK weights cache

XNNPACK internally packs static weights for operations (like convolutions) in
order to make accessing weights more memory friendly. XNNPACK needs to allocate
memory internally to hold these packed weights. If you are starting multiple
TFLite interpreter instances based on the same model, there can be multiple
copies of the same packed weights in each instance. This can cause high memory
usage. The weights cache can be used to share packed weights between multiple
TFLite instances.

```c++
// Create 2 interpreters which share the same model.
std::unique_ptr<tflite::Interpreter> interpreter1;
std::unique_ptr<tflite::Interpreter> interpreter2;

// Create a weights cache that you can pass to XNNPACK delegate.
TfLiteXNNPackDelegateWeightsCache* weights_cache =
    TfLiteXNNPackDelegateWeightsCacheCreate();

// Like using the low-level API above, initialize options, and pass this cache
// to XNNPACK delegate via the options.
TfLiteXNNPackDelegateOptions xnnpack_options =
    TfLiteXNNPackDelegateOptionsDefault();
xnnpack_options.weights_cache = weights_cache;

// Modify graph with delegate, as above...
TfLiteDelegate* delegate1 = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter1->ModifyGraphWithDelegate(delegate1) != kTfLiteOk) {
    // Static weights will be packed and written into weights_cache.
}
TfLiteDelegate* delegate2 = TfLiteXNNPackDelegateCreate(&xnnpack_options);
if (interpreter1->ModifyGraphWithDelegate(delegate2) != kTfLiteOk) {
    // XNNPACK will reuse packed weights if they can be found in the weights
    // cache.
}

// Finalize the weights cache.
// Hard finalization has the lowest memory overhead, but requires that all
// TFLite interpreter instances must be created up front before any finalization
// and inference.
TfLiteXNNPackDelegateWeightsCacheFinalizeHard(weights_cache);

// Alternatively, soft-finalizate the weights cache. This is useful if more
// delegates using the same model will to be created after finalization.
// TfLiteXNNPackDelegateWeightsCacheFinalizeSoft(weights_cache);

// Later, after all the interpreters and XNNPACK delegates using the cache are
// destroyed, release the weights cache.
TfLiteXNNPackDelegateWeightsCacheDelete(weights_cache);
```

The weights cache is a contents-based cache. Every time XNNPACK has to pack
weights, it first packs into a temporary buffer, then tries to look up if the
packed weights can be found in the weights cache, based on the contents of the
packed weights. If it can be found, we access the packed weights in the
cache for subsequent operations, and the temporary buffer is freed. Otherwise,
the packed weights is added to the cache.

The weights cache has to be finalized before any inference, it will be an error
otherwise. Hard finalization and soft finalization depends on whether new
XNNPACK delegate instances will be created after finalization. Hard finalization
does not allow new instances to be created, and has lower memory overhead. Soft
finalization allows new instances to be created, and has higher memory overhead
(up to the size of the largest packed weights, rounded up to page alignment).

## Profiling
When TfLite profiling is enabled, XNNPACK will time each operator and report the
results to TfLite which will print them as part of the overall execution profile.

## Limitations and supported operators

XNNPACK delegate is a work-in-progress, and currently supports a limited set of
operators. Unsupported operators will fall back to the default implementations,
so models using a combination of supported and unsupported operators can still
benefit from XNNPACK delegate.

### Floating-Point (IEEE FP32) Operators

Below is the list of currently supported floating-point operators:

#### `ABS`

* Inputs and outputs must be in 32-bit floating-point format.

#### `ADD`

* Inputs and outputs must be in 32-bit floating-point format.
* Only addition with two inputs is supported.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `AVERAGE_POOL_2D`

* Inputs and outputs must be in 32-bit floating-point format.
* 1x1 pooling with non-unit stride is not supported.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `CEIL`

* Inputs and outputs must be in 32-bit floating-point format.

#### `CONCATENATION`

* Inputs and outputs must be in 32-bit floating-point format.
* Only concatenation with two, three, or four inputs is supported.

#### `CONV_2D`

* Inputs and outputs must be in 32-bit floating-point format.
* Bias is mandatory.
* Both filter and bias must be static (use `kTfLiteMmapRo` allocation type).
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `DEPTH_TO_SPACE`

* Inputs and outputs must be in 32-bit floating-point format.
* Block size must be greater than 1.

#### `DEPTHWISE_CONV_2D`

* Inputs and outputs must be in 32-bit floating-point format.
* Bias is mandatory.
* Both filter and bias must be static (use `kTfLiteMmapRo` allocation type).
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `DIV`

* Inputs and outputs must be in 32-bit floating-point format.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `ELU`

* Inputs and outputs must be in 32-bit floating-point format.

#### `FULLY_CONNECTED`

* Inputs and outputs must be in 32-bit floating-point format.
* Both filter and bias must be static (use `kTfLiteMmapRo` allocation type).
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `FLOOR`

* Inputs and outputs must be in 32-bit floating-point format.

#### `HARD_SWISH`

* Inputs and outputs must be in 32-bit floating-point format.

#### `LEAKY_RELU`

* Inputs and outputs must be in 32-bit floating-point format.

#### `LOGISTIC`

* Inputs and outputs must be in 32-bit floating-point format.

#### `MAX_POOL_2D`

* Inputs and outputs must be in 32-bit floating-point format.
* 1x1 pooling with non-unit stride is not supported.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `MAXIMUM`

* Inputs and outputs must be in 32-bit floating-point format.

#### `MEAN`

* The first input and the output must be 4D tensors in 32-bit
  floating-point format.
* The second input (the input with the axes specification) must be static
  (use `kTfLiteMmapRo` allocation type).
* Only [1, 2], [2, 1], and [2] axes specification (i.e. reduction across either
  both spatial dimensions or across the width dimension) is supported.

#### `MINIMUM`

* Inputs and outputs must be in 32-bit floating-point format.

#### `MUL`

* Inputs and outputs must be in 32-bit floating-point format.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `NEG`

* Inputs and outputs must be in 32-bit floating-point format.

#### `PAD`

* The first input and the output must be in 32-bit floating-point format.
* The second input (the input with the padding specification) must be static
  (use `kTfLiteMmapRo` allocation type).
* The numbers of padding elements must be non-negative.

#### `PRELU`

* Inputs and outputs must be in 32-bit floating-point format.
* Slope must be static (use `kTfLiteMmapRo` allocation type).
* Slope must be either a 1D tensor, or have all its non-channel dimensions equal
  1.

#### `RELU`

* Inputs and outputs must be in 32-bit floating-point format.

#### `RELU6`

* Inputs and outputs must be in 32-bit floating-point format.

#### `RELU_N1_TO_1`

* Inputs and outputs must be in 32-bit floating-point format.

#### `RESHAPE`

* The first input and the output must be in 32-bit floating-point format.
* The second input (the input with the new shape specification) must be either
  static (use `kTfLiteMmapRo` allocation type), or absent (with the new shape
  specified via `ReshapeOptions` table).

#### `RESIZE_BILINEAR`

* The first input and the output must be 4D tensors in 32-bit floating-point
  format.
* The second input (the input with the new shape specification) must be
  static (use `kTfLiteMmapRo` allocation type).

#### `ROUND`

* Inputs and outputs must be in 32-bit floating-point format.

#### `SPLIT`

* Inputs and outputs must be in 32-bit floating-point format.
* Only split into two, three, or four outputs is supported.

#### `SOFTMAX`

* Inputs and outputs must be in 32-bit floating-point format.
* Only `beta = 1.0` is supported.

#### `SQRT`

* Inputs and outputs must be in 32-bit floating-point format.

#### `SQUARE`

* Inputs and outputs must be in 32-bit floating-point format.

#### `SQUARED_DIFFERENCE`

* Inputs and outputs must be in 32-bit floating-point format.

#### `SUB`

* Inputs and outputs must be in 32-bit floating-point format.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `TRANSPOSE`

* The first input and the output must be in 32-bit floating-point format.
* The second input (the input with the permutation specification) must be
  static (use `kTfLiteMmapRo` allocation type).

#### `TRANSPOSE_CONV`

* Input, filter, bias (if present) and output tensors must be in 32-bit
  floating-point format.
* Output size, filter and bias (if present) must be static (use
  `kTfLiteMmapRo` allocation type).

### Floating-Point (IEEE FP16) Operators (experimental)

XNNPACK supports half-precision (using IEEE FP16 format) inference for a subset
of floating-point operators. XNNPACK automatically enables half-precision
inference when the following conditions are met:

* XNNPACK runs on hardware that natively supports computations in IEEE FP16
format. Currently, this hardware is limited to ARM64 devices with ARMv8.2 FP16
arithmetics extension, and includes Android phones starting with Pixel 3,
Galaxy S9 (Snapdragon SoC), Galaxy S10 (Exynos SoC), iOS devices with A11 or
newer SoCs, and all Apple Silicon Macs.

* IEEE FP16 inference is supported for every floating-point operator in the
model.

* The model's "reduced_precision_support" metadata indicates that the model
is compatible with FP16 inference.

When the above conditions are met, XNNPACK replace FP32 operators with their
FP16 equivalents, and insert additional operators to convert model inputs
from FP32 to FP16 and convert model outputs back from FP16 to FP32. If the
above conditions are not met, XNNPACK will perform model inference with FP32
calculations.

Additionally, XNNPACK delegate provides an option to force FP16 inference
regardless of model metadata. This option is intended for development workflows,
and in particular for testing end-to-end accuracy of model when FP16 inference
is used. Forcing FP16 inference has several effects:

* Besides ARM64 devices with ARMv8.2 FP16 arithmetics extension, forced FP16
inference is supported on x86/x86-64 devices with AVX2 extension in emulation
mode: all elementary floating-point operations are computed in FP32, then
converted to FP16 and back to FP32. Note that such simulation is not exactly
equivalent to native FP16 inference, but simulates the effects of restricted
mantissa precision and exponent range in the native FP16 arithmetics.

* On devices that support neither the native FP16 arithmetics (ARM64 devices
with ARMv8.2 FP16 arithmetics extension), nor emulation (x86/x86-64 devices with
AVX2 extension), inference will fail rather than fall back to FP32.

* If any floating-point operator offloaded to XNNPACK is not supported for FP16
inference, inference will fail rather than fall back to FP32.

To force FP16 inference, either build the delegate with
`--define xnnpack_force_float_precision=fp16` option, or add
`TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16` flag to the
`TfLiteXNNPackDelegateOptions.flags` bitmask passed into
the `TfLiteXNNPackDelegateCreate` call:

```c
TfLiteXNNPackDelegateOptions xnnpack_options =
    TfLiteXNNPackDelegateOptionsDefault();
...
xnnpack_options.flags |= TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16;
TfLiteDelegate* xnnpack_delegate =
    TfLiteXNNPackDelegateCreate(&xnnpack_options);
```

Below is the list of operators supported in IEEE FP16 inference:

#### `ABS`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `ADD`

* Must satisfy constraints on the floating-point (FP32) operator.
* Neither of the inputs can be static (use `kTfLiteMmapRo` allocation type).

#### `AVERAGE_POOL_2D`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `CEIL`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `CONV_2D`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `CONCATENATION`

* Must satisfy constraints on the floating-point (FP32) operator.
* Neither of the inputs can be static (use `kTfLiteMmapRo` allocation type).

#### `DEPTH_TO_SPACE`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `DEPTHWISE_CONV_2D`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `DIV`

* Must satisfy constraints on the floating-point (FP32) operator.
* Neither of the inputs can be static (use `kTfLiteMmapRo` allocation type).

#### `FLOOR`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `FULLY_CONNECTED`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `HARD_SWISH`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `LEAKY_RELU`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `LOGISTIC`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `MAX_POOL_2D`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `MAXIMUM`

* Must satisfy constraints on the floating-point (FP32) operator.
* Neither of the inputs can be static (use `kTfLiteMmapRo` allocation type).

#### `MEAN`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `MINIMUM`

* Must satisfy constraints on the floating-point (FP32) operator.
* Neither of the inputs can be static (use `kTfLiteMmapRo` allocation type).

#### `MUL`

* Must satisfy constraints on the floating-point (FP32) operator.
* Neither of the inputs can be static (use `kTfLiteMmapRo` allocation type).

#### `NEG`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `PAD`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `PRELU`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `RELU`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `RELU6`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `RELU_N1_TO_1`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `RESHAPE`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `RESIZE_BILINEAR`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `ROUND`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `SPLIT`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `SOFTMAX`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `SQRT`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `SQUARE`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `SQUARED_DIFFERENCE`

* Must satisfy constraints on the floating-point (FP32) operator.
* Neither of the inputs can be static (use `kTfLiteMmapRo` allocation type).

#### `SUB`

* Must satisfy constraints on the floating-point (FP32) operator.
* Neither of the inputs can be static (use `kTfLiteMmapRo` allocation type).

#### `TRANSPOSE`

* Must satisfy constraints on the floating-point (FP32) operator.

#### `TRANSPOSE_CONV`

* Must satisfy constraints on the floating-point (FP32) operator.

### Quantized Operators

By default, quantized inference in XNNPACK delegate is disabled, and XNNPACK is
used only for floating-point models. Support for quantized inference in XNNPACK
must be enabled by adding extra Bazel flags when building TensorFlow Lite.

* `--define tflite_with_xnnpack_qs8=true` flag enables XNNPACK inference for
  quantized operators using signed quantization schema. This schema is used by
  models produced by [Model Optimization
  Toolkit](https://www.tensorflow.org/model_optimization) through either
  post-training integer quantization or quantization-aware training.
  Post-training dynamic range quantization is not supported in XNNPACK.

* `--define tflite_with_xnnpack_qu8=true` flag enables XNNPACK inference for
  quantized operators using unsigned quantization schema, produced via the
  legacy TensorFlow 1.X quantization tooling. This option is experimental and
  may perform suboptimally on mobile processors with NEON DOT product
  instructions.

Below is the list of currently supported quantized operators:

#### `ADD`

* Inputs and outputs must be in 8-bit quantized format.
* Only addition with two inputs is supported.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `CONCATENATION`

* Inputs and outputs must be in 8-bit quantized format.
* Only concatenation with two, three, or four inputs is supported.

#### `CONV_2D`

* Inputs and outputs must be in 8-bit quantized format (bias must be in 32-bit
  quantized format).
* Bias is mandatory.
* Both filter and bias must be static (use `kTfLiteMmapRo` allocation type),
  and can use either per-tensor or per-channel quantization parameters.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `DEPTH_TO_SPACE`

* Inputs and outputs must be in 8-bit quantized format.
* Block size must be greater than 1.

#### `DEPTHWISE_CONV_2D`

* Inputs and outputs must be in 8-bit quantized format (bias must be in
  32-bit quantized format).
* Bias is mandatory.
* Both filter and bias must be static (use `kTfLiteMmapRo` allocation type),
  and can use either per-tensor or per-channel quantization parameters.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `DEQUANTIZE`

* Input tensor must be in 8-bit quantized format without per-channel
  quantization.
* Output tensor must be in 32-bit floating-point format.

#### `ELU`

* Inputs and outputs must be in 8-bit signed quantized format.

#### `FULLY_CONNECTED`

* Inputs and outputs must be in 8-bit quantized format (bias, if present, must
  be in 32-bit quantized format).
* Both filter and bias must be static (use `kTfLiteMmapRo` allocation type).
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `LEAKY_RELU`

* Inputs and outputs must be in 8-bit quantized format.
* The ratio of input scale to output scale must be within [1/256, 128].
* The product of negative slope by the ratio of input scale to output scale
  must be within either [-127.99609375, -1/256] range or [1/256, 128] range.

#### `LOGISTIC`

* Inputs and outputs must be in 8-bit quantized format.

#### `MAX_POOL_2D`

* Inputs and outputs must be in 8-bit quantized format.
* 1x1 pooling with non-unit stride is not supported.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `MEAN`

* The first input and the output must be 4D tensors in 8-bit quantized format.
* The second input (the input with the axes specification) must be static
  (use `kTfLiteMmapRo` allocation type).
* Only [1, 2], [2, 1], and [2] axes specification (i.e. reduction across either
  both spatial dimensions or across the width dimension) is supported.

#### `MUL`

* Inputs and outputs must be in 8-bit quantized format.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `PAD`

* The first input and the output must be in 8-bit quantized format.
* The second input (the input with the padding specification) must be static
  (use `kTfLiteMmapRo` allocation type).
* The numbers of padding elements must be non-negative.

#### `QUANTIZE`

* Input tensor must be in 32-bit floating-point format or in 8-bit quantized
  format.
* Output tensor must be in 8-bit quantized format without per-channel
  quantization.
* If inputs are in 8-bit quantized format, they must have the same signedness
  as the outputs, and the ratio of input scale to output scale must be in the
  [2**-8, 2**7] range.

#### `RESIZE_BILINEAR`

* The first input and the output must be 4D tensors in 8-bit quantized format.
* The second input (the input with the new shape specification) must be
  static (use `kTfLiteMmapRo` allocation type).

#### `SPLIT`

* Inputs and outputs must be in 8-bit quantized format.
* Only split into two, three, or four outputs is supported.

#### `SUB`

* Inputs and outputs must be in 8-bit quantized format.
* Fused `NONE`, `RELU`, `RELU_N1_TO_1`, and `RELU6` activations are supported,
  but fused `TANH` and `SIGN_BIT` activations are not.

#### `TRANSPOSE`

* The first input and the output must be in 8-bit quantized format.
* The second input (the input with the permutation specification) must be
  static (use `kTfLiteMmapRo` allocation type).

#### `TRANSPOSE_CONV`

* Input, filter, and output tensors must be in 8-bit quantized format (bias, if
  present, must be in 32-bit quantized format).
* Output size, filter and bias (if present) must be static (use
  `kTfLiteMmapRo` allocation type).

### Sparse Inference

XNNPACK backend supports sparse inference for CNN models described in the
[Fast Sparse ConvNets](https://arxiv.org/abs/1911.09723) paper. Sparse
inference is restricted to subgraphs with the following operators:

* Sparse subgraph must store its weights in sparse representation (using
  `DENSIFY` operators in the TensorFlow Lite schema).
* Sparse subgraph must start with a 3x3 stride-2 `CONV_2D` operator with
  padding 1 on each side, no dilation, and 3 input channels.
* Sparse subgraph must end with either a `MEAN` operator with reduction across
  spatial axes, or a `DEPTH_TO_SPACE` operator.
* Sparse subgraph may contain the following operators:
  * `CONV_2D` with 1x1 kernel and no padding. At least 2/3rd of filter weights
    in the 1x1 `CONV_2D` operators across the sparse subgraph must be zeroes
    to enable sparse inference.
  * `DEPTHWISE_CONV_2D` with 3x3 kernel, stride 1, no dilation, and padding 1
    on each side.
  * `DEPTHWISE_CONV_2D` with 3x3 kernel, stride 2, no dilation, and padding 1
    on each side.
  * `DEPTHWISE_CONV_2D` with 5x5 kernel, stride 1, no dilation, and padding 2
    on each side.
  * `DEPTHWISE_CONV_2D` with 5x5 kernel, stride 2, no dilation, and padding 2
    on each side.
  * `RESIZE_BILINEAR` operator with output dimensions greater than 1.
  * `MEAN` operator with reduction across spatial axes.
  * `ADD` and `MUL` operators where both inputs are 4D tensors. If one of the
    inputs to `ADD` or `MUL` is a constant tensor, it must be representable as
    either a scalar, or a 1D vector.
  * Unary elementwise operators `ABS`, `CEIL`, `ELU`, `FLOOR`, `HARD_SWISH`,
    `LEAKY_RELU`, `LOGISTIC`, `NEG`, `RELU`, `RELU6`, `RELU_N1_TO_1`, `ROUND`,
    `SIGMOID`, and `SQUARE`.

Pre-trained [Fast Sparse ConvNets models](https://github.com/google-research/google-research/tree/master/fastconvnets)
provide examples that satisfy these constrains.

### Other limitations

* Dynamically allocated (with `kTfLiteDynamic` allocation type) inputs and
  outputs are not supported.
* Resizing model inputs (via `Interpreter::ResizeInputTensor`) is supported, but
  cause a complete reinitialization of the delegate instance, which has
  considerable overhead.
Name		Date	Size	#Lines	LOC
..		-	-
BUILD	D	04-Jul-2025	53.1 KiB	1,996	1,860
README.md	D	04-Jul-2025	28.7 KiB	829	569
abs_test.cc	D	04-Jul-2025	4.1 KiB	121	87
add_test.cc	D	04-Jul-2025	32 KiB	928	773
average_pool_2d_test.cc	D	04-Jul-2025	15.3 KiB	425	365
binary_elementwise_tester.cc	D	04-Jul-2025	21.5 KiB	518	461
binary_elementwise_tester.h	D	04-Jul-2025	4.5 KiB	159	110
ceil_test.cc	D	04-Jul-2025	4.1 KiB	121	87
channelwise_quantized_conv_2d_test.cc	D	04-Jul-2025	27.1 KiB	675	600
channelwise_quantized_depthwise_conv_2d_test.cc	D	04-Jul-2025	32.6 KiB	814	723
concatenation_test.cc	D	04-Jul-2025	11.3 KiB	328	231
concatenation_tester.cc	D	04-Jul-2025	9 KiB	239	189
concatenation_tester.h	D	04-Jul-2025	3 KiB	95	60
conv_2d_test.cc	D	04-Jul-2025	28 KiB	785	693
conv_2d_tester.cc	D	04-Jul-2025	18.7 KiB	432	364
conv_2d_tester.h	D	04-Jul-2025	7.3 KiB	273	201
delegate_test.cc	D	04-Jul-2025	2.4 KiB	67	43
depth_to_space_test.cc	D	04-Jul-2025	5.6 KiB	157	123
depth_to_space_tester.cc	D	04-Jul-2025	10 KiB	251	203
depth_to_space_tester.h	D	04-Jul-2025	3.1 KiB	104	65
depthwise_conv_2d_test.cc	D	04-Jul-2025	26.1 KiB	726	635
depthwise_conv_2d_tester.cc	D	04-Jul-2025	17.4 KiB	403	337
depthwise_conv_2d_tester.h	D	04-Jul-2025	7.4 KiB	263	193
dequantize_tester.cc	D	04-Jul-2025	6.3 KiB	171	127
dequantize_tester.h	D	04-Jul-2025	2.7 KiB	92	56
div_test.cc	D	04-Jul-2025	32 KiB	928	773
elu_test.cc	D	04-Jul-2025	4.1 KiB	121	87
floor_test.cc	D	04-Jul-2025	4.1 KiB	121	87
fully_connected_test.cc	D	04-Jul-2025	18.7 KiB	521	439
fully_connected_tester.cc	D	04-Jul-2025	15.1 KiB	364	298
fully_connected_tester.h	D	04-Jul-2025	4.5 KiB	160	110
hard_swish_test.cc	D	04-Jul-2025	4.2 KiB	121	87
leaky_relu_test.cc	D	04-Jul-2025	4.6 KiB	138	101
leaky_relu_tester.cc	D	04-Jul-2025	5.7 KiB	160	116
leaky_relu_tester.h	D	04-Jul-2025	2 KiB	70	39
logistic_test.cc	D	04-Jul-2025	4.3 KiB	125	91
max_pool_2d_test.cc	D	04-Jul-2025	15.2 KiB	425	365
maximum_test.cc	D	04-Jul-2025	28.6 KiB	823	683
mean_test.cc	D	04-Jul-2025	16.8 KiB	502	412
minimum_test.cc	D	04-Jul-2025	28.6 KiB	823	683
mul_test.cc	D	04-Jul-2025	32 KiB	928	773
neg_test.cc	D	04-Jul-2025	4.1 KiB	121	87
pad_test.cc	D	04-Jul-2025	9.9 KiB	280	222
pad_tester.cc	D	04-Jul-2025	7.2 KiB	193	148
pad_tester.h	D	04-Jul-2025	2.6 KiB	88	55
pool_2d_tester.cc	D	04-Jul-2025	8 KiB	202	152
pool_2d_tester.h	D	04-Jul-2025	4.8 KiB	178	126
prelu_test.cc	D	04-Jul-2025	21.7 KiB	654	524
prelu_tester.cc	D	04-Jul-2025	12.1 KiB	300	250
prelu_tester.h	D	04-Jul-2025	3.2 KiB	115	75
quantization_util.cc	D	04-Jul-2025	2.3 KiB	59	36
quantization_util.h	D	04-Jul-2025	2.3 KiB	56	20
quantization_util_test.cc	D	04-Jul-2025	3.3 KiB	103	68
quantize_float32_to_int8_test.cc	D	04-Jul-2025	4.5 KiB	132	98
quantize_float32_to_uint8_test.cc	D	04-Jul-2025	4.6 KiB	137	103
quantize_int8_to_int8_test.cc	D	04-Jul-2025	4.7 KiB	142	108
quantize_tester.cc	D	04-Jul-2025	8.8 KiB	233	184
quantize_tester.h	D	04-Jul-2025	3.4 KiB	114	73
quantize_uint8_to_uint8_test.cc	D	04-Jul-2025	4.7 KiB	142	108
quantized_binary_elementwise_tester.cc	D	04-Jul-2025	11.3 KiB	287	235
quantized_binary_elementwise_tester.h	D	04-Jul-2025	5.4 KiB	181	127
quantized_conv_2d_tester.cc	D	04-Jul-2025	11.7 KiB	278	228
quantized_conv_2d_tester.h	D	04-Jul-2025	8.8 KiB	305	223
quantized_depthwise_conv_2d_tester.cc	D	04-Jul-2025	11.9 KiB	283	233
quantized_depthwise_conv_2d_tester.h	D	04-Jul-2025	8.9 KiB	303	222
quantized_fully_connected_tester.cc	D	04-Jul-2025	10.1 KiB	253	203
quantized_fully_connected_tester.h	D	04-Jul-2025	5.7 KiB	194	136
quantized_leaky_relu_tester.cc	D	04-Jul-2025	7 KiB	183	138
quantized_leaky_relu_tester.h	D	04-Jul-2025	3.5 KiB	116	74
quantized_pad_tester.cc	D	04-Jul-2025	8.2 KiB	211	165
quantized_pad_tester.h	D	04-Jul-2025	3.5 KiB	120	80
quantized_pool_2d_tester.cc	D	04-Jul-2025	8.4 KiB	208	161
quantized_pool_2d_tester.h	D	04-Jul-2025	5.8 KiB	207	148
quantized_reduce_tester.cc	D	04-Jul-2025	7.8 KiB	196	151
quantized_reduce_tester.h	D	04-Jul-2025	4.7 KiB	156	110
quantized_resize_bilinear_tester.cc	D	04-Jul-2025	8.1 KiB	207	161
quantized_resize_bilinear_tester.h	D	04-Jul-2025	4.1 KiB	146	98
quantized_transpose_conv_tester.cc	D	04-Jul-2025	11.3 KiB	288	212
quantized_transpose_conv_tester.h	D	04-Jul-2025	6.8 KiB	233	167
quantized_unary_elementwise_tester.cc	D	04-Jul-2025	6.9 KiB	181	137
quantized_unary_elementwise_tester.h	D	04-Jul-2025	3.4 KiB	113	73
reduce_tester.cc	D	04-Jul-2025	6.6 KiB	177	133
reduce_tester.h	D	04-Jul-2025	3.5 KiB	118	81
relu6_test.cc	D	04-Jul-2025	4.1 KiB	121	87
relu_n1_to_1_test.cc	D	04-Jul-2025	4.2 KiB	121	87
relu_test.cc	D	04-Jul-2025	4.1 KiB	121	87
reshape_test.cc	D	04-Jul-2025	8 KiB	226	178
reshape_tester.cc	D	04-Jul-2025	6.9 KiB	187	141
reshape_tester.h	D	04-Jul-2025	2.6 KiB	88	54
resize_bilinear_test.cc	D	04-Jul-2025	4 KiB	120	89
resize_bilinear_tester.cc	D	04-Jul-2025	7.2 KiB	189	144
resize_bilinear_tester.h	D	04-Jul-2025	3.3 KiB	116	75
round_test.cc	D	04-Jul-2025	4.1 KiB	121	87
signed_dequantize_test.cc	D	04-Jul-2025	4.2 KiB	129	95
signed_quantized_add_test.cc	D	04-Jul-2025	42.6 KiB	1,067	930
signed_quantized_concatenation_test.cc	D	04-Jul-2025	11.4 KiB	328	231
signed_quantized_conv_2d_test.cc	D	04-Jul-2025	21.2 KiB	541	480
signed_quantized_depth_to_space_test.cc	D	04-Jul-2025	5.6 KiB	157	123
signed_quantized_depthwise_conv_2d_test.cc	D	04-Jul-2025	25.9 KiB	649	576
signed_quantized_elu_test.cc	D	04-Jul-2025	4.7 KiB	142	108
signed_quantized_fully_connected_test.cc	D	04-Jul-2025	19.2 KiB	470	406
signed_quantized_leaky_relu_test.cc	D	04-Jul-2025	5.4 KiB	166	129
signed_quantized_logistic_test.cc	D	04-Jul-2025	5 KiB	142	108
signed_quantized_max_pool_2d_test.cc	D	04-Jul-2025	15.5 KiB	425	365
signed_quantized_mean_test.cc	D	04-Jul-2025	17.4 KiB	502	412
signed_quantized_mul_test.cc	D	04-Jul-2025	42.6 KiB	1,067	930
signed_quantized_pad_test.cc	D	04-Jul-2025	10.2 KiB	280	222
signed_quantized_resize_bilinear_test.cc	D	04-Jul-2025	4.1 KiB	120	89
signed_quantized_split_test.cc	D	04-Jul-2025	10.9 KiB	334	246
signed_quantized_sub_test.cc	D	04-Jul-2025	42.6 KiB	1,067	930
signed_quantized_transpose_conv_test.cc	D	04-Jul-2025	23.8 KiB	677	595
signed_quantized_transpose_test.cc	D	04-Jul-2025	3.4 KiB	107	74
softmax_test.cc	D	04-Jul-2025	4.7 KiB	141	103
softmax_tester.cc	D	04-Jul-2025	5.8 KiB	162	118
softmax_tester.h	D	04-Jul-2025	2 KiB	72	41
split_test.cc	D	04-Jul-2025	10.7 KiB	334	246
split_tester.cc	D	04-Jul-2025	9.5 KiB	243	192
split_tester.h	D	04-Jul-2025	2.7 KiB	89	56
sqrt_test.cc	D	04-Jul-2025	4.1 KiB	121	87
square_test.cc	D	04-Jul-2025	4.1 KiB	121	87
squared_difference_test.cc	D	04-Jul-2025	29.5 KiB	823	683
sub_test.cc	D	04-Jul-2025	32 KiB	928	773
test_util.cc	D	04-Jul-2025	4 KiB	106	76
test_util.h	D	04-Jul-2025	1.5 KiB	42	18
transpose_conv_test.cc	D	04-Jul-2025	30.2 KiB	858	758
transpose_conv_tester.cc	D	04-Jul-2025	17.2 KiB	415	323
transpose_conv_tester.h	D	04-Jul-2025	6.7 KiB	238	172
transpose_test.cc	D	04-Jul-2025	3.3 KiB	107	74
transpose_tester.cc	D	04-Jul-2025	7 KiB	183	140
transpose_tester.h	D	04-Jul-2025	2.3 KiB	78	46
unary_elementwise_tester.cc	D	04-Jul-2025	6.6 KiB	187	144
unary_elementwise_tester.h	D	04-Jul-2025	2.3 KiB	71	40
unsigned_dequantize_test.cc	D	04-Jul-2025	4.4 KiB	137	103
unsigned_quantized_add_test.cc	D	04-Jul-2025	43.9 KiB	1,121	984
unsigned_quantized_concatenation_test.cc	D	04-Jul-2025	11.5 KiB	328	231
unsigned_quantized_conv_2d_test.cc	D	04-Jul-2025	23.4 KiB	583	522
unsigned_quantized_depth_to_space_test.cc	D	04-Jul-2025	5.6 KiB	157	123
unsigned_quantized_depthwise_conv_2d_test.cc	D	04-Jul-2025	28.8 KiB	703	630
unsigned_quantized_fully_connected_test.cc	D	04-Jul-2025	19.8 KiB	476	415
unsigned_quantized_leaky_relu_test.cc	D	04-Jul-2025	5.4 KiB	166	129
unsigned_quantized_logistic_test.cc	D	04-Jul-2025	5.1 KiB	147	113
unsigned_quantized_max_pool_2d_test.cc	D	04-Jul-2025	15.9 KiB	438	378
unsigned_quantized_mean_test.cc	D	04-Jul-2025	17.9 KiB	529	439
unsigned_quantized_mul_test.cc	D	04-Jul-2025	43.9 KiB	1,121	984
unsigned_quantized_pad_test.cc	D	04-Jul-2025	10.5 KiB	293	235
unsigned_quantized_resize_bilinear_test.cc	D	04-Jul-2025	4.2 KiB	124	93
unsigned_quantized_split_test.cc	D	04-Jul-2025	10.9 KiB	334	246
unsigned_quantized_sub_test.cc	D	04-Jul-2025	43.9 KiB	1,121	984
unsigned_quantized_transpose_conv_test.cc	D	04-Jul-2025	23.9 KiB	677	595
unsigned_quantized_transpose_test.cc	D	04-Jul-2025	3.4 KiB	107	74
weights_cache_test.cc	D	04-Jul-2025	9.5 KiB	235	167
xnnpack_delegate.cc	D	04-Jul-2025	223.6 KiB	5,285	4,645
xnnpack_delegate.h	D	04-Jul-2025	4.7 KiB	107	36