1# TensorFlow Lite on GPU 2 3[TensorFlow Lite](https://www.tensorflow.org/mobile/tflite/) supports several 4hardware accelerators. This document describes how to use the GPU backend using 5the TensorFlow Lite delegate APIs on Android (requires OpenGL ES 3.1 or higher) 6and iOS (requires iOS 8 or later). 7 8## Benefits of GPU Acceleration 9 10### Speed 11 12GPUs are designed to have high throughput for massively parallelizable 13workloads. Thus, they are well-suited for deep neural nets, which consist of a 14huge number of operators, each working on some input tensor(s) that can be 15easily divided into smaller workloads and carried out in parallel. This 16parallelism typically results in lower latency. In the best scenario, inference 17on the GPU may run fast enough to become suitable for real-time applications 18that were not previously possible. 19 20### Accuracy 21 22GPUs do their computation with 16-bit or 32-bit floating point numbers and 23(unlike the CPUs) do not require quantization for optimal performance. If 24decreased accuracy made quantization untenable for your models, running your 25neural network on a GPU may eliminate this concern. 26 27### Energy Efficiency 28 29Another benefit that comes with GPU inference is its power efficiency. A GPU 30carries out computations in a very efficient and optimized way, consuming less 31power and generating less heat than the same task run on a CPU. 32 33## Supported Ops 34 35TensorFlow Lite on GPU supports the following ops in 16-bit and 32-bit float 36precision: 37 38* `ADD v1` 39* `AVERAGE_POOL_2D v1` 40* `CONCATENATION v1` 41* `CONV_2D v1` 42* `DEPTHWISE_CONV_2D v1-2` 43* `FULLY_CONNECTED v1` 44* `LOGISTIC v1` 45* `MAX_POOL_2D v1` 46* `MUL v1` 47* `PAD v1` 48* `PRELU v1` 49* `RELU v1` 50* `RELU6 v1` 51* `RESHAPE v1` 52* `RESIZE_BILINEAR v1` 53* `SOFTMAX v1` 54* `STRIDED_SLICE v1` 55* `SUB v1` 56* `TRANSPOSE_CONV v1` 57 58## Basic Usage 59 60### Android 61 62Run TensorFlow Lite on GPU with `TfLiteDelegate`. In Java, you can specify the 63GpuDelegate through `Interpreter.Options`. 64 65```java 66// NEW: Prepare GPU delegate. 67GpuDelegate delegate = new GpuDelegate(); 68Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate); 69 70// Set up interpreter. 71Interpreter interpreter = new Interpreter(model, options); 72 73// Run inference. 74writeToInputTensor(inputTensor); 75interpreter.run(inputTensor, outputTensor); 76readFromOutputTensor(outputTensor); 77 78// Clean up. 79delegate.close(); 80``` 81 82### iOS 83 84To use TensorFlow Lite on GPU, get the GPU delegate via `NewGpuDelegate()` and 85then pass it to `Interpreter::ModifyGraphWithDelegate()` (instead of calling 86`Interpreter::AllocateTensors()`). 87 88```c++ 89// Set up interpreter. 90auto model = FlatBufferModel::BuildFromFile(model_path); 91if (!model) return false; 92tflite::ops::builtin::BuiltinOpResolver op_resolver; 93std::unique_ptr<Interpreter> interpreter; 94InterpreterBuilder(*model, op_resolver)(&interpreter); 95 96// NEW: Prepare GPU delegate. 97 98const GpuDelegateOptions options = { 99 .allow_precision_loss = false, 100 .wait_type = kGpuDelegateOptions::WaitType::Passive, 101}; 102 103auto* delegate = NewGpuDelegate(options); 104if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false; 105 106// Run inference. 107WriteToInputTensor(interpreter->typed_input_tensor<float>(0)); 108if (interpreter->Invoke() != kTfLiteOk) return false; 109ReadFromOutputTensor(interpreter->typed_output_tensor<float>(0)); 110 111// Clean up. 112DeleteGpuDelegate(delegate); 113``` 114 115Note: When calling `Interpreter::ModifyGraphWithDelegate()` or 116`Interpreter::Invoke()`, the caller must have an `EGLContext` in the current 117thread and `Interpreter::Invoke()` must be called from the same `EGLContext`. If 118an `EGLContext` does not exist, the delegate will internally create one, but 119then the developer must ensure that `Interpreter::Invoke()` is always called 120from the same thread in which `Interpreter::ModifyGraphWithDelegate()` was 121called. 122 123## Advanced Usage 124 125### Delegate Options for iOS 126 127`NewGpuDelegate()` accepts a `struct` of options. 128 129```c++ 130struct GpuDelegateOptions { 131 // Allows to quantify tensors, downcast values, process in float16 etc. 132 bool allow_precision_loss; 133 134 enum class WaitType { 135 // waitUntilCompleted 136 kPassive, 137 // Minimize latency. It uses active spinning instead of mutex and consumes 138 // additional CPU resources. 139 kActive, 140 // Useful when the output is used with GPU pipeline then or if external 141 // command encoder is set 142 kDoNotWait, 143 }; 144 WaitType wait_type; 145}; 146``` 147 148Passing `nullptr` into `NewGpuDelegate()` sets the default options (which are 149explicated in the Basic Usage example above). 150 151```c++ 152 153// THIS: 154const GpuDelegateOptions options = { 155 .allow_precision_loss = false, 156 .wait_type = kGpuDelegateOptions::WaitType::Passive, 157}; 158 159auto* delegate = NewGpuDelegate(options); 160 161// IS THE SAME AS THIS: 162auto* delegate = NewGpuDelegate(nullptr); 163 164``` 165 166While it is convenient to use `nullptr`, we recommend that you explicitly set 167the options, to avoid any unexpected behavior if default values are changed in 168the future. 169 170### Input/Output Buffers 171 172To do computation on the GPU, data must be made available to the GPU. This often 173requires performing a memory copy. It is desirable not to cross the CPU/GPU 174memory boundary if possible, as this can take up a significant amount of time. 175Usually, such crossing is inevitable, but in some special cases, one or the 176other can be omitted. 177 178If the network's input is an image already loaded in the GPU memory (for 179example, a GPU texture containing the camera feed) it can stay in the GPU memory 180without ever entering the CPU memory. Similarly, if the network's output is in 181the form of a renderable image (for example, 182[image style transfer](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf)_) 183it can be directly displayed on the screen. 184 185To achieve best performance, TensorFlow Lite makes it possible for users to 186directly read from and write to the TensorFlow hardware buffer and bypass 187avoidable memory copies. 188 189#### Android 190 191Assuming the image input is in the GPU memory, it must first be converted to an 192OpenGL Shader Storage Buffer Object (SSBO). You can associate a TfLiteTensor to 193a user-prepared SSBO with `Interpreter.bindGlBufferToTensor()`. Note that 194`Interpreter.bindGlBufferToTensor()` must be called before 195`Interpreter.modifyGraphWithDelegate()`. 196 197```java 198// Ensure a valid EGL rendering context. 199EGLContext eglContext = eglGetCurrentContext(); 200if (eglContext.equals(EGL_NO_CONTEXT)) return false; 201 202// Create an SSBO. 203int[] id = new int[1]; 204glGenBuffers(id.length, id, 0); 205glBindBuffer(GL_SHADER_STORAGE_BUFFER, id[0]); 206glBufferData(GL_SHADER_STORAGE_BUFFER, inputSize, null, GL_STREAM_COPY); 207glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); // unbind 208int inputSsboId = id[0]; 209 210// Create interpreter. 211Interpreter interpreter = new Interpreter(tfliteModel); 212Tensor inputTensor = interpreter.getInputTensor(0); 213GpuDelegate gpuDelegate = new GpuDelegate(); 214// The buffer must be bound before the delegate is installed. 215gpuDelegate.bindGlBufferToTensor(inputTensor, inputSsboId); 216interpreter.modifyGraphWithDelegate(gpuDelegate); 217 218// Run inference; the null input argument indicates use of the bound buffer for input. 219fillSsboWithCameraImageTexture(inputSsboId); 220float[] outputArray = new float[outputSize]; 221interpreter.runInference(null, outputArray); 222``` 223 224A similar approach can be applied to the output tensor. In that case, 225`Interpreter.Options.setAllowBufferHandleOutput(true)` should be passed on, to 226disable the default copying of the network's output from GPU memory to CPU 227memory. 228 229```java 230// Ensure a valid EGL rendering context. 231EGLContext eglContext = eglGetCurrentContext(); 232if (eglContext.equals(EGL_NO_CONTEXT)) return false; 233 234// Create a SSBO. 235int[] id = new int[1]; 236glGenBuffers(id.length, id, 0); 237glBindBuffer(GL_SHADER_STORAGE_BUFFER, id[0]); 238glBufferData(GL_SHADER_STORAGE_BUFFER, outputSize, null, GL_STREAM_COPY); 239glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); // unbind 240int outputSsboId = id[0]; 241 242// Create interpreter. 243Interpreter.Options options = (new Interpreter.Options()).setAllowBufferHandleOutput(true); 244Interpreter interpreter = new Interpreter(tfliteModel, options); 245Tensor outputTensor = interpreter.getOutputTensor(0); 246GpuDelegate gpuDelegate = new GpuDelegate(); 247// The buffer must be bound before the delegate is installed. 248gpuDelegate.bindGlBufferToTensor(outputTensor, outputSsboId); 249interpreter.modifyGraphWithDelegate(gpuDelegate); 250 251// Run inference; the null output argument indicates use of the bound buffer for output. 252ByteBuffer input = getCameraImageByteBuffer(); 253interpreter.runInference(input, null); 254renderOutputSsbo(outputSsboId); 255``` 256 257#### iOS 258 259Assuming the image input is in GPU memory, it must first be converted to a 260`MTLBuffer` object for Metal. You can associate a TfLiteTensor to a 261user-prepared `MTLBuffer` with `BindMetalBufferToTensor()`. Note that 262`BindMetalBufferToTensor()` must be called before 263`Interpreter::ModifyGraphWithDelegate()`. Additionally, the inference output is, 264by default, copied from GPU memory to CPU memory. This behavior can be turned 265off by calling `Interpreter::SetAllowBufferHandleOutput(true)` during 266initialization. 267 268```c++ 269// Prepare GPU delegate. 270auto* delegate = NewGpuDelegate(nullptr); 271interpreter->SetAllowBufferHandleOutput(true); // disable default gpu->cpu copy 272if (!BindMetalBufferToTensor(delegate, interpreter->inputs()[0], user_provided_input_buffer)) return false; 273if (!BindMetalBufferToTensor(delegate, interpreter->outputs()[0], user_provided_output_buffer)) return false; 274if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false; 275 276// Run inference. 277if (interpreter->Invoke() != kTfLiteOk) return false; 278``` 279 280Note: Once the default behavior is turned off, copying the inference output from 281GPU memory to CPU memory requires an explicit call to 282`Interpreter::EnsureTensorDataIsReadable()` for each output tensor. 283 284## Tips and Tricks 285 286* Some operations that are trivial on the CPU may be high cost on a GPU. One 287 class of such operation includes various forms of reshape operations 288 (including `BATCH_TO_SPACE`, `SPACE_TO_BATCH`, `SPACE_TO_DEPTH`, and similar 289 operation). If these operations are not required (for example, they were 290 inserted to help the network architect reason about the system but do not 291 otherwise affect output), it is worth removing them for performance. 292 293* On a GPU, tensor data is sliced into 4-channels. Thus, a computation on a 294 tensor of shape `[B, H, W, 5]` will perform about the same on a tensor of 295 shape `[B, H, W, 8]`, but significantly worse than `[B, H, W, 4]`. 296 297 * For example, if the camera hardware supports image frames in RGBA, 298 feeding that 4-channel input is significantly faster, because a memory 299 copy (from 3-channel RGB to 4-channel RGBX) can be avoided. 300 301* For best performance, do not hesitate to re-train your classifier with 302 mobile-optimized network architecture. That is a significant part of 303 optimization for on-device inference. 304