• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# TensorFlow Lite on GPU
2
3[TensorFlow Lite](https://www.tensorflow.org/mobile/tflite/) supports several
4hardware accelerators.  This document describes how to use the GPU backend using
5the TensorFlow Lite delegate APIs on Android (requires OpenGL ES 3.1 or higher)
6and iOS (requires iOS 8 or later).
7
8## Benefits of GPU Acceleration
9
10### Speed
11
12GPUs are designed to have high throughput for massively parallelizable
13workloads. Thus, they are well-suited for deep neural nets, which consist of a
14huge number of operators, each working on some input tensor(s) that can be
15easily divided into smaller workloads and carried out in parallel. This
16parallelism typically results in lower latency. In the best scenario, inference
17on the GPU may run fast enough to become suitable for real-time applications
18that were not previously possible.
19
20### Accuracy
21
22GPUs do their computation with 16-bit or 32-bit floating point numbers and
23(unlike the CPUs) do not require quantization for optimal performance. If
24decreased accuracy made quantization untenable for your models, running your
25neural network on a GPU may eliminate this concern.
26
27### Energy Efficiency
28
29Another benefit that comes with GPU inference is its power efficiency. A GPU
30carries out computations in a very efficient and optimized way, consuming less
31power and generating less heat than the same task run on a CPU.
32
33## Supported Ops
34
35TensorFlow Lite on GPU supports the following ops in 16-bit and 32-bit float
36precision:
37
38* `ADD v1`
39* `AVERAGE_POOL_2D v1`
40* `CONCATENATION v1`
41* `CONV_2D v1`
42* `DEPTHWISE_CONV_2D v1-2`
43* `FULLY_CONNECTED v1`
44* `LOGISTIC v1`
45* `MAX_POOL_2D v1`
46* `MUL v1`
47* `PAD v1`
48* `PRELU v1`
49* `RELU v1`
50* `RELU6 v1`
51* `RESHAPE v1`
52* `RESIZE_BILINEAR v1`
53* `SOFTMAX v1`
54* `STRIDED_SLICE v1`
55* `SUB v1`
56* `TRANSPOSE_CONV v1`
57
58## Basic Usage
59
60### Android
61
62Run TensorFlow Lite on GPU with `TfLiteDelegate`. In Java, you can specify the
63GpuDelegate through `Interpreter.Options`.
64
65```java
66// NEW: Prepare GPU delegate.
67GpuDelegate delegate = new GpuDelegate();
68Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);
69
70// Set up interpreter.
71Interpreter interpreter = new Interpreter(model, options);
72
73// Run inference.
74writeToInputTensor(inputTensor);
75interpreter.run(inputTensor, outputTensor);
76readFromOutputTensor(outputTensor);
77
78// Clean up.
79delegate.close();
80```
81
82### iOS
83
84To use TensorFlow Lite on GPU, get the GPU delegate via `NewGpuDelegate()` and
85then pass it to `Interpreter::ModifyGraphWithDelegate()` (instead of calling
86`Interpreter::AllocateTensors()`).
87
88```c++
89// Set up interpreter.
90auto model = FlatBufferModel::BuildFromFile(model_path);
91if (!model) return false;
92tflite::ops::builtin::BuiltinOpResolver op_resolver;
93std::unique_ptr<Interpreter> interpreter;
94InterpreterBuilder(*model, op_resolver)(&interpreter);
95
96// NEW: Prepare GPU delegate.
97
98const GpuDelegateOptions options = {
99  .allow_precision_loss = false,
100  .wait_type = kGpuDelegateOptions::WaitType::Passive,
101};
102
103auto* delegate = NewGpuDelegate(options);
104if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;
105
106// Run inference.
107WriteToInputTensor(interpreter->typed_input_tensor<float>(0));
108if (interpreter->Invoke() != kTfLiteOk) return false;
109ReadFromOutputTensor(interpreter->typed_output_tensor<float>(0));
110
111// Clean up.
112DeleteGpuDelegate(delegate);
113```
114
115Note: When calling `Interpreter::ModifyGraphWithDelegate()` or
116`Interpreter::Invoke()`, the caller must have an `EGLContext` in the current
117thread and `Interpreter::Invoke()` must be called from the same `EGLContext`. If
118an `EGLContext` does not exist, the delegate will internally create one, but
119then the developer must ensure that `Interpreter::Invoke()` is always called
120from the same thread in which `Interpreter::ModifyGraphWithDelegate()` was
121called.
122
123## Advanced Usage
124
125### Delegate Options for iOS
126
127`NewGpuDelegate()` accepts a `struct` of options.
128
129```c++
130struct GpuDelegateOptions {
131  // Allows to quantify tensors, downcast values, process in float16 etc.
132  bool allow_precision_loss;
133
134  enum class WaitType {
135    // waitUntilCompleted
136    kPassive,
137    // Minimize latency. It uses active spinning instead of mutex and consumes
138    // additional CPU resources.
139    kActive,
140    // Useful when the output is used with GPU pipeline then or if external
141    // command encoder is set
142    kDoNotWait,
143  };
144  WaitType wait_type;
145};
146```
147
148Passing `nullptr` into `NewGpuDelegate()` sets the default options (which are
149explicated in the Basic Usage example above).
150
151```c++
152
153// THIS:
154const GpuDelegateOptions options = {
155  .allow_precision_loss = false,
156  .wait_type = kGpuDelegateOptions::WaitType::Passive,
157};
158
159auto* delegate = NewGpuDelegate(options);
160
161// IS THE SAME AS THIS:
162auto* delegate = NewGpuDelegate(nullptr);
163
164```
165
166While it is convenient to use `nullptr`, we recommend that you explicitly set
167the options, to avoid any unexpected behavior if default values are changed in
168the future.
169
170### Input/Output Buffers
171
172To do computation on the GPU, data must be made available to the GPU. This often
173requires performing a memory copy. It is desirable not to cross the CPU/GPU
174memory boundary if possible, as this can take up a significant amount of time.
175Usually, such crossing is inevitable, but in some special cases, one or the
176other can be omitted.
177
178If the network's input is an image already loaded in the GPU memory (for
179example, a GPU texture containing the camera feed) it can stay in the GPU memory
180without ever entering the CPU memory. Similarly, if the network's output is in
181the form of a renderable image (for example,
182[image style transfer](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Gatys_Image_Style_Transfer_CVPR_2016_paper.pdf)_)
183it can be directly displayed on the screen.
184
185To achieve best performance, TensorFlow Lite makes it possible for users to
186directly read from and write to the TensorFlow hardware buffer and bypass
187avoidable memory copies.
188
189#### Android
190
191Assuming the image input is in the GPU memory, it must first be converted to an
192OpenGL Shader Storage Buffer Object (SSBO). You can associate a TfLiteTensor to
193a user-prepared SSBO with `Interpreter.bindGlBufferToTensor()`. Note that
194`Interpreter.bindGlBufferToTensor()` must be called before
195`Interpreter.modifyGraphWithDelegate()`.
196
197```java
198// Ensure a valid EGL rendering context.
199EGLContext eglContext = eglGetCurrentContext();
200if (eglContext.equals(EGL_NO_CONTEXT)) return false;
201
202// Create an SSBO.
203int[] id = new int[1];
204glGenBuffers(id.length, id, 0);
205glBindBuffer(GL_SHADER_STORAGE_BUFFER, id[0]);
206glBufferData(GL_SHADER_STORAGE_BUFFER, inputSize, null, GL_STREAM_COPY);
207glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);  // unbind
208int inputSsboId = id[0];
209
210// Create interpreter.
211Interpreter interpreter = new Interpreter(tfliteModel);
212Tensor inputTensor = interpreter.getInputTensor(0);
213GpuDelegate gpuDelegate = new GpuDelegate();
214// The buffer must be bound before the delegate is installed.
215gpuDelegate.bindGlBufferToTensor(inputTensor, inputSsboId);
216interpreter.modifyGraphWithDelegate(gpuDelegate);
217
218// Run inference; the null input argument indicates use of the bound buffer for input.
219fillSsboWithCameraImageTexture(inputSsboId);
220float[] outputArray = new float[outputSize];
221interpreter.runInference(null, outputArray);
222```
223
224A similar approach can be applied to the output tensor. In that case,
225`Interpreter.Options.setAllowBufferHandleOutput(true)` should be passed on, to
226disable the default copying of the network's output from GPU memory to CPU
227memory.
228
229```java
230// Ensure a valid EGL rendering context.
231EGLContext eglContext = eglGetCurrentContext();
232if (eglContext.equals(EGL_NO_CONTEXT)) return false;
233
234// Create a SSBO.
235int[] id = new int[1];
236glGenBuffers(id.length, id, 0);
237glBindBuffer(GL_SHADER_STORAGE_BUFFER, id[0]);
238glBufferData(GL_SHADER_STORAGE_BUFFER, outputSize, null, GL_STREAM_COPY);
239glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);  // unbind
240int outputSsboId = id[0];
241
242// Create interpreter.
243Interpreter.Options options = (new Interpreter.Options()).setAllowBufferHandleOutput(true);
244Interpreter interpreter = new Interpreter(tfliteModel, options);
245Tensor outputTensor = interpreter.getOutputTensor(0);
246GpuDelegate gpuDelegate = new GpuDelegate();
247// The buffer must be bound before the delegate is installed.
248gpuDelegate.bindGlBufferToTensor(outputTensor, outputSsboId);
249interpreter.modifyGraphWithDelegate(gpuDelegate);
250
251// Run inference; the null output argument indicates use of the bound buffer for output.
252ByteBuffer input = getCameraImageByteBuffer();
253interpreter.runInference(input, null);
254renderOutputSsbo(outputSsboId);
255```
256
257#### iOS
258
259Assuming the image input is in GPU memory, it must first be converted to a
260`MTLBuffer` object for Metal. You can associate a TfLiteTensor to a
261user-prepared `MTLBuffer` with `BindMetalBufferToTensor()`. Note that
262`BindMetalBufferToTensor()` must be called before
263`Interpreter::ModifyGraphWithDelegate()`. Additionally, the inference output is,
264by default, copied from GPU memory to CPU memory. This behavior can be turned
265off by calling `Interpreter::SetAllowBufferHandleOutput(true)` during
266initialization.
267
268```c++
269// Prepare GPU delegate.
270auto* delegate = NewGpuDelegate(nullptr);
271interpreter->SetAllowBufferHandleOutput(true);  // disable default gpu->cpu copy
272if (!BindMetalBufferToTensor(delegate, interpreter->inputs()[0], user_provided_input_buffer)) return false;
273if (!BindMetalBufferToTensor(delegate, interpreter->outputs()[0], user_provided_output_buffer)) return false;
274if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;
275
276// Run inference.
277if (interpreter->Invoke() != kTfLiteOk) return false;
278```
279
280Note: Once the default behavior is turned off, copying the inference output from
281GPU memory to CPU memory requires an explicit call to
282`Interpreter::EnsureTensorDataIsReadable()` for each output tensor.
283
284## Tips and Tricks
285
286*   Some operations that are trivial on the CPU may be high cost on a GPU. One
287    class of such operation includes various forms of reshape operations
288    (including `BATCH_TO_SPACE`, `SPACE_TO_BATCH`, `SPACE_TO_DEPTH`, and similar
289    operation). If these operations are not required (for example, they were
290    inserted to help the network architect reason about the system but do not
291    otherwise affect output), it is worth removing them for performance.
292
293*   On a GPU, tensor data is sliced into 4-channels. Thus, a computation on a
294    tensor of shape `[B, H, W, 5]` will perform about the same on a tensor of
295    shape `[B, H, W, 8]`, but significantly worse than `[B, H, W, 4]`.
296
297    *   For example, if the camera hardware supports image frames in RGBA,
298        feeding that 4-channel input is significantly faster, because a memory
299        copy (from 3-channel RGB to 4-channel RGBX) can be avoided.
300
301*   For best performance, do not hesitate to re-train your classifier with
302    mobile-optimized network architecture. That is a significant part of
303    optimization for on-device inference.
304