• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# TensorFlow Lite GPU delegate
2
3[TensorFlow Lite](https://www.tensorflow.org/lite) supports several hardware
4accelerators. This document describes how to preview the experimental GPU backend using the
5TensorFlow Lite delegate APIs on Android and iOS.
6
7GPUs are designed to have high throughput for massively parallelizable
8workloads. Thus, they are well-suited for deep neural nets, which consist of a
9huge number of operators, each working on some input tensor(s) that can be
10easily divided into smaller workloads and carried out in parallel, typically
11resulting in lower latency. In the best scenario, inference on the GPU may now
12run fast enough for previously not available real-time applications.
13
14Unlike CPUs, GPUs compute with 16-bit or 32-bit floating point numbers and do
15not require quantization for optimal performance.
16
17Another benefit with GPU inference is its power efficiency. GPUs carry out the
18computations in a very efficient and optimized manner, so that they consume less
19power and generate less heat than when the same task is run on CPUs.
20
21## Demo App Tutorials
22
23The easiest way to try out the experimental GPU delegate is to follow the below tutorials, which go through building our classification demo applications with GPU support. The GPU code is only binary for now; it will be open-sourced soon. Once you understand how to get our demos working, you can try this out on your own custom models.
24
25### Android (with Android Studio)
26
27For a step-by-step tutorial, watch the
28[Experimental GPU Delegate for Android](https://youtu.be/Xkhgre8r5G0) video.
29
30Note: This requires OpenGL ES 3.1 or higher.
31
32#### Step 1. Clone the TensorFlow source code and open it in Android Studio
33
34```
35git clone https://github.com/tensorflow/tensorflow
36```
37
38#### Step 2. Edit `app/build.gradle` to use the experimental GPU AAR
39
40Replace the existing `tensorflow-lite` package in the existing `dependencies`
41block.
42
43```
44dependencies {
45    ...
46    // implementation 'org.tensorflow:tensorflow-lite:0.0.0-nightly'
47    implementation 'org.tensorflow:tensorflow-lite:0.0.0-gpu-experimental'
48}
49```
50
51#### Step 3. Build and run
52
53Run → Run ‘app’.  When you run the application you will see a button for
54enabling the GPU. Change from quantized to a float model and then click GPU to
55run on the GPU.
56
57![running android gpu demo and switch to gpu](images/android_gpu_demo.gif)
58
59### iOS (with XCode)
60
61For a step-by-step tutorial, watch the
62[Experimental GPU Delegate for iOS](https://youtu.be/a5H4Zwjp49c) video.
63
64Note: This requires XCode v10.1 or later.
65
66#### Step 1. Get the demo source code and make sure it compiles.
67
68Follow our iOS Demo App [tutorial](https://www.tensorflow.org/lite/demo_ios).
69This will get you to a point where the unmodified iOS camera demo is working
70on your phone.
71
72
73#### Step 2. Modify the Podfile to use the TensorFlow Lite GPU CocoaPod
74
75We have built a binary CocoaPod that includes the GPU delegate. To switch the
76project to use it, modify the
77`tensorflow/tensorflow/lite/examples/ios/camera/Podfile` file to use
78the `TensorFlowLiteGpuExperimental` pod instead of `TensorFlowLite`.
79
80```
81target 'YourProjectName'
82  # pod 'TensorFlowLite', '1.12.0'
83  pod 'TensorFlowLiteGpuExperimental'
84```
85
86#### Step 3. Enable the GPU Delegate
87
88You will need to change two `#define` flags in `CameraExampleViewController.h`
89to enable the GPU delegate. First, change `TFLITE_USE_CONTRIB_LITE` from 1 to 0
90since TensorFlow Lite has moved from TensorFlow contrib into core.
91
92```c
93#define TFLITE_USE_CONTRIB_LITE 0
94```
95
96Next, change `TFLITE_USE_GPU_DELEGATE` from 0 to 1, to enable the code that will
97use the GPU delegate.
98
99```c
100#define TFLITE_USE_GPU_DELEGATE 1
101```
102
103#### Step 4. Build and run the demo app
104
105After following the previous step, you should be able to run the app.
106
107
108#### Step 5. Release mode.
109
110While in Step 4 you ran in debug mode, to get better performance, you should
111change to a release build with the appropriate optimal Metal settings. In
112particular, To edit these settings go to the `Product > Scheme > Edit
113Scheme...`. Select `Run`. On the `Info` tab, change `Build Configuration`, from
114`Debug` to `Release`, uncheck `Debug executable`.
115
116![setting up release](images/iosdebug.png)
117
118Then
119click the `Options` tab and change `GPU Frame Capture` to `Disabled` and
120`Metal API Validation` to `Disabled`.
121
122![setting up metal options](images/iosmetal.png)
123
124Lastly make sure Release only builds on 64-bit architecture. Under `Project
125navigator -> tflite_camera_example -> PROJECT -> tflite_camera_example -> Build
126Settings` set `Build Active Architecture Only > Release` to Yes.
127
128![setting up release options](images/iosrelease.png)
129
130## Trying the GPU Delegate on your own model
131
132### Android
133
134Look at the demo to see how to add the
135delegate. In your application, add the AAR as above, import
136`org.tensorflow.lite.experimental.GpuDelegate` module, and use the`addDelegate`
137function to register the GPU delegate to the interpreter:
138
139```java
140import org.tensorflow.lite.Interpreter;
141import org.tensorflow.lite.experimental.GpuDelegate;
142
143// Initialize interpreter with GPU delegate
144GpuDelegate delegate = new GpuDelegate();
145Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);
146Interpreter interpreter = new Interpreter(model, options);
147
148// Run inference
149while (true) {
150  writeToInput(input);
151  interpreter.run(input, output);
152  readFromOutput(output);
153}
154
155// Clean up
156delegate.close();
157```
158
159### iOS
160
161In your application code, include the GPU delegate header and call the
162`Interpreter::ModifyGraphWithDelegate` function to register the GPU delegate to
163the interpreter:
164
165```cpp
166#import "tensorflow/lite/delegates/gpu/metal_delegate.h"
167
168// Initialize interpreter with GPU delegate
169std::unique_ptr<Interpreter> interpreter;
170InterpreterBuilder(*model, resolver)(&interpreter);
171auto* delegate = NewGpuDelegate(nullptr);  // default config
172if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;
173
174// Run inference
175while (true) {
176  WriteToInputTensor(interpreter->typed_input_tensor<float>(0));
177  if (interpreter->Invoke() != kTfLiteOk) return false;
178  ReadFromOutputTensor(interpreter->typed_output_tensor<float>(0));
179}
180
181// Clean up
182interpreter = nullptr;
183DeleteGpuDelegate(delegate);
184```
185
186## Supported Models and Ops
187
188With the release of the GPU delegate, we included a handful of models that can
189be run on the backend:
190
191* [MobileNet v1 (224x224) image classification](https://ai.googleblog.com/2017/06/mobilenets-open-source-models-for.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/mobilenet_v1_1.0_224.tflite)
192<br /><i>(image classification model designed for mobile and embedded based vision applications)</i>
193* [DeepLab segmentation (257x257)](https://ai.googleblog.com/2018/03/semantic-image-segmentation-with.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/deeplabv3_257_mv_gpu.tflite)
194<br /><i>(image segmentation model that assigns semantic labels (e.g., dog, cat, car) to every pixel in the input image)</i>
195* [MobileNet SSD object detection](https://ai.googleblog.com/2018/07/accelerated-training-and-inference-with.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/mobile_ssd_v2_float_coco.tflite)
196<br /><i>(image classification model that detects multiple objects with bounding boxes)</i>
197* [PoseNet for pose estimation](https://github.com/tensorflow/tfjs-models/tree/master/posenet) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/multi_person_mobilenet_v1_075_float.tflite)
198<br /><i>(vision model that estimates the poses of a person(s) in image or video)</i>
199
200To see a full list of supported ops, please see the [advanced documentation](gpu_advanced.md).
201
202## Non-supported models and ops
203
204If some of the ops are not supported by the GPU delegate, the framework will
205only run a part of the graph on the GPU and the remaining part on the CPU.  Due
206to the high cost of CPU/GPU synchronization, a split execution mode like this
207will often result in a performance slower than when the whole network is run on
208the CPU alone.  In this case, the user will get a warning like:
209
210```
211WARNING: op code #42 cannot be handled by this delegate.
212```
213
214We did not provide a callback for this failure, as this is not a true run-time
215failure, but something that the developer can observe while trying to get the
216network to run on the delegate.
217
218## Tips for optimization
219
220Some operations that are trivial on the CPU may have a high cost for the GPU.
221One class of such operation is various forms of reshape operations, including
222`BATCH_TO_SPACE`, `SPACE_TO_BATCH`, `SPACE_TO_DEPTH`, and so forth. If those ops
223are inserted into the network just for the network architect's logical thinking,
224it is worth removing them for performance.
225
226On GPU, tensor data is sliced into 4-channels. Thus, a computation on a tensor
227of shape `[B,H,W,5]` will perform about the same on a tensor of shape
228`[B,H,W,8]` but significantly worse than `[B,H,W,4]`.
229
230In that sense, if the camera hardware supports image frames in RGBA, feeding
231that 4-channel input is significantly faster as a memory copy (from 3-channel
232RGB to 4-channel RGBX) can be avoided.
233
234For best performance, do not hesitate to retrain your classifier with a mobile-
235optimized network architecture. That is a significant part of optimization for
236on-device inference.
237