1# TensorFlow Lite GPU delegate 2 3[TensorFlow Lite](https://www.tensorflow.org/lite) supports several hardware 4accelerators. This document describes how to preview the experimental GPU backend using the 5TensorFlow Lite delegate APIs on Android and iOS. 6 7GPUs are designed to have high throughput for massively parallelizable 8workloads. Thus, they are well-suited for deep neural nets, which consist of a 9huge number of operators, each working on some input tensor(s) that can be 10easily divided into smaller workloads and carried out in parallel, typically 11resulting in lower latency. In the best scenario, inference on the GPU may now 12run fast enough for previously not available real-time applications. 13 14Unlike CPUs, GPUs compute with 16-bit or 32-bit floating point numbers and do 15not require quantization for optimal performance. 16 17Another benefit with GPU inference is its power efficiency. GPUs carry out the 18computations in a very efficient and optimized manner, so that they consume less 19power and generate less heat than when the same task is run on CPUs. 20 21## Demo App Tutorials 22 23The easiest way to try out the experimental GPU delegate is to follow the below tutorials, which go through building our classification demo applications with GPU support. The GPU code is only binary for now; it will be open-sourced soon. Once you understand how to get our demos working, you can try this out on your own custom models. 24 25### Android (with Android Studio) 26 27For a step-by-step tutorial, watch the 28[Experimental GPU Delegate for Android](https://youtu.be/Xkhgre8r5G0) video. 29 30Note: This requires OpenGL ES 3.1 or higher. 31 32#### Step 1. Clone the TensorFlow source code and open it in Android Studio 33 34``` 35git clone https://github.com/tensorflow/tensorflow 36``` 37 38#### Step 2. Edit `app/build.gradle` to use the experimental GPU AAR 39 40Replace the existing `tensorflow-lite` package in the existing `dependencies` 41block. 42 43``` 44dependencies { 45 ... 46 // implementation 'org.tensorflow:tensorflow-lite:0.0.0-nightly' 47 implementation 'org.tensorflow:tensorflow-lite:0.0.0-gpu-experimental' 48} 49``` 50 51#### Step 3. Build and run 52 53Run → Run ‘app’. When you run the application you will see a button for 54enabling the GPU. Change from quantized to a float model and then click GPU to 55run on the GPU. 56 57![running android gpu demo and switch to gpu](images/android_gpu_demo.gif) 58 59### iOS (with XCode) 60 61For a step-by-step tutorial, watch the 62[Experimental GPU Delegate for iOS](https://youtu.be/a5H4Zwjp49c) video. 63 64Note: This requires XCode v10.1 or later. 65 66#### Step 1. Get the demo source code and make sure it compiles. 67 68Follow our iOS Demo App [tutorial](https://www.tensorflow.org/lite/demo_ios). 69This will get you to a point where the unmodified iOS camera demo is working 70on your phone. 71 72 73#### Step 2. Modify the Podfile to use the TensorFlow Lite GPU CocoaPod 74 75We have built a binary CocoaPod that includes the GPU delegate. To switch the 76project to use it, modify the 77`tensorflow/tensorflow/lite/examples/ios/camera/Podfile` file to use 78the `TensorFlowLiteGpuExperimental` pod instead of `TensorFlowLite`. 79 80``` 81target 'YourProjectName' 82 # pod 'TensorFlowLite', '1.12.0' 83 pod 'TensorFlowLiteGpuExperimental' 84``` 85 86#### Step 3. Enable the GPU Delegate 87 88You will need to change two `#define` flags in `CameraExampleViewController.h` 89to enable the GPU delegate. First, change `TFLITE_USE_CONTRIB_LITE` from 1 to 0 90since TensorFlow Lite has moved from TensorFlow contrib into core. 91 92```c 93#define TFLITE_USE_CONTRIB_LITE 0 94``` 95 96Next, change `TFLITE_USE_GPU_DELEGATE` from 0 to 1, to enable the code that will 97use the GPU delegate. 98 99```c 100#define TFLITE_USE_GPU_DELEGATE 1 101``` 102 103#### Step 4. Build and run the demo app 104 105After following the previous step, you should be able to run the app. 106 107 108#### Step 5. Release mode. 109 110While in Step 4 you ran in debug mode, to get better performance, you should 111change to a release build with the appropriate optimal Metal settings. In 112particular, To edit these settings go to the `Product > Scheme > Edit 113Scheme...`. Select `Run`. On the `Info` tab, change `Build Configuration`, from 114`Debug` to `Release`, uncheck `Debug executable`. 115 116![setting up release](images/iosdebug.png) 117 118Then 119click the `Options` tab and change `GPU Frame Capture` to `Disabled` and 120`Metal API Validation` to `Disabled`. 121 122![setting up metal options](images/iosmetal.png) 123 124Lastly make sure Release only builds on 64-bit architecture. Under `Project 125navigator -> tflite_camera_example -> PROJECT -> tflite_camera_example -> Build 126Settings` set `Build Active Architecture Only > Release` to Yes. 127 128![setting up release options](images/iosrelease.png) 129 130## Trying the GPU Delegate on your own model 131 132### Android 133 134Look at the demo to see how to add the 135delegate. In your application, add the AAR as above, import 136`org.tensorflow.lite.experimental.GpuDelegate` module, and use the`addDelegate` 137function to register the GPU delegate to the interpreter: 138 139```java 140import org.tensorflow.lite.Interpreter; 141import org.tensorflow.lite.experimental.GpuDelegate; 142 143// Initialize interpreter with GPU delegate 144GpuDelegate delegate = new GpuDelegate(); 145Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate); 146Interpreter interpreter = new Interpreter(model, options); 147 148// Run inference 149while (true) { 150 writeToInput(input); 151 interpreter.run(input, output); 152 readFromOutput(output); 153} 154 155// Clean up 156delegate.close(); 157``` 158 159### iOS 160 161In your application code, include the GPU delegate header and call the 162`Interpreter::ModifyGraphWithDelegate` function to register the GPU delegate to 163the interpreter: 164 165```cpp 166#import "tensorflow/lite/delegates/gpu/metal_delegate.h" 167 168// Initialize interpreter with GPU delegate 169std::unique_ptr<Interpreter> interpreter; 170InterpreterBuilder(*model, resolver)(&interpreter); 171auto* delegate = NewGpuDelegate(nullptr); // default config 172if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false; 173 174// Run inference 175while (true) { 176 WriteToInputTensor(interpreter->typed_input_tensor<float>(0)); 177 if (interpreter->Invoke() != kTfLiteOk) return false; 178 ReadFromOutputTensor(interpreter->typed_output_tensor<float>(0)); 179} 180 181// Clean up 182interpreter = nullptr; 183DeleteGpuDelegate(delegate); 184``` 185 186## Supported Models and Ops 187 188With the release of the GPU delegate, we included a handful of models that can 189be run on the backend: 190 191* [MobileNet v1 (224x224) image classification](https://ai.googleblog.com/2017/06/mobilenets-open-source-models-for.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/mobilenet_v1_1.0_224.tflite) 192<br /><i>(image classification model designed for mobile and embedded based vision applications)</i> 193* [DeepLab segmentation (257x257)](https://ai.googleblog.com/2018/03/semantic-image-segmentation-with.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/deeplabv3_257_mv_gpu.tflite) 194<br /><i>(image segmentation model that assigns semantic labels (e.g., dog, cat, car) to every pixel in the input image)</i> 195* [MobileNet SSD object detection](https://ai.googleblog.com/2018/07/accelerated-training-and-inference-with.html) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/mobile_ssd_v2_float_coco.tflite) 196<br /><i>(image classification model that detects multiple objects with bounding boxes)</i> 197* [PoseNet for pose estimation](https://github.com/tensorflow/tfjs-models/tree/master/posenet) [[download]](https://storage.googleapis.com/download.tensorflow.org/models/tflite/gpu/multi_person_mobilenet_v1_075_float.tflite) 198<br /><i>(vision model that estimates the poses of a person(s) in image or video)</i> 199 200To see a full list of supported ops, please see the [advanced documentation](gpu_advanced.md). 201 202## Non-supported models and ops 203 204If some of the ops are not supported by the GPU delegate, the framework will 205only run a part of the graph on the GPU and the remaining part on the CPU. Due 206to the high cost of CPU/GPU synchronization, a split execution mode like this 207will often result in a performance slower than when the whole network is run on 208the CPU alone. In this case, the user will get a warning like: 209 210``` 211WARNING: op code #42 cannot be handled by this delegate. 212``` 213 214We did not provide a callback for this failure, as this is not a true run-time 215failure, but something that the developer can observe while trying to get the 216network to run on the delegate. 217 218## Tips for optimization 219 220Some operations that are trivial on the CPU may have a high cost for the GPU. 221One class of such operation is various forms of reshape operations, including 222`BATCH_TO_SPACE`, `SPACE_TO_BATCH`, `SPACE_TO_DEPTH`, and so forth. If those ops 223are inserted into the network just for the network architect's logical thinking, 224it is worth removing them for performance. 225 226On GPU, tensor data is sliced into 4-channels. Thus, a computation on a tensor 227of shape `[B,H,W,5]` will perform about the same on a tensor of shape 228`[B,H,W,8]` but significantly worse than `[B,H,W,4]`. 229 230In that sense, if the camera hardware supports image frames in RGBA, feeding 231that 4-channel input is significantly faster as a memory copy (from 3-channel 232RGB to 4-channel RGBX) can be avoided. 233 234For best performance, do not hesitate to retrain your classifier with a mobile- 235optimized network architecture. That is a significant part of optimization for 236on-device inference. 237