1# Target Aware Conversion (TAC) 2 3Different hardwares have different capabilities and restrictions. 4 5TAC is designed to leverage hardwares' capabilities to: 6 7* Perform device-specific optimizations (such as unsupported ops lowering, 8 layout transformations, etc.) 9* Graph partitioning based on the hardware costs modeling. 10* It supports general import/export where you can hook your own 11 importer/exporter from any format to MLIR and export MLIR to anything. 12 13For more details, please checkout the 14[TAC workflow](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/mlir/lite/experimental/tac/README.md#tac-workflow) 15section 16 17## How to use 18 19Once you have a converted TfLite model ready, you can use the following command 20to use TAC to optimize for your model: 21 22``` 23bazel run -c opt //tensorflow/compiler/mlir/lite/experimental/tac:tac-translate -- <PATH_TO_YOUR_MODEL> -o=<OUTPUT_PATH> -device-specs=<HARDWARE_BACKENDS> 24``` 25 26The devices_specs is a list of the names of the desired hardware backends, 27separated by comma, e.g., "GPU,CPU". 28 29If you're interested in what are the subgraphs being explored for different 30backends, you can pass in `-output-mlir -inline-subgraphs=false` and check out 31the output mlir file. 32 33## How to add a hardware backend 34 35If you want to add a hardware backend for TAC, you can start with the 36`SimpleHardware` interface. 37 38For example: 39 40``` 41class FooHardware : public SimpleHardware { 42 public: 43 static constexpr char kId[] = "FOO"; 44 45 mlir::RewritePatternSet GetTransformations( 46 MLIRContext* context) const override { 47 mlir::RewritePatternSet patterns; 48 // Pick the transformations that we want to perform, 49 // We can add other transformations we like here. 50 patterns.add<LowerPackIntoConcatReshape, UnrollSplit, UnrollSplitV, 51 PadSlice>(context); 52 return patterns; 53 } 54 55 mlir::TypeID GetTypeId() const override { 56 return mlir::TypeID::get<FooHardware>(); 57 } 58 59 // We can specify what ops are not supported here. 60 bool IsNotSupportedOp(mlir::Operation* op) const override { return false; } 61 62 // This is basically saying how fast are we comparing to CPU. 63 // The larger the value the better. 64 float AdvantageOverCPU() const override { return 5.0; } 65}; 66``` 67 68Then we need to register our hardware like below: 69 70``` 71std::unique_ptr<TargetHardware> CreateFooHardware() { 72 return std::make_unique<FooHardware>(); 73} 74 75TargetHardwareRegistration<FooHardware> foo_hardware( 76 "Target device for FOO", CreateFooHardware); 77``` 78 79### Advanced user 80 81For advanced users (e.g., you may already have your own hardware dialect 82defined), please just use `TargetHardware` directly. See the following code 83snippet for reference. 84 85``` 86class MyCustomHardware : public TargetHardware { 87 public: 88 static constexpr char kId[] = "MY_CUSTOM_HARDWARE"; 89 90 mlir::TypeID GetTypeId() const override { 91 return mlir::TypeID::get<MyCustomHardware>(); 92 } 93 94 bool IsOpSupported(mlir::Operation* op) const override { 95 // check whether the op is supported, if the user has they own dialect, 96 // this can be target dialect legalization process. 97 } 98 99 double GetHardwareSwitchingCost(const TargetHardware* from, 100 size_t buffer_size) const override { 101 // Get the hardware switching cost from the source hardware. 102 } 103 104 double GetOpCost(mlir::Operation* op) const override { 105 // call customized cost model. 106 } 107 108 mlir::RewritePatternSet GetTransformations( 109 MLIRContext* context) const override { 110 // customized transformations patterns: ops lowering/fusion, layout 111 // transformation, etc. 112 } 113}; 114``` 115 116## TAC workflow 117 118The workflow of target-aware-conversion is as followed: 119 1201 Try to break down the whole graph into several subgraphs based on hardwares' 121capabilities. See the diagram below, let's say our desired target backends are 122"GPU" and "CPU", and currently "C" is not supported on "GPU", but the rest are 123supported by "GPU". So we will end up with 3 subgraphs as shown in the diagram. 124 125 126 1272 Perform ops-lowering & target-specific optimizations for 128 different hardware backends. As shown in the below diagram, the red & the 129 yellow subgraph will be duplicated as "alternative subgraph view" for "CPU". 130 "C" op can be lowered into "G" + "H" op which can be supported by "GPU". 131 132 133 1343 Estimate the costs for each subgraph (and their alternative views) 135 based on the hardware cost model. See the following diagram. 136 137 138 1394 Pick the proper subgraphs from the alternative views for execution based on 140costs(computation costs, transfer costs, quant/dequant costs). As shown in the 141diagram below, since cross-device data transferring cost is high, even "G" + "H" 142running on GPU maybe less efficient than "C" running on "CPU", we will still 143pick "G" + "H" subgraph. 144 145 146 147The final graph looks like below: 148 149 150 151## TAC components 152 153### Hardwares 154 155Hardwares are used to modeling target device capabilities & also ops cost for 156the target devices. 157 158We have already modeled `cpu_hardware` & `gpu_hardware` as well as the 159`nnapi_hardware`. 160 161### Passes 162 163#### Target Annotation Pass 164In this pass, every op will be targeted with the user specified targets based on 165the device capabilites. For example, If the user specified the desired targets 166are "GPU", "CPU", `conv2d` can run on both "GPU" and "CPU", we will annotate 167the op `conv2d` with "GPU" since it's preferred; `pack` can only run on "CPU", 168so we will annotate the op with "CPU" since "GPU" does not support this op. 169 170#### Raise Target Subgraphs Pass 171 172In this pass, ops will be broken down into subgraph. Those ops have the same 173target annotation will be raised as subgraphs. 174 175In this pass, subgraph is actually implemented with `FuncOp`. 176 177Take the following code as an example: 178 179``` 180func @simpleTest(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>, %arg2: tensor<1xf32>, %arg3: tensor<1xf32>) -> tensor<2x1xf32> { 181 %0 = "tfl.add"(%arg0, %arg1) {tac.device = "GPU", fused_activation_function = "RELU6", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32> 182 %1 = "tfl.mul"(%0, %arg2) {tac.device = "GPU", fused_activation_function = "RELU6", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32> 183 %2 = "tfl.add"(%arg0, %arg3) {tac.device = "GPU", fused_activation_function = "RELU6", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32> 184 %3 = "tfl.pack"(%1, %2) {tac.device = "CPU", tac.inference_type = "FLOAT", axis = 0 : i32, values_count = 2 : i32} : (tensor<1xf32>, tensor<1xf32>) -> tensor<2x1xf32> 185 return %3 : tensor<2x1xf32> 186} 187``` 188 189In this code, `%3` is annotated with "CPU", while others are annotated with 190"GPU", in this case, `%3` will be raised as a separate function like below: 191 192``` 193 func private @func_1_GPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> attributes {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_1"} { 194 %0 = tfl.add %arg0, %arg1 {fused_activation_function = "RELU6", tac.device = "GPU", tac.inference_type = "FLOAT"} : tensor<1xf32> 195 return %0 : tensor<1xf32> 196 } 197``` 198 199And the rest ops will be raised as below: 200 201``` 202 func private @func_2_CPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<2x1xf32> attributes {tac.device = "CPU", tac.inference_type = "FLOAT", tac.interface_name = "func_2"} { 203 %0 = "tfl.pack"(%arg0, %arg1) {axis = 0 : i32, tac.device = "CPU", tac.inference_type = "FLOAT", values_count = 2 : i32} : (tensor<1xf32>, tensor<1xf32>) -> tensor<2x1xf32> 204 return %0 : tensor<2x1xf32> 205 } 206 207func private @func_0_GPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>, %arg2: tensor<1xf32>) -> tensor<1xf32> attributes {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_0"} { 208 %0 = tfl.add %arg0, %arg1 {fused_activation_function = "RELU6", tac.device = "GPU", tac.inference_type = "FLOAT"} : tensor<1xf32> 209 %1 = tfl.mul %0, %arg2 {fused_activation_function = "RELU6", tac.device = "GPU", tac.inference_type = "FLOAT"} : tensor<1xf32> 210 return %1 : tensor<1xf32> 211 } 212``` 213 214And the original function will be replaced by `CallOps` to those `FuncOps`: 215 216``` 217func @simpleTest(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>, %arg2: tensor<1xf32>, %arg3: tensor<1xf32>) -> tensor<2x1xf32> { 218 %0 = call @func_0_GPU_FLOAT(%arg0, %arg1, %arg2) {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_0"} : (tensor<1xf32>, tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32> 219 %1 = call @func_1_GPU_FLOAT(%arg0, %arg3) {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_1"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32> 220 %2 = call @func_2_CPU_FLOAT(%0, %1) {tac.device = "CPU", tac.inference_type = "FLOAT", tac.interface_name = "func_2"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<2x1xf32> 221 return %2 : tensor<2x1xf32> 222 } 223``` 224 225Why we need to raise those ops into `FuncOps`? Please see the following section. 226 227#### Get Alternative Subgraph View Pass 228In the Get Alternative Subgraph View Pass, we will essentially duplicate those 229`FuncOps` and perform unsupported ops lowering & target-specific optimization. 230 231For example, `Pack` is not supported by "GPU", but it can be lowered into 232`Concat` + `Reshape` which can be supported by "GPU". 233 234So the original example: 235 236``` 237 func private @func_1_GPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> attributes {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_1"} { 238 %0 = tfl.add %arg0, %arg1 {fused_activation_function = "RELU6", tac.device = "GPU", tac.inference_type = "FLOAT"} : tensor<1xf32> 239 return %0 : tensor<1xf32> 240 } 241``` 242 243Will be transformed into: 244 245``` 246 func private @func_2_CPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<2x1xf32> attributes {tac.device = "CPU", tac.inference_type = "FLOAT", tac.interface_name = "func_2"} { 247 %0 = "tfl.pack"(%arg0, %arg1) {axis = 0 : i32, tac.device = "CPU", tac.inference_type = "FLOAT", values_count = 2 : i32} : (tensor<1xf32>, tensor<1xf32>) -> tensor<2x1xf32> 248 return %0 : tensor<2x1xf32> 249 } 250 251func private @func_2_GPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<2x1xf32> attributes {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_2"} { 252 %cst = arith.constant dense<1> : tensor<4xi32> 253 %cst_0 = arith.constant dense<2> : tensor<1xi32> 254 %cst_1 = arith.constant dense<[2, 1]> : tensor<2xi32> 255 %0 = "tfl.reshape"(%arg0, %cst) {tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<4xi32>) -> tensor<1x1x1x1xf32> 256 %1 = "tfl.reshape"(%arg1, %cst) {tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<4xi32>) -> tensor<1x1x1x1xf32> 257 %2 = "tfl.concatenation"(%0, %1) {axis = 3 : i32, fused_activation_function = "NONE", tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<1x1x1x1xf32>, tensor<1x1x1x1xf32>) -> tensor<1x1x1x2xf32> 258 %3 = "tfl.reshape"(%2, %cst_0) {tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<1x1x1x2xf32>, tensor<1xi32>) -> tensor<2xf32> 259 %4 = "tfl.reshape"(%3, %cst_1) {tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<2xf32>, tensor<2xi32>) -> tensor<2x1xf32> 260 return %4 : tensor<2x1xf32> 261 } 262``` 263 264#### Compute Costs Pass 265In the compute cost pass, we will essentially compute the cost of each op within 266the `FuncOp` based on the target-device cost model and sum them together. 267 268 269#### Pick Subgraphs Pass 270In the pick subgraphs pass, we will pick those subgraphs which can minimize the 271global costs (we will take the tensor transferring costs as well). 272