• Home
Name Date Size #Lines LOC

..--

common/04-Jul-2025-415238

examples/04-Jul-2025-12370

g3doc/images/04-Jul-2025-

hardwares/04-Jul-2025-1,294854

py_wrapper/04-Jul-2025-251173

tests/04-Jul-2025-1,2941,084

transforms/04-Jul-2025-3,4202,163

utils/04-Jul-2025-187133

BUILDD04-Jul-202511 KiB378356

README.mdD04-Jul-202511.6 KiB272202

execution_metadata_exporter.ccD04-Jul-20257.3 KiB204145

execution_metadata_exporter.hD04-Jul-20251.2 KiB3210

execution_metadata_exporter_test.ccD04-Jul-20255.8 KiB127100

runtime_metadata.fbsD04-Jul-20252.5 KiB9083

tac.pyD04-Jul-20251.5 KiB4723

tac_importer_exporter.hD04-Jul-20251.9 KiB5322

tac_module.ccD04-Jul-20255.5 KiB150108

tac_module.hD04-Jul-20254.3 KiB12253

tac_translate.ccD04-Jul-20255.9 KiB152114

tflite_import_export.ccD04-Jul-20254.4 KiB11679

tflite_import_export.hD04-Jul-20252.3 KiB7341

README.md

1# Target Aware Conversion (TAC)
2
3Different hardwares have different capabilities and restrictions.
4
5TAC is designed to leverage hardwares' capabilities to:
6
7*   Perform device-specific optimizations (such as unsupported ops lowering,
8    layout transformations, etc.)
9*   Graph partitioning based on the hardware costs modeling.
10*   It supports general import/export where you can hook your own
11    importer/exporter from any format to MLIR and export MLIR to anything.
12
13For more details, please checkout the
14[TAC workflow](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/mlir/lite/experimental/tac/README.md#tac-workflow)
15section
16
17## How to use
18
19Once you have a converted TfLite model ready, you can use the following command
20to use TAC to optimize for your model:
21
22```
23bazel run -c opt //tensorflow/compiler/mlir/lite/experimental/tac:tac-translate -- <PATH_TO_YOUR_MODEL> -o=<OUTPUT_PATH> -device-specs=<HARDWARE_BACKENDS>
24```
25
26The devices_specs is a list of the names of the desired hardware backends,
27separated by comma, e.g., "GPU,CPU".
28
29If you're interested in what are the subgraphs being explored for different
30backends, you can pass in `-output-mlir -inline-subgraphs=false` and check out
31the output mlir file.
32
33## How to add a hardware backend
34
35If you want to add a hardware backend for TAC, you can start with the
36`SimpleHardware` interface.
37
38For example:
39
40```
41class FooHardware : public SimpleHardware {
42 public:
43  static constexpr char kId[] = "FOO";
44
45  mlir::RewritePatternSet GetTransformations(
46      MLIRContext* context) const override {
47    mlir::RewritePatternSet patterns;
48    // Pick the transformations that we want to perform,
49    // We can add other transformations we like here.
50    patterns.add<LowerPackIntoConcatReshape, UnrollSplit, UnrollSplitV,
51                  PadSlice>(context);
52    return patterns;
53  }
54
55  mlir::TypeID GetTypeId() const override {
56    return mlir::TypeID::get<FooHardware>();
57  }
58
59  // We can specify what ops are not supported here.
60  bool IsNotSupportedOp(mlir::Operation* op) const override { return false; }
61
62  // This is basically saying how fast are we comparing to CPU.
63  // The larger the value the better.
64  float AdvantageOverCPU() const override { return 5.0; }
65};
66```
67
68Then we need to register our hardware like below:
69
70```
71std::unique_ptr<TargetHardware> CreateFooHardware() {
72  return std::make_unique<FooHardware>();
73}
74
75TargetHardwareRegistration<FooHardware> foo_hardware(
76    "Target device for FOO", CreateFooHardware);
77```
78
79### Advanced user
80
81For advanced users (e.g., you may already have your own hardware dialect
82defined), please just use `TargetHardware` directly. See the following code
83snippet for reference.
84
85```
86class MyCustomHardware : public TargetHardware {
87 public:
88  static constexpr char kId[] = "MY_CUSTOM_HARDWARE";
89
90  mlir::TypeID GetTypeId() const override {
91    return mlir::TypeID::get<MyCustomHardware>();
92  }
93
94  bool IsOpSupported(mlir::Operation* op) const override {
95    // check whether the op is supported, if the user has they own dialect,
96    // this can be target dialect legalization process.
97  }
98
99 double GetHardwareSwitchingCost(const TargetHardware* from,
100                                 size_t buffer_size) const override {
101    // Get the hardware switching cost from the source hardware.
102 }
103
104  double GetOpCost(mlir::Operation* op) const override {
105    // call customized cost model.
106  }
107
108  mlir::RewritePatternSet GetTransformations(
109      MLIRContext* context) const override {
110    // customized transformations patterns: ops lowering/fusion, layout
111    // transformation, etc.
112  }
113};
114```
115
116## TAC workflow
117
118The workflow of target-aware-conversion is as followed:
119
1201 Try to break down the whole graph into several subgraphs based on hardwares'
121capabilities. See the diagram below, let's say our desired target backends are
122"GPU" and "CPU", and currently "C" is not supported on "GPU", but the rest are
123supported by "GPU". So we will end up with 3 subgraphs as shown in the diagram.
124
125![Target Annotation](g3doc/images/target_annotation.png)
126
1272  Perform ops-lowering & target-specific optimizations for
128    different hardware backends. As shown in the below diagram, the red & the
129    yellow subgraph will be duplicated as "alternative subgraph view" for "CPU".
130    "C" op can be lowered into "G" + "H" op which can be supported by "GPU".
131
132![Target Optimization](g3doc/images/target_optimization.png)
133
1343  Estimate the costs for each subgraph (and their alternative views)
135    based on the hardware cost model. See the following diagram.
136
137![Estimate costs](g3doc/images/compute_cost.png)
138
1394 Pick the proper subgraphs from the alternative views for execution based on
140costs(computation costs, transfer costs, quant/dequant costs). As shown in the
141diagram below, since cross-device data transferring cost is high, even "G" + "H"
142running on GPU maybe less efficient than "C" running on "CPU", we will still
143pick "G" + "H" subgraph.
144
145![Pick subgraphs](g3doc/images/pick_subgraphs.png)
146
147The final graph looks like below:
148
149![Final graph](g3doc/images/final_graph.png)
150
151## TAC components
152
153### Hardwares
154
155Hardwares are used to modeling target device capabilities & also ops cost for
156the target devices.
157
158We have already modeled `cpu_hardware` & `gpu_hardware` as well as the
159`nnapi_hardware`.
160
161### Passes
162
163#### Target Annotation Pass
164In this pass, every op will be targeted with the user specified targets based on
165the device capabilites. For example, If the user specified the desired targets
166are "GPU", "CPU", `conv2d` can run on both "GPU" and "CPU", we will annotate
167the op `conv2d` with "GPU" since it's preferred; `pack` can only run on "CPU",
168so we will annotate the op with "CPU" since "GPU" does not support this op.
169
170#### Raise Target Subgraphs Pass
171
172In this pass, ops will be broken down into subgraph. Those ops have the same
173target annotation will be raised as subgraphs.
174
175In this pass, subgraph is actually implemented with `FuncOp`.
176
177Take the following code as an example:
178
179```
180func @simpleTest(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>, %arg2: tensor<1xf32>, %arg3: tensor<1xf32>) -> tensor<2x1xf32> {
181  %0 = "tfl.add"(%arg0, %arg1) {tac.device = "GPU", fused_activation_function = "RELU6", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
182  %1 = "tfl.mul"(%0, %arg2) {tac.device = "GPU", fused_activation_function = "RELU6", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
183  %2 = "tfl.add"(%arg0, %arg3) {tac.device = "GPU", fused_activation_function = "RELU6", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
184  %3 = "tfl.pack"(%1, %2) {tac.device = "CPU", tac.inference_type = "FLOAT", axis = 0 : i32, values_count = 2 : i32} : (tensor<1xf32>, tensor<1xf32>) -> tensor<2x1xf32>
185  return %3 : tensor<2x1xf32>
186}
187```
188
189In this code, `%3` is annotated with "CPU", while others are annotated with
190"GPU", in this case, `%3` will be raised as a separate function like below:
191
192```
193 func private @func_1_GPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> attributes {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_1"} {
194    %0 = tfl.add %arg0, %arg1 {fused_activation_function = "RELU6", tac.device = "GPU", tac.inference_type = "FLOAT"} : tensor<1xf32>
195    return %0 : tensor<1xf32>
196  }
197```
198
199And the rest ops will be raised as below:
200
201```
202 func private @func_2_CPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<2x1xf32> attributes {tac.device = "CPU", tac.inference_type = "FLOAT", tac.interface_name = "func_2"} {
203    %0 = "tfl.pack"(%arg0, %arg1) {axis = 0 : i32, tac.device = "CPU", tac.inference_type = "FLOAT", values_count = 2 : i32} : (tensor<1xf32>, tensor<1xf32>) -> tensor<2x1xf32>
204    return %0 : tensor<2x1xf32>
205  }
206
207func private @func_0_GPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>, %arg2: tensor<1xf32>) -> tensor<1xf32> attributes {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_0"} {
208    %0 = tfl.add %arg0, %arg1 {fused_activation_function = "RELU6", tac.device = "GPU", tac.inference_type = "FLOAT"} : tensor<1xf32>
209    %1 = tfl.mul %0, %arg2 {fused_activation_function = "RELU6", tac.device = "GPU", tac.inference_type = "FLOAT"} : tensor<1xf32>
210    return %1 : tensor<1xf32>
211  }
212```
213
214And the original function will be replaced by `CallOps` to those `FuncOps`:
215
216```
217func @simpleTest(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>, %arg2: tensor<1xf32>, %arg3: tensor<1xf32>) -> tensor<2x1xf32> {
218    %0 = call @func_0_GPU_FLOAT(%arg0, %arg1, %arg2) {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_0"} : (tensor<1xf32>, tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
219    %1 = call @func_1_GPU_FLOAT(%arg0, %arg3) {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_1"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<1xf32>
220    %2 = call @func_2_CPU_FLOAT(%0, %1) {tac.device = "CPU", tac.inference_type = "FLOAT", tac.interface_name = "func_2"} : (tensor<1xf32>, tensor<1xf32>) -> tensor<2x1xf32>
221    return %2 : tensor<2x1xf32>
222  }
223```
224
225Why we need to raise those ops into `FuncOps`? Please see the following section.
226
227#### Get Alternative Subgraph View Pass
228In the Get Alternative Subgraph View Pass, we will essentially duplicate those
229`FuncOps` and perform unsupported ops lowering & target-specific optimization.
230
231For example, `Pack` is not supported by "GPU", but it can be lowered into
232`Concat` + `Reshape` which can be supported by "GPU".
233
234So the original example:
235
236```
237 func private @func_1_GPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<1xf32> attributes {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_1"} {
238    %0 = tfl.add %arg0, %arg1 {fused_activation_function = "RELU6", tac.device = "GPU", tac.inference_type = "FLOAT"} : tensor<1xf32>
239    return %0 : tensor<1xf32>
240  }
241```
242
243Will be transformed into:
244
245```
246 func private @func_2_CPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<2x1xf32> attributes {tac.device = "CPU", tac.inference_type = "FLOAT", tac.interface_name = "func_2"} {
247    %0 = "tfl.pack"(%arg0, %arg1) {axis = 0 : i32, tac.device = "CPU", tac.inference_type = "FLOAT", values_count = 2 : i32} : (tensor<1xf32>, tensor<1xf32>) -> tensor<2x1xf32>
248    return %0 : tensor<2x1xf32>
249  }
250
251func private @func_2_GPU_FLOAT(%arg0: tensor<1xf32>, %arg1: tensor<1xf32>) -> tensor<2x1xf32> attributes {tac.device = "GPU", tac.inference_type = "FLOAT", tac.interface_name = "func_2"} {
252    %cst = arith.constant dense<1> : tensor<4xi32>
253    %cst_0 = arith.constant dense<2> : tensor<1xi32>
254    %cst_1 = arith.constant dense<[2, 1]> : tensor<2xi32>
255    %0 = "tfl.reshape"(%arg0, %cst) {tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<4xi32>) -> tensor<1x1x1x1xf32>
256    %1 = "tfl.reshape"(%arg1, %cst) {tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<1xf32>, tensor<4xi32>) -> tensor<1x1x1x1xf32>
257    %2 = "tfl.concatenation"(%0, %1) {axis = 3 : i32, fused_activation_function = "NONE", tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<1x1x1x1xf32>, tensor<1x1x1x1xf32>) -> tensor<1x1x1x2xf32>
258    %3 = "tfl.reshape"(%2, %cst_0) {tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<1x1x1x2xf32>, tensor<1xi32>) -> tensor<2xf32>
259    %4 = "tfl.reshape"(%3, %cst_1) {tac.device = "GPU", tac.inference_type = "FLOAT"} : (tensor<2xf32>, tensor<2xi32>) -> tensor<2x1xf32>
260    return %4 : tensor<2x1xf32>
261  }
262```
263
264#### Compute Costs Pass
265In the compute cost pass, we will essentially compute the cost of each op within
266the `FuncOp` based on the target-device cost model and sum them together.
267
268
269#### Pick Subgraphs Pass
270In the pick subgraphs pass, we will pick those subgraphs which can minimize the
271global costs (we will take the tensor transferring costs as well).
272