1# Chapter 5: Partial Lowering to Lower-Level Dialects for Optimization 2 3[TOC] 4 5At this point, we are eager to generate actual code and see our Toy language 6take life. We will use LLVM to generate code, but just showing the LLVM builder 7interface here wouldn't be very exciting. Instead, we will show how to perform 8progressive lowering through a mix of dialects coexisting in the same function. 9 10To make it more interesting, in this chapter we will consider that we want to 11reuse existing optimizations implemented in a dialect optimizing affine 12transformations: `Affine`. This dialect is tailored to the computation-heavy 13part of the program and is limited: it doesn't support representing our 14`toy.print` builtin, for instance, neither should it! Instead, we can target 15`Affine` for the computation heavy part of Toy, and in the 16[next chapter](Ch-6.md) directly target the `LLVM IR` dialect for lowering 17`print`. As part of this lowering, we will be lowering from the 18[TensorType](../../LangRef.md#tensor-type) that `Toy` operates on to the 19[MemRefType](../../LangRef.md#memref-type) that is indexed via an affine 20loop-nest. Tensors represent an abstract value-typed sequence of data, meaning 21that they don't live in any memory. MemRefs, on the other hand, represent lower 22level buffer access, as they are concrete references to a region of memory. 23 24# Dialect Conversions 25 26MLIR has many different dialects, so it is important to have a unified framework 27for [converting](../../../getting_started/Glossary.md#conversion) between them. This is where the 28`DialectConversion` framework comes into play. This framework allows for 29transforming a set of *illegal* operations to a set of *legal* ones. To use this 30framework, we need to provide two things (and an optional third): 31 32* A [Conversion Target](../../DialectConversion.md#conversion-target) 33 34 - This is the formal specification of what operations or dialects are 35 legal for the conversion. Operations that aren't legal will require 36 rewrite patterns to perform 37 [legalization](../../../getting_started/Glossary.md#legalization). 38 39* A set of 40 [Rewrite Patterns](../../DialectConversion.md#rewrite-pattern-specification) 41 42 - This is the set of [patterns](../QuickstartRewrites.md) used to 43 convert *illegal* operations into a set of zero or more *legal* ones. 44 45* Optionally, a [Type Converter](../../DialectConversion.md#type-conversion). 46 47 - If provided, this is used to convert the types of block arguments. We 48 won't be needing this for our conversion. 49 50## Conversion Target 51 52For our purposes, we want to convert the compute-intensive `Toy` operations into 53a combination of operations from the `Affine` `Standard` dialects for further 54optimization. To start off the lowering, we first define our conversion target: 55 56```c++ 57void ToyToAffineLoweringPass::runOnFunction() { 58 // The first thing to define is the conversion target. This will define the 59 // final target for this lowering. 60 mlir::ConversionTarget target(getContext()); 61 62 // We define the specific operations, or dialects, that are legal targets for 63 // this lowering. In our case, we are lowering to a combination of the 64 // `Affine` and `Standard` dialects. 65 target.addLegalDialect<mlir::AffineDialect, mlir::StandardOpsDialect>(); 66 67 // We also define the Toy dialect as Illegal so that the conversion will fail 68 // if any of these operations are *not* converted. Given that we actually want 69 // a partial lowering, we explicitly mark the Toy operations that don't want 70 // to lower, `toy.print`, as *legal*. 71 target.addIllegalDialect<ToyDialect>(); 72 target.addLegalOp<PrintOp>(); 73 ... 74} 75``` 76 77Above, we first set the toy dialect to illegal, and then the print operation 78as legal. We could have done this the other way around. 79Individual operations always take precedence over the (more generic) dialect 80definitions, so the order doesn't matter. See `ConversionTarget::getOpInfo` 81for the details. 82 83## Conversion Patterns 84 85After the conversion target has been defined, we can define how to convert the 86*illegal* operations into *legal* ones. Similarly to the canonicalization 87framework introduced in [chapter 3](Ch-3.md), the 88[`DialectConversion` framework](../../DialectConversion.md) also uses 89[RewritePatterns](../QuickstartRewrites.md) to perform the conversion logic. 90These patterns may be the `RewritePatterns` seen before or a new type of pattern 91specific to the conversion framework `ConversionPattern`. `ConversionPatterns` 92are different from traditional `RewritePatterns` in that they accept an 93additional `operands` parameter containing operands that have been 94remapped/replaced. This is used when dealing with type conversions, as the 95pattern will want to operate on values of the new type but match against the 96old. For our lowering, this invariant will be useful as it translates from the 97[TensorType](../../LangRef.md#tensor-type) currently being operated on to the 98[MemRefType](../../LangRef.md#memref-type). Let's look at a snippet of lowering 99the `toy.transpose` operation: 100 101```c++ 102/// Lower the `toy.transpose` operation to an affine loop nest. 103struct TransposeOpLowering : public mlir::ConversionPattern { 104 TransposeOpLowering(mlir::MLIRContext *ctx) 105 : mlir::ConversionPattern(TransposeOp::getOperationName(), 1, ctx) {} 106 107 /// Match and rewrite the given `toy.transpose` operation, with the given 108 /// operands that have been remapped from `tensor<...>` to `memref<...>`. 109 mlir::LogicalResult 110 matchAndRewrite(mlir::Operation *op, ArrayRef<mlir::Value> operands, 111 mlir::ConversionPatternRewriter &rewriter) const final { 112 auto loc = op->getLoc(); 113 114 // Call to a helper function that will lower the current operation to a set 115 // of affine loops. We provide a functor that operates on the remapped 116 // operands, as well as the loop induction variables for the inner most 117 // loop body. 118 lowerOpToLoops( 119 op, operands, rewriter, 120 [loc](mlir::PatternRewriter &rewriter, 121 ArrayRef<mlir::Value> memRefOperands, 122 ArrayRef<mlir::Value> loopIvs) { 123 // Generate an adaptor for the remapped operands of the TransposeOp. 124 // This allows for using the nice named accessors that are generated 125 // by the ODS. This adaptor is automatically provided by the ODS 126 // framework. 127 TransposeOpAdaptor transposeAdaptor(memRefOperands); 128 mlir::Value input = transposeAdaptor.input(); 129 130 // Transpose the elements by generating a load from the reverse 131 // indices. 132 SmallVector<mlir::Value, 2> reverseIvs(llvm::reverse(loopIvs)); 133 return rewriter.create<mlir::AffineLoadOp>(loc, input, reverseIvs); 134 }); 135 return success(); 136 } 137}; 138``` 139 140Now we can prepare the list of patterns to use during the lowering process: 141 142```c++ 143void ToyToAffineLoweringPass::runOnFunction() { 144 ... 145 146 // Now that the conversion target has been defined, we just need to provide 147 // the set of patterns that will lower the Toy operations. 148 mlir::OwningRewritePatternList patterns; 149 patterns.insert<..., TransposeOpLowering>(&getContext()); 150 151 ... 152``` 153 154## Partial Lowering 155 156Once the patterns have been defined, we can perform the actual lowering. The 157`DialectConversion` framework provides several different modes of lowering, but, 158for our purposes, we will perform a partial lowering, as we will not convert 159`toy.print` at this time. 160 161```c++ 162void ToyToAffineLoweringPass::runOnFunction() { 163 ... 164 165 // With the target and rewrite patterns defined, we can now attempt the 166 // conversion. The conversion will signal failure if any of our *illegal* 167 // operations were not converted successfully. 168 auto function = getFunction(); 169 if (mlir::failed(mlir::applyPartialConversion(function, target, patterns))) 170 signalPassFailure(); 171} 172``` 173 174### Design Considerations With Partial Lowering 175 176Before diving into the result of our lowering, this is a good time to discuss 177potential design considerations when it comes to partial lowering. In our 178lowering, we transform from a value-type, TensorType, to an allocated 179(buffer-like) type, MemRefType. However, given that we do not lower the 180`toy.print` operation, we need to temporarily bridge these two worlds. There are 181many ways to go about this, each with their own tradeoffs: 182 183* Generate `load` operations from the buffer 184 185 One option is to generate `load` operations from the buffer type to materialize 186 an instance of the value type. This allows for the definition of the `toy.print` 187 operation to remain unchanged. The downside to this approach is that the 188 optimizations on the `affine` dialect are limited, because the `load` will 189 actually involve a full copy that is only visible *after* our optimizations have 190 been performed. 191 192* Generate a new version of `toy.print` that operates on the lowered type 193 194 Another option would be to have another, lowered, variant of `toy.print` that 195 operates on the lowered type. The benefit of this option is that there is no 196 hidden, unnecessary copy to the optimizer. The downside is that another 197 operation definition is needed that may duplicate many aspects of the first. 198 Defining a base class in [ODS](../../OpDefinitions.md) may simplify this, but 199 you still need to treat these operations separately. 200 201* Update `toy.print` to allow for operating on the lowered type 202 203 A third option is to update the current definition of `toy.print` to allow for 204 operating the on the lowered type. The benefit of this approach is that it is 205 simple, does not introduce an additional hidden copy, and does not require 206 another operation definition. The downside to this option is that it requires 207 mixing abstraction levels in the `Toy` dialect. 208 209For the sake of simplicity, we will use the third option for this lowering. This 210involves updating the type constraints on the PrintOp in the operation 211definition file: 212 213```tablegen 214def PrintOp : Toy_Op<"print"> { 215 ... 216 217 // The print operation takes an input tensor to print. 218 // We also allow a F64MemRef to enable interop during partial lowering. 219 let arguments = (ins AnyTypeOf<[F64Tensor, F64MemRef]>:$input); 220} 221``` 222 223## Complete Toy Example 224 225Let's take a concrete example: 226 227```mlir 228func @main() { 229 %0 = toy.constant dense<[[1.000000e+00, 2.000000e+00, 3.000000e+00], [4.000000e+00, 5.000000e+00, 6.000000e+00]]> : tensor<2x3xf64> 230 %2 = toy.transpose(%0 : tensor<2x3xf64>) to tensor<3x2xf64> 231 %3 = toy.mul %2, %2 : tensor<3x2xf64> 232 toy.print %3 : tensor<3x2xf64> 233 toy.return 234} 235``` 236 237With affine lowering added to our pipeline, we can now generate: 238 239```mlir 240func @main() { 241 %cst = constant 1.000000e+00 : f64 242 %cst_0 = constant 2.000000e+00 : f64 243 %cst_1 = constant 3.000000e+00 : f64 244 %cst_2 = constant 4.000000e+00 : f64 245 %cst_3 = constant 5.000000e+00 : f64 246 %cst_4 = constant 6.000000e+00 : f64 247 248 // Allocating buffers for the inputs and outputs. 249 %0 = alloc() : memref<3x2xf64> 250 %1 = alloc() : memref<3x2xf64> 251 %2 = alloc() : memref<2x3xf64> 252 253 // Initialize the input buffer with the constant values. 254 affine.store %cst, %2[0, 0] : memref<2x3xf64> 255 affine.store %cst_0, %2[0, 1] : memref<2x3xf64> 256 affine.store %cst_1, %2[0, 2] : memref<2x3xf64> 257 affine.store %cst_2, %2[1, 0] : memref<2x3xf64> 258 affine.store %cst_3, %2[1, 1] : memref<2x3xf64> 259 affine.store %cst_4, %2[1, 2] : memref<2x3xf64> 260 261 // Load the transpose value from the input buffer and store it into the 262 // next input buffer. 263 affine.for %arg0 = 0 to 3 { 264 affine.for %arg1 = 0 to 2 { 265 %3 = affine.load %2[%arg1, %arg0] : memref<2x3xf64> 266 affine.store %3, %1[%arg0, %arg1] : memref<3x2xf64> 267 } 268 } 269 270 // Multiply and store into the output buffer. 271 affine.for %arg0 = 0 to 3 { 272 affine.for %arg1 = 0 to 2 { 273 %3 = affine.load %1[%arg0, %arg1] : memref<3x2xf64> 274 %4 = affine.load %1[%arg0, %arg1] : memref<3x2xf64> 275 %5 = mulf %3, %4 : f64 276 affine.store %5, %0[%arg0, %arg1] : memref<3x2xf64> 277 } 278 } 279 280 // Print the value held by the buffer. 281 toy.print %0 : memref<3x2xf64> 282 dealloc %2 : memref<2x3xf64> 283 dealloc %1 : memref<3x2xf64> 284 dealloc %0 : memref<3x2xf64> 285 return 286} 287``` 288 289## Taking Advantage of Affine Optimization 290 291Our naive lowering is correct, but it leaves a lot to be desired with regards to 292efficiency. For example, the lowering of `toy.mul` has generated some redundant 293loads. Let's look at how adding a few existing optimizations to the pipeline can 294help clean this up. Adding the `LoopFusion` and `MemRefDataFlowOpt` passes to 295the pipeline gives the following result: 296 297```mlir 298func @main() { 299 %cst = constant 1.000000e+00 : f64 300 %cst_0 = constant 2.000000e+00 : f64 301 %cst_1 = constant 3.000000e+00 : f64 302 %cst_2 = constant 4.000000e+00 : f64 303 %cst_3 = constant 5.000000e+00 : f64 304 %cst_4 = constant 6.000000e+00 : f64 305 306 // Allocating buffers for the inputs and outputs. 307 %0 = alloc() : memref<3x2xf64> 308 %1 = alloc() : memref<2x3xf64> 309 310 // Initialize the input buffer with the constant values. 311 affine.store %cst, %1[0, 0] : memref<2x3xf64> 312 affine.store %cst_0, %1[0, 1] : memref<2x3xf64> 313 affine.store %cst_1, %1[0, 2] : memref<2x3xf64> 314 affine.store %cst_2, %1[1, 0] : memref<2x3xf64> 315 affine.store %cst_3, %1[1, 1] : memref<2x3xf64> 316 affine.store %cst_4, %1[1, 2] : memref<2x3xf64> 317 318 affine.for %arg0 = 0 to 3 { 319 affine.for %arg1 = 0 to 2 { 320 // Load the transpose value from the input buffer. 321 %2 = affine.load %1[%arg1, %arg0] : memref<2x3xf64> 322 323 // Multiply and store into the output buffer. 324 %3 = mulf %2, %2 : f64 325 affine.store %3, %0[%arg0, %arg1] : memref<3x2xf64> 326 } 327 } 328 329 // Print the value held by the buffer. 330 toy.print %0 : memref<3x2xf64> 331 dealloc %1 : memref<2x3xf64> 332 dealloc %0 : memref<3x2xf64> 333 return 334} 335``` 336 337Here, we can see that a redundant allocation was removed, the two loop nests 338were fused, and some unnecessary `load`s were removed. You can build `toyc-ch5` 339and try yourself: `toyc-ch5 test/Examples/Toy/Ch5/affine-lowering.mlir 340-emit=mlir-affine`. We can also check our optimizations by adding `-opt`. 341 342In this chapter we explored some aspects of partial lowering, with the intent to 343optimize. In the [next chapter](Ch-6.md) we will continue the discussion about 344dialect conversion by targeting LLVM for code generation. 345