1# Bufferization on MLIR 2 3The general task of bufferization is to move SSA values (like tensors) into 4allocated memory buffers that have to be freed when they are no longer needed. 5This also involves the placement of copies to clone contents of allocated 6memory buffers at specific locations (similar to register allocation). On the 7one hand, these copies are needed to ensure the right behavior of a program, on 8the other hand, introducing several aliases for a certain buffer could lead to 9a wrong freeing of buffers. Therefore, we have to take care of them and the 10program structure. The introduction of copies solves this problem. Several 11unnecessary introduced copies during this process can be eliminated afterwards. 12 13```mlir 14func @func_on_tensors(%source: tensor<4xf32>) -> tensor<4xf32> { 15 %0 = mhlo.exp %source : (tensor<4xf32>) -> (tensor<4xf32>) 16 return %0 : tensor<4xf32> 17} 18``` 19 20Will be transformed to: 21 22```mlir 23func @func_on_buffers(%source: memref<4xf32>, %target: memref<4xf32>) { 24 %0 = alloc() : memref<4xf32> 25 lmhlo.exp %source, %0 : (memref<4xf32>, memref<4xf32>) -> () 26 lmhlo.copy %0, %target : (memref<4xf32>, memref<4xf32>) -> () 27 dealloc %0 : memref<4xf32> 28 return 29} 30``` 31 32In general, Bufferization is split into three separate phases: a preparation 33phase, a bufferization phase and a post-processing phase. The assignment 34process happens during dialect conversion and allocates buffers for each value 35that should be moved into a memory buffer. This has to be implemented by each 36dialect using the following tools and patterns. Thereby, all operations on 37memory buffers have to be changed to `memref<T>` types (see Preparation Phase). 38Afterwards, the placement transformation (see BufferDeallocation) adds all 39required deallocation operations, temporary buffers and copy operations 40automatically. 41 42## Preparation Phase 43 44In order to apply the BufferDeallocation code transformation, the input MLIR 45program needs to leverage allocations (buffers in general) and `memref<T>` 46types(as outlined above). If your input program does not work on buffers, you 47need to perform this preparation step in order to port it to the “buffer 48world”. This is a user-defined preparation step that is intended to be applied 49during dialect conversion. The user has to take care for the right conversion 50by providing conversion patterns relying on a type converter to assign buffers. 51 52A necessary step is to apply type and function signature conversion. 53Furthermore, after changing all function signatures, the associated return and 54call operations must comply with the new corresponding function signatures. For 55this purpose, three operation conversion patterns are introduced: 56 57* BufferizeFuncOpConverter 58* BufferizeReturnOpConverter 59* BufferizeCallOpConverter 60 61In order to use these conversion patterns, the user needs to define a custom 62BufferizeTypeConverter implementation. 63 64### BufferizeTypeConverter 65 66The BufferizeTypeConverter is an extension to the TypeConverter class that 67provides additional functionality for dialect developers to decide on the 68signature of a function. The extra features include: 69 70* `setResultConversionKind` to decide on a function result after the conversion 71with a specific type to be appended to the function argument list as an output 72argument or remains as a function result. 73* define their own callback function for type or value unwrapping. 74 75### ResultConversionKind 76 77ResultConversionKind is an enum with two values 78 79* AppendToArgumentList 80* KeepAsFunctionResult 81 82that defines how BufferizeFuncOpConverter should handle function results in 83order to convert the signature of the function. The other two operation 84conversion patterns also use ResultConversionKind to adapt themselves with the 85new function signature. 86 87`ResultConversionKind` can be set using `setResultConversionKind`, which needs 88two template parameters that correspond to the types before and after type 89conversion. This mapping specifies whether the resulting type should stay as a 90function result or should be appended to the arguments list after the 91conversion is done. Note that the default value for unspecified mappings is 92`KeepAsFunctionResult`. For instance, the following command updates the 93`BufferizeTypeConverter` instance that defines all MemRefType function results 94(converted from `RankedTensorTypes`). These results should be appended to the 95function argument list in BufferizeFuncOpConverter: 96 97```mlir 98converter.setResultConversionKind<RankedTensorType, MemRefType>( 99 BufferizeTypeConverter::AppendToArgumentsList); 100``` 101 102## Callbacks for Unpacking Types 103 104```mlir 105func @func_on_tensors(%arg0: tuple<i1,f32>) -> (tuple<tensor<10xf32>, tensor<5xf32>>) { 106 ... 107} 108``` 109 110Will be transformed to: 111 112```mlir 113func @func_on_buffers(%arg0: i1, %arg1: f32, %target0: memref<10xf32>, %target1: memref<5xf32>) { 114 ... 115} 116``` 117 118BufferizeFuncOpConverter is also able to unpack the types of arguments and 119results of a function during function signature conversion. In the example 120above, it unwraps the tuple type and converts the type of each constituent. 121 122For this purpose, users can provide custom callbacks by using 123`addDecomposeTypeConversion` for the `BufferizeTypeConverter` instance to show 124how a specific type (i.e. TupleType) can be unpacked. However, when a type 125decomposition is provided, there are two additional callbacks that have to be 126defined as well. They specify how to pack or unpack values of that particular 127type. These two callbacks can be provided by the `addArgumentMaterialization` 128(packing) **and** `addDecomposeValueConversion` (unpacking) functions: 129 130The following piece of code demonstrates this functionality by flattening a 131`TupleType`. 132 133```mlir 134converter.addDecomposeTypeConversion( 135 [](TupleType tupleType, SmallVectorImpl<Type> &types) { 136 tupleType.getFlattenedTypes(types); 137 return success(); 138 }); 139 140converter.addArgumentMaterialization( 141 [](OpBuilder &builder, TupleType resultType, ValueRange inputs, 142 Location loc) -> Optional<Value> { 143 if (inputs.size() == 1) 144 return llvm::None; 145 TypeRange TypeRange = inputs.getTypes(); 146 SmallVector<Type, 2> types(TypeRange.begin(), TypeRange.end()); 147 TupleType tuple = TupleType::get(types, builder.getContext()); 148 mlir::Value value = builder.create<MakeTupleOp>(loc, tuple, inputs); 149 return value; 150 }); 151 152converter.addDecomposeValueConversion([](OpBuilder &builder, Location loc, 153 TupleType resultType, Value value, 154 SmallVectorImpl<Value> &values) { 155 for (unsigned i = 0, e = resultType.size(); i < e; ++i) { 156 Value res = builder.create<GetTupleElementOp>( 157 loc, resultType.getType(i), value, builder.getI32IntegerAttr(i)); 158 values.push_back(res); 159 } 160 return success(); 161 }); 162``` 163 164In the scope of these callback functions, the elements of a tuple value can be 165decomposed using `GetTupleElementOp`. Conversely, `MakeTupleOp` is used to pack 166a list of values as a single tuple type. 167 168### Bufferization Operation Conversion Patterns 169 170The following conversion patterns can be used to conveniently transform the 171signature of a function, the return and call operations: 172 173* `BufferizeFuncOpConverter` 174* `BufferizeReturnOpConverter` 175* `BufferizeCallOpConverter` 176 177Any combination of these conversion patterns can be specified by the user. If 178you need to apply all of these operation converters, you can use 179`populateWithBufferizeOpConversionPatterns` which sets up all converters. 180 181### BufferizeFuncOpConverter 182 183The BufferizeFuncOpConverter is the actual function operation converter that 184applies signature conversion by using a previously defined 185`BufferizeTypeConverter`. 186 187In the following example, we configure a `BufferizeTypeConverter` instance such 188that 189 190* all RankedTensorTypes should be converted to MemRefTypes. 191* all function results that are results of type conversion from 192RankedTensorTypes to MemRefTypes should be appended to the function argument 193list. 194* all TupleTypes should be flattened and decomposed to its constituents. 195 196```mlir 197converter.addConversion([](RankedTensorType type) { 198 return (Type)MemRefType::get(type.getShape(), type.getElementType()); 199 }); 200converter.setResultConversionKind<RankedTensorType, MemRefType>( 201 BufferizeTypeConverter::AppendToArgumentsList); 202converter.addDecomposeTypeConversion( 203 [](TupleType tupleType, SmallVectorImpl<Type> &types) { 204 tupleType.getFlattenedTypes(types); 205 return success(); 206 }); 207``` 208 209Consider the following signature conversion: 210 211```mlir 212func @on_tensors(%arg1: tuple<i1,f32>) -> (tuple<memref<10xf32>, tensor<5xf32>>){ 213 ... 214} 215``` 216 217Will be transformed to: 218 219```mlir 220func @on_buffers(%arg0: i1, %arg1: f32, %out: memref<5xf32>) -> memref<10xf32> { 221 ... 222} 223``` 224 225Using the presented converter setup, all TupleType arguments and results are 226decomposed first. The tensor<5xf32> result is converted to a memref<5xf32> type 227and appended to the argument list. There is no conversion for the types memref, 228i1, and f32. Therefore, the memref<10xf32> result is kept as it is and will 229also be kept as a function result since there is no ResultConversionKind 230mapping from a MemRefType to a MemRefType. However, if we change the 231result-conversion behavior via 232 233```mlir 234converter.setResultConversionKind<RankedTensorType, MemRefType>( 235 BufferizeTypeConverter::KeepAsFunctionResult); 236``` 237 238the output will be: 239 240```mlir 241func @on_buffers(%arg0: i1, %arg1: f32) -> (memref<10xf32>, memref<5xf32>) { 242 ... 243} 244``` 245 246### BufferizeReturnOpConverter 247 248When changing the signature of a function, the return operands must match with 249the results of the corresponding function if buffer-typed-results have been 250configured to be appended to the function arguments list. This matching 251consists of two separate steps. First, we have to remove the operands that have 252been appended to the argument list as output arguments. Second, we have to 253introduce additional copies for each operand. However, since each dialect has 254its own dialect-dependent return and copy operations, this conversion pattern 255comes with three template parameters which are the original return operation, 256target return operation, and copy operation kinds. 257 258In the following example, two conversion patterns are inserted into the pattern 259list. The `BufferizeReturnOpConverter` is set to replace a standard return 260operation with the same operation type. 261 262```mlir 263patterns->insert< 264 BufferizeFuncOpConverter, 265 BufferizeReturnOpConverter 266 <mlir::ReturnOp, mlir::ReturnOp, linalg::CopyOp> 267 >(...) 268``` 269 270Consider the following input/output program using a single return: 271 272```mlir 273func @on_tensors(%arg0: tensor<5xf32>, %arg1: i1) -> (tensor<5xf32>, i1) { 274 return %arg0, %arg1 : tensor<5xf32>, i1 275} 276``` 277 278Will be transformed to: 279 280```mlir 281func @on_buffers(%arg0: memref<5xf32>, %arg1: i1, %out: memref<5xf32>) -> i1 { 282 linalg.copy(%arg0, %out) : memref<5xf32>, memref<5xf32> 283 return %arg1 : i1 284} 285``` 286 287Based on our previously configured `BufferizeTypeConverter` instance which 288requires buffer-typed-function-results to be appended to the function argument 289list, the new `on_buffers` function signature is created. The operands of the 290return operation must be adapted with the new function signature. Therefore, 291the buffer-typed operand is removed from the operand list of the new return 292operation. Instead, a copy operation is inserted right before the return 293operation to copy the content of the operand buffer to the target buffer and 294yields the output as shown above. 295 296### BufferizeCallOpConverter 297 298The BufferizeCallOpConverter is a call operation converter that transforms and 299matches the operands and results of a call operation with the arguments and 300results of the callee. Besides converting operand and result types, it 301allocates a buffer for each buffer-based result of the called function that is 302appended to the argument list (if buffer typed results have been configured to 303be appended to the function arguments list). 304 305The following piece of code shows a sample call site, based on our previously 306configured `BufferizeTypeConversion`: 307 308```mlir 309func @callee(%arg0: tensor<5xf32>) -> (tensor<5xf32>) { 310 return %arg0 : tensor<5xf32> 311} 312 313func @caller(%arg0: tensor<5xf32>) -> tensor<5xf32> { 314 %x = call @callee(%arg0) : (tensor<5xf32>) -> tensor<5xf32> 315 return %x : tensor<5xf32> 316} 317``` 318 319Will be transformed to: 320 321```mlir 322func @callee(%arg0: memref<5xf32>, %out: memref<5xf32>) { 323 linalg.copy(%arg0, %out) : memref<5xf32>, memref<5xf32> 324 return 325} 326 327func @caller(%arg0: memref<5xf32>, %out: memref<5xf32>) { 328 %0 = alloc() : memref<5xf32> 329 call @callee(%arg0, %0) : (memref<5xf32>, memref<5xf32>) -> () 330 linalg.copy(%0, %out) : memref<5xf32>, memref<5xf32> 331 return 332} 333``` 334 335### Summarizing Example 336 337To summarize all preparation converters, the following sample is a complete 338listing of an input IR program and its output after applying all converters: 339 340```mlir 341func @callee(%arg0: tuple<tensor<5xf32>,i1>) -> tuple<tensor<5xf32>,i1> { 342 return %arg0 : tuple<tensor<5xf32>,i1> 343} 344 345func @caller(%arg0: tuple<tensor<5xf32>,i1>) -> tuple<tensor<5xf32>,i1> { 346 %x = call @callee(%arg0) : (tuple<tensor<5xf32>,i1>) -> tuple<tensor<5xf32>,i1> 347 return %x : tuple<tensor<5xf32>,i1> 348} 349``` 350 351Will be transformed to: 352 353```mlir 354func @callee(%arg0: memref<5xf32>, %arg1: i1, %arg2: memref<5xf32>) -> i1 { 355 %0 = "test.make_tuple"(%arg0, %arg1) : (memref<5xf32>, i1) -> tuple<memref<5xf32>, i1> 356 %1 = "test.get_tuple_element"(%0) {index = 0 : i32} : (tuple<memref<5xf32>, i1>) -> memref<5xf32> 357 %2 = "test.get_tuple_element"(%0) {index = 1 : i32} : (tuple<memref<5xf32>, i1>) -> i1 358 linalg.copy(%1, %arg2) : memref<5xf32>, memref<5xf32> 359 return %2 : i1 360} 361func @caller(%arg0: memref<5xf32>, %arg1: i1, %arg2: memref<5xf32>) -> i1 { 362 %0 = "test.make_tuple"(%arg0, %arg1) : (memref<5xf32>, i1) -> tuple<memref<5xf32>, i1> 363 %1 = "test.get_tuple_element"(%0) {index = 0 : i32} : (tuple<memref<5xf32>, i1>) -> memref<5xf32> 364 %2 = "test.get_tuple_element"(%0) {index = 1 : i32} : (tuple<memref<5xf32>, i1>) -> i1 365 %3 = alloc() : memref<5xf32> 366 %4 = call @callee(%1, %2, %3) : (memref<5xf32>, i1, memref<5xf32>) -> i1 367 %5 = "test.make_tuple"(%3, %4) : (memref<5xf32>, i1) -> tuple<memref<5xf32>, i1> 368 %6 = "test.get_tuple_element"(%5) {index = 0 : i32} : (tuple<memref<5xf32>, i1>) -> memref<5xf32> 369 %7 = "test.get_tuple_element"(%5) {index = 1 : i32} : (tuple<memref<5xf32>, i1>) -> i1 370 linalg.copy(%6, %arg2) : memref<5xf32>, memref<5xf32> 371 return %7 : i1 372} 373``` 374 375## Buffer Deallocation - Internal Functionality 376 377This section covers the internal functionality of the BufferDeallocation 378transformation. The transformation consists of several passes. The main pass 379called BufferDeallocation can be applied via “-buffer-deallocation” on MLIR 380programs. Currently, there are three optimization passes, that move allocs and 381convert AllocOps to AllocaOps, if possible. The first and second pass can be 382applied using “-buffer-hoisting” or “-buffer-loop-hoisting”, the third one 383using “-promote-buffers-to-stack”. However, these optimizations must be applied 384before using the BufferDeallocation pass. 385 386### Requirements 387 388In order to use BufferDeallocation on an arbitrary dialect, several 389control-flow interfaces have to be implemented when using custom operations. 390This is particularly important to understand the implicit control-flow 391dependencies between different parts of the input program. Without implementing 392the following interfaces, control-flow relations cannot be discovered properly 393and the resulting program can become invalid: 394 395* Branch-like terminators should implement the `BranchOpInterface` to query and 396manipulate associated operands. 397* Operations involving structured control flow have to implement the 398`RegionBranchOpInterface` to model inter-region control flow. 399* Terminators yielding values to their parent operation (in particular in the 400scope of nested regions within `RegionBranchOpInterface`-based operations), 401should implement the `ReturnLike` trait to represent logical “value returns”. 402 403Example dialects that are fully compatible are the “std” and “scf” dialects 404with respect to all implemented interfaces. 405 406### Detection of Buffer Allocations 407 408The first step of the BufferDeallocation transformation is to identify 409manageable allocation operations that implement the `SideEffects` interface. 410Furthermore, these ops need to apply the effect `MemoryEffects::Allocate` to a 411particular result value while not using the resource 412`SideEffects::AutomaticAllocationScopeResource` (since it is currently reserved 413for allocations, like `Alloca` that will be automatically deallocated by a 414parent scope). Allocations that have not been detected in this phase will not 415be tracked internally, and thus, not deallocated automatically. However, 416BufferDeallocation is fully compatible with “hybrid” setups in which tracked 417and untracked allocations are mixed: 418 419```mlir 420func @mixedAllocation(%arg0: i1) { 421 %0 = alloca() : memref<2xf32> // aliases: %2 422 %1 = alloc() : memref<2xf32> // aliases: %2 423 cond_br %arg0, ^bb1, ^bb2 424^bb1: 425 use(%0) 426 br ^bb3(%0 : memref<2xf32>) 427^bb2: 428 use(%1) 429 br ^bb3(%1 : memref<2xf32>) 430^bb3(%2: memref<2xf32>): 431 ... 432} 433``` 434 435Example of using a conditional branch with alloc and alloca. BufferDeallocation 436can detect and handle the different allocation types that might be intermixed. 437 438Note: the current version does not support allocation operations returning 439multiple result buffers. 440 441### Conversion from AllocOp to AllocaOp 442 443The PromoteBuffersToStack-pass converts AllocOps to AllocaOps, if possible. In 444some cases, it can be useful to use such stack-based buffers instead of 445heap-based buffers. The conversion is restricted to several constraints like: 446 447* Control flow 448* Buffer Size 449* Dynamic Size 450 451If a buffer is leaving a block, we are not allowed to convert it into an 452alloca. If the size of the buffer is large, we could convert it, but regarding 453stack overflow, it makes sense to limit the size of these buffers and only 454convert small ones. The size can be set via a pass option. The current default 455value is 1KB. Furthermore, we can not convert buffers with dynamic size, since 456the dimension is not known a priori. 457 458### Movement and Placement of Allocations 459 460Using the buffer hoisting pass, all buffer allocations are moved as far upwards 461as possible in order to group them and make upcoming optimizations easier by 462limiting the search space. Such a movement is shown in the following graphs. 463In addition, we are able to statically free an alloc, if we move it into a 464dominator of all of its uses. This simplifies further optimizations (e.g. 465buffer fusion) in the future. However, movement of allocations is limited by 466external data dependencies (in particular in the case of allocations of 467dynamically shaped types). Furthermore, allocations can be moved out of nested 468regions, if necessary. In order to move allocations to valid locations with 469respect to their uses only, we leverage Liveness information. 470 471The following code snippets shows a conditional branch before running the 472BufferHoisting pass: 473 474![branch_example_pre_move](/includes/img/branch_example_pre_move.svg) 475 476```mlir 477func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) { 478 cond_br %arg0, ^bb1, ^bb2 479^bb1: 480 br ^bb3(%arg1 : memref<2xf32>) 481^bb2: 482 %0 = alloc() : memref<2xf32> // aliases: %1 483 use(%0) 484 br ^bb3(%0 : memref<2xf32>) 485^bb3(%1: memref<2xf32>): // %1 could be %0 or %arg1 486 "linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> () 487 return 488} 489``` 490 491Applying the BufferHoisting pass on this program results in the following piece 492of code: 493 494![branch_example_post_move](/includes/img/branch_example_post_move.svg) 495 496```mlir 497func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) { 498 %0 = alloc() : memref<2xf32> // moved to bb0 499 cond_br %arg0, ^bb1, ^bb2 500^bb1: 501 br ^bb3(%arg1 : memref<2xf32>) 502^bb2: 503 use(%0) 504 br ^bb3(%0 : memref<2xf32>) 505^bb3(%1: memref<2xf32>): 506 "linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> () 507 return 508} 509``` 510 511The alloc is moved from bb2 to the beginning and it is passed as an argument to 512bb3. 513 514The following example demonstrates an allocation using dynamically shaped 515types. Due to the data dependency of the allocation to %0, we cannot move the 516allocation out of bb2 in this case: 517 518```mlir 519func @condBranchDynamicType( 520 %arg0: i1, 521 %arg1: memref<?xf32>, 522 %arg2: memref<?xf32>, 523 %arg3: index) { 524 cond_br %arg0, ^bb1, ^bb2(%arg3: index) 525^bb1: 526 br ^bb3(%arg1 : memref<?xf32>) 527^bb2(%0: index): 528 %1 = alloc(%0) : memref<?xf32> // cannot be moved upwards to the data 529 // dependency to %0 530 use(%1) 531 br ^bb3(%1 : memref<?xf32>) 532^bb3(%2: memref<?xf32>): 533 "linalg.copy"(%2, %arg2) : (memref<?xf32>, memref<?xf32>) -> () 534 return 535} 536``` 537 538### Introduction of Copies 539 540In order to guarantee that all allocated buffers are freed properly, we have to 541pay attention to the control flow and all potential aliases a buffer allocation 542can have. Since not all allocations can be safely freed with respect to their 543aliases (see the following code snippet), it is often required to introduce 544copies to eliminate them. Consider the following example in which the 545allocations have already been placed: 546 547```mlir 548func @branch(%arg0: i1) { 549 %0 = alloc() : memref<2xf32> // aliases: %2 550 cond_br %arg0, ^bb1, ^bb2 551^bb1: 552 %1 = alloc() : memref<2xf32> // resides here for demonstration purposes 553 // aliases: %2 554 br ^bb3(%1 : memref<2xf32>) 555^bb2: 556 use(%0) 557 br ^bb3(%0 : memref<2xf32>) 558^bb3(%2: memref<2xf32>): 559 … 560 return 561} 562``` 563 564The first alloc can be safely freed after the live range of its post-dominator 565block (bb3). The alloc in bb1 has an alias %2 in bb3 that also keeps this 566buffer alive until the end of bb3. Since we cannot determine the actual 567branches that will be taken at runtime, we have to ensure that all buffers are 568freed correctly in bb3 regardless of the branches we will take to reach the 569exit block. This makes it necessary to introduce a copy for %2, which allows us 570to free %alloc0 in bb0 and %alloc1 in bb1. Afterwards, we can continue 571processing all aliases of %2 (none in this case) and we can safely free %2 at 572the end of the sample program. This sample demonstrates that not all 573allocations can be safely freed in their associated post-dominator blocks. 574Instead, we have to pay attention to all of their aliases. 575 576Applying the BufferDeallocation pass to the program above yields the following 577result: 578 579```mlir 580func @branch(%arg0: i1) { 581 %0 = alloc() : memref<2xf32> 582 cond_br %arg0, ^bb1, ^bb2 583^bb1: 584 %1 = alloc() : memref<2xf32> 585 %3 = alloc() : memref<2xf32> // temp copy for %1 586 "linalg.copy"(%1, %3) : (memref<2xf32>, memref<2xf32>) -> () 587 dealloc %1 : memref<2xf32> // %1 can be safely freed here 588 br ^bb3(%3 : memref<2xf32>) 589^bb2: 590 use(%0) 591 %4 = alloc() : memref<2xf32> // temp copy for %0 592 "linalg.copy"(%0, %4) : (memref<2xf32>, memref<2xf32>) -> () 593 br ^bb3(%4 : memref<2xf32>) 594^bb3(%2: memref<2xf32>): 595 … 596 dealloc %2 : memref<2xf32> // free temp buffer %2 597 dealloc %0 : memref<2xf32> // %0 can be safely freed here 598 return 599} 600``` 601 602Note that a temporary buffer for %2 was introduced to free all allocations 603properly. Note further that the unnecessary allocation of %3 can be easily 604removed using one of the post-pass transformations. 605 606Reconsider the previously introduced sample demonstrating dynamically shaped 607types: 608 609```mlir 610func @condBranchDynamicType( 611 %arg0: i1, 612 %arg1: memref<?xf32>, 613 %arg2: memref<?xf32>, 614 %arg3: index) { 615 cond_br %arg0, ^bb1, ^bb2(%arg3: index) 616^bb1: 617 br ^bb3(%arg1 : memref<?xf32>) 618^bb2(%0: index): 619 %1 = alloc(%0) : memref<?xf32> // aliases: %2 620 use(%1) 621 br ^bb3(%1 : memref<?xf32>) 622^bb3(%2: memref<?xf32>): 623 "linalg.copy"(%2, %arg2) : (memref<?xf32>, memref<?xf32>) -> () 624 return 625} 626``` 627 628In the presence of DSTs, we have to parameterize the allocations with 629additional dimension information of the source buffers, we want to copy from. 630BufferDeallocation automatically introduces all required operations to extract 631dimension specifications and wires them with the associated allocations: 632 633```mlir 634func @condBranchDynamicType( 635 %arg0: i1, 636 %arg1: memref<?xf32>, 637 %arg2: memref<?xf32>, 638 %arg3: index) { 639 cond_br %arg0, ^bb1, ^bb2(%arg3 : index) 640^bb1: 641 %c0 = constant 0 : index 642 %0 = dim %arg1, %c0 : memref<?xf32> // dimension operation to parameterize 643 // the following temp allocation 644 %1 = alloc(%0) : memref<?xf32> 645 "linalg.copy"(%arg1, %1) : (memref<?xf32>, memref<?xf32>) -> () 646 br ^bb3(%1 : memref<?xf32>) 647^bb2(%2: index): 648 %3 = alloc(%2) : memref<?xf32> 649 use(%3) 650 %c0_0 = constant 0 : index 651 %4 = dim %3, %c0_0 : memref<?xf32> // dimension operation to parameterize 652 // the following temp allocation 653 %5 = alloc(%4) : memref<?xf32> 654 "linalg.copy"(%3, %5) : (memref<?xf32>, memref<?xf32>) -> () 655 dealloc %3 : memref<?xf32> // %3 can be safely freed here 656 br ^bb3(%5 : memref<?xf32>) 657^bb3(%6: memref<?xf32>): 658 "linalg.copy"(%6, %arg2) : (memref<?xf32>, memref<?xf32>) -> () 659 dealloc %6 : memref<?xf32> // %6 can be safely freed here 660 return 661} 662``` 663 664BufferDeallocation performs a fix-point iteration taking all aliases of all 665tracked allocations into account. We initialize the general iteration process 666using all tracked allocations and their associated aliases. As soon as we 667encounter an alias that is not properly dominated by our allocation, we mark 668this alias as _critical_ (needs to be freed and tracked by the internal 669fix-point iteration). The following sample demonstrates the presence of 670critical and non-critical aliases: 671 672![nested_branch_example_pre_move](/includes/img/nested_branch_example_pre_move.svg) 673 674```mlir 675func @condBranchDynamicTypeNested( 676 %arg0: i1, 677 %arg1: memref<?xf32>, // aliases: %3, %4 678 %arg2: memref<?xf32>, 679 %arg3: index) { 680 cond_br %arg0, ^bb1, ^bb2(%arg3: index) 681^bb1: 682 br ^bb6(%arg1 : memref<?xf32>) 683^bb2(%0: index): 684 %1 = alloc(%0) : memref<?xf32> // cannot be moved upwards due to the data 685 // dependency to %0 686 // aliases: %2, %3, %4 687 use(%1) 688 cond_br %arg0, ^bb3, ^bb4 689^bb3: 690 br ^bb5(%1 : memref<?xf32>) 691^bb4: 692 br ^bb5(%1 : memref<?xf32>) 693^bb5(%2: memref<?xf32>): // non-crit. alias of %1, since %1 dominates %2 694 br ^bb6(%2 : memref<?xf32>) 695^bb6(%3: memref<?xf32>): // crit. alias of %arg1 and %2 (in other words %1) 696 br ^bb7(%3 : memref<?xf32>) 697^bb7(%4: memref<?xf32>): // non-crit. alias of %3, since %3 dominates %4 698 "linalg.copy"(%4, %arg2) : (memref<?xf32>, memref<?xf32>) -> () 699 return 700} 701``` 702 703Applying BufferDeallocation yields the following output: 704 705![nested_branch_example_post_move](/includes/img/nested_branch_example_post_move.svg) 706 707```mlir 708func @condBranchDynamicTypeNested( 709 %arg0: i1, 710 %arg1: memref<?xf32>, 711 %arg2: memref<?xf32>, 712 %arg3: index) { 713 cond_br %arg0, ^bb1, ^bb2(%arg3 : index) 714^bb1: 715 %c0 = constant 0 : index 716 %d0 = dim %arg1, %c0 : memref<?xf32> 717 %5 = alloc(%d0) : memref<?xf32> // temp buffer required due to alias %3 718 "linalg.copy"(%arg1, %5) : (memref<?xf32>, memref<?xf32>) -> () 719 br ^bb6(%5 : memref<?xf32>) 720^bb2(%0: index): 721 %1 = alloc(%0) : memref<?xf32> 722 use(%1) 723 cond_br %arg0, ^bb3, ^bb4 724^bb3: 725 br ^bb5(%1 : memref<?xf32>) 726^bb4: 727 br ^bb5(%1 : memref<?xf32>) 728^bb5(%2: memref<?xf32>): 729 %c0_0 = constant 0 : index 730 %d1 = dim %2, %c0_0 : memref<?xf32> 731 %6 = alloc(%d1) : memref<?xf32> // temp buffer required due to alias %3 732 "linalg.copy"(%1, %6) : (memref<?xf32>, memref<?xf32>) -> () 733 dealloc %1 : memref<?xf32> 734 br ^bb6(%6 : memref<?xf32>) 735^bb6(%3: memref<?xf32>): 736 br ^bb7(%3 : memref<?xf32>) 737^bb7(%4: memref<?xf32>): 738 "linalg.copy"(%4, %arg2) : (memref<?xf32>, memref<?xf32>) -> () 739 dealloc %3 : memref<?xf32> // free %3, since %4 is a non-crit. alias of %3 740 return 741} 742``` 743 744Since %3 is a critical alias, BufferDeallocation introduces an additional 745temporary copy in all predecessor blocks. %3 has an additional (non-critical) 746alias %4 that extends the live range until the end of bb7. Therefore, we can 747free %3 after its last use, while taking all aliases into account. Note that %4 748 does not need to be freed, since we did not introduce a copy for it. 749 750The actual introduction of buffer copies is done after the fix-point iteration 751has been terminated and all critical aliases have been detected. A critical 752alias can be either a block argument or another value that is returned by an 753operation. Copies for block arguments are handled by analyzing all predecessor 754blocks. This is primarily done by querying the `BranchOpInterface` of the 755associated branch terminators that can jump to the current block. Consider the 756following example which involves a simple branch and the critical block 757argument %2: 758 759```mlir 760 custom.br ^bb1(..., %0, : ...) 761 ... 762 custom.br ^bb1(..., %1, : ...) 763 ... 764^bb1(%2: memref<2xf32>): 765 ... 766``` 767 768The `BranchOpInterface` allows us to determine the actual values that will be 769passed to block bb1 and its argument %2 by analyzing its predecessor blocks. 770Once we have resolved the values %0 and %1 (that are associated with %2 in this 771sample), we can introduce a temporary buffer and clone its contents into the 772new buffer. Afterwards, we rewire the branch operands to use the newly 773allocated buffer instead. However, blocks can have implicitly defined 774predecessors by parent ops that implement the `RegionBranchOpInterface`. This 775can be the case if this block argument belongs to the entry block of a region. 776In this setting, we have to identify all predecessor regions defined by the 777parent operation. For every region, we need to get all terminator operations 778implementing the `ReturnLike` trait, indicating that they can branch to our 779current block. Finally, we can use a similar functionality as described above 780to add the temporary copy. This time, we can modify the terminator operands 781directly without touching a high-level interface. 782 783Consider the following inner-region control-flow sample that uses an imaginary 784“custom.region_if” operation. It either executes the “then” or “else” region 785and always continues to the “join” region. The “custom.region_if_yield” 786operation returns a result to the parent operation. This sample demonstrates 787the use of the `RegionBranchOpInterface` to determine predecessors in order to 788infer the high-level control flow: 789 790```mlir 791func @inner_region_control_flow( 792 %arg0 : index, 793 %arg1 : index) -> memref<?x?xf32> { 794 %0 = alloc(%arg0, %arg0) : memref<?x?xf32> 795 %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>) 796 then(%arg2 : memref<?x?xf32>) { // aliases: %arg4, %1 797 custom.region_if_yield %arg2 : memref<?x?xf32> 798 } else(%arg3 : memref<?x?xf32>) { // aliases: %arg4, %1 799 custom.region_if_yield %arg3 : memref<?x?xf32> 800 } join(%arg4 : memref<?x?xf32>) { // aliases: %1 801 custom.region_if_yield %arg4 : memref<?x?xf32> 802 } 803 return %1 : memref<?x?xf32> 804} 805``` 806 807![region_branch_example_pre_move](/includes/img/region_branch_example_pre_move.svg) 808 809Non-block arguments (other values) can become aliases when they are returned by 810dialect-specific operations. BufferDeallocation supports this behavior via the 811`RegionBranchOpInterface`. Consider the following example that uses an “scf.if” 812operation to determine the value of %2 at runtime which creates an alias: 813 814```mlir 815func @nested_region_control_flow(%arg0 : index, %arg1 : index) -> memref<?x?xf32> { 816 %0 = cmpi "eq", %arg0, %arg1 : index 817 %1 = alloc(%arg0, %arg0) : memref<?x?xf32> 818 %2 = scf.if %0 -> (memref<?x?xf32>) { 819 scf.yield %1 : memref<?x?xf32> // %2 will be an alias of %1 820 } else { 821 %3 = alloc(%arg0, %arg1) : memref<?x?xf32> // nested allocation in a div. 822 // branch 823 use(%3) 824 scf.yield %1 : memref<?x?xf32> // %2 will be an alias of %1 825 } 826 return %2 : memref<?x?xf32> 827} 828``` 829 830In this example, a dealloc is inserted to release the buffer within the else 831block since it cannot be accessed by the remainder of the program. Accessing 832the `RegionBranchOpInterface`, allows us to infer that %2 is a non-critical 833alias of %1 which does not need to be tracked. 834 835```mlir 836func @nested_region_control_flow(%arg0: index, %arg1: index) -> memref<?x?xf32> { 837 %0 = cmpi "eq", %arg0, %arg1 : index 838 %1 = alloc(%arg0, %arg0) : memref<?x?xf32> 839 %2 = scf.if %0 -> (memref<?x?xf32>) { 840 scf.yield %1 : memref<?x?xf32> 841 } else { 842 %3 = alloc(%arg0, %arg1) : memref<?x?xf32> 843 use(%3) 844 dealloc %3 : memref<?x?xf32> // %3 can be safely freed here 845 scf.yield %1 : memref<?x?xf32> 846 } 847 return %2 : memref<?x?xf32> 848} 849``` 850 851Analogous to the previous case, we have to detect all terminator operations in 852all attached regions of “scf.if” that provides a value to its parent operation 853(in this sample via scf.yield). Querying the `RegionBranchOpInterface` allows 854us to determine the regions that “return” a result to their parent operation. 855Like before, we have to update all `ReturnLike` terminators as described above. 856Reconsider a slightly adapted version of the “custom.region_if” example from 857above that uses a nested allocation: 858 859```mlir 860func @inner_region_control_flow_div( 861 %arg0 : index, 862 %arg1 : index) -> memref<?x?xf32> { 863 %0 = alloc(%arg0, %arg0) : memref<?x?xf32> 864 %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>) 865 then(%arg2 : memref<?x?xf32>) { // aliases: %arg4, %1 866 custom.region_if_yield %arg2 : memref<?x?xf32> 867 } else(%arg3 : memref<?x?xf32>) { 868 %2 = alloc(%arg0, %arg1) : memref<?x?xf32> // aliases: %arg4, %1 869 custom.region_if_yield %2 : memref<?x?xf32> 870 } join(%arg4 : memref<?x?xf32>) { // aliases: %1 871 custom.region_if_yield %arg4 : memref<?x?xf32> 872 } 873 return %1 : memref<?x?xf32> 874} 875``` 876 877Since the allocation %2 happens in a divergent branch and cannot be safely 878deallocated in a post-dominator, %arg4 will be considered a critical alias. 879Furthermore, %arg4 is returned to its parent operation and has an alias %1. 880This causes BufferDeallocation to introduce additional copies: 881 882```mlir 883func @inner_region_control_flow_div( 884 %arg0 : index, 885 %arg1 : index) -> memref<?x?xf32> { 886 %0 = alloc(%arg0, %arg0) : memref<?x?xf32> 887 %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>) 888 then(%arg2 : memref<?x?xf32>) { 889 %c0 = constant 0 : index // determine dimension extents for temp allocation 890 %2 = dim %arg2, %c0 : memref<?x?xf32> 891 %c1 = constant 1 : index 892 %3 = dim %arg2, %c1 : memref<?x?xf32> 893 %4 = alloc(%2, %3) : memref<?x?xf32> // temp buffer required due to critic. 894 // alias %arg4 895 linalg.copy(%arg2, %4) : memref<?x?xf32>, memref<?x?xf32> 896 custom.region_if_yield %4 : memref<?x?xf32> 897 } else(%arg3 : memref<?x?xf32>) { 898 %2 = alloc(%arg0, %arg1) : memref<?x?xf32> 899 %c0 = constant 0 : index // determine dimension extents for temp allocation 900 %3 = dim %2, %c0 : memref<?x?xf32> 901 %c1 = constant 1 : index 902 %4 = dim %2, %c1 : memref<?x?xf32> 903 %5 = alloc(%3, %4) : memref<?x?xf32> // temp buffer required due to critic. 904 // alias %arg4 905 linalg.copy(%2, %5) : memref<?x?xf32>, memref<?x?xf32> 906 dealloc %2 : memref<?x?xf32> 907 custom.region_if_yield %5 : memref<?x?xf32> 908 } join(%arg4: memref<?x?xf32>) { 909 %c0 = constant 0 : index // determine dimension extents for temp allocation 910 %2 = dim %arg4, %c0 : memref<?x?xf32> 911 %c1 = constant 1 : index 912 %3 = dim %arg4, %c1 : memref<?x?xf32> 913 %4 = alloc(%2, %3) : memref<?x?xf32> // this allocation will be removed by 914 // applying the copy removal pass 915 linalg.copy(%arg4, %4) : memref<?x?xf32>, memref<?x?xf32> 916 dealloc %arg4 : memref<?x?xf32> 917 custom.region_if_yield %4 : memref<?x?xf32> 918 } 919 dealloc %0 : memref<?x?xf32> // %0 can be safely freed here 920 return %1 : memref<?x?xf32> 921} 922``` 923 924### Placement of Deallocs 925 926After introducing allocs and copies, deallocs have to be placed to free 927allocated memory and avoid memory leaks. The deallocation needs to take place 928after the last use of the given value. The position can be determined by 929calculating the common post-dominator of all values using their remaining 930non-critical aliases. A special-case is the presence of back edges: since such 931edges can cause memory leaks when a newly allocated buffer flows back to 932another part of the program. In these cases, we need to free the associated 933buffer instances from the previous iteration by inserting additional deallocs. 934 935Consider the following “scf.for” use case containing a nested structured 936control-flow if: 937 938```mlir 939func @loop_nested_if( 940 %lb: index, 941 %ub: index, 942 %step: index, 943 %buf: memref<2xf32>, 944 %res: memref<2xf32>) { 945 %0 = scf.for %i = %lb to %ub step %step 946 iter_args(%iterBuf = %buf) -> memref<2xf32> { 947 %1 = cmpi "eq", %i, %ub : index 948 %2 = scf.if %1 -> (memref<2xf32>) { 949 %3 = alloc() : memref<2xf32> // makes %2 a critical alias due to a 950 // divergent allocation 951 use(%3) 952 scf.yield %3 : memref<2xf32> 953 } else { 954 scf.yield %iterBuf : memref<2xf32> 955 } 956 scf.yield %2 : memref<2xf32> 957 } 958 "linalg.copy"(%0, %res) : (memref<2xf32>, memref<2xf32>) -> () 959 return 960} 961``` 962 963In this example, the _then_ branch of the nested “scf.if” operation returns a 964newly allocated buffer. 965 966Since this allocation happens in the scope of a divergent branch, %2 becomes a 967critical alias that needs to be handled. As before, we have to insert 968additional copies to eliminate this alias using copies of %3 and %iterBuf. This 969guarantees that %2 will be a newly allocated buffer that is returned in each 970iteration. However, “returning” %2 to its alias %iterBuf turns %iterBuf into a 971critical alias as well. In other words, we have to create a copy of %2 to pass 972it to %iterBuf. Since this jump represents a back edge, and %2 will always be a 973new buffer, we have to free the buffer from the previous iteration to avoid 974memory leaks: 975 976```mlir 977func @loop_nested_if( 978 %lb: index, 979 %ub: index, 980 %step: index, 981 %buf: memref<2xf32>, 982 %res: memref<2xf32>) { 983 %4 = alloc() : memref<2xf32> 984 "linalg.copy"(%buf, %4) : (memref<2xf32>, memref<2xf32>) -> () 985 %0 = scf.for %i = %lb to %ub step %step 986 iter_args(%iterBuf = %4) -> memref<2xf32> { 987 %1 = cmpi "eq", %i, %ub : index 988 %2 = scf.if %1 -> (memref<2xf32>) { 989 %3 = alloc() : memref<2xf32> // makes %2 a critical alias 990 use(%3) 991 %5 = alloc() : memref<2xf32> // temp copy due to crit. alias %2 992 "linalg.copy"(%3, %5) : memref<2xf32>, memref<2xf32> 993 dealloc %3 : memref<2xf32> 994 scf.yield %5 : memref<2xf32> 995 } else { 996 %6 = alloc() : memref<2xf32> // temp copy due to crit. alias %2 997 "linalg.copy"(%iterBuf, %6) : memref<2xf32>, memref<2xf32> 998 scf.yield %6 : memref<2xf32> 999 } 1000 %7 = alloc() : memref<2xf32> // temp copy due to crit. alias %iterBuf 1001 "linalg.copy"(%2, %7) : memref<2xf32>, memref<2xf32> 1002 dealloc %2 : memref<2xf32> 1003 dealloc %iterBuf : memref<2xf32> // free backedge iteration variable 1004 scf.yield %7 : memref<2xf32> 1005 } 1006 "linalg.copy"(%0, %res) : (memref<2xf32>, memref<2xf32>) -> () 1007 dealloc %0 : memref<2xf32> // free temp copy %0 1008 return 1009} 1010``` 1011 1012Example for loop-like control flow. The CFG contains back edges that have to be 1013handled to avoid memory leaks. The bufferization is able to free the backedge 1014iteration variable %iterBuf. 1015 1016### Private Analyses Implementations 1017 1018The BufferDeallocation transformation relies on one primary control-flow 1019analysis: BufferPlacementAliasAnalysis. Furthermore, we also use dominance and 1020liveness to place and move nodes. The liveness analysis determines the live 1021range of a given value. Within this range, a value is alive and can or will be 1022used in the course of the program. After this range, the value is dead and can 1023be discarded - in our case, the buffer can be freed. To place the allocs, we 1024need to know from which position a value will be alive. The allocs have to be 1025placed in front of this position. However, the most important analysis is the 1026alias analysis that is needed to introduce copies and to place all 1027deallocations. 1028 1029## Post Phase 1030 1031In order to limit the complexity of the BufferDeallocation transformation, some 1032tiny code-polishing/optimization transformations are not applied on-the-fly 1033during placement. Currently, there is only the CopyRemoval transformation to 1034remove unnecessary copy and allocation operations. 1035 1036Note: further transformations might be added to the post-pass phase in the 1037future. 1038 1039### CopyRemoval Pass 1040 1041A common pattern that arises during placement is the introduction of 1042unnecessary temporary copies that are used instead of the original source 1043buffer. For this reason, there is a post-pass transformation that removes these 1044allocations and copies via `-copy-removal`. This pass, besides removing 1045unnecessary copy operations, will also remove the dead allocations and their 1046corresponding deallocation operations. The CopyRemoval pass can currently be 1047applied to operations that implement the `CopyOpInterface` in any of these two 1048situations which are 1049 1050* reusing the source buffer of the copy operation. 1051* reusing the target buffer of the copy operation. 1052 1053### Reusing the Source Buffer of the Copy Operation 1054 1055In this case, the source of the copy operation can be used instead of target. 1056The unused allocation and deallocation operations that are defined for this 1057copy operation are also removed. Here is a working example generated by the 1058BufferDeallocation pass that allocates a buffer with dynamic size. A deeper 1059analysis of this sample reveals that the highlighted operations are redundant 1060and can be removed. 1061 1062```mlir 1063func @dynamic_allocation(%arg0: index, %arg1: index) -> memref<?x?xf32> { 1064 %7 = alloc(%arg0, %arg1) : memref<?x?xf32> 1065 %c0_0 = constant 0 : index 1066 %8 = dim %7, %c0_0 : memref<?x?xf32> 1067 %c1_1 = constant 1 : index 1068 %9 = dim %7, %c1_1 : memref<?x?xf32> 1069 %10 = alloc(%8, %9) : memref<?x?xf32> 1070 linalg.copy(%7, %10) : memref<?x?xf32>, memref<?x?xf32> 1071 dealloc %7 : memref<?x?xf32> 1072 return %10 : memref<?x?xf32> 1073} 1074``` 1075 1076Will be transformed to: 1077 1078```mlir 1079func @dynamic_allocation(%arg0: index, %arg1: index) -> memref<?x?xf32> { 1080 %7 = alloc(%arg0, %arg1) : memref<?x?xf32> 1081 %c0_0 = constant 0 : index 1082 %8 = dim %7, %c0_0 : memref<?x?xf32> 1083 %c1_1 = constant 1 : index 1084 %9 = dim %7, %c1_1 : memref<?x?xf32> 1085 return %7 : memref<?x?xf32> 1086} 1087``` 1088 1089In this case, the additional copy %10 can be replaced with its original source 1090buffer %7. This also applies to the associated dealloc operation of %7. 1091 1092To limit the complexity of this transformation, it only removes copy operations 1093when the following constraints are met: 1094 1095* The copy operation, the defining operation for the target value, and the 1096deallocation of the source value lie in the same block. 1097* There are no users/aliases of the target value between the defining operation 1098of the target value and its copy operation. 1099* There are no users/aliases of the source value between its associated copy 1100operation and the deallocation of the source value. 1101 1102### Reusing the Target Buffer of the Copy Operation 1103 1104In this case, the target buffer of the copy operation can be used instead of 1105its source. The unused allocation and deallocation operations that are defined 1106for this copy operation are also removed. 1107 1108Consider the following example where a generic linalg operation writes the 1109result to %temp and then copies %temp to %result. However, these two operations 1110can be merged into a single step. Copy removal removes the copy operation and 1111%temp, and replaces the uses of %temp with %result: 1112 1113```mlir 1114func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){ 1115 %temp = alloc() : memref<2xf32> 1116 linalg.generic { 1117 args_in = 1 : i64, 1118 args_out = 1 : i64, 1119 indexing_maps = [#map0, #map0], 1120 iterator_types = ["parallel"]} %arg0, %temp { 1121 ^bb0(%gen2_arg0: f32, %gen2_arg1: f32): 1122 %tmp2 = exp %gen2_arg0 : f32 1123 linalg.yield %tmp2 : f32 1124 }: memref<2xf32>, memref<2xf32> 1125 "linalg.copy"(%temp, %result) : (memref<2xf32>, memref<2xf32>) -> () 1126 dealloc %temp : memref<2xf32> 1127 return 1128} 1129``` 1130 1131Will be transformed to: 1132 1133```mlir 1134func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){ 1135 linalg.generic { 1136 args_in = 1 : i64, 1137 args_out = 1 : i64, 1138 indexing_maps = [#map0, #map0], 1139 iterator_types = ["parallel"]} %arg0, %result { 1140 ^bb0(%gen2_arg0: f32, %gen2_arg1: f32): 1141 %tmp2 = exp %gen2_arg0 : f32 1142 linalg.yield %tmp2 : f32 1143 }: memref<2xf32>, memref<2xf32> 1144 return 1145} 1146``` 1147 1148Like before, several constraints to use the transformation apply: 1149 1150* The copy operation, the defining operation of the source value, and the 1151deallocation of the source value lie in the same block. 1152* There are no users/aliases of the target value between the defining operation 1153of the source value and the copy operation. 1154* There are no users/aliases of the source value between the copy operation and 1155the deallocation of the source value. 1156 1157## Known Limitations 1158 1159BufferDeallocation introduces additional copies using allocations from the 1160“std” dialect (“std.alloc”). Analogous, all deallocations use the “std” 1161dialect-free operation “std.dealloc”. The actual copy process is realized using 1162“linalg.copy”. Furthermore, buffers are essentially immutable after their 1163creation in a block. Another limitations are known in the case using 1164unstructered control flow. 1165