• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Bufferization on MLIR
2
3The general task of bufferization is to move SSA values (like tensors) into
4allocated memory buffers that have to be freed when they are no longer needed.
5This also involves the placement of copies to clone contents of allocated
6memory buffers at specific locations (similar to register allocation). On the
7one hand, these copies are needed to ensure the right behavior of a program, on
8the other hand, introducing several aliases for a certain buffer could lead to
9a wrong freeing of buffers. Therefore, we have to take care of them and the
10program structure. The introduction of copies solves this problem. Several
11unnecessary introduced copies during this process can be eliminated afterwards.
12
13```mlir
14func @func_on_tensors(%source: tensor<4xf32>) -> tensor<4xf32> {
15  %0 = mhlo.exp %source : (tensor<4xf32>) -> (tensor<4xf32>)
16  return %0 : tensor<4xf32>
17}
18```
19
20Will be transformed to:
21
22```mlir
23func @func_on_buffers(%source: memref<4xf32>, %target: memref<4xf32>) {
24  %0 = alloc() : memref<4xf32>
25  lmhlo.exp %source, %0 : (memref<4xf32>, memref<4xf32>) -> ()
26  lmhlo.copy %0, %target : (memref<4xf32>, memref<4xf32>) -> ()
27  dealloc %0 : memref<4xf32>
28  return
29}
30```
31
32In general, Bufferization is split into three separate phases: a preparation
33phase, a bufferization phase and a post-processing phase. The assignment
34process happens during dialect conversion and allocates buffers for each value
35that should be moved into a memory buffer. This has to be implemented by each
36dialect using the following tools and patterns. Thereby, all  operations on
37memory buffers have to be changed to `memref<T>` types (see Preparation Phase).
38Afterwards, the placement transformation (see BufferDeallocation) adds all
39required deallocation operations, temporary buffers and copy operations
40automatically.
41
42## Preparation Phase
43
44In order to apply the BufferDeallocation code transformation, the input MLIR
45program needs to leverage allocations (buffers in general) and `memref<T>`
46types(as outlined above). If your input program does not work on buffers, you
47need to perform this preparation step in order to port it to the “buffer
48world”. This is a user-defined preparation step that is intended to be applied
49during dialect conversion. The user has to take care for the right conversion
50by providing conversion patterns relying on a type converter to assign buffers.
51
52A necessary step is to apply type and function signature conversion.
53Furthermore, after changing all function signatures, the associated return and
54call operations must comply with the new corresponding function signatures. For
55this purpose, three operation conversion patterns are introduced:
56
57* BufferizeFuncOpConverter
58* BufferizeReturnOpConverter
59* BufferizeCallOpConverter
60
61In order to use these conversion patterns, the user needs to define a custom
62BufferizeTypeConverter implementation.
63
64### BufferizeTypeConverter
65
66The BufferizeTypeConverter is an extension to the TypeConverter class that
67provides additional functionality for dialect developers to decide on the
68signature of a function. The extra features include:
69
70* `setResultConversionKind` to decide on a function result after the conversion
71with a specific type to be appended to the function argument list as an output
72argument or remains as a function result.
73* define their own callback function for type or value unwrapping.
74
75### ResultConversionKind
76
77ResultConversionKind is an enum with two values
78
79* AppendToArgumentList
80* KeepAsFunctionResult
81
82that defines how BufferizeFuncOpConverter should handle function results in
83order to convert the signature of the function. The other two operation
84conversion patterns also use ResultConversionKind to adapt themselves with the
85new function signature.
86
87`ResultConversionKind` can be set using `setResultConversionKind`, which needs
88two template parameters that correspond to the types before and after type
89conversion. This mapping specifies whether the resulting type should stay as a
90function result or should be appended to the arguments list after the
91conversion is done. Note that the default value for unspecified mappings is
92`KeepAsFunctionResult`. For instance, the following command updates the
93`BufferizeTypeConverter` instance that defines all MemRefType function results
94(converted from `RankedTensorTypes`). These results should be appended to the
95function argument list in BufferizeFuncOpConverter:
96
97```mlir
98converter.setResultConversionKind<RankedTensorType, MemRefType>(
99        BufferizeTypeConverter::AppendToArgumentsList);
100```
101
102## Callbacks for Unpacking Types
103
104```mlir
105func @func_on_tensors(%arg0: tuple<i1,f32>) -> (tuple<tensor<10xf32>, tensor<5xf32>>) {
106  ...
107}
108```
109
110Will be transformed to:
111
112```mlir
113func @func_on_buffers(%arg0: i1, %arg1: f32, %target0: memref<10xf32>, %target1: memref<5xf32>) {
114  ...
115}
116```
117
118BufferizeFuncOpConverter is also able to unpack the types of arguments and
119results of a function during function signature conversion. In the example
120above, it unwraps the tuple type and converts the type of each constituent.
121
122For this purpose, users can provide custom callbacks by using
123`addDecomposeTypeConversion` for the `BufferizeTypeConverter` instance to show
124how a specific type (i.e. TupleType) can be unpacked. However, when a type
125decomposition is provided, there are two additional callbacks that have to be
126defined as well. They specify how to pack or unpack values of that particular
127type. These two callbacks can be provided by the `addArgumentMaterialization`
128(packing) **and** `addDecomposeValueConversion` (unpacking) functions:
129
130The following piece of code demonstrates this functionality by flattening a
131`TupleType`.
132
133```mlir
134converter.addDecomposeTypeConversion(
135        [](TupleType tupleType, SmallVectorImpl<Type> &types) {
136          tupleType.getFlattenedTypes(types);
137          return success();
138        });
139
140converter.addArgumentMaterialization(
141        [](OpBuilder &builder, TupleType resultType, ValueRange inputs,
142           Location loc) -> Optional<Value> {
143          if (inputs.size() == 1)
144            return llvm::None;
145          TypeRange TypeRange = inputs.getTypes();
146          SmallVector<Type, 2> types(TypeRange.begin(), TypeRange.end());
147          TupleType tuple = TupleType::get(types, builder.getContext());
148          mlir::Value value = builder.create<MakeTupleOp>(loc, tuple, inputs);
149          return value;
150        });
151
152converter.addDecomposeValueConversion([](OpBuilder &builder, Location loc,
153                                         TupleType resultType, Value value,
154                                         SmallVectorImpl<Value> &values) {
155      for (unsigned i = 0, e = resultType.size(); i < e; ++i) {
156        Value res = builder.create<GetTupleElementOp>(
157            loc, resultType.getType(i), value, builder.getI32IntegerAttr(i));
158        values.push_back(res);
159      }
160      return success();
161 });
162```
163
164In the scope of these callback functions, the elements of a tuple value can be
165decomposed using `GetTupleElementOp`. Conversely, `MakeTupleOp` is used to pack
166a list of values as a single tuple type.
167
168### Bufferization Operation Conversion Patterns
169
170The following conversion patterns can be used to conveniently transform the
171signature of a function, the return and call operations:
172
173* `BufferizeFuncOpConverter`
174* `BufferizeReturnOpConverter`
175* `BufferizeCallOpConverter`
176
177Any combination of these conversion patterns can be specified by the user. If
178you need to apply  all of these operation converters, you can use
179`populateWithBufferizeOpConversionPatterns` which sets up all converters.
180
181### BufferizeFuncOpConverter
182
183The BufferizeFuncOpConverter is the actual function operation converter that
184applies signature conversion by using a previously defined
185`BufferizeTypeConverter`.
186
187In the following example, we configure a `BufferizeTypeConverter` instance such
188that
189
190* all RankedTensorTypes should be converted to MemRefTypes.
191* all function results that are results of type conversion from
192RankedTensorTypes to MemRefTypes should be appended to the function argument
193list.
194* all TupleTypes should be flattened and decomposed to its constituents.
195
196```mlir
197converter.addConversion([](RankedTensorType type) {
198    return (Type)MemRefType::get(type.getShape(), type.getElementType());
199  });
200converter.setResultConversionKind<RankedTensorType, MemRefType>(
201        BufferizeTypeConverter::AppendToArgumentsList);
202converter.addDecomposeTypeConversion(
203        [](TupleType tupleType, SmallVectorImpl<Type> &types) {
204          tupleType.getFlattenedTypes(types);
205          return success();
206        });
207```
208
209Consider the following signature conversion:
210
211```mlir
212func @on_tensors(%arg1: tuple<i1,f32>) -> (tuple<memref<10xf32>, tensor<5xf32>>){
213 ...
214}
215```
216
217Will be transformed to:
218
219```mlir
220func @on_buffers(%arg0: i1, %arg1: f32, %out: memref<5xf32>) -> memref<10xf32> {
221 ...
222}
223```
224
225Using the presented converter setup, all TupleType arguments and results are
226decomposed first. The tensor<5xf32> result is converted to a memref<5xf32> type
227and appended to the argument list. There is no conversion for the types memref,
228i1, and f32. Therefore, the memref<10xf32> result is kept as it is and will
229also be kept as a function result since there is no ResultConversionKind
230mapping from a MemRefType to a MemRefType. However, if we change the
231result-conversion behavior via
232
233```mlir
234converter.setResultConversionKind<RankedTensorType, MemRefType>(
235        BufferizeTypeConverter::KeepAsFunctionResult);
236```
237
238the output will be:
239
240```mlir
241func @on_buffers(%arg0: i1, %arg1: f32) -> (memref<10xf32>, memref<5xf32>) {
242 ...
243}
244```
245
246### BufferizeReturnOpConverter
247
248When changing the signature of a function, the return operands must match with
249the results of the corresponding function if buffer-typed-results have been
250configured to be appended to the function arguments list. This matching
251consists of two separate steps. First, we have to remove the operands that have
252been appended to the argument list as output arguments. Second, we have to
253introduce additional copies for each operand. However, since each dialect has
254its own dialect-dependent return and copy operations, this conversion pattern
255comes with three template parameters which are the original return operation,
256target return operation, and copy operation kinds.
257
258In the following example, two conversion patterns are inserted into the pattern
259list. The `BufferizeReturnOpConverter` is set to replace a standard return
260operation with the same operation type.
261
262```mlir
263patterns->insert<
264  BufferizeFuncOpConverter,
265  BufferizeReturnOpConverter
266    <mlir::ReturnOp, mlir::ReturnOp, linalg::CopyOp>
267                >(...)
268```
269
270Consider the following input/output program using a single return:
271
272```mlir
273func @on_tensors(%arg0: tensor<5xf32>, %arg1: i1) -> (tensor<5xf32>, i1) {
274  return %arg0, %arg1 : tensor<5xf32>, i1
275}
276```
277
278Will be transformed to:
279
280```mlir
281func @on_buffers(%arg0: memref<5xf32>, %arg1: i1, %out: memref<5xf32>) -> i1 {
282  linalg.copy(%arg0, %out) : memref<5xf32>, memref<5xf32>
283  return %arg1 : i1
284}
285```
286
287Based on our previously configured `BufferizeTypeConverter` instance which
288requires buffer-typed-function-results to be appended to the function argument
289list, the new `on_buffers` function signature is created. The operands of the
290return operation must be adapted with the new function signature. Therefore,
291the buffer-typed operand is removed from the operand list of the new return
292operation. Instead, a copy operation is inserted right before the return
293operation to copy the content of the operand buffer to the target buffer and
294yields the output as shown above.
295
296### BufferizeCallOpConverter
297
298The BufferizeCallOpConverter is a call operation converter that transforms and
299matches the operands and results of a call operation with the arguments and
300results of the callee. Besides converting operand and result types, it
301allocates a buffer for each buffer-based result of the called function that is
302appended to the argument list (if buffer typed results have been configured to
303be appended to the function arguments list).
304
305The following piece of code shows a sample call site, based on our previously
306configured `BufferizeTypeConversion`:
307
308```mlir
309func @callee(%arg0: tensor<5xf32>) -> (tensor<5xf32>) {
310  return %arg0 : tensor<5xf32>
311}
312
313func @caller(%arg0: tensor<5xf32>) -> tensor<5xf32> {
314  %x = call @callee(%arg0) : (tensor<5xf32>) -> tensor<5xf32>
315  return %x : tensor<5xf32>
316}
317```
318
319Will be transformed to:
320
321```mlir
322func @callee(%arg0: memref<5xf32>, %out: memref<5xf32>) {
323  linalg.copy(%arg0, %out) : memref<5xf32>, memref<5xf32>
324  return
325}
326
327func @caller(%arg0: memref<5xf32>, %out: memref<5xf32>) {
328  %0 = alloc() : memref<5xf32>
329  call @callee(%arg0, %0) : (memref<5xf32>, memref<5xf32>) -> ()
330  linalg.copy(%0, %out) : memref<5xf32>, memref<5xf32>
331  return
332}
333```
334
335### Summarizing Example
336
337To summarize all preparation converters, the following sample is a complete
338listing of an input IR program and its output after applying all converters:
339
340```mlir
341func @callee(%arg0: tuple<tensor<5xf32>,i1>) -> tuple<tensor<5xf32>,i1> {
342  return %arg0 : tuple<tensor<5xf32>,i1>
343}
344
345func @caller(%arg0: tuple<tensor<5xf32>,i1>) -> tuple<tensor<5xf32>,i1> {
346  %x = call @callee(%arg0) : (tuple<tensor<5xf32>,i1>) -> tuple<tensor<5xf32>,i1>
347  return %x : tuple<tensor<5xf32>,i1>
348}
349```
350
351Will be transformed to:
352
353```mlir
354func @callee(%arg0: memref<5xf32>, %arg1: i1, %arg2: memref<5xf32>) -> i1 {
355  %0 = "test.make_tuple"(%arg0, %arg1) : (memref<5xf32>, i1) -> tuple<memref<5xf32>, i1>
356  %1 = "test.get_tuple_element"(%0) {index = 0 : i32} : (tuple<memref<5xf32>, i1>) -> memref<5xf32>
357  %2 = "test.get_tuple_element"(%0) {index = 1 : i32} : (tuple<memref<5xf32>, i1>) -> i1
358  linalg.copy(%1, %arg2) : memref<5xf32>, memref<5xf32>
359  return %2 : i1
360}
361func @caller(%arg0: memref<5xf32>, %arg1: i1, %arg2: memref<5xf32>) -> i1 {
362  %0 = "test.make_tuple"(%arg0, %arg1) : (memref<5xf32>, i1) -> tuple<memref<5xf32>, i1>
363  %1 = "test.get_tuple_element"(%0) {index = 0 : i32} : (tuple<memref<5xf32>, i1>) -> memref<5xf32>
364  %2 = "test.get_tuple_element"(%0) {index = 1 : i32} : (tuple<memref<5xf32>, i1>) -> i1
365  %3 = alloc() : memref<5xf32>
366  %4 = call @callee(%1, %2, %3) : (memref<5xf32>, i1, memref<5xf32>) -> i1
367  %5 = "test.make_tuple"(%3, %4) : (memref<5xf32>, i1) -> tuple<memref<5xf32>, i1>
368  %6 = "test.get_tuple_element"(%5) {index = 0 : i32} : (tuple<memref<5xf32>, i1>) -> memref<5xf32>
369  %7 = "test.get_tuple_element"(%5) {index = 1 : i32} : (tuple<memref<5xf32>, i1>) -> i1
370  linalg.copy(%6, %arg2) : memref<5xf32>, memref<5xf32>
371  return %7 : i1
372}
373```
374
375## Buffer Deallocation - Internal Functionality
376
377This section covers the internal functionality of the BufferDeallocation
378transformation. The transformation consists of several passes. The main pass
379called BufferDeallocation can be applied via “-buffer-deallocation” on MLIR
380programs. Currently, there are three optimization passes, that move allocs and
381convert AllocOps to AllocaOps, if possible. The first and second pass can be
382applied using “-buffer-hoisting” or “-buffer-loop-hoisting”, the third one
383using “-promote-buffers-to-stack”. However, these optimizations must be applied
384before using the BufferDeallocation pass.
385
386### Requirements
387
388In order to use BufferDeallocation on an arbitrary dialect, several
389control-flow interfaces have to be implemented when using custom operations.
390This is particularly important to understand the implicit control-flow
391dependencies between different parts of the input program. Without implementing
392the following interfaces, control-flow relations cannot be discovered properly
393and the resulting program can become invalid:
394
395* Branch-like terminators should implement the `BranchOpInterface` to query and
396manipulate associated operands.
397* Operations involving structured control flow have to implement the
398`RegionBranchOpInterface` to model inter-region control flow.
399* Terminators yielding values to their parent operation (in particular in the
400scope of nested regions within `RegionBranchOpInterface`-based operations),
401should implement the `ReturnLike` trait to represent logical “value returns”.
402
403Example dialects that are fully compatible are the “std” and “scf” dialects
404with respect to all implemented interfaces.
405
406### Detection of Buffer Allocations
407
408The first step of the BufferDeallocation transformation is to identify
409manageable allocation operations that implement the `SideEffects` interface.
410Furthermore, these ops need to apply the effect `MemoryEffects::Allocate` to a
411particular result value while not using the resource
412`SideEffects::AutomaticAllocationScopeResource` (since it is currently reserved
413for allocations, like `Alloca` that will be automatically deallocated by a
414parent scope). Allocations that have not been detected in this phase will not
415be tracked internally, and thus, not deallocated automatically. However,
416BufferDeallocation is fully compatible with “hybrid” setups in which tracked
417and untracked allocations are mixed:
418
419```mlir
420func @mixedAllocation(%arg0: i1) {
421   %0 = alloca() : memref<2xf32>  // aliases: %2
422   %1 = alloc() : memref<2xf32>  // aliases: %2
423   cond_br %arg0, ^bb1, ^bb2
424^bb1:
425  use(%0)
426  br ^bb3(%0 : memref<2xf32>)
427^bb2:
428  use(%1)
429  br ^bb3(%1 : memref<2xf32>)
430^bb3(%2: memref<2xf32>):
431  ...
432}
433```
434
435Example of using a conditional branch with alloc and alloca. BufferDeallocation
436can detect and handle the different allocation types that might be intermixed.
437
438Note: the current version does not support allocation operations returning
439multiple result buffers.
440
441### Conversion from AllocOp to AllocaOp
442
443The PromoteBuffersToStack-pass converts AllocOps to AllocaOps, if possible. In
444some cases, it can be useful to use such stack-based buffers instead of
445heap-based buffers. The conversion is restricted to several constraints like:
446
447* Control flow
448* Buffer Size
449* Dynamic Size
450
451If a buffer is leaving a block, we are not allowed to convert it into an
452alloca. If the size of the buffer is large, we could convert it, but regarding
453stack overflow, it makes sense to limit the size of these buffers and only
454convert small ones. The size can be set via a pass option. The current default
455value is 1KB. Furthermore, we can not convert buffers with dynamic size, since
456the dimension is not known a priori.
457
458### Movement and Placement of Allocations
459
460Using the buffer hoisting pass, all buffer allocations are moved as far upwards
461as possible in order to group them and make upcoming optimizations easier by
462limiting the search space. Such a movement is shown in the following graphs.
463In addition, we are able to statically free an alloc, if we move it into a
464dominator of all of its uses. This simplifies further optimizations (e.g.
465buffer fusion) in the future. However, movement of allocations is limited by
466external data dependencies (in particular in the case of allocations of
467dynamically shaped types). Furthermore, allocations can be moved out of nested
468regions, if necessary. In order to move allocations to valid locations with
469respect to their uses only, we leverage Liveness information.
470
471The following code snippets shows a conditional branch before running the
472BufferHoisting pass:
473
474![branch_example_pre_move](/includes/img/branch_example_pre_move.svg)
475
476```mlir
477func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
478  cond_br %arg0, ^bb1, ^bb2
479^bb1:
480  br ^bb3(%arg1 : memref<2xf32>)
481^bb2:
482  %0 = alloc() : memref<2xf32>  // aliases: %1
483  use(%0)
484  br ^bb3(%0 : memref<2xf32>)
485^bb3(%1: memref<2xf32>):  // %1 could be %0 or %arg1
486  "linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> ()
487  return
488}
489```
490
491Applying the BufferHoisting pass on this program results in the following piece
492of code:
493
494![branch_example_post_move](/includes/img/branch_example_post_move.svg)
495
496```mlir
497func @condBranch(%arg0: i1, %arg1: memref<2xf32>, %arg2: memref<2xf32>) {
498  %0 = alloc() : memref<2xf32>  // moved to bb0
499  cond_br %arg0, ^bb1, ^bb2
500^bb1:
501  br ^bb3(%arg1 : memref<2xf32>)
502^bb2:
503   use(%0)
504   br ^bb3(%0 : memref<2xf32>)
505^bb3(%1: memref<2xf32>):
506  "linalg.copy"(%1, %arg2) : (memref<2xf32>, memref<2xf32>) -> ()
507  return
508}
509```
510
511The alloc is moved from bb2 to the beginning and it is passed as an argument to
512bb3.
513
514The following example demonstrates an allocation using dynamically shaped
515types. Due to the data dependency of the allocation to %0, we cannot move the
516allocation out of bb2 in this case:
517
518```mlir
519func @condBranchDynamicType(
520  %arg0: i1,
521  %arg1: memref<?xf32>,
522  %arg2: memref<?xf32>,
523  %arg3: index) {
524  cond_br %arg0, ^bb1, ^bb2(%arg3: index)
525^bb1:
526  br ^bb3(%arg1 : memref<?xf32>)
527^bb2(%0: index):
528  %1 = alloc(%0) : memref<?xf32>   // cannot be moved upwards to the data
529                                   // dependency to %0
530  use(%1)
531  br ^bb3(%1 : memref<?xf32>)
532^bb3(%2: memref<?xf32>):
533  "linalg.copy"(%2, %arg2) : (memref<?xf32>, memref<?xf32>) -> ()
534  return
535}
536```
537
538### Introduction of Copies
539
540In order to guarantee that all allocated buffers are freed properly, we have to
541pay attention to the control flow and all potential aliases a buffer allocation
542can have. Since not all allocations can be safely freed with respect to their
543aliases (see the following code snippet), it is often required to introduce
544copies to eliminate them. Consider the following example in which the
545allocations have already been placed:
546
547```mlir
548func @branch(%arg0: i1) {
549  %0 = alloc() : memref<2xf32>  // aliases: %2
550  cond_br %arg0, ^bb1, ^bb2
551^bb1:
552  %1 = alloc() : memref<2xf32>  // resides here for demonstration purposes
553                                // aliases: %2
554  br ^bb3(%1 : memref<2xf32>)
555^bb2:
556  use(%0)
557  br ^bb3(%0 : memref<2xf32>)
558^bb3(%2: memref<2xf32>):
559560  return
561}
562```
563
564The first alloc can be safely freed after the live range of its post-dominator
565block (bb3). The alloc in bb1 has an alias %2 in bb3 that also keeps this
566buffer alive until the end of bb3. Since we cannot determine the actual
567branches that will be taken at runtime, we have to ensure that all buffers are
568freed correctly in bb3 regardless of the branches we will take to reach the
569exit block. This makes it necessary to introduce a copy for %2, which allows us
570to free %alloc0 in bb0 and %alloc1 in bb1. Afterwards, we can continue
571processing all aliases of %2 (none in this case) and we can safely free %2 at
572the end of the sample program. This sample demonstrates that not all
573allocations can be safely freed in their associated post-dominator blocks.
574Instead, we have to pay attention to all of their aliases.
575
576Applying the BufferDeallocation pass to the program above yields the following
577result:
578
579```mlir
580func @branch(%arg0: i1) {
581  %0 = alloc() : memref<2xf32>
582  cond_br %arg0, ^bb1, ^bb2
583^bb1:
584  %1 = alloc() : memref<2xf32>
585  %3 = alloc() : memref<2xf32>  // temp copy for %1
586  "linalg.copy"(%1, %3) : (memref<2xf32>, memref<2xf32>) -> ()
587  dealloc %1 : memref<2xf32> // %1 can be safely freed here
588  br ^bb3(%3 : memref<2xf32>)
589^bb2:
590  use(%0)
591  %4 = alloc() : memref<2xf32>  // temp copy for %0
592  "linalg.copy"(%0, %4) : (memref<2xf32>, memref<2xf32>) -> ()
593  br ^bb3(%4 : memref<2xf32>)
594^bb3(%2: memref<2xf32>):
595596  dealloc %2 : memref<2xf32> // free temp buffer %2
597  dealloc %0 : memref<2xf32> // %0 can be safely freed here
598  return
599}
600```
601
602Note that a temporary buffer for %2 was introduced to free all allocations
603properly. Note further that the unnecessary allocation of %3 can be easily
604removed using one of the post-pass transformations.
605
606Reconsider the previously introduced sample demonstrating dynamically shaped
607types:
608
609```mlir
610func @condBranchDynamicType(
611  %arg0: i1,
612  %arg1: memref<?xf32>,
613  %arg2: memref<?xf32>,
614  %arg3: index) {
615  cond_br %arg0, ^bb1, ^bb2(%arg3: index)
616^bb1:
617  br ^bb3(%arg1 : memref<?xf32>)
618^bb2(%0: index):
619  %1 = alloc(%0) : memref<?xf32>  // aliases: %2
620  use(%1)
621  br ^bb3(%1 : memref<?xf32>)
622^bb3(%2: memref<?xf32>):
623  "linalg.copy"(%2, %arg2) : (memref<?xf32>, memref<?xf32>) -> ()
624  return
625}
626```
627
628In the presence of DSTs, we have to parameterize the allocations with
629additional dimension information of the source buffers, we want to copy from.
630BufferDeallocation automatically introduces all required operations to extract
631dimension specifications and wires them with the associated allocations:
632
633```mlir
634func @condBranchDynamicType(
635  %arg0: i1,
636  %arg1: memref<?xf32>,
637  %arg2: memref<?xf32>,
638  %arg3: index) {
639  cond_br %arg0, ^bb1, ^bb2(%arg3 : index)
640^bb1:
641  %c0 = constant 0 : index
642  %0 = dim %arg1, %c0 : memref<?xf32>   // dimension operation to parameterize
643                                        // the following temp allocation
644  %1 = alloc(%0) : memref<?xf32>
645  "linalg.copy"(%arg1, %1) : (memref<?xf32>, memref<?xf32>) -> ()
646  br ^bb3(%1 : memref<?xf32>)
647^bb2(%2: index):
648  %3 = alloc(%2) : memref<?xf32>
649  use(%3)
650  %c0_0 = constant 0 : index
651  %4 = dim %3, %c0_0 : memref<?xf32>  // dimension operation to parameterize
652                                      // the following temp allocation
653  %5 = alloc(%4) : memref<?xf32>
654  "linalg.copy"(%3, %5) : (memref<?xf32>, memref<?xf32>) -> ()
655  dealloc %3 : memref<?xf32>  // %3 can be safely freed here
656  br ^bb3(%5 : memref<?xf32>)
657^bb3(%6: memref<?xf32>):
658  "linalg.copy"(%6, %arg2) : (memref<?xf32>, memref<?xf32>) -> ()
659  dealloc %6 : memref<?xf32>  // %6 can be safely freed here
660  return
661}
662```
663
664BufferDeallocation performs a fix-point iteration taking all aliases of all
665tracked allocations into account. We initialize the general iteration process
666using all tracked allocations and their associated aliases. As soon as we
667encounter an alias that is not properly dominated by our allocation, we mark
668this alias as _critical_ (needs to be freed and tracked by the internal
669fix-point iteration). The following sample demonstrates the presence of
670critical and non-critical aliases:
671
672![nested_branch_example_pre_move](/includes/img/nested_branch_example_pre_move.svg)
673
674```mlir
675func @condBranchDynamicTypeNested(
676  %arg0: i1,
677  %arg1: memref<?xf32>,  // aliases: %3, %4
678  %arg2: memref<?xf32>,
679  %arg3: index) {
680  cond_br %arg0, ^bb1, ^bb2(%arg3: index)
681^bb1:
682  br ^bb6(%arg1 : memref<?xf32>)
683^bb2(%0: index):
684  %1 = alloc(%0) : memref<?xf32>   // cannot be moved upwards due to the data
685                                   // dependency to %0
686                                   // aliases: %2, %3, %4
687  use(%1)
688  cond_br %arg0, ^bb3, ^bb4
689^bb3:
690  br ^bb5(%1 : memref<?xf32>)
691^bb4:
692  br ^bb5(%1 : memref<?xf32>)
693^bb5(%2: memref<?xf32>):  // non-crit. alias of %1, since %1 dominates %2
694  br ^bb6(%2 : memref<?xf32>)
695^bb6(%3: memref<?xf32>):  // crit. alias of %arg1 and %2 (in other words %1)
696  br ^bb7(%3 : memref<?xf32>)
697^bb7(%4: memref<?xf32>):  // non-crit. alias of %3, since %3 dominates %4
698  "linalg.copy"(%4, %arg2) : (memref<?xf32>, memref<?xf32>) -> ()
699  return
700}
701```
702
703Applying BufferDeallocation yields the following output:
704
705![nested_branch_example_post_move](/includes/img/nested_branch_example_post_move.svg)
706
707```mlir
708func @condBranchDynamicTypeNested(
709  %arg0: i1,
710  %arg1: memref<?xf32>,
711  %arg2: memref<?xf32>,
712  %arg3: index) {
713  cond_br %arg0, ^bb1, ^bb2(%arg3 : index)
714^bb1:
715  %c0 = constant 0 : index
716  %d0 = dim %arg1, %c0 : memref<?xf32>
717  %5 = alloc(%d0) : memref<?xf32>  // temp buffer required due to alias %3
718  "linalg.copy"(%arg1, %5) : (memref<?xf32>, memref<?xf32>) -> ()
719  br ^bb6(%5 : memref<?xf32>)
720^bb2(%0: index):
721  %1 = alloc(%0) : memref<?xf32>
722  use(%1)
723  cond_br %arg0, ^bb3, ^bb4
724^bb3:
725  br ^bb5(%1 : memref<?xf32>)
726^bb4:
727  br ^bb5(%1 : memref<?xf32>)
728^bb5(%2: memref<?xf32>):
729  %c0_0 = constant 0 : index
730  %d1 = dim %2, %c0_0 : memref<?xf32>
731  %6 = alloc(%d1) : memref<?xf32>  // temp buffer required due to alias %3
732  "linalg.copy"(%1, %6) : (memref<?xf32>, memref<?xf32>) -> ()
733  dealloc %1 : memref<?xf32>
734  br ^bb6(%6 : memref<?xf32>)
735^bb6(%3: memref<?xf32>):
736  br ^bb7(%3 : memref<?xf32>)
737^bb7(%4: memref<?xf32>):
738  "linalg.copy"(%4, %arg2) : (memref<?xf32>, memref<?xf32>) -> ()
739  dealloc %3 : memref<?xf32>  // free %3, since %4 is a non-crit. alias of %3
740  return
741}
742```
743
744Since %3 is a critical alias, BufferDeallocation introduces an additional
745temporary copy in all predecessor blocks. %3 has an additional (non-critical)
746alias %4 that extends the live range until the end of bb7. Therefore, we can
747free %3 after its last use, while taking all aliases into account. Note that %4
748 does not need to be freed, since we did not introduce a copy for it.
749
750The actual introduction of buffer copies is done after the fix-point iteration
751has been terminated and all critical aliases have been detected. A critical
752alias can be either a block argument or another value that is returned by an
753operation. Copies for block arguments are handled by analyzing all predecessor
754blocks. This is primarily done by querying the `BranchOpInterface` of the
755associated branch terminators that can jump to the current block. Consider the
756following example which involves a simple branch and the critical block
757argument %2:
758
759```mlir
760  custom.br ^bb1(..., %0, : ...)
761  ...
762  custom.br ^bb1(..., %1, : ...)
763  ...
764^bb1(%2: memref<2xf32>):
765  ...
766```
767
768The `BranchOpInterface` allows us to determine the actual values that will be
769passed to block bb1 and its argument %2 by analyzing its predecessor blocks.
770Once we have resolved the values %0 and %1 (that are associated with %2 in this
771sample), we can introduce a temporary buffer and clone its contents into the
772new buffer. Afterwards, we rewire the branch operands to use the newly
773allocated buffer instead. However, blocks can have implicitly defined
774predecessors by parent ops that implement the `RegionBranchOpInterface`. This
775can be the case if this block argument belongs to the entry block of a region.
776In this setting, we have to identify all predecessor regions defined by the
777parent operation. For every region, we need to get all terminator operations
778implementing the `ReturnLike` trait, indicating that they can branch to our
779current block. Finally, we can use a similar functionality as described above
780to add the temporary copy. This time, we can modify the terminator operands
781directly without touching a high-level interface.
782
783Consider the following inner-region control-flow sample that uses an imaginary
784custom.region_if” operation. It either executes the “then” or “else” region
785and always continues to the “join” region. The “custom.region_if_yield786operation returns a result to the parent operation. This sample demonstrates
787the use of the `RegionBranchOpInterface` to determine predecessors in order to
788infer the high-level control flow:
789
790```mlir
791func @inner_region_control_flow(
792  %arg0 : index,
793  %arg1 : index) -> memref<?x?xf32> {
794  %0 = alloc(%arg0, %arg0) : memref<?x?xf32>
795  %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>)
796   then(%arg2 : memref<?x?xf32>) {  // aliases: %arg4, %1
797    custom.region_if_yield %arg2 : memref<?x?xf32>
798   } else(%arg3 : memref<?x?xf32>) {  // aliases: %arg4, %1
799    custom.region_if_yield %arg3 : memref<?x?xf32>
800   } join(%arg4 : memref<?x?xf32>) {  // aliases: %1
801    custom.region_if_yield %arg4 : memref<?x?xf32>
802   }
803  return %1 : memref<?x?xf32>
804}
805```
806
807![region_branch_example_pre_move](/includes/img/region_branch_example_pre_move.svg)
808
809Non-block arguments (other values) can become aliases when they are returned by
810dialect-specific operations. BufferDeallocation supports this behavior via the
811`RegionBranchOpInterface`. Consider the following example that uses an “scf.if812operation to determine the value of %2 at runtime which creates an alias:
813
814```mlir
815func @nested_region_control_flow(%arg0 : index, %arg1 : index) -> memref<?x?xf32> {
816  %0 = cmpi "eq", %arg0, %arg1 : index
817  %1 = alloc(%arg0, %arg0) : memref<?x?xf32>
818  %2 = scf.if %0 -> (memref<?x?xf32>) {
819    scf.yield %1 : memref<?x?xf32>   // %2 will be an alias of %1
820  } else {
821    %3 = alloc(%arg0, %arg1) : memref<?x?xf32>  // nested allocation in a div.
822                                                // branch
823    use(%3)
824    scf.yield %1 : memref<?x?xf32>   // %2 will be an alias of %1
825  }
826  return %2 : memref<?x?xf32>
827}
828```
829
830In this example, a dealloc is inserted to release the buffer within the else
831block since it cannot be accessed by the remainder of the program. Accessing
832the `RegionBranchOpInterface`, allows us to infer that %2 is a non-critical
833alias of %1 which does not need to be tracked.
834
835```mlir
836func @nested_region_control_flow(%arg0: index, %arg1: index) -> memref<?x?xf32> {
837    %0 = cmpi "eq", %arg0, %arg1 : index
838    %1 = alloc(%arg0, %arg0) : memref<?x?xf32>
839    %2 = scf.if %0 -> (memref<?x?xf32>) {
840      scf.yield %1 : memref<?x?xf32>
841    } else {
842      %3 = alloc(%arg0, %arg1) : memref<?x?xf32>
843      use(%3)
844      dealloc %3 : memref<?x?xf32>  // %3 can be safely freed here
845      scf.yield %1 : memref<?x?xf32>
846    }
847    return %2 : memref<?x?xf32>
848}
849```
850
851Analogous to the previous case, we have to detect all terminator operations in
852all attached regions of “scf.if” that provides a value to its parent operation
853(in this sample via scf.yield). Querying the `RegionBranchOpInterface` allows
854us to determine the regions that “return” a result to their parent operation.
855Like before, we have to update all `ReturnLike` terminators as described above.
856Reconsider a slightly adapted version of the “custom.region_if” example from
857above that uses a nested allocation:
858
859```mlir
860func @inner_region_control_flow_div(
861  %arg0 : index,
862  %arg1 : index) -> memref<?x?xf32> {
863  %0 = alloc(%arg0, %arg0) : memref<?x?xf32>
864  %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>)
865   then(%arg2 : memref<?x?xf32>) {  // aliases: %arg4, %1
866    custom.region_if_yield %arg2 : memref<?x?xf32>
867   } else(%arg3 : memref<?x?xf32>) {
868    %2 = alloc(%arg0, %arg1) : memref<?x?xf32>  // aliases: %arg4, %1
869    custom.region_if_yield %2 : memref<?x?xf32>
870   } join(%arg4 : memref<?x?xf32>) {  // aliases: %1
871    custom.region_if_yield %arg4 : memref<?x?xf32>
872   }
873  return %1 : memref<?x?xf32>
874}
875```
876
877Since the allocation %2 happens in a divergent branch and cannot be safely
878deallocated in a post-dominator, %arg4 will be considered a critical alias.
879Furthermore, %arg4 is returned to its parent operation and has an alias %1.
880This causes BufferDeallocation to introduce additional copies:
881
882```mlir
883func @inner_region_control_flow_div(
884  %arg0 : index,
885  %arg1 : index) -> memref<?x?xf32> {
886  %0 = alloc(%arg0, %arg0) : memref<?x?xf32>
887  %1 = custom.region_if %0 : memref<?x?xf32> -> (memref<?x?xf32>)
888   then(%arg2 : memref<?x?xf32>) {
889    %c0 = constant 0 : index  // determine dimension extents for temp allocation
890    %2 = dim %arg2, %c0 : memref<?x?xf32>
891    %c1 = constant 1 : index
892    %3 = dim %arg2, %c1 : memref<?x?xf32>
893    %4 = alloc(%2, %3) : memref<?x?xf32>  // temp buffer required due to critic.
894                                          // alias %arg4
895    linalg.copy(%arg2, %4) : memref<?x?xf32>, memref<?x?xf32>
896    custom.region_if_yield %4 : memref<?x?xf32>
897   } else(%arg3 : memref<?x?xf32>) {
898    %2 = alloc(%arg0, %arg1) : memref<?x?xf32>
899    %c0 = constant 0 : index  // determine dimension extents for temp allocation
900    %3 = dim %2, %c0 : memref<?x?xf32>
901    %c1 = constant 1 : index
902    %4 = dim %2, %c1 : memref<?x?xf32>
903    %5 = alloc(%3, %4) : memref<?x?xf32>  // temp buffer required due to critic.
904                                          // alias %arg4
905    linalg.copy(%2, %5) : memref<?x?xf32>, memref<?x?xf32>
906    dealloc %2 : memref<?x?xf32>
907    custom.region_if_yield %5 : memref<?x?xf32>
908   } join(%arg4: memref<?x?xf32>) {
909    %c0 = constant 0 : index  // determine dimension extents for temp allocation
910    %2 = dim %arg4, %c0 : memref<?x?xf32>
911    %c1 = constant 1 : index
912    %3 = dim %arg4, %c1 : memref<?x?xf32>
913    %4 = alloc(%2, %3) : memref<?x?xf32>  // this allocation will be removed by
914                                          // applying the copy removal pass
915    linalg.copy(%arg4, %4) : memref<?x?xf32>, memref<?x?xf32>
916    dealloc %arg4 : memref<?x?xf32>
917    custom.region_if_yield %4 : memref<?x?xf32>
918   }
919  dealloc %0 : memref<?x?xf32>  // %0 can be safely freed here
920  return %1 : memref<?x?xf32>
921}
922```
923
924### Placement of Deallocs
925
926After introducing allocs and copies, deallocs have to be placed to free
927allocated memory and avoid memory leaks. The deallocation needs to take place
928after the last use of the given value. The position can be determined by
929calculating the common post-dominator of all values using their remaining
930non-critical aliases. A special-case is the presence of back edges: since such
931edges can cause memory leaks when a newly allocated buffer flows back to
932another part of the program. In these cases, we need to free the associated
933buffer instances from the previous iteration by inserting additional deallocs.
934
935Consider the following “scf.for” use case containing a nested structured
936control-flow if:
937
938```mlir
939func @loop_nested_if(
940  %lb: index,
941  %ub: index,
942  %step: index,
943  %buf: memref<2xf32>,
944  %res: memref<2xf32>) {
945  %0 = scf.for %i = %lb to %ub step %step
946    iter_args(%iterBuf = %buf) -> memref<2xf32> {
947    %1 = cmpi "eq", %i, %ub : index
948    %2 = scf.if %1 -> (memref<2xf32>) {
949      %3 = alloc() : memref<2xf32>  // makes %2 a critical alias due to a
950                                    // divergent allocation
951      use(%3)
952      scf.yield %3 : memref<2xf32>
953    } else {
954      scf.yield %iterBuf : memref<2xf32>
955    }
956    scf.yield %2 : memref<2xf32>
957  }
958  "linalg.copy"(%0, %res) : (memref<2xf32>, memref<2xf32>) -> ()
959  return
960}
961```
962
963In this example, the _then_ branch of the nested “scf.if” operation returns a
964newly allocated buffer.
965
966Since this allocation happens in the scope of a divergent branch, %2 becomes a
967critical alias that needs to be handled. As before, we have to insert
968additional copies to eliminate this alias using copies of %3 and %iterBuf. This
969guarantees that %2 will be a newly allocated buffer that is returned in each
970iteration. However, “returning” %2 to its alias %iterBuf turns %iterBuf into a
971critical alias as well. In other words, we have to create a copy of %2 to pass
972it to %iterBuf. Since this jump represents a back edge, and %2 will always be a
973new buffer, we have to free the buffer from the previous iteration to avoid
974memory leaks:
975
976```mlir
977func @loop_nested_if(
978  %lb: index,
979  %ub: index,
980  %step: index,
981  %buf: memref<2xf32>,
982  %res: memref<2xf32>) {
983  %4 = alloc() : memref<2xf32>
984  "linalg.copy"(%buf, %4) : (memref<2xf32>, memref<2xf32>) -> ()
985  %0 = scf.for %i = %lb to %ub step %step
986    iter_args(%iterBuf = %4) -> memref<2xf32> {
987    %1 = cmpi "eq", %i, %ub : index
988    %2 = scf.if %1 -> (memref<2xf32>) {
989      %3 = alloc() : memref<2xf32> // makes %2 a critical alias
990      use(%3)
991      %5 = alloc() : memref<2xf32> // temp copy due to crit. alias %2
992      "linalg.copy"(%3, %5) : memref<2xf32>, memref<2xf32>
993      dealloc %3 : memref<2xf32>
994      scf.yield %5 : memref<2xf32>
995    } else {
996      %6 = alloc() : memref<2xf32> // temp copy due to crit. alias %2
997      "linalg.copy"(%iterBuf, %6) : memref<2xf32>, memref<2xf32>
998      scf.yield %6 : memref<2xf32>
999    }
1000    %7 = alloc() : memref<2xf32> // temp copy due to crit. alias %iterBuf
1001    "linalg.copy"(%2, %7) : memref<2xf32>, memref<2xf32>
1002    dealloc %2 : memref<2xf32>
1003    dealloc %iterBuf : memref<2xf32> // free backedge iteration variable
1004    scf.yield %7 : memref<2xf32>
1005  }
1006  "linalg.copy"(%0, %res) : (memref<2xf32>, memref<2xf32>) -> ()
1007  dealloc %0 : memref<2xf32> // free temp copy %0
1008  return
1009}
1010```
1011
1012Example for loop-like control flow. The CFG contains back edges that have to be
1013handled to avoid memory leaks. The bufferization is able to free the backedge
1014iteration variable %iterBuf.
1015
1016### Private Analyses Implementations
1017
1018The BufferDeallocation transformation relies on one primary control-flow
1019analysis: BufferPlacementAliasAnalysis. Furthermore, we also use dominance and
1020liveness to place and move nodes. The liveness analysis determines the live
1021range of a given value. Within this range, a value is alive and can or will be
1022used in the course of the program. After this range, the value is dead and can
1023be discarded - in our case, the buffer can be freed. To place the allocs, we
1024need to know from which position a value will be alive. The allocs have to be
1025placed in front of this position. However, the most important analysis is the
1026alias analysis that is needed to introduce copies and to place all
1027deallocations.
1028
1029## Post Phase
1030
1031In order to limit the complexity of the BufferDeallocation transformation, some
1032tiny code-polishing/optimization transformations are not applied on-the-fly
1033during placement. Currently, there is only the CopyRemoval transformation to
1034remove unnecessary copy and allocation operations.
1035
1036Note: further transformations might be added to the post-pass phase in the
1037future.
1038
1039### CopyRemoval Pass
1040
1041A common pattern that arises during placement is the introduction of
1042unnecessary temporary copies that are used instead of the original source
1043buffer. For this reason, there is a post-pass transformation that removes these
1044allocations and copies via `-copy-removal`. This pass, besides removing
1045unnecessary copy operations, will also remove the dead allocations and their
1046corresponding deallocation operations. The CopyRemoval pass can currently be
1047applied to operations that implement the `CopyOpInterface` in any of these two
1048situations which are
1049
1050* reusing the source buffer of the copy operation.
1051* reusing the target buffer of the copy operation.
1052
1053### Reusing the Source Buffer of the Copy Operation
1054
1055In this case, the source of the copy operation can be used instead of target.
1056The unused allocation and deallocation operations that are defined for this
1057copy operation are also removed. Here is a working example generated by the
1058BufferDeallocation pass that allocates a buffer with dynamic size. A deeper
1059analysis of this sample reveals that the highlighted operations are redundant
1060and can be removed.
1061
1062```mlir
1063func @dynamic_allocation(%arg0: index, %arg1: index) -> memref<?x?xf32> {
1064  %7 = alloc(%arg0, %arg1) : memref<?x?xf32>
1065  %c0_0 = constant 0 : index
1066  %8 = dim %7, %c0_0 : memref<?x?xf32>
1067  %c1_1 = constant 1 : index
1068  %9 = dim %7, %c1_1 : memref<?x?xf32>
1069  %10 = alloc(%8, %9) : memref<?x?xf32>
1070  linalg.copy(%7, %10) : memref<?x?xf32>, memref<?x?xf32>
1071  dealloc %7 : memref<?x?xf32>
1072  return %10 : memref<?x?xf32>
1073}
1074```
1075
1076Will be transformed to:
1077
1078```mlir
1079func @dynamic_allocation(%arg0: index, %arg1: index) -> memref<?x?xf32> {
1080  %7 = alloc(%arg0, %arg1) : memref<?x?xf32>
1081  %c0_0 = constant 0 : index
1082  %8 = dim %7, %c0_0 : memref<?x?xf32>
1083  %c1_1 = constant 1 : index
1084  %9 = dim %7, %c1_1 : memref<?x?xf32>
1085  return %7 : memref<?x?xf32>
1086}
1087```
1088
1089In this case, the additional copy %10 can be replaced with its original source
1090buffer %7. This also applies to the associated dealloc operation of %7.
1091
1092To limit the complexity of this transformation, it only removes copy operations
1093when the following constraints are met:
1094
1095* The copy operation, the defining operation for the target value, and the
1096deallocation of the source value lie in the same block.
1097* There are no users/aliases of the target value between the defining operation
1098of the target value and its copy operation.
1099* There are no users/aliases of the source value between its associated copy
1100operation and the deallocation of the source value.
1101
1102### Reusing the Target Buffer of the Copy Operation
1103
1104In this case, the target buffer of the copy operation can be used instead of
1105its source. The unused allocation and deallocation operations that are defined
1106for this copy operation are also removed.
1107
1108Consider the following example where a generic linalg operation writes the
1109result to %temp and then copies %temp to %result. However, these two operations
1110can be merged into a single step. Copy removal removes the copy operation and
1111%temp, and replaces the uses of %temp with %result:
1112
1113```mlir
1114func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){
1115  %temp = alloc() : memref<2xf32>
1116  linalg.generic {
1117    args_in = 1 : i64,
1118    args_out = 1 : i64,
1119    indexing_maps = [#map0, #map0],
1120    iterator_types = ["parallel"]} %arg0, %temp {
1121  ^bb0(%gen2_arg0: f32, %gen2_arg1: f32):
1122    %tmp2 = exp %gen2_arg0 : f32
1123    linalg.yield %tmp2 : f32
1124  }: memref<2xf32>, memref<2xf32>
1125  "linalg.copy"(%temp, %result) : (memref<2xf32>, memref<2xf32>) -> ()
1126  dealloc %temp : memref<2xf32>
1127  return
1128}
1129```
1130
1131Will be transformed to:
1132
1133```mlir
1134func @reuseTarget(%arg0: memref<2xf32>, %result: memref<2xf32>){
1135  linalg.generic {
1136    args_in = 1 : i64,
1137    args_out = 1 : i64,
1138    indexing_maps = [#map0, #map0],
1139    iterator_types = ["parallel"]} %arg0, %result {
1140  ^bb0(%gen2_arg0: f32, %gen2_arg1: f32):
1141    %tmp2 = exp %gen2_arg0 : f32
1142    linalg.yield %tmp2 : f32
1143  }: memref<2xf32>, memref<2xf32>
1144  return
1145}
1146```
1147
1148Like before, several constraints to use the transformation apply:
1149
1150* The copy operation, the defining operation of the source value, and the
1151deallocation of the source value lie in the same block.
1152* There are no users/aliases of the target value between the defining operation
1153of the source value and the copy operation.
1154* There are no users/aliases of the source value between the copy operation and
1155the deallocation of the source value.
1156
1157## Known Limitations
1158
1159BufferDeallocation introduces additional copies using allocations from the
1160“std” dialect (“std.alloc”). Analogous, all deallocations use the “std”
1161dialect-free operation “std.dealloc”. The actual copy process is realized using
1162linalg.copy”. Furthermore, buffers are essentially immutable after their
1163creation in a block. Another limitations are known in the case using
1164unstructered control flow.
1165