[mlir][NFC] Improve bufferization documentation (#89495)

* Add example for `test-analysis-only` and `print-conflicts`. * Mention other bufferization-related passes. * Update outdated documentation.
2025-04-19 07:46:49 +00:00 · 2024-05-07 16:15:09 +02:00 · 2024-05-07 16:15:09 +02:00 · 099417d617
commit 099417d617
parent 30cfe2b2ac
3 changed files with 215 additions and 89 deletions
--- a/mlir/docs/Bufferization.md
+++ b/mlir/docs/Bufferization.md
@ -5,35 +5,45 @@
 ## Overview

 Bufferization in MLIR is the process of converting ops with `tensor` semantics
-to ops with `memref` semantics. MLIR provides an infrastructure that bufferizes
-an entire program in a single pass (*One-Shot Bufferize*). This infrastructure
-bufferizes all ops that implement the
-[`BufferizableOpInterface`](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td)
-can be bufferized.
+to ops with `memref` semantics. There are multiple MLIR passes that are related
+to bufferization. These passes typically run as one of the last steps in a
+pass pipeline, right before lowering to `memref` ops to LLVM. That is because
+many transformations are easier or only supported in tensor land; e.g.,
+[tile/fuse/… on tensors first](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373),
+then bufferize the remaining IR.

-MLIR has an older bufferization infrastructure built around
-[dialect conversion](DialectConversion.md). Most dialect conversion
-bufferization patterns have been migrated to One-Shot Bufferize, but some
-functionality such as function boundary bufferization still depends on dialect
-conversion and its type converter. New projects should use One-Shot Bufferize,
-as the dialect conversion-based bufferization will eventually be deprecated.
-Moreover, One-Shot Bufferize results in better bufferization with fewer memory
-allocations and buffer copies. This documentation is mostly about One-Shot
-Bufferize, but also describes how to gradually migrate a project from dialect
-conversion-based bufferization to One-Shot Bufferize.
+![bufferization passes](/includes/img/bufferization_passes.svg)
+
+The most important bufferization pass is *One-Shot Bufferize*: This pass
+rewrites `tensor` IR to `memref` IR. There are additional helper passes that
+preprocess IR (e.g., so that IR can be bufferized more efficiently), perform
+buffer-level optimizations such as allocation hoisting, and
+[insert buffer deallocation ops](OwnershipBasedBufferDeallocation.md) so that
+the resulting `memref` IR has no memory leaks.
+
+## Deprecated Passes
+
+The old dialect conversion-based bufferization passes have been deprecated and
+should not be used anymore. Most of those passes have already been removed from
+MLIR. One-Shot Bufferize produces in better bufferization results with fewer
+memory allocations and buffer copies.
+
+The buffer deallocation pass has been deprecated in favor of the ownership-based
+buffer deallocation pipeline. The deprecated pass has some limitations that may
+cause memory leaks in the resulting IR.

 ## What is One-Shot Bufferize?

-One-Shot Bufferize is a new tensor bufferization pass designed for IR in
+One-Shot Bufferize is a tensor bufferization pass designed for IR in
 [destination-passing style](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/dps-fhpc17.pdf),
 and with aggressive in-place bufferization.

 One-Shot Bufferize is:

-*   **Monolithic**: A single MLIR pass does the entire work, whereas the
-    previous bufferization in MLIR was split across multiple passes residing in
-    different dialects. In One-Shot Bufferize, `BufferizableOpInterface`
-    implementations are spread across different dialects.
+*   **Monolithic**: A single MLIR pass does the entire work.
+
+*   **Extensible** via an op interface: All ops that implement
+    `BufferizableOpInterface` can be bufferized.

 *   A **whole-function at a time analysis**. In-place bufferization decisions
    are made by analyzing SSA use-def chains on tensors. Op interface
@ -41,10 +51,7 @@ One-Shot Bufferize is:
    ops, but also helper methods for One-Shot Bufferize's analysis to query
    information about an op's bufferization/memory semantics.

-*   **Extensible** via an op interface: All ops that implement
-    `BufferizableOpInterface` can be bufferized.
-
-*   **2-Pass**: Bufferization is internally broken down into 2 steps: First,
+*   **2-Phase**: Bufferization is internally broken down into 2 steps: First,
    analyze the entire IR and make bufferization decisions. Then, bufferize
    (rewrite) the IR. The analysis has access to exact SSA use-def information.
    It incrementally builds alias and equivalence sets and does not rely on a
@ -60,27 +67,17 @@ One-Shot Bufferize is:
    of `AnalysisState` that implements a small number virtual functions can
    serve as a custom analysis. It is even possible to run One-Shot Bufferize
    without any analysis (`AlwaysCopyAnalysisState`), in which case One-Shot
-    Bufferize behaves exactly like the old dialect conversion-based
-    bufferization (i.e., copy every buffer before writing to it).
+    Bufferize copies every buffer before writing to it.

-To reduce complexity, One-Shot Bufferize should be
-[run after other transformations](https://llvm.discourse.group/t/rfc-linalg-on-tensors-update-and-comprehensive-bufferization-rfc/3373),
-typically as one of the last steps right before lowering memref ops. Many
-transformations are easier in tensor land; e.g., tile/fuse/… on tensors first,
-then bufferize the remaining IR.
-
-From an architecture perspective, One-Shot Bufferize consists of
-[BufferizableOpInterface](https://github.com/llvm/llvm-project/blob/17a68065c378da74805e4e1b9a5b78cc9f83e580/mlir/include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.td)
-(and its implementations) and an
-[analysis](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L164)
-of tensor SSA values that decides if a buffer can be used directly or must be
-copied. The [bufferize] method of the op interface inspects analysis results and
-rewrites tensor ops into memref ops.
+Note that One-Shot Bufferize does not deallocate buffers. That is done by the
+[Ownership-based Buffer Deallocation passes](OwnershipBasedBufferDeallocation.md).

 ## Goals of Bufferization

-The high-level goal of every bufferization technique is to: 1. Use as little
-memory as possible. 2. Copy as little memory as possible.
+The high-level goal of every bufferization technique is to:
+
+1. Use as little memory as possible.
+2. Copy as little memory as possible.

 This implies reusing already allocated buffers when possible, turning
 bufferization into an algorithmically complex problem with similarities to
@ -102,40 +99,46 @@ choosing an already existing buffer, we must be careful not to accidentally
 overwrite data that is still needed later in the program.

 To simplify this problem, One-Shot Bufferize was designed to take advantage of
-*destination-passing style*. This form exists in itself independently of
-bufferization and is tied to SSA semantics: many ops are “updating” part of
-their input SSA variable. For example the LLVM instruction
+*destination-passing style* (DPS). In MLIR, DPS op should implement the
+[`DestinationStyleOpInterface`](https://github.com/llvm/llvm-project/blob/792d437b56adfb3416daf8105942d4899fb82763/mlir/include/mlir/Interfaces/DestinationStyleOpInterface.td).
+DPS exists in itself independently of bufferization and is tied to SSA
+semantics: many ops are "updating" a part of their input SSA variables. For
+example the LLVM instruction
 [`insertelement`](https://llvm.org/docs/LangRef.html#insertelement-instruction)
 is inserting an element inside a vector. Since SSA values are immutable, the
 operation returns a copy of the input vector with the element inserted.
-Another example in MLIR is `linalg.generic`, which always has an extra `outs`
-operand which provides the initial values to update (for example when the
-operation is doing a reduction).
+Another example in MLIR is `linalg.generic` on tensors, which always has an
+extra `outs` operand for each result, which provides the initial values to
+update (for example when the operation is doing a reduction).

-This input is referred to as "destination" in the following (quotes are
+`outs` operands are referred to as "destinations" in the following (quotes are
 important as this operand isn't modified in place but copied) and comes into
 place in the context of bufferization as a possible "anchor" for the
 bufferization algorithm. This allows the user to shape the input in a form that
 guarantees close to optimal bufferization result when carefully choosing the
 SSA value used as "destination".

-For every tensor result, a "destination-passing" style op has a corresponding
-tensor operand. If there aren't any other uses of this tensor, the bufferization
-can alias it with the op result and perform the operation "in-place" by reusing
-the buffer allocated for this "destination" input.
+For every tensor result, a DPS op has a corresponding tensor operand. If there
+aren't any other conflicting uses of this tensor, the bufferization can alias
+it with the op result and perform the operation "in-place" by reusing the buffer
+allocated for this "destination" input.

-As an example, consider the following op: `%0 = tensor.insert %cst into
-%t[%idx] : tensor<?xf32>`
+As an example, consider the following op: `%r = tensor.insert %f into
+%t[%idx] : tensor<5xf32>`
+
+![tensor.insert example](/includes/img/bufferization_tensor_insert_dst.svg)

 `%t` is the "destination" in this example. When choosing a buffer for the result
-`%0`, denoted as `buffer(%0)`, One-Shot Bufferize considers only two options:
+`%r`, denoted as `buffer(%r)`, One-Shot Bufferize considers only two options:

-1.  `buffer(%0) = buffer(%t)` : alias the "destination" tensor with the
-    result and perform the operation in-place.
-2.  `buffer(%0)` is a newly allocated buffer.
+1.  `buffer(%r) = buffer(%t)`: store the result in the existing `buffer(%t)`.
+    Note that this is not always possible. E.g., if the old contents of
+    `buffer(%t)` are still needed. One-Shot Bufferize's main task is to detect
+    such cases and fall back to the second option when necessary.
+2.  `buffer(%r)` is a newly allocated buffer.

 There may be other buffers in the same function that could potentially be used
-for `buffer(%0)`, but those are not considered by One-Shot Bufferize to keep the
+for `buffer(%r)`, but those are not considered by One-Shot Bufferize to keep the
 bufferization simple. One-Shot Bufferize could be extended to consider such
 buffers in the future to achieve a better quality of bufferization.

@ -151,7 +154,7 @@ memory allocation. E.g.:
 ```

 The result of `tensor.generate` does not have a "destination" operand, so
-bufferization allocates a new buffer. This could be avoided by choosing an
+bufferization allocates a new buffer. This could be avoided by instead using an
 op such as `linalg.generic`, which can express the same computation with a
 "destination" operand, as specified behind outputs (`outs`):

@ -198,12 +201,61 @@ e.g.:
 ```mlir
 %0 = "my_dialect.some_op"(%t) : (tensor<?xf32>) -> (tensor<?xf32>)
 %1 = "my_dialect.another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
+
+// "yet_another_op" likely needs to read the data of %0, so "another_op" cannot
+// in-place write to buffer(%0).
 %2 = "my_dialect.yet_another_op"(%0) : (tensor<?xf32>) -> (tensor<?xf32>)
 ```

-One-Shot Bufferize has debug flags (`test-analysis-only print-conflicts`) that
-print the results of the analysis and explain to the user why buffer copies were
-inserted.
+## Tensor / MemRef Boundary
+
+The bufferization dialect provides a few helper ops to connect tensor IR (that
+should be bufferized) with existing buffers (that may be allocated/provided by
+a different runtime/library/etc.).
+
+`bufferization.to_memref %t` returns the future buffer of a tensor SSA value.
+`bufferization.to_tensor %m` returns a tensor SSA value for a given MemRef
+buffer. `bufferization.materialize_in_destination` indicates that a tensor value
+should materialize in a certain buffer.
+
+Consider the following example, where a TOSA matmul result should materialize in
+an existing buffer `%C`:
+
+```mlir
+// Batched TOSA matrix multiplication. %A and %B are the
+// inputs, %C is the output.
+func.func @test_matmul(%A: memref<1x17x19xf32>,
+                       %B: memref<1x19x29xf32>,
+                       %C: memref<1x17x29xf32>) {
+
+  %A_tensor = bufferization.to_tensor %A restrict : memref<1x17x19xf32>
+  %B_tensor = bufferization.to_tensor %B restrict : memref<1x19x29xf32>
+
+  %0 = tosa.matmul %A_tensor, %B_tensor
+      : (tensor<1x17x19xf32>, tensor<1x19x29xf32>) ->
+         tensor<1x17x29xf32>
+
+  bufferization.materialize_in_destination
+    %0 in restrict writable %C
+      : (tensor<1x17x29xf32>, memref<1x17x29xf32>) -> ()
+
+  return
+}
+```
+
+Note that all bufferization ops in this example have the `restrict` unit
+attribute set. This attribute is similar to the C restrict keyword and indicates
+that there is no other `to_tensor` or `materialize_in_destination` op with
+the same or an aliasing MemRef operand. Only such
+`to_tensor`/`materialize_in_destination` ops are supported. The `restrict`
+attribute gives strong aliasing guarantees to the bufferization analysis and
+allows us to look only at the tensor IR in a program. (Ops that do not operate
+on tensors are ignored by the One-Shot Bufferize.)
+
+Also note that `tosa.matmul` cannot be bufferized as is: there is no
+`BufferizableOpInterface` implementation for that op. However, the op can be
+lowered to a combination of `tensor.empty` and `linalg.matmul`, which can be
+bufferized.

 ## Using One-Shot Bufferize

@ -221,17 +273,14 @@ By default, One-Shot Bufferize fails when it encounters an op with tensor
 semantics (i.e., tensor result or tensor operand) that is not bufferizable
 (i.e., does not implement `BufferizableOpInterface`). This can be avoided with
 `allow-unknown-ops`. In that case, One-Shot Bufferize inserts
-`to_memref`/`to_tensor` ops around the bufferization boundary. These ops are
-named versions of `unrealized_conversion_cast`. Note that One-Shot Bufferize's
-analysis can currently not analyze these ops, so input IR with such ops may fail
-bufferization. Therefore, running One-Shot Bufferize multiple times in a
-sequence is also not supported at the moment.
+`to_memref`/`to_tensor` ops around the bufferization boundary.

 One-Shot Bufferize can be configured to bufferize only ops from a set of
 dialects with `dialect-filter`. This can be useful for gradually migrating from
 dialect conversion-based bufferization to One-Shot Bufferize. One-Shot Bufferize
 must run first in such a case, because dialect conversion-based bufferization
-generates `to_tensor`/`to_memref` ops which One-Shot Bufferize cannot analyze.
+generates `to_tensor` ops without the `restrict` unit attribute, which One-Shot
+Bufferize cannot analyze.

 One-Shot Bufferize can also be called programmatically with
 [`bufferization::runOneShotBufferize`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h#L167).
@ -240,6 +289,14 @@ Alternatively,
 skips the analysis and inserts a copy on every buffer write, just like the
 dialect conversion-based bufferization.

+By default, function boundaries are not bufferized. This is because there are
+currently limitations around function graph bufferization: recursive
+calls are not supported. As long as there are no recursive calls, function
+boundary bufferization can be enabled with `bufferize-function-boundaries`. Each
+tensor function argument and tensor function result is then turned into a
+memref. The layout map of the memref type can be controlled with
+`function-boundary-type-conversion`.
+
 ## Memory Layouts

 One-Shot Bufferize bufferizes ops from top to bottom. This works well when all
@ -319,6 +376,11 @@ To get a better intuition of the interface methods, we invite users to take a
 look at existing implementations in MLIR, e.g., the implementation of
 `tensor.insert` or `tensor.extract`.

+Interface implementations of DPS ops (that implement
+`DestinationStyleOpInterface`) can derive from
+`DstBufferizableOpInterfaceExternalModel`, which provides all necessary
+method implementations except for `bufferize`.
+
 ## Debugging Buffer Copies

 To get a better understanding of why One-Shot Bufferize introduced a buffer
@ -338,14 +400,90 @@ There are two reasons why a buffer copy may be inserted.
 In the first case, `print-conflicts` illustrates the conflict in the form of a
 ("read", "conflicting write", "last write") tuple.

-## Understanding the SSA Use-Def Chain Analysis
+A RaW conflict consists of three parts, in the following order according to
+op dominance:
+
+1. **Definition:** A tensor `%t` is defined.
+2. **Conflicting Write:** An operation writes to `buffer(%t)`.
+3. **Read:** An operation reads `%t`.
+
+When such a RaW conflict is detected during the analysis phase, One-Shot
+Bufferize will insert a buffer copy for the conflicting write.
+
+**Example**
+
+```mlir
+// RUN: mlir-opt %s -one-shot-bufferize="bufferize-function-boundaries test-analysis-only print-conflicts"
+func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) {
+  // Create a new tensor with [%arg0, %arg0, %arg0].
+  %0 = tensor.from_elements %arg0, %arg0, %arg0 : tensor<3xf32>
+
+  // Insert something into the new tensor.
+  %1 = tensor.insert %arg1 into %0[%arg2] : tensor<3xf32>
+
+  // Read from the old tensor.
+  %r = tensor.extract %0[%arg3] : tensor<3xf32>
+
+  // Return the extracted value and the result of the insertion.
+  func.return %r, %1 : f32, tensor<3xf32>
+}
+```
+
+The output IR is as follows:
+
+```mlir
+func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, tensor<3xf32>) {
+  %from_elements = tensor.from_elements %arg0, %arg0, %arg0 {"C_0[DEF: result 0]"} : tensor<3xf32>
+  %inserted = tensor.insert %arg1 into %from_elements[%arg2] {"C_0[CONFL-WRITE: 1]", __inplace_operands_attr__ = ["none", "false", "none"]} : tensor<3xf32>
+  %extracted = tensor.extract %from_elements[%arg3] {"C_0[READ: 0]", __inplace_operands_attr__ = ["true", "none"]} : tensor<3xf32>
+  return {__inplace_operands_attr__ = ["none", "true"]} %extracted, %inserted : f32, tensor<3xf32>
+}
+```
+
+Note that the IR was not bufferized. It was merely annotated with the results
+of the bufferization analysis. Every operation with tensor semantics has a
+`__inplace_operands_attr__` attribute with one value per operand. If an operand
+is not a tensor, the respective value is `none`. Otherwise, if the operand was
+decided to be bufferized in-place, the value is `true`. A value of `false`
+indicates a buffer copy. In the above example, a buffer copy would be inserted
+for `tensor.insert`, so that it does not overwrite `buffer(%from_elements)`,
+which is still needed for `tensor.extract`.
+
+For each RaW (there is only one in the example), three `C_i` attributes were
+added:
+
+* `C_0[DEF: result 0]`: A tensor is defined: 0-th result of
+  `tensor.from_elements`.
+* `C_0[CONFL-WRITE: 1]`: An operation (if bufferized in-place) would write into
+  the future buffer of the defined tensor: 1-st operand of `tensor.insert`.
+* `C_0[READ: 0]`: An operation reads the tensor definition: 0-th operand of
+  `tensor.extract`.
+
+The fully bufferized IR (with the inserted buffer copy) is as follows:
+
+```mlir
+func.func @test(%arg0: f32, %arg1: f32, %arg2: index, %arg3: index) -> (f32, memref<3xf32>) {
+  %c2 = arith.constant 2 : index
+  %c1 = arith.constant 1 : index
+  %c0 = arith.constant 0 : index
+  %alloc = memref.alloc() {alignment = 64 : i64} : memref<3xf32>
+  memref.store %arg0, %alloc[%c0] : memref<3xf32>
+  memref.store %arg0, %alloc[%c1] : memref<3xf32>
+  memref.store %arg0, %alloc[%c2] : memref<3xf32>
+  %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<3xf32>
+  memref.copy %alloc, %alloc_0 : memref<3xf32> to memref<3xf32>
+  memref.store %arg1, %alloc_0[%arg2] : memref<3xf32>
+  %0 = memref.load %alloc[%arg3] : memref<3xf32>
+  return %0, %alloc_0 : f32, memref<3xf32>
+}
+```

 To get a better understanding of the SSA Use-Def Chain Analysis and the RaW
-conflict detection algorithm, we invite interested users to read the
-[design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf)
-and watch the corresponding [ODM talk](https://youtu.be/TXEo59CYS9A)
-([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)).
-can be used to bufferize a program in a single pass, as long as each op
+conflict detection algorithm, interested users may want to refer to:
+
+* [Original design document](https://discourse.llvm.org/uploads/short-url/5kckJ3DftYwQokG252teFgw3sYa.pdf)
+* [ODM talk](https://youtu.be/TXEo59CYS9A), ([slides](https://mlir.llvm.org/OpenMeetings/2022-01-13-One-Shot-Bufferization.pdf)).
+* [LLVM Dev Meeting 2023 tutorial slides](https://m-sp.org/downloads/llvm_dev_2023.pdf)

 ## Migrating from Dialect Conversion-based Bufferization

@ -356,20 +494,6 @@ One-Shot Bufferize must run first because it cannot analyze those boundary ops.
 To update existing code step-by-step, it may be useful to specify a dialect
 filter for One-Shot Bufferize, so that dialects can be switched over one-by-one.

-## Bufferization Function Graphs
-
-One-Shot Bufferize does currently not support function graph bufferization.
-I.e., `CallOp`, `ReturnOp` and function bbArgs are not bufferizable. Users can
-run the existing `--func-bufferize` bufferization pass after One-Shot Bufferize.
-
-Alternatively, users can try
-[`ModuleBufferization`](https://github.com/llvm/llvm-project/blob/ae2764e835a26bad9774803eca0a6530df2a3e2d/mlir/include/mlir/Dialect/Linalg/ComprehensiveBufferize/ModuleBufferization.h#L31),
-which is an extension of One-Shot Bufferize. This bufferization is still under
-development and does not support arbitrary IR. In essence, returning a tensor
-from a function is not supported, unless it is equivalent to a function bbArg.
-In that case, the corresponding return value can simply be dropped during
-bufferization.
-
 ## Dialect Conversion-based Bufferization

 Disclaimer: Most dialect conversion-based bufferization has been migrated to
--- a/mlir/docs/includes/img/bufferization_passes.svg
+++ b/mlir/docs/includes/img/bufferization_passes.svg
--- a/mlir/docs/includes/img/bufferization_tensor_insert_dst.svg
+++ b/mlir/docs/includes/img/bufferization_tensor_insert_dst.svg