Add support for import and translate.
MLIR does not support using basic block references outside a function
(like LLVM does), This PR does not consider changes to MLIR to that
respect. It instead introduces two new ops: `llvm.blockaddress` and
`llvm.blocktag`. Here's an example:
```
llvm.func @ba() -> !llvm.ptr {
%0 = llvm.blockaddress <function = @ba, tag = <id = 1>> : !llvm.ptr
llvm.br ^bb1
^bb1: // pred: ^bb0
llvm.blocktag <id = 1>
llvm.return %0 : !llvm.ptr
}
```
Value `%0` hold the address of block tagged as `id = 1` in function
`@ba`. Block tags need to be unique within a function and use of
`llvm.blockaddress` requires a matching tag in a `llvm.blocktag`.
This fixes a validation pass assert when processing ops with quantized
element types.
The failure case is added to invalid.mlir
The fix is to re-order the validation checking so that only ops with
int/float operands and results pass the first stage of validation pass,
so that the remaining checks do not need to handle quantized data types.
Signed-off-by: Tai Ly <tai.ly@arm.com>
The current implementation iterates and modifies the list of arguments
at the same time. Depending on the number of arguments this will trigger
an assert: `assert(index < arguments.size())`. This change replaces loop
with a range based erasure.
This change standardises the naming convention for the argument
representing the value to store in various vector operations.
Specifically, it ensures that all vector ops storing a value—whether
into memory, a tensor, or another vector — use `valueToStore` for the
corresponding argument name.
Updated operations:
* `vector.transfer_write`, `vector.insert`, `vector.scalable_insert`,
`vector.insert_strided_slice`.
For reference, here are operations that currently use `valueToStore`:
* `vector.store` `vector.scatter`, `vector.compressstore`,
`vector.maskedstore`.
This change is non-functional (NFC) and does not affect the
functionality of these operations.
Implements #131602
Check the memory space before lowering allocation ops, instead of
starting the lowering and then rolling back the pattern when the memory
space was found to be incompatible with LLVM.
Note: This is in preparation of the One-Shot Dialect Conversion
refactoring.
Note: `isConvertibleAndHasIdentityMaps` now also checks the memory
space.
Fixes the following warning after the changes in
https://github.com/llvm/llvm-project/pull/134309:
```
llvm-project/mlir/lib/Target/LLVMIR/Dialect/NVVM/NVVMToLLVMIRTranslation.cpp:134:3: warning: default label in switch which covers all enumeration values [-Wcovered-switch-default]
default:
^
1 warning generated.
```
Add support for lowering of convolution operations where the `acc_type`
attribute differs from the result type of the operation. The only case
of this in for convolutions in the TOSA-v1.0 specification is an fp16
convolution which internally uses an fp32 accumulator; all other
operations have accumulator types that match their output/result types.
Add lit tests for the fp16 convolution with fp32 accumulator operators
described above.
Signed-off-by: Jack Frankland <jack.frankland@arm.com>
Diagnostics at unknown locations can now be verified with
`-verify-diagnostics`.
Example:
```
// expected-error@unknown {{something went wrong}}
```
Also clean up some MemRefToLLVM conversion tests that had to redirect
all errors to stdout in order to FileCheck them. All of those tests can
now be stored in a single `invalid.mlir`. That was not possible before.
Relax the assumption that alloc op always has allocation at
`getResult(0)`, allow to use `optimize-allocation-liveness` pass for
custom ops with >1 results. Ops with multiple allocations are not
handled here yet.
This commit extends the LLVM dialect inliner interface to respect the
call op's noinline attribute. This is a follow-up to
https://github.com/llvm/llvm-project/pull/133726
which added the noinline attribute to the LLVM dialect call op.
The proper layering here is that Inliner depends on InlinerUtils, and
not the other way round. Maybe it's time to give InliningUtils a less
terrible file name.
Current inliner disables inlining when the caller is in a region with
single block trait, while the callee function contains multiple blocks.
the SingleBlock trait is used in operations such as do/while loop, for
example fir.do_loop, fir.iterate_while and fir.if. Typically, calls within
loops are good candidates for inlining. However, functions with multiple
blocks are also common. for example, any function with "if () then
return" will result in multiple blocks in MLIR.
This change gives the flexibility of a customized inliner to handle such
cases.
doClone: clones instructions and other information from the callee
function into the caller function. .
canHandleMultipleBlocks: checks if functions with multiple blocks can be
inlined into a region with the SingleBlock trait.
The default behavior of the inliner remains unchanged.
---------
Co-authored-by: jeanPerier <jean.perier.polytechnique@gmail.com>
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
When folding an op during a conversion, first try to legalize all
generated constants, then replace the original operation. This is
slightly more efficient because fewer rewrites must be rolled back in
case a generated constant could not be legalized.
Note: This is in preparation of the One-Shot Dialect Conversion
refactoring.
Compute the result types and bail out before modifying any IR. That is
more efficient when type conversion failed, because no modifications
must be rolled back.
Note: This is in preparation of the One-Shot Dialect Conversion
refactoring.
After https://github.com/llvm/llvm-project/pull/133220 we had some empty
complex literals (`tensor<0xcomplex<f32>>`) failing to parse.
This was largely due to the ambiguity between `shape.empty()` meaning
splat (`dense<1>`) or empty literal (`dense<>`). Used type's numel to
disambiguate during verification.
This PR will fix a bug in a canonicalization pattern (operation
shape.shape_of: shape of reshape)
```
// Before
func.func @f(%arg0: tensor<?x1xf32>, %arg1: tensor<3xi32>) -> tensor<3xindex> {
%reshape = tensor.reshape %arg0(%arg1) : (tensor<?x1xf32>, tensor<3xi32>) -> tensor<?x1x1xf32>
%0 = shape.shape_of %reshape : tensor<?x1x1xf32> -> tensor<3xindex>
return %0 : tensor<3xindex>
}
//This is will error out as follows:
error: 'tensor.cast' op operand type 'tensor<3xi32>' and result type 'tensor<3xindex>' are cast incompatible
%0 = shape.shape_of %reshape : tensor<?x1x1xf32> -> tensor<3xindex>
^
note: see current operation: %0 = "tensor.cast"(%arg1) : (tensor<3xi32>) -> tensor<3xindex>
```
```
// After
func.func @f(%arg0: tensor<?x1xf32>, %arg1: tensor<3xi32>) -> tensor<3xindex> {
%0 = arith.index_cast %arg1 : tensor<3xi32> to tensor<3xindex>
return %0 : tensor<3xindex>
}
```
See file canonicalize.mlir in the change list for an example.
For the context, this bug was found while running a test on Keras 3, the
canonicalizer errors out due to an invalid tensor.cast operation when
the batch size is dynamic.
The operands of the op are tensor<3xi32> cast to tensor<3xindex>.
This change is related to a previous PR:
https://github.com/llvm/llvm-project/pull/98531
---------
Co-authored-by: Alaa Ali <alaaali@ah-alaaali-l.dhcp.mathworks.com>
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
PTX source files are expected to only contain ASCII text
(https://docs.nvidia.com/cuda/parallel-thread-execution/#source-format) and no null terminators.
`ptxas` has so far not enforced this but is moving towards doing so.
This revealed a problem where the null terminator is getting printed out
in the output file in MLIR path when outputting ptx directly. Only add the null on the assembly output path for JIT instead of in output of `moduleToObject `.
`tensor.insert_slice` needs to have read semantics on its destination
operand. Since it has a return value, its semantics are
- Copy dest to result
- Copy source to subview of destination.
`tensor.parallel_insert_slice` though has no result. So it does not need
to have read semantics. The op description
[here](a3ac318e5f/mlir/include/mlir/Dialect/Tensor/IR/TensorOps.td (L1524))
also says that it is expected to lower to a `memref.subview`, that does
not have read semantics on the destination (its just a view).
This patch drops the read semantics for destination of
`tensor.parallel_insert_slice` but also makes the `shared_outs` operands
of `scf.forall` have read semantics. Earlier it would rely indirectly on
read semantics of destination operand of `tensor.parallel_insert_slice`
to propagate the read semantics for `shared_outs`. Now that is specified
more directly.
Fixes#133964
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
This patch updates the handling of target regions to set trip counts and
kernel execution modes properly, based on clang's behavior. This fixes a
race condition on `target teams distribute` constructs with no `parallel
do` loop inside.
This is how kernels are classified, after changes introduced in this
patch:
```f90
! Exec mode: SPMD.
! Trip count: Set.
!$omp target teams distribute parallel do
do i=...
end do
! Exec mode: Generic-SPMD.
! Trip count: Set (outer loop).
!$omp target teams distribute
do i=...
!$omp parallel do private(idx, y)
do j=...
end do
end do
! Exec mode: Generic-SPMD.
! Trip count: Set (outer loop).
!$omp target teams distribute
do i=...
!$omp parallel
...
!$omp end parallel
end do
! Exec mode: Generic.
! Trip count: Set.
!$omp target teams distribute
do i=...
end do
! Exec mode: SPMD.
! Trip count: Not set.
!$omp target parallel do
do i=...
end do
! Exec mode: Generic.
! Trip count: Not set.
!$omp target
...
!$omp end target
```
For the split `target teams distribute + parallel do` case, clang
produces a Generic kernel which gets promoted to Generic-SPMD by the
openmp-opt pass. We can't currently replicate that behavior in flang
because our codegen for these constructs results in the introduction of
calls to the `kmpc_distribute_static_loop` family of functions, instead
of `kmpc_distribute_static_init`, which currently prevent promotion of
the kernel to Generic-SPMD.
For the time being, instead of relying on the openmp-opt pass, we look
at the MLIR representation to find the Generic-SPMD pattern and directly
tag the kernel as such during codegen. This is what we were already
doing, but incorrectly matching other kinds of kernels as such in the
process.
This patch updates Flang lowering and kernel flags identification in
MLIR so that loop bounds on `target teams loop` constructs are evaluated
on the host, making the trip count available to the corresponding
`__tgt_target_kernel` call emitted for the target region.
This is necessary in order to properly execute these constructs as
`target teams distribute parallel do`.
Co-authored-by: Kareem Ergawy <kareem.ergawy@amd.com>
Remove the test in the convolution verifier that checks the input and
output element types of convolution operations conform to the
constraints imposed by the TOSA 1.0 specification.
These checks are too strict for users of the TOSA dialect who wish to
allow more types than those allowed by the spec and provide
compatibility issues with earlier TOSA implementation which allowed more
type combinations.
Users who do wish to constrain the convolution types combination to only
those allowed by the TOSA 1.0 spec should run the TOSA validation pass
which already performs these checks.
Signed-off-by: Jack Frankland <jack.frankland@arm.com>
We currently do not have masked vectorization support for tenor.pad with
low padding. However, we can allow this in the special case where the
result dimension after padding is a unit dim. The reason is when we
actually have a low pad on a unit dim, the input size of that dimension
will be (or should be for correct IR) dynamically zero and hence we will
create a zero mask which is correct. If the low pad is dynamically zero
then the lowering is correct as well.
---------
Signed-off-by: Nirvedh <nirvedh@gmail.com>
This patch corrects an invalid condition in `getEffectsOnResource` used
to identify relevant "resources":
```cpp
return it.getResource() != resource;
```
The current implementation assumes that only one instance of each
resource will exist, so comparing raw pointers is both safe and
sufficient. This assumption stems from constructs like:
```cpp
static DerivedResource *get() {
static DerivedResource instance;
return &instance;
}
```
i.e., resource instances returned via static singleton methods.
However, as discussed in
* https://github.com/llvm/llvm-project/issues/129216,
this assumption breaks in practice — notably on macOS (Apple Silicon)
when built with:
* `-DBUILD_SHARED_LIBS=On`.
In such cases, multiple instances of the same logical resource may exist
across shared library boundaries, leading to incorrect behavior and
causing failures in tests like:
* test/Dialect/Transform/check-use-after-free.mlir
This patch replaces the pointer comparison with a comparison based on
resource identity:
```cpp
return it.getResource()->getResourceID() != resource->getResourceID();
```
This approach aligns better with the intent of `getEffectsOnResource`,
which is to:
```cpp
/// Collect all of the effect instances that operate on the provided
/// resource (...)
```
Fixes#129216
Add a pattern that bubbles up tensor.extract_slice through
tensor.collapse_shape.
The pattern is registered in a pattern population function that is used
by the transform op
transform.apply_patterns.tensor.bubble_up_extract_slice and by the
tranform op transform.structured.fuse as a cleanup pattern.
This pattern enables tiling and fusing op chains which contain
tensor.collapse_shape if added as a cleanup pattern of tile and fuse
utility.
Without this pattern that would not be possible, as
tensor.collapse_shape does not implement the tiling interface. This is
an additional pattern to the one added in PR #126898
This fixes the current lowering of `arith.ceildivsi` in the arith-expand
pass, which was previously incorrect. The new version is based on the
lowering of `arith.floordivsi`, and will not introduce new undefined
behavior or poison during the lowering. It also replaces one division
with a multiplication.
The previous lowering of `ceildivsi(n, m)` was the following:
```
x = (m > 0) ? -1 : 1
(n*m>0) ? ((n+x) / m) + 1 : - (-n / m)
```
This caused two problems:
* In the case where `n` is INT_MIN and `m` is positive, the result would
be poison instead of an actual value
* In the case where `n` is INT_MAX and `m` is `-1`, this would trigger
undefined behavior, while the original code wouldn't. This is because
`n+x` would be equal to `INT_MIN` (`INT_MAX + 1`), so the `(n+x) / m`
division would overflow and trigger UB.
There are cases in SPIR-V shaders where values need to be yielded from
the selection region to make valid MLIR. For example (part of the SPIR-V
shader decompiled to GLSL):
```
bool _115
if (_107)
{
// ...
float _200 = fma(...);
// ...
_115 = _200 < _174;
}
else
{
_115 = _107;
}
bool _123;
if (_115)
{
// ...
float _213 = fma(...);
// ...
_123 = _213 < _174;
}
else
{
_123 = _115;
}
````
This patch extends `mlir.selection` so it can return values.
`mlir.merge` is used as a "yield" operation. This allows to maintain a
compatibility with code that does not yield any values, as well as, to
maintain an assumption that `mlir.merge` is the only operation in the
merge block of the selection region.