Compute the result types and bail out before modifying any IR. That is
more efficient when type conversion failed, because no modifications
must be rolled back.
Note: This is in preparation of the One-Shot Dialect Conversion
refactoring.
After https://github.com/llvm/llvm-project/pull/133220 we had some empty
complex literals (`tensor<0xcomplex<f32>>`) failing to parse.
This was largely due to the ambiguity between `shape.empty()` meaning
splat (`dense<1>`) or empty literal (`dense<>`). Used type's numel to
disambiguate during verification.
This PR will fix a bug in a canonicalization pattern (operation
shape.shape_of: shape of reshape)
```
// Before
func.func @f(%arg0: tensor<?x1xf32>, %arg1: tensor<3xi32>) -> tensor<3xindex> {
%reshape = tensor.reshape %arg0(%arg1) : (tensor<?x1xf32>, tensor<3xi32>) -> tensor<?x1x1xf32>
%0 = shape.shape_of %reshape : tensor<?x1x1xf32> -> tensor<3xindex>
return %0 : tensor<3xindex>
}
//This is will error out as follows:
error: 'tensor.cast' op operand type 'tensor<3xi32>' and result type 'tensor<3xindex>' are cast incompatible
%0 = shape.shape_of %reshape : tensor<?x1x1xf32> -> tensor<3xindex>
^
note: see current operation: %0 = "tensor.cast"(%arg1) : (tensor<3xi32>) -> tensor<3xindex>
```
```
// After
func.func @f(%arg0: tensor<?x1xf32>, %arg1: tensor<3xi32>) -> tensor<3xindex> {
%0 = arith.index_cast %arg1 : tensor<3xi32> to tensor<3xindex>
return %0 : tensor<3xindex>
}
```
See file canonicalize.mlir in the change list for an example.
For the context, this bug was found while running a test on Keras 3, the
canonicalizer errors out due to an invalid tensor.cast operation when
the batch size is dynamic.
The operands of the op are tensor<3xi32> cast to tensor<3xindex>.
This change is related to a previous PR:
https://github.com/llvm/llvm-project/pull/98531
---------
Co-authored-by: Alaa Ali <alaaali@ah-alaaali-l.dhcp.mathworks.com>
Co-authored-by: Mehdi Amini <joker.eph@gmail.com>
PTX source files are expected to only contain ASCII text
(https://docs.nvidia.com/cuda/parallel-thread-execution/#source-format) and no null terminators.
`ptxas` has so far not enforced this but is moving towards doing so.
This revealed a problem where the null terminator is getting printed out
in the output file in MLIR path when outputting ptx directly. Only add the null on the assembly output path for JIT instead of in output of `moduleToObject `.
`tensor.insert_slice` needs to have read semantics on its destination
operand. Since it has a return value, its semantics are
- Copy dest to result
- Copy source to subview of destination.
`tensor.parallel_insert_slice` though has no result. So it does not need
to have read semantics. The op description
[here](a3ac318e5f/mlir/include/mlir/Dialect/Tensor/IR/TensorOps.td (L1524))
also says that it is expected to lower to a `memref.subview`, that does
not have read semantics on the destination (its just a view).
This patch drops the read semantics for destination of
`tensor.parallel_insert_slice` but also makes the `shared_outs` operands
of `scf.forall` have read semantics. Earlier it would rely indirectly on
read semantics of destination operand of `tensor.parallel_insert_slice`
to propagate the read semantics for `shared_outs`. Now that is specified
more directly.
Fixes#133964
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
This patch updates the handling of target regions to set trip counts and
kernel execution modes properly, based on clang's behavior. This fixes a
race condition on `target teams distribute` constructs with no `parallel
do` loop inside.
This is how kernels are classified, after changes introduced in this
patch:
```f90
! Exec mode: SPMD.
! Trip count: Set.
!$omp target teams distribute parallel do
do i=...
end do
! Exec mode: Generic-SPMD.
! Trip count: Set (outer loop).
!$omp target teams distribute
do i=...
!$omp parallel do private(idx, y)
do j=...
end do
end do
! Exec mode: Generic-SPMD.
! Trip count: Set (outer loop).
!$omp target teams distribute
do i=...
!$omp parallel
...
!$omp end parallel
end do
! Exec mode: Generic.
! Trip count: Set.
!$omp target teams distribute
do i=...
end do
! Exec mode: SPMD.
! Trip count: Not set.
!$omp target parallel do
do i=...
end do
! Exec mode: Generic.
! Trip count: Not set.
!$omp target
...
!$omp end target
```
For the split `target teams distribute + parallel do` case, clang
produces a Generic kernel which gets promoted to Generic-SPMD by the
openmp-opt pass. We can't currently replicate that behavior in flang
because our codegen for these constructs results in the introduction of
calls to the `kmpc_distribute_static_loop` family of functions, instead
of `kmpc_distribute_static_init`, which currently prevent promotion of
the kernel to Generic-SPMD.
For the time being, instead of relying on the openmp-opt pass, we look
at the MLIR representation to find the Generic-SPMD pattern and directly
tag the kernel as such during codegen. This is what we were already
doing, but incorrectly matching other kinds of kernels as such in the
process.
This patch updates Flang lowering and kernel flags identification in
MLIR so that loop bounds on `target teams loop` constructs are evaluated
on the host, making the trip count available to the corresponding
`__tgt_target_kernel` call emitted for the target region.
This is necessary in order to properly execute these constructs as
`target teams distribute parallel do`.
Co-authored-by: Kareem Ergawy <kareem.ergawy@amd.com>
Remove the test in the convolution verifier that checks the input and
output element types of convolution operations conform to the
constraints imposed by the TOSA 1.0 specification.
These checks are too strict for users of the TOSA dialect who wish to
allow more types than those allowed by the spec and provide
compatibility issues with earlier TOSA implementation which allowed more
type combinations.
Users who do wish to constrain the convolution types combination to only
those allowed by the TOSA 1.0 spec should run the TOSA validation pass
which already performs these checks.
Signed-off-by: Jack Frankland <jack.frankland@arm.com>
We currently do not have masked vectorization support for tenor.pad with
low padding. However, we can allow this in the special case where the
result dimension after padding is a unit dim. The reason is when we
actually have a low pad on a unit dim, the input size of that dimension
will be (or should be for correct IR) dynamically zero and hence we will
create a zero mask which is correct. If the low pad is dynamically zero
then the lowering is correct as well.
---------
Signed-off-by: Nirvedh <nirvedh@gmail.com>
This patch corrects an invalid condition in `getEffectsOnResource` used
to identify relevant "resources":
```cpp
return it.getResource() != resource;
```
The current implementation assumes that only one instance of each
resource will exist, so comparing raw pointers is both safe and
sufficient. This assumption stems from constructs like:
```cpp
static DerivedResource *get() {
static DerivedResource instance;
return &instance;
}
```
i.e., resource instances returned via static singleton methods.
However, as discussed in
* https://github.com/llvm/llvm-project/issues/129216,
this assumption breaks in practice — notably on macOS (Apple Silicon)
when built with:
* `-DBUILD_SHARED_LIBS=On`.
In such cases, multiple instances of the same logical resource may exist
across shared library boundaries, leading to incorrect behavior and
causing failures in tests like:
* test/Dialect/Transform/check-use-after-free.mlir
This patch replaces the pointer comparison with a comparison based on
resource identity:
```cpp
return it.getResource()->getResourceID() != resource->getResourceID();
```
This approach aligns better with the intent of `getEffectsOnResource`,
which is to:
```cpp
/// Collect all of the effect instances that operate on the provided
/// resource (...)
```
Fixes#129216
Add a pattern that bubbles up tensor.extract_slice through
tensor.collapse_shape.
The pattern is registered in a pattern population function that is used
by the transform op
transform.apply_patterns.tensor.bubble_up_extract_slice and by the
tranform op transform.structured.fuse as a cleanup pattern.
This pattern enables tiling and fusing op chains which contain
tensor.collapse_shape if added as a cleanup pattern of tile and fuse
utility.
Without this pattern that would not be possible, as
tensor.collapse_shape does not implement the tiling interface. This is
an additional pattern to the one added in PR #126898
This fixes the current lowering of `arith.ceildivsi` in the arith-expand
pass, which was previously incorrect. The new version is based on the
lowering of `arith.floordivsi`, and will not introduce new undefined
behavior or poison during the lowering. It also replaces one division
with a multiplication.
The previous lowering of `ceildivsi(n, m)` was the following:
```
x = (m > 0) ? -1 : 1
(n*m>0) ? ((n+x) / m) + 1 : - (-n / m)
```
This caused two problems:
* In the case where `n` is INT_MIN and `m` is positive, the result would
be poison instead of an actual value
* In the case where `n` is INT_MAX and `m` is `-1`, this would trigger
undefined behavior, while the original code wouldn't. This is because
`n+x` would be equal to `INT_MIN` (`INT_MAX + 1`), so the `(n+x) / m`
division would overflow and trigger UB.
There are cases in SPIR-V shaders where values need to be yielded from
the selection region to make valid MLIR. For example (part of the SPIR-V
shader decompiled to GLSL):
```
bool _115
if (_107)
{
// ...
float _200 = fma(...);
// ...
_115 = _200 < _174;
}
else
{
_115 = _107;
}
bool _123;
if (_115)
{
// ...
float _213 = fma(...);
// ...
_123 = _213 < _174;
}
else
{
_123 = _115;
}
````
This patch extends `mlir.selection` so it can return values.
`mlir.merge` is used as a "yield" operation. This allows to maintain a
compatibility with code that does not yield any values, as well as, to
maintain an assumption that `mlir.merge` is the only operation in the
merge block of the selection region.
This patch fixes the following bugs:
- In SparseBackwardAnalysis, the setToExitState function should
propagate changes if it modifies the lattice. Previously, this issue was
masked because multi-block scenarios were not tested, and the traversal
order of backward data flow analysis starts from the end of the program.
- The method in liveness analysis for determining whether the
non-forwarded operand in branch/region branch operations is live is
incorrect, which may cause originally live variables to be marked as not
live.
The `match` and `rewrite` functions have been deprecated in #130031.
This commit deletes them entirely.
Note for LLVM integration: Update your patterns to use `matchAndRewrite`
instead of separate `match` / `rewrite`.
Canonicalizes a chain of `linalg.unpack -> tensor.extract_slice` into a
`linalg.unpack` with reduced dest sizes. This will only happen when the
unpack op's only user is a non rank-reducing slice with zero offset and
unit strides.
---------
Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
Signed-off-by: Max Dawkins <maxdawkins19@gmail.com>
Co-authored-by: Max Dawkins <maxdawkins19@gmail.com>
This commit extends the lowering of amdgpu.mfma to handle the new
double-rate MFMAs in gfx950 and adds tests for these operations.
It also adds support for MFMAs on small floats (f6 and f4), which are
implented using the "scaled" MFMA intrinsic with a scale value of 0 in
order to have an unscaled MFMA.
This commit does not add a `amdgpu.scaled_mfma` operation, as that is
future work.
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
Addition of `no_inline` and `always_inline` attributes for CallOps in
MLIR in order to be able to inline or not directly the call of a
function without having the attribute on the `FuncOp`.
The addition of these attributes will be used in a future PR in Flang
(`[NO]INLINE` directive).
During the transition from debug intrinsics to debug records, we used
several different command line options to customise handling: the
printing of debug records to bitcode and textual could be independent of
how the debug-info was represented inside a module, whether the
autoupgrader ran could be customised. This was all valuable during
development, but now that totally removing debug intrinsics is coming
up, this patch removes those options in favour of a single flag
(experimental-debuginfo-iterators), which enables autoupgrade, in-memory
debug records, and debug record printing to bitcode and textual IR.
We need to do this ahead of removing the
experimental-debuginfo-iterators flag, to reduce the amount of
test-juggling that happens at that time.
There are quite a number of weird test behaviours related to this --
some of which I simply delete in this commit. Things like
print-non-instruction-debug-info.ll , the test suite now checks for
debug records in all tests, and we don't want to check we can print as
intrinsics. Or the update_test_checks tests -- these are duplicated with
write-experimental-debuginfo=false to ensure file writing for intrinsics
is correct, but that's something we're imminently going to delete.
A short survey of curious test changes:
* free-intrinsics.ll: we don't need to test that debug-info is a zero
cost intrinsic, because we won't be using intrinsics in the future.
* undef-dbg-val.ll: apparently we pinned this to non-RemoveDIs in-memory
mode while we sorted something out; it works now either way.
* salvage-cast-debug-info.ll: was testing intrinsics-in-memory get
salvaged, isn't necessary now
* localize-constexpr-debuginfo.ll: was producing "dead metadata"
intrinsics for optimised-out variable values, dbg-records takes the
(correct) representation of poison/undef as an operand. Looks like we
didn't update this in the past to avoid spurious test differences.
* Transforms/Scalarizer/dbginfo.ll: this test was explicitly testing
that debug-info affected codegen, and we deferred updating the tests
until now. This is just one of those silent gnochange issues that get
fixed by RemoveDIs.
Finally: I've added a bitcode test, dbg-intrinsics-autoupgrade.ll.bc,
that checks we can autoupgrade debug intrinsics that are in bitcode into
the new debug records.
Targeted rewrite of a linalg.copy on memrefs to a memref.copy.
This is useful when bufferizing copies to a linalg.copy, applying some
transformations, and then rewriting the copy into a memref.copy.
If the element types of the source and destination differ, or if the
source is a scalar, the transform produces a silenceable failure.
- For splat dense attributes, the number of parsed elements must be 2.
- For non-splat dense attributes, the number of parsed elements must be
twice the number of elements in the type.
Fixes#132859.
Minor non-functional change of the dialect to better align with the
operator order from the TOSA specification:
https://www.mlplatform.org/tosa/tosa_spec.html
Signed-off-by: Jerry Ge <jerry.ge@arm.com>
This is replacing #125361
- communicator is mandatory
- new mpi.comm_world
- new mp.comm_split
- lowering and test
---------
Co-authored-by: Sergio Sánchez Ramírez <sergio.sanchez.ramirez+git@bsc.es>
Drop small size to make vector types match the generic helper
`getMixedValues` in `StaticValueUtils.h`.
This saves some needles vector copies. I didn't find any local variables
that need updating.