- When the operand type of an operation changes to a profile-dependent
type, the compliance metadata must be updated. Update compliance check
for the following:
- CONV2D, CONV3D, DEPTHWISE_CONV2D, and TRANSPOSE_CONV2D, as zero points
have changed to variable inputs.
- PAD, because pad_const has been changed to a variable input.
- GATHER and SCATTER, as indices has changed to index_t.
- Add an int16 extension check for CONCAT.
- Add a compliance check for COND_IF, WHILE_LOOP, VARIABLE,
VARIABLE_READ, and VARIABLE_WRITE.
- Correct the profile requirements for IDENTITY, TABLE, MATMUL and
LOGICAL-like operations.
- Remove unnecessary checks for non-v1.0 operations.
- Add condition requirements (anyOf and allOf) to the type mode of
metadata for modes that have multiple profile/extension considerations.
This patch introduces the `MemorySpaceAttrInterface` interface. This
interface is responsible for handling the semantics of `ptr` operations.
For example, this interface can be used to create read-only memory
spaces, making any other operation other than a load a verification
error, see `TestConstMemorySpaceAttr` for a possible implementation of
this concept.
This patch also introduces Enum dependencies `AtomicOrdering`, and
`AtomicBinOp`, both enumerations are clones of the Enums with the same
name in the LLVM Dialect.
Also, see:
- [[RFC] `ptr` dialect & modularizing ptr ops in the LLVM
dialect](https://discourse.llvm.org/t/rfc-ptr-dialect-modularizing-ptr-ops-in-the-llvm-dialect/75142)
for rationale.
- https://github.com/llvm/llvm-project/pull/73057 for a prototype
implementation of the full change.
**Note: Ignore the first commit, that's being reviewed in
https://github.com/llvm/llvm-project/pull/86860 .**
If a `target` directive is nested in a host OpenMP directive (e.g.
parallel, task, or a worksharing loop), flang currently crashes if the
target directive-related MLIR ops (e.g. `omp.map.bounds` and
`omp.map.info` depends on SSA values defined inside the parent host
OpenMP directives/ops.
This PR tries to solve this problem by treating these parent OpenMP ops
as "SSA scopes". Whenever we are translating for the device, instead of
completely translating host ops, we just tranlate their MLIR ops as pure
SSA values.
Fixes#121289
Given the following MLIR, where a variable: `x`, is `private` for the
`omp.parallel` op and a `depend` for the `omp.task` op:
```mlir
omp.private {type = private} @_QFEx_private_i32 : i32
llvm.func @nested_task_with_deps() {
%0 = llvm.mlir.constant(1 : i64) : i64
%1 = llvm.alloca %0 x i32 {bindc_name = "x"} : (i64) -> !llvm.ptr
omp.parallel private(@_QFEx_private_i32 %1 -> %arg0 : !llvm.ptr) {
omp.task depend(taskdependout -> %arg0 : !llvm.ptr) {
omp.terminator
}
omp.terminator
}
llvm.return
}
```
Before the fix proposed by this PR, the IR builder would emit the
allocation and the initialzation logic for the task's depedency info
struct in the parent function's alloc region:
```llvm
define void @nested_task_with_deps() {
....
%.dep.arr.addr = alloca [1 x %struct.kmp_dep_info], align 8
%2 = getelementptr inbounds [1 x %struct.kmp_dep_info], ptr %.dep.arr.addr, i64 0, i64 0
%3 = getelementptr inbounds nuw %struct.kmp_dep_info, ptr %2, i32 0, i32 0
%4 = ptrtoint ptr %omp.private.alloc to i64
store i64 %4, ptr %3, align 4
....
br label %entry
omp.par.entry: ; preds = %entry
....
%omp.private.alloc = alloca i32, align 4
```
Note the following:
- The private value `x` is alloced where it should be in the parallel
op's entry region,
- howerver, since the privae value is also a depedency of the task and
since allocation and initialzation of the task depedency info struct
both happen in the alloc region,
- this results in the private value being referenced before it is
actually defined.
This PR fixes the issue by only allocating the task depedency info in
the alloc region while initialzation happens in the current IP of the
function with the rest of the logic that depends on it.
Adds checks in `isPermutationVector` for indices that are outside of the
bounds and removes the assert.
Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
Add runtime verification for `memref.dim`: check that the index is in
bounds.
Also simplify the pass pipeline for all memref runtime verification
checks.
This was lacking a bitcast from the shifted integer type into a float.
Other non-struct types than integers and floats will still not be
Mem2Reg'ed.
Also adds special handling for constants to be emitted as a constant
directly rather than relying on followup canonicalization patterns
(`memset` of zero is a case that can appear in the wild).
In corner situations, the vectorization pass may face to lower a conv2d
op and assert in a completely irrelevant location in
vectorizeConvolution() subroutine.
~~This PR rejects the conv2d op early and make the asserted routine to
return failure as a defensive workaround.~~
In addressing this, the PR moved all condition check away from the
`Conv1dGenerator` into the `convOpPreconditionCheck()` function. This
makes the unsupported ops such as conv2d to be rejected early and leave
a cleaner `Conv1dGenerator` constructor.
This commit updates the following operations (operands/results) to be of
at least rank 1 such that it aligns with the expectations of the
specification:
- ARGMAX (input)
- REDUCE_ALL (input/output)
- REDUCE_ANY (input/output)
- REDUCE_MAX (input/output)
- REDUCE_MIN (input/output)
- REDUCE_PRODUCT (input/output)
- REDUCE_SUM (input/output)
- CONCAT (each input in input1/output)
- PAD (input1/output)
- REVERSE (input1/output)
- SLICE (input1/output)
- TILE (input1/output)
- TRANSPOSE (input1/output)
In addition to this change, PAD has been updated to allow unranked
tensors for input1/output, inline with other operations.
This is a code cleanup. Update a few places in MLIR that should use
`hasSingleElement`/`getSingleElement`.
Note: `hasSingleElement` is faster than `.getSize() == 1` when it is
used with linked lists etc.
Depends on #131508.
This commit adds 1:N support to `SignatureConversion::remapInputs`. This
API allows users to replace a block argument with multiple replacement
values. (And the block argument is dropped.) The API already supported
"bbarg --> multiple bbargs" mappings, but "bbarg --> multiple SSA
values" was missing.
---------
Co-authored-by: Markus Böck <markus.boeck02@gmail.com>
Currently, there is no common mechanism for supported intrinsics to be
generically annotated with arg and ret attributes. Since there are many
supported intrinsics around different dialects, the amount of work to
teach all them about these attributes is not trivial (though it would be
nice in the long term).
This PR adds a new flag `-prefer-unregistered-intrinsics` that can be
used alongside `--import-llvm` to always use `llvm.intrinsic_call`
during import time (ignoring dialect hooks for custom intrinsic
support).
Using this flag allow us to roundtrip the LLVM IR while eliminating a
whole set of differences coming from lack of arg/ret attributes on
supported intrinsics.
Note `convertIntrinsic` has to be moved to an implementation file
because it queries into `moduleImport` state, which is a fwd declaration
in `LLVMImportInterface.h`
Import and translation support.
Note that existing support (prior to this PR) already covers enough in
translation specifically to emit "Debug Info Version". Also, the debug
info version metadata is being emitted even though the imported IR has
no information and is showing up in some tests (will fix that in another
PR).
---------
Co-authored-by: Tobias Gysi <tobias.gysi@nextsilicon.com>
Co-authored-by: Henrich Lauko <xlauko@mail.muni.cz>
This commit adds a concept of the 'dynamic' extension in the Dialect and
checks that compile time constant (CTC) operands for each operator are
constant if the dynamic extension is not loaded.
Operands labeled as CTC in the specification that are of tosa.shape
(shape_t in the specification) type are not checked as they are always
expected to be constant. This requirement is checked elsewhere in the
dialect.
Signed-off-by: Luke Hutton <luke.hutton@arm.com>
With `tensor.expand_shape` allowing expanding dynamic dimension into
multiple dynamic dimension, adapt the reshape propagation through
expansion to handle cases where one dynamic dimension is expanded into
multiple dynamic dimension.
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Originally introduced in #130240 and reverted in #131364
Reproduced the issue locally in Linux by doing a shared lib build. Fixes
including adding the missing LINK_LIBS.
**Original commit message:**
This PR adds the SG map propagation step of the XeGPU SIMT distribution.
SG map propagation is a sparse backward dataflow analysis that propagate
the sg_map backward starting from the operands of certain operations
(DPAS, store etc.).
This is the first step of XeGPU subgroup distribution. This analysis
result is used to attach layout information to each XeGPU SIMD subgroup
op. The lowering patterns in XeGPUSubgroupDistribute will consume these
layout info to distribute SIMD ops into SIMT ops that work on work-item
level data fragments.
Summary of Lowering XeGPU SIMD -> SIMT
Subgroup map propagation (This PR)
Attach sg_map to each op in move all ops inside
gpu.warp_execute_on_lane0 region.
Distribute each op using sg_map
Additional legalization steps to align more with Xe HW.
This PR adds the SG map propagation step of the XeGPU SIMT distribution.
SG map propagation is a sparse backward dataflow analysis that propagate
the sg_map backward starting from the operands of certain operations
(DPAS, store etc.).
This is the first step of XeGPU subgroup distribution. This analysis
result is used to attach layout information to each XeGPU SIMD subgroup
op. The lowering patterns in XeGPUSubgroupDistribute will consume these
layout info to distribute SIMD ops into SIMT ops that work on work-item
level data fragments.
### Summary of Lowering XeGPU SIMD -> SIMT
1. Subgroup map propagation (This PR)
2. Attach `sg_map` to each op in move all ops inside
`gpu.warp_execute_on_lane0` region.
3. Distribute each op using `sg_map`
4. Additional legalization steps to align more with Xe HW.
Add support for importing `dereferenceable` and `dereferenceable_or_null` metadata into LLVM dialect. Add a new attribute which models these two metadata nodes and a new OpInterface.
Computing the bound of affine op
(ValueBoundsConstraintSet::computeBound) crashes due to the invalid dim
value given to the op. It is necessary for the pass to check the dim
attribute not to be greater than the rank of the input type.
Fixes https://github.com/llvm/llvm-project/issues/128807
This commit ensures the storage type is retrieved correctly which fixes
a crash when creating a quantized pad const tensor.
Testing is completed via the `tosa-optional-decompositions` pass which
makes use of the `createPadConstTensor` function.
Also includes some cleanup.
- Change to make sure architectures < gfx950 emulate bf16
buffer_atomic_fadd
- Add tests for bf16 buffer_atomic_fadd and architectures: gfx12, gfx942
and gfx950
---------
Co-authored-by: Jakub Kuderski <kubakuderski@gmail.com>
This pattern to bubble up collapse shapes was missing in
`populateFoldReshapeOpsByCollapsingPatterns` .
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Currently, `DistinctAttr` uses an allocator wrapped in a
`ThreadLocalCache` to manage attribute storage allocations. This ensures
all allocations are freed when the allocator is destroyed.
However, this setup can cause use-after-free errors when
`mlir::PassManager` runs its passes on a separate thread as a result of
crash reproduction being enabled. Distinct attribute storages are
created in the child thread's local storage and freed once the thread
joins. Attempting to access these attributes after this can result in
segmentation faults, such as during printing or alias analysis.
Example: This invocation of `mlir-opt` demonstrates the segfault issue
due to distinct attributes being created in a child thread and their
storage being freed once the thread joins:
```
mlir-opt --mlir-pass-pipeline-crash-reproducer=. --test-distinct-attrs mlir/test/IR/test-builtin-distinct-attrs.mlir
```
This pull request changes the distinct attribute allocator to use
different allocators depending on whether or not threading is enabled
and whether or not the pass manager is running its passes in a separate
thread. If multithreading is disabled, a non thread-local allocator is
used. If threading remains enabled and the pass manager invokes its pass
pipelines in a child thread, then a non-thread local but synchronised
allocator is used. This ensures that the lifetime of allocated storage
persists beyond the lifetime of the child thread.
I have added two tests for the `-test-distinct-attrs` pass and the
`-enable-debug-info-on-llvm-scope` passes that run them with crash
reproduction enabled.
Previously, the BufferizableOpInterface implementation for
'tensor.reshape'
listed the 'shape' operand as an alias for the result tensor, causing
unnecessary conflicts with ops that "write" to the shape operand.
This commit adds the following checks to avg_pool2d and max_pool2d TOSA
operations:
- check kernel values are >= 1
- check stride values are >= 1
- check padding values are >= 0
- check padding values are less than kernel sizes
- check output shape matches the expected output shape
Signed-off-by: Luke Hutton <luke.hutton@arm.com>
The `vector.contract` op on two matrices A and B will be lowered to
individual dot products of each row and column of A and B respectively.
The existing lowering will extract each column of B for each row of A,
which leads to multiple values in the IR representing the same columns
of B.
This PR makes changes to the `ContractOpToDotLowering` to make sure that
the columns of B are only ever extracted once, so then the SSA values
representing the extracted columns are then re-used in the IR for later
dot products.
I have updated the existing vector-contract-to-dot-transforms test.
205c5325b3
added a transform utility that moved all SSA dependences of an operation
before an insertion point. Similar to that, this PR adds a transform
utility function, `moveValueDefinitions` to move the slice of operations
that define all values in a `ValueRange` before the insertion point.
While very similar to `moveOperationDependencies`, this method differs
in a few ways
1. When computing the backward slice since the start of the slice is
value, the slice computed needs to be inclusive.
2. The combined backward slice needs to be sorted topologically before
moving them to avoid SSA use-def violations while moving individual ops.
The PR also adds a new transform op to test this new utility function.
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Affine parallelization should take explicitly parallel loops into
account when computing loop depth for dependency analysis purposes. This
was previously not the case, potentially leading to loops incorrectly
being marked as parallel due to depth mismatch.
One fusion pattern for collapse_shape -> expand_shape was added in
a95ad2da36,
however if the intermediate tensor between a collapse and expand is a
0-D tensor, then the `reassociation_map` for these two are special cases
and can't be generally fused in this function
`BubbleUpExpandThroughParallelCollapse`.
This patch adds a default memory space attribute to the DL and adds
methods to query the attribute. This is required as MLIR has no well
defined default memory space unlike LLVM which has 0. While `nullptr` is
a candidate for default memory space, the `ptr` dialect will remove the
possibility for `nullptr` memory spaces to avoid undefined semantics.
This patch also modifies the `DataLayoutTypeInterface::areCompatible` to
include the new DL spec and merged entries, as it is needed to query the default memory
space.
---------
Co-authored-by: Christian Ulmann <christianulmann@gmail.com>