- Add packed conversions fp8/bf8->bf16 for gfx950 and fp8/bf8->fp32 for
gfx942 in ROCDL dialect
- Update amdgpu.ext_packed_fp8 lowering to use ROCDL packed fp8/bf8->f32
conversions for vector target types and ROCDL scalar fp8/bf8->fp32 for
scalar target type.
---------
Co-authored-by: Jungwook Park <jungwook.park@amd.com>
When inlining a `callee` with a call site debug location, the inlining
infrastructure was trivially combining the `callee` and the `caller`
locations, forming a "tree" of call stacks. Because of this, the remarks
were printing an incomplete inlining stack.
This commit handles this case and appends the `caller` location at the
end of the `callee`'s stack, extending the chain.
This PR adds the Vector transfer_read to load rewrite pattern. The
pattern creates a transfer read op lowering. A vector trasfer read op
will be lowered to a combination of `vector.load`, `arith.select` and
`vector.broadcast` if:
- The transfer op is masked.
- The memref is in buffer address space.
- Other conditions introduced from `TransferReadToVectorLoadLowering`
The motivation of this PR is due to the lack of support of masked load
from amdgpu backend. `llvm.intr.masked.load` lower to a series of
conditional scalar loads refer to (`scalarize-masked-mem-intrin` pass).
This PR will make it possible for masked transfer_read to be lowered
towards buffer load with bounds check, allowing a more optimized global
load accessing pattern compared with existing implementation of
`llvm.intr.masked.load` on vectors.
DenseSet, SmallPtrSet, SmallSet, SetVector, and StringSet recently
gained C++23-style insert_range. This patch replaces:
Dest.insert(Src.begin(), Src.end());
with:
Dest.insert_range(Src);
This patch does not touch custom begin like succ_begin for now.
* `PyRegionList` is now sliceable. The dialect bindings generator seems
to assume it is sliceable already (!), yet accessing e.g. `cases` on
`scf.IndexedSwitchOp` raises a `TypeError` at runtime.
* `PyBlockList` and `PyOperationList` support negative indexing. It is
common for containers to do that in Python, and most container in the
MLIR Python bindings already allow the index to be negative.
Fixes a bug introduced by
https://github.com/llvm/llvm-project/pull/130078.
For non-BlockArgOpenMPOpInterface ops, we also want to map their entry
block arguments to their operands, if any. For the current support in
the OpenMP dialect, the table below lists all ops that have arguments
(SSA operands and/or attributes) and not target-related. Of all these
ops, we need to only process `omp.atomic.update` since it is the only op
that has SSA operands & an attached region. Therefore, the region's
entry block arguments must be mapped to the op's operands in case they
are referenced inside the region.
| op | operands? | region(s)? | parent is func? | processed? |
|--------------|-------------|------------|------------------|-------------|
| atomic.read | yes | no | yes | no |
| atomic.write | yes | no | yes | no |
| atomic.update | yes | yes | yes | yes |
| critical | no | no | yes | no |
| declare_mapper | no | yes | no | no |
| declare_reduction | no | yes | no | no |
| flush | yes | no | yes | no |
| private | no | yes | yes | no |
| threadprivate | yes | no | yes | no |
| yield | yes | no | yes | no |
This patch makes the `map_type` and `map_capture_type` arguments of the
`omp.map.info` operation required, which was already an invariant being
verified by its users via `verifyMapClause()`. This makes it clearer, as
getters no longer return misleading `std::optional` values.
Checks for the `mapper_id` argument are moved to a verifier for the
operation, rather than being checked by users.
Functionally NFC, but not marked as such due to a reordering of
arguments in the assembly format of `omp.map.info`.
This patch makes a few changes to unify the conversion process from the
'omp' to the 'llvm' dialect. The main goal of this change is to
consolidate the logic used to identify legal and illegal ops, and to
consolidate the conversion logic into a single class.
Changes introduced are the following:
- Removal of `getNumVariableOperands()` and `getVariableOperand()` extra
class declarations from OpenMP operations. These are redundant, as they
are equivalent to `mlir::Operation::getNumOperands()` and
`mlir::Operation::getOperands()`, respectively.
- Consolidation of `RegionOpConversion`,
`RegionLessOpWithVarOperandsConversion`,
`RegionOpWithVarOperandsConversion`, `RegionLessOpConversion`,
`AtomicReadOpConversion`, `MapInfoOpConversion`,
`DeclMapperOpConversion` and `MultiRegionOpConversion` into a single
`OpenMPOpConversion` class. This is possible because all of the previous
were doing parts of the same set of operations based on whether they
defined any regions, whether they took operands, type attributes, etc.
- Update of `mlir::configureOpenMPToLLVMConversionLegality` to use a
single generic set of checks for all operations, removing the need to
list every operation manually.
- Update of `mlir::populateOpenMPToLLVMConversionPatterns` to
automatically populate the list of patterns to include all dialect operations.
This PR changes the type of the command-line arguments representing
`LayoutMapOption` from `std::string` to the enum with the same name.
This allows for checking the values of programmable usages of the
corresponding options at compile time.
- When the operand type of an operation changes to a profile-dependent
type, the compliance metadata must be updated. Update compliance check
for the following:
- CONV2D, CONV3D, DEPTHWISE_CONV2D, and TRANSPOSE_CONV2D, as zero points
have changed to variable inputs.
- PAD, because pad_const has been changed to a variable input.
- GATHER and SCATTER, as indices has changed to index_t.
- Add an int16 extension check for CONCAT.
- Add a compliance check for COND_IF, WHILE_LOOP, VARIABLE,
VARIABLE_READ, and VARIABLE_WRITE.
- Correct the profile requirements for IDENTITY, TABLE, MATMUL and
LOGICAL-like operations.
- Remove unnecessary checks for non-v1.0 operations.
- Add condition requirements (anyOf and allOf) to the type mode of
metadata for modes that have multiple profile/extension considerations.
This patch introduces the `MemorySpaceAttrInterface` interface. This
interface is responsible for handling the semantics of `ptr` operations.
For example, this interface can be used to create read-only memory
spaces, making any other operation other than a load a verification
error, see `TestConstMemorySpaceAttr` for a possible implementation of
this concept.
This patch also introduces Enum dependencies `AtomicOrdering`, and
`AtomicBinOp`, both enumerations are clones of the Enums with the same
name in the LLVM Dialect.
Also, see:
- [[RFC] `ptr` dialect & modularizing ptr ops in the LLVM
dialect](https://discourse.llvm.org/t/rfc-ptr-dialect-modularizing-ptr-ops-in-the-llvm-dialect/75142)
for rationale.
- https://github.com/llvm/llvm-project/pull/73057 for a prototype
implementation of the full change.
**Note: Ignore the first commit, that's being reviewed in
https://github.com/llvm/llvm-project/pull/86860 .**
If a `target` directive is nested in a host OpenMP directive (e.g.
parallel, task, or a worksharing loop), flang currently crashes if the
target directive-related MLIR ops (e.g. `omp.map.bounds` and
`omp.map.info` depends on SSA values defined inside the parent host
OpenMP directives/ops.
This PR tries to solve this problem by treating these parent OpenMP ops
as "SSA scopes". Whenever we are translating for the device, instead of
completely translating host ops, we just tranlate their MLIR ops as pure
SSA values.
Adds checks in `isPermutationVector` for indices that are outside of the
bounds and removes the assert.
Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
This patch fixes:
mlir/lib/Target/LLVMIR/Dialect/NVVM/NVVMToLLVMIRTranslation.cpp:121:3:
error: default label in switch which covers all enumeration values
[-Werror,-Wcovered-switch-default]
Add runtime verification for `memref.dim`: check that the index is in
bounds.
Also simplify the pass pipeline for all memref runtime verification
checks.
Why? This option can lead to incorrect IR if used in isolation, for
example, consider the IR below:
```mlir
func.func @loop_with_aliasing(%arg0: tensor<5xf32>, %arg1: index, %arg2: index) -> tensor<5xf32> {
%c1 = arith.constant 1 : index
%cst = arith.constant 1.000000e+00 : f32
%0 = tensor.empty() : tensor<5xf32>
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<5xf32>) -> tensor<5xf32>
// The BufferizableOpInterface says that %2 alias with %arg0 or be a newly
// allocated buffer
%2 = scf.for %arg3 = %arg1 to %arg2 step %c1 iter_args(%arg4 = %arg0) -> (tensor<5xf32>) {
scf.yield %1 : tensor<5xf32>
}
%cst_0 = arith.constant 1.000000e+00 : f32
%inserted = tensor.insert %cst_0 into %1[%c1] : tensor<5xf32>
return %2 : tensor<5xf32>
}
```
If we bufferize with: enforce-aliasing-invariants=false, we get:
```
func.func @loop_with_aliasing(%arg0: memref<5xf32, strided<[?], offset: ?>>, %arg1: index, %arg2: index) -> memref<5xf32, strided<[?], offset: ?>> {
%c1 = arith.constant 1 : index
%cst = arith.constant 1.000000e+00 : f32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<5xf32>
linalg.fill ins(%cst : f32) outs(%alloc : memref<5xf32>)
%0 = scf.for %arg3 = %arg1 to %arg2 step %c1 iter_args(%arg4 = %arg0) -> (memref<5xf32, strided<[?], offset: ?>>) {
%cast = memref.cast %alloc : memref<5xf32> to memref<5xf32, strided<[?], offset: ?>>
scf.yield %cast : memref<5xf32, strided<[?], offset: ?>>
}
%cst_0 = arith.constant 1.000000e+00 : f32
memref.store %cst_0, %alloc[%c1] : memref<5xf32>
return %0 : memref<5xf32, strided<[?], offset: ?>>
}
```
Which is not correct IR since the loop yields the allocation.
I am using this option. What do I need to do now?
If you are using this option in isolation, you are possibly generating
incorrect IR, so you need to revisit your bufferization strategy. If you
are using it together with `copyBeforeWrite,` you simply need to retire
the `enforceAliasingInvariants` option.
Co-authored-by: Matthias Springer <mspringer@nvidia.com>
This was lacking a bitcast from the shifted integer type into a float.
Other non-struct types than integers and floats will still not be
Mem2Reg'ed.
Also adds special handling for constants to be emitted as a constant
directly rather than relying on followup canonicalization patterns
(`memset` of zero is a case that can appear in the wild).
In corner situations, the vectorization pass may face to lower a conv2d
op and assert in a completely irrelevant location in
vectorizeConvolution() subroutine.
~~This PR rejects the conv2d op early and make the asserted routine to
return failure as a defensive workaround.~~
In addressing this, the PR moved all condition check away from the
`Conv1dGenerator` into the `convOpPreconditionCheck()` function. This
makes the unsupported ops such as conv2d to be rejected early and leave
a cleaner `Conv1dGenerator` constructor.
This commit updates the following operations (operands/results) to be of
at least rank 1 such that it aligns with the expectations of the
specification:
- ARGMAX (input)
- REDUCE_ALL (input/output)
- REDUCE_ANY (input/output)
- REDUCE_MAX (input/output)
- REDUCE_MIN (input/output)
- REDUCE_PRODUCT (input/output)
- REDUCE_SUM (input/output)
- CONCAT (each input in input1/output)
- PAD (input1/output)
- REVERSE (input1/output)
- SLICE (input1/output)
- TILE (input1/output)
- TRANSPOSE (input1/output)
In addition to this change, PAD has been updated to allow unranked
tensors for input1/output, inline with other operations.
This is a code cleanup. Update a few places in MLIR that should use
`hasSingleElement`/`getSingleElement`.
Note: `hasSingleElement` is faster than `.getSize() == 1` when it is
used with linked lists etc.
Depends on #131508.
This is PR refactors `alignedConversionPrecondition` from
VectorEmulateNarrowType.cpp and adds new helper hooks.
**Update `alignedConversionPrecondition` (1)**
This method doesn't require the vector type for the "container" argument. The
underlying element type is sufficient. The corresponding argument has been
renamed as `containerTy` - this is meant as the multi-byte container element
type (`i8`, `i16`, `i32`, etc). With this change, the updated invocations of
`alignedConversionPrecondition` (in e.g. `RewriteAlignedSubByteIntExt`) make it
clear that the container element type is assumed to be `i8`.
**Update alignedConversionPrecondition (2):**
The final check in `alignedConversionPrecondition` has been replaced with a new
helper method, `isSubByteVecFittable`. This helper hook is now also re-used in
`ConvertVectorTransferRead` (to improve code re-use).
**Other updates**
Extended + unified comments.
**Implements**: https://github.com/llvm/llvm-project/issues/123630
This commit adds 1:N support to `SignatureConversion::remapInputs`. This
API allows users to replace a block argument with multiple replacement
values. (And the block argument is dropped.) The API already supported
"bbarg --> multiple bbargs" mappings, but "bbarg --> multiple SSA
values" was missing.
---------
Co-authored-by: Markus Böck <markus.boeck02@gmail.com>
Currently, there is no common mechanism for supported intrinsics to be
generically annotated with arg and ret attributes. Since there are many
supported intrinsics around different dialects, the amount of work to
teach all them about these attributes is not trivial (though it would be
nice in the long term).
This PR adds a new flag `-prefer-unregistered-intrinsics` that can be
used alongside `--import-llvm` to always use `llvm.intrinsic_call`
during import time (ignoring dialect hooks for custom intrinsic
support).
Using this flag allow us to roundtrip the LLVM IR while eliminating a
whole set of differences coming from lack of arg/ret attributes on
supported intrinsics.
Note `convertIntrinsic` has to be moved to an implementation file
because it queries into `moduleImport` state, which is a fwd declaration
in `LLVMImportInterface.h`
Import and translation support.
Note that existing support (prior to this PR) already covers enough in
translation specifically to emit "Debug Info Version". Also, the debug
info version metadata is being emitted even though the imported IR has
no information and is showing up in some tests (will fix that in another
PR).
---------
Co-authored-by: Tobias Gysi <tobias.gysi@nextsilicon.com>
Co-authored-by: Henrich Lauko <xlauko@mail.muni.cz>
This patch fixes:
mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp:54:3:
error: definition of implicit copy assignment operator for 'Layout'
is deprecated because it has a user-declared copy constructor
[-Werror,-Wdeprecated-copy]
mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp:103:3:
error: definition of implicit copy assignment operator for 'SGMap'
is deprecated because it has a user-declared copy constructor
[-Werror,-Wdeprecated-copy]
This commit adds a concept of the 'dynamic' extension in the Dialect and
checks that compile time constant (CTC) operands for each operator are
constant if the dynamic extension is not loaded.
Operands labeled as CTC in the specification that are of tosa.shape
(shape_t in the specification) type are not checked as they are always
expected to be constant. This requirement is checked elsewhere in the
dialect.
Signed-off-by: Luke Hutton <luke.hutton@arm.com>
With `tensor.expand_shape` allowing expanding dynamic dimension into
multiple dynamic dimension, adapt the reshape propagation through
expansion to handle cases where one dynamic dimension is expanded into
multiple dynamic dimension.
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Originally introduced in #130240 and reverted in #131364
Reproduced the issue locally in Linux by doing a shared lib build. Fixes
including adding the missing LINK_LIBS.
**Original commit message:**
This PR adds the SG map propagation step of the XeGPU SIMT distribution.
SG map propagation is a sparse backward dataflow analysis that propagate
the sg_map backward starting from the operands of certain operations
(DPAS, store etc.).
This is the first step of XeGPU subgroup distribution. This analysis
result is used to attach layout information to each XeGPU SIMD subgroup
op. The lowering patterns in XeGPUSubgroupDistribute will consume these
layout info to distribute SIMD ops into SIMT ops that work on work-item
level data fragments.
Summary of Lowering XeGPU SIMD -> SIMT
Subgroup map propagation (This PR)
Attach sg_map to each op in move all ops inside
gpu.warp_execute_on_lane0 region.
Distribute each op using sg_map
Additional legalization steps to align more with Xe HW.
This PR adds the SG map propagation step of the XeGPU SIMT distribution.
SG map propagation is a sparse backward dataflow analysis that propagate
the sg_map backward starting from the operands of certain operations
(DPAS, store etc.).
This is the first step of XeGPU subgroup distribution. This analysis
result is used to attach layout information to each XeGPU SIMD subgroup
op. The lowering patterns in XeGPUSubgroupDistribute will consume these
layout info to distribute SIMD ops into SIMT ops that work on work-item
level data fragments.
### Summary of Lowering XeGPU SIMD -> SIMT
1. Subgroup map propagation (This PR)
2. Attach `sg_map` to each op in move all ops inside
`gpu.warp_execute_on_lane0` region.
3. Distribute each op using `sg_map`
4. Additional legalization steps to align more with Xe HW.
The patch abstracts sending and receiving json messages of
`JSONTransport` to allow custom implementation of them. For example, one
concrete implementation can use pipes without a need to convert file
descriptor to a `FILE` object.
This patch avoids calling `TargetOp::getInnermostCapturedOmpOp` multiple
times during initialization of default and runtime target attributes in
MLIR to LLVM IR translation of `omp.target` operations. This is a
potentially expensive operation, so this change should help keep compile
times lower.
Add support for importing `dereferenceable` and `dereferenceable_or_null` metadata into LLVM dialect. Add a new attribute which models these two metadata nodes and a new OpInterface.