This commit adds an additional overload to `replaceOpWithMultiple` that
accepts additional container types. This has been brought up by users of
the new `replaceOpWithMultiple` API.
In particular, one missing container type was
`SmallVector<SmallVector<Value>>`. The "default" `ArrayRef<ValueRange>`
container type can lead to use-after-scope errors in cases such as:
```c++
// Compute the replacement value ranges. Some replacements are single
// values, some are value ranges.
SmallVector<ValueRange> repl;
repl.push_back(someValueRange); // OK
for (...) {
// push_back(Value) triggers an implicit conversion to ValueRange,
// which does not own the range.
repl.push_back(someValue); // triggers use-after-scope later
}
rewriter.replaceOpWithMultiple(op, repl);
```
In this example, users should use `SmallVector<SmallVector<Value>>
repl;`.
Adding wide integer emulation support for `arith.fpto*i` operations. As
the other emulated operations, the upper and lower `N` bits of the `i2N`
integer result are emitted separately.
For the unsigned case we use the following emulation
```c
// example is 64 -> 32 bit emulation, but the implementation is generalized to any 2N -> N case
const double TWO_POW_N = (uint_64_t(1) << N); // 2^N, N is the bitwidth of the widest int supported
// f is a floating-point value representing the input of the fptoui op.
uint32_t hi = (uint32_t)(f / TWO_POW_N); // Truncates the division result
uint32_t lo = (uint32_t)(f - hi * TWO_POW_N); // Subtracts to get the lower bits.
```
For the signed case, we defer the emulation of the absolute value to
`fptoui` and handle the sign:
```
fptosi(fp) = sign(fp) * fptoui(abs(fp))
```
The edge cases of `NaNs, +-inf` and overflows/underflows are undefined
behaviour and the resulting numbers are the combination of the lower
bitwidth UB values. These operations also propagate poison values.
Signed-off-by: Ege Beysel <beysel@roofline.ai>
This commit pulls apart the inherent attribute dependence of classes
like EnumAttrInfo and EnumAttrCase, factoring them out into simpler
EnumCase and EnumInfo variants. This allows specifying the cases of an
enum without needing to make the cases, or the EnumInfo itself, a
subclass of SignlessIntegerAttrBase.
The existing classes are retained as subclasses of the new ones, both
for backwards compatibility and to allow attribute-specific information.
In addition, the new BitEnum class changes its default printer/parser
behavior: cases when multiple keywords appear, like having both nuw and
nsw in overflow flags, will no longer be quoted by the operator<<, and
the FieldParser instance will now expect multiple keywords. All
instances of BitEnumAttr retain the old behavior.
Currently only ctor/dtor list and their priorities are supported. This
PR adds support for the missing data field.
Few implementation notes:
- The assembly printer has a fixed form because previous `attr_dict`
will sort the dict by key name, making global_dtor and global_ctor
differ in the order of printed arguments.
- LLVM's `ptr null` is being converted to `#llvm.zero` otherwise we'd
have to create a region to use the default operation conversion from
`ptr null`, which is silly given that the field only support null or a
symbol.
Adds wide integer emulation support for the `arith.subi` op. `(i2N, i2N)
-> (i2N)` ops are emulated as `(vector<2xiN>, vector<2xiN>) ->
(vector<2xiN>)`, just as the other emulation patterns.
The emulation uses the following scheme:
```
resLow = lhsLow - rhsLow; // carry = 1 if rhsLow > lhsLow
resHigh = lhsLow - carry - rhsLow;
```
Signed-off-by: Ege Beysel <beysel@roofline.ai>
Current usage of alignstack is restricted to LLVM pointer types, whereas
when it's used in parameters it's possible to use it for other types,
see examples like `{i8, i8}, [2 x float], etc` in `llvm/test/CodeGen`.
This PR lifts the restriction and add testcases.
When the TMA descriptor is transferred from host memory to global memory
using cudaMemcpy, each thread block must insert a fence before any
thread accesses the updated tensor map in global memory. Once the tensor
map has been accessed, no additional fences are needed by that block
unless the map is modified again.
[Example from cuda programming
guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays).
The `tma.fence.descriptor` basically implements
`ptx::fence_proxy_tensormap_generic`.
```
#include <cuda.h>
#include <cuda/ptx>
namespace ptx = cuda::ptx;
__device__ CUtensorMap global_tensor_map;
__global__ void kernel(CUtensorMap *tensor_map)
{
// Fence acquire tensor map:
ptx::n32_t<128> size_bytes;
// Since the tensor map was modified from the host using cudaMemcpy,
// the scope should be .sys.
ptx::fence_proxy_tensormap_generic(
ptx::sem_acquire, ptx::scope_sys, tensor_map, size_bytes
);
// Safe to use tensor_map after fence inside this thread..
}
int main() {
CUtensorMap local_tensor_map;
// [ ..Initialize map.. ]
cudaMemcpy(&global_tensor_map, &local_tensor_map, sizeof(CUtensorMap), cudaMemcpyHostToDevice);
kernel<<<1, 1>>>(global_tensor_map);
}
```
The `mbarrier.create` op can create multiple mbarrier objects, and other
mbarrier-related ops can access an mbarrier using a dynamic SSA value.
This is especially useful when using mbarriers in dynamic loops.
This PR adds the `mbarrier.get` op, which returns a pointer to a
specific mbarrier object from a group of barriers created by the
nvgpu.mbarrier.create operation. It is useful when composing the NVGPU
and NVVM dialects.
Example:
```
%mbars = nvgpu.mbarrier.create
-> !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 10>
%mbar_pointer = nvgpu.mbarrier.get %mbars[%c2]
: !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>>
-> i32
```
The override function `ensureTerminator` ensures that the terminator
`InParallelOp` has a region. However, if the terminator of `scf.forall`
is not an `InParallelOp`, calling ensureTerminator causes a crash. Since
the InParallelOp builder already guarantees the existence of a region,
`ForallOp::ensureTerminator` is redundant and can be safely removed.
Fixes#130019.
This moves the EnumAttrCase and EnumAttr classes from Attribute.h/.cpp
to a new EnumInfo.h/cpp and renames them to EnumCase and EnumInfo,
respectively.
This doesn't change any of the tablegen files or any user-facing aspects
of the enum attribute generation system, just reorganizes code in order
to make main PR (#132148) shorter.
We can use *Set::insert_range to collapse:
for (auto Elem : Range)
Set.insert(E);
down to:
Set.insert_range(Range);
In some cases, we can further fold that into the set declaration.
Debug value/declare operations imported before landing pad operations at
the bb start break invoke op verification:
```
error: first operation in unwind destination should be a llvm.landingpad operation
```
This this issue by making the placement slightly more smart.
Reverts llvm/llvm-project#131876
GPU integration tests get broken by this PR.
E.x.
`mlir/test/Integration/GPU/CUDA/sm90/gemm_f32_f16_f16_128x128x128.mlir`
Refactors the following pairs of helper hooks:
* `dynamicallyInsertSubVector` + `staticallyInsertSubVector`
* `dynamicallyExtractSubVector` + `staticallyExtractSubVector`
These hooks are very similar, so I have unified the variable names and
various conditions to make the actual differences clearer.
Propagate `Extract(Elementwise(...))` -> `Elemetwise(Extract...)`.
Currenly limited to the case when extract is the single use of
elementwise to avoid introducing additional elementwise ops.
* Improve the verifier of `memref.subview` to detect out-of-bounds
extractions.
* Improve the documentation of `memref.subview` to make clear that
out-of-bounds extractions are not allowed. Rewrite examples to use the
new `strided<>` notation instead of `affine_map` layout maps. Also
remove all unrelated operations (`memref.alloc`) from the examples.
* Fix various test cases where `memref.subview` ops ran out-of-bounds.
* Update canonicalizations patterns to ensure that they do not fold IR
if it would generate IR that no longer verifies.
Related discussion on Discourse:
https://discourse.llvm.org/t/out-of-bounds-semantics-of-memref-subview/85293
This commit checks if the operands/results of an operator can be found
in the profile compliance mapping, if it isn't the operator is
considered invalid. As a result, operator datatype combinations that are
not listed under the "Supported Data Types" of the TOSA specification
are disallowed and the validation pass results in failure.
Signed-off-by: Luke Hutton <luke.hutton@arm.com>
Handle dense resource attributes in the transpose TOSA folder.
Currently their interface does not align with the rest of the
`ElementsAttr` when it comes to data accessing hence the special
handling.
Signed-off-by: Georgios Pinitas <georgios.pinitas@arm.com>
This gets the consumer fusion method in sync with the corresponding
producer fusion method `tileAndFuseProducerOfSlice`. Not taking this as
input required use of complicated analysis to retrieve the surrounding
loops which are very fragile. Just like the producer fusion method, the
loops need to be taken in as an argument, with typically the loops being
created by the tiling methods.
Some utilities are added to check that the loops passed in are perfectly
nested (in the case of an `scf.for` loop nest.
This is change 1 of N to simplify the implementation of tile and fuse
consumers.
---------
Signed-off-by: MaheshRavishankar <mahesh.ravishankar@gmail.com>
Create a new operation `DSOLocalEquivalentOp`, following the steps of
other constants.
This is similar in a way to `AddressOfOp` but with specific semantics:
only support functions and function aliases (no globals) and extern_weak
linkage is not allowed.
An alternative approach is to use a new `UnitAttr` in `AddressOfOp` and
check that attribute to enforce specific semantics in the verifiers. The
drawback is going against what other constants do and having to add more
attributes in the future when we introduce `no_cfi`, `blockaddress`,
etc.
While here, improve the error message for other missing constants.
Since #130487, `tensor.extract_slice` and `tensor.insert_slice` ops that
are statically detected to go out of bounds are rejected by the
verifier.
This commit fixes canonicalization patterns that currently fold
dynamically out-of-bounds ops (valid IR) to statically out-of-bounds ops
(invalid IR).
For vector.extract, the folder always canonicalizes to a vector.extract
operation, while the rewrite pattern canonicalizes to a vector.broadcast
except in the case of 0-rank vectors.
Remove this special casing, and instead handle the 0-rank vector case in
the folder.
A number of places in our codebase special case to use
extractelement/insertelement for 0D vectors, because extract/insert did
not support 0D vectors previously. Since insert/extract support 0D
vectors now, use them instead of special casing.
This patch matches the definition of vector.scatter as a counter part of
vector.gather.
All of the changes done in this patch make vector.scatter match
vector.gather 's multi dimensional definition.
Unrolling for vector.scatter will be implemented in subsequent patches.
Discourse Discussion:
https://discourse.llvm.org/t/rfc-improving-gather-codegen-for-vector-dialect/85011/13
This patch decouples unrolling vector.gather and lowering vector.gather
to llvm.masked.gather.
This is consistent with how vector.load, vector.store,
vector.maskedload, vector.maskedstore lower to LLVM.
Some interesting test changes from this patch:
- 2D vector.gather lowering to llvm tests are deleted. This is
consistent with other memory load/store ops.
- There are still tests for 2D vector.gather, but the constant mask for
these test is modified. This is because with the updated lowering, one
of the unrolled vector.gather disappears because it is masked off (also
demonstrating why this is a better lowering path)
Overall, this makes vector.gather take the same consistent path for
lowering to LLVM as other load/store ops.
Discourse Discussion:
https://discourse.llvm.org/t/rfc-improving-gather-codegen-for-vector-dialect/85011/13
We hope that the timer can be cleared normally when the target-format is
`offload`, so as to avoid output like this:
```
===-------------------------------------------------------------------------===
Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===
---Wall Time--- --- Name ---
----- Timer for perf llvm-ir -> isa and isa -> binary.
...
```
Co-authored-by: Guray Ozen <guray.ozen@gmail.com>
This is an implementation for [RFC: Supporting Sub-Channel Quantization
in
MLIR](https://discourse.llvm.org/t/rfc-supporting-sub-channel-quantization-in-mlir/82694).
In order to make the review process easier, the PR has been divided into
the following commit labels:
1. **Add implementation for sub-channel type:** Includes the class
design for `UniformQuantizedSubChannelType`, printer/parser and bytecode
read/write support. The existing types (per-tensor and per-axis) are
unaltered.
2. **Add implementation for sub-channel type:** Lowering of
`quant.qcast` and `quant.dcast` operations to Linalg operations.
3. **Adding C/Python Apis:** We first define he C-APIs and build the
Python-APIs on top of those.
4. **Add pass to normalize generic ....:** This pass normalizes
sub-channel quantized types to per-tensor per-axis types, if possible.
A design note:
- **Explicitly storing the `quantized_dimensions`, even when they can be
derived for ranked tensor.**
While it's possible to infer quantized dimensions from the static shape
of the scales (or zero-points) tensor for ranked
data tensors
([ref](https://discourse.llvm.org/t/rfc-supporting-sub-channel-quantization-in-mlir/82694/3)
for background), there are cases where this can lead to ambiguity and
issues with round-tripping.
```
Consider the example: tensor<2x4x!quant.uniform<i8:f32:{0:2, 0:2}, {{s00:z00, s01:z01}}>>
```
The shape of the scales tensor is [1, 2], which might suggest that only
axis 1 is quantized. While this inference is technically correct, as the
block size for axis 0 is a degenerate case (equal to the dimension
size), it can cause problems with round-tripping. Therefore, even for
ranked tensors, we are explicitly storing the quantized dimensions.
Suggestions welcome!
PS: I understand that the upcoming holidays may impact your schedule, so
please take your time with the review. There's no rush.
Previously, the encodings are unconditionally dropped during the shape
inference. The revision adds the support for preserving the encodings in
the linalg ops.
---------
Signed-off-by: hanhanW <hanhan0912@gmail.com>
- Add packed conversions fp8/bf8->bf16 for gfx950 and fp8/bf8->fp32 for
gfx942 in ROCDL dialect
- Update amdgpu.ext_packed_fp8 lowering to use ROCDL packed fp8/bf8->f32
conversions for vector target types and ROCDL scalar fp8/bf8->fp32 for
scalar target type.
---------
Co-authored-by: Jungwook Park <jungwook.park@amd.com>
When inlining a `callee` with a call site debug location, the inlining
infrastructure was trivially combining the `callee` and the `caller`
locations, forming a "tree" of call stacks. Because of this, the remarks
were printing an incomplete inlining stack.
This commit handles this case and appends the `caller` location at the
end of the `callee`'s stack, extending the chain.
This PR adds the Vector transfer_read to load rewrite pattern. The
pattern creates a transfer read op lowering. A vector trasfer read op
will be lowered to a combination of `vector.load`, `arith.select` and
`vector.broadcast` if:
- The transfer op is masked.
- The memref is in buffer address space.
- Other conditions introduced from `TransferReadToVectorLoadLowering`
The motivation of this PR is due to the lack of support of masked load
from amdgpu backend. `llvm.intr.masked.load` lower to a series of
conditional scalar loads refer to (`scalarize-masked-mem-intrin` pass).
This PR will make it possible for masked transfer_read to be lowered
towards buffer load with bounds check, allowing a more optimized global
load accessing pattern compared with existing implementation of
`llvm.intr.masked.load` on vectors.