Update the `generate-tests.py` script to create new tests for `atomicrmw
{fadd,fmin,fmax}` and test these with `half`, `float`, `bfloat` and
`double`.
Generate fp auto-tests to check both with and without `+lsfe`, so that when
#125686 is merged, `+lsfe` will use a single atomic floating-point
instruction.
A collection of fixes to the mesh dialect
- allow constants in sharding propagation/spmdization
- fixes to tensor replication (e.g. 0d tensors)
- improved canonicalization
- sharding propagation incorrectly generated too many ShardOps
New operation `mesh.GetShardOp` enables exchanging sharding information
(like on function boundaries)
`arith.bitcast` is allowed on memrefs and such code can actually be
generated by IREE `ConvertBf16ArithToF32Pass`.
`LLVM::detail::vectorOneToOneRewrite` doesn't properly check its types
and will generate bitcast between structs which is illegal.
With the opaque pointers this is a no-op operation for memref so we can
just add type check in `LLVM::detail::vectorOneToOneRewrite` and add a
separate pattern which removes op if converted types are the same.
This commit adds the new analyzer option
`assume-at-least-one-iteration`, which is `false` by default, but can be
set to `true` to ensure that the analyzer always assumes at least one
iteration in loops.
In some situations this "loop is skipped" execution path is an important
corner case that may evade the notice of the developer and hide
significant bugs -- however, there are also many situations where it's
guaranteed that at least one iteration will happen (e.g. some data
structure is always nonempty), but the analyzer cannot realize this and
will produce false positives when it assumes that the loop is skipped.
This commit refactors some logic around the implementation of the new
feature, but the only functional change is introducing the new analyzer
option. If the new option is left in its default state (false), then the
analysis is functionally equivalent to an analysis done with a version
before this commit.
lowerShuffleAsBroadcast only matches a known-splat shuffle mask, but we
can use the isShuffleEquivalent/IsElementEquivalent helpers to attempt
to find a hidden broadcast-able shuffle pattern.
This requires an extension to IsElementEquivalent to peek through
bitcasts to match against wider shuffles - these typically appear during
shuffle lowering where we've widened a preceding shuffle, often to a
vector concatenation etc.
Amazingly I hit this while yak shaving #126033 .......
We currently handle sequences of fixed-length arrays properly by **not**
emitting length parameters for `embox` ops inside the `omp.private` op.
However, we do not handle the scalar case. This PR extends
`getLengthParameters` defined in `PrivateReductionUtils.cpp` to handle
such cases.
Fixes issue reported in #125732.
The command already supported disassembling multiple ranges, among other
reasons because inline functions can be discontinuous. The main thing
that was missing was being able to retrieve the function ranges from the
top level function object.
The output of the command for the case where the function entry point is
not its lowest address is somewhat confusing (we're showing negative
offsets), but it is correct.
The LLVM dialect lowers globals using IRBuilder, relying on it creating
constant expressions where possible. As we remove support for more
constant expressions (per
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179),
this can cause issues for cases where the constant expression is no
longer supported, and the operation cannot be constant folded without
DataLayout being available. In particular, I ran into this issue with
flang and the removal of mul constant expressions.
Address this by using TargetFolder when creating globals, which will
perform DL-aware constant folding. I think it would make sense to also
do this in general, but I'm starting with globals where not doing this
can result in translation failures.
Ideally, globals with these problematic expressions would never be
generated in the first place, but there has been little movement on
fixing this (https://github.com/llvm/llvm-project/issues/96047).
This is typically handled by SimplifyDemandedVectorElts, but this will
fail when there are multiple uses of the subv_broadcast_load node, but
if there's just one use of the load result (and the rest are uses of the
memory chain), we can still replace with a load and update the chain
accordingly.
Noticed on #126517
This commit moves the implementations of conversion builtins to the CLC
library. It keeps the dichotomy of regular vs. clspv implementations of
the conversions. However, for the sake of a consistent interface all CLC
conversion routines are built, even the ones that clspv opts out of in
the user-facing OpenCL layer.
It simultaneously updates the python script to use f-strings for
formatting.
Add pretty printer/parser for fir.call argument/result attributes and
propagate them to llvm.call.
This will allow implementing the TODO about ABI relevant argument
attribute in indirect calls.
Use LLVM's getMainExecutable() helper instead of rolling our own. This
will result in standard behavior across platforms, such as making sure
that symlinks are always resolved.
- This reverts commit
0c6c4a9993.
- Add '-mcode-object-version=5' as to explicitly use code object
version 5 to match with 'FAIL' diagnostic.
- Add Requires directive to support lit test run on platforms
registered with x86_64 and amdgpu.
They have same semantics. NonUniqueID is more friendly for isUnique
implementation in MCSectionELF.
History: 97837b7 added support for unique IDs in sections and added
GenericSectionID. Later, 1dc16c7 added NonUniqueID.
This has been deprecated since a479be0f39a3301e9ca634d37cf6454b6d3865c6
from September 2023, before LLVM 18. Surely now enough release cycles
have happened that it can be removed upstream.
This pull request is the second part of an ongoing effort to extends PGO
instrumentation to GPU device code and depends on #76587. This PR makes
the following changes:
- Introduces `__llvm_write_custom_profile` to PGO compiler-rt library.
This is an external function that can be used to write profiles with
custom data to target-specific files.
- Adds `__llvm_write_custom_profile` as weak symbol to libomptarget so
that it can write the collected data to a profraw file.
- Adds `PGODump` debug flag and only displays dump when the
aforementioned flag is set
This is another attempt at #88496 to keep mask operands in SSA after
instruction selection.
Previously we selected the mask operands into vmv0, a singleton register
class with exactly one register, V0.
But the register allocator doesn't really support singleton register
classes and we ran into errors like "ran out of registers during
register allocation in function".
This avoids this by introducing a pass just before register allocation
that converts any use of vmv0 to a copy to $v0, i.e. what isel currently
does today.
That way the register allocator doesn't need to deal with the singleton
register class, but we get the benefits of having the mask registers in
SSA throughout the backend:
- This allows RISCVVLOptimizer to reduce the VLs of instructions that
define mask registers
- It enables CSE and code sinking in more places
- It removes the need to peek through mask copies in RISCVISelDAGToDAG
and keep track of V0 defs in RISCVVectorPeephole
This patch initially eliminates uses of vmv0s after RISCVVectorPeephole
to keep the diff to a minimum, and a follow up patch will move it past
the other MachineInstr SSA passes.
Note that it doesn't try to remove any defs of vmv0 as we shouldn't have
any instructions that have any vmv0 outputs.
As a further follow up, we can move the elimination pass to after phi
elimination and outside of SSA, which would unblock the pre-RA scheduler
around masked pseudos. This might also help the issue that
RISCVVectorMaskDAGMutation tries to solve.
This patch adds a function, handlePairwiseShadowOrIntrinsic that ORs
pairs of adjacent shadow values; this is suitable for propagating shadow
for 1- or 2-vector intrinsics that combine adjacent fields. It then
applies handlePairwiseShadowOrIntrinsic to Arm NEON pairwise add:
llvm.aarch64.neon.{addhn, raddhn} (currently incorrectly handled) and
llvm.aarch64.neon.{saddlp, uaddlp} (currently suboptimally handled).
Updates the tests from https://github.com/llvm/llvm-project/pull/125820.
Follow up of #125168.
This patch adds endian-related macros to `endian.h`. We utilize compiler
built-ins for byte swap functions, which are already included in our
minimal supported compiler version.
[mlir][vector] Fix typos in tests (nfc)
Fix typos in `{insert|extract}_scalar_from_vec_2d_f32_dynamic_idxs_compile_time_constant` - the intention was to use `f32` rather than `i32`.
When querying the lower bound and upper bound of loop to update the
value range of a loop iteration variable, the program point to depend on
should be the block corresponding to the iteration variable rather than
the loop operation.
Update the error message to be explicit that this is likely due to
memory corruption.
In addition, check if the chunk header is all zero, which could mean
corruption or an attempt to free a pointer after the memory has been
released to the kernel. This case results in a slightly different error
message to also indicate this could still be a double free.
The lowering for GEP didn't properly support the case where the pointer
argument was being implicitly broadcast by a vector of indices. Fix
that.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>