This is intended to solve a problem with lowering atomics in
OpenMP and C++ common to AMDGPU and NVPTX.
In OpenCL and CUDA, it is undefined behavior for an atomic instruction
to modify an object in thread private memory. In OpenMP, it is defined.
Correspondingly, the hardware does not handle this correctly. For
AMDGPU,
32-bit atomics work and 64-bit atomics are silently dropped. We
therefore
need to codegen this by inserting a runtime address space check,
performing
the private case without atomics, and fallback to issuing the real
atomic
otherwise. This metadata allows us to avoid this extra check and branch.
Handle this by introducing metadata intended to be applied to atomicrmw,
indicating they cannot access the forbidden address space.
[APFloat::getSmallest](915df1ae41/llvm/include/llvm/ADT/APFloat.h (L1060))
(and similarly `APFloat:getLargest`)
```
APFloat getSmallest(const fltSemantics &Sem, bool Negative = false);
```
return the positive number when the default value for the second
argument is used.
With that being said, the check
[QuantTypes.cpp#L325](96f37ae453/mlir/lib/Dialect/Quant/IR/QuantTypes.cpp (L325))
```c++
if (scale <= 0.0 || std::isinf(scale) || std::isnan(scale))
return emitError() << "illegal scale: " << scale;
```
is already covered by the check which follows
[QuantTypes.cpp#L327](96f37ae453/mlir/lib/Dialect/Quant/IR/QuantTypes.cpp (L327))
```c++
if (scale < minScale || scale > maxScale)
return emitError() << "scale out of expressed type range [" << minScale
<< ", " << maxScale << "]";
```
given that range `[positive-smallest-finite-number,
positive-largest-finite-number]` does not include `inf` and `nan`s.
I propose to remove the redundant check. Any suggestion for improving
the error message is welcome.
Updated the `HeaderFilterRegex` description to reference
`--header-filter` instead of the incorrect `--header-filter-regex` in
the clang-tidy documentation.
1. This commit adds LLDB_TEST_PLATFORM_URL, LLDB_TEST_SYSROOT,
LLDB_TEST_PLATFORM_WORKING_DIR, LLDB_SHELL_TESTS_DISABLE_REMOTE cmake
flags to pass arguments for cross-compilation and remote running of both Shell&API tests.
2. To run Shell tests remotely, it adds 'platform select' and 'platform connect' commands to %lldb
substitution.
3. 'remote-linux' feature added to lit to disable tests failing with
remote execution.
4. A separate working directory is assigned to each test to avoid
conflicts during parallel test execution.
5. Remote Shell testing is run only when LLDB_TEST_SYSROOT is set for
building test sources. The recommended compiler for that is Clang.
---------
Co-authored-by: Vladimir Vereschaka <vvereschaka@accesssoftek.com>
[Retry 110696 with a proper rebase.]
Seed collection will assemble instructions to be vectorized into
SeedBundles. This data structure is not intended to be used directly,
but will be the basis for load bundles, store bundles, and so on.
Found by inspecting AMDGPU assembly - so the arithmetic ops created
there were definitely making their way into the target ISA. A
`LLVM::BitcastOp` seems equivalent, and evaporates as expected in the
target asm.
Along the way, I thought that this helper function `mfmaConcatIfNeeded`
could be renamed to `convertMFMAVectorOperand` to better convey its
contract; so I don't need to think about whether a bitcast is a
legitimate "concat" :-)
---------
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
We have a dedicated test to check the target-features for FMV
(clang/test/CodeGen/aarch64-fmv-dependencies.c) therefore I am removing
the autogenerated checks from irrelevant tests since the noise is making
it harder to review actual codegen changes.
Fix an obvious typo in these tests to get them passing, and also fix the
-Wimplicit-fallthrough warning that fires when trying to build.
Reverting #110616 was tricky because of dependencies, so I'm just doing
the easy fix directly here.
This adds supports for the new EH instructions (`try_table` and
`throw_ref`) to the type checker.
One thing I'd like to improve on is the locations in the errors for
`catch_***` clauses. Currently they just point to the starting column of
`try_table` instruction itself. But to figure out where catch clauses
start you need to traverse `OperandVector` and check
`WebAssemblyOperand::isCatchList` on them to see which one is the catch
list operand, but `WebAssemblyOperand` class is in AsmParser and
AsmTypeCheck does not have access to it:
cdfdc857cb/llvm/lib/Target/WebAssembly/AsmParser/WebAssemblyAsmParser.cpp (L43-L204)
And even if AsmTypeCheck has access to it, currently it treats the list
of catch clauses as a single `WebAssemblyOperand` so there is no way to
get the starting location of each `catch_***` clause in the current
structure.
This also renames `valTypeToStackType` to `valTypesToStackTypes`, given
that it takes two type lists.
Recently, Solaris bootstrap got broken because Solaris uses a
non-standard mangling of `std::tm` and a few others. This was fixed with
a hack in PR #100724. The Solaris ABI requires mangling `std::tm` as
`tm` and similarly for `std::div_t`, `std::ldiv_t`, and `std::lconv`,
which is what this patch implements. The hack needs to stay in place to
allow building with older versions of `clang`.
Tested on `amd64-pc-solaris2.11`, `sparcv9-sun-solaris2.11` (2-stage
builds with both `clang-19` and `gcc-14` as build compiler), and
`x86_64-pc-linux-gnu`.
This is a part of #109477 that I'm making into it's own patch. Here we
remove logic from the DYLD that prevents it's logic from running if the
main executable already has a load address. Instead we let the DYLD
fully determine what should be loaded and what shouldn't.
Reverts llvm/llvm-project#110800
`llvm\test\CodeGen\DirectX\radians.ll` is failing after this change.
@adam-yang please send a new PR with the issue resolved once you've had
time to investigate.
We've noticed that for large builds executing thin-link can take on the
order of 10s of minutes. We are only using a single thread to write the
sharded indices and import files for each input bitcode file. While we
need to ensure the index file produced lists modules in a deterministic
order, that doesn't prevent us from executing the rest of the work in
parallel.
In this change we use a thread pool to execute as much of the backend's
work as possible in parallel. In local testing on a machine with 80
cores, this change makes a thin-link for ~100,000 input files run in ~2
minutes. Without this change it takes upwards of 10 minutes.
---------
Co-authored-by: Nuri Amari <nuriamari@fb.com>
Previously this would fail if the default target enabled the loop
terminator folding pass (currently just RISC-V), as it runs after loop
strength reduction.
This seems to be an artifact from the intial import in 2012, but even if
not, folks are better off reading the LICENSE.TXT file for the full
details if they need them.
Fixes#109968
Hack around the register scavenger doing the wrong thing.
It does not find the result register as available in the
case the frame index add isn't also reading the dest register.
This is the quick fix for a regression where the scavenge would
create a broken spill of SGPR to memory. I believe this is still
broken for cases we cannot use the result register.
I'm confused about what position the scavenger iterator
is supposed to be in, and what RestoreAfter is for. The scavenger
is missing a full set of forward/backward APIs and there seems
to be an off by one somewhere.
According to https://developer.arm.com/documentation/102105/latest Arm
Architecture Reference Manual for A-profile architecture: Known issues
2.206 D22789
In section C5.2.25 "SSBS, Speculative Store Bypass Safe", under the
heading 'Configurations', the text that reads:
"This register is present only when FEAT_SSBS is implemented. Otherwise,
direct accesses to SSBS are UNDEFINED."
is changed to read:
"This register is present only when FEAT_SSBS2 is implemented.
Otherwise, direct accesses to SSBS are UNDEFINED."
This suggests that it's not worth splitting FEAT_SSBS2 from FEAT_SSBS in
the compiler, since FEAT_SSBS cannot be used for predicating the MRS/MSR
instructions. Those can access PSTATE.SSBS only when FEAT_SSBS2 is
available. Moreover, there are no hardware implementations which
implement FEAT_SSBS without FEAT_SSBS2, therefore unifying these
features in the specification should not be a regression for feature
detection.
Approved in ACLE as https://github.com/ARM-software/acle/pull/350
Here I'm splitting up the existing "if" statement into two. Mixing
hasDefinition() and insert() in one "if" condition would be extremely
confusing as hasDefinition() doesn't change anything while insert()
does.
This caused assertion failures:
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:7736:
SDValue getMemsetValue(SDValue, EVT, SelectionDAG &, const SDLoc &):
Assertion `C->getAPIntValue().getBitWidth() == 8' failed.
See comment on the PR for a reproducer.
> repstosb and repstosd are the same size, but stosd is only done for 0
> because the process of multiplying the constant so that it is copied
> across the bytes of the 32-bit number adds extra instructions that cause
> the size to increase. For 0, repstosb and repstosd are the same size,
> but stosd is only done for 0 because the process of multiplying the
> constant so that it is copied across the bytes of the 32-bit number adds
> extra instructions that cause the size to increase. For 0, we do not
> need to do that at all.
>
> For memcpy, the same goes, and as a result the minsize check was moved
> ahead because a jmp to memcpy encoded takes more bytes than repmovsb.
This reverts commit 6de5305b3d7a4a19a29b35d481a8090e2a6d3a7e.
This is an initial patch to enable constexpr support on the more basic SSE1 intrinsics - such as initialization, arithmetic, logic and fixed shuffles.
The plan is to incrementally extend this for SSE2/AVX etc. - initially for the equivalent basic intrinsics, but we can add support for some of the ia32 builtins as well we the need arises.
Reverts llvm/llvm-project#108939
When `AVX` is available but `-mprefer-vector-width=128` some of the
`mov` instructions turn into the x86 `rep;movsb` instruction leading to
poor performance on "old" architectures (sandybridge, haswell). The
possible solutions are : get rid of the `-mprefer-vector-width` option
or use smaller static copy sizes in
`inline_memcpy_x86_sse2_ge64_sw_prefetching`. Right now a copy size of 3
cache lines (192B) relying exclusively on xmm registers gets turned into
`rep;movsb`.