From GFX10 onwards it is possible to employ benevolent scheduling of
waves. This patch unconditionally enables, for the `amdhsa` OS, the bit
which controls that capability, as it is beneficial for algorithms that
rely on more complex concurrent coordination and it is generally
performance neutral otherwise.
The PTX target allows an alignment to be specified on both return values
and parameters to allow for more efficient vectorized stores. Currently
we represent these parameter alignments via the "alignstack" attribute,
but must fall back to metadata for the return value. This PR allows
"alignstack" on return values as well.
This patch adds a section pointing out how permissions should be done
within Github workflows. I believe all of our workflows are currently
compliant with this, but it helps to have something to point to
documenting the practice and especially the motivation.
This is meant as a preparation for PR #130988 "[AMDGPU] Implement IR
expansion for frem instruction" which implements the expansion of
another instruction in this pass. The more general name seems more
appropriate given this change and quite reasonable even without it.
…elative
The semantics of `llvm.type.checked.load.relative` seem to be a little
different from that of `llvm.load.relative`. It looks like the semantics
for `llvm.type.checked.load.relative` is `ptr + offset + *(ptr +
offset)` whereas the semantics for `llvm.load.relative` is `ptr + *(ptr
+ offset)`. That is, the offset for the former is added to the offset
address whereas the later has the offset added to the original pointer.
It really feels like the checked intrinsic was meant to match the
semantics of the non-checked intrinsic, but I think for all cases the
checked intrinsic is used (swift being the only use I know of), the
calculation just happens to be the same because swift always uses an
offset of zero. Likewise, all llvm tests for this intrinsic happen to
use an offset of zero.
Relative vtables in clang happens to be the first time where we're using
this intrinsic and using it with non-zero values. This updates the
semantics of the checked intrinsic to match the non-checked one.
Effectively this shouldn't change any codegen by any users of this since
all current users seem to use a zero offset.
This PR also updates some tests with non-zero offsets.
The Xqcili extension includes a two instructions that load large
immediates than is available with the base RISC-V ISA.
The current spec can be found at:
https://github.com/quic/riscv-unified-db/releases/tag/Xqci-0.7.0
This patch adds assembler only support.
The grammar is `!match(str, regex)` and this operator produces 1
if the `str` matches the regular expression `regex`.
The format of `regex` is ERE (Extended POSIX Regular Expressions).
* Add multi dimensional array support
* Make maximum vector size tunable
* Make ratio of VGPRs used for vector promotion tunable
* Maximum array size now based on VGPR count (32b) instead of element count
The document has had a few minor tweaks over the years, but the last
major piece of work on it was 2016, after first being introduced in
2013. My aim is to provide a clear and clean recipe for cross-compiling
LLVM that:
* Should be achievable for anyone on common variants of Linux
(_including_ the step of acquiring a working sysroot).
* I think I've kept the coverage of setting up acquiring a Debian
sysroot minimal enough that it can reasonably be included. `debootstrap`
is packaged for most common Linux distributions including non-Debian
derived distributions like Arch Linux and Fedora.
* Describes a setup that we can reasonably support within the community.
* I realise with the ninja symlink canonicalisation issue I haven't
completely avoided hacks, but I look particularly to point 2 under hacks
in the current docs which talks about libraries on the host being found
by CMake and adding `-L` and `-I` to try to hack around this. We've all
been there and made these kind of temporary workarounds to see if we can
get further, but it's very hard to support someone who has problems with
a setup that's improperly leaking between the host and target like this.
The approach I describe with a clean sysroot and setting appropriate
`CMAKE_FIND_ROOT_PATH_MODE_*` settings doesn't have this issue.
* Cuts down on extraneous / outdated information, especially where it is
better covered elsewhere (e.g. detailed descriptions of CMake options
not directly relevant to cross compilation).
I've run through the instructions for AArch64, RISC-V (64-bit), and
armhf.
The Xqci 0.7.0 spec just came out, with some updates to Xqciint,
bringing it to v0.4. The main update of any relevance is that
`qc.c.mienter` and `qc.c.mienter.nest` now update both the stack pointer
and the frame pointer (before, they only updated the stack pointer).
They both remain compatible with the frame pointer convention.
This change bumps the Xqciint version, and ensures that we don't emit
the unneeded frame pointer adjustment instruction after
`qc.c.mienter(.nest)`.
The previous implementation wasn't maintaining a faithful IR
representation of how this really works. The value returned by
createEnqueuedBlockKernel wasn't actually used as a function, and
hacked up later to be a pointer to the runtime handle global
variable. In reality, the enqueued block is a struct where the first
field is a pointer to the kernel descriptor, not the kernel itself. We
were also relying on passing around a reference to a global using a
string attribute containing its name. It's better to base this on a
proper IR symbol reference during final emission.
This now avoids using a function attribute on kernels and avoids using
the additional "runtime-handle" attribute to populate the final
metadata. Instead, associate the runtime handle reference to the
kernel with the !associated global metadata. We can then get a final,
correctly mangled name at the end.
I couldn't figure out how to get rename-with-external-symbol behavior
using a combination of comdats and aliases, so leaves an IR pass to
externalize the runtime handles for codegen. If anything breaks, it's
most likely this, so leave avoiding this for a later step. Use a
special section name to enable this behavior. This also means it's
possible to declare enqueuable kernels in source without going through
the dedicated block syntax or other dedicated compiler support.
We could move towards initializing the runtime handle in the
compiler/linker. I have a working patch where the linker sets up the
first field of the handle, avoiding the need to export the block
kernel symbol for the runtime. We would need new relocations to get
the private and group sizes, but that would avoid the runtime's
special case handling that requires the device_enqueue_symbol metadata
field.
https://reviews.llvm.org/D141700
My previous patch to add DISubrangeType (#126772) had a couple of minor
errors. This patch corrects them.
1. When using a DISubrangeType as an array index type, the wrong tag was
written into the DIE.
2. I'd intended for subranges to use bit strides, not byte strides --
but neglected to actually implement this. Ada needs bit strides.
This patch adds a new test that checks both these things.
Finally, this patch adds some documentation for DISubrangeType.
This change means that llvm-strip no longer exits immediately upon
encountering an error when modifying a file and will instead continue
modifying the other inputs. Fixes#129412
Spec can be found here https://github.com/intel/llvm/pull/15225
TODO for future patches:
- During spec review need to decide whether only FunctionCall or Atomic
instructions can be decorated and if not - move the code around adding
handling for other instructions;
- Handle optional string metadata;
- Handle LLVM atomic instructions;
- Handle SPIR-V friendly atomic calls returning via sret argument.
Signed-off-by: Sidorov, Dmitry <dmitry.sidorov@intel.com>
From what I’ve observed, some contributors are still unaware that in LLVM, the
PR summary - not the commit messages - is used as the final commit message when
merging. This is especially unclear to those without commit access, as only
users with commit access can edit the commit message before merging.
This PR clarifies that policy and consolidates all relevant information into
`GitHub.rst`, ensuring it is no longer split between `GitHub.rst` and
`Contributing.rst`.
Note, a big part of this change is merely moving text between the docs.
LLVM currently expects `__float128` to be both passed and returned in
xmm registers on Windows. However, this disagrees with the Windows
x86-64 calling convention [1], which indicates values larger than 64
bits should be passed indirectly.
Update LLVM's default Windows calling convention to pass `fp128`
directly. Returning in xmm0 is unchanged since this seems like a
reasonable extrapolation of the ABI. With this patch, the calling
convention for `i128` and `f128` is the same.
GCC passes `__float128` indirectly, which this also matches. However, it
also returns indirectly, which is not done here. I intend to attempt a
GCC change to also return in `xmm0` rather than making that change here,
given the consistency with `i128`.
This corresponds to the frontend change in [2], see more details there.
[1]:
https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170
[2]: https://github.com/llvm/llvm-project/pull/115052
This extension adds thirty eight bit manipulation instructions.
The current spec can be found at:
https://github.com/quic/riscv-unified-db/releases/tag/Xqci-0.6
This patch adds assembler only support.
Co-authored-by: Sudharsan Veeravalli <quic_svs@quicinc.com>
This performs the minimal replacment of amdgpu-no-agpr to
amdgpu-agpr-alloc=0. Most of the test diffs are due to the new
attribute sorting later alphabetically.
We could do better by trying to perform range merging in the attributor,
and trying to pick non-0 values.
This provides a range to decide how to subdivide the vector register
budget on gfx90a+. A single value declares the minimum AGPRs that
should be allocatable. Eventually this should replace amdgpu-no-agpr.
I want this primarily for testing agpr allocation behavior. We should
have a heuristic try to detect a reasonable number of AGPRs to keep
allocatable.
This adds support for Xqccmp to the following passes:
- Prolog Epilog Insertion - reusing much of the existing push/pop logic,
but extending it to cope with frame pointers and reorder the CFI
information correctly.
- Move Merger - extending it to support the `qc.` variants of the
double-move instructions.
- Push/Pop Optimizer - extending it to support the `qc.` variants of the
pop instructions.
The testing is based on existing Zcmp tests, but I have put them in
separate files as some of the Zcmp tests were getting quite long.
This is something that Tom and I have discussed briefly for a while now,
a doc that lists out all of the best practices we want to adhere to
surrounding CI things along with their associated motivations/any other
relevant info. This patch adds that doc along with three best practices
surrounding Github Workflows that we try and adhere to (although more
work needs to be done to get 100% adherance).
[NVPTX] Add Intrinsics for discard.*
This PR adds intrinsics for all variations of discard.*
* These intrinsics supports generic or global for all variations.
* The lowering is handled from nvvm to nvptx tablegen directly.
* Lit tests are added as part of discard.ll
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst.
For more information, refer to the PTX ISA
<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-discard>_.
---------
Co-authored-by: abmajumder <abmajumder@nvidia.com>
Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
Langref for `roundeven` implies that the C standard function `roundeven`
may raise floating point exceptions. However, this is not correct; C23
does not mention exceptions for `roundeven`, and per [1] `FE_INEXACT` is
never raised.
Clarify that LLVM's `roundeven` behaves the same.
[1]: https://en.cppreference.com/w/c/numeric/math/roundeven
\[NVPTX\] Add ApplyPriority intrinsics
This PR adds applypriority.\* intrinsics with relevant eviction
priorities.
* The lowering is handled from nvvm to nvptx tablegen directly.
* Lit tests are added as part of applypriority.ll
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst.
For more information, refer to the PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-applypriority>`_.
---------
Co-authored-by: abmajumder <abmajumder@nvidia.com>
This commit adds support for tcgen05.{ld, st} instructions with lit
tests under tcgen05-ld.ll and tcgen05-st.ll and intrinsics documentation
under NVPTXUsage.rst
Xqccmp is a new spec by Qualcomm that makes a vendor-specific effort to
solve the push/pop + frame pointers issue. Broadly, it takes the Zcmp
instructions and reverse the order they push/pop registers in, which
ends up matching the frame pointer convention.
This extension adds a new instruction not present in Zcmp,
`qc.cm.pushfp`, which will set `fp` to the incoming `sp` value after it
has pushed the registers.
This change duplicates the Zcmp implementation, with minor changes to
mnemonics (for the `qc.` prefix), predicates, and the addition of
`qc.cm.pushfp`. There is also new logic to prevent combining Xqccmp and
Zcmp. Xqccmp is kept separate to Xqci for decoding/encoding etc, as the
specs are separate today.
Specification:
https://github.com/quic/riscv-unified-db/releases/tag/Xqccmp_extension-0.1.0
The documents claims that it ignores sNaN, while in the current code it
may be different.
- as the finally callback, it use libc call fmin(3)/fmax(3). while C23
clarifies that fmin(3)/fmax(3) should return NaN for sNaN vs NUM.
- on some architectures, such as aarch64, it converts to `fmaxnm`, which
returns qNaN for sNaN vs NUM.
- on RISC-V (SPEC 2019+), it converts to `fmax`, which returns NUM for
sNaN vs NUM.
Since we have introduced llvm.minimumnum and llvm.maximumnum, which
follow IEEE 754-2019's minimumNumber/maximumNumber.
So, it's time for us to clarify llvm.minnum and llvm.maxnum. Since the
final fallback of llvm.minnum and llvm.maxnum is
fmin(3)/fmax(3), so that it is reasonable to follow the behaviors of
fmin(3)/fmax(3).
Although C23 clarified the behavior about sNaN and +0.0/-0.0:
(NUM or NaN) vs sNaN -> qNaN
+0.0 vs -0.0 -> either one of +0.0/-0.0
It is the same the IEEE754-2008's maxNUM and minNUM.
Not all implementation work as expected.
Since some architectures such as aarch64/MIPSr6/LoongArch, have
instructions that implements +0.0>-0.0.
So Let's define llvm.minnum and llvm.maxnum to IEEE754-2008 with
+0.0>-0.0.
The architectures without such instructions can implements `NSZ` flavor
to speed up,
and the frontend, such as clang, can call them with `nsz` attribute.
This implements assembler support for the XRivosVisni custom/vendor
extension from Rivos Inc. which is defined in:
https://github.com/rivosinc/rivos-custom-extensions (See
src/xrivosvisni.adoc)
Codegen support will follow in separate changes.
Replace some more nvvm.annotations with function attributes,
auto-upgrading the annotations as needed. These new attributes will be
more idiomatic and compile-time efficient than the annotations.
- !"maxntid[xyz]" -> "nvvm.maxntid"
- !"reqntid[xyz]" -> "nvvm.reqntid"
- !"cluster_dim_[xyz]" -> "nvvm.cluster_dim"
There had been concern raised about possible confusion with "rvv". After
internal discussion, we decided to go with an alternate prefix to reduce
possible confusion going forward. The specification document
(https://github.com/rivosinc/rivos-custom-extensions) has been updated.
And also add the XRivosVizip extension to the documentation. I'd missed
that in the initial commit.
This meeting never quite took off the way I had hoped, and I haven't had
time for it in quite a while, so I am removing it from the Getting
Involved page.