\[NVPTX\] Add ApplyPriority intrinsics
This PR adds applypriority.\* intrinsics with relevant eviction
priorities.
* The lowering is handled from nvvm to nvptx tablegen directly.
* Lit tests are added as part of applypriority.ll
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst.
For more information, refer to the PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-applypriority>`_.
---------
Co-authored-by: abmajumder <abmajumder@nvidia.com>
This commit adds support for tcgen05.{ld, st} instructions with lit
tests under tcgen05-ld.ll and tcgen05-st.ll and intrinsics documentation
under NVPTXUsage.rst
Xqccmp is a new spec by Qualcomm that makes a vendor-specific effort to
solve the push/pop + frame pointers issue. Broadly, it takes the Zcmp
instructions and reverse the order they push/pop registers in, which
ends up matching the frame pointer convention.
This extension adds a new instruction not present in Zcmp,
`qc.cm.pushfp`, which will set `fp` to the incoming `sp` value after it
has pushed the registers.
This change duplicates the Zcmp implementation, with minor changes to
mnemonics (for the `qc.` prefix), predicates, and the addition of
`qc.cm.pushfp`. There is also new logic to prevent combining Xqccmp and
Zcmp. Xqccmp is kept separate to Xqci for decoding/encoding etc, as the
specs are separate today.
Specification:
https://github.com/quic/riscv-unified-db/releases/tag/Xqccmp_extension-0.1.0
The documents claims that it ignores sNaN, while in the current code it
may be different.
- as the finally callback, it use libc call fmin(3)/fmax(3). while C23
clarifies that fmin(3)/fmax(3) should return NaN for sNaN vs NUM.
- on some architectures, such as aarch64, it converts to `fmaxnm`, which
returns qNaN for sNaN vs NUM.
- on RISC-V (SPEC 2019+), it converts to `fmax`, which returns NUM for
sNaN vs NUM.
Since we have introduced llvm.minimumnum and llvm.maximumnum, which
follow IEEE 754-2019's minimumNumber/maximumNumber.
So, it's time for us to clarify llvm.minnum and llvm.maxnum. Since the
final fallback of llvm.minnum and llvm.maxnum is
fmin(3)/fmax(3), so that it is reasonable to follow the behaviors of
fmin(3)/fmax(3).
Although C23 clarified the behavior about sNaN and +0.0/-0.0:
(NUM or NaN) vs sNaN -> qNaN
+0.0 vs -0.0 -> either one of +0.0/-0.0
It is the same the IEEE754-2008's maxNUM and minNUM.
Not all implementation work as expected.
Since some architectures such as aarch64/MIPSr6/LoongArch, have
instructions that implements +0.0>-0.0.
So Let's define llvm.minnum and llvm.maxnum to IEEE754-2008 with
+0.0>-0.0.
The architectures without such instructions can implements `NSZ` flavor
to speed up,
and the frontend, such as clang, can call them with `nsz` attribute.
This implements assembler support for the XRivosVisni custom/vendor
extension from Rivos Inc. which is defined in:
https://github.com/rivosinc/rivos-custom-extensions (See
src/xrivosvisni.adoc)
Codegen support will follow in separate changes.
Replace some more nvvm.annotations with function attributes,
auto-upgrading the annotations as needed. These new attributes will be
more idiomatic and compile-time efficient than the annotations.
- !"maxntid[xyz]" -> "nvvm.maxntid"
- !"reqntid[xyz]" -> "nvvm.reqntid"
- !"cluster_dim_[xyz]" -> "nvvm.cluster_dim"
There had been concern raised about possible confusion with "rvv". After
internal discussion, we decided to go with an alternate prefix to reduce
possible confusion going forward. The specification document
(https://github.com/rivosinc/rivos-custom-extensions) has been updated.
And also add the XRivosVizip extension to the documentation. I'd missed
that in the initial commit.
This meeting never quite took off the way I had hoped, and I haven't had
time for it in quite a while, so I am removing it from the Getting
Involved page.
- Rename anonymous namespace section and rework it to
cover visibility more broadly.
- Add language suggesting restricting visibility as much as
possible, using various C++ facilities.
---------
Co-authored-by: Aaron Ballman <aaron@aaronballman.com>
This patch adds intrinsics for tcgen05.cp and
tcgen05.shift instructions.
lit tests are added and verified with a
ptxas-12.8 executable.
Docs are updated in the NVPTXUsage.rst file.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
gfx940 and gfx941 are no longer supported. This is the last one of a
series of PRs to remove them from the code base.
The ISA documentation still contains a lot of links and file names with
the "gfx940" identifier. Changing them to "gfx942" is probably not worth
the cost of breaking all URLs to these pages that users might have saved
in the past.
For SWDEV-512631
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.
This PR removes all documentation occurrences of gfx940/gfx941 except
for the gfx940 ISA description, which will be the subject of a separate
PR.
For SWDEV-512631
gfx940 and gfx941 are no longer supported. This is one of a series of
PRs to remove them from the code base.
This PR removes all non-documentation occurrences of gfx940/gfx941 from
the llvm directory, and the remaining occurrences in clang.
Documentation changes will follow.
For SWDEV-512631
Attempting to pass a `ptr addrspace(7)` to functions that take `ptr`
arguments produces undesirable `addrspacecast(addrspacecast(p8 x to p7)
to p0) => addrspacecast(p8 x to p0)` folds. This results in illegal GEP
operations on buffer resources, which can't be GEP'd. (However, note
that, while unimplemneted, addressspacecast from ptr addrspace(7) to ptr
is legal - it's just an effective address computation)
To resolve this problem, and thus prevent illegal
`getelementptr T, ptr addrspace(8) %x, ...` s from being produces, this
commit extends amdgcn.make.buffer.rsrc to also be variadic in its result
type, auto-upgrading old manglings.
The logic for handling a make.buffer.rsrc in instruction selection
remains untouched and expects the output type to be a ptr addrspace(8),
as does the Clang lowering for its builtin (the pointer-to-pointer
version might want a different name in clang). LowerBufferFatPointers
has been updated to lower
amdgcn.make.buffer.rsrc.p7.p* to amdgcn.make.buffer.rsrc.p8.p* .
This'll also make exposing buffer fat pointers in Clang easier, since
you don't have to cast between a `__amdgcn_rsrc_t` and a pointer.
The last use was removed in:
commit fa6ea7a419f37befbed04368bcb8af4c718facbb
Author: Arthur Eubanks <aeubanks@google.com>
Date: Mon Mar 20 11:18:35 2023 -0700
Update LangRef and code using `Dereferenceable` in assume bundles to
only use the information if it is safe at the point of use.
`Dereferenceable` in an assume bundle is only guaranteed at the point of
the assumption, but may not be guaranteed at later points, because the
pointer may have been freed.
Update code using `Dereferenceable` to only use it if the pointer cannot
be freed. This can further be refined to check if the pointer could be
freed between assume and use.
This follows up on https://github.com/llvm/llvm-project/pull/123196.
With that change, it should be safe to expose dereferenceable
assumptions more widely as in
https://github.com/llvm/llvm-project/pull/121789
PR: https://github.com/llvm/llvm-project/pull/126117
By far the most important part of this patch is updating
GettingInvolved.rst to include the invite link, but I've grepped for any
other discord.com links.
I'm no Discord expert, but from my experience (confirmed via @preames
kindly testing as well) the direct channel links provide a confusing
experience if you haven't already found and used an invite link to the
LLVM Discord server. If you're logged into Discord but not a member of
LLVM's sever, the web app opens and then...nothing. No channel opens, no
prompt to join the server or even a hint that you need to find an invite
link (and if you're not used to Discord, you likely don't even know
that's necessary).
This patch addresses the issue by providing the invite link where
Discord is mentioned.
Model C/C++ `errno` macro by adding a corresponding `errno`
memory location kind to the IR. Preliminary work to separate
`errno` writes from other memory accesses, to the benefit of
alias analyses and optimization correctness.
Previous discussion: https://discourse.llvm.org/t/rfc-modelling-errno-memory-effects/82972.
- Change InstrInfoEmitter to emit OpName as an enum class
instead of an anonymous enum in the OpName namespace.
- This will help clearly distinguish between values that are
OpNames vs just operand indices and should help avoid
bugs due to confusion between the two.
- Rename OpName::OPERAND_LAST to NUM_OPERAND_NAMES.
- Emit declaration of getOperandIdx() along with the OpName
enum so it doesn't have to be repeated in various headers.
- Also updated AMDGPU, RISCV, and WebAssembly backends
to conform to the new definition of OpName (mostly
mechanical changes).
Replace some more nvvm.annotations with function attributes,
auto-upgrading the annotations as needed. These new attributes will be
more idiomatic and compile-time efficient than the annotations.
- !"maxclusterrank" / !"cluster_max_blocks" -> "nvvm.maxclusterrank"
- !"minctasm" -> "nvvm.minctasm"
- !"maxnreg" -> "nvvm.maxnreg"
This adds the `llvm.sincospi` intrinsic, legalization, and lowering
(mostly reusing the lowering for sincos and frexp).
The `llvm.sincospi` intrinsic takes a floating-point value and returns
both the sine and cosine of the value multiplied by pi. It computes the
result more accurately than the naive approach of doing the
multiplication ahead of time, especially for large input values.
```
declare { float, float } @llvm.sincospi.f32(float %Val)
declare { double, double } @llvm.sincospi.f64(double %Val)
declare { x86_fp80, x86_fp80 } @llvm.sincospi.f80(x86_fp80 %Val)
declare { fp128, fp128 } @llvm.sincospi.f128(fp128 %Val)
declare { ppc_fp128, ppc_fp128 } @llvm.sincospi.ppcf128(ppc_fp128 %Val)
declare { <4 x float>, <4 x float> } @llvm.sincospi.v4f32(<4 x float> %Val)
```
Currently, the default lowering of this intrinsic relies on the
`sincospi[f|l]` functions being available in the target's runtime (e.g.
libc).
\[NVPTX\] Add Prefetch intrinsics
This PR adds prefetch intrinsics with the relevant eviction priorities.
* Lit tests are added as part of prefetch.ll
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst.
For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu>`_.
---------
Co-authored-by: abmajumder <abmajumder@nvidia.com>
* Drop ".Z" in milestone name since we've been doing X.Y releases
instead of X.Y.Z releases since LLVM 18
* Add "LLVM" prefix since that's what release milestones are named
* Use a numbered list to make it clearer that there are two steps
needed, and add some more details to the first step
Based on some feedback in Discord about a PR where a reviewer asked the
author to move the formatting changes to a new PR, which appears to
contradict the current form of this document.
I've added an explanation here, before the point where the author would
be committing any of the formatting changes.
There are other ways this can go, for example some projects don't want
the churn of formatting, or you can pre-emptively send a formatting PR,
but I don't think enumerating them all here will help the audience for
this text.
So I've recomended one path that will start them off well, and can
branch off if the reviewers make requests.
There are two ways we can fix this problem, depending on how the
semantics of byval and initializes should interact:
* Don't infer initializes on byval arguments. initializes on byval
refers to the original caller memory (or having both attributes is made
a verifier error).
* Infer initializes on byval, but don't use it in DSE. initializes on
byval refers to the callee copy. This matches the semantics of readonly
on byval. This is slightly more powerful, for example, we could do a
backend optimization where byval + initializes will allocate the full
size of byval on the stack but not copy over the parts covered by
initializes.
I went with the second variant here, skipping byval + initializes in DSE
(FunctionAttrs already doesn't propagate initializes past byval). I'm
open to going in the other direction though.
Fixes https://github.com/llvm/llvm-project/issues/126181.
This patch adds intrinsics for tcgen05 wait,
fence and commit PTX instructions.
lit tests are added and verified with a
ptxas-12.8 executable.
Docs are updated in the NVPTXUsage.rst file.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>