194 Commits

Author SHA1 Message Date
Joseph Huber
13dcc95dcd
[Offload] Rework offloading entry type to be more generic (#124018)
Summary:
The previous offloading entry type did not fit the current use-cases
very well. This widens it and adds a version to prevent further
annoyances. It also includes the kind to better sort who's using it.

The first 64-bytes are reserved as zero so the OpenMP runtime can detect
the old format for binary compatibilitry.
2025-01-28 07:26:13 -06:00
Joseph Huber
760a786d15
[Clang] Prevent mlink-builtin-bitcode from internalizing the RPC client (#118661)
Summary:
Currently, we only use `-mlink-builtin-bitcode` for non-LTO NVIDIA
compiliations. This has the problem that it will internalize the RPC
client symbol which needs to be visible to the host. To counteract that,
I put `retain` on it, but this also prevents optimizations on the global
itself, so the passes we have that remove the symbol don't work on
OpenMP anymore. This patch does the dumbest solution, adding a special
string check for it in clang. Not the best solution, the runner up would
be to have a clang attribute for `externally_initialized` because those
can't be internalized, but that might have some unfortunate
side-effects. Alternatively we could make NVIDIA compilations do LTO all
the time, but that would affect some users and it's harder than I
thought.
2025-01-27 19:30:59 -06:00
Joseph Huber
38b3f45a81 [Offload] Fix offload-info interface
Summary:
The offload info tool doesn't initialize things properly, just check
this first instead.
2025-01-27 10:36:09 -06:00
Joseph Huber
f07505849c [Offload] Fix server thread from being shut down if unused 2025-01-27 08:29:41 -06:00
Joseph Huber
e7592d83e0 [Offload][NFC] Make sure the thread is not running already 2025-01-27 08:06:29 -06:00
Joseph Huber
bd8a818128 [Offload] Add cuLaunchHostFunc to dynamic cuda
Summary:
This was missing, causing non-directly linked builds to fail.
2025-01-24 11:41:20 -06:00
Joseph Huber
134401deea
[Offload] Move RPC server handling to a dedicated thread (#112988)
Summary:
Handling the RPC server requires running through list of jobs that the
device has requested to be done. Currently this is handled by the thread
that does the waiting for the kernel to finish. However, this is not
sound on NVIDIA architectures and only works for async launches in the
OpenMP model that uses helper threads.

However, we also don't want to have this thread doing work
unnnecessarily. For this reason we track the execution of kernels and
cause the thread to sleep via a condition variable (usually backed by
some kind of futex or other intelligent sleeping mechanism) so that the
thread will be idle while no kernels are running.
2025-01-24 11:36:45 -06:00
hidekisaito
ed512710a5
[Offload] Make MemoryManager threshold ENV var size_t type. (#124063) 2025-01-23 11:46:56 -06:00
Joseph Huber
6518b121f0
[Offload][NFC] Factor out and rename the __tgt_offload_entry struct (#123785)
Summary:
This patch is an NFC renaming to make using the offloading entry type
more portable between other targets. Right now this is just moving its
definition to LLVM so others can use it. Future work will rework the
struct layout.
2025-01-21 12:05:24 -06:00
Joseph Huber
f233a54ae8
[OpenMP] Remove usage of pointer-to-member in lookup (#123671)
Summary:
This is buggy and is currently being tracked in
https://github.com/llvm/llvm-project/issues/123241. For now, replace it
with a macro so that we can use address spaces directly.
2025-01-21 07:50:40 -06:00
Joseph Huber
3274bf6b42
[OpenMP] Make each atomic helper take an atomic scope argument (#122786)
Summary:
Right now we just default to device for each type, and mix an ad-hoc
scope with the one used by the compiler's builtins. Unify this can make
each version take the scope optionally.

For @ronlieb, this will remove the need for `add_system` in the fork as
well as the extra `cas` with system scope, just pass `system`.
2025-01-20 21:58:27 -06:00
Joseph Huber
2d9f406943
[OpenMP] Adjust 'printf' handling in the OpenMP runtime (#123670)
Summary:
We used to avoid a lot of this stuff because we didn't properly handle
variadics in device code. That's been solved for now, so we can just
make an internal printf handler that forwards to the external `vprintf`
function. This is either provided by NVIDIA's SDK or by the GPU libc
implementation.

The main reason for doing this is because it prevents the stupid AMDGPU
printf pass from mangling our beautiful printfs!
2025-01-20 21:56:46 -06:00
Joseph Huber
723a3e746a [OpenMP] Fix mispelled attribute and warning
Summary:
This is spelled `ompx_aligned_barrier` when used directly, but wasn't
included in the list of known assumptions. Fix that so now th test
works.
2025-01-20 08:40:19 -06:00
Joseph Huber
58af82b462
[OpenMP] Remove 'omp assumes' scopes now that we have no inline ASM (#123611)
Summary:
We used this globally scoped `ext_no_call_asm` as a sort of hack around
the compiler that allowed the attributor to optimize out inline assembly
calls to PTX instructions. Quite some time ago I got rid of every inline
assembly call and replaced it with a builitin, so this can just be
deleted.

Furthermore, I use the `[[omp::assume]]` attribute directly for the
aligned barrier usage. This prints an unknown assumption warning (even
though it isn't) so I'm just silencing that for now until I fix it
later.

---------

Co-authored-by: Michael Kruse <github@meinersbur.de>
2025-01-20 08:11:06 -06:00
Jan Patrick Lehr
4b3c17850b
[Offload] Enable shared-libs; compiler-rt as default RTLIB (#123568)
This is the next step to move the CMake cache file builder closer to the
build configuration we care about downstream.
2025-01-20 10:23:41 +01:00
Jan Patrick Lehr
61f94ebc9e
[NFC][Offload] Structure/Readability of CMake cache (#123328)
Preparing to add more config options and want to group them all from
most-common to project / component specific.
2025-01-17 13:01:25 +01:00
Joseph Huber
1c00d0d776
[OpenMP] Remove hack around missing atomic load (#122781)
Summary:
We used to do a fetch add of zero to approximate a load. This is because
the NVPTX backend didn't handle this properly. It's not an issue anymore
so simply use the proper atomic builtin.
2025-01-16 15:17:15 -06:00
Jinsong Ji
8d1d67ec4d
[Offload][PGO] Fix dump of array in ProfData (#122039)
Exposed by -Warray-bounds:

In file included from
../../../../../../../llvm/offload/plugins-nextgen/common/src/GlobalHandler.cpp:252:

../../../../../../../llvm/llvm/include/llvm/ProfileData/InstrProfData.inc:109:1:
error: array index 4 is past the end of the array (that has type 'const
std::remove_const<const uint16_t>::type[4]' (aka 'const unsigned
short[4]')) [-Werror,-Warray-bounds]
109 | INSTR_PROF_DATA(const uint16_t, Int16ArrayTy,
NumValueSites[IPVK_Last+1], \
| ^ ~~~~~~~~~~~

../../../../../../../llvm/offload/plugins-nextgen/common/src/GlobalHandler.cpp:250:15:
note: expanded from macro 'INSTR_PROF_DATA'
250 | outs() << ProfData.Name << " "; \
      |               ^        ~~~~

../../../../../../../llvm/llvm/include/llvm/ProfileData/InstrProfData.inc:109:1:
note: array 'NumValueSites' declared here
109 | INSTR_PROF_DATA(const uint16_t, Int16ArrayTy,
NumValueSites[IPVK_Last+1], \
      | ^

../../../../../../../llvm/offload/plugins-nextgen/common/include/GlobalHandler.h:62:3:
note: expanded from macro 'INSTR_PROF_DATA'
   62 |   std::remove_const<Type>::type Name;

Avoid accessing out-of-bound data, but skip printing array data for now.
As there is no simple way to do this without hardcoding the
NumValueSites field.

---------

Co-authored-by: Ethan Luis McDonough <ethanluismcdonough@gmail.com>
2025-01-14 15:46:27 -05:00
Joseph Huber
74d5373f49 [OpenMP] Fix missing type getter for SFINAE helper
Summary:
This didn't get the type, which made using this always return false.
2025-01-10 19:35:41 -06:00
Joseph Huber
f53cb84df6
[OpenMP] Use __builtin_bit_cast instead of UB type punning (#122325)
Summary:
Use a normal bitcast, remove from the shared utils since it's not
available in
GCC 7.4
2025-01-09 13:59:21 -06:00
Joseph Huber
b57c0bac81
[OpenMP] Update atomic helpers to just use headers (#122185)
Summary:
Previously we had some indirection here, this patch updates these
utilities to just be normal template functions. We use SFINAE to manage
the special case handling for floats. Also this strips address spaces so
it can be used more generally.
2025-01-09 13:57:39 -06:00
Johannes Doerfert
1739ba9bb5
[OpenMP][FIX] Adjust test to be non-flaky (#122331)
The test runs asynchronous kernels and depending on the timing the
output is slightly different. We now only check for the common parts of
the output.
2025-01-09 11:57:11 -08:00
agozillon
fa56e8bb64
[OpenMP][MLIR] Fix threadprivate lowering when compiling for target when target operations are in use (#119310)
Currently the compiler will ICE in programs like the following on the
device lowering pass:

```
program main
    implicit none

    type i1_t
       integer :: val(1000)
    end type i1_t
    integer :: i
    type(i1_t), pointer :: newi1
    type(i1_t), pointer :: tab=>null()

    integer, dimension(:), pointer :: tabval

!$omp THREADPRIVATE(tab)

allocate(newi1)

tab=>newi1
tab%val(:)=1
tabval=>tab%val

!$omp target teams distribute parallel do
  do i = 1, 1000
   tabval(i) = i
 end do
!$omp end target teams distribute parallel do

end program main
```

This is due to the fact that THREADPRIVATE returns a result operation,
and this operation can actually be used by other LLVM dialect (or other
dialect) operations. However, we currently skip the lowering of
threadprivate, so we effectively never generate and bind an LLVM-IR
result to the threadprivate operation result. So when we later go on to
lower dependent LLVM dialect operations, we are missing the required
LLVM-IR result, try to access and use it and then ICE. The fix in this
particular PR is to allow compilation of threadprivate for device as
well as host, and simply treat the device compilation as a no-op,
binding the LLVM-IR result of threadprivate with no alterations and
binding it, which will allow the rest of the compilation to proceed,
where we'll eventually discard the host segment in any case.

The other possible solution to this I can think of, is doing something
similar to Flang's passes that occur prior to CodeGen to the LLVM
dialect, where they erase/no-op certain unrequired operations or
transform them to lower level series of operations. And we would
erase/no-op threadprivate on device as we'd never have these in target
regions.

The main issues I can see with this are that we currently do not
specialise this stage based on wether we're compiling for device or
host, so it's setting a precedent and adding another point of having to
understand the separation between target and host compilation. I am also
not sure we'd necessarily want to enforce this at a dialect level incase
someone else wishes to add a different lowering flow or translation
flow. Another possible issue is that a target operation we have/utilise
would depend on the result of threadprivate, meaning we'd not be allowed
to entirely erase/no-op it, I am not sure of any situations where this
may be an issue currently though.
2025-01-03 18:01:01 +01:00
agozillon
5137c209f0
[Flang][OpenMP] Fix allocating arrays with size intrinisic (#119226)
Attempt to address the following example from causing an assert or ICE:

```
   subroutine test(a)
        implicit none
        integer :: i
        real(kind=real64), dimension(:) :: a
        real(kind=real64), dimension(size(a, 1)) :: b

!$omp target map(tofrom: b)
        do i = 1, 10
            b(i) = i
        end do
!$omp end target
end subroutine
```

Where we utilise a Fortran intrinsic (size) to calculate the size of
allocatable arrays and then map it to device.
2025-01-03 16:46:15 +01:00
Joseph Huber
34f8573a51
[OpenMP] Use generic IR for the OpenMP DeviceRTL (#119091)
Summary:
We previously built this for every single architecture to deal with
incompatibility. This patch updates it to use the 'generic' IR that
`libc` and other projects use. Who knows if this will have any
side-effects, probably worth testing more but it passes the tests I
expect to pass on my side.
2024-12-24 18:05:28 -06:00
Kareem Ergawy
e532241b02
Re-apply (#117867): [flang][OpenMP] Implicitly map allocatable record fields (#120374)
This re-applies #117867 with a small fix that hopefully prevents build
bot failures. The fix is avoiding `dyn_cast` for the result of
`getOperation()`. Instead we can assign the result to `mlir::ModuleOp`
directly since the type of the operation is known statically (`OpT` in
`OperationPass`).
2024-12-18 09:19:45 +01:00
Kareem Ergawy
dc936f3c19
Revert "[flang][OpenMP] Implicitly map allocatable record fields (#117867)" (#120360) 2024-12-18 06:52:24 +01:00
Kareem Ergawy
db09014a07
[flang][OpenMP] Implicitly map allocatable record fields (#117867)
This is a starting PR to implicitly map allocatable record fields.

This PR contains the following changes:
1. Re-purposes some of the utils used in `Lower/OpenMP.cpp` so that
   these utils work on the `mlir::Value` level rather than the
   `semantics::Symbol` level. This takes one step towards to enabling
   MLIR passes to more easily do some lowering themselves (e.g. creating
   `omp.map.bounds` ops for implicitely caputured data like this PR
   does).
2. Adds support for implicitely capturing and mapping allocatable fields
   in record types.

There is quite some distant to still cover to have full support for
this. I added a number of todos to guide further development.

Co-authored-by: Andrew Gozillon <andrew.gozillon@amd.com>

Co-authored-by: Andrew Gozillon <andrew.gozillon@amd.com>
2024-12-18 05:37:58 +01:00
Joseph Huber
b0fbddde38 [OpenMP] Only put retain for NVPTX so it can be optimized out for AMD
Summary:
This is a hack that only NVPTX needs.
2024-12-17 15:16:51 -06:00
wanglei
bdf727065b
[Offload] Add support for loongarch64 to host plugin
This adds support for the loongarch64 architecture to the offload host
plugin.

Similar to #115773

To fix some test issues, I've had to add the LoongArch64 target to:

- CompilerInvocation::ParseLangArgs
- linkDevice in ClangLinuxWrapper.cpp
- OMPContext::OMPContext (to set the device_kind_cpu trait)

Reviewed By: jhuber6

Pull Request: https://github.com/llvm/llvm-project/pull/120173
2024-12-17 19:06:10 +08:00
Jinsong Ji
e85a9f5540
libc: Prefix RPC Status code to avoid conflict in windows build (#119991)
Somehow conflict with define in wingdi.h.

Fix build failures:

[ 52%] Building CXX object
projects/offload/plugins-nextgen/common/CMakeFiles/PluginCommon.dir/src/RPC.cpp.obj
In file included from
...llvm\offload\plugins-nextgen\common\src\RPC.cpp:16:
...\llvm\libc\shared\rpc.h(48,3): error: expected identifier
   48 |   ERROR = 0x1000,
      |   ^
c:\Program files (x86)\Windows
Kits\10\include\10.0.22000.0\um\wingdi.h(118,29): note: expanded from
macro 'ERROR'
  118 | #define ERROR               0
      |                             ^
...\llvm\offload\plugins-nextgen\common\src\RPC.cpp(75,17): error:
expected unqualified-id
   75 |     return rpc::ERROR;
      |                 ^
c:\Program files (x86)\Windows
Kits\10\include\10.0.22000.0\um\wingdi.h(118,29): note: expanded from
macro 'ERROR'
  118 | #define ERROR               0
      |                             ^
2 errors generated.
2024-12-15 09:35:44 -05:00
Joseph Huber
f4ee5a673f
[OpenMP] Replace AMDGPU fences with generic scoped fences (#119619)
Summary:
This is simpler and more common. I would've replaced the CUDA uses and
made this the same but currently it doesn't codegen these fences fully
and just emits a full system wide barrier as a fallback.
2024-12-12 07:54:51 -06:00
Jan Patrick Lehr
74486dcf41
[Offload] Add CMake cache to be used in AMDGPU bot (#119369)
Adds initial CMake cache definition that is similar to what we use in
one of our production buidlbots. The goal is to consolidate the
configurations and make them accessible.
This cache file is a first step and to prepare for full pipeline testing
once the new bot comes online.
2024-12-10 17:50:02 +01:00
hidekisaito
f2bceb2311
[Offload][AMDGPU] accept generic target (#118919)
Enables generic ISA, e.g., "--offload-arch=gfx11-generic" device code to
run on gfx11-generic ISA capable device.

Executable may contain one ELF that has specific target ISA and another
ELF that has compatible generic ISA.
Under that circumstance, this code should say both ELFs are compatible,
leaving the rest to PluginManager to handle.
Suggestions on how best to address that is welcome.
2024-12-09 19:11:38 -05:00
Michał Górny
69227a11fe
[offload] Support LIBOMPTARGET_DEVICE_ARCHITECTURES={amdgpu|nvptx} (#119070)
Add two more special values for LIBOMPTARGET_DEVICE_ARCHITECTURES:
`amdgpu` and `nvptx`, to support building for all AMDGPU and NVPTX
targets respectively. This can be used in place of `all` when offload is
built with one of the GPU plugins only.
2024-12-07 15:37:28 +00:00
Michał Górny
4a44e4b192
[offload] Remove bogus offload-tblgen check for standalone build (#119004)
fd3907ccb583df99e9c19d2fe84e4e7c52d75de9 introduced a check for system
offload-tblgen executable when doing a standalone build. This check is
bogus, since offload-tblgen is built as part of offload and not some
other preinstalled component. The path is also overwritten below, so the
check only causes tests to be disabled unnecessarily.
2024-12-06 18:21:12 +00:00
Shilei Tian
92376c3ff5
[Offload][OMPX] Add the runtime support for multi-dim grid and block (#118042) 2024-12-06 09:07:50 -05:00
Michał Górny
b54ba5361e
[offload] Add gfx1012 (Navi 14) to AMDGPU models list (#118857)
Fixes #118824
2024-12-06 03:24:55 +00:00
Shilei Tian
eb49788bd9
[Offload][AMDGPU] Allow COV6 images (#118909) 2024-12-05 20:05:39 -05:00
Callum Fare
fd3907ccb5
Reland #118503: [Offload] Introduce offload-tblgen and initial new API implementation (#118614)
Reland #118503. Added a fix for builds with `-DBUILD_SHARED_LIBS=ON`
(see last commit). Otherwise the changes are identical.

---


### New API

Previous discussions at the LLVM/Offload meeting have brought up the
need for a new API for exposing the functionality of the plugins. This
change introduces a very small subset of a new API, which is primarily
for testing the offload tooling and demonstrating how a new API can fit
into the existing code base without being too disruptive. Exact designs
for these entry points and future additions can be worked out over time.

The new API does however introduce the bare minimum functionality to
implement device discovery for Unified Runtime and SYCL. This means that
the `urinfo` and `sycl-ls` tools can be used on top of Offload. A
(rough) implementation of a Unified Runtime adapter (aka plugin) for
Offload is available
[here](https://github.com/callumfare/unified-runtime/tree/offload_adapter).
Our intention is to maintain this and use it to implement and test
Offload API changes with SYCL.

### Demoing the new API

```sh
# From the runtime build directory
$ ninja LibomptUnitTests
$ OFFLOAD_TRACE=1 ./offload/unittests/OffloadAPI/offload.unittests 
```


### Open questions and future work
* Only some of the available device info is exposed, and not all the
possible device queries needed for SYCL are implemented by the plugins.
A sensible next step would be to refactor and extend the existing device
info queries in the plugins. The existing info queries are all strings,
but the new API introduces the ability to return any arbitrary type.
* It may be sensible at some point for the plugins to implement the new
API directly, and the higher level code on top of it could be made
generic, but this is more of a long-term possibility.
2024-12-05 09:34:04 +01:00
hidekisaito
8c36a823c2
Fix to account for multiple ISA enumeration (#118676) 2024-12-04 12:11:16 -06:00
Michał Górny
04b26f0eb7
[offload] Standalone build fixes (#118173)
A fair number of fixes to get standalone builds of offload working —
mostly copying missing bits from openmp. It's almost ready — I still
need to figure out why some of the tsts aren't linking to the right
libraries.
2024-12-04 11:27:37 +00:00
Shilei Tian
68bcba6d7a Revert "[AMDGPU] Use COV6 by default (#118515)"
This reverts commit 410cbe3cf28913cca2fc61b3437306b841d08172 because some
buildbots are not ready yet.
2024-12-03 20:17:06 -05:00
Shilei Tian
410cbe3cf2
[AMDGPU] Use COV6 by default (#118515) 2024-12-03 19:38:35 -05:00
Jan Patrick Lehr
51553227f0
Revert "Reland of #108413: [Offload] Introduce offload-tblgen and initial new API implementation" (#118541)
Reverts llvm/llvm-project#118503

Broke bot
https://lab.llvm.org/staging/#/builders/131/builds/9701/steps/5/logs/stdio
2024-12-03 21:42:38 +01:00
Callum Fare
8da490320f
Reland of #108413: [Offload] Introduce offload-tblgen and initial new API implementation (#118503)
This is another attempt to reland the changes from #108413

The previous two attempts introduced regressions and were reverted. This
PR has been more thoroughly tested with various configurations so
shouldn't cause any problems this time. If anyone is aware of any likely
remaining problems then please let me know.

The changes are identical other than the fixes contained in the last 5
commits.

---


### New API

Previous discussions at the LLVM/Offload meeting have brought up the
need for a new API for exposing the functionality of the plugins. This
change introduces a very small subset of a new API, which is primarily
for testing the offload tooling and demonstrating how a new API can fit
into the existing code base without being too disruptive. Exact designs
for these entry points and future additions can be worked out over time.

The new API does however introduce the bare minimum functionality to
implement device discovery for Unified Runtime and SYCL. This means that
the `urinfo` and `sycl-ls` tools can be used on top of Offload. A
(rough) implementation of a Unified Runtime adapter (aka plugin) for
Offload is available
[here](https://github.com/callumfare/unified-runtime/tree/offload_adapter).
Our intention is to maintain this and use it to implement and test
Offload API changes with SYCL.

### Demoing the new API

```sh
# From the runtime build directory
$ ninja LibomptUnitTests
$ OFFLOAD_TRACE=1 ./offload/unittests/OffloadAPI/offload.unittests 
```


### Open questions and future work
* Only some of the available device info is exposed, and not all the
possible device queries needed for SYCL are implemented by the plugins.
A sensible next step would be to refactor and extend the existing device
info queries in the plugins. The existing info queries are all strings,
but the new API introduces the ability to return any arbitrary type.
* It may be sensible at some point for the plugins to implement the new
API directly, and the higher level code on top of it could be made
generic, but this is more of a long-term possibility.
2024-12-03 16:28:35 +00:00
Jan Patrick Lehr
c7babfa6a3
[Offload] Find libc relative to DeviceRTL path (#118497)
This was discussed as a potential solution in
https://github.com/llvm/llvm-project/pull/118173
2024-12-03 16:37:57 +01:00
Joseph Huber
a6ef0debb1 [libc][NFC] Rename RPC opcodes to better reflect their usage
Summary:
RPC_ is a generic prefix here, use LIBC_ to indicate that these are
opcodes used to implement the C library
2024-12-02 15:35:08 -06:00
Joseph Huber
91f5f974cb
[OpenMP] Unconditionally provide an RPC client interface for OpenMP (#117933)
Summary:
This patch adds an RPC interface that lives directly in the OpenMP
device runtime. This allows OpenMP to implement custom opcodes.
Currently this is only providing the host call interface, which is the
raw version of reverse offloading. Previously this lived in `libc/` as
an extension which is not the correct place.

The interface here uses a weak symbol for the RPC client by the same
name that the `libc` interface uses. This means that it will defer to
the libc one if both are present so we don't need to set up multiple
instances.

The presense of this symbol is what controls whether or not we set up
the RPC server. Because this is an external symbol it normally won't be
optimized out, so there's a special pass in OpenMPOpt that deletes this
symbol if it is unused during linking. That means at `O0` the RPC server
will always be present now, but will be removed trivially if it's not
used at O1 and higher.
2024-12-02 14:31:51 -06:00
Shilei Tian
8cb44859cc FIX: Fix test failure in offload/test/mapping/power_of_two_alignment.c
A C file with `template` in it. Awesome.
2024-12-01 21:40:46 -05:00