The AMDGPU target can only emit LLVM-IR, so we can always rely on LTO to
link the static version of the runtime optimally. Using the static
library only has a few advantages. Namely, it avoids several known bugs
and allows us to optimize out more functions. This is legal since the
changes in D142486 and D142484
Depends on D142486 D142484
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142491
GCC doesn't support `-fopenmp-version`, causing test failure if the compiler used
for testing is GCC.
GCC's OpenMP 5.2 support is very limited yet. Disable those tests requiring 5.2
feature for GCC as well.
We might want to take a look at all `libomp` tests and mark those tests that
don't support GCC yet.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D142173
This patch fix issues reported for Ubuntu and possibly other platforms:
https://github.com/llvm/llvm-project/issues/45290
The latest comment on this issue points out that using dlsym rather than
the weak symbol approach to call TSan annotation functions fixes the issue
for Ubuntu.
Differential Revision: https://reviews.llvm.org/D142378
The next-gen plugins are complete drop-in replacements for the old
versions. We should strive to replace the old ones as quickly as
possible now that we have a viable alternative.
The only test failing is the `prelock.cpp` test as the support has not landed in
the next-gen plugins.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D142399
Add free functions llvm::CodeGenOpt::{getLevel,getID,parseLevel} to
provide common implementations for functionality that has been
duplicated in many places across the codebase.
Differential Revision: https://reviews.llvm.org/D141968
Summary:
Recently AMD moved the "hsa.h" include to "hsa/hsa.h". This causes
several warning. This patch checks to see if we can include that one
instead. This should hopefully keep things backwards compatible while
silencing the warnings.
This patch makes preparation for a series that will enable per-kernel information
used in both host and device runtime. Some variables/enums, such as `OMPTgtExecModeFlags`,
have to be shared by both of them. A new header `OMPDeviceConstants.h` is added,
containing code that will be shared by them. We will introduce more variables soon.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142320
Dynamic memory allows users to allocate fast shared memory when a kernel
is launched. We support a single size for all kernels via the
`LIBOMPTARGET_SHARED_MEMORY_SIZE` environment variable but now we can
control it per kernel invocation, hence allow computed values.
Note: Only the nextgen plugins will allocate memory based on the clause,
the old plugins will silently miscompile.
Differential Revision: https://reviews.llvm.org/D141233
Clang passes `KernelArgs.NumArgs` to the runtime but not all are kernel
arguments. This ensures we fallback to the old logic. In a follow up we
should introduce a new `KernelArgs.NumKernelArgs` field and set it in
the runtime.
We already created a versioned `__tgt_kernel_arguments` struct but it
was only briefly used and its content was passed in isolation anyway.
This makes it hard to add more information in the future. With this
patch we fully embrace the struct as means to pass information from the
compiler to the plugin as part of a kernel launch.
The patch also extends and renames the struct, bumping the version
number to 2. Version 1 entries are auto-upgraded. This is in preparation
for "bare" kernel launches, per kernel dynamic shared memory, CUDA/HIP
lowering, etc.
The `__tgt_target_kernel_nowait` interface was deprecated as it was
unused. Once we actually implement support for something like that, we
can add an appropriate API.
Note: Only plugins with the `launch_kernel` interface are now supported.
That means that a new clang won't be able to use an old runtime.
An old clang can still use the new runtime since the libomptarget
interface did not change.
Differential Revision: https://reviews.llvm.org/D141232
Move plugin initialization to libomptarget initialization.
Removes the call_once control, probably fractionally faster overall.
Fixes issue 60119 because the plugin initialization, which might
try to dlopen unrelated shared libraries, is no longer nested within
a call from application code.
Fixes#60119
Reviewed By: Maetveis, jhuber6
Differential Revision: https://reviews.llvm.org/D142249
Summary:
This patch removes a tool that was never finished and has no plans of
being picked up again. It does not need to live in LLVM source in an
unusable state.
Distributed barrier was found to cause hangs in some test cases. Found
that a section updating the barrier size was improperly shifted to a
different code section during patching. Restored to original
location, all tests run to completion.
Differential Revision: https://reviews.llvm.org/D141618
The test `openmp/runtime/test/atomic/kmp_atomic_float10_max_min.c` uses a compiler
flag `-mlong-double-80` that might not be supported by all targets. Currently it
requires `x86-registered-target`, but that requirement can be true when LLVM
supports X86 while the actual `libomp` arch is not X86. For example, when LLVM
is built on AArch64 with all targets enabled, `x86-registered-target` can be met.
If `libomp` is built with native target, aka. AArch64, the test will still be enabled,
causing test failure.
This patch only enables the test if the actual target is X86. The actual target
is determined by `LIBOMP_ARCH`.
Fix#53696.
Reviewed By: jlpeyton
Differential Revision: https://reviews.llvm.org/D142172
This patch fixes a segfault that was appearing when the plugin fails to
initialize and then is deinitialized. Also, do not call hsa_shut_down if
the hsa_init failed.
Differential Revision: https://reviews.llvm.org/D142145
The entries inside a "target data end" is processed in three steps:
1. Query internal data maps for the entries and dispatch any necessary
device-side operations (i.e., data retrieval);
2. Synchronize the such operations;
3. Update the host-side pointers and remove any entry which reference
counter reached zero.
Such steps may be executed by multiple threads which may even operate on
the same entries. The current implementation (D121058) tries to
synchronize these threads by tracking the "owner" for the deletion of
each entry using their thread ID. Unfortunately it may failed to do so
because of the following reasons:
1. The owner is always assigned at the first step only if the
reference count is 0 when the map is queried. This does not work
when such owner thread is faster than a previous one that is also
processing the same entry on another "target data end", leading to
user-after-free problems.
2. The entry is only added for post-processing (step 3) if its
reference count was 0 at query time (step 1). This does not allow
for threads to exchange responsibility for the deletion, leading
again to user-after-free problems.
3. An entry may appear multiple times in the arguments array of a
"target data end", which may lead to deleting the entry
prematurely, leading, again, to user-after-free problems.
This patch addresses these problems by tracking all the threads that are
using an entry at "target data end" region through a counter, ensuring
only the last one deletes it when needed. It also ensures that all
entries that are successfully found inside the data maps in step 1 are
also processed in step 3, regardless if their reference count was zeroed
or not at query time. This ensures the deletion ownership may be passed
to any thread that is using such entry.
Reviewed By: ye-luo
Differential Revision: https://reviews.llvm.org/D132676
D91464 introduced verbose tool loading, but the test check only considers Linux.
On macOS, the outputs are totally different, causing the regression afterwards.
This patch simply sets the test to XFAIL on macOS.
Fix#56833.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142045
The next gen plugin adds the def of `DEBUG_PREFIX` in CMake, causing
compiler warning that `DEBUG_PREFIX` is defined multiple times. This patch simply
guards the macro def.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142064
This patch fixes the inconsistent task state when hot team is not used.
When the primary thread executes `__kmp_join_call`, it calls `__kmp_free_team`,
where worker threads will get destroyed if not using hot team. The destroy of
worker threads also reset their task state. However, the primary thread's is not
reset. When the next parallel region is encountered, in `__kmp_task_team_sync`,
the task state of thread will be flipped. Since the state of primary thread is not
reset, it is still 1, but all the worker threads will be 0, this leads to the
inconsistent task state, causing those threads are using completely different
task team.
Fix#59190.
Reviewed By: tlwilmar
Differential Revision: https://reviews.llvm.org/D141979
This patch enables to store bitcode images when JIT is enabled for the record-and-replay functionality (see https://reviews.llvm.org/D138931). Credits to @jdoerfert for refactoring the code.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D141986
There are plenty of assumptions in `libomptarget` and the device runtime
about the pointer size or `size_t`, etc. 32-bit systems are not supported. There
is no point to refine whole things to make it portable. This patch simply disables
building on 32-bit systems.
Fix#60121.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D142023
Currently we build tests as long as the libraries are found on the
machine. This doesn't necessarily mean there is a GPU to use though.
This patch changes it to where we only will build the tests if we found
a compatible GPU via `nvptx-arch` or `amdgpu-arch`.
The only downside to this I could see if someone were to build LLVM on a
home node of a cluster and then wished to run the tests after switching
to a compute node. For this I think we should allow it to be overridden.
I think that's better than allowing us to run tests that will fail by
default.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D142018
This patch adds functionality for recording and replaying the execution of OpenMP offload kernels, based on an original implementation by Steve Rangel. The patch extends libomptarget to extract a json description of the kernel, the device image binary, and a device memory snapshot before and after the execution of a recorded kernel. Kernel recording/replaying in libomptarget is controlled through env vars (LIBOMPTARGET_RECORD, LIBOMPTARGET_REPLAY). It provides a tool, llvm-omp-kernel-replay, for replaying a kernel using the extracted information with the ability to verify replayed execution using the post-execution device memory snapshot, also supporting changing the number of teams/threads for replaying.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D138931
This variable is used by the runtime. Before kernel launch we set it to
indicate several configuration options from the host. This patch renames
it to be more in-line with the rest of the named exported from the
runtime. This is better because this is the only symbol visible to the
host from the runtime, so it should have a reserved name.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D141960
This method to look up the CUDA architecture is deprecated in newer
versions of CMake. We also have our own way to query this information
that we control now via the `nvptx-arch` program, which should always be
present in LLVM builds with clang going forward. This is currently only
used for testing so I think we should be okay with the dependency.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D141933
Each time a thread gets a new affinity assigned, it will not
only assign its mask, but also topology information including
which socket, core, thread and core-attributes (if available)
it is now assigned. This occurs for all non-disabled KMP_AFFINITY
values as well as OMP_PLACES/OMP_PROC_BIND.
The information regarding which socket, core, etc. can take on three
values:
1) The actual ID of the unit (0 - (N-1)), given N units
2) UNKNOWN_ID (-1) which indicates it does not know which ID
3) MULTIPLE_ID (-2) which indicates the thread is spread across
multiple of this unit (e.g., affinity mask is spread across
multiple hardware threads)
This new information is stored in th_topology_ids[] array. An example
how to get the socket Id, one would read th_topology_ids[KMP_HW_SOCKET].
This could be expanded in the future to something more descriptive for
the "multiple" case, like a range of values. For now, the single
value suffices.
The information regarding the core attributes can take on two values:
1) The actual core-type or core-eff
2) KMP_HW_CORE_TYPE_UNKNOWN if the core type is unknown, and
UNKNOWN_CORE_EFF (-1) if the core eff is unknown.
This new information is stored in th_topology_attrs. An example
how to get the core type, one would read
th_topology_attrs.core_type.
Differential Revision: https://reviews.llvm.org/D139854
This patch fixes the wrong format string used in `__kmpc_error`, which could
cause segment fault at runtime.
Reviewed By: jlpeyton
Differential Revision: https://reviews.llvm.org/D141889
When building the library with icc and using it on macOS 12,
the library destruction process is skipped which has many OMPT tests
failing for macOS 12. This change registers the
__kmp_internal_end_library() call for atexit() which will be a
harmless, redundant call for macOS 11 and below and the only destructor
called for macOS 12+.
Differential Revision: https://reviews.llvm.org/D139857
The JIT is a great debugging tool since we can modify the IR manually
before launching it in an existing test case. The new flasks allow to
skip optimizations, to use the exact given IR, as well as to provide a
finished object file. The latter is useful to try out different backend
options and to have complete freedom with pass pipelines.
Documentation is included. Minimal refactoring was performed to make the
second object fit in nicely.
The JIT interface was somewhat irregular as it used multiple global
functions. It also did not cache the results of the JIT, hence multiple
GPU systems would perform the work multiple times. Finally, there might
have been races on the state if we have multi-threaded initialization of
different embedded images, or one image initialized on multiple devices.
This patch tries to rectify all of the above. The JITEngine is now a
part of the GenericPluginTy and tied to one target triple. To support
multiple "ComputeUnitKind"s (previously confusingly called Arch or
[M]CPU) and to avoid re-jitting for the same ComputeUnitKind, we keep a
map of JIT results per ComputeUnitKind. All interaction with the JIT
happens through the JITEngine directly, two functions are exposed. Both
use (shared) locks to avoid races and cache the result. All JIT-related
environment variables are now defined together.
Differential Revision: https://reviews.llvm.org/D141081