This patch changes `__kmpc_free_shared` to take an additional argument
corresponding to the associated allocation's size. This makes it easier to
implement the allocator in the runtime.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106496
Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3) forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D102107
In the spirit of TRegions [0], this patch provides a simpler and uniform
interface for a kernel to set up the device runtime. The OMPIRBuilder is
used for reuse in Flang. A custom state machine will be generated in the
follow up patch.
The "surplus" threads of the "master warp" will not exit early anymore
so we need to use non-aligned barriers. The new runtime will not have an
extra warp but also require these non-aligned barriers.
[0] https://link.springer.com/chapter/10.1007/978-3-030-28596-8_11
This was in parts extracted from D59319.
Reviewed By: ABataev, JonChesterfield
Differential Revision: https://reviews.llvm.org/D101976
Summary:
Memory globalization is required to maintain OpenMP standard semantics for data sharing between
worker and master threads. The GPU cannot share data between its threads so must allocate global or
shared memory to store the data in. Currently this is implemented fully in the frontend using the
`__kmpc_data_sharing_push_stack` and __kmpc_data_sharing_pop_stack` functions to emulate standard
CPU stack sharing. The front-end scans the target region for variables that escape the region and
must be shared between the threads. Each variable then has a field created for it in a global record
type.
This patch replaces this functinality with a single allocation command, effectively mimicing an
alloca instruction for the variables that must be shared between the threads. This will be much
slower than the current solution, but makes it much easier to optimize as we can analyze each
variable independently and determine if it is not captured. In the future, we can replace these
calls with an `alloca` and small allocations can be pushed to shared memory.
Reviewed By: tianshilei1992
Differential Revision: https://reviews.llvm.org/D97680
The original change was reverted because it was discovered
that clang mishandles thunks, and they receive wrong
attributes for their this/return types - the ones for the function
they will call, not the ones they have.
While i have tried to fix this in https://reviews.llvm.org/D100388
that patch has been up and stuck for a month now,
with little signs of progress.
So while it will be good to solve this for real,
for now we can simply avoid introducing the bug,
by not annotating this/return for thunks.
This reverts commit 6270b3a1eafaba4279e021418c5a2c5a35abc002,
relanding 0aa0458f1429372038ca6a4edc7e94c96cd9a753.
This patch fixes various issues with our prior `declare target` handling
and extends it to support `omp begin declare target` as well.
This started with PR49649 in mind, trying to provide a way for users to
avoid the "ref" global use introduced for globals with internal linkage.
From there it went down the rabbit hole, e.g., all variables, even
`nohost` ones, were emitted into the device code so it was impossible to
determine if "ref" was needed late in the game (based on the name only).
To make it really useful, `begin declare target` was needed as it can
carry the `device_type`. Not emitting variables eagerly had a ripple
effect. Finally, the precedence of the (explicit) declare target list
items needed to be taken into account, that meant we cannot just look
for any declare target attribute to make a decision. This caused the
handling of functions to require fixup as well.
I tried to clean up things while I was at it, e.g., we should not "parse
declarations and defintions" as part of OpenMP parsing, this will always
break at some point. Instead, we keep track what region we are in and
act on definitions and declarations instead, this is what we do for
declare variant and other begin/end directives already.
Highlights:
- new diagnosis for restrictions specificed in the standard,
- delayed emission of globals not mentioned in an explicit
list of a declare target,
- omission of `nohost` globals on the host and `host` globals on the
device,
- no explicit parsing of declarations in-between `omp [begin] declare
variant` and the corresponding end anymore, regular parsing instead,
- precedence for explicit mentions in `declare target` lists over
implicit mentions in the declaration-definition-seq, and
- `omp allocate` declarations will now replace an earlier emitted
global, if necessary.
---
Notes:
The patch is larger than I hoped but it turns out that most changes do
on their own lead to "inconsistent states", which seem less desirable
overall.
After working through this I feel the standard should remove the
explicit declare target forms as the delayed emission is horrible.
That said, while we delay things anyway, it seems to me we check too
often for the current status even though that is often not sufficient to
act upon. There seems to be a lot of duplication that can probably be
trimmed down. Eagerly emitting some things seems pretty weak as an
argument to keep so much logic around.
---
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D101030
This patch refactors a subset of Clang OpenMP tests, generating checklines using the update_cc_test_checks script. This refactoring facilitates updating the Clang OpenMP code generation codebase by automating test generation.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D101849
This patch renames the replace-function-regex to replace-value-regex to indicate that the existing regex replacement functionality can replace any IR value besides functions.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D101934
This patch refactors a subset of Clang OpenMP tests, generating checklines using the update_cc_test_checks script. This refactoring facilitates updating the Clang OpenMP code generation codebase by automating test generation.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D101849
This revision simplifies Clang codegen for parallel regions in OpenMP GPU target offloading and corresponding changes in libomptarget: SPMD/non-SPMD parallel calls are unified under a single `kmpc_parallel_51` runtime entry point for parallel regions (which will be commonized between target, host-side parallel regions), data sharing is internalized to the runtime. Tests have been auto-generated using `update_cc_test_checks.py`. Also, the revision contains changes to OpenMPOpt for remark creation on target offloading regions.
Reviewed By: jdoerfert, Meinersbur
Differential Revision: https://reviews.llvm.org/D95976
Need to insert a basic block during generation of the target region to
avoid crash for the GPU to be able always calling a cleanup action.
This cleanup action is required for the correct emission of the target
region for the GPU.
Differential Revision: https://reviews.llvm.org/D99445
There are various runtime calls in the device runtime with unused, or
always fixed, arguments. This is bad for all sorts of reasons. Clean up
two before as we match them in OpenMPOpt now.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D83268
Summary:
SYCL and OpenMP prohibits thread local storage in device code,
so this commit ensures that error is emitted for device code and not
emitted for host code when host target supports it.
Reviewers: jdoerfert, erichkeane, bader
Reviewed By: jdoerfert, erichkeane
Subscribers: guansong, riccibruno, ABataev, yaxunl, ebevhan, Anastasia, sstefan1, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D81641
If we're going to assume references are dereferenceable, we should also
assume they're aligned: otherwise, we can't actually dereference them.
See also D80072.
Differential Revision: https://reviews.llvm.org/D80166
Summary:
Currently, we ignore all locality attributes/info when building for
the device and thus all symblos are externally visible and can be
preemted at the runtime. It may lead to incorrect results. We need to
follow the same logic, compiler uses for static/pie builds. But in some
cases changing of dso locality may lead to problems with codegen, so
instead mark external symbols as hidden instead in the device code.
Reviewers: jdoerfert
Subscribers: guansong, caomhin, kkwli0, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D70549
When the parallel region is called directly in the sequential region,
the zeroed tid/bound id are used. But they must point to the different
memory locations as the parameters are marked as noalias.
llvm-svn: 375017
Currently, we ignore all dso locality attributes/info when building for
the device and thus all symblos are externally visible and can be
preemted at the runtime. It may lead to incorrect results. We need to
follow the same logic, compiler uses for static/pie builds.
llvm-svn: 361283
performance.
Internally generated functions must be marked as always_inlines in most
cases. Patch marks some extra reduction function + outlined parallel
functions as always_inline for better performance, but only if the
optimization is requested.
llvm-svn: 361269
mode.
After the previous patch with the more correct handling of the number of
threads in parallel regions, the parallel regions with num_threads
clauses can be executed in SPMD mode.
llvm-svn: 358445
nvvm_barrier0.
Use runtime functions instead of the direct call to the nvvm intrinsics.
It allows to prevent some dangerous LLVM optimizations, that breaks the
code for the NVPTX target.
llvm-svn: 350328
buffer.
Seems to me, nvlink has a bug with the proper support of the weakly
linked symbols. It does not allow to define several shared memory buffer
with the different sizes even with the weak linkage. Instead we always
use 128 bytes buffer to prevent nvlink from the error message emission.
llvm-svn: 349540
If the array section is based on pointer and this sections is mapped in
target region + then it is used in the inner parallel region, it also
must be globalized as the pointer itself is passed by value, not by
reference.
llvm-svn: 348492
modes.
If the region is inside target|teams|distribute region, we can emit the
locations with the correct info for execution mode and runtime mode.
Patch adds this ability to the NVPTX codegen to help the optimizer to
produce better code.
llvm-svn: 347583
Coalesced memory access requires use of the new function
`__kmpc_data_sharing_coalesced_push_stack` instead of the
`__kmpc_data_sharing_push_stack`.
llvm-svn: 345991
if the function has globalized variables and called in context of
target/teams/distribute regions, it does not need to globalize 32
copies of the same variables for memory coalescing, it is enough to
have just one copy, because there is parallel region.
Patch does this by adding call for `__kmpc_parallel_level` function and
checking its return value. If the code sees that the parallel level is
0, then only one variable is allocated, not 32.
llvm-svn: 344356
Added support for memory coalescing for better performance for
globalized variables. From now on all the globalized variables are
represented as arrays of 32 elements and each thread accesses these
elements using `tid & 31` as index.
llvm-svn: 344049
Worker threads fork off to the compiler generated worker function
directly after entering the kernel function. Hence, there is no
need to check whether the current thread is the master if we are
outside of a parallel region (neither SPMD nor parallel_level > 0).
Differential Revision: https://reviews.llvm.org/D52732
llvm-svn: 343618
If the target construct can be executed in SPMD mode + it is a loop
based directive with static scheduling, we can use lightweight runtime
support.
llvm-svn: 340953
The first argument for the parallel outlined functions, called as
serialized parallel regions, should be a pointer to the global thread id
that always is 0.
llvm-svn: 337957
If orphaned parallel region is found, the next code must be emitted:
```
if(__kmpc_is_spmd_exec_mode() || __kmpc_parallel_level(loc, gtid))
Serialized execution.
else if (IsMasterThread())
Prepare and signal worker.
else
Outined function call.
```
llvm-svn: 333301
If the orphaned directive is executed in SPMD mode, we need to emit the
check for the SPMD mode and run the orphaned parallel directive in
sequential mode.
llvm-svn: 332467
In generic data-sharing mode we do not need to globalize
variables/parameters of reference/pointer types. They already are placed
in the global memory.
llvm-svn: 332380
Added initial codegen for level 2, 3 etc. parallelism. Currently, all
the second, the third etc. parallel regions will run sequentially.
llvm-svn: 331642
Added NUW flags for all the add|mul|sub operations + replaced sdiv by udiv
as we operate on unsigned values only (addresses, converted to integers)
llvm-svn: 329411
In the future the compiler will analyze whether the OpenMP
runtime needs to be (fully) initialized and avoid that overhead
if possible. The functions already take an argument to transfer
that information to the runtime, so pass in the default value 1.
(This is needed for binary compatibility with libomptarget-nvptx
currently being upstreamed.)
Differential Revision: https://reviews.llvm.org/D40354
llvm-svn: 318836
If the variable is marked as TLS variable and target device does not
support TLS, the error is emitted for the variable even if it is not
used in target regions. Patch fixes this and allows to use the values of
the TLS variables in target regions.
llvm-svn: 303768
This patch adds codegen for the 'target parallel' directive on the NVPTX
device. We term offload OpenMP directives such as 'target parallel' and
'target teams distribute parallel for' as SPMD constructs. SPMD constructs,
in contrast to Generic ones like the plain 'target', can never contain
a serial region.
SPMD constructs can be handled more efficiently on the GPU and do not
require the Warp Loop of the Generic codegen scheme. This patch adds
SPMD codegen support for 'target parallel' on the NVPTX device and can
be reused for other SPMD constructs.
Reviewers: ABataev
Differential Revision: https://reviews.llvm.org/D28755
llvm-svn: 292428
This patch includes updates for codegen of the target region for the NVPTX
device. It moves initializers from the compiler to the runtime and updates
the worker loop to assume parallel work is retrieved from the runtime. A
subsequent patch will update the codegen to retrieve the parallel work using
calls to the runtime. It includes the removal of the inline attribute
for the worker loop and disabling debug info in it.
This allows codegen for a target directive and serial execution on the
NVPTX device.
Reviewers: ABataev
Differential Revision: https://reviews.llvm.org/D28125
llvm-svn: 291121
This patch includes updates for codegen of the target region for the NVPTX
device. It moves initializers from the compiler to the runtime and updates
the worker loop to assume parallel work is retrieved from the runtime. A
subsequent patch will update the codegen to retrieve the parallel work using
calls to the runtime. It includes the removal of the inline attribute
for the worker loop and disabling debug info in it.
This allows codegen for a target directive and serial execution on the
NVPTX device.
Reviewers: ABataev
Differential Revision: https://reviews.llvm.org/D28125
llvm-svn: 290983
program
Offload related code is not quite ready yet, but some simple examples
must not crash the compiler. Patch fixes the problem in offloading code
with exceptions.
llvm-svn: 290364
Summary: This patch changes the options used by offloading to start with -fopenmp instead of -fomp. This makes the option naming more consistent and materializes a suggestion by Richard Smith in http://reviews.llvm.org/D9888.
Reviewers: hfinkel, carlo.bertolli, arpith-jacob, ABataev
Subscribers: kkwli0, cfe-commits, caomhin
Differential Revision: http://reviews.llvm.org/D21841
llvm-svn: 274283
Summary:
This patch adds base support for codegen of the target directive on the NVPTX device.
Reviewers: ABataev
Differential Revision: http://reviews.llvm.org/D17877
Reworked test case after buildbot failure on windows.
Updated patch to integrate r263837 and test case nvptx_target_firstprivate_codegen.cpp.
llvm-svn: 264018