This PR adds an initial implementation for the map modifiers close,
present and ompx_hold, primarily just required adding the appropriate
map type flags to the map type bits. In the case of ompx_hold it
required adding the map type to the OpenMP dialect. Close has a bit of a
problem when utilised with the ALWAYS map type on descriptors, so it is
likely we'll have to make sure close and always are not applied to the
descriptor simultaneously in the future when we apply always to the
descriptors to facilitate movement of descriptor information to device
for consistency, however, we may find an alternative to this with
further investigation. For the moment, it is a TODO/Note to keep track
of it.
The module currently stores the target triple as a string. This means
that any code that wants to actually use the triple first has to
instantiate a Triple, which is somewhat expensive. The change in #121652
caused a moderate compile-time regression due to this. While it would be
easy enough to work around, I think that architecturally, it makes more
sense to store the parsed Triple in the module, so that it can always be
directly queried.
For this change, I've opted not to add any magic conversions between
std::string and Triple for backwards-compatibilty purses, and instead
write out needed Triple()s or str()s explicitly. This is because I think
a decent number of them should be changed to work on Triple as well, to
avoid unnecessary conversions back and forth.
The only interesting part in this patch is that the default triple is
Triple("") instead of Triple() to preserve existing behavior. The former
defaults to using the ELF object format instead of unknown object
format. We should fix that as well.
The ffi_cif structure defined in the wrapper header is smaller than the
actual structure in libffi which results in other structures being
overwritten when libffi is called, and finally in a segfault.
The patch updates the structure to the correct layout as specified in
ffi.h
This patch adds OpenMPToLLVMIRTranslation support for the OpenMP Declare
Mapper directive.
Since both MLIR and Clang now support custom mappers, I've changed the
respective function params to no longer be optional as well.
Depends on #121005
Summary:
The test failed because it no longer passed Rpass by default without
LTO. I think that's desirable as it matches the standard behavior.
This reverts commit 6fd99de31864a5ef84ae8613b3a9034e05293461.
This pull request is the second part of an ongoing effort to extends PGO
instrumentation to GPU device code and depends on #76587. This PR makes
the following changes:
- Introduces `__llvm_write_custom_profile` to PGO compiler-rt library.
This is an external function that can be used to write profiles with
custom data to target-specific files.
- Adds `__llvm_write_custom_profile` as weak symbol to libomptarget so
that it can write the collected data to a profraw file.
- Adds `PGODump` debug flag and only displays dump when the
aforementioned flag is set
Summary:
If the user deallocates an RPC device this can sometimes fail if the RPC
server is still running. This will happen if the modification happens
while the server is still checking it. This patch adds a mutex to guard
modifications to it.
Summary:
We originally wanted `-stdlib=libc++` by default so that it could use
offloading support in libc++, however this causes issues with out the
Offloading proejct itself is built. Is the user builds the LLVM libs
with libstdc++ then uses this cache it will enable this option by
default for the ensuing build of the offloading libraries with the newly
build clang. This will cause a lot of linker failured because the C++
library doesn't match.
Long term I think the proper solution to this is to make better use of
clang configuration files, but I don't know a good way to do that by
default. For now just make it build right.
Summary:
This definition is more portable since it defines the correct value for
the target. I got rid of the helper mostly because I think it's easy
enough to use now that it's a type and being explicit about what's
`undef` or `poison` is good.
In #124846 the symbolizer was changed to ignore 0-column entries, which
lead to a slightly different representation in the stack traces. This
patch addresses these differences.
Not sure if the difference in kernel_trap.c is also a result of this
change or not.
Can be tracked separate from this, after the bots are back to green.
Summary:
This patch just changes the interface to make starting the thread
multiple times permissable since it will only be done the first time.
Note that this does not refcount it or anything, so it's onto the user
to make sure that they don't shut down the thread before everyone is
done using it. That is the case today because the shutDown portion is
run by a single thread in the destructor phase.
Another question is if we should make this thread truly global state,
because currently it will be private to each plugin instance, so if you
have an AMD and NVIDIA image there will be two, similarly if you have
those inside of a shared library.
Summary:
This patch unifies the existing offloading entires into a single section
called `llvm_offload_entires`. This lets us use a more unified
offloading infrastructure so that all targets share the same handling.
The effect is that people in the runtimes now need to check if the kind
is what they expect, but the expectation is that you can combine
multiple potential providers into a compile job. Doesn't fully work
yet because of other runtime issues, but some day. Mostly this helps the
future of liboffload where we want to handle different languages than
OpenMP.
Summary:
Pretty dumb mistake of me, forgot that this is run per-device and
per-plugin, which fell through the cracks with my testing because I have
two GPUs that use different plugins.
Summary:
This removes the use of OpenMP offloading to build the device runtime.
The main benefit here is that we no longer need to rely on offloading
semantics to build a device only runtime. Things like variants are now
no longer needed and can just be simple if-defs. In the future, I will
remove most of the special handling here and fold it into calls to the
`<gpuintrin.h>` functions instead. Additionally I will rework the
compilation to make this a separate runtime.
The current plan is to have this, but make including OpenMP and
offloading either automatically add it, or print a warning if it's
missing. This will allow us to use a normal CMake workflow and delete
all the weird 'lets pull the clang binary out of the build' business.
```
-DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=offload
-DLLVM_RUNTIME_TARGETS=amdgcn-amd-amdhsa
```
After that, linking the OpenMP device runtime will be `-Xoffload-linker
-lomp`. I.e. no more fat binary business.
Only look at the most recent commit since this includes the two
dependencies
(fix to AMDGPUEmitPrintfBinding and the PointerToMember bug).
Use the test compiler ID to verify whether tests can be run rather than
the host compiler. This makes it possible to run tests (with Clang)
while the library itself was built with GCC.
Currently if we generate code for the below target data map that uses an
optional mapping:
!$omp target data if(present(a)) map(alloc:a)
do i = 1, 10
a(i) = i
end do
!$omp end target data
We yield an LLVM-IR error as the branch for the else path is not
generated. This occurs because we enter the NoDupPriv path of the call
back function when generating the else branch, however, the emitBranch
function needs to be set to a block for it to functionally generate and
link in a follow up branch. The NoDupPriv path currently doesn't do
this, while it's not supposed to generate anything (as far as I am
aware) we still need to at least set the builders placement back so that
it emits the appropriate follow up branch. This avoids the missing
terminator LLVM-IR verification error by correctly generating the follow
up branch.
This PR aims to fix a mapping error when trying to map nullary elements
of a record type (primary example is allocatables/pointer types in
Fortran at the moment). This should be legal to map, just not write to
without pointing to anything within the target region. A common Fortran
OpenMP idiom/example where this is useful can be found in the added
Fortran offload example.
The runtime error arises when we try to map the pointer member utilising
a prescribed constant size that we receive from the lowered type,
resulting in mapping of data that will be non-existent when there is no
allocated data. The fix in this case is to emit a runtime check to see
if the data has been allocated, if it hasn't been we select a size of 0,
if it has we emit the usual type size.
Summary:
The previous offloading entry type did not fit the current use-cases
very well. This widens it and adds a version to prevent further
annoyances. It also includes the kind to better sort who's using it.
The first 64-bytes are reserved as zero so the OpenMP runtime can detect
the old format for binary compatibilitry.
Summary:
Currently, we only use `-mlink-builtin-bitcode` for non-LTO NVIDIA
compiliations. This has the problem that it will internalize the RPC
client symbol which needs to be visible to the host. To counteract that,
I put `retain` on it, but this also prevents optimizations on the global
itself, so the passes we have that remove the symbol don't work on
OpenMP anymore. This patch does the dumbest solution, adding a special
string check for it in clang. Not the best solution, the runner up would
be to have a clang attribute for `externally_initialized` because those
can't be internalized, but that might have some unfortunate
side-effects. Alternatively we could make NVIDIA compilations do LTO all
the time, but that would affect some users and it's harder than I
thought.
Summary:
Handling the RPC server requires running through list of jobs that the
device has requested to be done. Currently this is handled by the thread
that does the waiting for the kernel to finish. However, this is not
sound on NVIDIA architectures and only works for async launches in the
OpenMP model that uses helper threads.
However, we also don't want to have this thread doing work
unnnecessarily. For this reason we track the execution of kernels and
cause the thread to sleep via a condition variable (usually backed by
some kind of futex or other intelligent sleeping mechanism) so that the
thread will be idle while no kernels are running.
Summary:
This patch is an NFC renaming to make using the offloading entry type
more portable between other targets. Right now this is just moving its
definition to LLVM so others can use it. Future work will rework the
struct layout.
Summary:
Right now we just default to device for each type, and mix an ad-hoc
scope with the one used by the compiler's builtins. Unify this can make
each version take the scope optionally.
For @ronlieb, this will remove the need for `add_system` in the fork as
well as the extra `cas` with system scope, just pass `system`.
Summary:
We used to avoid a lot of this stuff because we didn't properly handle
variadics in device code. That's been solved for now, so we can just
make an internal printf handler that forwards to the external `vprintf`
function. This is either provided by NVIDIA's SDK or by the GPU libc
implementation.
The main reason for doing this is because it prevents the stupid AMDGPU
printf pass from mangling our beautiful printfs!
Summary:
This is spelled `ompx_aligned_barrier` when used directly, but wasn't
included in the list of known assumptions. Fix that so now th test
works.
Summary:
We used this globally scoped `ext_no_call_asm` as a sort of hack around
the compiler that allowed the attributor to optimize out inline assembly
calls to PTX instructions. Quite some time ago I got rid of every inline
assembly call and replaced it with a builitin, so this can just be
deleted.
Furthermore, I use the `[[omp::assume]]` attribute directly for the
aligned barrier usage. This prints an unknown assumption warning (even
though it isn't) so I'm just silencing that for now until I fix it
later.
---------
Co-authored-by: Michael Kruse <github@meinersbur.de>
Summary:
We used to do a fetch add of zero to approximate a load. This is because
the NVPTX backend didn't handle this properly. It's not an issue anymore
so simply use the proper atomic builtin.