Summary:
When we were first porting to COV5, this lead to some ABI issues due to
a change in how we looked up the work group size. Bitcode libraries
relied on the builtins to emit code, but this was changed between
versions. This prevented the bitcode libraries, like OpenMP or libc,
from being used for both COV4 and COV5. The solution was to have this
'none' functionality which effectively emitted code that branched off of
a global to resolve to either version.
This isn't a great solution because it forced every TU to have this
variable in it. The patch in
https://github.com/llvm/llvm-project/pull/131033 removed support for
COV4 from OpenMP, which was the only consumer of this functionality.
Other users like HIP and OpenCL did not use this because they linked the
ROCm Device Library directly which has its own handling (The name was
borrowed from it after all).
So, now that we don't need to worry about backward compatibility with
COV4, we can remove this special handling. Users can still emit COV4
code, this simply removes the special handling used to make the OpenMP
device runtime bitcode version agnostic.
When building with asserts enabled, this can actually cause strange
miscompilations because an incorrect llvm.assume is generated at the
point of the assertion.
Summary:
We conditionally allocate the implicit arguments, so they possibly are
null. The flang compiler seems to hit this case, even though it
shouldn't when it's supposed to conform to the HSA code object. For now
guard this to fix the regression and cover a case in the future where
someone rolls a fully custom implementatation.
Fixes: https://github.com/llvm/llvm-project/issues/132982
This pull request is the third part of an ongoing effort to extends PGO
instrumentation to GPU device code and depends on
https://github.com/llvm/llvm-project/pull/93365. This PR makes the
following changes:
- Allows PGO flags to be supplied to GPU targets
- Pulls version global from device
- Modifies `__llvm_write_custom_profile` and `lprofWriteDataImpl` to
allow the PGO version to be overridden
Spirv doesn't have implicit conversions between address spaces (at least
at present, we might need to change that) and address space qualified
*this pointers are not handled well by clang. This commit changes the
single instance of the smartstack to be explicitly a singleton, for
fractionally simpler IR generation (no this pointer) and to sidestep the
work in progress spirv64-- openmp target not being able to compile the
original version.
Summary:
This patch moves the RPC server handling to be a header only utility
stored in the `shared/` directory. This is intended to be shared within
LLVM for the loaders and `offload/` handling.
Generally, this makes it easier to share code without weird
cross-project binaries being plucked out of the build system. It also
allows us to soon move the loader interface out of the `libc` project so
that we don't need to bootstrap those and can build them in LLVM.
Enable the LLVM_ENABLE_RUNTIMES=flang-rt build of the Fortran runtime
for the amdgpu-offload-* buildbots. This pre-population cmake cache
files is referred to by the llvm-zorg annotated builder factory
[script](872f477610/zorg/buildbot/builders/annotated/amdgpu-offload-cmake.py (L26)).
The corresponding change in llvm-zorg is
llvm/llvm-zorg#402
This reverts commit e296fb8ff6255b97db9ff6cd941acc730164b38f.
The worker of amdgpu-offload-rhel-8-cmake-build-only has been updated
with a newer version of Ninja that supports Fortran.
Improve the check for whether a type can be passed by copy. Currently,
passing by copy is done via the OMP_MAP_LITERAL mapping, which can only
transfer as much data as can be contained in a pointer representation.
The HAS_DEVICE_ADDR indicates that the object(s) listed exists at an
address that is a valid device address. Specifically,
`has_device_addr(x)` means that (in C/C++ terms) `&x` is a device
address.
When entering a target region, `x` does not need to be allocated on the
device, or have its contents copied over (in the absence of additional
mapping clauses). Passing its address verbatim to the region for use is
sufficient, and is the intended goal of the clause.
Some Fortran objects use descriptors in their in-memory representation.
If `x` had a descriptor, both the descriptor and the contents of `x`
would be located in the device memory. However, the descriptors are
managed by the compiler, and can be regenerated at various points as
needed. The address of the effective descriptor may change, hence it's
not safe to pass the address of the descriptor to the target region.
Instead, the descriptor itself is always copied, but for objects like
`x`, no further mapping takes place (as this keeps the storage pointer
in the descriptor unchanged).
---------
Co-authored-by: Sergio Afonso <safonsof@amd.com>
This PR adds an initial implementation for the map modifiers close,
present and ompx_hold, primarily just required adding the appropriate
map type flags to the map type bits. In the case of ompx_hold it
required adding the map type to the OpenMP dialect. Close has a bit of a
problem when utilised with the ALWAYS map type on descriptors, so it is
likely we'll have to make sure close and always are not applied to the
descriptor simultaneously in the future when we apply always to the
descriptors to facilitate movement of descriptor information to device
for consistency, however, we may find an alternative to this with
further investigation. For the moment, it is a TODO/Note to keep track
of it.
The module currently stores the target triple as a string. This means
that any code that wants to actually use the triple first has to
instantiate a Triple, which is somewhat expensive. The change in #121652
caused a moderate compile-time regression due to this. While it would be
easy enough to work around, I think that architecturally, it makes more
sense to store the parsed Triple in the module, so that it can always be
directly queried.
For this change, I've opted not to add any magic conversions between
std::string and Triple for backwards-compatibilty purses, and instead
write out needed Triple()s or str()s explicitly. This is because I think
a decent number of them should be changed to work on Triple as well, to
avoid unnecessary conversions back and forth.
The only interesting part in this patch is that the default triple is
Triple("") instead of Triple() to preserve existing behavior. The former
defaults to using the ELF object format instead of unknown object
format. We should fix that as well.
The ffi_cif structure defined in the wrapper header is smaller than the
actual structure in libffi which results in other structures being
overwritten when libffi is called, and finally in a segfault.
The patch updates the structure to the correct layout as specified in
ffi.h
This patch adds OpenMPToLLVMIRTranslation support for the OpenMP Declare
Mapper directive.
Since both MLIR and Clang now support custom mappers, I've changed the
respective function params to no longer be optional as well.
Depends on #121005
Summary:
The test failed because it no longer passed Rpass by default without
LTO. I think that's desirable as it matches the standard behavior.
This reverts commit 6fd99de31864a5ef84ae8613b3a9034e05293461.
This pull request is the second part of an ongoing effort to extends PGO
instrumentation to GPU device code and depends on #76587. This PR makes
the following changes:
- Introduces `__llvm_write_custom_profile` to PGO compiler-rt library.
This is an external function that can be used to write profiles with
custom data to target-specific files.
- Adds `__llvm_write_custom_profile` as weak symbol to libomptarget so
that it can write the collected data to a profraw file.
- Adds `PGODump` debug flag and only displays dump when the
aforementioned flag is set
Summary:
If the user deallocates an RPC device this can sometimes fail if the RPC
server is still running. This will happen if the modification happens
while the server is still checking it. This patch adds a mutex to guard
modifications to it.
Summary:
We originally wanted `-stdlib=libc++` by default so that it could use
offloading support in libc++, however this causes issues with out the
Offloading proejct itself is built. Is the user builds the LLVM libs
with libstdc++ then uses this cache it will enable this option by
default for the ensuing build of the offloading libraries with the newly
build clang. This will cause a lot of linker failured because the C++
library doesn't match.
Long term I think the proper solution to this is to make better use of
clang configuration files, but I don't know a good way to do that by
default. For now just make it build right.
Summary:
This definition is more portable since it defines the correct value for
the target. I got rid of the helper mostly because I think it's easy
enough to use now that it's a type and being explicit about what's
`undef` or `poison` is good.
In #124846 the symbolizer was changed to ignore 0-column entries, which
lead to a slightly different representation in the stack traces. This
patch addresses these differences.
Not sure if the difference in kernel_trap.c is also a result of this
change or not.
Can be tracked separate from this, after the bots are back to green.
Summary:
This patch just changes the interface to make starting the thread
multiple times permissable since it will only be done the first time.
Note that this does not refcount it or anything, so it's onto the user
to make sure that they don't shut down the thread before everyone is
done using it. That is the case today because the shutDown portion is
run by a single thread in the destructor phase.
Another question is if we should make this thread truly global state,
because currently it will be private to each plugin instance, so if you
have an AMD and NVIDIA image there will be two, similarly if you have
those inside of a shared library.
Summary:
This patch unifies the existing offloading entires into a single section
called `llvm_offload_entires`. This lets us use a more unified
offloading infrastructure so that all targets share the same handling.
The effect is that people in the runtimes now need to check if the kind
is what they expect, but the expectation is that you can combine
multiple potential providers into a compile job. Doesn't fully work
yet because of other runtime issues, but some day. Mostly this helps the
future of liboffload where we want to handle different languages than
OpenMP.
Summary:
Pretty dumb mistake of me, forgot that this is run per-device and
per-plugin, which fell through the cracks with my testing because I have
two GPUs that use different plugins.
Summary:
This removes the use of OpenMP offloading to build the device runtime.
The main benefit here is that we no longer need to rely on offloading
semantics to build a device only runtime. Things like variants are now
no longer needed and can just be simple if-defs. In the future, I will
remove most of the special handling here and fold it into calls to the
`<gpuintrin.h>` functions instead. Additionally I will rework the
compilation to make this a separate runtime.
The current plan is to have this, but make including OpenMP and
offloading either automatically add it, or print a warning if it's
missing. This will allow us to use a normal CMake workflow and delete
all the weird 'lets pull the clang binary out of the build' business.
```
-DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=offload
-DLLVM_RUNTIME_TARGETS=amdgcn-amd-amdhsa
```
After that, linking the OpenMP device runtime will be `-Xoffload-linker
-lomp`. I.e. no more fat binary business.
Only look at the most recent commit since this includes the two
dependencies
(fix to AMDGPUEmitPrintfBinding and the PointerToMember bug).
Use the test compiler ID to verify whether tests can be run rather than
the host compiler. This makes it possible to run tests (with Clang)
while the library itself was built with GCC.
Currently if we generate code for the below target data map that uses an
optional mapping:
!$omp target data if(present(a)) map(alloc:a)
do i = 1, 10
a(i) = i
end do
!$omp end target data
We yield an LLVM-IR error as the branch for the else path is not
generated. This occurs because we enter the NoDupPriv path of the call
back function when generating the else branch, however, the emitBranch
function needs to be set to a block for it to functionally generate and
link in a follow up branch. The NoDupPriv path currently doesn't do
this, while it's not supposed to generate anything (as far as I am
aware) we still need to at least set the builders placement back so that
it emits the appropriate follow up branch. This avoids the missing
terminator LLVM-IR verification error by correctly generating the follow
up branch.
This PR aims to fix a mapping error when trying to map nullary elements
of a record type (primary example is allocatables/pointer types in
Fortran at the moment). This should be legal to map, just not write to
without pointing to anything within the target region. A common Fortran
OpenMP idiom/example where this is useful can be found in the added
Fortran offload example.
The runtime error arises when we try to map the pointer member utilising
a prescribed constant size that we receive from the lowered type,
resulting in mapping of data that will be non-existent when there is no
allocated data. The fix in this case is to emit a runtime check to see
if the data has been allocated, if it hasn't been we select a size of 0,
if it has we emit the usual type size.
Summary:
The previous offloading entry type did not fit the current use-cases
very well. This widens it and adds a version to prevent further
annoyances. It also includes the kind to better sort who's using it.
The first 64-bytes are reserved as zero so the OpenMP runtime can detect
the old format for binary compatibilitry.
Summary:
Currently, we only use `-mlink-builtin-bitcode` for non-LTO NVIDIA
compiliations. This has the problem that it will internalize the RPC
client symbol which needs to be visible to the host. To counteract that,
I put `retain` on it, but this also prevents optimizations on the global
itself, so the passes we have that remove the symbol don't work on
OpenMP anymore. This patch does the dumbest solution, adding a special
string check for it in clang. Not the best solution, the runner up would
be to have a clang attribute for `externally_initialized` because those
can't be internalized, but that might have some unfortunate
side-effects. Alternatively we could make NVIDIA compilations do LTO all
the time, but that would affect some users and it's harder than I
thought.