Summary:
One of the downsides of the linker wrapper is that it made debugging
more difficult. It is very powerful in that it can resolve a lot of
input matching and library handling that could not be done before.
However, the old method allowed users to simply copy-paste the script
files to modify the output and test it.
This patch attempts to make it easier to debug changes by letting the
user override all the linker inputs. That is, we provide a user-created
binary that is treated like the final output of the device link step.
The intended use-case is for using `-save-temps` to get some IR, then
modifying the IR and sticking it back in to see if it exhibits the old
failures.
The distinction between the hip and hipv4 offload kinds is historically
based. Originally, these designations might have indicated different
versions of the code object ABI (Application Binary Interface). However,
as the system has evolved, the ABI version is now embedded directly
within the code object itself, making these historical distinctions
irrelevant during the unbundling process. Consequently, hip and hipv4
are treated as compatible in current implementations, facilitating
interchangeable handling of code objects without differentiation based
on offload kind. This change streamlines code management within the
ecosystem.
Summary:
The linker wrapper attempts to maintain consistent semantics with
existing host invocations. Static libraries by default only extract if
there are non-weak symbols that remain undefined. However, we have
situations between linkers that put different meanings on ordering. The
ld.bfd linker requires static libraries to be defined after the symbols,
while `ld.lld` relaxes this rule. The linker wrapper went with the
former as it's the easier solution, however this has caused a lot of
issues as I've had to explain this rule to several people, it also make
it difficult to include things like `libc` in the OpenMP runtime because
it would sometimes be linked before or after.
This patch reworks the logic to more or less perform the following logic
for static libraries.
1. Split library / object inputs.
2. Include every object input and record its undefined symbols
3. Repeatedly try to extract static libraries to resolve these
symbols. If a file is extracted we need to check every library
again to resolve any new undefined symbols.
This allows the following to work and will cause fewer issues when
replacing HIP, which does `--whole-archive` so it's very likely the old
logic will regress.
```console
$ clang -lfoo main.c -fopenmp --offload-arch=native
```
Summary:
The device linking phase only wants to create the necessary commands to
emit the device binary. There were issues where the user's default
config file was being used and passing incompatible arguments to the
device compilation step. Simply disable this since we do not want any
additional arguments to these clang invocations.
The plugin was not getting built as the build_generic_elf64 macro
assumes the LLVM triple processor name matches the CMake processor name,
which is unfortunately not the case for SystemZ.
Fix this by providing two separate arguments instead.
Actually building the plugin exposed a number of other issues causing
various test failures. Specifically, I've had to add the SystemZ target
to
- CompilerInvocation::ParseLangArgs
- linkDevice in ClangLinuxWrapper.cpp
- OMPContext::OMPContext (to set the device_kind_cpu trait)
- LIBOMPTARGET_ALL_TARGETS in libomptarget/CMakeLists.txt
- a check_plugin_target call in libomptarget/src/CMakeLists.txt
Finally, I've had to set a number of test cases to UNSUPPORTED on
s390x-ibm-linux-gnu; all these tests were already marked as UNSUPPORTED
for x86_64-pc-linux-gnu and aarch64-unknown-linux-gnu and are failing on
s390x for what seem to be the same reason.
In addition, this also requires support for BE ELF files in
plugins-nextgen: https://github.com/llvm/llvm-project/pull/85246
Added --offload-compression-level= option to clang and
-compression-level=
option to clang-offload-bundler for controlling compression level.
Added support of long distance matching (LDM) for llvm::zstd which is
off
by default. Enable it for clang-offload-bundler by default since it
improves compression rate in general.
Change default compression level to 3 for zstd for clang-offload-bundler
since it works well for bundle entry size from 1KB to 32MB, which should
cover most of the clang-offload-bundler usage. Users can still specify
compression level by -compression-level= option if necessary.
Summary:
The clang-offload-bundler uses an empty file to control the bundles made
for embedding. Previously this still used `/dev/null` by mistake even on
Windows.
Summary:
The very first version of the `clang-linker-wrapper` used `--` as a
separator for the host and device arguments. I moved away from this
towards a commandline parsing implementation years ago but never got
around to officially removing this.
The plugin was not getting built as the build_generic_elf64 macro
assumes the LLVM triple processor name matches the CMake processor name,
which is unfortunately not the case for SystemZ.
Fix this by providing two separate arguments instead.
Actually building the plugin exposed a number of other issues causing
various test failures. Specifically, I've had to add the SystemZ target
to
- CompilerInvocation::ParseLangArgs
- linkDevice in ClangLinuxWrapper.cpp
- OMPContext::OMPContext (to set the device_kind_cpu trait)
- LIBOMPTARGET_ALL_TARGETS in libomptarget/CMakeLists.txt
- a check_plugin_target call in libomptarget/src/CMakeLists.txt
Finally, I've had to set a number of test cases to UNSUPPORTED on
s390x-ibm-linux-gnu; all these tests were already marked as UNSUPPORTED
for x86_64-pc-linux-gnu and aarch64-unknown-linux-gnu and are failing on
s390x for what seem to be the same reason.
In addition, this also requires support for BE ELF files in
plugins-nextgen: https://github.com/llvm/llvm-project/pull/83976
Summary:
We support standalone compilation for the NVPTX architecture using
'nvlink' as our linker. Because of the special handling required to
transform input files to cubins, as nvlink expects for some reason, we
didn't use the standard AddLinkerInput method. However, this also meant
that we weren't forwarding options passed with -Wl to the linker. Add
this support in for the standalone toolchain path.
Revived from https://reviews.llvm.org/D149978
Summary:
The standard GPU compilation process embeds each intermediate object
file into the host file at the `.llvm.offloading` section so it can be
linked later. We also use a special section called something like
`omp_offloading_entries` to store all the globals that need to be
registered by the runtime. The linker-wrapper's job is to link the
embedded device code stored at this section and then emit code to
register the linked image and the kernels and globals in the offloading
entry section.
One downside to RDC linking is that it can become quite big for very
large projects that wish to make use of static linking. This patch
changes the support for relocatable linking via `-r` to support a kind
of "partial" RDC compilation for offloading languages.
This primarily requires manually editing the embedded data in the
output object file for the relocatable link. We need to rename the
output section to make it distinct from the input sections that will be
merged. We then delete the old embedded object code so it won't be
linked further. We then need to rename the old offloading section so
that it is private to the module. A runtime solution could also be done
to defer entries that don't belong to the given GPU executable, but this
is easier. Note that this does not work with COFF linking, only the ELF
method for handling offloading entries, that could be made to work
similarly.
Given this support, the following compilation path should produce two
distinct images for OpenMP offloading.
```
$ clang foo.c -fopenmp --offload-arch=native -c
$ clang foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang main.c merged.o -fopenmp --offload-arch=native
$ ./a.out
```
Or similarly for HIP to effectively perform non-RDC mode compilation for
a subset of files.
```
$ clang -x hip foo.c --offload-arch=native --offload-new-driver -fgpu-rdc -c
$ clang -x hip foo.c -lomptarget.devicertl --offload-link -r -o merged.o
$ clang -x hip main.c merged.o --offload-arch=native --offload-new-driver -fgpu-rdc
$ ./a.out
```
One question is whether or not this should be the default behavior of
`-r` when run through the linker-wrapper or a special option. Standard
`-r` behavior is still possible if used without invoking the
linker-wrapper and it guaranteed to be correct.
Summary:
A relocatable link through `clang -r` can go through the
clang-linker-wrapper if offloading is enabled. This will have the effect
of linking the device code and creating the wrapper module. It will then
be merged into the final file. This is useful behavior on its own, but
is likely not what is expected for a `-r` job.
This patch makes the linker wrapper ignore the device code when doing a
reloctable link. This has the effect of the linker merging the
`.llvm.offloading` sections in the output object. These will then be
parsed as normal when the executable is finally created.
Even though this doesn't actually perform a reloctable link on the
device code itself, it has a similar effect of combining multiple files
into a single one.
Summary:
The linker wrapper's job is to sort various embedded inputs into a list
of files that participate in a single link job. So far, this has been
completely 1-to-1, that is, each input file participates in exactly one
link job. However, support for AMD's target-id requires that one input
file may participate in multiple link jobs. For example, if given a
`gfx90a` static library and a `gfx90a:xnack+` object file input, we
should link the gfx90a` target into the `gfx90a:xnack+` job. These are
considered separate CPUs that can be mutually linked more or less.
This patch adds the necessary logic to make this happen. It primarily
reworks the logic to copy relevant input files into a separate list. So,
it moves construction of the final list of link jobs into the extraction
phase. We also need to copy the files in the case that it is needed more
than once, as the entire workflow expects ownership of said file.
This patch moves `clang/tools/clang-linker-wrapper/OffloadWrapper.*` to
`llvm/Frontend/Offloading` allowing them to be re-utilized by other
projects.
Additionally, it makes minor modifications to the API to make it more
flexible.
Concretely:
- The `wrap*` methods now have additional arguments `EntryArray`,
`Suffix` and `EmitSurfacesAndTextures` to specify some additional options.
- The `EntryArray` is now constructed by the caller. This change is needed to
enable JIT compilation, as ORC doesn't fully support `__start_` and `__stop_`
symbols. Thus, to JIT the code, the `EntryArray` has to be constructed explicitly in the IR.
- The `Suffix` field is used when emitting the descriptor, registration
methods, etc, to make them more readable. It is empty by default.
- The `EmitSurfacesAndTextures` field controls whether to emit surface
and texture registration code, as those functions were removed from `CUDART`
in CUDA 12. It is true by default.
- The function `getOffloadingEntryInitializer` was added to help create
the `EntryArray`, as it returns the constant initializer and not a global
variable.
Summary:
We use the OffloadBinary to contain bundled offloading objects used to
support many images / targets at the same time. The `__tgt_device_info`
struct used to contain a pointer to this underlying binary format, which
contains information about the triple and architecture. We used to parse
this in the runtime to do image verification.
Recent changes removed the need for this to be used internally, as we
just parse it out of the ELF directly. This patch sets the pointers up
so they point to the ELF without requiring any further parsing.
Summary:
The CPU target currently inherits all the libraries from the normal link
job to ensure that it has access to the same envrionment that the host
does. However, this previously was not respecting argument libraries
that are passed by name rather than `-l` as well as the whole archive
flags. This patch fixes this to allow the CPU linker to correctly pick
up the libraries associated with things like address sanitizers.
Fixes: https://github.com/llvm/llvm-project/issues/75651
This reverts commit 47d9fbc04b91fb03b6da294e82c2fb4bca6b6343.
It creates a segmentation falt on SPEChpc soma on Frontier on a GPU for
the kernel generate_new_beads
Co-authored-by: fel-cab <fel-cab@github.com>
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
Summary:
We provide `-Xoffload-linker` to pass arguments directly to the link
step. Currently this uses `-Wl,` implicitly which prevents us from using
clang options that we otherwise could make use of. This patch removes
that implicit behavior as users can just as easiliy pass
`-Xoffload-linker -Wl,-foo` if needed.
Summary:
This patch adds support for registering texture / surface variables from
CUDA / HIP. Additionally, we now properly track the `extern` and `const`
flags that are also used in these runtime functions.
This does not implement the `managed` variables yet as those seem to
require some extra handling I'm not familiar with. The issue is that the
current offload entry isn't large enough to carry size and alignment
information along with an extra global.
Multiple calls to `PointerType::getUnqual(C)`, and calls to
`Type::getPointerTo(AddrSpace=0)` on them all result in the same type.
Clean them up to re-use the same `PtrTy` variable within function
`createRegisterGlobalsFunction()`.
Summary:
The linker wrapper is a utility used to create offloading programs from
single-source offloading languages such as OpenMP or CUDA. This is done
by embedding device code into the host object, then feeding it into the
linker wrapper which extracts the accelerator object files, links them,
then wraps them in registration code for the target runtime. This
previously has only worked in Linux / ELF platforms.
This patch attempts to hand Windows / COFF inputs by also accepting COFF
forms of certain linker arguments we use internally. The important
arguments are library search paths, so we can identify libraries which
may contain device code, libraries themselves, and the output name used
for intermediate output.
I am not intimately familiar with the semantics here for the semantics
in how a `lib` file is earched. I am simply treating `foo.lib` as the
GNU equivalent `-l:foo.lib` in the search logic. Similarly, I am
assuming that static libraries will be llvm-ar style libraries. I will
need to investigate the actual deficiencies later, but this should be a
good starting point along with
https://github.com/llvm/llvm-project/pull/72697
Summary:
This patch is a simple refactoring of code out of the linker wrapper
into a common location. The main motivation behind this change is to
make it easier to change the handling in the future to accept a triple
to be used to emit entries that function on that target.
Summary:
This accidentally was a dangling reference and caused issues when
actually used. Make sure that the memory is saved before the job is
created.
Summary:
Weak symbols are supposed to have the semantics that they can be
overriden by a strong (i.e. global) definition. This wasn't being
respected by the LTO pass because we simply used the first definition
that was available. This patch fixes that logic by doing a first pass
over the symbols to check for strong resolutions that could override a
weak one.
A lot of fake linker logic is ending up in the linker wrapper. If there
were an option to handle this in `lld` it would be a lot cleaner, but
unfortunately supporting NVPTX is a big restriction as their binaries
require the `nvlink` tool.
Summary:
We use these image wrappers to do runtime specifica registration of
variables and to load the device image that was compiled. This was
intended to support multiple of these running at the same time, e.g. you
can have a CUDA instance running with OpenMP and they should both
function so long as you do not share state between the two. However,
because we did not use a unique name for this file it would cause
conflicts when included. This patch names the image based off of the
language runtime it's using so that they remain separate.
Fixes: https://github.com/llvm/llvm-project/issues/67583
This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.
This matches other nearby enums.
For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
SmallString<0> is more flexible and avoids an unneeded copy in
ObjectYAML/OffloadEmitter.cpp.
Reviewed By: jhuber6
Differential Revision: https://reviews.llvm.org/D159335
Update DeviceRTL and the AMDGPU plugin to support code
object version 5. Default is code object version 4.
CodeGen for __builtin_amdgpu_workgroup_size generates code
for cov4 as well as cov5 if -mcode-object-version=none
is specified. DeviceRTL compilation passes this argument
via Xclang option to generate abi-agnostic code.
Generated code for the above builtin uses a clang
control constant "llvm.amdgcn.abi.version" to branch on
the abi version, which is available during linking of
user's OpenMP code. Load of this constant gets eliminated
during linking.
AMDGPU plugin queries the ELF for code object version
and then prepares various implicitargs accordingly.
Differential Revision: https://reviews.llvm.org/D139730
Reviewed By: jhuber6, yaxunl
All command-line tools using `llvm::opt` create an enum of option IDs and a table of `OptTable::Info` object. Most of the tools use the same ID (`OPT_##ID`), kind (`Option::KIND##Class`), group ID (`OPT_##GROUP`) and alias ID (`OPT_##ALIAS`). This patch extracts that common code into canonical macros. This results in fewer changes when tweaking the `OPTION` macros emitted by the TableGen backend.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D157028
Since we no longer support typed LLVM IR pointer types, the code can
be simplified into for example using PointerType::get directly instead
of using Type::getInt8PtrTy and Type::getInt32PtrTy etc.
Differential Revision: https://reviews.llvm.org/D156733
The GPU vendors currently provide bitcode files for their device
runtime. These files need to be handled specially as they are not built
to be linked in with a standard `llvm-link` call or through LTO linking.
This patch adds an alternative to use the existing clang handling of
these libraries that does the necessary magic to make this work.
We do this by causing the LTO backend to emit bitcode before running the
backend. We then pass this through to clang which uses the existing
support which has been fixed to support this by D152391. The backend
will then be run with the merged module.
This patch adds the `--builtin-bitcode=<triple>=file.bc` to specify a single
file, or just `--clang-backend` to let the toolchain handle its defaults
(currently nothing for NVPTX and the ROCm device libs for AMDGPU). This may have
a performance impact due to running the optimizations again, we could
potentially disable optimizations in LTO and only do the linking if this is an
issue.
This should allow us to resolve issues when relying on the `linker-wrapper` to
do a late linking that may depend on vendor libraries.
Depends on D152391
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D152442
The linker wrapper performs its own very basic symbol resolution for the
purpose of supporting standard static library semantics. We do this here
because the Nvidia `nvlink` wrapper does not support static linking and
we have some offloading specific extensions.
Currently, we always place symbols in the "table" even if they aren't
extracted. This caused the logic to fail when many files were used that
referenced the same undefined variable. This patch changes the pass to
only add the symbols to the global "table" if the file is actually
extracted.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D151839
The linker wrapper needs to reinvent its own special static library
handling for static libraries containing fatbinaries. This is primarily
because offloading languages expect certain global symbols to be visible
to the host so we must consider them used symbols. However we should be
able to remove this requirement if we are linking in "freestanding" code
that was not created by an offloading language.
The motivation for this is to support the work-in-progress `libc` for
GPUs. It is provided as a static library with no offloading language
set. This logic will let us only import used `libc` symbols always.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D146326
None of these tools need to depend on clang to build, instead the test
target should depend on them.
This prevents rebuilding/linking these tools when building the `clang`
target directly.
The forwarding header is left in place because of its use in
`polly/lib/External/isl/interface/extract_interface.cc`, but I have
added a GCC warning about the fact it is deprecated, because it is used
in `isl` from where it is included by Polly.