This does a global rename from `flang-new` to `flang`. I also
removed/changed any TODOs that I found related to making this change.
---------
Co-authored-by: H. Vetinari <h.vetinari@gmx.com>
Co-authored-by: Andrzej Warzynski <andrzej.warzynski@arm.com>
We recently added versioning support to Flang's OpenMP, which restricts
and enables certain things based on the OpenMP specification version.
Currently one of the check-offload tests makes use of a feature that's
at a slightly higher version than the current default causing it to
fail.
This PR basically applies the highest current OpenMP version number as a
default argument for the lit.cfg, if we need more fine grained control
in the future we can expand it to different lit commands for each
relevant version than can then be added in each test. But for now, to
keep it simple, just set the max level version.
Summary:
We can include `stdint.h` just fine as long as we don't allow it to find
system headers, passing `-nostdlibinc` and `-nogpuinc` suppresses these
extra paths so we will just use the clang resource headers for
`stdint.h` and `stddef.h`.
We had three `utils::` namespaces, all with different "meaning" (host,
device, hsa_utils). We should, when we can, keep "include/Shared"
accessible from host and device, thus RefCountTy has been moved to a
separate header. `hsa_utils` was introduced to make `utils::` less
overloaded. And common functionality was de-duplicated, e.g.,
`utils::advance` and `utils::advanceVoidPtr` -> `utils:advancePtr`. Type
punning now checks for the size of the result to make sure it matches
the source type.
No functional change was intended.
This patch fixes the mapping and lowering of arrays with dynamic extents
and adds a new test for the same. The fix discards the incomplete the
dynamic extent information and replacing it with just the base type.
When lowering to llvm later, the bounds information is used instead.
It appears that the RUNTIMES build prefers the x86-64-unknown-linux-gnu
triple notation for the host. This fixes runtime / test breakages when
compiler-rt is used as the CLANG_DEFAULT_RTLIB.
In #92581 the `LibomptargetUitls.cmake` helpers have been removed, but
only uses of `libomptarget_say` were migrated. Migrate the remaining few
warning and error messages so the `check-offload` target would not fail
due to missing `libomptarget_warning_say`.
While at it, update the `check-offload` unavailability message to say
`check-offload` instead of `check-libomptarget`.
Fixes#92581
This PR aims to unify the map argument generation behavior across both
the implicit capture (captured in a target region) and the explicit
capture (process map), currently the varPtr field of the MapInfo for the
same variable will be different depending on how it's captured. This PR
tries to align that across the generations of MapInfoOp in the OpenMP
lowering.
Currently, I have opted to utilise the rawInput (input memref to a HLFIR
DeclareInfoOp) as opposed to the addr field which includes more
information. The side affect of this is that we have to deal with
BoxTypes less often, which will result in simpler maps in these cases.
The negative side affect of this is that we don't have access to the
bounds information through the resulting value, however, I believe the
bounds information we require in our case is still appropriately stored
in the map bounds, and this seems to be the case from testing so far.
The other fix is for cases where we end up with a BoxType argument into
a function (certain assumed shape and sizes cases do this) that has no
fir.ref wrapping it. As we need the Box to be a reference type to
actually utilise the operation to access the base address stored inside
and create the correct mappings we currently generate an intermediate
allocation in these cases, and then store into it, and utilise this as
the map argument, as opposed to the original.
However, as we were not sharing the same intermediate allocation across
all of the maps for a variable, this resulted in errors in certain cases
when detatching/attatching the data e.g. via enter and exit. This PR
adjusts this for cases
Currently we only maintain tracking of all intermediate allocations for
the current function scope, as opposed to module. Primarily as the only
case I am aware of that this is required is in cases where we pass
certain types of arguments to functions (so I opted to minimize the
overhead of the pass for now). It could likely be extended to module
scope if required if we find other cases where it's applicable and
causing issues.
This pull request is a revised version of #76587. This pull request
fixes some build issues that were present in the previous version of
this change.
> This pull request is the first part of an ongoing effort to extends
PGO instrumentation to GPU device code. This PR makes the following
changes:
>
> - Adds blank registration functions to device RTL
> - Gives PGO globals protected visibility when targeting a supported
GPU
> - Handles any addrspace casts for PGO calls
> - Implements PGO global extraction in GPU plugins (currently only
dumps info)
>
> These changes can be tested by supplying `-fprofile-instrument=clang`
while targeting a GPU.
Since we can already track allocations, we can diagnose memory faults to
some degree. If the fault happens in a prior allocation (use after free)
or "close but outside" one, we can provide that information to the user.
Note that the fault address might be page aligned, and not all accesses
trigger a fault, especially for allocations that are backed by a
MemoryManager. Still, if people disable the MemoryManager or the
allocation is big enough, we can sometimes provide valueable feedback.
Summary:
Currently, we assign this to private memory. This causes failures on
some SOLLVE tests. The standard isn't clear on the semantics of this
allocation type, but there seems to be a consensus that it's supposed to
be shared memory.
This patch moves utilities from
`offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h` to
`llvm/Frontend/Offloading/Utility.h` to be reused by
other projects.
Concretely the following changes were made:
- Rename `KernelMetaDataTy` to `AMDGPUKernelMetaData`.
- Remove unused fields `KernelObject`, `KernelSegmentSize`,
`ExplicitArgumentCount` and `ImplicitArgumentCount` from
`AMDGPUKernelMetaData`.
- Return the produced error if `ELFObj.sections()` failed instead of
using `cantFail`.
- Added `AGPRCount` field to `AMDGPUKernelMetaData`.
- Added a default invalid value to all the fields in
`AMDGPUKernelMetaData`.
Error: CommandLine Error: Option 'attributor-manifest-internal'
registered more than once
During the standalone debug build of offload the above error is seen at
app runtime when using a prebuilt llvm with LLVM_LINK_LLVM_DYLIB=ON.
This is caused by linking both libLLVM.so and various archives that are
found via llvm_map_components_to_libnames for jit support.
Summary:
The 'omp_alloc' function should be callable from a target region. This
patch implemets it by simply calling `malloc` for every non-default
trait value allocator. All the special access modifiers are
unimplemented and return null. The null allocator returns null as the
spec states it should not be usable from the target.
When we use the device, e.g., with an API that interacts with it, we
need to ensure the image is loaded and the constructors are executed.
Two tests are included to verify we 1) load images and run constructors
when needed, and 2) we do so lazily only if the device is actually used.
---------
Co-authored-by: Joseph Huber <huberjn@outlook.com>
Summary;
Now that we use the linker to do LTO / device linking, we need to inform
the `clang` invocation to use `-flto` so it forwards arguments like
`-On` correctly.
Through the new `-foffload-via-llvm` flag, CUDA kernels can now be
lowered to the LLVM/Offload API. On the Clang side, this is simply done
by using the OpenMP offload toolchain and emitting calls to `llvm*`
functions to orchestrate the kernel launch rather than `cuda*`
functions. These `llvm*` functions are implemented on top of the
existing LLVM/Offload API.
As we are about to redefine the Offload API, this wil help us in the
design process as a second offload language.
We do not support any CUDA APIs yet, however, we could:
https://www.osti.gov/servlets/purl/1892137
For proper host execution we need to resurrect/rebase
https://tianshilei.me/wp-content/uploads/2021/12/llpp-2021.pdf
(which was designed for debugging).
```
❯❯❯ cat test.cu
extern "C" {
void *llvm_omp_target_alloc_shared(size_t Size, int DeviceNum);
void llvm_omp_target_free_shared(void *DevicePtr, int DeviceNum);
}
__global__ void square(int *A) { *A = 42; }
int main(int argc, char **argv) {
int DevNo = 0;
int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
*Ptr = 7;
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
square<<<1, 1>>>(Ptr);
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
llvm_omp_target_free_shared(Ptr, DevNo);
}
❯❯❯ clang++ test.cu -O3 -o test123 -foffload-via-llvm --offload-arch=native
❯❯❯ llvm-objdump --offloading test123
test123: file format elf64-x86-64
OFFLOADING IMAGE [0]:
kind elf
arch gfx90a
triple amdgcn-amd-amdhsa
producer openmp
❯❯❯ LIBOMPTARGET_INFO=16 ./test123
Ptr 0x155448ac8000, *Ptr 7
Ptr 0x155448ac8000, *Ptr 42
```
This is only for struct containing nested structs with user defined
mappers.
Add four functions:
1>buildImplicitMap: build map for default mapper
2>buildImplicitMapper: build default mapper.
3>hasUserDefinedMapper for given mapper name and mapper type, lookup
user defined map, if found one return true.
4>isImplicitMapperNeeded check if Mapper is needed
During create map, in checkMappableExpressionList, call
isImplicitMapperNeeded when it return true, call buildImplicitMapper to
generate implicit mapper and added to map clause.
https://github.com/llvm/llvm-project/pull/101101
The kernel names for OpenMP are manually mangled and not ideal when we
report something to the user. We demangle them now, providing the
function and line number of the target region, together with the actual
kernel name.
Similar to (de)allocation traces, we can record kernel launch stack
traces and display them in case of an error. However, the AMD GPU plugin
signal handler, which is invoked on memroy faults, cannot pinpoint the
offending kernel. Insteade print `<NUM>`, set via
`OFFLOAD_TRACK_NUM_KERNEL_LAUNCH_TRACES=<NUM>`, many traces. The
recoding/record uses a ring buffer of fixed size (for now 8).
For `trap` errors, we print the actual kernel name, and trace if
recorded.
As a first step towards a GPU sanitizer we now can track allocations and
deallocations in order to report double frees, and other problems during
deallocation.
Summary:
Currently there are several layers to handle `printf`. Since we now have
varargs and an implementation of `printf` this can be heavily
simplified.
1. The frontend renames `printf` into `omp_vprintf` and gives it an
argument buffer.
Removing 1. triggered some code in the AMDGPU backend menat for HIP /
OpenCL, so I hadded an exception to it.
2. Forward this to CUDA vprintf or ignore it.
We no longer need special handling for it since we have varargs. So now
we just forward this to CUDA vprintf if we have libc, otherwise just
leave `printf` as an external function and expect that `libc` will be
linked in.
Summary:
The C library is intended to provide `__assert_fail`, so in the cases
that we have both we should defer to that. This means that if you build
the C library for GPUs you'll get the RPC based asser, and if not you'll
get the trap based one.
The `llvm-omp-device-info` tool is very handy, but broke due to the lazy
evaluation of devices. This repairs the functionality and adds a test.
The tool is also renamed into `llvm-offload-device-info` as `-omp-` is
going away.
In debug mode there is a wrapper (the kernel) around the function in
which we generate the kernel code. We worked around this before to get
the correct kernel name, but now we really distinguish both to attach
the launch bounds to the kernel, not the inner function.
This patch handles dependencies specified by the `depend` clause on an
OpenMP target construct. It does this much the same way clang does it by
materializing an OpenMP `task` that is tagged with the dependencies.
The following functions are relevant to this patch -
1) `createTarget` - This function itself is largely unchanged except
that it now accepts a vector of `DependData` objects that it simply
forwards to `emitTargetCall`
2) `emitTargetCall` - This function has changed now to check if an outer
target-task needs to be materialized (i.e if `target` construct has
`nowait` or has `depend` clause). If yes, it calls `emitTargetTask` to
do all the heavy lifting for creating and dispatching the task.
3) `emitTargetTask` - Bulk of the change is here. See the large comment
explaining what it does at the beginning of this function
Fix another runtime problem when explicit map both pointer and pointee
in target data region.
In #92210, problem is only addressed in target region, but missing for
target data region.
The change just passing AreBothBasePtrAndPteeMapped in
generateInfoForComponentList when processing target data.
---------
Co-authored-by: Alexey Bataev <a.bataev@gmx.com>
Summary:
These functions provide special-case implementations internal to the
OpenMP device runtime. This can potentially conflict with the symbols
pulled in from the actual GPU `libc`. This patch makes these weak, so in
the case that the GPU libc functions exist they will be overridden. This
should not impact performance in the average case because the old
`-mlink-builtin-bitcode` version does internalization, deleting weak,
and the new LTO path will resolve to the strong reference and then
internalize it.
Many tests in the `offload` project have requirements defined by which
targets are not supported rather than which platforms are supported.
This patch aims to streamline the requirement definitions by adding four
new feature tags: `host`, `gpu`, `amdgpu`, and `nvidiagpu`.