Like other target statements, the statement associated with the label in
a legacy ASSIGN statement could be inside a construct. Constructs
containing such a target must therefore be marked as unstructured,
fairly similar to how targets are processed in `markBranchTarget`.
Hi,
This patch implements support for the following directives :
- `!DIR$ NOUNROLL_AND_JAM` to disable unrolling and jamming on a DO
LOOP.
- `!DIR$ NOUNROLL` to disable unrolling on a DO LOOP.
- `!DIR$ NOVECTOR` to disable vectorization on a DO LOOP.
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces #77285 since we
extended this WIP quite a bit on our fork over the past year.
An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.
In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
This looks like a huge PR but a lot of the added stuff is documentation.
It is also worth noting that the downstream pass has been validated on
https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived
performance speed-ups that match pure OpenMP, for GPU mapping we are
still working on extending our support for implicit memory mapping and
locality specifiers.
PR stack:
- https://github.com/llvm/llvm-project/pull/126026 (this PR)
- https://github.com/llvm/llvm-project/pull/127595
- https://github.com/llvm/llvm-project/pull/127633
- https://github.com/llvm/llvm-project/pull/127634
- https://github.com/llvm/llvm-project/pull/127635
Add the implementation of the `PERROR(STRING) ` intrinsic from the GNU
Extension to prints on the stderr a newline-terminated error message
corresponding to the last system error prefixed by `STRING`.
(https://gcc.gnu.org/onlinedocs/gfortran/PERROR.html)
During the transition from debug intrinsics to debug records, we used
several different command line options to customise handling: the
printing of debug records to bitcode and textual could be independent of
how the debug-info was represented inside a module, whether the
autoupgrader ran could be customised. This was all valuable during
development, but now that totally removing debug intrinsics is coming
up, this patch removes those options in favour of a single flag
(experimental-debuginfo-iterators), which enables autoupgrade, in-memory
debug records, and debug record printing to bitcode and textual IR.
We need to do this ahead of removing the
experimental-debuginfo-iterators flag, to reduce the amount of
test-juggling that happens at that time.
There are quite a number of weird test behaviours related to this --
some of which I simply delete in this commit. Things like
print-non-instruction-debug-info.ll , the test suite now checks for
debug records in all tests, and we don't want to check we can print as
intrinsics. Or the update_test_checks tests -- these are duplicated with
write-experimental-debuginfo=false to ensure file writing for intrinsics
is correct, but that's something we're imminently going to delete.
A short survey of curious test changes:
* free-intrinsics.ll: we don't need to test that debug-info is a zero
cost intrinsic, because we won't be using intrinsics in the future.
* undef-dbg-val.ll: apparently we pinned this to non-RemoveDIs in-memory
mode while we sorted something out; it works now either way.
* salvage-cast-debug-info.ll: was testing intrinsics-in-memory get
salvaged, isn't necessary now
* localize-constexpr-debuginfo.ll: was producing "dead metadata"
intrinsics for optimised-out variable values, dbg-records takes the
(correct) representation of poison/undef as an operand. Looks like we
didn't update this in the past to avoid spurious test differences.
* Transforms/Scalarizer/dbginfo.ll: this test was explicitly testing
that debug-info affected codegen, and we deferred updating the tests
until now. This is just one of those silent gnochange issues that get
fixed by RemoveDIs.
Finally: I've added a bitcode test, dbg-intrinsics-autoupgrade.ll.bc,
that checks we can autoupgrade debug intrinsics that are in bitcode into
the new debug records.
The OpenMP standard says that all dependencies in the same set of
inter-dependent tasks must be non-overlapping. This simplification means
that the OpenMP only needs to keep track of the base addresses of
dependency variables. This can be seen in kmp_taskdeps.cpp, which stores
task dependency information in a hash table, using the base address as a
key.
This patch generates a rebox operation to slice boxed arrays, but only
the box data address is used for the task dependency. The extra box is
optimized away by LLVM at O3.
Vector subscripts are TODO (I will address in my next patch).
This also fixes a bug for ordinary subscripts when the symbol was mapped
to a box:
Fixes#132647
Although combining -fveclib=ArmPL with -nostdlib is a rare situation, it
should still be supported correctly and should effect in avoidance of
linking against libm.
The code generation relies on `ShallowCopyDirect` runtime
to copy data between the original and the temporary arrays
(both directions). The allocations are done by the compiler
generated code. The heap allocations could have been passed
to `ShallowCopy` runtime, but I decided to expose the allocations
so that the temporary descriptor passed to `ShallowCopyDirect`
has `nocapture` - maybe this will be better for LLVM optimizations.
This change enables LoopVersioning when `fir.pack_array` is met
in the def-use chain. It fixes a couple of huge performance regressions
caused by enabling `-frepack-arrays`.
Issue: Cray Pointer is not associated to Cray Pointee, leading to
Segmentation fault
Fix: GetUltimate, retrieves the base symbol in the current scope, which
gets passed all the references and returns the original symbol
---------
Co-authored-by: Michael Klemm <michael.klemm@amd.com>
Currently there are two specific overloads: for unary operations, i.e.
`Operation<D, R, O>`, and binary ones `Operation<D, R, LO, RO>`.
This makes it impossible for a derived class to use a single overload to
handle all types of operations: `Operation<D, R, O...>`. Since the base
overloads need to be included in the derived class's scope, via `using
Base::operator()` either one of the specific overloads will always be a
better candidate than the more generic derived one.
```
class MyVisitor : public Traverse<...> {
using Traverse<...>::operator();
template <typename D, typename R, typename... O>
Result operator()(const Operation<D, R, O...> &op) const {
// Will never be used.
}
};
```
This patch replaces the two specific overloads for Operation in Traverse
with a single generic overload, while preserving the existing
functionality, and allowing derived classes to use a single overload as
well.
Summary:
When we were first porting to COV5, this lead to some ABI issues due to
a change in how we looked up the work group size. Bitcode libraries
relied on the builtins to emit code, but this was changed between
versions. This prevented the bitcode libraries, like OpenMP or libc,
from being used for both COV4 and COV5. The solution was to have this
'none' functionality which effectively emitted code that branched off of
a global to resolve to either version.
This isn't a great solution because it forced every TU to have this
variable in it. The patch in
https://github.com/llvm/llvm-project/pull/131033 removed support for
COV4 from OpenMP, which was the only consumer of this functionality.
Other users like HIP and OpenCL did not use this because they linked the
ROCm Device Library directly which has its own handling (The name was
borrowed from it after all).
So, now that we don't need to worry about backward compatibility with
COV4, we can remove this special handling. Users can still emit COV4
code, this simply removes the special handling used to make the OpenMP
device runtime bitcode version agnostic.
Currently only ctor/dtor list and their priorities are supported. This
PR adds support for the missing data field.
Few implementation notes:
- The assembly printer has a fixed form because previous `attr_dict`
will sort the dict by key name, making global_dtor and global_ctor
differ in the order of printed arguments.
- LLVM's `ptr null` is being converted to `#llvm.zero` otherwise we'd
have to create a region to use the default operation conversion from
`ptr null`, which is silly given that the field only support null or a
symbol.
This PR does reduces the verbosity of parser errors for OpenACC block
constructs that do not parse correctly because they are missing their
trailing end block directive by:
- Removing the redundant error messages created by parsing 3 different
styles of directive tokens.
- Providing a general mechanism of configuring the max number of
contexts printed for every syntax error.
- Not printing less specific contexts that are at the same location.
Prior to the changes:
```
$ flang -fc1 -fopenacc -fsyntax-only flang/test/Parser/acc-data-statement.f90 2>&1 | tee acc-data-statement.prior.log | wc -l
262
```
[acc-data-statement.prior.log](https://github.com/user-attachments/files/19298165/acc-data-statement.prior.log)
```
$ flang -fc1 -fopenacc -fsyntax-only flang/test/Parser/acc-data-statement.f90 2>&1 | tee acc-data-statement.prior.log | wc -l
73
```
[acc-data-statement.post.log](https://github.com/user-attachments/files/19298181/acc-data-statement.post.log)
The map of symbols requiring new local aliases for USE association needs
to use the symbols' ultimate resolutions to avoid missing cases that can
arise in convoluted codes with lots of confusing renamings.
Fixes https://github.com/llvm/llvm-project/issues/132435.
flang/include/flang/Runtime/io-api.h was changed into io-api-consts.h,
then wrapped into a new io-api.h that includes io-api-consts.h, does
some redundant includes and declarations, and then declares the
prototype of one function, InquiryKeywordHashDecode.
Make that function static in io-stmt.cpp prior to its sole call site,
then undo the renaming, to reduce confusion and redundancy.
The production buildbot master apparently has not yet been restarted
since https://github.com/llvm/llvm-zorg/pull/393 landed.
This reverts commit 96d1baedefc3581b53bc4389bb171760bec6f191.
Remove the FLANG_INCLUDE_RUNTIME option which was replaced by
LLVM_ENABLE_RUNTIMES=flang-rt.
The FLANG_INCLUDE_RUNTIME option was added in #122336 which disables the
non-runtimes build instructions for the Flang runtime so they do not
conflict with the LLVM_ENABLE_RUNTIMES=flang-rt option added in #110217.
In order to not maintain multiple build instructions for the same thing,
this PR completely removes the old build instructions (effectively
forcing FLANG_INCLUDE_RUNTIME=OFF).
As per discussion in
https://discourse.llvm.org/t/buildbot-changes-with-llvm-enable-runtimes-flang-rt/83571/2
we now implicitly add LLVM_ENABLE_RUNTIMES=flang-rt whenever Flang is
compiled in a bootstrapping (non-standalone) build. Because it is
possible to build Flang-RT separately, this behavior can be disabled
using `-DFLANG_ENABLE_FLANG_RT=OFF`. Also see the discussion an
implicitly adding runtimes/projects in #123964.
The test was missing "-o /dev/null" and inadvertently generating a .ll
file in the test directory.
Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Add -f[no-]slp-vectorize to the flang driver.
Add corresponding -fvectorize-slp to the flang frontend.
Enable -fslp-vectorize at -O2 and higher in flang to match the current
behaviour in clang.
---------
Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
When building Flang with Clang, we need to do the same quadmath.h
wrapping as we do for flang-rt. I extracted the CMake code
into FlangCommon.cmake, and cleaned up the arguments passing
to execute_process (note that `-###` was treated as `-` in the original
code, because `#` starts a comment). I believe the Clang command
does not require the input source file, so I removed it as well.
Implement GNU extension intrinsic HOSTNM, both function and subroutine
forms. Add HOSTNM documentation to `flang/docs/Intrinsics.md`. Add
lowering and semantic unit tests.
(This change is modeled after GETCWD implementation.)
Summary:
Currently we use `TestBigEndian` in CMake to determine endianness. This
doesn't work on all platforms and is deprecated since CMake 3.20.
Instead of using CMake, we can just use the GNU/Clang preprocessor
definitions.
The only difficulty is MSVC, mostly because they don't support the same
macros. But, as far as I'm aware, MSVC / Windows targets are always
little endian, and if not we can just override it for that specific
target in the future.
Summary:
This patch adds initial support for compiling `flang-rt` directly for
the GPU. The method used here matches what's already done for `libc` and
`libc++` for the GPU and builds off of those projects.
Mainly this requires setting up some flags and setting the sources that
currently work. This will deposit the resulting library in the
appropriate directory. These files are then intended to be linked via
`-Xoffload-linker` support in the offloading driver.
```
lib/clang/21/lib/nvptx64-nvidia-cuda/libflang_rt.runtime.a
lib/clang/21/lib/amdgcn-amd-amdhsa/libflang_rt.runtime.a
```
This is obviously missing a lot of functions, mainly the `io` support.
Most of what we cannot support is due to using POSIX things that just
don't make sense on the GPU. Stuff like `pthreads` or `sema`.
Getting unit tests to run on this will also be a challenge. We could run
tests the same way we do with `libc`, but the problem there is that the
`libc` test suite is freestanding while `gtest` currently doesn't
compile on the GPU bcause it uses a lot of weird stuff. If the unit
tests were simply `int main` then it would work.
I don't understand the actual runtime code very well, I'd appreciate
some guidance on how to actually support Fortran IO from this interface.
As I understand it, Fortran IO requires a stack-like operation, which
conflicts with the SIMT model GPUs use. Worst case scenario we could
burn some LDS to keep a stack, or serialize it somehow since we can
always just iterate over all the active lanes.
Building this right now looks like this, which depends on the arguments
added in https://github.com/llvm/llvm-project/pull/131695.
```
-DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=compiler-rt;libc;libcxx;libcxxabi;flang-rt \
-DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=compiler-rt;libc;libcxx;libcxxabi;flang-rt \
-DRUNTIMES_nvptx64-nvidia-cuda_FLANG_RT_LIBC_PROVIDER=llvm \
-DRUNTIMES_nvptx64-nvidia-cuda_FLANG_RT_LIBCXX_PROVIDER=llvm \
-DRUNTIMES_amdgcn-amd-amdhsa_FLANG_RT_LIBC_PROVIDER=llvm \
-DRUNTIMES_amdgcn-amd-amdhsa_FLANG_RT_LIBCXX_PROVIDER=llvm
```
Using LoopNest's indices with ShapeShifts that have non-default
lower bounds results in accesses to incorrect array elements.
To avoid having to adjust each index, a ShapeShift with default
lower bounds can be used instead.
Fixes#131751