Allocatable or pointer module variables with the CUDA managed attribute
are defined with a double descriptor. One on the host and one on the
device. Only the data pointed to by the descriptor will be allocated in
managed memory.
Allow the registration of any allocatable or pointer module variables
like device or constant.
Implement extended intrinsic PUTENV, both function and subroutine forms.
Add PUTENV documentation to flang/docs/Intrinsics.md. Add functional and
semantic unit tests.
No longer require -fopenmp or -fopenacc with -E, unless specific version
number options are also required for predefined macros. This means that
most source can be preprocessed with -E and then later compiled with
-fopenmp, -fopenacc, or neither.
This means that OpenMP conditional compilation lines (!$) are also
passed through to -E output. The tricky part of this patch was dealing
with the fact that those conditional lines can also contain regular
Fortran line continuation, and that now has to be deferred when !$ lines
are interspersed.
The preprocessor can perform macro replacement within identifiers when
they are split up with Fortran line continuation, but is failing to do
macro replacement on a continued identifier when none of its parts are
replaced.
The optional second argument to IEEE_SUPPORT_FLAG (and related functions
from the intrinsic IEEE_ARITHMETIC module) is needed only for its type,
not its value. Restrictions on local objects as arguments to function
references in specification expressions shouldn't apply to it.
Define a new attribute for dummy data object characteristics to
distinguish such arguments, set it for the appropriate intrinsic
function references, and test it during specification expression
validation.
Fortran::runtime::Descriptor::BytesFor() only works for Fortran
intrinsic types for which a C++ type counterpart exists, so it crashes
on some types that are legitimate Fortran types like REAL(2). Move some
logic from Evaluate into a new header in flang/Common, then use it to
avoid this needless dependence on C++.
A function or subroutine can allow an object of the same name to appear
in its scope, so long as the name is not used. This is similar to the
case of a name being imported from multiple distinct modules, and
implemented by the same representation.
It's not clear whether this is conforming behavior or a common
extension.
Add function and subroutine forms of FSEEK and FTELL as intrinsic
procedures. Accept common aliases from legacy compilers as well.
A separate patch to llvm-test-suite will enable tests for these
procedures once this patch has merged.
Depends on https://github.com/llvm/llvm-project/pull/132423; CI builds
will likely fail until that patch is merged and this PR is rebased.
Follow up to #134170.
We should be using the LLVM intrinsics instead of plain fir.calls when
we can. Existing code creates a declaration for the llvm intrinsic and
a regular fir.call, which makes it hard for consumers of the IR to find
all the intrinsic calls.
This patch updates flang to follow clang's behavior when processing the
`-mcode-object-version` option.
It is now used to populate an LLVM module flag called
`amdhsa_code_object_version` expected by the backend and also updates
the driver to add the `--amdhsa-code-object-version` option to the
frontend invocation for device compilation of AMDGPU targets.
Follow-up to #132003, in particular, see
https://github.com/llvm/llvm-project/pull/132003#issuecomment-2739701936.
This PR extends reduction support for `loop` directives. Consider the
following scenario:
```fortran
subroutine bar
implicit none
integer :: x, i
!$omp teams loop reduction(+: x)
DO i = 1, 5
call foo()
END DO
end subroutine
```
Note the following:
* According to the spec, the `reduction` clause will be attached to
`loop` during earlier stages in the compiler.
* Additionally, `loop` cannot be mapped to `distribute parallel for` due
to the call to a foreign function inside the loop's body.
* Therefore, `loop` must be mapped to `distribute`.
* However, `distribute` does not have `reduction` clauses.
* As a result, we have to move the `reduction`s from the `loop` to its
parent `teams` directive, which is what is done by this PR.
This PR implements the nonstandard intrinsic time.
In addition to running the unit tests, I also double checked that the
example code works by manually compiling and running it.
This PR adds the intrinsic `unlink` to flang.
## Test plan
- Added two codegen unit tests and ensured flang-check continues to
pass.
- Manually compiled and ran the example from the documentation.
The string used for intrinsic was not the correct one
"llvm.nvvm.match.any.sync.i32p". There was an extra `p` at the end.
Use the NVVM operation instead so we don't duplicate it.
Flang uses `fir.call <llvm intrinsic>` in a few places. This means
consumers of the IR need to strcmp every fir.call if they want to find a
particular LLVM intrinsic.
Emit LLVM memcpy intrinsics instead.
This patch updates Flang lowering and kernel flags identification in
MLIR so that loop bounds on `target teams loop` constructs are evaluated
on the host, making the trip count available to the corresponding
`__tgt_target_kernel` call emitted for the target region.
This is necessary in order to properly execute these constructs as
`target teams distribute parallel do`.
Co-authored-by: Kareem Ergawy <kareem.ergawy@amd.com>
Consider:
```
function foo()
!$omp declare target(foo) ! This `foo` was a function-result symbol
...
end
```
When resolving symbols, for this case use the symbol corresponding to
the function instead of the symbol corresponding to the function result.
Currently, this will result in an error:
```
error: A variable that appears in a DECLARE TARGET directive must be
declared in the scope of a module or have the SAVE attribute, either
explicitly or implicitly
```
Like other target statements, the statement associated with the label in
a legacy ASSIGN statement could be inside a construct. Constructs
containing such a target must therefore be marked as unstructured,
fairly similar to how targets are processed in `markBranchTarget`.
Hi,
This patch implements support for the following directives :
- `!DIR$ NOUNROLL_AND_JAM` to disable unrolling and jamming on a DO
LOOP.
- `!DIR$ NOUNROLL` to disable unrolling on a DO LOOP.
- `!DIR$ NOVECTOR` to disable vectorization on a DO LOOP.
This PR starts the effort to upstream AMD's internal implementation of
`do concurrent` to OpenMP mapping. This replaces #77285 since we
extended this WIP quite a bit on our fork over the past year.
An important part of this PR is a document that describes the current
status downstream, the upstreaming status, and next steps to make this
pass much more useful.
In addition to this document, this PR also contains the skeleton of the
pass (no useful transformations are done yet) and some testing for the
added command line options.
This looks like a huge PR but a lot of the added stuff is documentation.
It is also worth noting that the downstream pass has been validated on
https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived
performance speed-ups that match pure OpenMP, for GPU mapping we are
still working on extending our support for implicit memory mapping and
locality specifiers.
PR stack:
- https://github.com/llvm/llvm-project/pull/126026 (this PR)
- https://github.com/llvm/llvm-project/pull/127595
- https://github.com/llvm/llvm-project/pull/127633
- https://github.com/llvm/llvm-project/pull/127634
- https://github.com/llvm/llvm-project/pull/127635
Add the implementation of the `PERROR(STRING) ` intrinsic from the GNU
Extension to prints on the stderr a newline-terminated error message
corresponding to the last system error prefixed by `STRING`.
(https://gcc.gnu.org/onlinedocs/gfortran/PERROR.html)
The OpenMP standard says that all dependencies in the same set of
inter-dependent tasks must be non-overlapping. This simplification means
that the OpenMP only needs to keep track of the base addresses of
dependency variables. This can be seen in kmp_taskdeps.cpp, which stores
task dependency information in a hash table, using the base address as a
key.
This patch generates a rebox operation to slice boxed arrays, but only
the box data address is used for the task dependency. The extra box is
optimized away by LLVM at O3.
Vector subscripts are TODO (I will address in my next patch).
This also fixes a bug for ordinary subscripts when the symbol was mapped
to a box:
Fixes#132647
The code generation relies on `ShallowCopyDirect` runtime
to copy data between the original and the temporary arrays
(both directions). The allocations are done by the compiler
generated code. The heap allocations could have been passed
to `ShallowCopy` runtime, but I decided to expose the allocations
so that the temporary descriptor passed to `ShallowCopyDirect`
has `nocapture` - maybe this will be better for LLVM optimizations.
This change enables LoopVersioning when `fir.pack_array` is met
in the def-use chain. It fixes a couple of huge performance regressions
caused by enabling `-frepack-arrays`.
Issue: Cray Pointer is not associated to Cray Pointee, leading to
Segmentation fault
Fix: GetUltimate, retrieves the base symbol in the current scope, which
gets passed all the references and returns the original symbol
---------
Co-authored-by: Michael Klemm <michael.klemm@amd.com>
Summary:
When we were first porting to COV5, this lead to some ABI issues due to
a change in how we looked up the work group size. Bitcode libraries
relied on the builtins to emit code, but this was changed between
versions. This prevented the bitcode libraries, like OpenMP or libc,
from being used for both COV4 and COV5. The solution was to have this
'none' functionality which effectively emitted code that branched off of
a global to resolve to either version.
This isn't a great solution because it forced every TU to have this
variable in it. The patch in
https://github.com/llvm/llvm-project/pull/131033 removed support for
COV4 from OpenMP, which was the only consumer of this functionality.
Other users like HIP and OpenCL did not use this because they linked the
ROCm Device Library directly which has its own handling (The name was
borrowed from it after all).
So, now that we don't need to worry about backward compatibility with
COV4, we can remove this special handling. Users can still emit COV4
code, this simply removes the special handling used to make the OpenMP
device runtime bitcode version agnostic.
Currently only ctor/dtor list and their priorities are supported. This
PR adds support for the missing data field.
Few implementation notes:
- The assembly printer has a fixed form because previous `attr_dict`
will sort the dict by key name, making global_dtor and global_ctor
differ in the order of printed arguments.
- LLVM's `ptr null` is being converted to `#llvm.zero` otherwise we'd
have to create a region to use the default operation conversion from
`ptr null`, which is silly given that the field only support null or a
symbol.
This PR does reduces the verbosity of parser errors for OpenACC block
constructs that do not parse correctly because they are missing their
trailing end block directive by:
- Removing the redundant error messages created by parsing 3 different
styles of directive tokens.
- Providing a general mechanism of configuring the max number of
contexts printed for every syntax error.
- Not printing less specific contexts that are at the same location.
Prior to the changes:
```
$ flang -fc1 -fopenacc -fsyntax-only flang/test/Parser/acc-data-statement.f90 2>&1 | tee acc-data-statement.prior.log | wc -l
262
```
[acc-data-statement.prior.log](https://github.com/user-attachments/files/19298165/acc-data-statement.prior.log)
```
$ flang -fc1 -fopenacc -fsyntax-only flang/test/Parser/acc-data-statement.f90 2>&1 | tee acc-data-statement.prior.log | wc -l
73
```
[acc-data-statement.post.log](https://github.com/user-attachments/files/19298181/acc-data-statement.post.log)