This patch attempts to provide a fix for an issue that appears when the
`__kmp_dist_for_static_init` function is called from a serialized team.
This is triggered by code generated by flang for `distribute parallel
do` constructs whenever an `if` clause for the `parallel` leaf construct
is present. This results in the introduction of a call to
`__kmpc_fork_call_if` in place of `__kmpc_fork_call`. When it evaluates
to `false`, it defers execution to `__kmp_serialized_parallel`, which
creates a new serial team that is picked up by
`__kmp_dist_for_static_init`, resulting in an incorrect `team` pointer
that causes the `nteams == (kmp_uint32)team->t.t_parent->t.t_nproc`
assertion to fail.
The sequence of calls replicating this issue can be summarized as:
- `__kmpc_fork_teams`
- `__kmpc_fork_call_if`
- `__kmpc_dist_for_static_init_*`
Since I am not familiar with the implementation of the OpenMP runtime,
it is possible that the above sequence of calls is incorrect, or that
the bug can be better fixed in another way, so I am open to discussing
this.
The following Fortran program can be compiled with flang to show the
issue:
```f90
! Compile and run: flang -fopenmp test.f90 -o test && ./test
! Check LLVM IR: flang -fc1 -emit-llvm -fopenmp test.f90 -o -
program main
implicit none
integer, parameter :: n = 10
integer :: i, idx(n)
!$omp teams
!$omp distribute parallel do if(.false.)
do i=1,n
idx(i) = i
end do
!$omp end teams
print *, idx
end program
```
Updating OpenMP runtime taskgraph support(record/replay mechanism):
- Adds a `graph_reset` bit in `kmp_taskgraph_flags_t` to discard
existing TDG records.
- Switches from a strict index-based TDG ID/IDX to a more flexible
integer-based, which can be any integer (e.g. hashed).
- Adds helper functions like `__kmp_find_tdg`, `__kmp_alloc_tdg`, and
`__kmp_free_tdg` to manage TDGs by their IDs.
These changes pave the way for the integration of OpenMP taskgraph (spec
6.0). Taskgraphs are still recorded in an array with a lookup efficiency
reduced to O(n), where n ≤ `__kmp_max_tdgs`. This can be optimized by
moving the TDGs to a hashtable, making lookups more efficient. The
provided helper routines facilitate easier future optimizations.
This PR creates a new member for task data, which is used to identify
the task in its taskgraph (when ompx taskgraph is enabled).
It aims to remove the overloading of the td_task_id member, which was
used both by the debugger and the taskgraph. This resulted in the
identifier's non-unicity in the case of multiple taskgraphs.
Co-authored-by: Rémy Neveu <rem2007@free.fr>
This PR fixes warning which occurs if one compiles OpenMP runtime with
GCC:
```
warning: comparison of integer expressions of different signedness: 'kmp_intptr_t' {aka 'long int'} and 'long unsigned int' [-Wsign-compare]
```
In several cases the flags entries in ompt_frame_t are not initialized.
According to @jdelsign the address provided as reenter and exit address
is the canonical frame address (cfa) rather than a "framepointer". This
patch makes sure that the flags entry is always initialized and changes
the value from ompt_frame_framepointer to ompt_frame_cfa.
The assertion in the tests makes sure that the flags are always set,
when a tool (callback.h in this case) looks at the value.
Fixes#89058
This patch series demonstrates and fixes a bug that causes crashes with
OpenMP 'taskwait' directives in heavily multi-threaded scenarios.
TLDR: The early return from __kmpc_omp_taskwait_deps_51 missed the
synchronization mechanism in place for the late return.
Additional debug assertions check for the implied invariants of the code.
@jpeyton52 found the timing hole as this sequence of events:
>
> 1. THREAD 1: A regular task with dependences is created, call it T1
> 2. THREAD 1: Call into `__kmpc_omp_taskwait_deps_51()` and create a stack
based depnode (`NULL` task), call it T2 (stack)
> 3. THREAD 2: Steals task T1 and executes it getting to
`__kmp_release_deps()` region.
> 4. THREAD 1: During processing of dependences for T2 (stack) (within
`__kmp_check_deps()` region), a link is created T1 -> T2. This increases
T2's (stack) `nrefs` count.
> 5. THREAD 2: Iterates through the successors list: decrement the T2's
(stack) npredecessor count. BUT HASN'T YET `__kmp_node_deref()`-ed it.
> 6. THREAD 1: Now when finished with `__kmp_check_deps()`, it returns false
because npredecessor count is 0, but T2's (stack) `nrefs` count is 2 because
THREAD 2 still references it!
> 7. THREAD 1: Because `__kmp_check_deps()` returns false, early exit.
> _Now the stack based depnode is invalid, but THREAD 2 still references it._
>
> We've reached improper stack referencing behavior. Varied results/crashes/
asserts can occur if THREAD 1 comes back and recreates the exact same depnode
in the exact same stack address during the same time THREAD 2 calls
`__kmp_node_deref()`.
A proposed fix for the issue #95611, [OpenMP][SIMD] ordered has no
effect in a loop SIMD region as of LLVM 18.1.0
Changes:
- Implement new lowering behavior: Conservatively serialize "omp simd"
loops that have `omp simd ordered` directive to prevent incorrect
vectorization (which results in incorrect execution behavior of the
miscompiled program).
Implementation outline:
- We start with the optimistic default initial value of
`LoopStack.setParallel(/Enable=/true);` in
`CodeGenFunction::EmitOMPSimdInit(const OMPLoopDirective &D)`.
- We only disable the loop parallel memory access assumption with `if
(HasOrderedDirective) LoopStack.setParallel(/Enable=/false);` using the
`HasOrderedDirective` (which tests for the presence of an
`OMPOrderedDirective`).
- This results in no longer incorrectly vectorizing the loop when the
`omp simd ordered` directive is present.
Motivation: We'd like to prevent incorrect vectorization of the loops
marked with the `#pragma omp ordered simd` directive which has
previously resulted in miscompiled code.
At the same time, we'd like the usage outside of the `#pragma omp
ordered simd` context to remain unaffected: Note that in the test
"clang/test/OpenMP/ordered_codegen.cpp" we only "lose" the
`!llvm.access.group` metadata in `foo_simd` alone.
This is conservative, in that it's possible some of the loops would be
possible to vectorize, but we prefer to avoid miscompilation of the
loops that are currently illegal to vectorize.
A concrete example follows:
```cpp
// "test.c"
#include <float.h>
#include <math.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int compare_float(float x1, float x2, float scalar) {
const float diff = fabsf(x1 - x2);
x1 = fabsf(x1);
x2 = fabsf(x2);
const float l = (x2 > x1) ? x2 : x1;
if (diff <= l * scalar * FLT_EPSILON)
return 1;
else
return 0;
}
#define ARRAY_SIZE 256
__attribute__((noinline)) void initialization_loop(
float X[ARRAY_SIZE][ARRAY_SIZE], float Y[ARRAY_SIZE][ARRAY_SIZE]) {
const float max = 1000.0;
srand(time(NULL));
for (int r = 0; r < ARRAY_SIZE; r++) {
for (int c = 0; c < ARRAY_SIZE; c++) {
X[r][c] = ((float)rand() / (float)(RAND_MAX)) * max;
Y[r][c] = X[r][c];
}
}
}
__attribute__((noinline)) void omp_simd_loop(float X[ARRAY_SIZE][ARRAY_SIZE]) {
for (int r = 1; r < ARRAY_SIZE; ++r) {
for (int c = 1; c < ARRAY_SIZE; ++c) {
#pragma omp simd
for (int k = 2; k < ARRAY_SIZE; ++k) {
#pragma omp ordered simd
X[r][k] = X[r][k - 2] + sinf((float)(r / c));
}
}
}
}
__attribute__((noinline)) int comparison_loop(float X[ARRAY_SIZE][ARRAY_SIZE],
float Y[ARRAY_SIZE][ARRAY_SIZE]) {
int totalErrors_simd = 0;
const float scalar = 1.0;
for (int r = 1; r < ARRAY_SIZE; ++r) {
for (int c = 1; c < ARRAY_SIZE; ++c) {
for (int k = 2; k < ARRAY_SIZE; ++k) {
Y[r][k] = Y[r][k - 2] + sinf((float)(r / c));
}
}
// check row for simd update
for (int k = 0; k < ARRAY_SIZE; ++k) {
if (!compare_float(X[r][k], Y[r][k], scalar)) {
++totalErrors_simd;
}
}
}
return totalErrors_simd;
}
int main(void) {
float X[ARRAY_SIZE][ARRAY_SIZE];
float Y[ARRAY_SIZE][ARRAY_SIZE];
initialization_loop(X, Y);
omp_simd_loop(X);
const int totalErrors_simd = comparison_loop(X, Y);
if (totalErrors_simd) {
fprintf(stdout, "totalErrors_simd: %d \n", totalErrors_simd);
fprintf(stdout, "%s : %d - FAIL: error in ordered simd computation.\n",
__FILE__, __LINE__);
} else {
fprintf(stdout, "Success!\n");
}
return totalErrors_simd;
}
```
Before:
```
$ clang -fopenmp-simd -O3 -ffast-math -lm test.c -o test && ./test
totalErrors_simd: 15408
test.c : 76 - FAIL: error in ordered simd computation.
```
clang 19.1.0: https://godbolt.org/z/6EvhxqEhe
After:
```
$ clang -fopenmp-simd -O3 -ffast-math test.c -o test && ./test
Success!
```
Co-authored-by: Matt P. Dziubinski <matt-p.dziubinski@hpe.com>
Summary:
Freestanding builds have `stddef.h` and `stdint.h` but not `stdlib.h`.
We don't actually use any `stdlib.h` definitions in the OpenMP headers,
and some definitions from this header are usable without the OpenMP
runtime (allocators) so we should be able to do this. This ignores the
include if possible, removing the implicit include would possibly break
some applications so it stays here.
Instead respect what the toolchain default is (or what the user sets via
CMAKE_CXX_FLAGS).
This fixes builds with libcxx, with mingw toolchains targeting
msvcrt.dll, after 5d8be4c036aa5ce4a94f1f37a9155d5c877e23db; after that
commit, the libcxx public headers reference symbols such as iswspace_l,
which are unavailable when targeting msvcrt.dll on older versions of
Windows (it's only available in msvcrt.dll since Windows Vista).
The `openmp` runtime failed to build on LoP with LLVM18 on LoP due to
the addition of `-Wvla-cxx-extension` as
```
llvm-project/openmp/runtime/src/ompt-general.cpp:711:15: error: variable length arrays in C++ are a Clang extension [-Werror,-Wvla-cxx-extension]
711 | int tmp_ids[ids_size];
| ^~~~~~~~
llvm-project/openmp/runtime/src/ompt-general.cpp:711:15: note: function parameter 'ids_size' with unknown value cannot be used in a constant expression
llvm-project/openmp/runtime/src/ompt-general.cpp:704:65: note: declared here
704 | OMPT_API_ROUTINE int ompt_get_place_proc_ids(int place_num, int ids_size,
| ^
1 error generated.
```
This patch is to ignore the checking against this usage.
Add libgomp.1.dylib for MacOS and libgomp.so.1 for Linux
Linkers on Mac and Linux pick up versioned libgomp dynamic library
files. The existing softlinks (libgomp.dylib for MacOS and libgomp.so
for Linux) are insufficient. This helps alleviate the issue of mixing
libgomp and libomp at runtime.
This patch modifies the signature of the `__kmp_print_tdg_dot` function
in `kmp_tasking.cpp` to include the global thread ID (gtid) as an
argument. The gtid is now correctly passed to the function.
- Updated the function declaration to accept the gtid parameter.
- Modified all calls to `__kmp_print_tdg_dot` to pass the correct gtid
value.
This change addresses issues encountered when compiling with
`OMPX_TASKGRAPH` enabled. No functional changes are expected beyond
successful compilation.
Line ending policies were changed in the parent, dccebddb3b80. To make
it easier to resolve downstream merge conflicts after line-ending
policies are adjusted this is a separate whitespace-only commit. If you
have merge conflicts as a result, you can simply `git add --renormalize
-u && git merge --continue` or `git add --renormalize -u && git rebase
--continue` - depending on your workflow.
On powerpc, physical_package_id may not be available. Currently, this
causes openmp to fall back to flat topology and various affinity tests
fail.
Fix this by parsing core_siblings_list to deterimine which cpus belong
to the same socket. This matches what the testing code does. The code to
parse the CPU list format thankfully already exists.
Fixes https://github.com/llvm/llvm-project/issues/111809.
For `libomp` on AIX, we build shared object `libomp.so` first and then
archive it into libomp.a. This patch changes to use SO version `1` and
name the shared object `libomp.so.1` so that it is consistent with the
naming of other shared objects in AIX libraries, e.g., `libc++.so.1` in
`libc++.a`. With this change, the change made in commit
bde51d9b0d473447ea12fb14924f14ea167eec85 to ensure only `libomp.a` is
published on AIX is no longer necessary and is removed.
For `libomp` on AIX, we build shared object `libomp.so` first and then
archive it into `libomp.a`. Due to a CMake for AIX problem, the install
step also tries to publish `libomp.so`. While we use a script to build
`libomp.a` out-of-tree for Clang and avoided the problem, this chokes
the in-tree build for Flang. The issue will be reported to CMake but
before a fixed CMake is available, this patch ensures only `libomp.a` is
published.
This patch adds support for pause resource with a new enumerator
omp_pause_stop_tool. The expected behavior of this enumerator is
* omp_pause_resource: not allowed
* omp_pause_resource_all: equivalent to omp_pause_hard
Start __kmp_invoke_microtask with PACBTI in order to identify the
function as a valid branch target. Before returning, SP is
authenticated.
Also add the BTI and PAC markers to z_Linux_asm.S.
With this patch, libomp.so can now be generated with DT_AARCH64_BTI_PLT
when built with -mbranch-protection=standard.
The implementation is based on the code available in compiler-rt.
This fixes several of those when building with MSVC on Windows:
```
[3625/7617] Building CXX object
projects\openmp\runtime\src\CMakeFiles\omp.dir\kmp_affinity.cpp.obj
C:\src\git\llvm-project\openmp\runtime\src\kmp_affinity.cpp(2637):
warning C4062: enumerator 'KMP_HW_UNKNOWN' in switch of enum 'kmp_hw_t'
is not handled
C:\src\git\llvm-project\openmp\runtime\src\kmp.h(628): note: see
declaration of 'kmp_hw_t'
```
* Separate wasi and emscripten as they have different constraints and
abilities
* Emscripten mimics Linux/POSIX by statically linking the musl runtime.
This allow nearly all KMP_OS_LINUX code paths to work correctly. There
are only a few places that need to be adjusted related to dynamic
linking (dl_open)
* Internally link openmp globals
* With CommonLinkage it is needed to emit them in an assembly file, now
they are defined and used within each compilation unit
* With ExternalLinkage they suffer from duplicate symbols during linking
for unnamed globals like reduction/critical
* Interestingly this aligns with the TODO comment above this code
On non-hyperthreaded machines, the thread id is not always explicit in
the /proc/cpuinfo file. This patch adds a check to ensure the thread ids
are put in.
These are Intel-specific changes for the CPUID leaf 31 method for
detecting machine topology.
* Cleanup known levels usage in x2apicid topology algorithm
Change to be a constant mask of all Intel topology type values.
* Take unknown ids into account when sorting them
If a hardware id is unknown, then put further down the hardware thread
list so it will take last priority when assigning to threads.
* Have sub ids printed out for hardware thread dump
* Add caches to topology
New` kmp_cache_ids_t` class helps create cache ids which are then put
into the topology table after regular topology type ids have been put
in.
* Allow empty masks in place list creation
Have enumeration information and place list generation take into account
that certain hardware threads may be lacking certain layers
* Allow different procs to have different number of topology levels
Accommodates possible situation where CPUID.1F has different depth for
different hardware threads. Each hardware thread has a topology
description which is just a small set of its topology levels. These
descriptions are tracked to see if the topology is uniform or not.
* Change regular ids with logical ids
Instead of keeping the original sub ids that the x2apicid topology
detection algorithm gives, change each id to its logical id which is a
number: [0, num_items - 1]. This makes inserting new layers into the
topology significantly simpler.
* Insert caches into topology
This change takes into account that most topologies are uniform and
therefore can use the quicker method of inserting caches as equivalent
layers into the topology.
The debug assert is meant to check that the index is a valid which means
the runtime needs to check against the size of the array instead of the
number of threads. A free()-ed thread put back in the thread pool may
index into anywhere inside the task team's available array from 0 to
tt_max_threads potentially.
Fixes: #94260
Add the reverse directive which will be introduced in the upcoming
OpenMP 6.0 specification. A preview has been published in [Technical
Report 12](https://www.openmp.org/wp-content/uploads/openmp-TR12.pdf).
---------
Co-authored-by: Alexey Bataev <a.bataev@outlook.com>
This change makes the runtime use new OMPT state and sync kinds
introduced in OpenMP 5.1 in place of the deprecated implicit state and
sync kinds. Events from implicit barriers use different enumerators for
workshare, parallel, and teams.
Use more specific values from `ompt_work_t` to allow the tool identify
the schedule of a worksharing-loop. With this patch, the runtime will
report the schedule chosen by the runtime rather than necessarily the
schedule literally requested by the clause.
E.g., for guided + just one iteration per thread, the runtime would
choose and report static.
Fixes issue #63904