In libclc, we observe that compiling OpenCL source files to bitcode is
executed sequentially on Windows, which increases debug build time by
about an hour.
add_custom_command may introduce additional implicit dependencies, see
https://gitlab.kitware.com/cmake/cmake/-/issues/17097
This PR adds a target for each command, enabling parallel builds of
OpenCL source files.
CMake 3.27 has fixed above issue with DEPENDS_EXPLICIT_ONLY. When LLVM
upgrades cmake vertion to 3.7, we can switch to DEPENDS_EXPLICIT_ONLY.
llvm-diff shows there is no change to amdgcn--amdhsa.bc.
Similar to how cl_khr_fp64 and cl_khr_fp16 implementations are put in a
same file for math built-ins, this PR do the same to atom_* built-ins.
The main motivation is to prevent that two files with same base name
implementats different built-ins. In a follow-up PR, I'd like to relax
libclc_configure_lib_source to only compare filename instead of path for
overriding, since in our downstream the same category of built-ins, e.g.
math, are organized in several different folders.
clspv is already handling generation of fp16. This implementation is
preventing clspv from making the best choice to use an emulation on top
of fp32-fma, or the native fp16-fma, depending on the command-line
arguments.
This commit moves the shuffle and shuffle2 builtins to the CLC library.
In so doing it makes the headers simpler and re-usable for other builtin
layers to hook into the CLC functions, if they wish.
An additional gentype utility has been made available, which provides a
consistent vector-size-or-1 macro for use.
The existing __CLC_VECSIZE is defined but empty which is useful in
certain applications, such as in concatenation with a type to make a
correctly sized scalar or vector type. However, this isn't usable in the
same preprocessor lines when wanting to check for specific vector sizes,
as e.g., '__CLC_VECSIZE == 2' resolves to '== 2' which is invalid. In
local testing this is also useful for the geometric builtins which are
only available for scalar types and vector types of 2, 3, or 4 elements.
No codegen changes are observed, except the internal shuffle/shuffle2
utility functions are no longer made publicly available.
Some files were accidentally given two copyright headers. Another was
missing one. This commit also converts that file's dos line endings to
unix ones and reformats a comment.
Devices not supporting denormals can compare them true against zero. It
leads to result not matching the CTS expectation when either supporting
or not denormals.
For example for 0x1.008p-140 we get {0x1.008p-140, 0} while the CTS
expects {0x1.008p-1, -139} when supporting denormals, or {0, 0} when not
supporting denormals (flushed to zero).
Ref #129871
clspv uses a better implementation that is not using a bigger side when
not available.
Add a dummy implementation for mul_hi to avoid to override the
implementation of clspv with the one in libclc.
These are the three remaining native builtins not yet ported.
There are elementwise versions of exp10 and tan which correspond to the
intrinsics, which may be preferable to the current versions which route
through other native builtins. Those could be changed in a follow-up if
desired.
Also enable half-precision variants of tgamma, which were previously
missing.
Note that unlike recent work, these builtins are not vectorized as part
of this commit. Ultimately all three call into lgamma_r, which has heavy
control flow (including switch statements) that would be difficult to
vectorize. Additionally the lgamma_r algorithm is copyrighted to SunPro
so may need a rewrite in the future anyway.
There are no codegen changes (to non-SPIR-V targets) with this commit,
aside from the new half builtins.
This commit moves the 'native' builtins that use asm statements to
generate LLVM intrinsics to the CLC library. In doing so it converts
them to use the appropriate elementwise builtin to generate the same
intrinsic; there are no codegen changes to any target except to AMDGPU
targets where `native_log` is no longer custom implemented and instead
used the clang elementwise builtin.
This work forms part of #127196 and indeed with this commit there are no
'generic' builtins using/abusing asm statements - the remaining builtins
are specific to the amdgpu and r600 targets.
The function was already nominally in the CLC namespace; this commit
just moves it over.
This commit also vectorizes the builtin to avoid scalarization.
Splitting the 'ln_tbl' into two in db98e292 wasn't done thoroughly
enough as some references to the old table still remained. This commit
fixes the unresolved references by updating to the new split table.
These functions were already nominally in the CLC library.
Similar to others, these builtins are now vectorized and are not broken
down into scalar types.
The libclc build system isn't well set up to pass arbitrary options to
arbitrary source files in a non-intrusive way. There isn't currently any
other motivating example to warrant rewriting the build system just to
satisfy this requirement. So this commit uses a filename-based approach
to inserting this option into the list of compile flags.
These functions were already nominally in the CLC namespace; this commit
just formally moves them over.
Note that 'half' versions of these CLC functions are now provided.
Previously the corresponding OpenCL builtins would forward directly to
the 'float' versions of the CLC builtins. Now the OpenCL builtins call
the 'half' CLC builtins, which themselves call the 'float' CLC versions.
This keeps the interface between the OpenCL and CLC libraries neater and
keeps the CLC library self-contained.
No changes to the generated code for non-SPIR-V targets is observed.
As with other work in this area, these builtins are now vectorized.
A further table has been split into two. There was discrepancy between
comments above the table describing the values as "lead" and "tail" and
variables taken from the table called "head" and "tail", so these have
been unified as head/tail.
These four functions all related in that they share tables and helper
functions. Furthermore, the acosh and atanh builtins call log1p.
As with other work in this area, these builtins are now vectorized. To
enable this, there are new table accessor functions which return a
vector of table values using a vector of indices. These are internally
scalarized, in the absence of gather operations. Some tables which were
tables of multiple entries (e.g., double2) are split into two separate
"low" and "high" tables. This might affect the performance of memory
operations but are hopefully mitigated by better codegen overall.
Similar to d46a6999, this commit simultaneously moves these three
functions to the CLC library and optimizes them for vector types by
avoiding scalarization.
This commit moves most of the sincos helper functions to the CLC
library. It simultaneously vectorizes them with the aim to increase
performance for vector types by avoiding scalarization.
Some helpers for double types remain as they use various features not
yet ready, like 'fract' which in turn relies on 'fmin'; neither of these
are in the CLC library. They also use table lookups and type punning
which don't translate well to vector versions.
As a proof of concept, float and half versions of the sin and cos
builtins are now vectorized and use the CLC helpers to do so. They
remain in the OpenCL layer but will be simpler to move to the CLC
library when the double versions are ready.
On some implementations, the current implementation leads to slight
accuracy issues.
While the maths behind this implementation is correct, it does not take
into account the accumulation of errors coming from other operators that
do not provide correct rounding (like the exp function).
To avoid it, compute statically exp(-0.5625).
Fixes#124939
Similar to work done in 82912fd6, this commit re-licenses both the
gen_convert.py script and the file it generates.
It previously possessed an MIT license, with three additional individual
copyrights. The file it generated was similar, but to only two of the
three individuals. LLVM's policy is not to accept contributions that
include in-source copyright notices [1]. I'm not aware whether the
individuals concerned signed the re-licensing agreement or not.
It takes the opportunity to update the description(s) in the header
files, since the previous comments were out of date.
[1]
https://llvm.org/docs/DeveloperPolicy.html#embedded-copyright-or-contributed-by-statements
This commit bulk updates all '.h', '.cl', '.inc', and '.cpp' files to
add any missing license headers.
The remaining files are generally CMake, SOURCES, scripts, markdown,
etc.
There are still some '.ll' files which may benefit from a license
header. I can't find an example of an LLVM IR file with a license header
in the rest of LLVM, but unlike most other (sub)projects, libclc has
examples of LLVM IR as source files, compiled and built into the
library.
Currently link_bc command depends on the bitcode file that is associated
with custom target builtins.link.clc-arch_suffix.
On windows we randomly see following error:
`
Generating builtins.link.clc-${ARCH}--.bc
Generating builtins.link.libspirv-${ARCH}.bc
error : The requested operation cannot be performed on a file with a
user-mapped section open.
`
I suspect that builtins.link.clc-${ARCH}--.bc file is being generated
while it is being used in link_bc.
This PR adds target-level dependency to ensure
builtins.link.clc-${ARCH}--.bc is generated first.
When -internalize flag is passed to llvm-link, we only need to link in
needed symbols. This PR reduces size of linked bitcode, e.g. by removing
following symbols:
_Z12__clc_sw_fmaDv16_fS_S_
_Z12__clc_sw_fmaDv2_fS_S_
_Z12__clc_sw_fmaDv3_fS_S_
_Z12__clc_sw_fmaDv4_fS_S_
_Z12__clc_sw_fmaDv8_fS_S_
_Z12__clc_sw_fmafff
This commit bulk-updates the libclc license headers to the current
Apache-2.0 WITH LLVM-exception license in situations where they were
previously attributed to AMD - and occasionally under an additional
single individual contributor - under an MIT license.
AMD signed the LLVM relicensing agreement and so agreed for their past
contributions under the new LLVM license.
The LLVM project also has had a long-standing, unwritten, policy of not
adding copyright notices to source code. This policy was recently
written up [1]. This commit therefore also removes these copyright
notices at the same time.
Note that there are outstanding copyright notices attributed to others -
and many files missing copyright headers - which will be dealt with in
future work.
[1]
https://llvm.org/docs/DeveloperPolicy.html#embedded-copyright-or-contributed-by-statements
The libclc headers are an implementation detail and are not intended to
be used by others as OpenCL headers. The only artifacts of libclc we
want to publish are the LLVM bytecode libraries.
As the headers have been incidentally broken by recent changes, this
commit takes the step to stop installing the headers at all. Downstreams
can use clang's own OpenCL headers, and/or its -fdeclare-opencl-builtins
flag.
Fixes#119967.
Also replace some magic constants with named ones.
Checking against FP zero and using isnan and isinf functions allows the
optimizer to create one unified @llvm.is.fpclass intrinsic. This results
in fewer more canonical IR instructions.
This was already nominally in the CLC library; this commit just formally
moves it over. It simultaneously optimizes it for vector types by
avoiding scalarization.
This also adds missing half variants to certain targets.
It also optimizes some targets' implementations to perform the operation
directly in vector types, as opposed to scalarizing.
This is fairly straightforward for most targets.
We use the element-wise sqrt builtin by default. We also remove a legacy
pre-filtering of the input argument, which the intrinsic now officially
handles.
AMDGPU provides its own implementation of sqrt for double types. This
commit moves this into the implementation of CLC sqrt. It uses weak
linkage on the 'default' CLC sqrt to allow AMDGPU to only override the
builtin for the types it cares about.
There is a long-standing workaround in the libclc build system that
silences a warning about the use of parentheses in bitwise conditional
operations.
In an effort to remove this workaround, this commit re-enables the
warning on the internal CLC library, where most of the bodies of the
builtins will eventually be defined. Thus as we move builtin
implementations into this library, the warnings will trigger and we can
clean up the codebase as we go.
As it happens the only instance in the CLC library which triggered the
warning was in __clc_ldexp.
This function was already conceptually in the CLC namespace - this just
formally moves it over.
Note however that this commit marks a change in how libclc functions may
be overridden by targets.
Until now we have been using a purely build-system-based approach where
targets could register identically-named files which took responsibility
for the implementation of the builtin in its entirety.
This system wasn't well equipped to deal with AMD's overriding of
__clc_ldexp for only a subset of types, and furthermore conditionally on
a pre-defined macro.
One option for handling this would be to require AMD to duplicate code
for the versions of __clc_ldexp it's *not* interested in overriding. We
could also make it easier for targets to re-define CLC functions through
macros or .inc files. Both of these have obvious downsides. We could
also keep AMD's overriding in the OpenCL layer and bypass CLC
altogether, but this has limited use.
We could use weak linkage on the "base" implementations of CLC
functions, and allow targets to opt-in to providing their own
implementations on a much finer granularity. This commit supports this
as a proof of concept; we could expand it to all CLC builtins if
accepted.
Note that the existing filename-based "claiming" approach is still in
effect, so targets have to name their overrides differently to have both
files compiled. This could also be refined.
This commit also enables fp16 log, which was previously missing.
Other than that, no changes to codegen for AMDGPU/Nvidia targets.
Note that for simplicity this commit doesn't try to refactor or optimize
the implementations. Notably, each log is only implementated for scalar
types; vector types are scalarized. It doesn't look too difficult to
make the implementations suitable for vector codegen, so I'll try that
in a future commit.
There's also an unused implementation of log in clc_log_base.h, whereas
the implementation currently used by libclc targets re-uses log2 with an
additional multiplication. That should also be cleaned up as on first
inspection it looks a more optimal implementation, though it would have
to be checked against the OpenCL CTS for good measure.
Comparing the case where each dimension is used alone, the only codegen
difference is a missed addressing mode fold for the constant offset in the old
version due to an ancient bug.