We have a lot of missing Codesize costs for vector operations. This
patch starts things off by adding codesize costs for getVectorInstrCost,
returning a single cost instead of the VectorInsertExtractBaseCost
(which is typically 2). Insert of a load are given a cost of 0 as they
use ld1, otherwise the cost is 1.
The main goal of this patch is to improve the
performance of concept subsumption by
- Making sure literal (atomic) clauses are de-duplicated (Whether 2
atomic constraint is established during the initial normal form
production).
- Eagerly removing duplicated clauses.
This should minimize the risks of exponentially large formulas that can
be produced by a naive {C,D}NF transformation.
While at it, I restructured that part of the code to be a bit clearer.
Subsumption of fold expanded constraint is also cached.
---
Note that removing duplicated clauses seems to be necessary and
sufficient to have acceptable performance on anything that could be
construed as reasonable code.
Ultimately, the number of clauses is always going to be fairly small
(but $2^{fairly\ small}$ is quickly *fairly large*..).
I went too far in the rabbit hole of Tseitin transformations etc, which
was much faster but would then require to check satisfiabiliy to
establish subsumption between some constraints (although it was good
enough to pass all but ones of our tests...).
It doesn't help that the C++ standard has a very specific definition of
subsumption that is really more of an implication...
While that sort of musing is fascinating, it was ultimately a fool's
errand, at least until such time that there is more motivation for a SAT
solver in clang (clang-tidy can after all use z3!).
Here be dragons.
Fixes#122581
C2y adds the `_Countof` operator which returns the number of elements in
an array. As with `sizeof`, `_Countof` either accepts a parenthesized
type name or an expression. Its operand must be (of) an array type. When
passed a constant-size array operand, the operator is a constant
expression which is valid for use as an integer constant expression.
This is being exposed as an extension in earlier C language modes, but
not in C++. C++ already has `std::extent` and `std::size` to cover these
needs, so the operator doesn't seem to get the user enough benefit to
warrant carrying this as an extension.
Fixes#102836
This adds some initial documentation about freestanding requirements for
Clang. The most critical part of the documentation is spelling out that
a conforming freestanding C Standard Library is required; Clang will not
be providing the headers for <string.h> in C23 which expose a number of
symbols in freestanding mode.
The docs also make it clear that in addition to a conforming
freestanding C standard library, the library must provide some
additional symbols which LLVM requires.
These docs are not comprehensive, this is just getting the bare bones in
place so that they can be expanded later.
This also updates the C status page to make it clear that we don't have
anything to do for WG14 N2524 which adds string interfaces to
freestanding mode.
Add int overloads which cast the various ints to a float and call the
float builtin.
These overloads are conditional on hlsl version 202x or earlier.
Add tests and puts tests in own files, including some of the tests added
for double overloads.
Closes#128229
The entrypoints for aarch64 are mostly up to date, but the headers are
not. This patch fixes that, and also makes explicit the dependency from
OSUtils/linux on sys/syscalls.h
Since we've changed what get's generated, we should update the snapshots
of MIR. Otherwise, we end up testing configurations which are no longer
possible from codegen.
Similar to what is done for visitEXTRACT_VECTOR_ELT - if all uses of a
vector are EXTRACT_SUBVECTOR, then determine the accumulated demanded
elts across all users and call SimplifyDemandedVectorElts in
"AssumeSingleUse" use.
optimizeCondBranch isn't allowed to modify the CFG, but it can rewrite
the branch condition freely. However, If we could fold a conditional
branch to an unconditional one (aside from that restriction), we can
also rewrite it into some canonical conditional branch instead.
Looking at the diffs, the only cases this catches in tree tests are
cases where we could have constant folded during lowering from IR, but
didn't. This is inspired by trying to salvage code from
https://github.com/llvm/llvm-project/pull/131684 which might be useful.
Given the test impact, it's of questionable merits. The main advantage
over only the late cleanup pass is that it kills off the LIs for the
constants early - which can help e.g. register allocation.
These functions were already nominally in the CLC namespace; this commit
just formally moves them over.
Note that 'half' versions of these CLC functions are now provided.
Previously the corresponding OpenCL builtins would forward directly to
the 'float' versions of the CLC builtins. Now the OpenCL builtins call
the 'half' CLC builtins, which themselves call the 'float' CLC versions.
This keeps the interface between the OpenCL and CLC libraries neater and
keeps the CLC library self-contained.
No changes to the generated code for non-SPIR-V targets is observed.
This patch removes the old range constructors of SmallSet and
StringSet that do not take the llvm::from_range tag. Since there are
so few uses, this patch directly removes them without going through
the deprecation process.
The input is an integer which can't be NAN so the NV(invalid) exception
can't be raised. The conversion is exact so it can't raise NX(inexact),
UF(underflow), or OF(overflow). The instructions are not divide so they
can't raise DZ(divide by zero).
Fixes#133192.
When the TMA descriptor is transferred from host memory to global memory
using cudaMemcpy, each thread block must insert a fence before any
thread accesses the updated tensor map in global memory. Once the tensor
map has been accessed, no additional fences are needed by that block
unless the map is modified again.
[Example from cuda programming
guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays).
The `tma.fence.descriptor` basically implements
`ptx::fence_proxy_tensormap_generic`.
```
#include <cuda.h>
#include <cuda/ptx>
namespace ptx = cuda::ptx;
__device__ CUtensorMap global_tensor_map;
__global__ void kernel(CUtensorMap *tensor_map)
{
// Fence acquire tensor map:
ptx::n32_t<128> size_bytes;
// Since the tensor map was modified from the host using cudaMemcpy,
// the scope should be .sys.
ptx::fence_proxy_tensormap_generic(
ptx::sem_acquire, ptx::scope_sys, tensor_map, size_bytes
);
// Safe to use tensor_map after fence inside this thread..
}
int main() {
CUtensorMap local_tensor_map;
// [ ..Initialize map.. ]
cudaMemcpy(&global_tensor_map, &local_tensor_map, sizeof(CUtensorMap), cudaMemcpyHostToDevice);
kernel<<<1, 1>>>(global_tensor_map);
}
```
The `mbarrier.create` op can create multiple mbarrier objects, and other
mbarrier-related ops can access an mbarrier using a dynamic SSA value.
This is especially useful when using mbarriers in dynamic loops.
This PR adds the `mbarrier.get` op, which returns a pointer to a
specific mbarrier object from a group of barriers created by the
nvgpu.mbarrier.create operation. It is useful when composing the NVGPU
and NVVM dialects.
Example:
```
%mbars = nvgpu.mbarrier.create
-> !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 10>
%mbar_pointer = nvgpu.mbarrier.get %mbars[%c2]
: !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>>
-> i32
```
Most of the dylib functions inside `<__filesystem/operations.h>` are at
the top of the file. There are a few spread out in the file for some
reason, which this patch fixes.
This puts the base before the offset to match the order we use for base
ISA where the offset is an immediate.
I'm investigating using sub-operands for the base ISA loads and stores
too so having a consistent operand order will allow more sharing.
We previously handled ConstantExpr scalable splats in
5d929794a87602cfd873381e11cc99149196bb49, but only fpexts.
ConstantExpr fpexts have since been removed, and simultaneously we
didn't handle splats of constants that weren't extended.
This updates it to remove the fpext check and instead see if we can
shrink the result of getSplatValue.
Note that the test case doesn't get completely folded away due to
#132922
Previously, ranges::min_element delegated to ranges::__min_element_impl, which
duplicated the definition of std::__min_element. This patch updates
ranges::min_element to directly call std::__min_element, which allows
removing the redundant code in ranges::__min_element_impl.
Upon removal of ranges::__min_element_impl, the other ranges algorithms
ranges::{min,max,max_element}, which previously delegated to ranges::__min_element_impl,
have been updated to call std::__min_element instead.
This refactoring unifies the implementation across these algorithms,
ensuring that future optimizations or maintenance work only need to be
applied in one place.
Summary:
The default behavior for LTO on other targets does not specify the
number of LTO partitions. Recent changes made this default to 8 on
AMDGPU which had some issues with the `libc` project. The option to
disable this is HIP only so I think for now we should restrict this just
to HIP.
I'm definitely on board with getting some more parallelism here, but I
think it should probably be restricted to just offloading languages. The
new driver goes through the `--target=amdgcn-amd-amdhsa` for its output,
which means we'd need to forward the default somehow.
This has multiple benefits:
- There is a single instance of our hash function, reducing object file
size
- The hash implementation isn't instantiated in every TU anymore,
reducing compile times
- Behind an ABI configuration macro it would be possible to salt the
hash
The override function `ensureTerminator` ensures that the terminator
`InParallelOp` has a region. However, if the terminator of `scf.forall`
is not an `InParallelOp`, calling ensureTerminator causes a crash. Since
the InParallelOp builder already guarantees the existence of a region,
`ForallOp::ensureTerminator` is redundant and can be safely removed.
Fixes#130019.
The main change here is that we're now able to correctly look up plans
for these functions. Previously, due to caching, we could end up with
one entry covering most of the address space (because part of the
function was at the beginning and one at the end). Now, we can correctly
recognise that the part in between does not belong to that function, and
we can create a different FuncUnwinders instance for it. It doesn't help
the discontinuous function much (its plan will still be garbled), but
we can at least properly unwind out of the simple functions in between.
Fixing the unwind plans for discontinuous functions requires handling
each unwind source specially, and this setup allows us to make the
transition incrementally.
Using argv[0] for this was incorrect. I'm ignoring LaunchInfo::SetArg0,
as that's what darwin and windows launchers do (they use the first
element of the args vector instead).
I picked up the funny unit test re-exec method from the llvm unit tests.
The instantiation of a VarDecl's initializer might be deferred until the
variable is actually used. However, we were still building the
DeclRefExpr with a type that could later be changed by the initializer's
instantiation, which is incorrect when incomplete arrays are involved.
Fixes#79750Fixes#113936Fixes#133047
When pragma of loop transformations is specified, follow-up metadata for
loops is generated after each transformation. On the LLVM side,
follow-up metadata is expected to be a list of properties, such as the
following:
```
!followup = !{!"llvm.loop.vectorize.followup_all", !mp, !isvectorized}
!mp = !{!"llvm.loop.mustprogress"}
!isvectorized = !{"llvm.loop.isvectorized"}
```
However, on the clang side, the generated metadata contains an MDNode
that has those properties, as shown below:
```
!followup = !{!"llvm.loop.vectorize.followup_all", !loop_id}
!loop_id = distinct !{!loop_id, !mp, !isvectorized}
!mp = !{!"llvm.loop.mustprogress"}
!isvectorized = !{"llvm.loop.isvectorized"}
```
According to the
[LangRef](https://llvm.org/docs/TransformMetadata.html#transformation-metadata-structure),
the LLVM side is correct. Due to this inconsistency, follow-up metadata
was not interpreted correctly, e.g., only one transformation is applied
when multiple pragmas are used.
This patch fixes clang side to emit followup metadata in correct format.
As with other work in this area, these builtins are now vectorized.
A further table has been split into two. There was discrepancy between
comments above the table describing the values as "lead" and "tail" and
variables taken from the table called "head" and "tail", so these have
been unified as head/tail.
We've started using `_LIBCPP_BEGIN_NAMESPACE_STD` and
`_LIBCPP_END_NAMESPACE_STD` for more than just the namespace for a while
now. For example, we're using it to add visibility annotations to types.
This works very well and avoids a bunch of annotations, but doesn't work
for the few places where we have an unversioned namespace. This adds
`_LIBCPP_BEGIN_UNVERSIONED_NAMESPACE_STD` and
`_LIBCPP_END_UNVERSIONED_NAMESPACE_STD` to make it simpler to add new
annotations consistently across the library as well as making it more
explicit that the unversioned namespace is indeed intended.
If the broadcast is already the full vector width, try to prefer a blend over a vector insertion which is usually a lower latency (and sometimes a lower uop count).
The surrounding code doesn't use them anymore. This removes the internal
usages.
This patch makes the Rows actual values. An alternative would be to make
them unique_ptrs. That would make vector resizes faster at the cost of
more pointer chasing and heap fragmentation. I don't know which one is
better so I picked the simpler option.
This parallels the GNU Binutils feature's usage. A hidden command-line
option `--loongarch-no-aliases` is also added, similar to how
`--loongarch-numeric-reg` is for the `numeric` option.
It was excluded from spilling in a263a60, possibly by accident.
In the linked bug, we hit a situation like this:
```
%s = call @llvm.coro.suspend(...)
|
switch (%s)
case v1: / \ case v2:
... ...
| suspend point
| ...
\ /
%x = phi [v1] [v2]
|
...
|
use(%x)
```
Instcombine will notice that %x correlates exactly with %s, and so
use(%x) becomes use(%s).
However, corosplit would substitute different values for %s when
splitting the function, so even though %s had a particular value when
control actually passed through the switch, it could have a *different*
value when reaching use(%s).
This illustrates that while IR from the frontend typically does not use
these suspend return values across suspend points, mid-level
optimizations on the presplit coroutine may introduce new uses of
suspend values, so they must be considered eligible to spill to the
coroutine frame.
Fixes: #130326
These four functions all related in that they share tables and helper
functions. Furthermore, the acosh and atanh builtins call log1p.
As with other work in this area, these builtins are now vectorized. To
enable this, there are new table accessor functions which return a
vector of table values using a vector of indices. These are internally
scalarized, in the absence of gather operations. Some tables which were
tables of multiple entries (e.g., double2) are split into two separate
"low" and "high" tables. This might affect the performance of memory
operations but are hopefully mitigated by better codegen overall.
PR #130587 defined same SubTargetFeature for CPUs i6400 and i6500 which
resulted into following warning when -mcpu=i6500 was used:
+i6500' is not a recognized feature for this target (ignoring feature)
This PR fixes above issue by defining separate SubTargetFeature for
i6500.