Currently we have code with target hooks in CodeGenModule shared between
X86 and AArch64 for sorting MultiVersionResolverOptions. Those are used
when generating IFunc resolvers for FMV. The RISCV target has different
criteria for sorting, therefore it repeats sorting after calling
CodeGenFunction::EmitMultiVersionResolver.
I am moving the FMV priority logic in TargetInfo, so that it can be
implemented by the TargetParser which then makes it possible to query it
from llvm. Here is an example why this is handy:
https://github.com/llvm/llvm-project/pull/87939
This can be used to inform users when a template should not be
specialized. For example, this is the case for the standard type traits
(except for `common_type` and `common_reference`, which have more
complicated rules).
Implement isSubclass with direct lookup into some tables instead of
nested switches.
Part of the motivation for this is improving compile time when clang-18
is used as a host compiler, since it seems to have trouble with very
large switch statements.
Initial parsing/sema for 'strict' modifier with 'num_tasks' and
‘grainsize’ clause is present in these commits
[grainsize_parsing](ab9eac762c)
and
[num_tasks_parsing](56c1660170 (diff-4184486638e85284c3a2c961a81e7752231022daf97e411007c13a6732b50db9R6545))
. However, this implementation appears incomplete as it lacks code
generation support. A runtime patch was introduced in this runtime
commit
[runtime_patch](540007b427 (diff-5e95f9319910d6965d09c301359dbe6b23f3eef5ce4d262ef2c2d2137875b5c4R374))
, which adds a new API, _kmpc_taskloop_5, to accommodate the strict
modifier.
In this patch I have added codegen support. When the strict modifier is
present alongside the grainsize or num_tasks clauses of taskloop
construct, the code now emits a call to _kmpc_taskloop_5, which includes
an additional parameter of type i32 with the value 1 to indicate the
strict modifier. If the strict modifier is not present, it falls back to
the existing _kmpc_taskloop API call.
---------
Co-authored-by: Chandra Ghale <ghale@pe31.hpc.amslabs.hpecorp.net>
The printing and parsing logic for struct types was still using ad-hoc
functions instead of the more conventional `print` and `parse` methods
whose declarations are automatically generated by TableGen.
This PR effectively renames these functions and uses them directly as
implementations for `print` and `parse` of `LLVMStructType`.
This additionally fixes linking errors when users or auto generated code
may call `print` and `parse` directly.
Fixes https://github.com/llvm/llvm-project/issues/117927
This patch updates the TMA intrinsic definitions to use the "NAME"
macro (inside the multiclass) instead of an empty string.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
The verifier complains that synthesized IR functions have minsize and
optnone attributes which are incompatible. This patch removes optnone
attribute and updates affected tests as needed.
Link errors on builders:
- llvm-nvptx-nvidia-ubuntu
- llvm-nvptx64-nvidia-ubuntu
Add explicitly references to DebugInfoDWARF and Object.
Compile errors on builders:
- ppc64le-lld-multistage-test
- clang-ppc64le-linux-multistage
- clang-ppc64le-rhel
error: comparison of integers of different signs:
Add to the constants used in the 'EXPECT_EQ' the 'u' postfix.
It's never set to true. Also, using inheritable FDs in a multithreaded
process pretty much guarantees descriptor leaks. It's better to
explicitly pass a specific FD to a specific subprocess, which we already
mostly can do using the ProcessLaunchInfo FileActions.
This tune feature will disable latency scheduling heuristic.
This can reduce the number of spills/reloads but will cause some
regressions on some cores.
CPU may add this tune feature if they find it's profitable.
Reviewers: lukel97, michaelmaitland, asb, preames, mshockwave, topperc
Reviewed By: michaelmaitland, mshockwave, topperc
Pull Request: https://github.com/llvm/llvm-project/pull/115858
Update the test cases contains `any-of` printings from the
precomputeCost().
Origin message:
The any-of reduction contains phi and select instructions.
The select instruction might be optimized and removed in the vplan which
may cause VF difference between legacy and VPlan-based model. But if the
select instruction be removed, planContainsAdditionalSimplifications()
will catch it and disable the assertion.
Therefore, we can just remove the ayn-of reduction calculation in the
precomputeCost().
Recommit "[LV][VPlan] Remove any-of reduction from precomputeCost. NFC
(#117109)"
This patch adds the llvm_reachable() for TMA
reduction opcode printer method, outside the
switch.
We had this inside the default-case leading to
the warning below (and hence was removed):
error: default label in switch which covers all enumeration values
[-Werror,-Wcovered-switch-default]
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
- In the DWARF reader, for those attributes that can have an unsigned
value, allow for the following cases:
* Is an implicit constant
* Is an optional value
- The testing is done by creating a file with generated DWARF, using
`DwarfGenerator` (generate DWARF debug info for unit tests).
Check that we're not reusing any handler tag addresses before installing any
handlers. This ensures that either all of the handlers are installed*, or none
of them are, simplifying error recovery.
* Ignoring handlers whose tags couldn't be resolved at all: these were never
installed.
This PR extends the MLIR representation for `omp.target` ops by adding a
`map_idx` to `private` vars. This annotation stores the index of the map
info operand corresponding to the private var. If the variable does not
have a map operand, the `map_idx` attribute is either not present at all
or its value is `-1`.
This makes matching the private variable to its map info op easier (see
https://github.com/llvm/llvm-project/pull/116576 for usage).
Introduces a new conversion pass that rewrites `omp.loop` ops to their
semantically equivalent op nests bases on the surrounding/binding
context of the `loop` op. Not all forms of `omp.loop` are supported yet.
See `isLoopConversionSupported` for more info on which forms are
supported.
The terms "legal type" and "illegal type" are ambiguous when talking
about materializations. E.g., for target materializations we do not
necessarily convert from illegal to legal types. We convert from the
most recently mapped value to the type that was produced by converting
the original type.
---------
Co-authored-by: Markus Böck <markus.boeck02@gmail.com>
This PR adds a command line argument `--mlir-disable-diagnostic` for
disabling diagnostic information for mlir-opt.
When debugging with mlir-opt, some developers would like to disable the
diagnostic information and focus specifically on the dumped IR. For
example, https://github.com/triton-lang/triton/pull/5250
Summary:
When building a project in a runtime mode, the compilation database is a
separate CMake invocation. So its `compile_commands.json` file will be
placed elsewhere in the `runtimes/runtime-bins` directory. This is
somewhat annoying for ongoing development when a runtimes build is
necessary. This patch adds some CMake magic to merge the two files.
Summary:
We should now use the official™ way to include the files from
`libc/shared`. This required some code to make sure that it's not
included twice if multiple people use it as well as a sanity check on
the directory.
Sampling PGO has already been supported on Windows. This patch adds
/fprofile-sample-use= /fprofile-sample-use: /fno-profile-sample-use and
supports -fprofile-sample-use= for CL.
In C++20, a defaulted but implicitly deleted destructor is constexpr if
and only if the class has no virtual base class. This hasn't been
changed in C++23 by P2448R2.
Constexpr-ness on a deleted destructor affects almost nothing. The
`__is_literal` intrinsic is related, while the corresponding
`std::is_literal_type(_v)` utility has been removed in C++20. A recently
added example in `test/AST/ByteCode/cxx23.cpp` will become valid, and
the example is already accepted by GCC.
Clang currently behaves correctly in C++23 mode, because the
constexpr-ness on defaulted destructor is relaxed by P2448R2. But we
should make similar relaxation for an implicitly deleted destructor.
Fixes#85550.
This PR implements process_mrelease.
A previous PR was merged #117503, but failed on merge due to an issue in
the tests. Namely the failing tests were comparing against return type
as opposed to errno. This is fixed in this PR.
Adds a CMake option MLIR_DISABLE_CONFIGURE_PYTHON_DEV_PACKAGES which
gates doing package discovery and configuration for Python dev packages
by MLIR (this was made opt-out to preserve compatibility with
find_package(MLIR) based uses which do not set the standard options).
The default Python setup that MLIR does has been a problem for
super-projects that include LLVM for a long time because it forces a
very specific package discovery mechanism that is not uniform in all
uses.
When reviewing #117922, I noted that this would effectively be a break
the world event for downstreams, forcing them to adapt their nanobind
dep to the exact way that MLIR does it. Adding the option to just
wholesale skip the built-in configuration heuristics at least gives us a
mechanism to tell downstreams to migrate to, giving them complete
control and not requiring packaging workarounds. This seemed a better
option than (once again) creating a situation where downstreams could
not integrate the dep change without doing tricky infra upgrades, and it
removes the burden from the author of that patch from needing to think
about how this affects super-projects that include MLIR (i.e. they can
just be told to do it themselves as needed vs being in a wedged state
and unable to upgrade).
Use a `DCI` object to actually check the DAG combine level instead of
using the type `i1` because this assumption fails on AVX512 where we
have types like `v8i1` after legalization.
Closes#117684
At the moment Windows 32 bit SEH state stores are only emitted for
throwing calls.
Windows 32 bit SEH state stores should also be emitted before SEH scope
begin and before SEH scope end.
An invalid inline memory access would otherwise not trigger unwinding,
in combination with /EHa.
This fixes#90946
Change the intersect for the anticipated algorithm to ignore unknown
when anticipating. This effectively allows VXRM writes speculatively
because it could do a VXRM write even when there's branches where VXRM
is unneeded.
The importance of this change is because VXRM writes causes pipeline
flushes in some micro-architectures and so it makes sense to allow more
aggressive hoisting even if it causes some degradation for the slow
path.
An example is this code:
```
typedef unsigned char uint8_t;
__attribute__ ((noipa))
void foo (uint8_t *dst, int i_dst_stride,
uint8_t *src1, int i_src1_stride,
uint8_t *src2, int i_src2_stride,
int i_width, int i_height )
{
for( int y = 0; y < i_height; y++ )
{
for( int x = 0; x < i_width; x++ )
dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
dst += i_dst_stride;
src1 += i_src1_stride;
src2 += i_src2_stride;
}
}
```
With this patch, the code above generates a hoisting VXRM writes out of
the outer loop.
This change matches a subset of vcompress patterns during shuffle
lowering. The subset implemented requires a contiguous prefix of
demanded elements followed by undefs. This subset was chosen for two
reasons: 1) which elements to spurious demand is a non-obvious problem,
and 2) my first several attempts at implementing the general case were
buggy. I decided to go with the simple case to start with.
vcompress scales better with LMUL than a general vrgather, and at least
the SpaceMit X60, has higher throughput even at m1. It also has the
advantage of requiring smaller vector constants at one bit per element
as opposed to vrgather which is a minimum of 8 bits per element. The
downside to using vcompress is that we can't fold a vselect into it, as
there is no masked vcompress variant.
For reference, here are the relevant throughputs from camel-cdr's data
table on BP3 (X60):
vrgather.vv v8,v16,v24 4.0 16.0 64.0 256.0
vcompress.vm v8,v16,v24 3.0 10.0 36.0 136.
vmerge.vvm v8,v16,v24,v0 2.0 4.0 8.0 16.0
The largest concern with the extra vmerge is that we locally increase
register pressure. If we do have masking, we also have a passthru,
without the ability to fold that into the vcompress, we need to keep it
alive a bit longer. This can hurt at e.g. m8 where we have very few
architectural registers. As compared with the vrgather.vv sequence, this
is only one additional m1 VREG - since we no longer need the index
vector. It compares slightly worse against vrgatherie16.vv which can use
index vectors smaller than other operands. Note that we could
potentially fold the vmerge if only tail elements are being preserved; I
haven't investigated this.
It is unfortunately hard given our current lowering structure to know if
we're emitting a shuffle where masking will follow. Thankfully, it
doesn't seem to show up much in practice, so I think we can probably
ignore it.
This patch only handles single source compress idioms at the moment.
This is an effort to avoid interacting with other patches on review for
changing how we canonicalize length changing shuffles.
This patch simplifies and extends the logic used when compressing masks
emitted by `vector.constant_mask` to support extracting 1-D vectors from
multi-dimensional vector loads. It streamlines mask computation, making
it applicable for multi-dimensional mask generation, improving the
overall handling of masked load operations.
Summary:
We can simply include this header from the shared directory now and do
not need to have this level of indirection. Simply stash it with the
other libc opcode handlers.
If we were able to move the printf handlers to the shared directory then
this could just be a header as well, which would HEAVILY simplify the
mess associated with building the RPC server first in the projects
build, then copying it to the runtimes build.
Summary:
We currently have an unnecessary level of indirection when initializing
the RPC client. This is a holdover from when the RPC client was not
trivially copyable and simply makes it more complicated. Here we use the
`asm` syntax to give the C++ variable a valid name so that we can just
copy to it directly.
Another advantage to this, is that if users want to piggy-back on the
same RPC interface they need only declare theirs as extern with the same
symbol name, or make it weak to optionally use it if LIBC isn't
avaialb.e