All non-existing local times in a contiguous range should map to the
same time point. This fixes a bug, were the times inside the range were
mapped to the wrong time.
Fixes: #113654
Fixes MLIR lowering passes in x86vector integration tests.
The tests are refactored with lowering pass bundle which ensures that all
dialect are lowered into LLVM dialect.
This simplifies the test pipelines and addresses missing arith lowering.
This change folds (setcc ne (lshr x c) 0) for 64-bit types and constants
c >= 32. This fold already existed for other types or smaller constants
but was not applicable to 64-bit types and constants >= 32 due to a
comparison of the constant c with the bit size of the setcc operation.
The type of this operation is legalized to i32, which does not
necessarily match the type of the lshr operation. Use the bit size of
the type of the lshr operation instead for the comparison.
Fixes#122380.
In #127378 it was reported that builds without clspv targets enabled
were failing after #124727, as all targets had a dependency on a file
that only clspv targets generated.
A quick fix was merged in #127315 which wasn't correct. It moved the
dependency on those generated files to the spirv targets, instead of
onto the clspv targets. This means a build with spirv targets and
without clspv targets would see the same problems as #127378 reported.
I tried simply removing the requirement to explicitly add dependencies
to the custom command, relying instead on the file-level dependencies.
This didn't seem reliable enough; in some cases on a Makefiles build,
the clang command compiling (e.g.,) convert.cl would begin before the
file was fully written.
Instead, we keep the target-level dependency but automatically infer it
based on the generated file name, to avoid manual book-keeping of pairs
of files and targets.
This commit also fixes what looks like an unintended bug where, when
ENABLE_RUNTIME_SUBNORMAL was enabled, the OpenCL conversions weren't
being compiled.
- `CodeGen/AMDGPU/spill_more_than_wavesize_csr_sgprs.ll`
- `CodeGen/AMDGPU/call-preserved-registers.ll`
- `CodeGen/AMDGPU/stack-realign.ll`
This is to make preparation for another PR.
Concat/unpack the src subvectors together in the bottom 128-bit vector and then extend with a single EXTEND/EXTEND_VECTOR_INREG instruction
Required the getEXTEND_VECTOR_INREG helper to be tweaked to accept EXTEND_VECTOR_INREG opcodes as well to avoid us having to remap the opcode between both types.
This is a fix for https://github.com/llvm/llvm-project/issues/126949
There are two issues being fixed here.
First, in some cases, OMPIRBuilder generates empty target task proxy functions. This happens
when the target kernel doesn't use any stack-allocated data (either no data or only globals).
The second problem is encountered when the target task i.e the code that makes the target call
spans a single basic block. This usually happens when we do not generate a target or device kernel
launch and instead fall back to the host. In such cases, we end up not outlining the target task entirely.
This can cause us to call target kernel twice - once via the target task proxy function and a second time
via the host fallback
This PR fixes both of these problems and updates some tests to catch these problems should this patch fail.
Update VPWidenPHIRecipe to use the predecessors in VPlan to determine
the incoming blocks instead of tracking them separately. This brings
VPWidenPHIRecipe in line with the other phi recipes.
PR: https://github.com/llvm/llvm-project/pull/126388
After #117558 landed, this code would assert "Value is not an N-bit
unsigned value" in getConstant(), from a test case in zig.
Co-authored-by: Craig Topper <craig.topper@sifive.com>
Fixes#127296
This is necessary to enable composing subregisters in peephole-opt.
For now use a brute force table to find the return value. The worst
case target is AMDGPU with a 399 x 399 entry table.
this PR close#124474
when calling `read` and `recv` function for a non-block file descriptor
or a invalid file descriptor(`-1`), it will not cause block inside a
critical section.
this commit checks for non-block file descriptor assigned by `open`
function with `O_NONBLOCK` flag.
---------
Co-authored-by: Balazs Benics <benicsbalazs@gmail.com>
Extends generic `loop` directive support by supporting the `bind`
clause. Since semantic checking does the heavy lifting of verifying the
proper usage of the clause modifier, we can simply enable code-gen for
`teams loop bind(...)` without the need to differentiate between the
values the the clause can accept.
These cannot be static compile errors, and should be treated as
poison. Invalid casts may be introduced which are dynamically dead.
For example:
```
void foo(volatile generic int* x) {
__builtin_assume(is_shared(x));
*x = 4;
}
void bar() {
private int y;
foo(&y); // violation, wrong address space
}
```
This could produce a compile time backend error or not depending on
the optimization level. Similarly, the new test demonstrates a failure
on a lowered atomicrmw which required inserting runtime address
space checks. The invalid cases are dynamically dead, we should not
error, and the AtomicExpand pass shouldn't have to consider the details
of the incoming pointer to produce valid IR.
This should go to the release branch. This fixes broken -O0 compiles
with 64-bit atomics which would have started failing in
1d0370872f28ec9965448f33db1b105addaf64ae.
This fixes a false positive caused by #114044.
For `GSLPointer*` types, it's less clear whether the lifetime issue is
about the GSLPointer object itself or the owner it points to. To avoid
false positives, we take a conservative approach in our heuristic.
Fixes#127195
(This will be backported to release 20).
Under non-Windows platforms, also create a dynamic library version of
the runtime. Build of either version of the library can be switched on
using FLANG_RT_ENABLE_STATIC=ON respectively FLANG_RT_ENABLE_SHARED=ON.
Default is to build only the static library, consistent with previous
behaviour. This is because the way the flang driver invokes the linker,
most linkers choose the dynamic library by default, if available.
Building the dynamic library therefore causes flang-built executables to
depend on `libflang_rt.so`, unless explicitly told otherwise.
In soft floating-point ABI, this function takes the double argument as a
pair of registers r0 and r1.
The ordering of these two registers follow the endianness rules,
therefore the register on which the bit flipping must happen depends on
the endianness.
This patch adds support for describing per-write resource cycle counts
for ReadAdvance records via a new optional field called `tunables`.
This makes it possible to declare ReadAdvance records such as:
def : ReadAdvance<Read_C, 1, [Write_A, Write_B], [2]>;
The above will effectively declare two entries in the ReadAdvance
table for Read_C, one for Write_A with a cycle count of 1+2, and one for
Write_B with a cycle count of 1+0 (omitted values are assumed 0).
The field `tunables` provides a list of deltas relative to the base
`cycle` count of the ReadAdvance. Since the field is optional and
defaults to a list of 0's, this change doesn't affect current targets.
The pattern was returning success() by default which made the greedy
pattern application act as if the IR was modified and even though
nothing was changed and thus it can prevent it from converging for no
legitimate reason.
The patch makes the rewrite pattern return failure() by default and
success() if and only if the IR changed.
An example of unexpected behavior is by running `mlir-opt input.mlir
--linalg-specialize-generic-ops`, we obtain an empty mlir as output with
`input.mlir` as follows:
```
#map = affine_map<(d0) -> (d0)>
func.func @f(%arg0: tensor<8xi32>, %arg1: tensor<8xi32>) -> tensor<8xi32> {
%0 = tensor.empty() : tensor<8xi32>
%1 = linalg.generic {indexing_maps = [#map, #map, #map], iterator_types = ["parallel"]} ins(%arg0, %arg1: tensor<8xi32>, tensor<8xi32>) outs(%0: tensor<8xi32>) {
^bb0(%in: i32, %in_0: i32, %out: i32):
%2 = arith.addi %in, %in_0: i32
linalg.yield %2: i32
} -> tensor<8xi32>
return %1 : tensor<8xi32>
}
```
This PR adds `cmd-options` to the `gpu-lower-to-nvvm-pipeline` pipeline
and the `nvvm-attach-target` pass, allowing users to pass flags to the
downstream compiler, *ptxas*.
Example:
```
mlir-opt -gpu-lower-to-nvvm-pipeline="cubin-chip=sm_80 ptxas-cmd-options='-v --register-usage-level=8'"
```
Moves `PackOp` and `UnPackOp` from the Tensor dialect to Linalg. This change
was discussed in the following RFC:
* https://discourse.llvm.org/t/rfc-move-tensor-pack-and-tensor-unpack-into-linalg
This change involves significant churn but only relocates existing code - no new
functionality is added.
**Note for Downstream Users**
Downstream users must update references to `PackOp` and `UnPackOp` as follows:
* Code: `s/tensor::(Up)PackOp/linalg::(Un)PackOp/g`
* Tests: `s/tensor.(un)pack/linalg.(un)pack/g`
No other modifications should be required.
`computeStaticLoopSizes()` is functionally identical to `getStaticLoopRanges()`.
Replace all uses of `computeStaticLoopSizes()` by `getStaticLoopRanges()` and remove the former.
In my previous attempt (#126913) of fixing the flaky case was on a good
track when I used the begin locations as a stable ordering. However, I
forgot to consider the case when the begin locations are the same among
the Exprs.
In an `EXPENSIVE_CHECKS` build, arrays are randomly shuffled prior to
sorting them. This exposed the flaky behavior much more often basically
breaking the "stability" of the vector - as it should.
Because of this, I had to revert the previous fix attempt in #127034.
To fix this, I use this time `Expr::getID` for a stable ID for an Expr.
Hopefully fixes#126619
Hopefully fixes#126804
These relocations apply to 16-bit Thumb instructions, so reading 16 bits
rather than 32 bits ensures the correct bits are masked and written
back. This fixes the incorrect masking and aligns the relocation logic
with the instruction encoding.
Before this patch, 32 bits were read from the ELF object. This did not
align with the instruction size of 16 bits, but the masking incidentally
made it all work nonetheless. However, this was the case only in little
endian.
In big endian mode, the read 32-bit word had to have its bytes reversed.
With this byte reordering, the masking would be applied to the wrong
bits, hence causing the incorrect encoding to be produced as a result of
the relocation resolution.
The added test checks the result for both little and big endian modes.
`let constructor` is deprecated since the table gen backend emits most
of the glue logic to build a pass. This PR retires the td method for
most (I need another pass) passes in the Conversion directory.
This patch adds initial support for vectorizing literal struct return
values. Currently, this is limited to the case where the struct is
homogeneous (all elements have the same type) and not packed. The users
of the call also must all be `extractvalue` instructions.
The intended use case for this is vectorizing intrinsics such as:
```
declare { float, float } @llvm.sincos.f32(float %x)
```
Mapping them to structure-returning library calls such as:
```
declare { <4 x float>, <4 x float> } @Sleef_sincosf4_u10advsimd(<4 x float>)
```
Or their widened form (such as `@llvm.sincos.v4f32` in this case).
Implementing this required two main changes:
1. Supporting widening `extractvalue`
2. Adding support for vectorized struct types in LV
* This is mostly limited to parts of the cost model and scalarization
Since the supported use case is narrow, the required changes are
relatively small.
The last use was removed in:
commit ac9e67756e0157793d565c2cceaf82e4403f58ba
Author: Yingwei Zheng <dtcxzyw2333@gmail.com>
Date: Mon Feb 26 01:53:16 2024 +0800
Enable the vectorizer to access interleaved memory. This means that,
when it's decided to be profitable, the memory accesses can be
vectorized instead of the value being built up by a sequence of
load_lane instructions. This will often increase the vectorization
factor of the loop, leading to significantly better performance.
I run a reasonably large collection of benchmarks and most are not
affected by this change, with most performance changes <1%. But I see a
2.5% speedup for the total run time of TSVC, 1% speedup for SPEC2017
x265, 28% speedup for a ResNet workload and 95% for libyuv. This is
running V8 on an AArch64 box.
When lowering EXTEND_VECTOR_INREG, check whether the operand is a
shuffle that is moving the top half of a vector into the lower half. If
so, we can EXTEND_HIGH the input to the shuffle instead.