According to its doc-comment `isImplicit` is meant to return true if the
expression is an implicit location description (describes an object or part of
an object which has no location by computing the value from available program
state).
There's a brief entry for `DW_OP_LLVM_tag_offset` in the LangRef and there's
some info in the original commit fb9ce100d19be130d004d03088ccd4af295f3435.
From what I can tell it doesn't look like `DW_OP_LLVM_tag_offset` affects
whether or not the location is implicit; the opcode doesn't get included in the
final location description but instead is added as an attribute to the variable.
This was tripping an assertion in the latest application of the fix to #76545,
#78606, where an expression containing a `DW_OP_LLVM_tag_offset` is split into
a fragment (i.e., describe a part of the whole variable).
Promoted and NF LZCNT/POPCNT/TZCNT were supported in #79954.
B/c null_frag is used in the patterns for these variants, tablgen can
not infer mayLoad = 1 for them.
This can be tested by MCA tests, which will be added after
-mcpu=<cpu_with_apx> is supported.
FileCheck test added
```
./bin/llvm-lit -sv llvm/test/tools/llvm-gsymutil/X86/elf-dwo.yaml
```
Manual test steps:
- Create binary with split-dwarf:
```
clang++ -g -gdwarf-4 -gsplit-dwarf main.cpp -o main_split
```
- Remove or remane the dwo file to a different name so llvm-gsymutil can't find it
```
mv main_split-main.dwo main_split-main__.dwo
```
- Now run llvm-gsymutil conversion, it should print out warning with and
without the `--quiet` flag
```
$ ./bin/llvm-gsymutil --convert=./main_split
Input file: ./main_split
Output file (x86_64): ./main_split.gsym
warning: Unable to retrieve DWO .debug_info section for main_split-main.dwo
Loaded 0 functions from DWARF.
Loaded 12 functions from symbol table.
Pruned 0 functions, ended with 12 total
```
```
$ ./bin/llvm-gsymutil --convert=./main_split --quiet
Input file: ./main_split
Output file (x86_64): ./main_split.gsym
warning: Unable to retrieve DWO .debug_info section for some object files. (Remove the --quiet flag for full output)
Pruned 0 functions, ended with 12 total
```
We don't have an AMO instruction for Nand, so with the A extension we
use an LR/SC loop. If we have Zacas we can use a CAS loop instead.
According to the Zacas spec, a CAS loop scales to highly parallel
systems better than LR/SC.
This is a follow up to an item I noted in my submission comment for
e947f95. I don't have a real world example where this is triggering
unprofitably, but avoiding the transform when we estimate the loop to be
short running from profiling seems quite reasonable. It's also now come
up as a possibility in a regression twice in two days, so I'd like to
get this in to close out the possibility if nothing else.
The original review dropped the threshold for short trip count loops. I
will return to that in a separate review if this lands.
If we can't produce a large enough index vector in i8, we may need to legalize
the shuffle (via scalarization - which in turn gets lowered into stack usage).
This change makes two related changes:
* Deferring legalization until we actually need to generate the vrgather
instruction. With the new recursive structure, this only happens when
doing the fallback for one of the arms.
* Check the actual mask values for something outside of the representable
range.
Both are covered by recently added tests.
Add TableGen patterns to convert more instructions to boolean
expressions:
- **mul -> and/or**: i1 multiply instructions currently cannot be
selected causing the compiler to crash. See
https://github.com/llvm/llvm-project/issues/57404
- **select -> and/or**: Converting selects to and/or can enable more
optimizations. `InstCombine` cannot do this as aggressively due to
poison semantics.
Add a new node `AArch64ISD::URSHR_I_PRED`.
`srl(add(X, 1 << (ShiftValue - 1)), ShiftValue)` is transformed to
`urshr`, or to `rshrnb` (as before) if the result it truncated.
`uzp1(rshrnb(uunpklo(X),C), rshrnb(uunpkhi(X), C))` is converted to
`urshr(X, C)` (tested by the wide_trunc tests).
Pattern matching code in `canLowerSRLToRoundingShiftForVT` is taken
from prior code in rshrnb. It returns true if the add has NUW or if the
number of bits used in the return value allow us to not care about the
overflow (tested by rshrnb test cases).
This patch adds support for common and local symbols in the TOC for AIX.
Note that we need to update isVirtualSection so as a common symbol in
TOC will have the symbol type XTY_CM and will be initialized when placed
in the TOC so sections with this type are no longer virtual.
---------
Co-authored-by: Zaara Syeda <syzaara@ca.ibm.com>
Removes the MaterializationResponsibility::addDependencies and
addDependenciesForAll methods, and transfers dependency registration to
the notifyEmitted operation. The new dependency registration allows
dependencies to be specified for arbitrary subsets of the
MaterializationResponsibility's symbols (rather than just single symbols
or all symbols) via an array of SymbolDependenceGroups (pairs of symbol
sets and corresponding dependencies for that set).
This patch aims to both improve emission performance and simplify
dependence tracking. By eliminating some states (e.g. symbols having
registered dependencies but not yet being resolved or emitted) we make
some errors impossible by construction, and reduce the number of error
cases that we need to check. NonOwningSymbolStringPtrs are used for
dependence tracking under the session lock, which should reduce
ref-counting operations, and intra-emit dependencies are resolved
outside the session lock, which should provide better performance when
JITing concurrently (since some dependence tracking can happen in
parallel).
The Orc C API is updated to account for this change, with the
LLVMOrcMaterializationResponsibilityNotifyEmitted API being modified and
the LLVMOrcMaterializationResponsibilityAddDependencies and
LLVMOrcMaterializationResponsibilityAddDependenciesForAll operations
being removed.
https://github.com/llvm/llvm-project/pull/78171 added support for
non-consecutive local value numbers. This extends the support for global
value numbers (for globals and functions).
This means that it is now possible to delete an unnamed global
definition/declaration without breaking the IR.
This is a lot less common than unnamed local values, but it seems like
something we should support for consistency. (Unnamed globals are used a
lot in Rust though.)
Currently, the `PPCMergeStringPool` merges the global variable after the
`AsmPrinter` initializer adds the global variables to its symbol list.
This is to move the merging work of `PPCMergeStringPool` to its
initializer, just like what GlobalMerge does, to avoid adding merged
global variables to the `AsmPrinter` symbol lis.
Extra space causes the checks generated by update_mir_test_checks to be
unavailable.
```
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
# RUN: llc -mtriple=x86_64-- -o - %s -run-pass=none -verify-machineinstrs -simplify-mir | FileCheck %s
---
name: foo
body: |
; CHECK-LABEL: name: foo
; CHECK: bb.0:
; CHECK-NEXT: successors:
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: bb.1:
; CHECK-NEXT: RET 0, $eax
bb.0:
successors:
bb.1:
RET 0, $eax
...
```
The failure log is as follows:
```
llvm/test/CodeGen/MIR/X86/unreachable-block-print.mir:9:16: error: CHECK-NEXT: is on the same line as previous match
; CHECK-NEXT: {{ $}}
^
<stdin>:21:13: note: 'next' match was here
successors:
^
<stdin>:21:13: note: previous match ended here
successors:
```
JumpThreading may perform AA queries while the dominator tree is not up
to date, which may result in miscompilations.
Fix this by adding a new AAQI option to disable the use of the dominator
tree in BasicAA.
Fixes https://github.com/llvm/llvm-project/issues/79175.
Update createScalarIVSteps to take an insert point as parameter. This
ensures that the inserted scalar steps are in the same order as the
recipes they replace (vs in reverse order as currently). This helps to
reduce the diff for follow-up changes.
Broadcast of a single float should not be any slower than
loading 32B using vmovaps. So remat it can help reduce
register spill when there is big register pressure.
Miscompilation arises due to instruction combining of cast pairs of the
type `bitcast bfloat to half` + `<FPOp> bfloat to half` or `bitcast half
to bfloat` + `<FPOp half to bfloat`. For example `bitcast bfloat to
half`+`fpext half to double` or `bitcast bfloat to half`+`fpext bfloat
to double` respectively reduce to `fpext bfloat to double` and `fpext
half to double`. This is an incorrect conversion as it assumes the
representation of `bfloat` and `half` are equivalent due to having the
same width. As a consequence miscompilation arises.
Fixes#61984
Calling a `__arm_locally_streaming` function from a function that
is not a streaming-SVE function would lead to incorrect inlining.
The issue didn't surface because the tests were not testing what
they were supposed to test.
If `-allow-incomplete-ir` is enabled, automatically insert declarations
for missing globals.
If a global is only used in calls with the same function type, insert a
function declaration with that type.
Otherwise, insert a dummy i8 global. The fallback case could be extended
with various heuristics (e.g. we could look at load/store types), but
I've chosen to keep it simple for now, because I'm unsure to what degree
this would really useful without more experience. I expect that in most
cases the declaration type doesn't really matter (note that the type of
an external global specifies a *minimum* size only, not a precise size).
This is a followup to https://github.com/llvm/llvm-project/pull/78421.
Patch ensures that host runtime functions are not called for handling
OpenMP teams clause on the device.
GPU code for pragma `omp target teams distribute parallel do` will
require only one call to OpenMP loop-worksharing GPU runtime. Support
for it will be added later.
This patch does not include changes required for handling `omp target
teams` for the host side.
The goal of this PR is to tolerate differences between description of
formal arguments by function metadata (represented by "kernel_arg_type")
and LLVM actual parameter types. A compiler may use "kernel_arg_type" of
function metadata fields to encode detailed type information, whereas
LLVM IR may utilize for an actual parameter a more general type, in
particular, opaque pointer type. This PR proposes to resolve this by a
fallback to LLVM actual parameter types during the lowering of formal
function arguments in cases when the type can't be created by string
content of "kernel_arg_type", i.e., when "kernel_arg_type" contains a
type unknown for the SPIR-V Backend.
An example of the issue manifestation is
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/blob/main/test/transcoding/KernelArgTypeInOpString.ll,
where a compiler generates for the following kernel function detailed
`kernel_arg_type` info in a form of `!{!"image_kernel_data*", !"myInt",
!"struct struct_name*"}`, and in LLVM IR same arguments are referred to
as `@foo(ptr addrspace(1) %in, i32 %out, ptr addrspace(1) %outData)`.
Both definitions are correct, and the resulting LLVM IR is correct, but
lowering stage of SPIR-V Backend fails to generate SPIR-V type.
```
typedef int myInt;
typedef struct {
int width;
int height;
} image_kernel_data;
struct struct_name {
int i;
int y;
};
void kernel foo(__global image_kernel_data* in,
__global struct struct_name *outData,
myInt out) {}
```
```
define spir_kernel void @foo(ptr addrspace(1) %in, i32 %out, ptr addrspace(1) %outData) ... !kernel_arg_type !7 ... {
entry:
ret void
}
...
!7 = !{!"image_kernel_data*", !"myInt", !"struct struct_name*"}
```
The PR changes a contract of `SPIRVType *getArgSPIRVType(...)` in a way
that it may return `nullptr` to signal that the metadata string content
is not recognized, so corresponding comments are added and a couple of
checks for `nullptr` are inserted where appropriate.
PAL uses ELF REL (not RELA) relocations which can only store a 32-bit
addend in the instruction, even for reloc types like R_AMDGPU_ABS32_HI
which require the upper 32 bits of a 64-bit address calculation to be
correct. This means that it is not safe to fold an arbitrary offset into
a GlobalAddressSDNode, so stop doing that.
In practice this is mostly a problem for small negative offsets which do
not work as expected because PAL treats the 32-bit addend as unsigned.
This patch merges the logic of `cannotBeOrderedLessThanZeroImpl` into
`computeKnownFPClass` to improve the signbit inference.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
This reverts commit f8525030004f907cd108e7c18df255a6d3b23124.
It was supposed to speed things up but llvm-compile-time-tracker.com
showed a slight slow down.
If the demanded bits of an instruction are full, we don't have to
recurse to its users, but we may still have to clear flags on the
instruction itself.
Fixes https://github.com/llvm/llvm-project/issues/80113.
This patch introduces a 'COALESCER_BARRIER' which is a pseudo node that
expands to
a 'nop', but which stops the register allocator from coalescing a COPY
node when
its use/def crosses a SMSTART or SMSTOP instruction.
For example:
%0:fpr64 = COPY killed $d0
undef %2.dsub:zpr = COPY %0 // <- Do not coalesce this COPY
ADJCALLSTACKDOWN 0, 0
MSRpstatesvcrImm1 1, 0, csr_aarch64_smstartstop, implicit-def dead $d0
$d0 = COPY killed %0
BL @use_f64, csr_aarch64_aapcs
If the COPY would be coalesced, that would lead to:
$d0 = COPY killed %0
being replaced by:
$d0 = COPY killed %2.dsub
which means the whole ZPR reg would be live upto the call, causing the
MSRpstatesvcrImm1 (smstop) to spill/reload the ZPR register:
str q0, [sp] // 16-byte Folded Spill
smstop sm
ldr z0, [sp] // 16-byte Folded Reload
bl use_f64
which would be incorrect for two reasons:
1. The program may load more data than it has allocated.
2. If there are other SVE objects on the stack, the compiler might use
the
'mul vl' addressing modes to access the spill location.
By disabling the coalescing, we get the desired results:
str d0, [sp, #8] // 8-byte Folded Spill
smstop sm
ldr d0, [sp, #8] // 8-byte Folded Reload
bl use_f64
Previously we called ignoreCSRForAllocationOrder on every alias of every
CSR which was expensive on targets like AMDGPU which define a very large
number of overlapping register tuples.
On such targets it is simpler and faster to call
ignoreCSRForAllocationOrder once for every physical register.
Differential Revision: https://reviews.llvm.org/D146735
Similar to #78403, but for scalable `vwadd(u).wv`, given that #76785 is recommited.
### Code
```
define <vscale x 8 x i64> @vwadd_wv_mask_v8i32(<vscale x 8 x i32> %x, <vscale x 8 x i64> %y) {
%mask = icmp slt <vscale x 8 x i32> %x, shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 42, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer)
%a = select <vscale x 8 x i1> %mask, <vscale x 8 x i32> %x, <vscale x 8 x i32> zeroinitializer
%sa = sext <vscale x 8 x i32> %a to <vscale x 8 x i64>
%ret = add <vscale x 8 x i64> %sa, %y
ret <vscale x 8 x i64> %ret
}
```
### Before this patch
[Compiler Explorer](https://godbolt.org/z/xsoa5xPrd)
```
vwadd_wv_mask_v8i32:
li a0, 42
vsetvli a1, zero, e32, m4, ta, ma
vmslt.vx v0, v8, a0
vmv.v.i v12, 0
vmerge.vvm v24, v12, v8, v0
vwadd.wv v8, v16, v24
ret
```
### After this patch
```
vwadd_wv_mask_v8i32:
li a0, 42
vsetvli a1, zero, e32, m4, ta, ma
vmslt.vx v0, v8, a0
vsetvli zero, zero, e32, m4, tu, mu
vwadd.wv v16, v16, v8, v0.t
vmv8r.v v8, v16
ret
```