The MachineUA now queries the target to determine if a given register holds a
uniform value. This is determined using the corresponding register bank if
available, or by a combination of the register class and value type. This
assumes that the target is optimizing for performance by choosing registers, and
the target is responsible for any mismatch with the inferred uniformity.
For example, on AMDGPU, an SGPR is now treated as uniform, except if the
register bank is VCC (i.e., the register holds a wave-wide vector of 1-bit
values) or equivalently if it has a value type of s1.
- This does not always work with inline asm, where the register bank or the
value type might not be present. We assume that the SGPR is uniform, because
it is not expected to be s1 in the vast majority of cases.
- The pseudo branch instruction SI_LOOP is now hard-coded to be always
divergent, although its condition is an SGPR.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D150438
Add some debug output to `outline` to assist in debugging + understanding the
code.
This will say
- How many things we found worth turning into outlined functions
- Whether or not candidates were pruned via the outlining algorithm
- The function created (if it was created)
- Where the calls were inserted
- What instruction was used to create the call
Sample output below:
```
NUMBER OF POTENTIAL FUNCTIONS: 5
WALKING FUNCTION LIST
PRUNED: 0/2 candidates
OUTLINE: Expected benefit (12 B) > threshold (1 B)
NEW FUNCTION: OUTLINED_FUNCTION_0
CREATE OUTLINED CALLS
CALL: OUTLINED_FUNCTION_0 in bar:<unknown>
.. BL @OUTLINED_FUNCTION_0, implicit-def $lr, implicit $sp
CALL: OUTLINED_FUNCTION_0 in bar:<unknown>
.. BL @OUTLINED_FUNCTION_0, implicit-def $lr, implicit $sp
PRUNED: 2/2 candidates
SKIP: Expected benefit (0 B) < threshold (1 B)
PRUNED: 0/2 candidates
OUTLINE: Expected benefit (8 B) > threshold (1 B)
NEW FUNCTION: OUTLINED_FUNCTION_1
CREATE OUTLINED CALLS
CALL: OUTLINED_FUNCTION_1 in bar:<unknown>
.. BL @OUTLINED_FUNCTION_1, implicit-def $lr, implicit $sp
CALL: OUTLINED_FUNCTION_1 in bar:<unknown>
.. BL @OUTLINED_FUNCTION_1, implicit-def $lr, implicit $sp
PRUNED: 2/2 candidates
SKIP: Expected benefit (0 B) < threshold (1 B)
PRUNED: 2/2 candidates
SKIP: Expected benefit (0 B) < threshold (1 B)
```
At a cycle C with divergent exits, UA was using a naive traversal of the exiting
edges to locate blocks that may use values defined inside C. But this traversal
fails when it encounters a cycle. This is now replaced with a much simpler
propagation that iterates over every instruction in C and checks any uses that
are outside C. But such an iteration can be expensive when C is very large; the
original strategy may need to be reconsidered if there is a regression in
compilation times.
Also fixed lit tests that should have originally caught the missed propagation
of temporal divergence.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D149646
Prior to this patch, for the DWARF frame base LLVM uses the frame pointer
register if available, otherwise the stack pointer register. If the stack
pointer register is being used and a call or other code modifies the stack
pointer during the body of the function this results in the locations being
wrong and the debugger displaying the wrong values for variables.
By using DW_OP_call_frame_cfa in these situations the emitted location for
the variable will automatically handle changes in the stack pointer.
The CFA needs to be adjusted for the offset between the frame pointer/stack
pointer to allow the variable locations themselves to remain unchanged by
this patch.
Reviewed By: #debug-info, scott.linder, jryans
Differential Revision: https://reviews.llvm.org/D143463
Previously, LegalizeVectorOps used the result VT while LegalizeDAG
used the operand VT. This patch makes them both use the operand VT.
This also makes it consistent with how the default cost model works.
I've hacked the AArch64 cost model to maintain old behavior for some
f16 vectors.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D149572
The current logic is pretty limitted unless the `Op` is a
constant. This at least covers more obvious cases.
Reviewed By: craig.topper, foad
Differential Revision: https://reviews.llvm.org/D149196
Both of these functions recursively call themselves so it makes sense
to limit that upper bound.
Differential Revision: https://reviews.llvm.org/D149195
This patch adds a heuristic to skip if-conversion if the condition has a
high chance of being predictable.
If the condition is in a loop, consider it predictable if the condition
itself or all its operands are loop-invariant. E.g. this considers a load
from a loop-invariant address predictable; we were unable to prove that it
doesn't alias any of the memory-writes in the loop, but it is likely to
read to same value multiple times.
This is a relatively crude heuristic, but it helps to prevent excessive
if-conversion in multiple workloads in practice.
Reviewed By: apostolakis
Differential Revision: https://reviews.llvm.org/D141639
While the original motivation for this patch (address space 7 on
AMDGPU) has been reworked and is not presently planned to reach IR
translation, the incorrect (by the spec) handling of index offset
width in IR translation and CodeGenPrepare is likely to trip someone
- possibly future AMD, since we have a p7:160:256:256:32 now, so we
convert to the other API now.
Reviewed By: aemerson, arsenm
Differential Revision: https://reviews.llvm.org/D143526
Correct the legality of i32 mul_lohi on AArch64.
Previously, AArch64 incorrectly reported i32 mul_lohi as Legal.
This allowed BuildUDIV/SDIV to use them. A later DAGCombiner would
replace them with MULHS/MULHU because only the high half was used.
This conversion does not check the legality of MULHS/MULHU under
the assumption that LegalizeDAG can turn it back into MUL_LOHI later.
After they are converted to MULHS/MULHU, DAGCombine ran and saw that
these operations aren't supported but an i64 MUL is. So they get
converted to that plus a shift. Without this, LegalizeDAG would
convert back MUL_LOHI and isel would fail to find a pattern.
This patch teaches BuildUDIV/SDIV to create the wide mul and shift
so that we can report the correct operation legality on AArch64. It
also enables div by constant folding for more cases on VE.
I don't know if VE wants this div by constant optimization or not. If they
don't want it, they can use the isIntDivCheap hook to disable it.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D150333
This commit implements SelectionDAG lowering of dbg.declare intrinsics targeting
swiftasync Arguments, by putting them in the MachineFunction's table of
variables whose location doesn't change throughout the function.
Depends on D149882
Differential Revision: https://reviews.llvm.org/D149883
This commit implements IRTranslator lowering of dbg.declare intrinsics targeting
swiftasync Arguments, by putting them in the MachineFunction's table of
variables whose location doesn't change throughout the function.
Depends on D149881
Differential Revision: https://reviews.llvm.org/D149882
This patch consumes the EntryValueObjects in a MachineFunction's table, using
them to emit the appropriate debug information for these variables.
Depends on D149880
Differential Revision: https://reviews.llvm.org/D149881
Looks like code assumes that it will be always set, but it's not true:
https://reviews.llvm.org/D150420. This is temporarily suppression to enabled
stricter msan on a bot.
We were missing any support for ISD::INTRINSIC_W_CHAIN/INTRINSIC_VOID
used for memory operations.
For ISD::PREFETCH and target memory nodes we didn't add the subclass
data.
This patch handles all MemIntrinsicSDNode in one place and adds the
missing subclass data.
Note. Unlike load/stores we don't add the memory VT in AddNodeIDCustom or getMemIntrinsicNode. Not sure why.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D150387
This patch adds a check for whether the memory operand is known to be
a jump table and, if so, allows shrinkwrapping to continue. In the
case that we are looking at a jump table, I believe it is safe to
assume that the access will not be to the stack (but please correct me
if I am wrong here).
In the test attached, this is helpful in that we are able to generate
only one instruction for each non-default case in the original switch
statement.
Differential Revision: https://reviews.llvm.org/D149886
This patch encapsulates the encoding and decoding logic of basic block metadata into the Metadata struct, and also reduces the decoded size of `SHT_LLVM_BB_ADDR_MAP` section.
The patch would've looked more readable if we could use designated initializer, but that is a c++20 feature.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D148360
This commit implements the serialization and deserialization of the Machine
Function's EntryValueObjects.
Depends on D149879, D149778
Differential Revision: https://reviews.llvm.org/D149880
Land D42600 with optimisation disabled by default by setting 'enable-shrink-wrap-region-split' option.
This is just to reduce effort involved in making changes to patch each time issue is detected and reland the whole patch.
MachineFunction keeps a table of variables whose addresses never change
throughout the function. Today, the only kinds of locations it can
handle are stack slots.
However, we could expand this for variables whose address is derived
from the value a register had upon function entry. One case where this
happens is with variables alive across coroutine funclets: these can
be placed in a coroutine frame object whose pointer is placed in a
register that is an argument to coroutine funclets.
```
define @foo(ptr %frame_ptr) {
dbg.declare(%frame_ptr, !some_var,
!DIExpression(EntryValue, <ptr_arithmetic>))
```
This is a patch in a series that aims to improve the debug information
generated by the CoroSplit pass in the context of `swiftasync`
arguments. Variables stored in the coroutine frame _must_ be described
the entry_value of the ABI-defined register containing a pointer to the
coroutine frame. Since these variables have a single location throughout
their lifetime, they are candidates for being stored in the
MachineFunction table.
Differential Revision: https://reviews.llvm.org/D149879
Add support for splitting critical edges coming from an indirect jump
using a jump table ("switch jumps").
This introduces the `TargetInstrInfo::getJumpTableIndex` callback to
allows targets to return an index into `MachineJumpTableInfo` for a
given indirect jump. It also updates to
`MachineBasicBlock::SplitCriticalEdge` to allow splitting of critical
edges by rewriting jump table entries.
This is largely based on work done by Zhixuan Huan in D132202.
Differential Revision: https://reviews.llvm.org/D140975
I was seeing a regression when enabling FS discriminators on an non-FS CSSPGO build. This is because a probe can get a zero-valued discriminator at a specific pass and that could lead to accidentally loading the corresponding base counter in the non-FS profile, while a non-zeo discriminator would end up getting zero samples. This could in turn undo the sample distribution effort done by previous BFI maintenance work and the probe distribution factor work for pseudo probes specifically. To mitigate that I'm disabling loading a non-FS profile against FS discriminators. The problem should also exist with non-CS AutoFDO, so I'm doing this for it too.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D149597
This change enables loading pseudo-probe based profile on MIR. Different from the IR profile loader, callsites are excluded from MIR profile loading since they are not assinged a FS discriminator. Using zero as the discriminator is not accurate and would undo the distribution work done by the IR loader based on pseudo probe distribution factor. We reply on block probes only for FS profile loading.
Some refactoring is done to the IR profile loader so that `getProbeWeight` can be shared by both loaders.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D148584
Encoding FS discriminators for block probes. Decoding them correspondingly.
The encoding/decoding of FS discriminators are conditional, only for probes with a non-zero discriminator. This saves encoding size, also ensures downwards-compatiblity.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D147651
Without the fix gcc complains with
../lib/CodeGen/GlobalISel/CombinerHelper.cpp:1652:52: warning: suggest parentheses around '&&' within '||' [-Wparentheses]
1652 | SrcDef->getOpcode() == TargetOpcode::G_OR && "Unexpected op");
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
Reland D147506 after fixing the failure in bot
https://lab.llvm.org/buildbot/#/builders/247/builds/4125
Some debuggers like DBX on AIX assume the address in debug line
entries is always incremental. But clang generates two entries (entry
for file scope line and entry for prologue end) with same address if
prologue is empty
And if the prologue is empty, seems the first debug line entry for the
function is unnecessary(i.e. removing the first entry won't impact the
behavior in GDB on Linux), so I implement this for all debuggers.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D147506
While investigating a bug in Clang, I noticed that -Wframe-larger-than
was emitting extra debug information along with the diagnostic. It
turns out that 2e1e2f52f357768186ecfcc5ac53d5fa53d1b094 fixed an issue
with the diagnostic, but accidentally left in some debug code that was
exposed in all builds.
So now we no longer emit things like:
8/4294967304 (0.00%) spills, 4294967296/4294967304 (100.00%) variables
along with the diagnostic
KCFI machine function passes transform indirect calls with a
cfi-type attribute into architecture-specific type checks bundled
together with the calls. Instead of having a separate pass for each
architecture, add a generic machine function pass for KCFI and
move the architecture-specific code that emits the actual check to
TargetLowering. This avoids unnecessary duplication and makes it
easier to add KCFI support to other architectures.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D149915
Annotation metadata supports adding singular annotation strings to annotation block. This patch adds the ability to insert a tuple of strings into the metadata array.
The idea here is that each tuple of strings represents a piece of information that can be all related. It makes it easier to parse through related metadata information given it will be contained in one tuple.
For example in remarks any pass that implements annotation remarks can have different type of remarks and pass additional information for each.
The original behaviour of annotation remarks is preserved here and we can mix tuple annotations and single annotations for the same instruction.
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D148328
There's a target hook that's called in DAGCombiner that we stub here, I'll
implement the equivalent override for AArch64 in a subsequent patch since it's
used by different shift combine.
This change by itself has minor code size improvements on arm64 -Os CTMark:
Program size.__text
outputg181ppyy output8av1cxfn diff
consumer-typeset/consumer-typeset 410648.00 410648.00 0.0%
tramp3d-v4/tramp3d-v4 364176.00 364176.00 0.0%
kimwitu++/kc 449216.00 449212.00 -0.0%
7zip/7zip-benchmark 576128.00 576120.00 -0.0%
sqlite3/sqlite3 285108.00 285100.00 -0.0%
SPASS/SPASS 411720.00 411688.00 -0.0%
ClamAV/clamscan 379868.00 379764.00 -0.0%
Bullet/bullet 452064.00 451928.00 -0.0%
mafft/pairlocalalign 246184.00 246108.00 -0.0%
lencod/lencod 428524.00 428152.00 -0.1%
Geomean difference -0.0%
Differential Revision: https://reviews.llvm.org/D150086
Prevent optimization of DebugLoc across section boundaries, such optimization will yield incorrect source location if memory layout of sections does not strictly match the Asm file.
Reviewed By: #debug-info, dblaikie, MaskRay
Differential Revision: https://reviews.llvm.org/D149294
This reverts commit 1ddfd1c8186735c62b642df05c505dc4907ffac4.
The original commit causes a Chrome build assertion failure with
ThinLTO: https://crbug.com/1443635
With cb57b7a7, the kill flags are now tracked during the forward search over
the instructions and the call to findRegisterUseOperandIdx() should therefore
only check for killing uses.
As shown with the failing test CodeGen/Hexagon/vector-sint-to-fp.ll, it could
otherwise be the case that an undef use after the instruction that killed the
register will be inserted into MBBKills, and the kill flag will not be
cleared.
In the SimplifyDemandedBits function, there is a fallthrough to the
default case in the case of ISD::ADD, ISD::MUL and ISD::SUB. This
leads to a call to computeKnownBits which is unnecessary as the
calls to SimplifyDemandedBits in the cases themselves handle the
calculation of the known bits. This information is discarded through
the Known2 variables.
By keeping this information around and calling
KnownBits::mul or KnownBits::computeForAddSub directly, the
unnecessary computation can be avoided. For now, the NSW bit is not
passed through to KnownBits as this is something that
computeKnownBits does not handle either. This requires updating
computeForAddCarry to handle the flag as well.
Differential Revision: https://reviews.llvm.org/D150110
It was discovered that this pass could be slow on huge functions, meaning 20%
compile time instead of the usual ~0.5% (with a test case spending ~19 mins
just in the backend).
The problem related to the necessary clearing of earlier kill flags when a
redundant instruction is removed. With this patch, the handling of kill flags
is now done by maintaining a map instead of scanning backwards in the
function. This remedies the compile time on the huge file fully.
Reviewed By: vpykhtin, arsenm
Differential Revision: https://reviews.llvm.org/D147532
Resolves https://github.com/llvm/llvm-project/issues/61397
For constant BUILD_VECTORs the operands need to be legal types. This can mean
that when the number of sign bits is calculated it may look that the entire
constant and inefficiently produce less sign bits than it could. For example i8
vectors could use i32 elements, for which 0x000000ff would be incorrectly
limited to 1 sign bit as the original value has 24 sign bits. This makes it
look at the constant directly, truncated to the correct type for the element so
that it can correctly return 8.
Differential Revision: https://reviews.llvm.org/D149956