AMDGPU has native instructions and target intrinsics for this, but
these really should be subject to legalization and generic
optimizations. This will enable legalization of f16->f32 on targets
without f16 support.
Implement a somewhat horrible inline expansion for targets without
libcall support. This could be better if we could introduce control
flow (GlobalISel version not yet implemented). Support for strictfp
legalization is less complete but works for the simple cases.
Parametrize SampleProfileInference and SampleProfileLoaderBaseImpl by function
type (Function/MachineFunction) instead of block type
(BasicBlock/MachineBasicBlock). Move out specializations to appropriate
locations.
This change makes it possible to use GraphTraits instead of a custom TypeMap and
make SampleProfileInference not dependent on LLVM types, paving the way for
generalizing SampleProfileInference interfaces to BOLT IR types
(BinaryFunction/BinaryBasicBlock) in stale profile matching (D144500).
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D152187
If intrinsic `get_fpenv` or `set_fpenv` is lowered to the form where FP
environment is represented as a region in memory, extra moves can
appear. For example the code:
define void @func_01(ptr %ptr) {
%env = call i256 @llvm.get.fpenv.i256()
store i256 %env, ptr %ptr
ret void
}
produces DAG:
ch = get_fpenv_mem ch, memory_region
val: i256, ch = load ch, memory_region
ch = store ch, ptr, val
In this case the extra moves can be avoided if `get_fpenv_mem` got
pointer to the memory where the FP environment should be finally placed.
This change implement such optimization for this use case.
Differential Revision: https://reviews.llvm.org/D150437
This removes BitCasts from isSource in Type Promotion, as I don't believe they
need to be treated as Sources. They will usually be from floats or hoisted
constants, where constants will be handled already.
This fixes#62513, but didn't otherwise cause any differences in the tests I
ran.
Differential Revision: https://reviews.llvm.org/D152112
Currently, a node and its users are added back to the worklist in reverse topological order after it is combined. This diff changes that order to be topological. This is part of a larger migration to get the DAGCombiner to process nodes in topological order.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D127115
The change implements intrinsics 'get_fpenv', 'set_fpenv' and 'reset_fpenv'.
They are used to read floating-point environment, set it or reset to
some default state. They do the same actions as C library functions
'fegetenv' and 'fesetenv'. By default these intrinsics are lowered to calls
to these functions.
The new intrinsics specify FP environment as a value of integer type, it
is convenient of most targets where the FP state is a content of some
register. Some targets however use long representations. On X86 the size
of FP environment is 256 bits, and even half of this size is not a legal
ibteger type. To facilitate legalization in such cases, two sets of DAG
nodes is used. Nodes GET_FPENV and SET_FPENV are used when FP
environment may be represented by a legal integer type. Nodes
GET_FPENV_MEM and SET_FPENV_MEM consider FP environment as a region in
memory, much like `fesetenv` and `fegetenv` do. They are used when
target has long representation for floationg-point state.
Differential Revision: https://reviews.llvm.org/D71742
The lists contain differences between register numbers, not the register
numbers themselves. Since a difference can also be negative, this also
changes its type to signed.
Changing the type to signed exposed a "bug". For AMDGPU, which has many
registers, the first element of a sequence could be as big as ~45k.
The value does not fit into int16_t, but fits into uint16_t. The bug
didn't show up because of unsigned wrapping and truncation of the Val
field in the advance() method.
To fix the issue, I changed the way regunit difflists are encoded. The
4-bit 'scale' field of MCRegisterDesc::RegUnit was replaced by 12-bit
number of the first regunit, and the first element of each of the lists
was removed. The higher 20 bits of RegUnit field contain the initial
offset into DiffLists array.
AMDGPU has 1'409 regunits (2^12 = 4'096), and the biggest offset is
80'041 (2^20 = 1'048'576). That is, there is enough room.
Changing the encoding method also resulted in a smaller array size, the
numbers are below (I omitted targets with less than 100 elements).
```
AMDGPU | 80052 | 78741 | -1,6%
RISCV | 6498 | 6297 | -3,1%
ARM | 4181 | 3966 | -5,1%
AArch64 | 2770 | 2592 | -6,4%
PPC | 1578 | 1441 | -8,7%
Hexagon | 994 | 740 | -25,6%
R600 | 508 | 398 | -21,7%
VE | 471 | 459 | -2,5%
Sparc | 381 | 363 | -4,7%
X86 | 326 | 208 | -36,2%
Mips | 253 | 200 | -20,9%
SystemZ | 186 | 162 | -12,9%
```
Reviewed By: foad, arsenm
Differential Revision: https://reviews.llvm.org/D151036
In some rare corner cases where in between the div/rem pair there's a def of
the second instruction's source (but a different vreg due to the combine's
eqivalence checks), it will place the DIVREM at the first instruction's point,
causing a use-before-def. There wasn't an obvious fix that stood out to me
without doing more involved analysis than a combine should really be doing.
Fixes issue #60516
I'm open to new suggestions on how to approach this, as I'm not too happy
at bailing out here. It's not the first time we run into issues with value liveness
that the DAG world isn't affected by.
Differential Revision: https://reviews.llvm.org/D144336
This patch adds logic for determining RegisterBank size to RegisterBankInfo, which allows accounting for the HwMode of the target. Individual RegisterBanks cannot be constructed with HwMode information as construction is generated by TableGen, but a RegisterBankInfo subclass can provide the HwMode as a constructor argument. The HwMode is used to select the appropriate RegisterBank size from an array relating sizes to RegisterBanks.
Targets simply need to provide the HwMode argument to the <target>GenRegisterBankInfo constructor. The RISC-V RegisterBankInfo constructor has been updated accordingly (plus an unused argument removed).
Reviewed By: simoncook, craig.topper
Differential Revision: https://reviews.llvm.org/D76007
If the ZExt can be lowered to a single ZExt to the next power-of-2 and
the remaining ZExt folded into the user, don't use tbl lowering.
Fixes#62620.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D150482
Use computeOverflowForUnsignedAdd and computeOverflowForSignedAdd
instead. Unfortunately, this recomputes some known bits and sign bits
we may have already computed, but was the easiest fix without a lot
of restructuring.
This recovers the regressions from D151472.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D151858
Sometimes an developer would like to have more control over cmov vs branch. We have unpredictable metadata in LLVM IR, but currently it is ignored by X86 backend. Propagate this metadata and avoid cmov->branch conversion in X86CmovConversion for cmov with this metadata.
Example:
```
int MaxIndex(int n, int *a) {
int t = 0;
for (int i = 1; i < n; i++) {
// cmov is converted to branch by X86CmovConversion
if (a[i] > a[t]) t = i;
}
return t;
}
int MaxIndex2(int n, int *a) {
int t = 0;
for (int i = 1; i < n; i++) {
// cmov is preserved
if (__builtin_unpredictable(a[i] > a[t])) t = i;
}
return t;
}
```
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D118118
FoldSetCC() returns UNDEF in a number of cases. However, the SetCC
result must follow BooleanContents. Unless the type is a
pre-legalization i1 or we have UndefinedBooleanContents, the use of
UNDEF will not uphold the requirement that the top bits are either
zero or match the low bit. In such cases, return zero instead.
Fixes https://github.com/llvm/llvm-project/issues/63055.
Differential Revision: https://reviews.llvm.org/D151883
Split DWARF doesn't handle LTO of any form (roughly there's an
assumption that each dwo file will have one CU - it's not explicitly
documented, nor explicitly handled, so the ecosystem isn't really well
understood/tested/etc).
This had previously been handled by implementing (& disabling by
default) the `-split-dwarf-cross-cu-references` flag, which would
disable use of ref_addr across two dwo CUs.
This worked for a while, at least in LTO (it didn't address Split
DWARF+Full LTO, but that's an unlikely combination, as the benefits of
Split DWARF are more limited in a full LTO build) - because the only
source of cross-CU references was inlined functions, so by making those
non-cross-CU (by moving the referenced inlined function DWARF
description into the referencing CU) the result was one CU per dwo.
But recently the Function Specialization pass was added to the ThinLTO
pipeline, which caused imported functions that may not be inlined to be
emitted by a backend compile. This meant foreign CU entities (not just
abstract origins/cross-CU referenced entities)/standalone foreign CUs
could be emitted by a backend compile.
The end result was, due to a bug* in binutils dwp (I think basically
it saw two CUs in a single dwo and reprocessed the offsets in the shared
debug_str_offsets.dwo section) this situation lead to corrupted strings.
So to make this more robust, I've generalized the definition of the
`-split-dwarf-cross-cu-references` flag (perhaps it should be renamed at
this point, but it's /really/ niche, doubt anyone's using it - more or
less there for experimentation when we get around to figuring out
spec'ing LTO+Split DWARF) to mean "single CU in a dwo file" and added
more general handling for this.
There's certainly some weird corner cases that could come up in terms of
"how do we choose which CU to put everything in" - for now it's "first
come, first served" which is probably going to be OK for ThinLTO - the
base module will have the first functions and first CU, imported
fragments will come after that. For LTO the choice will be fairly
arbitrary - but, again, essentially whichever module comes first.
* Arguably a bug in binutils dwp, but since the feature isn't well
specified, I'd rather avoid dabbling in this uncertain area and ensure
LLVM doesn't produce especially novel DWARF (dwos with multiple CUs)
regardless of whether binutils dwp would/should be fixed. I'm not
confident debuggers could read such a dwo file well, etc.
And also set the SHF_X86_64_LARGE section flag.
gcc only uses the "l" prefix and SHF_X86_64_LARGE in the medium code model for data larger than -mlarge-data-threshold. But it seems more consistent to use it in the large code model as well in case separate parts of the binary aren't compiled with the large code model and also have a .data/.bss/.rodata section.
Reviewed By: MaskRay, tkoeppe
Differential Revision: https://reviews.llvm.org/D148836
Given an insert of a scalar load into a vector shuffle with mask
u,0,1,2,3,4,5,6 or 1,2,3,4,5,6,7,u (depending on the insert index),
it can be more profitable to convert to a single load and avoid the
shuffles. This adds a DAG combine for it, providing the new load is
still fast.
Differential Revision: https://reviews.llvm.org/D151029
Code generated with -Ofast and -O3 -ffp-contract=fast (add
-ffinite-math-only to enable vectorization) can differ significantly.
Code compiled with -O3 can be deinterleaved using patterns as the
instruction order is preserved. However, with the -Ofast flag, there
can be multiple changes in the computation sequence, and even the real
and imaginary parts may not be calculated in parallel.
For more details, refer to
llvm/test/CodeGen/AArch64/complex-deinterleaving-*-fast.ll and
llvm/test/CodeGen/AArch64/complex-deinterleaving-*-contract.ll tests.
This patch implements a more general approach and enables handling most
-Ofast cases.
Differential Revision: https://reviews.llvm.org/D148558
This exposed a miscompile due to incorrect flag preservation in
integer type legalization, which has been fixed in D151472.
-----
This patch is a continuation of D150110. It separates the cases for
ADD and SUB into their own cases so that computeForAddSub can be
directly called and the NSW flag passed. This allows better
optimization when the NSW flag is enabled, and allows fixing up the
TODO that was there previously in SimplifyDemandedBits.
Differential Revision: https://reviews.llvm.org/D150769
This patch updates several functions in LLVM's IR generation code to accept
an IRBuilder object as an argument, rather than an Instruction that indicates
the insertion point for new instructions.
This change is necessary to handle sophisticated -Ofast optimization cases
from D148558 where it's unclear which instructions should be used as the
insertion point for new operations.
Differential Revision: https://reviews.llvm.org/D148703
As discussed in D151436, it's safe to do this as a simple shift (as is
done in LegalizeDAG.cpp) rather than needing a libcall. The added test
cases for RISC-V previously just triggered an assertion.
Codegen for bfloat_to_double will be slightly improved by D151434.
Differential Revision: https://reviews.llvm.org/D151563
Use big obj copy in range for-loop will call copy constructor every time,
which can be avoided by use ref instead.
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D150024
class InstructionRemover manages resources such as dynamically allocated memory, it's generally a good practice to either implement a custom copy constructor or disable the default one.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D151543
This reverts commit 9b92f70d4758f75903ce93feaba5098130820d40. The issue
with the re-applied change was an implicit truncation due to the
multiplication. Although the operations were converted to `APInt`, the
values were implicitly converted to `long` due to the typing rules.
Fixes: #59594
Differential Revision: https://reviews.llvm.org/D140347
The generic implementation is umin(TC, VF * vscale).
Lowering to vsetvli for RISC-V will come in a future patch.
This patch is a pre-requisite to be able to CodeGen vectorized code from
D99750.
Reviewed By: reames, frasercrmck
Differential Revision: https://reviews.llvm.org/D149916
For dbg.value intrinsics targeting an llvm::Argument address whose expression
starts with an entry value, we lower this to a DEBUG_VALUE targeting the livein
physical register corresponding to that Argument.
Depends on D151332
Differential Revision: https://reviews.llvm.org/D151333
This will make it easier to add more cases in a subsequent commit and also
better conforms to the coding guidelines.
Depends on D151330
Differential Revision: https://reviews.llvm.org/D151331
Summary:
DbgValue intrinsics whose expression is an entry_value and whose address is
described an llvm::Argument must be lowered to the corresponding livein physical
register for that Argument.
Depends on D151329
Reviewers: aprantl
Subscribers:
For dbg.value intrinsics targeting an llvm::Argument address whose expression
starts with an entry value, we lower this to a DEBUG_VALUE targeting the livein
physical register corresponding to that Argument.
Depends on D151328
Differential Revision: https://reviews.llvm.org/D151329
This was originally added to preserve FMF on SETCC. Unfortunately,
it also incorrectly preserves nuw/nsw on ADD/SUB in some cases.
There's also no guarantee the new opcode is even the same opcode
as the original node.
This patch removes the code and adds code to explicitly preserve
FMF flags in the SETCC promotion function.
The other test changes are from nuw/nsw not being preserved. I
believe for all these tests it was correct to preserve the flags,
so we need new code to preserve the flags when possible. I'll post
another patch for that since it's a riskier change.
This should unblock D150769.
Differential Revision: https://reviews.llvm.org/D151472
When the worklist is initially being formed, there is no need to
consider all nodes for pruning. This is because the first time calling
getNextWorklistEntry will only clear those nodes which have no uses,
with their operands being added to the worklist. However, when the worklist is
created for the first time all nodes are added anyways, so this operation
actually ends up adding no nodes.
This patch adds a parameter IsCandidateForPruning to AddToWorklist with a
default value of true to avoid having to update every call site.
Differential Revision: https://reviews.llvm.org/D151416
Make sure we do not crash in rfindDebugLoc when starting at
instr_rend(). Solution is to see it as we start one MI before the
first MI, so we can start searching forward at instr_begin()
instead.
This behavior is similar to how findPrevDebugLoc(instr_end()) works.
Differential Revision: https://reviews.llvm.org/D150577
- Add some unittests for the findDebugLoc, rfindDebugLoc,
findPrevDebugLoc and rfindPrevDebugLoc helpers in MachineBasicBlock.
- Clean up code comments and code formatting related to the functions
mentioned above.
This was extracted as a pre-commit to D150577, adn some of the tests
are commented out since they would crash/assert in a rather
uncontrolled way.
This will make it easier to add more cases in a subsequent commit and also
better conforms to the coding guidelines.
Differential Revision: https://reviews.llvm.org/D151328
This is an attempt to reland D42600 and enabling this optimisation by default.
This also resolves the issue pointed out in the context of PGO build.
Differential Revision: https://reviews.llvm.org/D42600
D151036 adds an assertions that prohibits iterating over sub- and
super-registers of a null register. This is already the case when
iterating over register units of a null register, and worked by
accident for sub- and super-registers.
The only place where the assertion is currently triggering is in
CriticalAntiDepBreaker::ScanInstruction. Other places are changed
in case new assertions are added and should be harmless otherwise.