On GPU targets, materializing constants is cheap and stores are
expensive, so only doing this for zero vectors was silly.
Most of the new testcases aren't optimally merged, and are for
later improvements.
llvm-svn: 238108
This patch improves support for sign extension of the lower lanes of vectors of integers by making use of the SSE41 pmovsx* sign extension instructions where possible, and optimizing the sign extension by shifts on pre-SSE41 targets (avoiding the use of i64 arithmetic shifts which require scalarization).
It converts SIGN_EXTEND nodes to SIGN_EXTEND_VECTOR_INREG where necessary, that more closely matches the pmovsx* instruction than the default approach of using SIGN_EXTEND_INREG which splits the operation (into an ANY_EXTEND lowered to a shuffle followed by shifts) making instruction matching difficult during lowering. Necessary support for SIGN_EXTEND_VECTOR_INREG has been added to the DAGCombiner.
Differential Revision: http://reviews.llvm.org/D9848
llvm-svn: 237885
DAG.FoldConstantArithmetic() can fail even though both operands are
Constants if OpaqueConstants are involved. Continue trying other combine
possibilities in tis case.
Differential Revision: http://reviews.llvm.org/D6946
Somewhat related to PR21801 / rdar://19211454
llvm-svn: 237822
In CombineToPreIndexedLoadStore, when the offset is a constant, we have code
that looks for other uses of the pointer which are constant offset computations
so that they can be rewritten in terms of the updated pointer so that we don't
need to keep a copy of the base pointer to compute these constant offsets.
Unfortunately, when it iterated over the uses, it did so by SDNodes, and so we
could confuse ourselves if the base pointer was produced by a node that had
multiple results (because we would not immediately exclude uses of the other
node results). This was reported as PR22755. Unfortunately, we don't have a
test case (and I've also been unable to produce one thus far), but at least the
mistake is clear. The right way to fix this problem is to make use of the information
contained in the use iterators to filter out any uses of other results of the
node producing the base pointer.
This should be mostly NFC, but should also fix PR22755 (for which,
unfortunately, we have no in-tree test case).
llvm-svn: 237576
This is a less ambitious version of:
http://reviews.llvm.org/rL236546
because that was reverted in:
http://reviews.llvm.org/rL236600
because it caused memory corruption that wasn't related to FMF
but was actually due to making nodes with 2 operands derive from a
plain SDNode rather than a BinarySDNode.
This patch adds the minimum plumbing necessary to use IR-level
fast-math-flags (FMF) in the backend without actually using
them for anything yet. This is a follow-on to:
http://reviews.llvm.org/rL235997
...which split the existing nsw / nuw / exact flags and FMF
into their own struct.
llvm-svn: 237046
The bug showed up as a compile-time assertion failure:
Assertion `NumBits >= MIN_INT_BITS && "bitwidth too small"' failed
when building msan tests on x86-64.
Prior to r236850, this bug was masked due to a bogus alignment check,
which also accidentally rejected non-byte-sized accesses. Afterwards,
an invalid ElementSizeBytes == 0 got further into the function, and
triggered the assertion failure.
It would probably be a good idea to allow it to handle merging stores
of unusual widths as well, but for now, to un-break it, I'm just
making the minimal fix.
Differential Revision: http://reviews.llvm.org/D9626
llvm-svn: 236927
1) check whether the alignment of the memory is sufficient for the
*merged* store or load to be efficient.
Not doing so can result in some ridiculously poor code generation, if
merging creates a vector operation which must be aligned but isn't.
2) DON'T check that the alignment of each load/store is equal. If
you're merging 2 4-byte stores, the first *might* have 8-byte
alignment, but the second certainly will have 4-byte alignment. We do
want to allow those to be merged.
llvm-svn: 236850
This patch adds the minimum plumbing necessary to use IR-level
fast-math-flags (FMF) in the backend without actually using
them for anything yet. This is a follow-on to:
http://reviews.llvm.org/rL235997
...which split the existing nsw / nuw / exact flags and FMF
into their own struct.
There are 2 structural changes here:
1. The main diff is that we're preparing to extend the optimization
flags to affect more than just binary SDNodes. Eg, IR intrinsics
( https://llvm.org/bugs/show_bug.cgi?id=21290 ) or non-binop nodes
that don't even exist in IR such as FMA, FNEG, etc.
2. The other change is that we're actually copying the FP fast-math-flags
from the IR instructions to SDNodes.
Differential Revision: http://reviews.llvm.org/D8900
llvm-svn: 236546
This patch makes ReplaceExtractVectorEltOfLoadWithNarrowedLoad convert
the element number from getVectorIdxTy() to PtrTy before doing pointer
arithmetic on it. This is needed on z, where element numbers are i32
but pointers are i64.
Original patch by Richard Sandiford.
llvm-svn: 236530
For little-endian, the function would convert (extract_vector_elt (load X), Y)
to X + Y*sizeof(elt). For big-endian it would instead use
X + sizeof(vec) - Y*sizeof(elt). The big-endian case wasn't right since
vector index order always follows memory/array order, even for big-endian.
(Note that the current handling has to be wrong for Y==0 since it would
access beyond the end of the vector.)
Original patch by Richard Sandiford.
llvm-svn: 236529
At the least it should be guarded by some kind of target hook.
It also introduced catastrophic compile time and code quality
regressions on some out of tree targets (test case still being
reduced/sanitized).
Sanjay agreed with reverting this patch until these issues can be
resolved.
llvm-svn: 236199
This is a compromise: with this simple patch, we should always handle a chain of exactly 3
operations optimally, but we're not generating the optimal balanced binary tree for a longer
sequence.
In general, this transform will reduce the dependency chain for a sequence of instructions
using N operands from a worst case N-1 dependent operations to N/2 dependent operations.
The optimal balanced binary tree would reduce the chain to log2(N).
The trade-off for not dealing with longer sequences is: (1) we have less complexity in the
compiler, (2) we avoid unknown compile-time blowup calculating a balanced tree, and (3) we
don't need to worry about the increased register pressure required to parallelize longer
sequences. It also seems unlikely that we would ever encounter really long strings of
dependent ops like that in the wild, but I'm not sure how to verify that speculation.
FWIW, I see no perf difference for test-suite running on btver2 (x86-64) with -ffast-math
and this patch.
We can extend this patch to cover other associative operations such as fmul, fmax, fmin,
integer add, integer mul.
This is a partial fix for:
https://llvm.org/bugs/show_bug.cgi?id=17305
and if extended:
https://llvm.org/bugs/show_bug.cgi?id=21768https://llvm.org/bugs/show_bug.cgi?id=23116
The issue also came up in:
http://reviews.llvm.org/D8941
Differential Revision: http://reviews.llvm.org/D9232
llvm-svn: 236031
This is a preliminary step to using the IR-level floating-point fast-math-flags in the SDAG (D8900).
In this patch, we introduce the optimization flags as their own struct. As noted in the TODO comment,
we should eventually share this data between the IR passes and the backend.
We also switch the existing nsw / nuw / exact bit functionality of the BinaryWithFlagsSDNode class to
use the new struct.
The tradeoff is that instead of using the free but limited space of SDNode's SubclassData, we add a
data member to the subclass. This means we don't have to repeat all of the get/set methods per flag,
but we're potentially adding size to all nodes of this subclassi type.
In practice on 64-bit systems (measured on Linux and MacOS X), there is no size difference between an
SDNode and BinaryWithFlagsSDNode after this change: they're both 80 bytes. This means that we had at
least one free byte to play with due to struct alignment.
Differential Revision: http://reviews.llvm.org/D9325
llvm-svn: 235997
[DebugInfo] Add debug locations to constant SD nodes
This adds debug location to constant nodes of Selection DAG and updates
all places that create constants to pass debug locations
(see PR13269).
Can't guarantee that all locations are correct, but in a lot of cases choice
is obvious, so most of them should be. At least all tests pass.
Tests for these changes do not cover everything, instead just check it for
SDNodes, ARM and AArch64 where it's easy to get incorrect locations on
constants.
This is not complete fix as FastISel contains workaround for wrong debug
locations, which drops locations from instructions on processing constants,
but there isn't currently a way to use debug locations from constants there
as llvm::Constant doesn't cache it (yet). Although this is a bit different
issue, not directly related to these changes.
Differential Revision: http://reviews.llvm.org/D9084
llvm-svn: 235989
This adds debug location to constant nodes of Selection DAG and updates
all places that create constants to pass debug locations
(see PR13269).
Can't guarantee that all locations are correct, but in a lot of cases choice
is obvious, so most of them should be. At least all tests pass.
Tests for these changes do not cover everything, instead just check it for
SDNodes, ARM and AArch64 where it's easy to get incorrect locations on
constants.
This is not complete fix as FastISel contains workaround for wrong debug
locations, which drops locations from instructions on processing constants,
but there isn't currently a way to use debug locations from constants there
as llvm::Constant doesn't cache it (yet). Although this is a bit different
issue, not directly related to these changes.
Differential Revision: http://reviews.llvm.org/D9084
llvm-svn: 235977
right scaling.
In the function canFoldInAddressingMode, VT is computed as the type of the
destination/source of a LOAD/STORE operations, instead of the memory type of the
operation.
On targets with a scaling factor on the offset of the LOAD/STORE operations, the
function may return false for actually valid cases. This may then prevent the
selection of profitable pre or post indexed load/store operations, and instead
select pre or post indexed load/store for unprofitable cases.
Patch by Francois de Ferriere <francois.de-ferriere@st.com>!
Differential Revision: http://reviews.llvm.org/D9146
llvm-svn: 235780
Patch to remove extra bitcasts from shuffles, this is often a legacy of XformToShuffleWithZero being used to combine bitmaskings (of float vectors bitcast to integer vectors) into shuffles: bitcast(shuffle(bitcast(s0),bitcast(s1))) -> shuffle(s0,s1)
Differential Revision: http://reviews.llvm.org/D9097
llvm-svn: 235578
This turned up after r235333, but was a pre-existing bug. The optimization
which transforms select(c, load, load) into a load of a select of the addresses
does not handle indexed loads (pre/post inc/dec). However, it did not check for
them either, leading to a crash if it tried to transform one of them.
llvm-svn: 235497
Summary:
This patch adds legalization support to operate on FP16 as a load/store type
and do operations on it as floats.
Tests for ARM are added to test/CodeGen/ARM/fp16-promote.ll
Reviewers: srhines, t.p.northover
Differential Revision: http://reviews.llvm.org/D8755
llvm-svn: 235215
The only type that isn't an integer, isn't floating point, and isn't
a vector; ladies and gentlemen, the gift that keeps on giving: x86_mmx!
Fixes PR23246.
Original message (reverted in r235062):
[CodeGen] Combine concat_vectors of scalars into build_vector.
Combine something like:
(v8i8 concat_vectors (v2i8 bitcast (i16)) x4)
into:
(v8i8 (bitcast (v4i16 BUILD_VECTOR (i16) x4)))
If any of the scalars are floating point, use that throughout.
Differential Revision: http://reviews.llvm.org/D8948
llvm-svn: 235072
Combine something like:
(v8i8 concat_vectors (v2i8 bitcast (i16)) x4)
into:
(v8i8 (bitcast (v4i16 BUILD_VECTOR (i16) x4)))
If any of the scalars are floating point, use that throughout.
Differential Revision: http://reviews.llvm.org/D8948
llvm-svn: 234809
In case of different types used for the condition of the selects the
select(select) -> select(and) normalisation cannot be performed.
See also: http://reviews.llvm.org/D7622
llvm-svn: 234763
We already do:
concat_vectors(scalar, undef) -> scalar_to_vector(scalar)
When the scalar is legal.
When it's not, but is a truncated legal scalar, we can also do:
concat_vectors(trunc(scalar), undef) -> scalar_to_vector(scalar)
Which is equivalent, since the upper lanes are undef anyway.
While there, teach the combine to look at more than 2 operands.
Differential Revision: http://reviews.llvm.org/D8883
llvm-svn: 234530
The bug manifests when there are two loads and two stores chained as follows in
a DAG,
(ld v3f32) -> (st f32) -> (ld v3f32) -> (st f32)
and the stores' values are extracted from the preceding vector loads.
MergeConsecutiveStores would replace the first store in the chain with the
merged vector store, which would create a cycle between the merged store node
and the last load node that appears in the chain.
This commits fixes the bug by replacing the last store in the chain instead.
rdar://problem/20275084
Differential Revision: http://reviews.llvm.org/D8849
llvm-svn: 234430