See RFC for details:
https://discourse.llvm.org/t/rfc-for-moving-swift-s-merge-function-pass-to-llvm/73778
We will need to refactor extension to FunctionComparator/FunctionHash to
StructuralHash. This patch adds a new pass which is ported from Swift,
and will need to discuss on how to migrate Swift’s pass over after we
land this in llvm.
Create this PR to get some early review on the patch.
---------
Co-authored-by: Manman Ren <mren@meta.com>
This reverts commit 86bfeb906e3a95ae428f3e97d78d3d22a7c839f3.
This is a long time coming re-application that was originally reverted due to
regressions, unrelated to the actual inlining change. These regressions have since
been fixed due to another long-in-the-making change of a66051c6 landing.
Original commit message for reference:
---
We have several situations where it's beneficial for code size to ensure that every
call to always-inline functions are inlined before normal inlining decisions are
made. While the normal inliner runs in a "MandatoryOnly" mode to try to do this,
it only does it on a per-SCC basis, rather than the whole module. Ensuring that
all mandatory inlinings are done before any heuristic based decisions are made
just makes sense.
Despite being referred to the "legacy" AlwaysInliner pass, it's already necessary
for -O0 because the CGSCC inliner is too expensive in compile time to run at -O0.
This also fixes an exponential compile time blow up in
https://github.com/llvm/llvm-project/issues/59126
Differential Revision: https://reviews.llvm.org/D143624
---
This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional:
1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:
- Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
- Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
- After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
- All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;
2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
- Iterate all functions in a Module;
- If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
- If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
- We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.
Reviewed by: JonChesterfield, yaxunl
Differential Revision: https://reviews.llvm.org/D155856
This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional:
1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:
- Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
- Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
- After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
- All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;
2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
- Iterate all functions in a Module;
- If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
- If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
- We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.
Reviewed by: JonChesterfield, yaxunl
Differential Revision: https://reviews.llvm.org/D155856
This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional:
1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:
- Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
- Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
- After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
- All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;
2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
- Iterate all functions in a Module;
- If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
- If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
- We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.
Reviewed by: JonChesterfield, yaxunl
Differential Revision: https://reviews.llvm.org/D155856
User only can use opt to test LoopVersioningLICM pass, and this PR add
the option back(deleted in https://reviews.llvm.org/D137915) so that
it's easy for verifying if it is useful for some benchmarks.
Adjust the pipeline slightly to move ConstraintElim just before the loop
simplification pipeline. This increases the number of cases where SCEV
should can preserved in the future.
This also enables slightly more opportunities, by benefiting from
earlier CFG simplifications, which allow more conditions to be added.
Reviewed By: nikic, antoniofrighetto
Differential Revision: https://reviews.llvm.org/D158843
This pass aims to infer alignment for instructions as a separate pass,
to reduce redundant work done by InstCombine running multiple times. It
runs late in the pipeline, just before the back-end passes where this
information is most useful.
Differential Revision: https://reviews.llvm.org/D158529
Currently, the `-fprofile-udpate` is ignored when `-fprofile-generate` is in effect. This patch enables `-fprofile-update` for `-fprofile-generate`. This patch continues the work from https://reviews.llvm.org/D87737, which added `-fprofile-update` in the first place.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D157280
This reverts commit 1f37088679a5c2416707d477093950e48148d430 as it causes a
large regression in x264, and some other regressions in downstream embedded
benchmarks under LTO.
We currently only enable hoisting in the last SimplifyCFG run of
the function simplification pipeline. In particular this happens
after GVN, which means that instructions that were identical (and
thus hoistable) prior to GVN might no longer be so after it ran,
due to equality replacements (see the phase ordering test).
The history here is that D84108 restricted hoisting to the very
late (module optimization) pipeline only. Then D101468 went back
on that, and also performed it at the end of function simplification.
This patch goes one step further and allows it prior to GVN.
Importantly, we still don't perform hoisting before LoopRotate,
which was the original motivation for delaying it.
Differential Revision: https://reviews.llvm.org/D156532
This restores commit b4a82b62258c5f650a1cccf5b179933e6bae4867, reverted
in 3ab7ef28eebf9019eb3d3c4efd7ebfd160106bb1 because it was thought to
cause a bot failure, which ended up being unrelated to this patch set.
Differential Revision: https://reviews.llvm.org/D154856
Previously the MemProf profile was expected to be in the same profile
file as a normal PGO profile, passed via the usual -fprofile-use=
option, and was matched in the same pass. To simplify profile
preparation, since the raw MemProf profile requires the binary for
symbolization and may be simpler to index separately from the raw PGO
profile, and also to enable providing a MemProf profile for a SamplePGO
build, separate out the MemProf feedback option and matching pass.
This patch adds the -fmemory-profile-use=${file} option, and the
provided file is passed down to LLVM and ultimately used in a new
MemProfUsePass which performs the matching of just the memory profile
contents of that file.
Note that a single profile file containing both normal PGO and MemProf
profile data is still supported, and the relevant profile data is
matched by the appropriate matching pass(es) based on which option(s)
the profile is provided with (the same profile file can be supplied to
both feedback options).
Differential Revision: https://reviews.llvm.org/D154856
In the LTO pipeline we run InstCombine after LICM, which is
different to what we normally do without LTO. This has the
effect of undoing all the great work done by LICM to reduce
the cost of the loop when it hoists the fdiv out and replaces
it with fmul. When InstCombine runs after LICM it puts the
fdiv straight back which, on AArch64 at least, is darn
expensive. You can observe this problem in the SPEC2017
benchmark parest if you build with "-Ofast -flto" and the
loop-vectoriser uses an unroll factor of 1, which is what
often happens when tail-folding is enabled.
This is also a problem for scalar loops, or indeed any loop
where there is only one use of the preheader fdiv result in
the loop.
See InstCombinerImpl::visitFMul for the code that sinks the fdiv.
I've attempted to fix this by adding another LICM pass for Full
LTO after InstCombine. The alternative is to stop InstCombine
from sinking the fdiv into loops. See D87479 for a previous
discussion on this issue.
Differential Revision: https://reviews.llvm.org/D143631
Here's a high level summary of the changes in this patch. For more
information on rational, see the RFC.
(https://discourse.llvm.org/t/rfc-a-unified-lto-bitcode-frontend/61774).
- Add config parameter to LTO backend, specifying which LTO mode is
desired when using unified LTO.
- Add unified LTO flag to the summary index for efficiency. Unified
LTO modules can be detected without parsing the module.
- Make sure that the ModuleID is generated by incorporating more types
of symbols.
Differential Revision: https://reviews.llvm.org/D123803
Fat LTO objects contain both LTO compatible IR, as well as generated
object code. This allows users to defer the choice of whether to use LTO
or not to link-time. This is a feature available in GCC for some time,
and makes the existing -ffat-lto-objects flag functional in the same
way as GCC's.
Within LLVM, we add a new EmbedBitcodePass that serializes the module to
the object file, and expose a new pass pipeline for compiling fat
objects. The new pipeline initially clones the module and runs the
selected (Thin)LTOPrelink pipeline, after which it will serialize the
module into a `.llvm.lto` section of an ELF file. When compiling for
(Thin)LTO, this normally the point at which the compiler would emit a
object file containing the bitcode and metadata.
After that point we compile the original module using the
PerModuleDefaultPipeline used for non-LTO compilation. We generate
standard object files at the end of this pipeline, which contain machine
code and the new `.llvm.lto` section containing bitcode.
Since the two pipelines operate on different copies of the module, we
can be sure that the bitcode in the `.llvm.lto` section and object code
in `.text` are congruent with the existing output produced by the
default and LTO pipelines.
Original RFC: https://discourse.llvm.org/t/rfc-ffat-lto-objects-support/63977
Earlier versions of this patch were missing REQUIRES lines for llc
related tests in Transforms/EmbedBitcode. Those tests are now under
CodeGen/X86, which should avoid running the check on unsupported
platforms.
The EmbedbBitcodePass also returned PreservedAnalyses::all when adding a
metadata section, which failed expensive checks, since it modified the
module. This is now corrected.
Reviewed By: tejohnson, MaskRay, nikic
Differential Revision: https://reviews.llvm.org/D146776
D63932 added a module flag to indicate that we are executing the regular
LTO post merge pipeline, so that GlobalDCE could perform more aggressive
optimization for Dead Virtual Function Elimination. This caused issues
trying to reuse bitcode that had already been through the LTO pipeline
(see context in D139816).
Instead support this by passing down a parameter flag to the
GlobalDCEPass constructor, which is the more usual way for indicating
this information.
Most test changes are to remove incidental uses of this flag. Of the 2
real uses, llvm/test/LTO/ARM/lto-linking-metadata.ll is now obsolete and
removed in this patch, and the virtual-functions-visibility-post-lto.ll
test is updated to use the regular LTO default pipeline where this
parameter is set to true.
Differential Revision: https://reviews.llvm.org/D153655
Fat LTO objects contain both LTO compatible IR, as well as generated
object code. This allows users to defer the choice of whether to use LTO
or not to link-time. This is a feature available in GCC for some time,
and makes the existing -ffat-lto-objects flag functional in the same
way as GCC's.
Within LLVM, we add a new EmbedBitcodePass that serializes the module to
the object file, and expose a new pass pipeline for compiling fat
objects. The new pipeline initially clones the module and runs the
selected (Thin)LTOPrelink pipeline, after which it will serialize the
module into a `.llvm.lto` section of an ELF file. When compiling for
(Thin)LTO, this normally the point at which the compiler would emit a
object file containing the bitcode and metadata.
After that point we compile the original module using the
PerModuleDefaultPipeline used for non-LTO compilation. We generate
standard object files at the end of this pipeline, which contain machine
code and the new `.llvm.lto` section containing bitcode.
Since the two pipelines operate on different copies of the module, we
can be sure that the bitcode in the `.llvm.lto` section and object code
in `.text` are congruent with the existing output produced by the
default and LTO pipelines.
Original RFC: https://discourse.llvm.org/t/rfc-ffat-lto-objects-support/63977
Earlier versions of this patch were missing REQUIRES lines for llc
related tests in Transforms/EmbedBitcode. Those tests are now under
CodeGen/X86, which should avoid running the check on unsupported
platforms.
Reviewed By: tejohnson, MaskRay, nikic
Differential Revision: https://reviews.llvm.org/D146776
There seems to be a problem on arm buildbots. Reverting until I can
investigate. https://lab.llvm.org/buildbot#builders/245/builds/10184
This reverts commit a67208e1c697649ce432e6497f56a93675273dd8
and dependent commit e54a3112cee5ae0a9117359ecbea878e1388f51e.
Fat LTO objects contain both LTO compatible IR, as well as generated
object code. This allows users to defer the choice of whether to use LTO
or not to link-time. This is a feature available in GCC for some time,
and makes the existing -ffat-lto-objects flag functional in the same
way as GCC's.
Within LLVM, we add a new EmbedBitcodePass that serializes the module to
the object file, and expose a new pass pipeline for compiling fat
objects. The new pipeline initially clones the module and runs the
selected (Thin)LTOPrelink pipeline, after which it will serialize the
module into a `.llvm.lto` section of an ELF file. When compiling for
(Thin)LTO, this normally the point at which the compiler would emit a
object file containing the bitcode and metadata.
After that point we compile the original module using the
PerModuleDefaultPipeline used for non-LTO compilation. We generate
standard object files at the end of this pipeline, which contain machine
code and the new `.llvm.lto` section containing bitcode.
Since the two pipelines operate on different copies of the module, we
can be sure that the bitcode in the `.llvm.lto` section and object code
in `.text` are congruent with the existing output produced by the
default and LTO pipelines.
Original RFC: https://discourse.llvm.org/t/rfc-ffat-lto-objects-support/63977
Reviewed By: tejohnson, MaskRay, nikic
Differential Revision: https://reviews.llvm.org/D146776
First, removes the invocation of the memprof instrumentation passes from
the end of the module simplification pass builder, where it doesn't
really belong. However, it turns out that this was never being invoked,
as it is guarded by an internal option not used anywhere (even tests).
These passes are actually added via clang under the -fmemory-profile
option. Changed this to add via the EP callback interface, similar to
the sanitizer passes. They are added to the EP for the end of the
optimization pipeline, which is roughly where they were being added
already (end of the pre-LTO link pipelines and non-LTO optimization
pipeline).
Ideally we should plumb the output file through to LLVM and set it up
there, so I have added a TODO.
Differential Revision: https://reviews.llvm.org/D151593
GlobalDCE will only remove functions with available externally
linkage if they are unreferenced. As such, I don't believe there
is any problem with running this pass as part of the ThinLTO pre-link
pipeline. It will only remove functions that are completely dead in
that module, and I don't think there is any benefit to keeping them
around for the post-link phase.
There is no compile-time impact from the additional pass.
This is a followup to one of the side discussions in D146776.
Differential Revision: https://reviews.llvm.org/D149446
Applies ThinLTO cloning decisions made during the thin link and
recorded in the summary index to the IR during the ThinLTO backend.
Depends on D141077.
Differential Revision: https://reviews.llvm.org/D149117
This reverts commit 6f29d1adf29820daae9ea7a01ae2588b67735b9e.
https://reviews.llvm.org/D149768 is causing size regressions for -Oz
with FullLTO, and I'm reverting that one while investigating. This
commit depends on that one, so it needs to be reverted as well.
This is a cheap pass so there's no need to limit to -O3.
This removes some differences between various pipelines.
Code size regressions should be addressed with
https://reviews.llvm.org/D149768.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D148269
Original patch (50b2a113db197a97f60ad2aace8b7382dc9b8c31) ignored the
fact that -ftrivial-auto-var-init could affect function parameters with
the sret attribute.
Just do not move instruction that don't affect alloca.
Also add missing test case for volatile instruction.
Differential Revision: https://reviews.llvm.org/D148507
This is effectively a debugging pass to adjust function attributes.
I don't think it makes sense to run it in the post-link pipeline.
Differential Revision: https://reviews.llvm.org/D148904
LICM does not use ORE from the pass manager, it constructs its
own instance. As such, explicitly requiring the analysis in the
pipeline is unnecessary.
This patch allows access to callbacks registered by TargetMachines to allow custom pipelines to run those callbacks.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D148561
As pointed out in D148010, these passes are missing from the LTO
post-link pipeline. They are present in the pre-link pipeline,
but LoopSink is completely useless there (it will always be fully
undone by LICM post-link) and DivRemPairs is mostly useless
(I believe most of what it does will be undone by InstCombine).
I've not added RelLookupTableConverterPass, because it's also
disabled in the LTO pre-link pipeline, with a comment that there
is an unresolved issue with full LTO.
Compile-time impact of the extra passes is minimal. Of course,
LoopSink will have a larger impact in PGO builds.
Differential Revision: https://reviews.llvm.org/D148343
This is a cheap pass so there's no need to limit to -O3.
This removes some differences between various pipelines.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D148269
The pre-link pipeline already ran the pass and it only needs to be run once.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D145978
This could also move initialization of sret args, causing actually
initialized parts of such return values to be uninitialized. See
discussion on the code review.
> As a result of -ftrivial-auto-var-init, clang generates instructions to
> set alloca'd memory to a given pattern, right after the allocation site.
> In some cases, this (somehow costly) operation could be delayed, leading
> to conditional execution in some cases.
>
> This is not an uncommon situation: it happens ~500 times on the cPython
> code base, and much more on the LLVM codebase. The benefit greatly
> varies on the execution path, but it should not regress on performance.
>
> This is a recommit of cca01008cc31a891d0ec70aff2201b25d05d8f1b with
> MemorySSA update fixes.
>
> Differential Revision: https://reviews.llvm.org/D137707
This reverts commit 50b2a113db197a97f60ad2aace8b7382dc9b8c31
and follow-up commit ad9ad3735c4821ff4651fab7537a75b8f0bb60f8.
In the non-ThinLTO pipeline this was directly before PipelineStartEP,
in the ThinLTO pipeline it was directly after. I don't think the
specific position matters here, just make sure it's the same for
both pipelines.
Next step after https://reviews.llvm.org/D113179
Recently a set of patches by @anton-afanasyev improved many cases (better and cleaner vectorized code) thanks to improvements to AIC's TruncInstCombine (IC cannot handle it) motivated by real examples in bug reports. There was a discussion that -O2 could benefit from AIC as well, but discussion then stalled, so I would like restart it, with new numbers from LLVM compile time tracker.
As -O2 pipeline is not tracked by LLVM compile time tracker, I disabled AIC for -O3 to get an idea how expensive is it. Without AIC, I observed that geomean was cca -0.10%. Given that it seems like AIC is quite cheap, heavily tested by -O3 pipeline, I am proposing to enable it also with -O2 and similar to improve quality to vectorized code.
https://llvm-compile-time-tracker.com/compare.php?from=a1df5abef5f27646c809c7b85cf6170eb68f7735&to=e1ba6068f58c6ca862b920b8750faccb42a5843c&stat=instructions:u
Differential Revision: https://reviews.llvm.org/D147604
Reviewed-By: nikic
As a result of -ftrivial-auto-var-init, clang generates instructions to
set alloca'd memory to a given pattern, right after the allocation site.
In some cases, this (somehow costly) operation could be delayed, leading
to conditional execution in some cases.
This is not an uncommon situation: it happens ~500 times on the cPython
code base, and much more on the LLVM codebase. The benefit greatly
varies on the execution path, but it should not regress on performance.
This is a recommit of cca01008cc31a891d0ec70aff2201b25d05d8f1b with
MemorySSA update fixes.
Differential Revision: https://reviews.llvm.org/D137707