This is an immutable analysis that loads and makes the contextual profile available to other passes. This patch introduces the analysis and an analysis printer pass. Subsequent patches will introduce the APIs that IPO passes will call to modify the profile as result of their changes.
There is currently no plan to support contextual profiling use in a non-
ThinLTO scenario.
In the pre-link phase, we only instrument and then immediately bail out
to let the linker group functions under an entrypoint in the same module
as the entrypoint. We don't actually care what the profile contains -
just that we want to use a contextual profile.
After that, in post-thinlink, we require the profile be passed again so
we can actually use it. The earlier instrumentation will be used to
match counter values.
While the feature is in development, we add a hidden flag for the use
scenario, but we can eventually tie it to the `PGOOptions` mechanism. We
will use the same flag in both pre- and post-thinlink, because it
simplifies things - usually the post-thinlink args are the same as the
ones for pre-. This, despite the flag being basically treated as a
boolean in pre-thinlink.
This is re-land of #90310 after making asan skip pre-split coroutines in
#99415.
Skip CoroSplit and CoroCleanup in LTO pre-link pipeline so that
CoroElide can happen after callee coroutine is imported into caller's
module in ThinLTO.
Summary:
This pass expands variadic functions into non-variadic function calls
according to the target ABI. Currently, this is used as the lowering for
the NVPTX and AMDGPU targets.
This pass is currently only run late in the target's backend. However,
during LTO we want to run it before the inliner pass so that the
expanded functions can be inlined using standard heuristics. This pass
is a no-op for unsupported targets, so this won't apply to any code that
isn't already using it.
The `!unpredictable` metadata has been present for a long time, but
it's usage in optimizations is still limited. This patch teaches
`FoldTwoEntryPHINode()` to be more aggressive with an unpredictable
branch to reduce mispredictions.
A TTI interface `getBranchMispredictPenalty()` is added to distinguish
between different hardwares to ensure we don't go too far for simpler
cores. For simplicity, only a naive x86 implementation is included for
the time being.
In comparison to non-instrumented binaries, PGO instrumentation binaries
can be significantly slower. For highly threaded programs, this slowdown
can
reach 10x due to data races or false sharing within counters.
This patch incorporates sampling into the PGO instrumentation process to
enhance the speed of instrumentation binaries. The fundamental concept
is similar to the one proposed in https://reviews.llvm.org/D63949.
Three sampling modes are introduced:
1. Simple Sampling: When '-sampled-instr-bust-duration' is set to 1.
2. Fast Burst Sampling: When not using simple sampling, and
'-sampled-instr-period' is set to 65535. This is the default mode of
sampling.
3. Full Burst Sampling: When neither simple nor fast burst sampling is
used.
Utilizing this sampled instrumentation significantly improves the
binary's
execution speed. Measurements show up to 5x speedup with default
settings. Fast burst sampling now results in only around 20% to 30%
slowdown (compared to 8 to 10x slowdown without sampling).
Out tests show that profile quality remains good with sampling,
with edge counts typically showing more than 90% overlap.
For applications whose behavior changes due to binary speed,
sampling instrumentation can enhance performance.
Observations have shown some apps experiencing up to
a ~2% improvement in PGO.
A potential drawback of this patch is the increased binary size
and compilation time. The Sampling method in this patch does
not improve single threaded program instrumentation binary
speed.
Constmerge can fold switch jump tables, possibly making functions
identical again. It can help mergefunc.
On the other hand, the opposite seems unlikely.
Fixes https://github.com/llvm/llvm-project/issues/92201.
This change is incorrect when thinlto and asan are enabled, and this can
be observed by adding `-fsanitize=address` to the provided
coro-elide-thinlto.cpp test. It results in the error "Coroutines cannot
handle non static allocas yet", and ASan introduces a dynamic alloca.
In other words, we must preserve the invariant that CoroSplit runs
before ASan. If we move CoroSplit to the post post-link compile stage,
ASan has to be moved to the post-link compile stage first. It would
also be correct to make CoroSplit handle dynamic allocas so the pass
ordering doesn't matter, but sanitizer instrumentation really ought to
be last, after coroutine splitting.
This reverts commit bafc5f42c0132171287d7cba7f5c14459be1f7b7.
This reverts commit b1b1bfa7bea0ce489b5ea9134e17a43c695df5ec.
This reverts commit 0232b77e145577ab78e3ed1fdbb7eacc5a7381ab.
This reverts commit fb2d3056618e3d03ba9a695627c7b002458e59f0.
This reverts commit 1cb33713910501c6352d0eb2a15b7a15e6e18695.
This reverts commit cd68d7b3c0ebf6da5e235cfabd5e6381737eb7fe.
Skip CoroSplit and CoroCleanup in LTO pre-link pipeline so that
CoroElide can happen after callee coroutine is imported into caller's
module in ThinLTO.
This means that the pass is unusable without some sort of profile. We
can revisit this decision later if we want to support running this pass
without a profile.
Many profitable optimizations cannot be performed at -Oz, due to
unrotated loops. While this is worse for size (minimally), many of the
optimizations significantly reduce code size, such as memcpy
optimizations and other patterns found by loop idiom recognition.
Related discussion can be found in issue #50308.
This patch adds an experimental, backend-only flag to allow loop header
duplication, regardless of the optimization level. Downstream consumers
can experiment with this flag, and if it is profitable, we can adjust
the compiler's defaults accordingly, and expose any useful frontend
flags to opt into the new behavior.
The pass itself checks whether to apply the optimization based on the
minsize attribute, so there isn't much functional benefit to preventing
the pass from being added. Gating the pass gets added to the pass
pipeline complicates the interaction with -enable-dfa-jump-thread, as
well.
Reviewers: aeubanks
Reviewed By: aeubanks
Pull Request: https://github.com/llvm/llvm-project/pull/83318
The performance of cold functions shouldn't matter too much, so if we
care about binary sizes, add an option to mark cold functions as
optsize/minsize for binary size, or optnone for compile times [1]. Clang
patch will be in a future patch.
This is intended to replace `shouldOptimizeForSize(Function&, ...)`.
We've seen multiple cases where calls to this expensive function, if not
careful, can blow up compile times. I will clean up users of that
function in a followup patch.
Initial version: https://reviews.llvm.org/D149800
[1]
https://discourse.llvm.org/t/rfc-new-feature-proposal-de-optimizing-cold-functions-using-pgo-info/56388
Add a pass to convert jump tables to switches.
The new pass replaces an indirect call with a switch + direct calls if all the functions in the jump table are smaller than the provided threshold.
The pass is currently disabled by default and can be enabled by -enable-jump-table-to-switch.
Test plan: ninja check-all
Currently, the UnifiedLTO pipeline seems to have trouble with several
LTO features, like SplitLTO units, which means we cannot use important
optimizations like Whole Program Devirtualization or security hardening
instrumentation like CFI.
This patch reverts FatLTO to using distinct pipelines for Full LTO and
ThinLTO. It still avoids module cloning, since that was error prone.
- With PGO, indirect call edges are constructed using value profiles, and the profile address is mapped to a function's PGO name. The PGO name is computed using a functions linkage before LTO internalization or global promotion.
- With ThinLTO, local functions [could be
promoted](2663d2cb9c/llvm/lib/Transforms/Utils/FunctionImportUtils.cpp (L288)) to have external linkage; and with
[full](2663d2cb9c/llvm/lib/LTO/LTO.cpp (L1328))
or
[thin](2663d2cb9c/llvm/lib/LTO/LTO.cpp (L448))
LTO, global functions could be internalized. Edge construction should use a function's PGO name before its linkage is updated.
Akin other passes - refactored the name to `InstrProfilingLoweringPass` to better communicate what it does, and split the pass part and the transformation part to avoid needing to initialize object state during `::run`.
A subsequent PR will move `InstrLowering` to the .cpp file and rename it to `InstrLowerer`.
https://github.com/llvm/llvm-project/issues/70703 pointed out that
cloning LLVM modules could lead to miscompiles when using FatLTO.
This is due to an existing issue when cloning modules with labels (see
#55991 and #47769). Since this can lead to miscompilation, we can avoid
cloning the LLVM modules, which was desirable anyway.
This patch modifies the EmbedBitcodePass to no longer clone the module
or run an input pipeline over it. Further, it make FatLTO always perform
UnifiedLTO, so we can still defer the Thin/Full LTO decision to
link-time. Lastly, it removes dead/obsolete code related to now defunct
options that do not work with the EmbedBitcodePass implementation any
longer.
Loosen up the matching so that a missing leaf debug frame in the profile
does not prevent matching an allocation context if we can match further
up the inlined call context. This relies on the pre-inliner, which was
already the default when performing normal PGO feedback along with the
MemProf feedback, but to ensure matching is not affected by the presence
of PGO, enable the pre-inliner for MemProf feedback as well.
See RFC for details:
https://discourse.llvm.org/t/rfc-for-moving-swift-s-merge-function-pass-to-llvm/73778
We will need to refactor extension to FunctionComparator/FunctionHash to
StructuralHash. This patch adds a new pass which is ported from Swift,
and will need to discuss on how to migrate Swift’s pass over after we
land this in llvm.
Create this PR to get some early review on the patch.
---------
Co-authored-by: Manman Ren <mren@meta.com>
This reverts commit 86bfeb906e3a95ae428f3e97d78d3d22a7c839f3.
This is a long time coming re-application that was originally reverted due to
regressions, unrelated to the actual inlining change. These regressions have since
been fixed due to another long-in-the-making change of a66051c6 landing.
Original commit message for reference:
---
We have several situations where it's beneficial for code size to ensure that every
call to always-inline functions are inlined before normal inlining decisions are
made. While the normal inliner runs in a "MandatoryOnly" mode to try to do this,
it only does it on a per-SCC basis, rather than the whole module. Ensuring that
all mandatory inlinings are done before any heuristic based decisions are made
just makes sense.
Despite being referred to the "legacy" AlwaysInliner pass, it's already necessary
for -O0 because the CGSCC inliner is too expensive in compile time to run at -O0.
This also fixes an exponential compile time blow up in
https://github.com/llvm/llvm-project/issues/59126
Differential Revision: https://reviews.llvm.org/D143624
---
This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional:
1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:
- Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
- Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
- After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
- All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;
2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
- Iterate all functions in a Module;
- If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
- If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
- We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.
Reviewed by: JonChesterfield, yaxunl
Differential Revision: https://reviews.llvm.org/D155856
This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional:
1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:
- Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
- Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
- After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
- All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;
2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
- Iterate all functions in a Module;
- If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
- If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
- We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.
Reviewed by: JonChesterfield, yaxunl
Differential Revision: https://reviews.llvm.org/D155856
This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional:
1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:
- Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
- Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
- After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
- All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;
2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
- Iterate all functions in a Module;
- If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
- If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
- We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.
Reviewed by: JonChesterfield, yaxunl
Differential Revision: https://reviews.llvm.org/D155856
User only can use opt to test LoopVersioningLICM pass, and this PR add
the option back(deleted in https://reviews.llvm.org/D137915) so that
it's easy for verifying if it is useful for some benchmarks.
Adjust the pipeline slightly to move ConstraintElim just before the loop
simplification pipeline. This increases the number of cases where SCEV
should can preserved in the future.
This also enables slightly more opportunities, by benefiting from
earlier CFG simplifications, which allow more conditions to be added.
Reviewed By: nikic, antoniofrighetto
Differential Revision: https://reviews.llvm.org/D158843
This pass aims to infer alignment for instructions as a separate pass,
to reduce redundant work done by InstCombine running multiple times. It
runs late in the pipeline, just before the back-end passes where this
information is most useful.
Differential Revision: https://reviews.llvm.org/D158529
Currently, the `-fprofile-udpate` is ignored when `-fprofile-generate` is in effect. This patch enables `-fprofile-update` for `-fprofile-generate`. This patch continues the work from https://reviews.llvm.org/D87737, which added `-fprofile-update` in the first place.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D157280