It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
This commit introduces support for outlining functions across modules
using codegen data generated from previous codegen. The codegen data
currently manages the outlined hash tree, which records outlining
instances that occurred locally in the past.
The machine outliner now operates in one of three modes:
1. CGDataMode::None: This is the default outliner mode that uses the
suffix tree to identify (local) outlining candidates within a module.
This mode is also used by (full)LTO to maintain optimal behavior with
the combined module.
2. CGDataMode::Write (`-codegen-data-generate`): This mode is identical
to the default mode, but it also publishes the stable hash sequences of
instructions in the outlined functions into a local outlined hash tree.
It then encodes this into the `__llvm_outline` section, which will be
dead-stripped at link time.
3. CGDataMode::Read (`-codegen-data-use-path={.cgdata}`): This mode
reads a codegen data file (.cgdata) and initializes a global outlined
hash tree. This tree is used to generate global outlining candidates.
Note that the codegen data file has been post-processed with the raw
`__llvm_outline` sections from all native objects using the
`llvm-cgdata` tool (or a linker, `LLD`, or a new ThinLTO pipeline
later).
This depends on https://github.com/llvm/llvm-project/pull/105398. After
this PR, LLD (https://github.com/llvm/llvm-project/pull/90166) and Clang
(https://github.com/llvm/llvm-project/pull/90304) will follow for each
client side support.
This is a patch for
https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-2-thinlto-nolto/78753.
When the machine outliner copies instructions from a source function
into an outlined function, it was doing it using `CloneMachineInstr`,
which is documented as not preserving the interior of any instruction
bundle. So outlining code that includes an instruction bundle would
fail, because in the outlined version, the bundle would be empty, so
instructions would go missing in the move.
This occurs when any bundled instruction appears in the outlined code,
so there was no need to construct an unusual test case: I've just copied
a function from the existing `stp-opt-with-renaming.mir`, which happens
to contain an SVE instruction bundle. Including two identical copies of
that function makes the outliner merge them, and then we check that it
didn't destroy the interior of the bundle in the process.
This patch prepares the NFC groundwork for global outlining using
CGData, which will follow
https://github.com/llvm/llvm-project/pull/90074.
- The `MinRepeats` parameter is now explicitly passed to the
`getOutliningCandidateInfo` function, rather than relying on a default
value of 2. For local outlining, the minimum number of repetitions is
typically 2, but for the global outlining (mentioned above), we will
optimistically create a single `Candidate` for each `OutlinedFunction`
if stable hashes match a specific code sequence. This parameter is
adjusted accordingly in global outlining scenarios.
- I have also implemented `unique_ptr` for `OutlinedFunction` to ensure
safe and efficient memory management within `FunctionList`, avoiding
unnecessary implicit copies.
This depends on https://github.com/llvm/llvm-project/pull/101461.
This is a patch for
https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-2-thinlto-nolto/78753.
Since `raw_string_ostream` doesn't own the string buffer, it is
desirable (in terms of memory safety) for users to directly reference
the string buffer rather than use `raw_string_ostream::str()`.
Work towards TODO comment to remove `raw_string_ostream::str()`.
This PR depends on https://github.com/llvm/llvm-project/pull/90264
In the current implementation, only leaf children of each internal node
in the suffix tree are included as candidates for outlining. But all
leaf descendants are outlining candidates, which we include in the new
implementation. This is enabled on a flag `outliner-leaf-descendants`
which is default to be true.
The reason for _enabling this on a flag_ is because machine outliner is
not the only pass that uses suffix tree.
The reason for _having this default to be true_ is because including all
leaf descendants show consistent size win.
* For Clang/LLD, it shows around 3% reduction in text segment size when
compared to the baseline `-Oz` linker binary.
* For selected benchmark tests in LLVM test suite
| run (CTMark/) | only leaf children | all leaf descendants | reduction
% |
|------------------|--------------------|----------------------|-------------|
| lencod | 349624 | 348564 | -0.2004% |
| SPASS | 219672 | 218440 | -0.4738% |
| kc | 271956 | 250068 | -0.4506% |
| sqlite3 | 223920 | 222484 | -0.5471% |
| 7zip-benchmark | 405364 | 401244 | -0.3428% |
| bullet | 139820 | 138340 | -0.8315% |
| consumer-typeset | 295684 | 286628 | -1.2295% |
| pairlocalalign | 72236 | 71936 | -0.2164% |
| tramp3d-v4 | 189572 | 183676 | -2.9668% |
This is part of an enhanced version of machine outliner -- see
[RFC](https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-1-fulllto-part-2-thinlto-nolto-to-come/78732).
This PR depends on https://github.com/llvm/llvm-project/pull/90260
We changed the order in which functions are outlined in Machine
Outliner.
The formula for priority is found via a black-box Bayesian optimization
toolbox. Using this formula for sorting consistently reduces the
uncompressed size of large real-world mobile apps. We also ran a few
benchmarks using LLVM test suites, and showed that sorting by priority
consistently reduces the text segment size.
|run (CTMark/) |baseline (1)|priority (2)|diff (1 -> 2)|
|----------------|------------|------------|-------------|
|lencod |349624 |349264 |-0.1030% |
|SPASS |219672 |219480 |-0.0874% |
|kc |271956 |251200 |-7.6321% |
|sqlite3 |223920 |223708 |-0.0947% |
|7zip-benchmark |405364 |402624 |-0.6759% |
|bullet |139820 |139500 |-0.2289% |
|consumer-typeset|295684 |290196 |-1.8560% |
|pairlocalalign |72236 |72092 |-0.1993% |
|tramp3d-v4 |189572 |189292 |-0.1477% |
This is part of an enhanced version of machine outliner -- see
[RFC](https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-1-fulllto-part-2-thinlto-nolto-to-come/78732).
This reduce the time complexity of the main loop of `findCandidates()`
method from $O(n^2)$ to $O(n \log n)$.
For small $n$, the modification does not regress the build time, but it
helps significantly when $n$ is large.
For one application, this reduces the runtime of the main loop from 120
seconds to 28 seconds.
This is the first commit for an enhanced version of machine outliner --
see
[RFC](https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-1-fulllto-part-2-thinlto-nolto-to-come/78732).
This avoids the pitfall where we set the uwtable to none:
```
func.setUWTableKind(llvm::UWTableKind::None)
```
`Attribute::getAsString()` would see an unknown attribute and fail an
assertion. In this patch, we assert that we do not see a None uwtable
kind.
This also skips the check of `UWTableKind::Async`. It is dominated by
the check of `UWTableKind::Default`, which has the same enum value
(nfc).
Make Candidate's front() and back() functions return references to
MachineInstr and introduce begin() and end() returning iterators, the
same way it is usually done in other container-like classes.
This makes possible to iterate over the instructions contained in
Candidate the same way one can iterate over MachineBasicBlock (note that
begin() and end() return bundled iterators, just like MachineBasicBlock
does, but no instr_begin() and instr_end() are defined yet).
Add some debug output to `outline` to assist in debugging + understanding the
code.
This will say
- How many things we found worth turning into outlined functions
- Whether or not candidates were pruned via the outlining algorithm
- The function created (if it was created)
- Where the calls were inserted
- What instruction was used to create the call
Sample output below:
```
NUMBER OF POTENTIAL FUNCTIONS: 5
WALKING FUNCTION LIST
PRUNED: 0/2 candidates
OUTLINE: Expected benefit (12 B) > threshold (1 B)
NEW FUNCTION: OUTLINED_FUNCTION_0
CREATE OUTLINED CALLS
CALL: OUTLINED_FUNCTION_0 in bar:<unknown>
.. BL @OUTLINED_FUNCTION_0, implicit-def $lr, implicit $sp
CALL: OUTLINED_FUNCTION_0 in bar:<unknown>
.. BL @OUTLINED_FUNCTION_0, implicit-def $lr, implicit $sp
PRUNED: 2/2 candidates
SKIP: Expected benefit (0 B) < threshold (1 B)
PRUNED: 0/2 candidates
OUTLINE: Expected benefit (8 B) > threshold (1 B)
NEW FUNCTION: OUTLINED_FUNCTION_1
CREATE OUTLINED CALLS
CALL: OUTLINED_FUNCTION_1 in bar:<unknown>
.. BL @OUTLINED_FUNCTION_1, implicit-def $lr, implicit $sp
CALL: OUTLINED_FUNCTION_1 in bar:<unknown>
.. BL @OUTLINED_FUNCTION_1, implicit-def $lr, implicit $sp
PRUNED: 2/2 candidates
SKIP: Expected benefit (0 B) < threshold (1 B)
PRUNED: 2/2 candidates
SKIP: Expected benefit (0 B) < threshold (1 B)
```
We add a field `IsOutlined` to indicate whether a MachineFunction
is outlined and set it true for outlined functions in MachineOutliner.
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D146191
Outlining isn't always a win when the saved instruction count is >= 1.
The overhead of representing a new function in the binary depends on
exception metadata and alignment. So parameterize this for local tuning.
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D136774
I think it's good practice to avoid having default ctors unless they're really
valid/useful. For OutlinedFunction the default ctor was used to represent a
bail-out value for getOutliningCandidateInfo(), so I changed the API to return
an optional<getOutliningCandidateInfo> instead which seems a tad cleaner.
Differential Revision: https://reviews.llvm.org/D146375
Add a test for statistics as well.
The mapper size stats were nested in a loop unnecessarily. Move them out.
Give existing stats better names, and add one which also tracks the number of
sentinels added.
This had no debug output. Since it was committed as NFC, it had no testcase.
The me of today was nerdsniped by the me of 6 years ago and decided that this
ought to have a testcase and some debug output.
The MachineOutliner + SuffixTree both used `std::vector` everywhere because I
didn't know any better at the time.
At least for small types, such as `unsigned` and iterators, I can't see any
particular reason to use std::vector over `SmallVector` here.
Recommit with bug fixes + added testcases to the outliner. Also adds some
debug output.
We found a case in the Swift benchmarks where the MachineOutliner introduces
about a 20% compile time overhead in comparison to building without the
MachineOutliner.
The origin of this slowdown is that the benchmark has long blocks which incur
lots of LRU checks for lots of candidates.
Imagine a case like this:
```
bb:
i1
i2
i3
...
i123456
```
Now imagine that all of the outlining candidates appear early in the block, and
that something like, say, NZCV is defined at the end of the block.
The outliner has to check liveness for certain registers across all candidates,
because outlining from areas where those registers are used is unsafe at call
boundaries.
This is fairly wasteful because in the previously-described case, the outlining
candidates will never appear in an area where those registers are live.
To avoid this, precalculate areas where we will consider outlining from.
Anything outside of these areas is mapped to illegal and not included in the
outlining search space. This allows us to reduce the size of the outliner's
suffix tree as well, giving us a potential memory win.
By precalculating areas, we can also optimize other checks too, like whether
or not LR is live across an outlining candidate.
Doing all of this is about a 16% compile time improvement on the case.
This is likely useful for other targets (e.g. ARM + RISCV) as well, but for now,
this only implements the AArch64 path. The original "is the MBB safe" method
still works as before.
Add `nooutline` + update LangRef to say it exists.
This makes it possible to say "don't outline from this function ever."
We want to be able to toggle whether or not a function should be in the search
set regardless of default behaviour.
Add testcases for the IR Outliner + Machine Outliner.
Also remove an unnecessary check for an empty function in the Machine Outliner.
Differential Revision: https://reviews.llvm.org/D140438
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
1. When checking if a candidate contains a CFI instruction, actually
iterate over all of the instructions, instead of stopping halfway
through.
2. Make sure copied CFI directives refer to the correct instruction.
Fixes https://github.com/llvm/llvm-project/issues/55842
Differential Revision: https://reviews.llvm.org/D126930
This reverts commit 7f230feeeac8a67b335f52bd2e900a05c6098f20.
Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang,
and many LLVM tests, see comments on https://reviews.llvm.org/D121169
We found a case in the Swift benchmarks where the MachineOutliner introduces
about a 20% compile time overhead in comparison to building without the
MachineOutliner.
The origin of this slowdown is that the benchmark has long blocks which incur
lots of LRU checks for lots of candidates.
Imagine a case like this:
```
bb:
i1
i2
i3
...
i123456
```
Now imagine that all of the outlining candidates appear early in the block, and
that something like, say, NZCV is defined at the end of the block.
The outliner has to check liveness for certain registers across all candidates,
because outlining from areas where those registers are used is unsafe at call
boundaries.
This is fairly wasteful because in the previously-described case, the outlining
candidates will never appear in an area where those registers are live.
To avoid this, precalculate areas where we will consider outlining from.
Anything outside of these areas is mapped to illegal and not included in the
outlining search space. This allows us to reduce the size of the outliner's
suffix tree as well, giving us a potential memory win.
By precalculating areas, we can also optimize other checks too, like whether
or not LR is live across an outlining candidate.
Doing all of this is about a 16% compile time improvement on the case.
This is likely useful for other targets (e.g. ARM + RISCV) as well, but for now,
this only implements the AArch64 path. The original "is the MBB safe" method
still works as before.
Useful for debugging + evaluating improvements to the outliner.
Stats are the number of illegal, legal, and invisible instructions in the
unsigned vector, and it's total length.
We have the `clang -cc1` command-line option `-funwind-tables=1|2` and
the codegen option `VALUE_CODEGENOPT(UnwindTables, 2, 0) ///< Unwind
tables (1) or asynchronous unwind tables (2)`. However, this is
encoded in LLVM IR by the presence or the absence of the `uwtable`
attribute, i.e. we lose the information whether to generate want just
some unwind tables or asynchronous unwind tables.
Asynchronous unwind tables take more space in the runtime image, I'd
estimate something like 80-90% more, as the difference is adding
roughly the same number of CFI directives as for prologues, only a bit
simpler (e.g. `.cfi_offset reg, off` vs. `.cfi_restore reg`). Or even
more, if you consider tail duplication of epilogue blocks.
Asynchronous unwind tables could also restrict code generation to
having only a finite number of frame pointer adjustments (an example
of *not* having a finite number of `SP` adjustments is on AArch64 when
untagging the stack (MTE) in some cases the compiler can modify `SP`
in a loop).
Having the CFI precise up to an instruction generally also means one
cannot bundle together CFI instructions once the prologue is done,
they need to be interspersed with ordinary instructions, which means
extra `DW_CFA_advance_loc` commands, further increasing the unwind
tables size.
That is to say, async unwind tables impose a non-negligible overhead,
yet for the most common use cases (like C++ exceptions), they are not
even needed.
This patch extends the `uwtable` attribute with an optional
value:
- `uwtable` (default to `async`)
- `uwtable(sync)`, synchronous unwind tables
- `uwtable(async)`, asynchronous (instruction precise) unwind tables
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D114543
This patch implements a new MachineFunction in the ARM backend for
placing BTI instructions. It is similar to the existing AArch64
aarch64-branch-targets pass.
BTI instructions are inserted into basic blocks that:
- Have their address taken
- Are the entry block of a function, if the function has external
linkage or has its address taken
- Are mentioned in jump tables
- Are exception/cleanup landing pads
Each BTI instructions is placed in the beginning of a BB after the
so-called meta instructions (e.g. exception handler labels).
Each outlining candidate and the outlined function need to be in agreement about
whether BTI placement is enabled or not. If branch target enforcement is
disabled for a function, the outliner should not covertly enable it by emitting
a call to an outlined function, which begins with BTI.
The cost mode of the outliner is adjusted to account for the extra BTI
instructions in the outlined function.
The ARM Constant Islands pass will maintain the count of the jump tables, which
reference a block. A `BTI` instruction is removed from a block only if the
reference count reaches zero.
PAC instructions in entry blocks are replaced with PACBTI instructions (tests
for this case will be added in a later patch because the compiler currently does
not generate PAC instructions).
The ARM Constant Island pass is adjusted to handle BTI
instructions correctly.
Functions with static linkage that don't have their address taken can
still be called indirectly by linker-generated veneers and thus their
entry points need be marked with BTI or PACBTI.
The changes are tested using "LLVM IR -> assembly" tests, jump tables
also have a MIR test. Unfortunately it is not possible add MIR tests
for exception handling and computed gotos because of MIR parser
limitations.
This patch is part of a series that adds support for the PACBTI-M extension of
the Armv8.1-M architecture, as detailed here:
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/armv8-1-m-pointer-authentication-and-branch-target-identification-extension
The PACBTI-M specification can be found in the Armv8-M Architecture Reference
Manual:
https://developer.arm.com/documentation/ddi0553/latest
The following people contributed to this patch:
- Mikhail Maltsev
- Momchil Velikov
- Ties Stuij
Reviewed By: ostannard
Differential Revision: https://reviews.llvm.org/D112426
The compiler needs to mark register $x0 as live in for the following case.
$x1 = ADDXri $sp, 16, 0
BL @spam, csr_darwin_aarch64_aapcs, implicit-def dead $lr, implicit $sp, implicit $x0, implicit killed $x1, implicit-def $sp, implicit-def dead $x0
Reviewed By: paquette
Differential Revision: https://reviews.llvm.org/D95267
The debug location is removed from any outlined instruction. This
causes the MachineVerifier to crash on outlined DBG_VALUE
instructions.
Then, debug instructions are "invisible" to the outliner, that is, two
ranges of instructions from different functions are considered
identical if the only difference is debug instructions. Since a debug
instruction from one function is unlikely to provide sensible debug
information about all functions, sharing an outlined sequence, this
patch just removes debug instructions from the outlined functions.
Differential Revision: https://reviews.llvm.org/D89485
We only need to include MachineInstrBundle.h, but exposes an implicit dependency in MachineOutliner.h.
Also, remove duplicate includes from LiveRegUnits.cpp + MachineOutliner.cpp.
This prevents the outlined functions from pulling in a lot of unnecessary code
in our downstream libraries/linker. Which stops outlining making codesize
worse in c++ code with no-exceptions.
Differential Revision: https://reviews.llvm.org/D57254
This moves the SuffixTree test used in the Machine Outliner and moves it into Support for use in other outliners elsewhere in the compilation pipeline.
Differential Revision: https://reviews.llvm.org/D80586