Add an external interface to register a branch in a function that is in
disassembled state. Allows to make custom modifications to the
disassembler. E.g., a pre-CFG pass can add an instruction and register a
branch that will later be used during the CFG construction.
Move code that sorts TakenBranches right before the branches are used.
We can populate TakenBranches in pre-CFG post-processing and hence have
to postpone the sorting to a later point in the processing pipeline.
Will add such a pass later. For now it's NFC.
To avoid accidentally setting the label twice for the same instruction,
which can lead to a "lost" label, introduce getOrSetInstLabel()
function. Rename existing functions to getInstLabel()/setInstLabel() to
make it explicit that they operate on instruction labels. Add an
assertion in setInstLabel() that the instruction did not have a prior
label set.
Make core BOLT functionality more friendly to being used as a
library instead of in our standalone driver llvm-bolt. To
accomplish this, we augment BinaryContext with journaling streams
that are to be used by most BOLT code whenever something needs to
be logged to the screen. Users of the library can decide if logs
should be printed to a file, no file or to the screen, as
before. To illustrate this, this patch adds a new option
`--log-file` that allows the user to redirect BOLT logging to a
file on disk or completely hide it by using
`--log-file=/dev/null`. Future BOLT code should now use
`BinaryContext::outs()` for printing important messages instead of
`llvm::outs()`. A new test log.test enforces this by verifying that
no strings are print to screen once the `--log-file` option is
used.
In previous patches we also added a new BOLTError class to report
common and fatal errors, so code shouldn't call exit(1) now. To
easily handle problems as before (by quitting with exit(1)),
callers can now use
`BinaryContext::logBOLTErrorsAndQuitOnFatal(Error)` whenever code
needs to deal with BOLT errors. To test this, we have fatal.s
that checks we are correctly quitting and printing a fatal error
to the screen.
Because this is a significant change by itself, not all code was
yet ported. Code from Profiler libs (DataAggregator and friends)
still print errors directly to screen.
Co-authored-by: Rafael Auler <rafaelauler@fb.com>
Test Plan: NFC
As part of the effort to refactor old error handling code that
would directly call exit(1), in this patch continue the migration
on libCore, libRewrite and libPasses to use the new BOLTError
class whenever a failure occurs.
Test Plan: NFC
Co-authored-by: Rafael Auler <rafaelauler@fb.com>
Provide backwards compatibility for YAML profile that uses `std::hash`:
xxh3 hash is the default for newly produced profile (sets `std-hash:
false`),
whereas the profile that doesn't specify `std-hash` will be treated as
`std-hash: true`, preserving old behavior.
Simplify code in fixBranches(). Mostly NFC, accept the x86-specific
check for code fragments now takes into account presence of more than
two fragments. Should only matter when we split code into multiple
fragments and can run fixBranches() more than once.
Also, don't replace a branch target with the same one, as such operation
may allocate memory for extra MCSymbolRefExpr.
std::hash and ADT/Hashing::hash_value are non-deterministic functions
whose
results might vary across implementation/process/execution. Using xxh3
instead
for computing hashes of BinaryFunctions and BinaryBasicBlock for stale
profile
matching.
(A possible alternative is to use ADT/StableHashing.h based on FNV
hashing but
xxh3 seems to be more popular in LLVM)
This is to address https://github.com/llvm/llvm-project/issues/65241.
This is a follow-up to #73076. We need to reset output addresses for
deleted blocks, otherwise the address translation may mistakenly
attribute input address of a deleted block to a non-zero address.
While working on a test case, I've discovered that DWARF output ranges
were already broken for deleted basic blocks: #73428. I will provide a
test case for this PR with a DWARF address range fix PR.
This commit modifies BinaryContext::calculateEmittedSize() to update
the BinaryBasicBlock::OutputAddressRange of each basic block in the
function in place. BinaryBasicBlock::getOutputSize() now gives the
emitted size of the basic block.
Previously HasFixedIndirectBranch was set in BF to set isSimple to false
later because of unreachable bb ellimination pass which might remove the
BB with it's symbols accessed by other instructions than calls. It seems
to be that better solution would be to add extra entry point on target
offset instead of marking BF as non-simple.
Use MCAsmBackend::writeNopData() interface to emit NOP instructions on
x86. There are multiple forms of NOP instruction on x86 with different
sizes. Currently, LLVM's assembly/disassembly does not support all forms
correctly which can lead to a breakage of input code semantics, e.g. if
the program relies on NOP instructions for reserving a patch space.
Add "--keep-nops" option to preserve NOP instructions.
When NOP instructions are used to reserve space in the code, e.g. for
patching, it becomes critical to preserve their original size while
emitting the code. On x86, we rely on "Size" annotation for NOP
instructions size, as the original instruction size is lost in the
disassembly/assembly process.
This change makes instruction size a first-class annotation and is
affectively NFCI. A follow-up diff will use the annotation for code
emission.
When NOP instructions are removed by BOLT and a DWARF address range
falls past the removed instructions, it may lead to invalid DWARF ranges
in the output binary. E.g. the range may fall outside of the basic block
boundaries.
This fix makes sure the modified range fits within the containing basic
block. A proper fix requires tracking instructions within the block and
will come in a different PR.
In 8244ff6739a09cb75e6e7fd1c24b85e2b1397266, I've introduced an
assertion that incorrectly used BasicBlock::empty(). Some basic blocks
may contain only pseudo instructions and thus BB->empty() will evaluate
to false, while the actual code size will be zero.
When annotating MCInst instructions, attach extra annotation operands
directly to the annotated instruction, instead of attaching them to an
instruction pointed to by a special kInst operand.
With this change, it's no longer necessary to allocate MCInst and most
of the first-class annotations come with free memory as currently MCInst
is declared with:
SmallVector<MCOperand, 10> Operands;
i.e. more operands than are normally being used.
We still create a kInst operand with a nullptr instruction value to
designate the beginning of annotation operands. However, this special
operand might not be needed if we can rely on MCInstrDesc::NumOperands.
When we create new code for indirect code promotion optimization, we
should mark it as originating from the indirect jump instruction for
BOLT address translation (BAT) to map it to the original instruction.
Create BinaryFunction::translateInputToOutputRange() and use it for
updating DWARF debug ranges and location lists while de-duplicating the
existing code. Additionally, move DWARF-specific code out of
BinaryFunction and add print functions to facilitate debugging.
Note that this change is deliberately kept "bug-level" compatible with
the existing solution to keep it NFCI and make it easier to track any
possible regressions in the future updates to the ranges-handling code.
Some optimization passes may duplicate basic blocks and assign the same
input offset to a number of different blocks in a function. This is done
e.g. to correctly map debugging ranges for duplicated code.
However, duplicate input offsets present a problem when we use
AddressMap to generate new addresses for basic blocks. The output
address is calculated based on the input offset and will be the same for
blocks with identical offsets. The result is potentially incorrect debug
info and BAT records.
To address the issue, we have to eliminate the dependency on input
offsets while generating output addresses for a basic block. Each block
has a unique label, hence we extend AddressMap to include address lookup
based on MCSymbol and use the new functionality to update block
addresses.
On RISC-V, GNU as produces the following initial instruction in CIE's:
```
DW_CFA_def_cfa_register: r2
```
While I believe it is technically illegal to use this instruction
without first using a `DW_CFA_def_cfa` (since the offset is undefined),
both `readelf` and `llvm-dwarfdump` accept this and implicitly set the
offset to 0.
In BOLT, however, this triggers an assert (in `CFISnapshot::advanceTo`)
as it (correctly) believes the offset is not set. This patch fixes this
by setting the offset to 0 whenever executing `DW_CFA_def_cfa_register`
while the offset is undefined.
Note that this is probably the simplest workaround but it has a
downside: while emitting CFI start, we check if the initial instructions
are contained within `MCAsmInfo::getInitialFrameState` and omit them if
they are. This will not be true for GNU CIE's (since they differ from
LLVM's) which causes an unnecessary `DW_CFA_def_cfa_register` to be
emitted.
While technically correct, it would probably be better to replace the
GNU CIE with the one used by LLVM to avoid this situation. This would
solve the problem this patch solves while also preventing unnecessary
CFI instructions. However, this is a bit trickier to implement correctly
so I propose to keep this for a later time.
Note on testing: the test creates a simple function with three basic
blocks and forces the CFI state of the last one to be different from the
others using an arbitrary CFI instruction. Then,
`--reorder-blocks=reverse` is used to force `CFISnapshot::advanceTo` to
be called. This causes an assert on the current main branch.
Currently minimal alignment of function is hardcoded to 2 bytes.
Add 2 more cases:
1. In case BF is data in code return the alignment of CI as minimal
alignment
2. For aarch64 and riscv platforms return the minimal value of 4 (added
test for aarch64)
Otherwise fallback to returning the 2 as it previously was.
On RISC-V, there are certain relocations that target a specific
instruction instead of a more abstract location like a function or basic
block. Take the following example that loads a value from symbol `foo`:
```
nop
1: auipc t0, %pcrel_hi(foo)
ld t0, %pcrel_lo(1b)(t0)
```
This results in two relocation:
- auipc: `R_RISCV_PCREL_HI20` referencing `foo`;
- ld: `R_RISCV_PCREL_LO12_I` referencing to local label `1` which points
to the auipc instruction.
It is of utmost importance that the `R_RISCV_PCREL_LO12_I` keeps
referring to the auipc instruction; if not, the program will fail to
assemble. However, BOLT currently does not guarantee this.
BOLT currently assumes that all local symbols are jump targets and
always starts a new basic block at symbol locations. The example above
results in a CFG the looks like this:
```
.BB0:
nop
.BB1:
auipc t0, %pcrel_hi(foo)
ld t0, %pcrel_lo(.BB1)(t0)
```
While this currently works (i.e., the `R_RISCV_PCREL_LO12_I` relocation
points to the correct instruction), it has two downsides:
- Too many basic blocks are created (the example above is logically only
one yet two are created);
- If instructions are inserted in `.BB1` (e.g., by instrumentation),
things will break since the label will not point to the auipc anymore.
This patch proposes to fix this issue by teaching BOLT to track labels
that should always point to a specific instruction. This is implemented
as follows:
- Add a new annotation type (`kLabel`) that allows us to annotate
instructions with an `MCSymbol *`;
- Whenever we encounter a relocation type that is used to refer to a
specific instruction (`Relocation::isInstructionReference`), we
register it without a symbol;
- During disassembly, whenever we encounter an instruction with such a
relocation, create a symbol for its target and store it in an offset
to symbol map (to ensure multiple relocations referencing the same
instruction use the same label);
- After disassembly, iterate this map to attach labels to instructions
via the new annotation type;
- During emission, emit these labels right before the instruction.
I believe the use of annotations works quite well for this use case as
it allows us to reliably track instruction labels. If we were to store
them as offsets in basic blocks, it would be error prone to keep them
updated whenever instructions are inserted or removed.
I have chosen to add labels as first-class annotations (as opposed to a
generic one) because the documentation of `MCAnnotation` suggests that
generic annotations are to be used for optional metadata that can be
discarded without affecting correctness. As this is not the case for
labels, a first-class annotation seemed more appropriate.
Long tail calls use the following instruction sequence on RISC-V:
```
1: auipc xi, %pcrel_hi(sym)
jalr zero, %pcrel_lo(1b)(xi)
```
Since the second instruction in isolation looks like an indirect branch,
this confused BOLT and most functions containing a long tail call got
marked with "unknown control flow" and didn't get optimized as a
consequence.
This patch fixes this by detecting long tail call sequence in
`analyzeIndirectBranch`. `FixRISCVCallsPass` also had to be updated to
expand long tail calls to `PseudoTAIL` instead of `PseudoCALL`.
Besides this, this patch also fixes a minor issue with compressed tail
calls (`c.jr`) not being detected.
Note that I had to change `BinaryFunction::postProcessIndirectBranches`
slightly: the documentation of `MCPlusBuilder::analyzeIndirectBranch`
mentions that the [`Begin`, `End`) range contains the instructions
immediately preceding `Instruction`. However, in
`postProcessIndirectBranches`, *all* the instructions in the BB where
passed in the range. This made it difficult to find the preceding
instruction so I made sure *only* the preceding instructions are passed.
When current state is `CFG_Finalized`, function `validateCFG()` should return true directly.
Reviewed By: maksfb, yota9, Kepontry
Differential Revision: https://reviews.llvm.org/D159410
BOLT uses `MCAsmLayout` to calculate the output values of functions and
basic blocks. This means output values are calculated based on a
pre-linking state and any changes to symbol values during linking will
cause incorrect values to be used.
This issue can be triggered by enabling linker relaxation on RISC-V.
Since linker relaxation can remove instructions, symbol values may
change. This causes, among other things, the symbol table created by
BOLT in the output executable to be incorrect.
This patch solves this issue by using `BOLTLinker` to get symbol values
instead of `MCAsmLayout`. This way, output values are calculated based
on a post-linking state. To make sure the linker can update all
necessary symbols, this patch also makes sure all these symbols are not
marked as temporary so that they end-up in the object file's symbol
table.
Note that this patch only deals with symbols of binary functions
(`BinaryFunction::updateOutputValues`). The technique described above
turned out to be too expensive for basic block symbols so those are
handled differently in D155604.
Reviewed By: maksfb
Differential Revision: https://reviews.llvm.org/D154604
BOLT uses MCAsmLayout to calculate the output values of basic blocks.
This means output values are calculated based on a pre-linking state and
any changes to symbol values during linking will cause incorrect values
to be used.
This issue was first addressed in D154604 by adding all basic block
symbols to the symbol table for the linker to resolve them. However, the
runtime overhead of handling this huge symbol table turned out to be
prohibitively large.
This patch solves the issue in a different way. First, a temporary
section containing [input address, output symbol] pairs is emitted to the
intermediary object file. The linker will resolve all these references
so we end up with a section of [input address, output address] pairs.
This section is then parsed and used to:
- Replace BinaryBasicBlock::OffsetTranslationTable
- Replace BinaryFunction::InputOffsetToAddressMap
- Update BinaryBasicBlock::OutputAddressRange
Note that the reason this is more performant than the previous attempt
is that these symbol references do not cause entries to be added to the
symbol table. Instead, section-relative references are used for the
relocations.
Reviewed By: maksfb
Differential Revision: https://reviews.llvm.org/D155604
We identify instructions to be instrumented based on Offset annotation.
BOLT "expands" conditional tail calls into a conditional jump to a basic block
with unconditional tail call. Move Offset annotation from former CTC to the tail
call.
For expanded CTC we keep Offset attached to the original instruction which is
converted into a regular conditional jump, while leaving the newly created tail
call without an Offset annotation. This leads to attempting the instrumentation
of the conditional jump which points to the basic block with an inherited input
offset thus creating an invalid edge description. At the same time, the newly
created tail call is skipped entirely which means we're not creating a call
description for it.
If we instead reassign Offset annotation from the conditional jump to the tail
call we fix both issues. The conditional jump will be skipped not creating an
invalid edge description, while tail call will be handled properly (unformly
with regular calls).
Reviewed By: #bolt, maksfb
Differential Revision: https://reviews.llvm.org/D156389
Adding some logs related to stale profile matching. The new data can be helpful
to understand how "stale" the input profile is and how well the inference is
able to utilize the stale data.
Example of outputs on clang-10 built with LTO (profile collected on a year-old release):
```
BOLT-INFO: inferred profile for 2101 (18.52% of profiled, 100.00% of stale) functions responsible for 30.95% samples (14754697 out of 47670654)
BOLT-INFO: stale inference matched 89.42% of basic blocks (79052 out of 88402 stale) responsible for 76.99% samples (645737 out of 838719 stale)
```
LTO+AutoFDO:
```
BOLT-INFO: inferred profile for 6146 (57.57% of profiled, 100.00% of stale) functions responsible for 90.34% samples (50891403 out of 56330313)
BOLT-INFO: stale inference matched 74.55% of basic blocks (191295 out of 256589 stale) responsible for 57.30% samples (1288632 out of 2248799 stale)
```
Reviewed By: Amir, maksfb
Differential Revision: https://reviews.llvm.org/D154737
A jump table in a split function may contain an entry matching a start
address of another fragment of the function. While converting addresses
to labels, we used to ignore such entries resulting in underpopulated
jump table. Change that, so we always create one label per address.
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D156013
Create LinuxKernelRewriter and move kernel-specific code to this class.
Depends on D154023
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D154024
Migrate SDT markers processing to the new MetadataRewriter interface.
Depends on D154020
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D154021
Just enough features are implemented to process a simple "hello world"
executable and produce something that still runs (including libc calls).
This was mainly a matter of implementing support for various
relocations. Currently, the following are handled:
- R_RISCV_JAL
- R_RISCV_CALL
- R_RISCV_CALL_PLT
- R_RISCV_BRANCH
- R_RISCV_RVC_BRANCH
- R_RISCV_RVC_JUMP
- R_RISCV_GOT_HI20
- R_RISCV_PCREL_HI20
- R_RISCV_PCREL_LO12_I
- R_RISCV_RELAX
- R_RISCV_NONE
Executables linked with linker relaxation will probably fail to be
processed. BOLT relocates .text to a high address while leaving .plt at
its original (low) address. This causes PC-relative PLT calls that were
relaxed to a JAL to not fit their offset in an I-immediate anymore. This
is something that will be addressed in a later patch.
Changes to the BOLT core are relatively minor. Two things were tricky to
implement and needed slightly larger changes. I'll explain those below.
The R_RISCV_CALL(_PLT) relocation is put on the first instruction of a
AUIPC/JALR pair, the second does not get any relocation (unlike other
PCREL pairs). This causes issues with the combinations of the way BOLT
processes binaries and the RISC-V MC-layer handles relocations:
- BOLT reassembles instructions one by one and since the JALR doesn't
have a relocation, it simply gets copied without modification;
- Even though the MC-layer handles R_RISCV_CALL properly (adjusts both
the AUIPC and the JALR), it assumes the immediates of both
instructions are 0 (to be able to or-in a new value). This will most
likely not be the case for the JALR that got copied over.
To handle this difficulty without resorting to RISC-V-specific hacks in
the BOLT core, a new binary pass was added that searches for
AUIPC/JALR pairs and zeroes-out the immediate of the JALR.
A second difficulty was supporting ABS symbols. As far as I can tell,
ABS symbols were not handled at all, causing __global_pointer$ to break.
RewriteInstance::analyzeRelocation was updated to handle these
generically.
Tests are provided for all supported relocations. Note that in order to
test the correct handling of PLT entries, an ELF file produced by GCC
had to be used. While I tried to strip the YAML representation, it's
still quite large. Any suggestions on how to improve this would be
appreciated.
Reviewed By: rafauler
Differential Revision: https://reviews.llvm.org/D145687
In lite mode (default for X86), BOLT optimizes and relocates functions
with profile. The rest of the code is preserved, but if it references
relocated code such references have to be updated. The update is handled
by scanExternalRefs() function. Note that we cannot solely rely on
relocations written by the linker, as not all code references are
exposed to the linker. Additionally, the linker can modify certain
instructions and relocations will no longer match the code.
With this change, start using symbolic disassembler for scanning code
for references in scanExternalRefs(). Unlike the previous approach, the
symbolizer properly detects and creates references for instructions with
multiple/ambiguous symbolic operands and handles cases where a
relocation doesn't match any operand. See test cases for examples.
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D152631
This is a first "serious" version of stale profile matching in BOLT. This diff
extends the hash computation for basic blocks so that we can apply a fuzzy
hash-based matching. The idea is to compute several "versions" of a hash value
for a basic block. A loose version of a hash (computed by ignoring instruction
operands) allows to match blocks in functions whose content has been changed,
while stricter hash values (considering instruction opcodes with operands and
even based on hashes of block's successors/predecessors) allow to resolve
collisions. In order to save space and build time, individual hash components
are blended into a single uint64_t.
There are likely numerous ways of improving hash computation but already this
simple variant provides significant perf benefits.
**Perf testing** on the clang binary: collecting data on clang-10 and using it
to optimize clang-11 (with ~1 year of commits in between). Next, we compare
- //stale_clang// (clang-11 optimized with profile collected on clang-10 with **infer-stale-profile=0**)
- //opt_clang// (clang-11 optimized with profile collected on clang-11)
- //infer_clang// (clang-11 optimized with profile collected on clang-10 with **infer-stale-profile=1**)
`LTO-only` mode:
//stale_clang// vs //opt_clang//: task-clock [delta(%): 9.4252 ± 1.6582, p-value: 0.000002]
(That is, there is a ~9.5% perf regression)
//infer_clang// vs //opt_clang//: task-clock [delta(%): 2.1834 ± 1.8158, p-value: 0.040702]
(That is, the regression is reduced to ~2%)
Related BOLT logs:
```
BOLT-INFO: identified 2114 (18.61%) stale functions responsible for 30.96% samples
BOLT-INFO: inferred profile for 2101 (18.52% of all profiled) functions responsible for 30.95% samples
```
`LTO+AutoFDO` mode:
//stale_clang// vs //opt_clang//: task-clock [delta(%): 19.1293 ± 1.4131, p-value: 0.000002]
//infer_clang// vs //opt_clang//: task-clock [delta(%): 7.4364 ± 1.3343, p-value: 0.000002]
Related BOLT logs:
```
BOLT-INFO: identified 5452 (50.27%) stale functions responsible for 85.34% samples
BOLT-INFO: inferred profile for 5442 (50.23% of all profiled) functions responsible for 85.33% samples
```
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D146661
Fix debug printing, making it easier to compare two debug logs side by side:
- `BinaryFunction::addRelocation`: print function name instead of `this` ptr,
- `DataAggregator::doTrace`: remove duplicated function name.
Reviewed By: #bolt, maksfb
Differential Revision: https://reviews.llvm.org/D152314
Extending yaml profile format with block hashes, which are used for stale
profile matching. To avoid duplication of the code, created a new class with a
collection of utilities for computing hashes.
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D144306
When there is a direct jump right after an indirect one, in
the absence of code jumpting to this direct jump, this is obviously
dead code. However, BOLT was failing to recognize that by mistakenly
placing both jmp instructions in the same basic block, and creating
wrong successor edges. Fix that, so we can safely run UCE on
that. This bug also causes validateCFG to fail and BOLT to crash if it
is running ICP on that function.
Reviewed By: #bolt, Amir
Differential Revision: https://reviews.llvm.org/D148055
All users of MCCodeEmitter::encodeInstruction use a raw_svector_ostream
to encode the instruction into a SmallVector. The raw_ostream however
incurs some overhead for the actual encoding.
This change allows an MCCodeEmitter to directly emit an instruction into
a SmallVector without using a raw_ostream and therefore allow for
performance improvments in encoding. A default path that uses existing
raw_ostream implementations is provided.
Reviewed By: MaskRay, Amir
Differential Revision: https://reviews.llvm.org/D145791
`Function.RawBranchCount` is initialized for fdata profile but not for yaml one.
The diff adds the computation of the field for yaml profiles
Reviewed By: Amir
Differential Revision: https://reviews.llvm.org/D144211