2461 Commits

Author SHA1 Message Date
Maksim Panchenko
d5956fb8f9
[BOLT][AArch64] Add support for short LLD thunks/veneers (#118422)
When a callee function is closer than 256MB from its call site, LLD
linker can strategically create a short thunk for the function with a
single branch instruction (that covers +/-128MB). Detect and convert
such thunks into direct calls in BOLT.
2024-12-03 13:44:51 -08:00
Paschalis Mpeis
51003076eb
Reapply [BOLT] DataAggregator support for binaries with multiple text segments (#118023)
When a binary has multiple text segments, the Size is computed as the
difference of the last address of these segments from the BaseAddress.
The base addresses of all text segments must be the same.

Introduces flag 'perf-script-events' for testing, which allows passing
perf events without BOLT having to parse them by invoking 'perf script'.
The flag is used to pass a mock perf profile that has two memory
mappings for a mock binary that has two text segments. The mapping
size is updated as `parseMMapEvents` now processes all text segments.
2024-12-02 09:20:40 +00:00
David Spickett
085e7d2b22
[bolt] Move CODE_OWNERS.txt to Maintainers.txt (#118082)
To align with: https://llvm.org/docs/DeveloperPolicy.html#maintainers

I have not changed the format of the file, my only goal here is that the
project have a `bolt/Maintainers.*` so it is easy to find.
2024-12-02 09:12:57 +00:00
Peter Waller
b5ed375f9d
[BOLT] Skip _init; avoiding GOT breakage for static binaries (#117751)
_init is used during startup of binaires. Unfortunately, its
address can be shared (at least on AArch64 glibc static binaries) with a
data
reference that lives in the GOT. The GOT rewriting is currently unable
to distinguish between data addresses and function addresses. This leads
to the data address being incorrectly rewritten, causing a crash on
startup of the binary:

  Unexpected reloc type in static binary.

To avoid this, don't consider _init for being moved, by skipping it.

~We could add further conditions to narrow the skipped case for known
crashes, but as a straw man I thought it'd be best to keep the condition
as simple as possible and see if there any objections to this.~
(Edit: this broke the test
bolt/test/runtime/X86/retpoline-synthetic.test,
because _init was skipped from the retpoline pass and it has an indirect
call in it, so I include a check for static binaries now, which avoids
the test failure,
but perhaps this could/should be narrowed further?)

For now, skip _init for static binaries on any architecture; we could
add further conditions to narrow the skipped case for known crashes, but
as a straw man I thought it'd be best to keep the condition as simple as
possible and see if there any objections to this.

Updates #100096.
2024-11-28 14:59:07 +00:00
Sander de Smalen
318c69de52 Reland "[AArch64] Define high bits of FPR and GPR registers (take 2) (#114827)"
The issue with slow compile-time was caused by an assert in
AArch64RegisterInfo.cpp. The assert invokes 'checkAllSuperRegsMarked'
after adding all the reserved registers. This call gets very expensive
after adding the _HI registers due to the way the function searches
in the 'Exception' list, which is expected to be a small list but isn't
(the patch added 190 _HI regs).

It was possible to rewrite the code in such a way that the _HI registers
are marked as reserved after the check. This makes the problem go away
entirely and restores compile-time to what it was before (tested for
`check-runtimes`, which previously showed a ~5x slowdown).

This reverts commits:
  1434d2ab215e3ea9c5f34689d056edd3d4423a78
  2704647fb7986673b89cef1def729e3b022e2607
2024-11-27 13:31:59 +00:00
Enna1
4d2bc0adc6
[BOLT] Extract comparator for sorting functions by index into helper function (#116217)
This change extracts the comparator for sorting functions by index into
a helper function `compareBinaryFunctionByIndex()`

Not sure why the comparator used in
`BinaryContext::getSortedFunctions()` is not same as the other two
places. I think they should use the same comparator, so I also change
`BinaryContext::getSortedFunctions()` to use
`compareBinaryFunctionByIndex()` for sorting functions.
2024-11-27 09:01:12 +08:00
Raul Tambre
003b48e0cb
[BOLT][test] enable GNU extensions, use C++ compiler, remove unnecessary target (#117043)
1. With a Clang that doesn't default to GNU extensions they need to be enabled explicitly.
2. The X86 directory lit config sets it already, there's no reason for this test to do it by itself.
3. The C frontend executable will fail if there's for example a Clang resource file for the C++ mode that sets C++-specific options:
```
+ /home/tambre/dev/llvm/build/bin/clang --target=x86_64-unknown-linux-gnu -fPIE -fuse-ld=lld -Wl,--unresolved-symbols=ignore-all -pie -fPIC -shared /home/tambre/dev/llvm/bolt/test/R_ABS.pic.lld.cpp -o /home/tambre/dev/llvm/build/tools/bolt/test/Output/R_ABS.pic.lld.cpp.tmp.so -Wl,-q -fuse-ld=lld
clang: warning: argument unused during compilation: '-pie' [-Wunused-command-line-argument]
error: invalid argument '-std=c23' not allowed with 'C++'
```
2024-11-27 00:14:00 +02:00
Hans Wennborg
537343dea4 Revert "[BOLT] DataAggregator support for binaries with multiple text segments (#92815)"
This caused test failures, see comment on the PR:

  Failed Tests (2):
    BOLT-Unit :: Core/./CoreTests/AArch64/MemoryMapsTester/MultipleSegmentsMismatchedBaseAddress/0
    BOLT-Unit :: Core/./CoreTests/X86/MemoryMapsTester/MultipleSegmentsMismatchedBaseAddress/0

> When a binary has multiple text segments, the Size is computed as the
> difference of the last address of these segments from the BaseAddress.
> The base addresses of all text segments must be the same.
>
> Introduces flag 'perf-script-events' for testing. It allows passing perf events
> without BOLT having to parse them using 'perf script'. The flag is used to
> pass a mock perf profile that has two memory mappings for a mock binary
> that has two text segments. The size of the mapping is updated as this
> change `parseMMapEvents` processes all text segments.

This reverts commit 4b71b3782d217db0138b701c4514bd2168ca1659.
2024-11-26 14:59:30 +01:00
Paschalis Mpeis
957c2ac4f1
[BOLT] Fix for bughunter.sh in offline mode (#116649)
In offline mode, the script sets 'PASS' variable and does not use it.
Surrounding code suggests using 'FAIL' variable instead.
2024-11-25 13:13:10 +00:00
Paschalis Mpeis
4b71b3782d
[BOLT] DataAggregator support for binaries with multiple text segments (#92815)
When a binary has multiple text segments, the Size is computed as the
difference of the last address of these segments from the BaseAddress.
The base addresses of all text segments must be the same.

Introduces flag 'perf-script-events' for testing. It allows passing perf events
without BOLT having to parse them using 'perf script'. The flag is used to
pass a mock perf profile that has two memory mappings for a mock binary
that has two text segments. The size of the mapping is updated as this
change `parseMMapEvents` processes all text segments.
2024-11-25 13:12:43 +00:00
Maksim Panchenko
2704647fb7 Revert "Fix up MCPlusBuilder.cpp to account for W0_HI on AArch64"
This reverts commit 576865a50e6ccb74196c9491fa79575d6d7f0b0b.

Depends on #114827 that was reverted.
2024-11-22 13:57:30 -08:00
Maksim Panchenko
92301180f7
[BOLT] Use compact EH format for fixed-address executables (#117274)
Use ULEB128 format for emitting LSDAs for fixed-address executables,
similar to what we use for PIEs/DSOs. Main difference is that we don't
use landing pad trampolines when landing pads are not contained in a
single fragment. Instead, we fallback to emitting larger fixed-address
LSDAs, which is still better than adding trampoline instructions.
2024-11-22 00:28:55 -08:00
Maksim Panchenko
105ecd8bb2
[BOLT] Avoid EH trampolines for PIEs/DSOs (#117106)
We used to emit EH trampolines for PIE/DSO whenever a function fragment
contained a landing pad outside of it. However, it is common to have all
landing pads in a cold fragment even when their throwers are in a hot
one.

To reduce the number of trampolines, analyze landing pads for any given
function fragment, and if they all belong to the same (possibly
different) fragment, designate that fragment as a landing pad fragment
for the "thrower" fragment. Later, emit landing pad fragment symbol as
an LPStart for the thrower LSDA.
2024-11-21 18:18:30 -08:00
Maksim Panchenko
3282be1f8d
[BOLT] Use ULEB128 encoding for PIE/DSO exception tables (#116911)
Use ULEB128 encoding for call sites in PIE/DSO binaries. The encoding
reduces the size of the tables compared to sdata4 and is the default
format used by Clang.

Note that for fixed-address executables we still use absolute addressing
to cover cases where landing pads can reside in different function
fragments.

For testing, we rely on runtime EH tests.
2024-11-20 12:29:23 -08:00
Maksim Panchenko
066dd91ad8
[BOLT] Offset LPStart to avoid unnecessary instructions (#116713)
For C++ exception handling, when we write a call site table, we must
avoid emitting 0-value offsets for landing pads unless the call site has
no landing pad. However, 0 can be a real offset from the start of the
FDE if the FDE corresponds to a function fragment that starts with a
landing pad. In such cases, we used to emit a trap instruction at the
start of the fragment to guarantee non-zero LP offset.

To avoid emitting unnecessary trap instructions, we can instead set
LPStart to an offset from the FDE. If we emit it as [FDEStart - 1], then
all real offsets from LPStart in FDE become non-negative.
2024-11-19 16:45:03 -08:00
Maksim Panchenko
996553228f
[BOLT] Overwrite .eh_frame and .gcc_except_table (#116755)
Under --use-old-text or --strict, we completely rewrite contents of EH
frames and exception tables sections. If new contents of either section
do not exceed the size of the original section, rewrite the section
in-place.
2024-11-19 12:59:05 -08:00
Maksim Panchenko
08ef939637
[BOLT] Overwrite .eh_frame_hdr in-place (#116730)
If the new EH frame header can fit into the original .eh_frame_hdr
section, overwrite it in-place and pad with zeroes.
2024-11-18 20:42:38 -08:00
Maksim Panchenko
93a4244523
[BOLT] Use new assembler directives for EH table emission (#116294)
When emitting C++ exception tables (LSDAs), BOLT used to estimate the
size of the tables beforehand. This implementation was necessary as the
assembler/streamer lacked the emitULEB128IntValue() functionality.

As I plan to introduce [u|s]uleb128-encoded exception tables in BOLT,
now is a perfect time to switch to the new API and eliminate the need
to pre-compute the size of the tables.
2024-11-17 12:40:07 -08:00
Sander de Smalen
576865a50e Fix up MCPlusBuilder.cpp to account for W0_HI on AArch64
Landing #114827 broke these tests, because they did not account
for the new artificial registers.
2024-11-14 12:02:14 +00:00
Maksim Panchenko
1b8e0cf090
[BOLT] Never emit "large" functions (#115974)
"Large" functions are functions that are too big to fit into their
original slots after code modifications. CheckLargeFunctions pass is
designed to prevent such functions from emission. Extend this pass to
work with functions with constant islands.

Now that CheckLargeFunctions covers all functions, it guarantees that we
will never see such functions after code emission on all platforms
(previously it was guaranteed on x86 only). Hence, we can get rid of
RewriteInstance extensions that were meant to support "large" functions.
2024-11-13 09:58:44 -08:00
Maksim Panchenko
d922045381
[BOLT] Use AsmInfo for address size. NFCI (#115932)
Use AsmInfo instead of DWARFObj interface for extracting address size
and format.
2024-11-12 11:53:34 -08:00
Maksim Panchenko
be89e794f7
[BOLT][AArch64] Add support for long absolute LLD thunks/veneers (#113408)
Absolute thunks generated by LLD reference function addresses recorded
as data in code. Since they are generated by the linker, they don't have
relocations associated with them and thus the addresses are left
undetected. Use pattern matching to detect such thunks and handle them
in VeneerElimination pass.
2024-11-12 11:27:08 -08:00
Kazu Hirata
06e0869624 [BOLT] Fix warnings
This patch fixes:

  bolt/lib/Profile/StaleProfileMatching.cpp:694:24: error: unused
  variable 'BinHash' [-Werror,-Wunused-variable]

  bolt/lib/Profile/YAMLProfileWriter.cpp:206:61: error: missing field
  'GUID' initializer [-Werror,-Wmissing-field-initializers]

  bolt/lib/Profile/YAMLProfileReader.cpp:840:16: error: unused
  variable 'MatchedWithPseudoProbes' [-Werror,-Wunused-variable]
2024-11-12 09:39:57 -08:00
Shaw Young
9a9af0a23f
[BOLT] Match blocks with pseudo probes (#99891)
Match inline trees first between profile and the binary: by GUID,
checksum, parent, and inline site for inlined functions. Map profile
probes to binary probes via matched inline tree nodes. Each binary probe
has an associated binary basic block. If all probes from one profile
basic block map to the same binary basic block, it’s an exact match,
otherwise the block is determined by majority vote and reported as loose
match.

Pseudo probe matching happens between exact hash matching and call/loose
matching.

Introduce ProbeMatchSpec - a mechanism to match probes belonging to
another binary function. For example, given functions foo and bar:
```
void foo() {
  bar();
}
```
profiled binary: bar is not inlined => have top-level function bar
new binary where the profile is applied to: bar is inlined into foo.

Currently, BOLT does 1:1 matching between profile functions and binary
functions based on the name. #100446 will extend this to N:M where
multiple profiles can be matched to one binary function (as in the
example above where binary function foo would use profiles for foo and
bar), and one profile can be matched to multiple binary functions (e.g.
if bar was inlined into multiple functions).

In this diff, ProbeMatchSpecs would only have one BinaryFunctionProfile
(existing name-based matching). 

Test Plan: Added match-blocks-with-pseudo-probes.test

Performance test:
- Setup:
  - Baseline no-BOLT: Clang with pseudo probes, ThinLTO + CSSPGO
  (#79942)
  - BOLT fresh: BOLTed Clang using fresh profile,
  - BOLT stale (hash): BOLTed Clang using stale profile (collected on
    Clang 10K commits back), `-infer-stale-profile` (hash+call block
    matching)
  - BOLT stale (+probe): BOLTed Clang using stale profile,
    `-infer-stale-profile` with `-stale-matching-with-pseudo-probes`
    (hash+call+pseudo probe block matching)
  - 2S Intel SKX Xeon 6138 with 40C/80T and 256GB RAM, using 20C/40T for
    build,
  - BOLT profiles are collected on Clang compiling large preprocessed
    C++ file.
- Benchmark: building Clang (average of 5 runs), see driver in
  aaupov/llvm-devmtg-2022
- Results, wall time, lower is better:
  - Baseline no-BOLT: 429.52 +- 2.61s,
  - BOLT stale (hash): 413.21 +- 2.19s,
  - BOLT stale (+probe): 409.69 +- 1.41s,
  - BOLT fresh: 384.50 +- 1.80s.

---------

Co-authored-by: Amir Ayupov <aaupov@fb.com>
2024-11-12 07:21:03 -08:00
Daniel Sanders
74003f11b3
[mc] Add CFI directive to emit val_offset() rules (#113971)
These specify that the value of the given register in the previous frame
is the CFA plus some offset. This isn't very common but can be necessary
if the original value is normally reconstructed from the stack/frame
pointer instead of being saved on the stack and reloaded from there.
2024-11-11 11:38:36 -08:00
Amir Ayupov
7ec682b16b
[MC] Use StringRefs from pseudo_probe_desc section if it's mapped
Add `IsMMapped` flag to `buildGUID2FuncDescMap` controlling whether to
allocate a string in `FuncNameAllocator` or use StringRef directly.
Keep it false by default, only set it for BOLT use case because BOLT
keeps file sections in memory while processing them. llvm-profgen
constructs GUID2FuncDescMap and then releases the binary.

For medium sized binary with 0.8 GiB .pseudo_probe_desc section, this
saves 0.7 GiB peak RSS in perf2bolt.

Test Plan: no-op for llvm-profgen, NFC for perf2bolt

Reviewers: maksfb, dcci, wlei-llvm, rafaelauler, ayermolo

Reviewed By: wlei-llvm

Pull Request: https://github.com/llvm/llvm-project/pull/112996
2024-11-08 16:39:33 -08:00
Amir Ayupov
d936924f5e
[BOLT][NFC] Make YamlProfileToFunction a DenseMap (#108712)
YAML function profiles have sparse function IDs, assigned from
sequential function IDs from profiled binary. For example, for one large
binary, YAML profile has 15K functions, but the highest ID is ~600K,
close to number of functions in the profiled binary.

In `matchProfileToFunction`, `YamlProfileToFunction` vector was resized
to match function ID, which entails a 40X overcommit. Change the type of
`YamlProfileToFunction` to DenseMap to reduce memory utilization.

#99891 makes use of it for profile lookup associated with a given binary
function.
2024-11-08 15:24:48 -08:00
Amir Ayupov
74e6478f81
[BOLT] Set call to continuation count in pre-aggregated profile
#109683 identified an issue with pre-aggregated profile where a call to
continuation fallthrough edge count is missing (profile discontinuity).

This issue only affects pre-aggregated profile but not perf data since
LBR stack has the necessary information to determine if the trace (fall-
through) starts at call continuation, whereas pre-aggregated fallthrough
lacks this information.

The solution is to look at branch records in pre-aggregated profiles
that correspond to returns and assign counts to call to continuation
fallthrough:
- BranchFrom is in another function or DSO,
- BranchTo may be a call continuation site:
  - not an entry point/landing pad.

Note that we can't directly check if BranchFrom corresponds to a return
instruction if it's in external DSO.

Keep call continuation handling for perf data (`getFallthroughsInTrace`)
[1] as-is due to marginally better performance. The difference is that
return-converted call to continuation fallthrough is slightly more
frequent than other fallthroughs since the former only requires one LBR
address while the latter need two that belong to the profiled binary.
Hence return-converted fallthroughs have larger "weight" which affects
code layout.

[1] `DataAggregator::getFallthroughsInTrace`
fea18afeed/bolt/lib/Profile/DataAggregator.cpp (L906-L915)

Test Plan: added callcont-fallthru.s

Reviewers: maksfb, ayermolo, ShatianWang, dcci

Reviewed By: maksfb, ShatianWang

Pull Request: https://github.com/llvm/llvm-project/pull/109486
2024-11-07 16:20:19 -08:00
Kazu Hirata
accd8f98be [BOLT] Fix a warning
This patch:

  bolt/lib/Passes/LongJmp.cpp:830:14: error: variable 'NumIterations'
  set but not used [-Werror,-Wunused-but-set-variable]
2024-11-07 15:09:52 -08:00
Maksim Panchenko
49ee6069db
[BOLT][AArch64] Add support for compact code model (#112110)
Add `--compact-code-model` option that executes alternative branch
relaxation with an assumption that the resulting binary has less than
128MB of code. The relaxation is done in `relaxLocalBranches()`, which
operates on a function level and executes on multiple functions in
parallel.

Running the new option on AArch64 Clang binary produces slightly smaller
code and the relaxation finishes in about 1/10th of the time.

Note that the new `.text` has to be smaller than 128MB, *and* `.plt` has
to be closer than 128MB to `.text`.
2024-11-07 14:51:12 -08:00
Jacob Bramley
16cd5cdf4d
[BOLT] Ignore AArch64 markers outside their sections. (#74106)
AArch64 uses $d and $x symbols to delimit data embedded in code.
However, sometimes we see $d symbols, typically in .eh_frame, with
addresses that belong to different sections. These occasionally fall
inside .text functions and cause BOLT to stop disassembling, which in
turn causes DWARF CFA processing to fail.

As a workaround, we just ignore symbols with addresses outside the
section they belong to. This behaviour is consistent with objdump and
similar tools.
2024-11-07 15:16:14 +03:00
Sergei Barannikov
eeb987f6f3
[MC] Make generated MCInstPrinter::getMnemonic const (NFC) (#114682)
The value returned from the function depends only on the instruction opcode.

As a drive-by, change the type of the argument to const-reference.
2024-11-03 20:37:26 +03:00
Kazu Hirata
41baa69a7e
[BOLT] Fix warnings (#114116)
This patch fixes:

  bolt/lib/Core/BinaryFunction.cpp:2537:13: error: enumeration value
  'OpNegateRAStateWithPC' not handled in switch [-Werror,-Wswitch]

  bolt/lib/Core/BinaryFunction.cpp:2661:13: error: enumeration value
  'OpNegateRAStateWithPC' not handled in switch [-Werror,-Wswitch]

  bolt/lib/Core/BinaryFunction.cpp:2805:13: error: enumeration value
  'OpNegateRAStateWithPC' not handled in switch [-Werror,-Wswitch]
2024-10-29 13:52:22 -07:00
Amir Ayupov
cafd3e10c3
[BOLT][test] Fix NFC check with pre-aggregated-perf.test (#113944)
NFC checks have been failing starting with
https://lab.llvm.org/buildbot/#/builders/92/builds/8567.

NFC testing wrapper (llvm-bolt-wrapper) replaces the call of `perf2bolt`
with `llvm-bolt --aggregate-only --ignore-build-id`.

`show-density` is automatically enabled for perf2bolt only but not for
`llvm-bolt --aggregate-only`. Add the flag to the test to work around
the issue.

Test Plan:
```
cd build
../llvm-project/bolt/utils/nfc-check-setup.py --switch-back --verbose
bin/llvm-lit -a tools/bolt/test/X86/pre-aggregated-perf.test
```
2024-10-28 11:30:30 -07:00
Amir Ayupov
6ee5ff95ab
[BOLT] Add profile density computation
Reuse the definition of profile density from llvm-profgen (#92144):
- the density is computed in perf2bolt using raw samples (perf.data or
  pre-aggregated data),
- function density is the ratio of dynamically executed function bytes
  to the static function size in bytes,
- profile density:
  - functions are sorted by density in decreasing order, accumulating
    their respective sample counts,
  - profile density is the smallest density covering 99% of total sample
    count.

In other words, BOLT binary profile density is the minimum amount of
profile information per function (excluding functions in tail 1% sample
count) which is sufficient to optimize the binary well.

The density threshold of 60 was determined through experiments with
large binaries by reducing the sample count and checking resulting
profile density and performance. The threshold is conservative.

perf2bolt would print the warning if the density is below the threshold
and suggest to increase the sampling duration and/or frequency to reach
a given density, e.g.:
```
BOLT-WARNING: BOLT is estimated to optimize better with 2.8x more samples.
```

Test Plan: updated pre-aggregated-perf.test

Reviewers: maksfb, wlei-llvm, rafaelauler, ayermolo, dcci, WenleiHe

Reviewed By: WenleiHe, wlei-llvm

Pull Request: https://github.com/llvm/llvm-project/pull/101094
2024-10-24 18:30:59 -07:00
Amir Ayupov
08916cef7e
[BOLT] Set RawBranchCount in DataAggregator
Align DataAggregator (Linux perf and pre-aggregated profile reader) to
DataReader (fdata profile reader) behavior: set BF->RawBranchCount which
is used in profile density computation (#101094).

Reviewers: ayermolo, maksfb, dcci, rafaelauler, WenleiHe

Reviewed By: WenleiHe

Pull Request: https://github.com/llvm/llvm-project/pull/101093
2024-10-24 18:28:44 -07:00
Kazu Hirata
6803062eb7 [BOLT] Fix a build failure
This patch fixes:

  bolt/lib/Core/DIEBuilder.cpp:285:40: error: too many arguments to
  function call, expected 2, have 3
2024-10-22 10:20:20 -07:00
Kazu Hirata
9f264e4d2f
[BOLT] Avoid repeated hash lookups (NFC) (#112822) 2024-10-18 08:39:31 -07:00
sinan
c3bbc3a57d
[BOLT] Fix logs with no hex convension (#112650)
Add `utohexstr` to ensure that offsets/addresses are correctly formatted
as hexadecimal values.
2024-10-18 09:46:41 +08:00
Paschalis Mpeis
cb9bacf57d
[AArch64][BOLT] Ensure tentative code layout for cold BBs runs. (#96609)
When split functions is used, BOLT may skip tentative code layout
estimation in some cases, like:
- when there is no profile data for some blocks (ie cold blocks)
- when there are cold functions in lite mode
- when skip functions is used
     
However, when rewriting the binary we still need to compute PC-relative
distances between hot and cold basic blocks. Without cold layout
estimation, BOLT uses '0x0' as the address of the first cold block,
leading to incorrect estimations of any PC-relative addresses.
 
This affects large binaries as the relaxStub method expands more
branches than necessary using the short-jump sequence, at it wrongly
believes it has exceeded the branch distance boundary.
 
This increases code size with both a larger and slower sequence;
however,
performance regression is expected to be minimal since this only affects
any called cold code.
 
Example of such an unnecessary relaxation:
from:
```armasm
b       .Ltmp1234
```
 
to:
```armasm
adrp    x16, .Ltmp1234
add     x16, x16, :lo12:.Ltmp1234
br      x16
```
2024-10-17 08:59:05 +01:00
Amir Ayupov
3c4f00905e
[BOLT] Support perf2bolt-N in the driver
Check invoked tool with `starts_with`.

Addresses the issue where `perf2bolt` invoked using a distro symlink
`perf2bolt-16` fails to run in perf2bolt mode and runs in llvm-bolt mode
instead.

The issue is mentioned in https://vondra.me/posts/playing-with-bolt-and-postgres/

Test Plan:
```
ln -sf perf2bolt perf2bolt-20
perf2bolt-20 clang -p perf.data -o fdata.clang -w yaml.clang
...
PERF2BOLT: wrote 188593 objects and 0 memory objects to fdata.clang
```

Reviewers: ayermolo, rafaelauler, dcci, maksfb

Reviewed By: maksfb

Pull Request: https://github.com/llvm/llvm-project/pull/111072
2024-10-14 10:17:31 -07:00
Kazu Hirata
23c834092e
[BOLT] Avoid repeated set lookups (NFC) (#112157) 2024-10-14 06:55:04 -07:00
Kazu Hirata
7928e14f5e
[BOLT] Avoid repeated map lookups (NFC) (#112118) 2024-10-12 22:06:49 -07:00
Kazu Hirata
b192f208d6
[BOLT] Avoid repeated hash lookups (NFC) (#112073) 2024-10-12 08:03:39 -07:00
Amir Ayupov
79d695f049
[BOLT][NFCI] Speedup BAT::writeMaps
For a large binary with BAT section of size 38 MB with ~170k maps,
reduces writeMaps time from 70s down to 1s.

The inefficiency was in the use of std::distance with std::map::iterator
which doesn't provide random access. Use sorted vector for lookups.

Test Plan: NFC

Reviewers: maksfb, rafaelauler, dcci, ayermolo

Reviewed By: maksfb

Pull Request: https://github.com/llvm/llvm-project/pull/112061
2024-10-11 21:40:53 -07:00
Kazu Hirata
1be849c529
[BOLT] Avoid repeated hash lookups (NFC) (#111782) 2024-10-09 20:19:58 -07:00
Maksim Panchenko
0e86e5214c
[BOLT][AArch64] Reduce the number of ADR relaxations (#111577)
If ADR instruction references the same function, we can skip relaxation
even if the function is split but ADR is in the main fragment.
2024-10-08 16:15:00 -07:00
ShatianWang
4cab01f072
[BOLT] Profile quality stats -- CFG discontinuity (#109683)
In a perfect profile, each positive-execution-count block in the
function’s CFG should be reachable from a positive-execution-count
function entry block through a positive-execution-count path. This new
pass checks how well the BOLT input profile satisfies this “CFG
continuity” property.

More specifically, for each of the hottest 1000 functions, the pass
calculates the function’s fraction of basic block execution counts that
is “unreachable”. It then reports the 95th percentile of the
distribution of the 1000 unreachable fractions in a single BOLT-INFO
line. The smaller the reported value is, the better the BOLT profile
satisfies the CFG continuity property.

The default value of 1000 above can be changed via the hidden BOLT
option `-num-functions-for-continuity-check=[N]`. If more detailed stats
are needed, `-v=1` can be added to the BOLT invocation: the hottest N
functions will be grouped into 5 equally-sized buckets, from the hottest
to the coldest; for each bucket, various summary statistics of the
distribution of the fractions and the raw unreachable execution counts
will be reported.
2024-10-08 19:07:43 -04:00
Tex Riddell
e237d8aac8
[BOLT] Fix tests broken by abe0dd1 (#110071)
abe0dd195a3b2630afdc5c1c233eb2a068b2d72f (#109553) changed default
llvm-objdump output for consecutive zeros.

This broke two tests:
BOLT :: AArch64/constant_island_pie_update.s
BOLT :: AArch64/update-weak-reference-symbol.s

This fixes the test failures by adding -z to llvm-objdump in RUN line.
2024-09-25 19:34:57 -07:00
Maksim Panchenko
4db0cc4c55
[BOLT] Allow sections in --print-only flag (#109622)
While printing functions, expand --print-only flag to accept section
names. E.g., "--print-only=\.init" will only print functions from
".init" section.
2024-09-25 23:44:06 +02:00