The Global Function Merger
(https://discourse.llvm.org/t/rfc-global-function-merging/82608) pass
optimistically creates merged instances of functions and suffixes their
names with `.Tgm`. Then in the linker, ICF will (hopefully) fold these
`.Tgm` functions. For example, a function `foo` might become a thunk
`foo` that calls a merged function `foo.Tgm`.
Since IRPGO runs before the global merger, we will only have a profile
for `foo`. We want to correlate this profile to both `foo` and `foo.Tgm`
so they can both be ordered to improve startup time.
I built a large binary and found that it increased the number of
functions ordered for startup, as expected.
```
Functions for startup: 12049 -> 12697
Functions for compression: 34733 -> 34707
```
The reason why we don't see a larger improvement is because there are
some cases where the code was accidentally working:
`getRootSymbol("foo.llvm.5555.Tgm")` already returns `foo`.
Reland #120514 after 2f6e3df08a8b7cd29273980e47310cf09c6fdbd8 fixed
iteration order issue and libstdc++/libc++ differences.
---
Both options instruct the linker to optimize section layout with the
following goals:
* `--bp-compression-sort=[data|function|both]`: Improve Lempel-Ziv
compression by grouping similar sections together, resulting in a
smaller compressed app size.
* `--bp-startup-sort=function --irpgo-profile=<file>`: Utilize a
temporal profile file to reduce page faults during program startup.
The linker determines the section order by considering three groups:
* Function sections ordered according to the temporal profile
(`--irpgo-profile=`), prioritizing early-accessed and frequently
accessed functions.
* Function sections. Sections containing similar functions are placed
together, maximizing compression opportunities.
* Data sections. Similar data sections are placed together.
Within each group, the sections are ordered using the Balanced
Partitioning algorithm.
The linker constructs a bipartite graph with two sets of vertices:
sections and utility vertices.
* For profile-guided function sections:
+ The number of utility vertices is determined by the symbol order
within the profile file.
+ If `--bp-compression-sort-startup-functions` is specified, extra
utility vertices are allocated to prioritize nearby function similarity.
* For sections ordered for compression: Utility vertices are determined
by analyzing k-mers of the section content and relocations.
The call graph profile is disabled during this optimization.
When `--symbol-ordering-file=` is specified, sections described in that
file are placed earlier.
Co-authored-by: Pengying Xu <xpy66swsry@gmail.com>
Exposed by the test added in the reverted #120514
* Fix libstdc++/libc++ differences due to nth_element. https://github.com/llvm/llvm-project/pull/125450#issuecomment-2631404178
* Fix LLVM_ENABLE_REVERSE_ITERATION=1 differences
* Fix potential issue in `currentSize += D::getSize(*sections[*sectionIdxs.begin()])` where DenseSet was used, though not covered by a test
The ELF/bp-section-orderer.s test is failing on some buildbots due to
what seems like non-determinism issues, see comments on the original PR
and #125450
Reverting to green the build.
This reverts commit 0154dce8d39d2688b09f4e073fe601099a399365 and
follow-up commits 046dd4b28b9c1a75a96cf63465021ffa9fe1a979 and
c92f20416e6dbbde9790067b80e75ef1ef5d0fa4.
Add new ELF linker options for profile-guided section ordering
optimizations:
- `--irpgo-profile=<file>`: Read IRPGO profile data for use with startup
and compression optimizations
- `--bp-startup-sort={none,function}`: Order sections based on profile
data to improve star tup time
- `--bp-compression-sort={none,function,data,both}`: Order sections
using balanced partitioning to improve compressed size
- `--bp-compression-sort-startup-functions`: Additionally optimize
startup functions for compression
- `--verbose-bp-section-orderer`: Print statistics about balanced
partitioning section ordering
Thanks to the @ellishg, @thevinster, and their team's work.
---------
Co-authored-by: Fangrui Song <i@maskray.me>
PR #117514 refactored BPSectionOrderer to be used by the ELF port
but introduced some inefficiency:
* BPSectionBase/BPSymbol are wrappers around a single pointer.
The numbers of sections and symbols could be huge, and the extra
allocations are memory inefficient.
* Reconstructing the returned DenseMap (since BPSectionBase != InputSectin)
is wasteful.
This patch refactors BPSectionOrderer with Curiously Recurring Template
Pattern and eliminates the inefficiency. In addition,
`symbolToSectionIdxs` is removed and `rootSymbolToSectionIdxs` building
is moved to lld/MachO: while getting sections for symbols is cheap in
Mach-O, it is awkward and inefficient in the ELF port.
While here, add a file-level comment and replace some `StringMap<*>`
(which copies strings) with `DenseMap<CachedHashStringRef, *>`.
Pull Request: https://github.com/llvm/llvm-project/pull/124482
xxHash, inferior to xxh3, is discouraged. We try not to use xxhash in
lld.
Switch to read32le for content hash and xxh3/stable_hash_combine for
relocation hash. Remove the intermediate std::string for relocation
hash.
Change the tail hashing scheme to consider individual bytes instead.
This helps group 0102 and 0201 together. The benefit is negligible,
though.
Pull Request: https://github.com/llvm/llvm-project/pull/121729
--order_file, call graph profile, and BalancedPartitioning currently
build the section order vector by decreasing priority (from SIZE_MAX to
0). However, it's conventional to use an increasing key (see
OutputSection::inputOrder).
Switch to increasing priorities, remove the global variable
highestAvailablePriority, and remove the highestAvailablePriority
parameter from BPSectionOrderer. Change size_t to int.
This improves consistenty with the ELF and COFF ports. The ELF port
utilizes negative priorities for --symbol-ordering-file and call graph
profile, and non-negative priorities for --shuffle-sections (no Mach-O
counterpart yet).
Pull Request: https://github.com/llvm/llvm-project/pull/121727
For COFF and ELF that are mostly free of global states, lld::errs() and
lld::outs() should not be used. This migration change allows us to
remove lld::errs, which uses the global errorHandler().
The current diagnostic functions log/warn/error/fatal lack a context
argument and call the global `lld::errorHandler()`, which prevents
multiple lld instances in one process.
This patch introduces context-aware replacements:
* log => Log(ctx)
* warn => Warn(ctx)
* errorOrWarn => Err(ctx)
* error => ErrAlways(ctx)
* fatal => Fatal(ctx)
Example: `errorOrWarn(toString(f) + "xxx")` => `Err(ctx) << f << "xxx"`.
(`toString(f)` is shortened to `f` as a bonus and may access `ctx`
without accessing the global variable (see `Target.cpp`)).
`ctx.e = &context->e;` can be replaced with a non-global Errorhandler
when `ctx` becomes a local variable.
(For the ELF port, the long term goal is to eliminate `error`. Most can
be straightforwardly converted to `Err(ctx)`.)
lld ELF
[BitcodeFile](a527248a3c/lld/ELF/InputFiles.h (L328))
uses [string
saver](a527248a3c/lld/include/lld/Common/CommonLinkerContext.h (L57))
to keep copies of bitcode symbols. Symbol duplication is very common
when compiling application binaries.
This change proposes to introduce a UniqueStringSaver in lld context and
use it for bitcode symbol parsing. The implementation covers ELF only.
Similar opportunities should exist on other (COFF, MachO, wasm) formats.
For an internal production binary where lto indexing takes ~10GiB
originally, this changes optimizes away ~800MiB (~7.8%), measured by
https://github.com/google/pprof. Flame graph breaks down memory by usage
call stacks and agrees with this measurement.
This reverts commit aa495214b39d475bab24b468de7a7c676ce9e366.
As discussed in https://github.com/llvm/llvm-project/issues/53475 this patch
allows for using LLD-as-a-lib. It also lets clients link only the drivers that
they want (see unit tests).
This also adds the unit test infra as in the other LLVM projects. Among the
test coverage, I've added the original issue from @krzysz00, see:
https://github.com/ROCmSoftwarePlatform/D108850-lld-bug-reproduction
Important note: this doesn't allow (yet) linking in parallel. This will come a
bit later hopefully, in subsequent patches, for COFF at least.
Differential revision: https://reviews.llvm.org/D119049
Allow controlling the CodeGenOpt::Level independent of the LTO
optimization level in LLD via new options for the COFF, ELF, MachO, and
wasm frontends to lld. Most are spelled as --lto-CGO[0-3], but COFF is
spelled as -opt:lldltocgo=[0-3].
See D57422 for discussion surrounding the issue of how to set the CG opt
level. The ultimate goal is to let each function control its CG opt
level, but until then the current default means it is impossible to
specify a CG opt level lower than 2 while using LTO. This option gives
the user a means to control it for as long as it is not handled on a
per-function basis.
Reviewed By: MaskRay, #lld-macho, int3
Differential Revision: https://reviews.llvm.org/D141970
{D116279}, in addition to adding support for other demanglers, also
factored out some of the demangling logic. However, I don't think the
abstraction really carries its weight -- after {D135942}, only the ELF
and WASM backends call it with anything other than a non-constant
`shouldDemangle` argument. The COFF and Mach-O backends were already
doing the should-demangle check before calling `demangle()`.
Reviewed By: MaskRay, #lld-macho
Differential Revision: https://reviews.llvm.org/D135943
makeThreadLocal/makeThreadLocalN are moved from D130810 ([ELF] Parallelize input
section initialization) here to make D130810 more focused on the refactor:
* COFF has some needs for multiple linker contexts. D108850 partially removed
global states from lldCommon but left the global variable `lctx`.
* To the best of my knowledge, all multiple-linker-context feature requests to
ELF are more from user convenience, with no very strong argument.
* In practice, ELF port is very difficult to remove global states without
introducing significant performance regression/hurting code readability.
* Per-thread allocators from D122922/D123879 are too expensive and will not
really benefit ELF.
This patch adds a simple thread_local based makeThreadLocal to
lld/Common/Memory.h. It will enable further optimization in ELF.
This flag suppresses warnings produced by the linker. In ld64 this has
an interesting interaction with -fatal_warnings, it silences the
warnings but the link still fails. Instead of doing that here we still
print the warning and eagerly fail the link in case both are passed,
this seems more reasonable so users can understand why the link fails.
Differential Revision: https://reviews.llvm.org/D127564
The inline `lld::error` expands to two function calls `errorHandler` and `error`
where the latter is opaque. Move the functions to .cpp files to decrease code
size.
My x86-64 lld executable is 9KiB smaller.
Reviewed By: #lld-macho, thakis
Differential Revision: https://reviews.llvm.org/D120002
Main motivation: including `llvm/CodeGen/CommandFlags.h` in
`CommonLinkerContext.h` means that the declaration of `llvm::Reloc` is
visible in any file that includes `CommonLinkerContext.h`. Since our
cpp files have both `using namespace llvm` and `using namespace
lld::macho`, this results in conflicts with `lld::macho::Reloc`.
I suppose we could put `llvm::Reloc` into a nested namespace, but in general,
I think we should avoid transitively including too many header files in
a very widely used header like `CommonLinkerContext.h`.
RegisterCodeGenFlags' ctor initializes a bunch of function-`static`
structures and does nothing else, so it should be fine to "initialize"
it as a temporary stack variable rather than as a file static.
Reviewed By: aganea
Differential Revision: https://reviews.llvm.org/D119913
Having clarified that executing the SerializeToHsaco pass can
depend on a ROCm installation, switch from calling lld as a library to
using the copy of lld guaranteed to be included in a ROCm install.
This removes the workaround introduced in D119277
Reviewed By: whchung
Differential Revision: https://reviews.llvm.org/D119463
Move all variables at file-scope or function-static-scope into a hosting structure (lld::CommonLinkerContext) that lives at lldMain()-scope. Drivers will inherit from this structure and add their own global state, in the same way as for the existing COFFLinkerContext.
See discussion in https://lists.llvm.org/pipermail/llvm-dev/2021-June/151184.html
Differential Revision: https://reviews.llvm.org/D108850
This reverts commit 859ebca744e634dcc89a2294ffa41574f947bd62.
The change contained many unrelated changes and e.g. restored
unit test failes for the old lld port.
This reverts commit 640beb38e7710b939b3cfb3f4c54accc694b1d30.
That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort).
Reverting until we have a better solution to s_cselect_b64 codegen cleanup
Change-Id: Ibf8e397df94001f248fba609f072088a46abae08
Reviewed By: kzhuravl
Differential Revision: https://reviews.llvm.org/D115960
Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105
LLVM core library supports demangling other mangled symbols other than itanium,
such as D and Rust. LLD should use those demanglers in order to output pretty
demangled symbols on error messages.
Reviewed By: MaskRay, #lld-macho
Differential Revision: https://reviews.llvm.org/D116279