## Problem
The `safe_thunks` ICF optimization in `lld-macho` was creating thunks
that pointed to `InputSection`s instead of `Symbol`s. While, generally,
branch relocations can point to symbols or input sections, in this case
we need them to point to symbols as subsequently the branch extension
algorithm expects branches to always point to `Symbol`'s.
## Solution
This patch changes the ICF implementation so that safe thunks point to
`Symbol`'s rather than `InputSection`s.
## Testing
The existing `arm64-thunks.s` test is modified to include
`--icf=safe_thunks` to explicitly verify the interaction between ICF and
branch range extension thunks. Two functions were added that will be
merged together via a thunk. Before this patch, this test would generate
an assert - now this scenario is correctly handled.
ICF runs before BPSectionOrderer. When a section is ICF'ed, it seems
that the original sections are marked as not live, but are still kept
around. Prior to this patch, those ICF'ed sections would be passed to BP
and ordered before being skipped when writing the output. Now, these
sections are no longer passed to BP, saving runtime and possibly
improving BP's output.
In a large binary, I found that the number of sections ordered using BP
decreased, while the number of duplicate sections drastically decreased
as expected.
```
Functions for startup: 50755 -> 50520
Functions for compression: 165734 -> 105328
Duplicate functions: 1827231 -> 55230
```
Note that PointerUnion::dyn_cast has been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
Literal migration would result in dyn_cast_if_present (see the
definition of PointerUnion::dyn_cast), but this patch uses dyn_cast
because we expect referent to be nonnull.
Exposed by the test added in the reverted #120514
* Fix libstdc++/libc++ differences due to nth_element. https://github.com/llvm/llvm-project/pull/125450#issuecomment-2631404178
* Fix LLVM_ENABLE_REVERSE_ITERATION=1 differences
* Fix potential issue in `currentSize += D::getSize(*sections[*sectionIdxs.begin()])` where DenseSet was used, though not covered by a test
The ELF/bp-section-orderer.s test is failing on some buildbots due to
what seems like non-determinism issues, see comments on the original PR
and #125450
Reverting to green the build.
This reverts commit 0154dce8d39d2688b09f4e073fe601099a399365 and
follow-up commits 046dd4b28b9c1a75a96cf63465021ffa9fe1a979 and
c92f20416e6dbbde9790067b80e75ef1ef5d0fa4.
PR #117514 refactored BPSectionOrderer to be used by the ELF port
but introduced some inefficiency:
* BPSectionBase/BPSymbol are wrappers around a single pointer.
The numbers of sections and symbols could be huge, and the extra
allocations are memory inefficient.
* Reconstructing the returned DenseMap (since BPSectionBase != InputSectin)
is wasteful.
This patch refactors BPSectionOrderer with Curiously Recurring Template
Pattern and eliminates the inefficiency. In addition,
`symbolToSectionIdxs` is removed and `rootSymbolToSectionIdxs` building
is moved to lld/MachO: while getting sections for symbols is cheap in
Mach-O, it is awkward and inefficient in the ELF port.
While here, add a file-level comment and replace some `StringMap<*>`
(which copies strings) with `DenseMap<CachedHashStringRef, *>`.
Pull Request: https://github.com/llvm/llvm-project/pull/124482
Note that PointerUnion::dyn_cast has been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
This patch migrates uses of PointerUnion::dyn_cast to
dyn_cast_if_present (see the definition of PointerUnion::dyn_cast).
Note that we cannot use dyn_cast in any of the migrations in this
patch; placing
assert(!X.isNull());
just before any of dyn_cast_if_present in this patch triggers some
failure in check-lld.
The symbol string table does not have deduplication.
Here we add code to deduplicate the symbol string table.
This has a rather large size impact (20-30%) on unstripped binaries
(typically debug binaries) but no size impact on stripped
binaries(typically release binaries).
We enable deduplication by default and add a flag to disable it
(`-no-deduplicate-symbol-strings`).
Note that PointerUnion::dyn_cast has been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
Literal migration would result in dyn_cast_if_present (see the
definition of PointerUnion::dyn_cast), but this patch uses cast
because we know expect isa<Symbol *>(rel.referent) to be true.
Note that PointerUnion::dyn_cast has been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
Literal migration would result in dyn_cast_if_present (see the
definition of PointerUnion::dyn_cast), but this patch uses cast
because we know expect isa<InputSection *>(reloc.referent) to be true.
Adding the `safe_thunks` option in `Options.td` as it was missing there
- mentioned by @Colibrow in
https://github.com/llvm/llvm-project/pull/106573
Also documenting what the various options mean.
Help now looks like this:
```
..........
--error-limit=<value> Maximum number of errors to print before exiting (default: 20)
--help-hidden Display help for hidden options
--icf=[none,safe,safe_thunks,all]
Set level for identical code folding (default: none). Possible values:
none - Disable ICF
safe - Only folds non-address significant functions (as described by `__addrsig` section)
safe_thunks - Like safe, but replaces address-significant functions with thunks
all - Fold all identical functions
--ignore-auto-link-option=<value>
Ignore a single auto-linked library or framework. Useful to ignore invalid options that ld64 ignores
--irpgo-profile-sort=<profile>
Deprecated. Please use --irpgo-profile and --bp-startup-sort=function
..........
```
xxHash, inferior to xxh3, is discouraged. We try not to use xxhash in
lld.
Switch to read32le for content hash and xxh3/stable_hash_combine for
relocation hash. Remove the intermediate std::string for relocation
hash.
Change the tail hashing scheme to consider individual bytes instead.
This helps group 0102 and 0201 together. The benefit is negligible,
though.
Pull Request: https://github.com/llvm/llvm-project/pull/121729
This commit improves the memory efficiency of the lld-macho linker by
optimizing how thunks are printed in the map file. Previously, merging
vectors of input sections required creating a temporary vector, which
increased memory usage and in some cases caused the linker to run out of
memory as reported in comments on
https://github.com/llvm/llvm-project/pull/120496. The new approach
interleaves the printing of two arrays of ConcatInputSection in sorted
order without allocating additional memory for a merged array.
--order_file, call graph profile, and BalancedPartitioning currently
build the section order vector by decreasing priority (from SIZE_MAX to
0). However, it's conventional to use an increasing key (see
OutputSection::inputOrder).
Switch to increasing priorities, remove the global variable
highestAvailablePriority, and remove the highestAvailablePriority
parameter from BPSectionOrderer. Change size_t to int.
This improves consistenty with the ELF and COFF ports. The ELF port
utilizes negative priorities for --symbol-ordering-file and call graph
profile, and non-negative priorities for --shuffle-sections (no Mach-O
counterpart yet).
Pull Request: https://github.com/llvm/llvm-project/pull/121727
This patch improves the linker’s ability to estimate stub reachability
in the `TextOutputSection::estimateStubsInRangeVA` function. It does so
by including thunks that have already been placed ahead of the current
call site address when calculating the threshold for direct stub calls.
Before this fix, the estimation process overlooked existing forward
thunks. This could result in some thunks not being inserted where
needed. In rare situations, particularly with large and specially
arranged codebases, this might lead to branch instructions being out of
range, causing linking errors.
Although this patch successfully addresses the problem, it is not
feasible to create a test for this issue. The specific layout and order
of thunk creation required to reproduce the corner case are too complex,
making test creation impractical.
Example error messages the issue could generate:
```
ld64.lld: error: banana.o:(symbol OUTLINED_FUNCTION_24949_3875): relocation BRANCH26 is out of range: 134547892 is not in [-134217728, 134217727]; references objc_autoreleaseReturnValue
ld64.lld: error: main.o:(symbol _main+0xc): relocation BRANCH26 is out of range: 134544132 is not in [-134217728, 134217727]; references objc_release
```
This patch extends the MachO linker's map file generation to include
branch extension thunk symbols. Previously, thunks were omitted from the
map file, making it difficult to understand the final layout of the
binary, especially when debugging issues related to long branch thunks.
This change ensures thunks are included and correctly interleaved with
other symbols based on their address, providing an accurate
representation of the linked output.
The existing comparison does not insert symbols in the intended place.
Closes#120559.
---------
Co-authored-by: Bjorn Pettersson <bjorn.a.pettersson@ericsson.com>
Note that PointerUnion::{is,get} have been soft deprecated in
PointerUnion.h:
// FIXME: Replace the uses of is(), get() and dyn_cast() with
// isa<T>, cast<T> and the llvm::dyn_cast<T>
I'm not touching PointerUnion::dyn_cast for now because it's a bit
complicated; we could blindly migrate it to dyn_cast_if_present, but
we should probably use dyn_cast when the operand is known to be
non-null.
Apologies for the large change, I looked for ways to break this up and
all of the ones I saw added real complexity. This change focuses on the
option's prefixed names and the array of prefixes. These are present in
every option and the dominant source of dynamic relocations for PIE or
PIC users of LLVM and Clang tooling. In some cases, 100s or 1000s of
them for the Clang driver which has a huge number of options.
This PR addresses this by building a string table and a prefixes table
that can be referenced with indices rather than pointers that require
dynamic relocations. This removes almost 7k dynmaic relocations from the
`clang` binary, roughly 8% of the remaining dynmaic relocations outside
of vtables. For busy-boxing use cases where many different option tables
are linked into the same binary, the savings add up a bit more.
The string table is a straightforward mechanism, but the prefixes
required some subtlety. They are encoded in a Pascal-string fashion with
a size followed by a sequence of offsets. This works relatively well for
the small realistic prefixes arrays in use.
Lots of code has to change in order to land this though: both all the
option library code has to be updated to use the string table and
prefixes table, and all the users of the options library have to be
updated to correctly instantiate the objects.
Some follow-up patches in the works to provide an abstraction for this
style of code, and to start using the same technique for some of the
other strings here now that the infrastructure is in place.
Add new CLI options for feature parity with ELF w.r.t pass plugins.
Most of the changes are ported directly from
0c86198b27.
With this change, it is now possible to load and run external pass
plugins during the LTO phase.
ld64.lld would previously allow you to link against dylibs linked with
`-allowable_client`, even if the client's name does not match any
allowed client.
This change fixes that. See #114146 for related discussion.
The test binary `liballowable_client.dylib` was created on macOS with:
echo | clang -xc - -dynamiclib -mmacosx-version-min=10.11 -arch x86_64
-Wl,-allowable_client,allowed -o lib/liballowable_client.dylib
Currently when `--icf=safe_thunks` is used, `STABS` entries cannot be
generated for ICF'ed functions. This is because if ICF converts a full
function into a thunk and then we generate a `STABS` entry for the
thunk, `dsymutil` will expect to find the entire function body at the
location of the thunk. Because just a thunk will be present at the
location of the `STABS` entry - dsymutil will generate invalid debug
info for such scenarios.
With this change, if `--icf=safe_thunks` is used and `--keep-icf-stabs`
is also specified, STABS entries will be created for all functions, even
merged ones. However, the STABS entries will point at the actual (full)
function body while having the name of the thunk. This way we still get
program correctness as well as correct DWARF data. When doing this, the
debug data will be identical to the scenario where we're using
`--icf=all` and `--keep-icf-stabs`, but the actual program will also
contain thunks, which won't show up in the DWARF data.
For COFF and ELF that are mostly free of global states, lld::errs() and
lld::outs() should not be used. This migration change allows us to
remove lld::errs, which uses the global errorHandler().
This patch enhances the robustness of lld's Objective-C category
merging. Currently, the category merger assumes it can fully parse and
understand the format of all categories in the input, triggering an
assert if any invalid category data is encountered.
This will end up causing asserts in certain rare corner cases that are
difficult to reproduce in small test cases. The proposed changes modify
the behavior so that if invalid category data is detected, category
merging is skipped for that specific class and all other categories
sharing the same base class. This approach allows the linker to continue
processing other categories without failing entirely due to a single
problematic input.
We also add a LIT test to where we corrupt category data and check that
category merging for that class was skipped but the link was successful.
In `--icf=safe_thunks` mode, the linker differentiates `keepUnique`
functions by creating thunks during a post-processing step after
Identical Code Folding (ICF). While this ensures that `keepUnique`
functions themselves are not incorrectly merged, it overlooks functions
that reference these `keepUnique` symbols.
If two functions are identical except for references to different
`keepUnique` functions, the current ICF algorithm incorrectly considers
them identical because it doesn't account for the future differentiation
introduced by thunks. This leads to incorrect deduplication of functions
that should remain distinct.
To address this issue, we modify the ICF comparison to explicitly check
for references to `keepUnique` functions during deduplication. By doing
so, functions that reference different `keepUnique` symbols are
correctly identified as distinct, preventing erroneous merging and
ensuring the correctness of the linked output.
We've noticed that for large builds executing thin-link can take on the
order of 10s of minutes. We are only using a single thread to write the
sharded indices and import files for each input bitcode file. While we
need to ensure the index file produced lists modules in a deterministic
order, that doesn't prevent us from executing the rest of the work in
parallel.
In this change we use a thread pool to execute as much of the backend's
work as possible in parallel. In local testing on a machine with 80
cores, this change makes a thin-link for ~100,000 input files run in ~2
minutes. Without this change it takes upwards of 10 minutes.
---------
Co-authored-by: Nuri Amari <nuriamari@fb.com>
There is a bug in the current implementation of `--icf=safe_thunks`
where a STABS entry is emitted for generated thunks. This is problematic
as we end up generating invalid DWARF as dsymutil will think the entire
function body is at the thunk location, when in actuality there will
only be a single branch present. This will end up causing overlapping
DWARF entries.
To fix this we never generate STABS entries for such thunks.
The existing `--icf=safe_thunks` test is updated to also generate debug
info and we add a check that no corrupt DWARF is generated.
As a future TODO we need to make `--keep-icf-stabs` compatible with
`--icf=safe_thunks`.
* Don't call raw_string_ostream::flush(), which is essentially a no-op.
* Strip calls to raw_string_ostream::str(), to avoid excess layer of indirection.
Under the Microsoft ABI, only those bit fields can be merged whose
underlying types have the same size.
d175616 (`[lld-macho][arm64] Enhance safe ICF with thunk-based
deduplication`) added an enum field (`identicalCodeFoldingKind`) next to
booleans in the `Defined` class, which increased the size under the MS
ABI. On MinGW targets, this triggered the `static_assert` which checks
the size of `Defined` (for MSVC targets, the check is disabled due to
another problem). Let's store it as a `uint8_t` to allow merging to take
place.
Fixes#107511
Currently, our `safe` ICF mode only merges non-address-significant code,
leaving duplicate address-significant functions in the output. This
patch introduces `safe_thunks` ICF mode, which keeps a single master
copy of each function and replaces address-significant duplicates with
thunks that branch to the master copy.
Currently `--icf=safe_thunks` is only supported for `arm64`
architectures.
**Perf stats for a large binary:**
| ICF Option | Total Size | __text Size | __unwind_info | % total |
|-------------------|------------|-------------|---------------------|---------------------------|
| `--icf=none` | 91.738 MB | 55.220 MB | 1.424 MB | 0% |
| `--icf=safe` | 85.042 MB | 49.572 MB | 1.168 MB | 7.30% |
| `--icf=safe_thunks` | 84.650 MB | 49.219 MB | 1.143 MB | 7.72% |
| `--icf=all` | 82.060 MB | 48.726 MB | 1.111 MB | 10.55% |
So overall we can expect a `~0.45%` binary size reduction for a typical
large binary compared to the `--icf=safe` option.
**Runtime:**
Linking the above binary took ~10 seconds. Comparing the link
performance of --icf=safe_thunks vs --icf=safe, a ~2% slowdown was
observed.
Refactor some code in `BPSectionOrderer.cpp` in preparation for
https://github.com/llvm/llvm-project/pull/107348.
* Rename `constructNodesForCompression()` -> `getUnsForCompression()`
and return a `SmallVector` directly rather than populating a vector
alias
* Pass `duplicateSectionIdxs` as a pointer to make it possible to skip
finding (nearly) duplicate sections
* Combine `duplicate{Function,Data}SectionIdxs` into one variable
* Compute all `BPFunctionNode` vectors at the end (like
`nodesForStartup`)
There should be no functional change.
...including for catalyst.
The usecase for this is to put certain security-critical variables into
a special segment/section that's mapped as read-only most of the time,
and that temporary gets remapped as writeable when these variables are
written to be the program. This protects against them being written to
by heap spraying attacks. This special section should be mapped as
read-only at program start, so using
`-segprot MY_PROTECTED_MEMORY_THINGER rw r`
to mark that segment as rw maxprot and r initprot is exactly what we
want.
lld has so far rejected mismatching initprot and maxprot.
ld64 doesn't reject this, but silently writes initprot into both fields
(!) It looks like this might not be fully intentional, see
https://crbug.com/41495919#comment5 and
http://crbug.com/41495919#comment8.
In any case, when postprocessing ld64's output to have different values
for initprot and maxprot, the dynamic loader seems to do the right thing
(see also the previous two links).
The same technique also works on Windows, using both link.exe and
lld-link.exe using `/SECTION:myprotsect,R`.
So, since this is useful, allow it when targeting macOS, and make it do
what you'd expect.
Since loader support for this on iOS is less clear, keep disallowing it
there for now.
See the PR for the program I used to check that this seems to work. (I
only checked on arm64 macOS 14.5 so far; will run this on many more
systems on bots once this is merged and rolled in.)
The only instance where we weren't already passing a `StringRef` with a
known length to `Symbol`'s constructor is where the argument is a string
literal. Even in that case, lazy `strlen` calls don't make sense, as the
compiler can constant-evaluate the `StringRef(const char*)` constructor.
For symbols that go into the symbol table we need the length when
calculating the hash anyway. We could get away with not calling
`getName()` for local symbols, but the total contribution of `strlen` to
the run time is already below 1%, so that would just complicate the code
for a negligible benefit.