Adding a new flag(`--csprof-max-unsymbolized-context-depth`) to only
limit unsymbolized context depth. Currently,`--csprof-max-context-depth`
applies to both symbolized and unsymbolized profile context, there are
scenarios where `--csprof-max-context-depth` may not be flexible enough,
e.g. if we want to limit the context but still keep all the inlinings
from the leaf frame, we could set the value
csprof-max-unsymbolized-context-depth >= 1.
Replace the map from addresses to list of probes with a flat vector
containing probe references sorted by their addresses.
Reduces pseudo probe parsing time from 9.56s to 8.59s and peak RSS from
9.66 GiB to 9.08 GiB as part of perf2bolt processing a large binary.
Test Plan:
```
bin/llvm-lit -sv test/tools/llvm-profgen
```
Reviewers: maksfb, rafaelauler, dcci, ayermolo, wlei-llvm
Reviewed By: wlei-llvm
Pull Request: https://github.com/llvm/llvm-project/pull/102904
Use #102774 to allocate storage for decoded probes (`PseudoProbeVec`)
and function records (`InlineTreeVec`).
Leverage that to also shrink sizes of `MCDecodedPseudoProbe`:
- Drop Guid since it's accessible via `InlineTree`.
`MCDecodedPseudoProbeInlineTree`:
- Keep track of probes and inlinees using `ArrayRef`s now that probes
and function records belonging to the same function are allocated
contiguously.
This reduces peak RSS from 13.7 GiB to 9.7 GiB and pseudo probe parsing
time (as part of perf2bolt) from 15.3s to 9.6s for a large binary with
400MiB .pseudo_probe section containing 43M probes and 25M function
records.
Depends on:
#102774#102787#102788
Reviewers: maksfb, rafaelauler, dcci, ayermolo, wlei-llvm
Reviewed By: wlei-llvm
Pull Request: https://github.com/llvm/llvm-project/pull/102789
Without `--sample-period`, no assumptions are made about perf profile
sample frequencies. This is useful for comparing relative hotness of
different program locations within the same profile.
With `--sample-period`, LBR- and IP-based profile hit counts are
adjusted to estimate the absolute total event count for each program
location. This makes it reasonable to compare hit counts between
different profiles, e.g., between two LBR-based execution frequency
profiles with different sampling periods or between LBR-based execution
frequency profiles and IP-based branch mispredict profiles.
This functionality is in support of HWPGO[^1], which aims to enable
feedback from a wider range of hardware events.
[^1]:
https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf
This change introduces two options which may be used to create profiles
of arbitrary PMU events.
1. `--leading-ip-only` provides a simple sample-IP-based profile mode.
This is not useful for building a profile of execution frequency, but it
is useful for building new types of profiles.
For example, to build a profile of unpredictable branches:
perf record -b -e branch-misses:upp -o perf.data ... llvm-profgen
--perfdata perf.data --leading-ip-only ...
2. `--perf-event=event` enables the creation of a profile concerned with
a specific event or set of events. The names given should match the
"event" field as emitted by perf-script(1).
This option has two spellings: `--perf-event` and `--perf-events`. The
plural spelling accepts a comma-separated list. The singular spelling
appends a single event name to the set of events which will be used.
This is meant to accommodate event names containing commas.
Combined, these options allow generating multiple kinds of profiles from
a single `perf record` collection. For example, to generate both
execution frequency and branch mispredict profiles:
perf record -c 1000003 -b -e
br_inst_retired.near_taken:upp,br_misp_retired.all_branches:upp ...
llvm-profgen --output execution.prof
--perf-event=br_inst_retired.near_taken:upp ...
llvm-profgen --leading-ip-only --output unpredictable.prof
--perf-event=br_misp_retired.all_branches:upp ...
These additions are in support of more general HWPGO[^1], allowing
feedback from a wider range of hardware events.
[^1]:
https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf
---------
Co-authored-by: Tim Creech <tcreech@tcreech.com>
Also some control flow simplifications.
Notably, this doesn't address `sampleprof_error`. I *think* the style
there tries to match `std::error_category`.
Also left `hash_value` as-is, because it matches what we do in Hashing.h
Add the support to handle Linux kernel perf files. The functionality is
under option -kernel. Note that currently only main kernel (in vmlinux)
is handled: kernel modules are not handled.
---------
Co-authored-by: Han Shen <shenhan@google.com>
The profile density feature(the amount of samples in the profile
relative to the program size) is used to identify insufficient sample
issue and provide hints for user to increase sample count. A low-density
profile can be inaccurate due to statistical noise, which can hurt FDO
performance.
This change introduces two improvements to the current density work.
1. The density calculation/definition is changed. Previously, the
density of a profile was calculated as the minimum density for all warm
functions (a function was considered warm if its total samples were
within the top N percent of the profile). However, there is a problem
that a high total sample profile can have a very low density, which
makes the density value unstable.
- Instead, we want to find a density number such that if a function's
density is below this value, it is considered low-density function. We
consider the whole profile is bad if a group of low-density functions
have the sum of samples that exceeds N percent cut-off of the total
samples.
- In implementation, we sort the function profiles by density, iterate
them in descending order and keep accumulating the body samples until
the sum exceeds the (100% - N) percentage of the total_samples, the
profile-density is the last(minimum) function-density of processed
functions. We introduce the a flag(`--profile-density-threshold`) for
this percentage threshold.
2. The density is now calculated based on final(compiler used) profiles
instead of merged context-less profiles.
On Windows, perfscript generated by sep contains CR+LF at the end of
LBR records line. This '\r' will be treated as a LBR record when running
llvm-profgen on Linux and then generate warning.
The temporary perf script files converted from perf data will occupy
lots
of space for large project. This patch removes them when llvm-profgen
exits normally or receives signals.
Intel Vtune/SEP has supported collecting LBR on Windows and generating
perf-script file which is same format as Linux perf script. This patch
teaches llvm-profgen to disassemble COFF binary so that we can do
Sampling based PGO on Windows.
For the built-in local initialization function(`__cxx_global_var_init`,
`__tls_init` prefix), there could be multiple versions of the functions
in the final binary, e.g. `__cxx_global_var_init`, which is a wrapper of
global variable ctors, the compiler could assign suffixes like
`__cxx_global_var_init.N` for different ctors.
However, in the profile generation, we call `getCanonicalFnName` to
canonicalize the names which strip the suffixes. Therefore, samples from
different functions queries the same profile(only
`__cxx_global_var_init`) and the counts are merged. As the functions are
essentially different, entries of the merged profile are ambiguous. In
sample loading, for each version of this function, the IR from one
version would be attributed towards a merged entries, which is
inaccurate, especially for fuzzy profile matching, it gets multiple
callsites(from different function) but using to match one callsite,
which mislead the matching and report a lot of false positives.
Hence, we want to filter them out from the profile map during the
profile generation time. The profiles are all cold functions, it won't
have perf impact.
For the linux kernel, the loadable segments start at 0xffff... and thus
the 32 bit integer here was truncating all the meaningful bits. Grow it
to 64 bits.
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
This will make it possible to add visibility attributes to these
variables. This also fixes some type mismatches between the
declaration and the definition.
Reviewed By: bogner, huangjd
Differential Revision: https://reviews.llvm.org/D156599
Broken debug information can give empty names for an inlined frame, e.g,
```
0x1d605c68: ryKeyINS7_17SmartCounterTypesEEESt10shared_ptrINS7_15AsyncCacheValueIS9_EEESaIhESt6atomicEEE9fetch_subElSt12memory_order at Filename: edata.h
Function start filename: edata.h
Function start line: 266
Function start address: 0x1d605c68
Line: 267
Column: 0
(inlined by) at Filename: edata.h
Function start filename: edata.h
Function start line: 274
Function start address: 0x1d605c68
Line: 275
Column: 0
(inlined by) _EEEmmEv at Filename: arena.c
Function start filename: arena.c
Function start line: 1303
Line: 1308
Column: 0
```
This patch avoids creating a sample context with an empty function name
by stopping tracking at that frame. This prevents a hash failure that
leads to an ICE, where empty context serves at an empty key for the
underlying MapVector
7624de5bea/llvm/lib/ProfileData/SampleProfWriter.cpp (L261)
D152495 makes clang warn on unused variables that are declared in conditions like `if (int var = init) {}`
This patch is an NFC fix to suppress the new warning in llvm,clang,lld builds to pass CI in the above patch.
Differential Revision: https://reviews.llvm.org/D158016
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact:
When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact:
When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
Zero-sized functions should be cost-free in term of size budget, so they should be considered during inlining even if we run out of size budget.
This appears to give 0.5% win for one of our internal services.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D153820
The compiler has more insight and knowledge about functions based on their IR and attribures and should make a better inline decision than the offline preinliner does which is purely based on callsites hotness and code size. Therefore I'm making changes to favor previous compiler inline decision by bumping up the callsite allowance.
This should improve the performance by more than 1% according to testing on Meta services.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D153797
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact:
When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact:
When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
Llvm-profgen internally uses the llvm libraries and the MCDesc interface to do disassembling and symblization and it never checks against target-specific instruction operators. This makes it quite transparent to targets and a first attempt for an aarch64 binary just works. Therefore I'm removing the unnecessary triple check to unblock for new targets.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D153449
CPU profile indicated memcmp was hot due to the two rfind calls in
getCanonicalFnName. If UseSymbolTable is false, we can avoid the cost entirely.
For CSSPGO profiles I've measured ~5% speedup with this change.
Profile similarity before/after matches 100%.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D151441
This change enables generating pseudo-probe-based FS-AFDO profiles. The change is straightforward based-on previous change {D147651} by just injecting FS-discriminators into various profile generation spot.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D147957