This change allows to trim the profile if it's considered to be cold for baseline AutoFDO. We reuse the cold threshold from `ProfileSummaryBuilder::getColdCountThreshold(..)` which can be set by percent(--profile-summary-cutoff-cold) or by value(--profile-summary-cold-count).
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D113785
Due to the debug info merging, there may have some contexts with zero probe id, we should truncate the context to avoid misleading pre-inliner.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D114284
In order to support generating profile with FS discriminator, three kind of changes are done in llvm-profgen:
1) Dissassemble .rodata section to check if FS discriminator var ('"__llvm_fs_discriminator__"') exists and set the corresponding flag in the binary.
2) Change the discriminator decoding in `getBaseDiscriminator` and `getDuplicationFactor`.
3) set true for `FunctionSamples::ProfileIsFS` to enable FS functionality in ProfileData.
Reviewed By: xur, hoy, wenlei
Differential Revision: https://reviews.llvm.org/D113296
AutoFDO performance is sensitive to profile density, i.e., the amount of samples in the profile relative to the program size, because profiles with insufficient samples could be inaccurate due to statistical noise and thus hurt AutoFDO performance. A previous investigation showed that AutoFDO performed better on MySQL with increased amount of samples. Therefore, we implement a profile-density computation feature to give hints about profile density to users and the compiler.
We define the density of a profile Prof as follows:
- For each function A in the profile, density(A) = total_samples(A) / sizeof(A).
- density(Prof) = min(density(A)) for all functions A that are warm (defined below).
A function is considered warm if its total-samples is within top N percent of the profile. For implementation, we reuse the `ProfileSummaryBuilder::getHotCountThreshold(..)` as threshold which can be set by percent(`--profile-summary-cutoff-hot`) or by value(`--profile-summary-hot-count`).
We also introduce `--hot-function-density-threshold` to set hot function density threshold and will give suggestion if profile density is below it which implies we should increase samples.
This also applies for CS profile with all profiles merged into base.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D113781
Previously we assume there're some non-executing sections at the bottom of the text section so that we won't hit the array's bound. But on BOLTed binary, it turned out .bolt section is at the bottom of text section which can be profiled, then it crash llvm-profgen. This change try to fix it.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D113238
Allow filling zero count for all the function ranges even there is no samples hitting that function. Add a switch for this.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D112858
Like probe-based profile, the total samples is the sum of all its body samples. This patch fix it by a post-processing update for the line-number based profile. Tested it on our internal services, results showed no performance change.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D112672
Previous implementation of populating profile symbol list is wrong, it only included the profiled symbols. Actually it should use all symbols, here this switches to use the symbols from debug info. Also turned the flag off by default.
Reviewed By: wenlei, hoy
Differential Revision: https://reviews.llvm.org/D111824
It happened a bug that some callsite name in the profile is not a real function, it turned out that there're some non-function symbol from the ELF text section, e.g. the global accessible branch label and also recalled that we can have one function being split into multiple ranges. We shouldn't count samples for those are not the entry of the real function.
So this change tried to fix this issue by switching to use the name or ranges from DWARF-based debug info, the range of which assure it's the real function start. For the split functions, we assume that the real entry function's DWARF name should always match the symbol table name.
The switching is also consistent with the body samples' symbol which is from DWARF.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D112282
Adding support to the CS preinliner to trim cold base profiles. This makes trimming consistent with the inline decision made by the preinliner. Also disable the existing profile merger when preinliner is on unless explicitly specified.
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D112489
We incorrectly use duplication factor for total samples even though we already accumulate samples instead of taking MAX. It causes profile to have bloated total samples for functions with loop unrolled or vectorized. The change fix the issue for total sample, head sample and call target samples.
Differential Revision: https://reviews.llvm.org/D112042
For some transformations like hot-cold split or coro split, it can outline its part of function ranges. Since sample loader is the early stage of backend and no split happens at that time, compiler can't recognize those function, so in llvm-profgen we should attribute the sample to the original function. This is already done for the body range samples since we use the symbols from dwarf which is created before the split.
But for branch samples, the call from master function to its outlined function is actually not a call to the original function, we shouldn't add head/callsie samples for it. So instead of dwarf symbol, we use the symbols from symbol table and ignore those functions with special suffixes(like `.cold` ,`.resume`) for accumulating the callsite/head samples.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D110864
This change adds duplication factor multiplier while accumulating body samples for line-number based profile. The body sample count will be `duplication-factor * count`. Base discriminator and duplication factor is decoded from the raw discriminator, this requires some refactor works.
Differential Revision: https://reviews.llvm.org/D109934
Similar to https://reviews.llvm.org/D110465, we can compute function size on-demand for the functions that's hit by samples.
Here we leverage the raw range samples' address to compute a set of sample hit function. Then `BinarySizeContextTracker` just works on those function range for the size.
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D110466
In order to be consistent with compiler that interprets zero count as unexecuted(cold), this change reports zero-value count for unexecuted part of function code. For the implementation, it leverages the range counter, initializes all the executed function range with the zero-value. After all ranges are merged and converted into disjoint ranges, the remaining zero count will indicates the unexecuted(cold) part of the function.
This change also extends the current `findDisjointRanges` method which now can support adding zero-value range.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D109713
This patch introduces non-CS AutoFDO profile generation into LLVM. The profile is supposed to be well consumed by compiler using `-fprofile-sample-use=[profile]`.
After range and branch counters are extracted from the LBR sample, here we go through each addresses for symbolization, create FunctionSamples and populate its sub fields like TotalSamples, BodySamples and HeadSamples etc. For inlined code, as we need to map back to original code, so we always add body samples to the leaf frame's function sample.
Reviewed By: wenlei, hoy
Differential Revision: https://reviews.llvm.org/D109551
It seems we missed one spot to persist `SampleContextFrameVector` into the global table (CSProfileGenerator::populateFunctionBoundarySamples:340) which causes a crash.
This change tried to fix it in a centralized way i. e. where we generate the `FunctionSamples`.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D110275
Without preinliner, we need to tune down the cold count cutoff to merge/trim more context to limit profile size for large components. However it doesn't make sense for cold threshold to be higher than hot threshold, so we now change to use hot threshold as merging/trimming cut off instead.
Differential Revision: https://reviews.llvm.org/D110212
Perf script can sometimes give disordered LBR samples like below.
```
b022500
32de0044
3386e1d1
7f118e05720c
7f118df2d81f
0x2a0b9622/0x2a0b9610/P/-/-/1 0x2a0b79ff/0x2a0b9618/P/-/-/2 0x2a0b7a4a/0x2a0b79e8/P/-/-/1 0x2a0b7a33/0x2a0b7a46/P/-/-/1 0x2a0b7a42/0x2a0b7a23/P/-/-/1 0x2a0b7a21/0x2a0b7a37/P/-/-/2 0x2a0b79e6/0x2a0b7a07/P/-/-/1 0x2a0b79d4/0x2a0b79dc/P/-/-/2 0x2a0b7a03/0x2a0b79aa/P/-/-/1 0x2a0b79a8/0x2a0b7a00/P/-/-/234 0x2a0b9613/0x2a0b7930/P/-/-/1 0x2a0b9622/0x2a0b9610/P/-/-/1 0x2a0b79ff/0x2a0b9618/P/-/-/2 0x2a0b7a4a/0x2aWarning:
Processed 10263226 events and lost 1 chunks!
```
Note that the last LBR record `0x2a0b7a4a/0x2aWarning:` . Currently llvm-profgen does not detect that and as a result an uninitialized branch target value will be used. The uninitialized value can cause creepy instruction ranges created which which in turn will result in a completely wrong profile. An example is like
```
.... @ _ZN5folly13loadUnalignedIsEET_PKv]:18446744073709551615:18446744073709551615
1: 18446744073709551615
!CFGChecksum: 4294967295
!Attributes: 0
```
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D109637
We merge cold context by default to save profile size. However trimming cold context after merging doesn't save size much, so default to off to reflect how it's commonly used.
Differential Revision: https://reviews.llvm.org/D109166
Adding the compiler support of MD5 CS profile based on pervious context split work D107299. A MD5 CS profile is about 40% smaller than the string-based extbinary profile. As a result, the compilation is 15% faster.
There are a few conversion from real names to md5 names that have been made on the sample loader and context tracker side to get it work.
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D108342
This change aims at supporting LBR only sample perf script which is used for regular(Non-CS) profile generation. A LBR perf script includes a batch of LBR sample which starts with a frame pointer and a group of 32 LBR entries is followed. The FROM/TO LBR pair and the range between two consecutive entries (the former entry's TO and the latter entry's FROM) will be used to infer function profile info.
An example of LBR perf script(created by `perf script -F ip,brstack -i perf.data`)
```
40062f 0x40062f/0x4005b0/P/-/-/9 0x400645/0x4005ff/P/-/-/1 0x400637/0x400645/P/-/-/1 ...
4005d7 0x4005d7/0x4005e5/P/-/-/8 0x40062f/0x4005b0/P/-/-/6 0x400645/0x4005ff/P/-/-/1 ...
...
```
For implementation:
- Extended a new child class `LBRPerfReader` for the sample parsing, reused all the functionalities in `extractLBRStack` except for an extension to parsing leading instruction pointer.
- `HybridSample` is reused(just leave the call stack empty) and the parsed samples is still aggregated in `AggregatedSamples`. After that, range samples, branch sample, address samples are computed and recorded.
- Reused `ContextSampleCounterMap` to store the raw profile, since it's no need to aggregation by context, here it just registered one sample counter with a fake context key.
- Unified to use `show-raw-profile` instead of `show-unwinder-output` to dump the intermediate raw profile, see the comments of the format of the raw profile. For CS profile, it remains to output the unwinder output.
Profile generation part will come soon.
Differential Revision: https://reviews.llvm.org/D108153
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.
A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.
When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.
Testing
This is no-diff change in terms of code quality and profile content (for text profile).
For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.
The compile time of ads is dropped by 25%.
Differential Revision: https://reviews.llvm.org/D107299
The change adds a switch to allow sample loader to use global pre-inliner's decision instead. The pre-inliner in llvm-profgen makes inline decision globally based on whole program profile and function byte size as cost proxy.
Since pre-inliner also adjusts/merges context profile based on its inline decision, honoring its inline decision in sample loader would lead to better post-inline profile quality especially for thinlto where cross module profile merging isn't possible without pre-inliner.
Minor fix in profile reader is also included. When pre-inliner is use, we now also turn off the default merging and trimming logic unless it's explicitly asked.
Differential Revision: https://reviews.llvm.org/D108677
This change enables llvm-profgen to use accurate context-sensitive post-optimization function byte size as a cost proxy to drive global preinline decisions.
To do this, BinarySizeContextTracker is introduced to track function byte size under different inline context during disassembling. In preinliner, we can not query context byte size under switch `context-cost-for-preinliner`. The tracker uses a reverse trie to keep size of functions under different context (callee as parent, caller as child), and it can give best/longest possible matching context size for given input context.
The new size cost is off by default. There're a few TODOs that needs to addressed: 1) avoid dangling string from `Offset2LocStackMap`, which will be addressed in split context work; 2) using inlinee's entry probe to make sure we have correct zero size for inlinee that's completely optimized away after inlining. Some tuning is also needed.
Differential Revision: https://reviews.llvm.org/D108180
As we decided to support only one binary each time, this patch cleans up the related code dealing with multiple binaries. We can use `llvm-profdata` to merge profile from multiple binaries.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D108002
Currently we use a centralized string map(StringMap<FunctionSamples> ProfileMap) to store the profile while populating the sample, which might cause the memory usage bottleneck. I saw in an extreme case, there are thousands of samples whose context stack depth is >= 100. The memory consumption can be greater than 100GB.
As here the context is used for inlining, we can assume we won't have so many of inlinees keeping inlined at the same root function, so this change tried to cap the context stack and merge the samples for peak memory reduction and this is done after recursion compression.
The default value is -1 meaning no depth limit, in the future we can tune to a smaller one.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D107800
One performance issue happened in profile generation and it turned out the line 525 loop is the bottleneck.
Moving the code outside of loop scope can fix this issue. The run time is improved from 30+mins to ~30s.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D107529
Migrate pseudo probe decoding logic in llvm-profgen to MC, so other LLVM-base program could reuse existing codes. Redesign object layout of encoded and decoded pseudo probes.
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D106861
As a follow-up to https://reviews.llvm.org/D104129, I'm cleaning up the danling probe related code in both the compiler and llvm-profgen.
I'm seeing a 5% size win for the pseudo_probe section for SPEC2017 and 10% for Ciner. Certain benchmark such as 602.gcc has a 20% size win. No obvious difference seen on build time for SPEC2017 and Cinder.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D104477
We were using 0 as an indicator of invalid offset when computing disjoint ranges. In reality, 0 can be an valid code offset which stands for the first function in .text section. I'm using UINT64_MAX as an invalid code offset instead.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D104497
We were using a `StringMap` object to store all profiles to be emitted. The object is basically an unordered hash table, therefore updating it in the process of trasvering it may cause issue since the underlying bucket array could change.
I'm also moving the `csspgo-preinliner` switch around so that no context tri will be constructed (by the constructor of `CSPreInliner`) when the switch is off.
Reviewed By: wenlei
Differential Revision: https://reviews.llvm.org/D104267
Previously dangling samples were represented by INT64_MAX in sample profile while probes never executed were not reported. This was based on an observation that dangling probes were only at a smaller portion than zero-count probes. However, with compiler optimizations, dangling probes end up becoming at large portion of all probes in general and reporting them does not make sense from profile size point of view. This change flips sample reporting by reporting zero-count probes instead. This enabled dangling probe to be represented by none (missing entry in profile). This has a couple benefits:
1. Reducing sample profile size in optimize mode, even when the number of non-executed probes outperform the number of dangling probes, since INT64_MAX takes more space over 0 to encode.
2. Binary size savings. No need to encode dangling probe anymore, since missing probes are treated as dangling in the profile reader.
3. Reducing compiler work to track dangling probes. However, for probes that are real dead and removed, we still need the compiler to identify them so that they can be reported as zero-count, instead of mistreated as dangling probes.
4. Improving counts quality by respecting the counts already collected on the non-dangling copy of a probe. A probe, when duplicated, gets two copies at runtime. If one of them is dangling while the other is not, merging the two probes at profile generation time will cause the real samples collected on the non-dangling one to be discarded. Not reporting the dangling counterpart will keep the real samples.
5. Better readability.
6. Be consistent with non-CS dwarf line number based profile. Zero counts are trusted by the compiler counts inferencer while missing counts will be inferred by the compiler.
Note that the current patch does include any work for #3. There will be follow-up changes.
For #1, I've seen for a large Facebook service, the text profile is reduced by 7%. For extbinary profile, the size of LBRProfileSection is reduced by 35%.
For #4, I have seen general counts quality for SPEC2017 is improved by 10%.
Reviewed By: wenlei, wlei, wmi
Differential Revision: https://reviews.llvm.org/D104129
This change provides the option to merge and aggregate cold context by the last k frames instead of context-less name. By default K = 1 means the context-less one.
This is for better perf tuning. The more selective merging and trimming will rely on llvm-profgen's preinliner.
Reviewed By: wenlei, hoy
Differential Revision: https://reviews.llvm.org/D104131
Make extended binary the default output format for CSSPGO. This avoids having to pass flag every time when generating profile. It also matches llvm-profdata where binary profile is the default (should we switch to extbinary as default for llvm-profdata?).
We plan to compress name table for context profile, which depends on the built-in compression of extbinary.
Differential Revision: https://reviews.llvm.org/D103650
llvm-profgen uses profile summary based cold threshold to merge and trim cold context profile. This is to strike a good balance between profile size and performance.
We've been using 99.9% as the cutoff to save profile size without affecting performance. This change switch to use 99.9% instead of 99.9999% as default cold threshold cutoff for llvm-profgen.
Redundant switch csprof-cold-thres is also removed and tests cleaned up.
Differential Revision: https://reviews.llvm.org/D103071
The change adds support for triming and merging cold context when mergine CSSPGO profiles using llvm-profdata. This is similar to the context profile trimming in llvm-profgen, however the flexibility to trim cold context after profile is generated can be useful.
Differential Revision: https://reviews.llvm.org/D100528
Report dangling probes for frames that have real samples collected. Dangling probes are the probes associated to an empty block. When reported, sample count on a dangling probe will not be trusted by the compiler and we will rely on the counts inference algorithm to get the probe a reasonable count. This actually fixes a bug where previously only those dangling probes with samples collected were reported.
This patch also fixes two existing issues. Pseudo probes are stored in `Address2ProbesMap` and their pointers are used in `PseudoProbeInlineTree`. Previously `std::vector` was used to store probes and the pointers to probes may get obsolete as the vector grows. I'm changing `std::vector` to `std::list` instead.
The other issue is that all outlined functions shared the same inline frame previously due to the unchanged `Index` value as the dummy inlineSite identifier.
Good results seen for SPEC2017 in general regarding profile quality.
Reviewed By: wenlei, wlei
Differential Revision: https://reviews.llvm.org/D100235
This patch fixed the following issues along side with some refactoring:
1. Fix bugs where StringRef for context string out live the underlying std::string. We now keep string table in profile generator to hold std::strings. We also do the same for bracketed context strings in profile writer.
2. Make sure profile output strictly follow (total sample, name) order. Previously, there's inconsistency between ProfileMap's key and FunctionSamples's name, leading to inconsistent ordering. This is now fixed by introducing context profile canonicalization. Assertions are also added to make sure ProfileMap's key and FunctionSamples's name are always consistent.
3. Enhanced error handling for profile writing to make sure we bubble up errors properly for both llvm-profgen and llvm-profdata when string table is not populated correctly for extended binary profile.
4. Keep all internal context representation bracket free. This avoids creating new strings for context trimming, merging and preinline. getNameWithContext API is now simplied accordingly.
5. Factor out the code for context trimming and merging into SampleContextTrimmer in SampleProf.cpp. This enables llvm-profdata to use the trimmer when merging profiles. Changes in llvm-profgen will be in separate patch.
Differential Revision: https://reviews.llvm.org/D100090
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.
It will serve two purposes:
1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.
Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.
This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.
Differential Revision: https://reviews.llvm.org/D99146
Switch to use cold threshold from profile summary for cold context merging and trimming, instead of relying on hard coded values. Minor refactoring included for switch names, etc.
Differential Revision: https://reviews.llvm.org/D98921
This changes adds attribute field for metadata of context profile. Currently we have an inline attribute that indicates whether the leaf frame corresponding to a context profile was inlined in previous build.
This will be used to help estimating inlining and be taken into account when trimming context. Changes for that in llvm-profgen will follow. It will also help tuning.
Differential Revision: https://reviews.llvm.org/D98823
For ThinLTO's prelink compilation, we need to put external inline candidates into an import list attached to function's entry count metadata. This enables ThinLink to treat such cross module callee as hot in summary index, and later helps postlink to import them for profile guided cross module inlining.
For AutoFDO, the import list is retrieved by traversing the nested inlinee functions. For CSSPGO, since profile is flatterned, a few things need to happen for it to work:
- When loading input profile in extended binary format, we need to load all child context profile whose parent is in current module, so context trie for current module includes potential cross module inlinee.
- In order to make the above happen, we need to know whether input profile is CSSPGO profile before start reading function profile, hence a flag for profile summary section is added.
- When searching for cross module inline candidate, we need to walk through the context trie instead of nested inlinee profile (callsite sample of AutoFDO profile).
- Now that we have more accurate counts with CSSPGO, we swtiched to use entry count instead of total count to decided if an external callee is potentially beneficial to inline. This make it consistent with how we determine whether call tagert is potential inline candidate.
Differential Revision: https://reviews.llvm.org/D98590
It appears some instructions doesn't have the debug location info and the symbolizer will return an empty call stack for them which will cause some crash later in profile unwinding. Actually we do not record the sample info for them, so this change just filter out those instruction.
As those instruction would appears at the begin and end of the instruction list, without them we need to add the boundary check for IP `advance` and `backward`.
Also for pseudo probe based profile, we actually don't need the symbolized location info, so here just change to use an empty stack for it. This could save half of the binary loading time.
Differential Revision: https://reviews.llvm.org/D96434