[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
//===-- ProfileGenerator.cpp - Profile Generator ---------------*- C++ -*-===//
|
|
|
|
//
|
|
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
#include "ProfileGenerator.h"
|
2021-11-04 00:08:37 -07:00
|
|
|
#include "ErrorHandling.h"
|
2022-12-15 18:36:52 -08:00
|
|
|
#include "MissingFrameInferrer.h"
|
2022-03-30 12:27:10 -07:00
|
|
|
#include "PerfReader.h"
|
2021-08-16 18:29:07 -07:00
|
|
|
#include "ProfiledBinary.h"
|
2022-02-23 22:16:39 +01:00
|
|
|
#include "llvm/DebugInfo/Symbolize/SymbolizableModule.h"
|
2021-03-18 09:45:07 -07:00
|
|
|
#include "llvm/ProfileData/ProfileCommon.h"
|
2022-03-23 12:36:44 -07:00
|
|
|
#include <algorithm>
|
2021-11-28 18:42:09 -08:00
|
|
|
#include <float.h>
|
2021-08-04 20:20:58 -07:00
|
|
|
#include <unordered_set>
|
2022-03-30 12:27:10 -07:00
|
|
|
#include <utility>
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
|
2021-08-31 13:27:42 -07:00
|
|
|
cl::opt<std::string> OutputFilename("output", cl::value_desc("output"),
|
|
|
|
cl::Required,
|
|
|
|
cl::desc("Output profile file"));
|
2021-02-09 16:41:44 -08:00
|
|
|
static cl::alias OutputA("o", cl::desc("Alias for --output"),
|
|
|
|
cl::aliasopt(OutputFilename));
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
|
|
|
|
static cl::opt<SampleProfileFormat> OutputFormat(
|
2021-06-03 13:42:24 -07:00
|
|
|
"format", cl::desc("Format of output profile"), cl::init(SPF_Ext_Binary),
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
cl::values(
|
|
|
|
clEnumValN(SPF_Binary, "binary", "Binary encoding (default)"),
|
|
|
|
clEnumValN(SPF_Ext_Binary, "extbinary", "Extensible binary encoding"),
|
|
|
|
clEnumValN(SPF_Text, "text", "Text encoding"),
|
|
|
|
clEnumValN(SPF_GCC, "gcc",
|
|
|
|
"GCC encoding (only meaningful for -sample)")));
|
|
|
|
|
2022-11-23 23:08:49 -08:00
|
|
|
static cl::opt<bool> UseMD5(
|
|
|
|
"use-md5", cl::Hidden,
|
2021-08-30 10:31:47 -07:00
|
|
|
cl::desc("Use md5 to represent function names in the output profile (only "
|
|
|
|
"meaningful for -extbinary)"));
|
|
|
|
|
2021-09-29 19:15:37 -07:00
|
|
|
static cl::opt<bool> PopulateProfileSymbolList(
|
2021-10-21 20:56:06 -07:00
|
|
|
"populate-profile-symbol-list", cl::init(false), cl::Hidden,
|
2021-09-29 19:15:37 -07:00
|
|
|
cl::desc("Populate profile symbol list (only meaningful for -extbinary)"));
|
|
|
|
|
2021-10-29 16:33:31 -07:00
|
|
|
static cl::opt<bool> FillZeroForAllFuncs(
|
|
|
|
"fill-zero-for-all-funcs", cl::init(false), cl::Hidden,
|
|
|
|
cl::desc("Attribute all functions' range with zero count "
|
|
|
|
"even it's not hit by any samples."));
|
|
|
|
|
[CSSPGO][llvm-profgen] Compress recursive cycles in calling context
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]
Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.
Differential Revision: https://reviews.llvm.org/D93556
2021-01-29 15:00:08 -08:00
|
|
|
static cl::opt<int32_t, true> RecursionCompression(
|
|
|
|
"compress-recursion",
|
|
|
|
cl::desc("Compressing recursion by deduplicating adjacent frame "
|
|
|
|
"sequences up to the specified size. -1 means no size limit."),
|
|
|
|
cl::Hidden,
|
|
|
|
cl::location(llvm::sampleprof::CSProfileGenerator::MaxCompressionSize));
|
|
|
|
|
2021-11-28 23:43:11 -08:00
|
|
|
static cl::opt<bool>
|
2022-06-03 21:59:05 -07:00
|
|
|
TrimColdProfile("trim-cold-profile",
|
2021-11-28 23:43:11 -08:00
|
|
|
cl::desc("If the total count of the profile is smaller "
|
|
|
|
"than threshold, it will be trimmed."));
|
|
|
|
|
2021-03-18 09:45:07 -07:00
|
|
|
static cl::opt<bool> CSProfMergeColdContext(
|
2022-06-03 21:59:05 -07:00
|
|
|
"csprof-merge-cold-context", cl::init(true),
|
2021-05-24 21:17:17 -07:00
|
|
|
cl::desc("If the total count of context profile is smaller than "
|
|
|
|
"the threshold, it will be merged into context-less base "
|
|
|
|
"profile."));
|
2021-03-18 09:45:07 -07:00
|
|
|
|
2021-08-09 11:44:33 -07:00
|
|
|
static cl::opt<uint32_t> CSProfMaxColdContextDepth(
|
2022-06-03 21:59:05 -07:00
|
|
|
"csprof-max-cold-context-depth", cl::init(1),
|
2021-08-09 11:44:33 -07:00
|
|
|
cl::desc("Keep the last K contexts while merging cold profile. 1 means the "
|
2021-06-11 00:35:45 -07:00
|
|
|
"context-less base profile"));
|
|
|
|
|
2021-08-09 11:44:33 -07:00
|
|
|
static cl::opt<int, true> CSProfMaxContextDepth(
|
2022-06-03 21:59:05 -07:00
|
|
|
"csprof-max-context-depth",
|
2021-08-09 11:44:33 -07:00
|
|
|
cl::desc("Keep the last K contexts while merging profile. -1 means no "
|
|
|
|
"depth limit."),
|
|
|
|
cl::location(llvm::sampleprof::CSProfileGenerator::MaxContextDepth));
|
|
|
|
|
2024-05-24 14:37:24 -04:00
|
|
|
static cl::opt<double> ProfileDensityThreshold(
|
|
|
|
"profile-density-threshold", llvm::cl::init(50),
|
|
|
|
llvm::cl::desc("If the profile density is below the given threshold, it "
|
|
|
|
"will be suggested to increase the sampling rate."),
|
2021-11-28 18:42:09 -08:00
|
|
|
llvm::cl::Optional);
|
|
|
|
static cl::opt<bool> ShowDensity("show-density", llvm::cl::init(false),
|
|
|
|
llvm::cl::desc("show profile density details"),
|
|
|
|
llvm::cl::Optional);
|
2024-05-24 14:37:24 -04:00
|
|
|
static cl::opt<int> ProfileDensityCutOffHot(
|
|
|
|
"profile-density-cutoff-hot", llvm::cl::init(990000),
|
|
|
|
llvm::cl::desc("Total samples cutoff for functions used to calculate "
|
|
|
|
"profile density."));
|
2021-11-28 18:42:09 -08:00
|
|
|
|
2021-12-02 16:51:42 -08:00
|
|
|
static cl::opt<bool> UpdateTotalSamples(
|
|
|
|
"update-total-samples", llvm::cl::init(false),
|
|
|
|
llvm::cl::desc(
|
|
|
|
"Update total samples by accumulating all its body samples."),
|
|
|
|
llvm::cl::Optional);
|
|
|
|
|
2021-12-14 10:03:05 -08:00
|
|
|
static cl::opt<bool> GenCSNestedProfile(
|
2022-03-07 11:41:11 -08:00
|
|
|
"gen-cs-nested-profile", cl::Hidden, cl::init(true),
|
2021-12-14 10:03:05 -08:00
|
|
|
cl::desc("Generate nested function profiles for CSSPGO"));
|
|
|
|
|
2022-12-15 18:36:52 -08:00
|
|
|
cl::opt<bool> InferMissingFrames(
|
|
|
|
"infer-missing-frames", llvm::cl::init(true),
|
|
|
|
llvm::cl::desc(
|
|
|
|
"Infer missing call frames due to compiler tail call elimination."),
|
|
|
|
llvm::cl::Optional);
|
|
|
|
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
using namespace llvm;
|
|
|
|
using namespace sampleprof;
|
|
|
|
|
|
|
|
namespace llvm {
|
2022-11-23 23:08:49 -08:00
|
|
|
extern cl::opt<int> ProfileSummaryCutoffHot;
|
|
|
|
extern cl::opt<bool> UseContextLessSummary;
|
|
|
|
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
namespace sampleprof {
|
|
|
|
|
[CSSPGO][llvm-profgen] Compress recursive cycles in calling context
This change compresses the context string by removing cycles due to recursive function for CS profile generation. Removing recursion cycles is a way to normalize the calling context which will be better for the sample aggregation and also make the context promoting deterministic.
Specifically for implementation, we recognize adjacent repeated frames as cycles and deduplicated them through multiple round of iteration.
For example:
Considering a input context string stack:
[“a”, “a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For first iteration,, it removed all adjacent repeated frames of size 1:
[“a”, “b”, “c”, “a”, “b”, “c”, “b”, “c”, “d”]
For second iteration, it removed all adjacent repeated frames of size 2:
[“a”, “b”, “c”, “a”, “b”, “c”, “d”]
So in the end, we get compressed output:
[“a”, “b”, “c”, “d”]
Compression will be called in two place: one for sample's context key right after unwinding, one is for the eventual context string id in the ProfileGenerator.
Added a switch `compress-recursion` to control the size of duplicated frames, default -1 means no size limit.
Added unit tests and regression test for this.
Differential Revision: https://reviews.llvm.org/D93556
2021-01-29 15:00:08 -08:00
|
|
|
// Initialize the MaxCompressionSize to -1 which means no size limit
|
|
|
|
int32_t CSProfileGenerator::MaxCompressionSize = -1;
|
|
|
|
|
2021-08-09 11:44:33 -07:00
|
|
|
int CSProfileGenerator::MaxContextDepth = -1;
|
|
|
|
|
2021-11-04 00:08:37 -07:00
|
|
|
bool ProfileGeneratorBase::UseFSDiscriminator = false;
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
std::unique_ptr<ProfileGeneratorBase>
|
|
|
|
ProfileGeneratorBase::create(ProfiledBinary *Binary,
|
2022-03-30 12:27:10 -07:00
|
|
|
const ContextSampleCounterMap *SampleCounters,
|
2022-04-28 11:31:02 -07:00
|
|
|
bool ProfileIsCS) {
|
2021-09-22 20:00:24 -07:00
|
|
|
std::unique_ptr<ProfileGeneratorBase> Generator;
|
2022-04-28 11:31:02 -07:00
|
|
|
if (ProfileIsCS) {
|
2021-09-22 20:00:24 -07:00
|
|
|
Generator.reset(new CSProfileGenerator(Binary, SampleCounters));
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
} else {
|
2021-09-24 11:32:32 -07:00
|
|
|
Generator.reset(new ProfileGenerator(Binary, SampleCounters));
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
2021-11-04 00:08:37 -07:00
|
|
|
ProfileGeneratorBase::UseFSDiscriminator = Binary->useFSDiscriminator();
|
|
|
|
FunctionSamples::ProfileIsFS = Binary->useFSDiscriminator();
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
return Generator;
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
|
|
|
|
2022-03-30 12:27:10 -07:00
|
|
|
std::unique_ptr<ProfileGeneratorBase>
|
2022-06-23 20:14:47 -07:00
|
|
|
ProfileGeneratorBase::create(ProfiledBinary *Binary, SampleProfileMap &Profiles,
|
2022-04-28 11:31:02 -07:00
|
|
|
bool ProfileIsCS) {
|
2022-03-30 12:27:10 -07:00
|
|
|
std::unique_ptr<ProfileGeneratorBase> Generator;
|
2022-04-28 11:31:02 -07:00
|
|
|
if (ProfileIsCS) {
|
2022-06-23 20:14:47 -07:00
|
|
|
Generator.reset(new CSProfileGenerator(Binary, Profiles));
|
2022-03-30 12:27:10 -07:00
|
|
|
} else {
|
|
|
|
Generator.reset(new ProfileGenerator(Binary, std::move(Profiles)));
|
|
|
|
}
|
|
|
|
ProfileGeneratorBase::UseFSDiscriminator = Binary->useFSDiscriminator();
|
|
|
|
FunctionSamples::ProfileIsFS = Binary->useFSDiscriminator();
|
|
|
|
|
|
|
|
return Generator;
|
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void ProfileGeneratorBase::write(std::unique_ptr<SampleProfileWriter> Writer,
|
|
|
|
SampleProfileMap &ProfileMap) {
|
2021-09-29 19:15:37 -07:00
|
|
|
// Populate profile symbol list if extended binary format is used.
|
|
|
|
ProfileSymbolList SymbolList;
|
|
|
|
|
|
|
|
if (PopulateProfileSymbolList && OutputFormat == SPF_Ext_Binary) {
|
2021-10-21 20:56:06 -07:00
|
|
|
Binary->populateSymbolListFromDWARF(SymbolList);
|
2021-09-29 19:15:37 -07:00
|
|
|
Writer->setProfileSymbolList(&SymbolList);
|
|
|
|
}
|
|
|
|
|
2021-04-07 23:06:39 -07:00
|
|
|
if (std::error_code EC = Writer->write(ProfileMap))
|
|
|
|
exitWithError(std::move(EC));
|
2021-02-03 14:13:06 -08:00
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void ProfileGeneratorBase::write() {
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
auto WriterOrErr = SampleProfileWriter::create(OutputFilename, OutputFormat);
|
|
|
|
if (std::error_code EC = WriterOrErr.getError())
|
|
|
|
exitWithError(EC, OutputFilename);
|
2021-08-30 10:31:47 -07:00
|
|
|
|
|
|
|
if (UseMD5) {
|
|
|
|
if (OutputFormat != SPF_Ext_Binary)
|
|
|
|
WithColor::warning() << "-use-md5 is ignored. Specify "
|
|
|
|
"--format=extbinary to enable it\n";
|
|
|
|
else
|
|
|
|
WriterOrErr.get()->setUseMD5();
|
|
|
|
}
|
|
|
|
|
2021-02-03 14:13:06 -08:00
|
|
|
write(std::move(WriterOrErr.get()), ProfileMap);
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
|
|
|
|
2021-11-28 18:42:09 -08:00
|
|
|
void ProfileGeneratorBase::showDensitySuggestion(double Density) {
|
|
|
|
if (Density == 0.0)
|
2024-05-24 14:37:24 -04:00
|
|
|
WithColor::warning() << "The output profile is empty or the "
|
|
|
|
"--profile-density-cutoff-hot option is "
|
2021-11-28 18:42:09 -08:00
|
|
|
"set too low. Please check your command.\n";
|
2024-05-24 14:37:24 -04:00
|
|
|
else if (Density < ProfileDensityThreshold)
|
2021-11-28 18:42:09 -08:00
|
|
|
WithColor::warning()
|
2023-06-25 16:39:16 -07:00
|
|
|
<< "Sample PGO is estimated to optimize better with "
|
2024-05-24 14:37:24 -04:00
|
|
|
<< format("%.1f", ProfileDensityThreshold / Density)
|
2021-11-28 18:42:09 -08:00
|
|
|
<< "x more samples. Please consider increasing sampling rate or "
|
|
|
|
"profiling for longer duration to get more samples.\n";
|
|
|
|
|
|
|
|
if (ShowDensity)
|
2024-05-24 14:37:24 -04:00
|
|
|
outs() << "Functions with density >= " << format("%.1f", Density)
|
|
|
|
<< " account for "
|
2021-11-28 18:42:09 -08:00
|
|
|
<< format("%.2f",
|
2024-05-24 14:37:24 -04:00
|
|
|
static_cast<double>(ProfileDensityCutOffHot) / 10000)
|
|
|
|
<< "% total sample counts.\n";
|
2021-11-28 18:42:09 -08:00
|
|
|
}
|
|
|
|
|
[llvm-profgen] Filter out ambiguous cold profiles during profile generation (#81803)
For the built-in local initialization function(`__cxx_global_var_init`,
`__tls_init` prefix), there could be multiple versions of the functions
in the final binary, e.g. `__cxx_global_var_init`, which is a wrapper of
global variable ctors, the compiler could assign suffixes like
`__cxx_global_var_init.N` for different ctors.
However, in the profile generation, we call `getCanonicalFnName` to
canonicalize the names which strip the suffixes. Therefore, samples from
different functions queries the same profile(only
`__cxx_global_var_init`) and the counts are merged. As the functions are
essentially different, entries of the merged profile are ambiguous. In
sample loading, for each version of this function, the IR from one
version would be attributed towards a merged entries, which is
inaccurate, especially for fuzzy profile matching, it gets multiple
callsites(from different function) but using to match one callsite,
which mislead the matching and report a lot of false positives.
Hence, we want to filter them out from the profile map during the
profile generation time. The profiles are all cold functions, it won't
have perf impact.
2024-02-16 14:29:24 -08:00
|
|
|
bool ProfileGeneratorBase::filterAmbiguousProfile(FunctionSamples &FS) {
|
|
|
|
for (const auto &Prefix : FuncPrefixsToFilter) {
|
|
|
|
if (FS.getFuncName().starts_with(Prefix))
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Filter the function profiles for the inlinees. It's useful for fuzzy
|
|
|
|
// profile matching which flattens the profile and inlinees' samples are
|
|
|
|
// merged into top-level function.
|
|
|
|
for (auto &Callees :
|
|
|
|
const_cast<CallsiteSampleMap &>(FS.getCallsiteSamples())) {
|
|
|
|
auto &CalleesMap = Callees.second;
|
|
|
|
for (auto I = CalleesMap.begin(); I != CalleesMap.end();) {
|
|
|
|
auto FS = I++;
|
|
|
|
if (filterAmbiguousProfile(FS->second))
|
|
|
|
CalleesMap.erase(FS);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
// For built-in local initialization function such as __cxx_global_var_init,
|
|
|
|
// __tls_init prefix function, there could be multiple versions of the functions
|
|
|
|
// in the final binary. However, in the profile generation, we call
|
|
|
|
// getCanonicalFnName to canonicalize the names which strips the suffixes.
|
|
|
|
// Therefore, samples from different functions queries the same profile and the
|
|
|
|
// samples are merged. As the functions are essentially different, entries of
|
|
|
|
// the merged profile are ambiguous. In sample loader, the IR from one version
|
|
|
|
// would be attributed towards a merged entries, which is inaccurate. Especially
|
|
|
|
// for fuzzy profile matching, it gets multiple callsites(from different
|
|
|
|
// function) but used to match one callsite, which misleads the matching and
|
|
|
|
// causes a lot of false positives report. Hence, we want to filter them out
|
|
|
|
// from the profile map during the profile generation time. The profiles are all
|
|
|
|
// cold functions, it won't have perf impact.
|
|
|
|
void ProfileGeneratorBase::filterAmbiguousProfile(SampleProfileMap &Profiles) {
|
|
|
|
for (auto I = ProfileMap.begin(); I != ProfileMap.end();) {
|
|
|
|
auto FS = I++;
|
|
|
|
if (filterAmbiguousProfile(FS->second))
|
|
|
|
ProfileMap.erase(FS);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void ProfileGeneratorBase::findDisjointRanges(RangeSample &DisjointRanges,
|
|
|
|
const RangeSample &Ranges) {
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
Regions may overlap with each other. Using the boundary info, find all
|
|
|
|
disjoint ranges and their sample count. BoundaryPoint contains the count
|
2021-01-11 09:08:39 -08:00
|
|
|
multiple samples begin/end at this points.
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
|
|
|
|
|<--100-->| Sample1
|
|
|
|
|<------200------>| Sample2
|
|
|
|
A B C
|
|
|
|
|
|
|
|
In the example above,
|
|
|
|
Sample1 begins at A, ends at B, its value is 100.
|
|
|
|
Sample2 beings at A, ends at C, its value is 200.
|
|
|
|
For A, BeginCount is the sum of sample begins at A, which is 300 and no
|
|
|
|
samples ends at A, so EndCount is 0.
|
|
|
|
Then boundary points A, B, and C with begin/end counts are:
|
|
|
|
A: (300, 0)
|
|
|
|
B: (0, 100)
|
|
|
|
C: (0, 200)
|
|
|
|
*/
|
|
|
|
struct BoundaryPoint {
|
|
|
|
// Sum of sample counts beginning at this point
|
2021-09-23 22:53:12 -07:00
|
|
|
uint64_t BeginCount = UINT64_MAX;
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
// Sum of sample counts ending at this point
|
2021-09-23 22:53:12 -07:00
|
|
|
uint64_t EndCount = UINT64_MAX;
|
|
|
|
// Is the begin point of a zero range.
|
|
|
|
bool IsZeroRangeBegin = false;
|
|
|
|
// Is the end point of a zero range.
|
|
|
|
bool IsZeroRangeEnd = false;
|
|
|
|
|
|
|
|
void addBeginCount(uint64_t Count) {
|
|
|
|
if (BeginCount == UINT64_MAX)
|
|
|
|
BeginCount = 0;
|
|
|
|
BeginCount += Count;
|
|
|
|
}
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
|
2021-09-23 22:53:12 -07:00
|
|
|
void addEndCount(uint64_t Count) {
|
|
|
|
if (EndCount == UINT64_MAX)
|
|
|
|
EndCount = 0;
|
|
|
|
EndCount += Count;
|
|
|
|
}
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
For the above example. With boundary points, follwing logic finds two
|
|
|
|
disjoint region of
|
|
|
|
|
|
|
|
[A,B]: 300
|
|
|
|
[B+1,C]: 200
|
|
|
|
|
|
|
|
If there is a boundary point that both begin and end, the point itself
|
|
|
|
becomes a separate disjoint region. For example, if we have original
|
|
|
|
ranges of
|
|
|
|
|
|
|
|
|<--- 100 --->|
|
|
|
|
|<--- 200 --->|
|
|
|
|
A B C
|
|
|
|
|
|
|
|
there are three boundary points with their begin/end counts of
|
|
|
|
|
|
|
|
A: (100, 0)
|
|
|
|
B: (200, 100)
|
|
|
|
C: (0, 200)
|
|
|
|
|
|
|
|
the disjoint ranges would be
|
|
|
|
|
|
|
|
[A, B-1]: 100
|
|
|
|
[B, B]: 300
|
|
|
|
[B+1, C]: 200.
|
2021-09-23 22:53:12 -07:00
|
|
|
|
|
|
|
Example for zero value range:
|
|
|
|
|
|
|
|
|<--- 100 --->|
|
|
|
|
|<--- 200 --->|
|
|
|
|
|<--------------- 0 ----------------->|
|
|
|
|
A B C D E F
|
|
|
|
|
|
|
|
[A, B-1] : 0
|
|
|
|
[B, C] : 100
|
|
|
|
[C+1, D-1]: 0
|
|
|
|
[D, E] : 200
|
|
|
|
[E+1, F] : 0
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
*/
|
|
|
|
std::map<uint64_t, BoundaryPoint> Boundaries;
|
|
|
|
|
2022-01-14 14:16:18 +00:00
|
|
|
for (const auto &Item : Ranges) {
|
2021-09-23 22:53:12 -07:00
|
|
|
assert(Item.first.first <= Item.first.second &&
|
|
|
|
"Invalid instruction range");
|
|
|
|
auto &BeginPoint = Boundaries[Item.first.first];
|
|
|
|
auto &EndPoint = Boundaries[Item.first.second];
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
uint64_t Count = Item.second;
|
|
|
|
|
2021-09-23 22:53:12 -07:00
|
|
|
BeginPoint.addBeginCount(Count);
|
|
|
|
EndPoint.addEndCount(Count);
|
|
|
|
if (Count == 0) {
|
|
|
|
BeginPoint.IsZeroRangeBegin = true;
|
|
|
|
EndPoint.IsZeroRangeEnd = true;
|
|
|
|
}
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
|
|
|
|
2021-09-23 22:53:12 -07:00
|
|
|
// Use UINT64_MAX to indicate there is no existing range between BeginAddress
|
|
|
|
// and the next valid address
|
2021-06-17 18:02:45 -07:00
|
|
|
uint64_t BeginAddress = UINT64_MAX;
|
2021-09-23 22:53:12 -07:00
|
|
|
int ZeroRangeDepth = 0;
|
2021-09-10 16:34:49 -07:00
|
|
|
uint64_t Count = 0;
|
2022-01-14 14:16:18 +00:00
|
|
|
for (const auto &Item : Boundaries) {
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
uint64_t Address = Item.first;
|
2022-01-14 14:16:18 +00:00
|
|
|
const BoundaryPoint &Point = Item.second;
|
2021-09-23 22:53:12 -07:00
|
|
|
if (Point.BeginCount != UINT64_MAX) {
|
2021-06-17 18:02:45 -07:00
|
|
|
if (BeginAddress != UINT64_MAX)
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
DisjointRanges[{BeginAddress, Address - 1}] = Count;
|
|
|
|
Count += Point.BeginCount;
|
|
|
|
BeginAddress = Address;
|
2021-09-23 22:53:12 -07:00
|
|
|
ZeroRangeDepth += Point.IsZeroRangeBegin;
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
2021-09-23 22:53:12 -07:00
|
|
|
if (Point.EndCount != UINT64_MAX) {
|
2021-06-17 18:02:45 -07:00
|
|
|
assert((BeginAddress != UINT64_MAX) &&
|
|
|
|
"First boundary point cannot be 'end' point");
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
DisjointRanges[{BeginAddress, Address}] = Count;
|
2021-09-10 16:34:49 -07:00
|
|
|
assert(Count >= Point.EndCount && "Mismatched live ranges");
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
Count -= Point.EndCount;
|
|
|
|
BeginAddress = Address + 1;
|
2021-09-23 22:53:12 -07:00
|
|
|
ZeroRangeDepth -= Point.IsZeroRangeEnd;
|
|
|
|
// If the remaining count is zero and it's no longer in a zero range, this
|
|
|
|
// means we consume all the ranges before, thus mark BeginAddress as
|
|
|
|
// UINT64_MAX. e.g. supposing we have two non-overlapping ranges:
|
|
|
|
// [<---- 10 ---->]
|
|
|
|
// [<---- 20 ---->]
|
|
|
|
// A B C D
|
|
|
|
// The BeginAddress(B+1) will reset to invalid(UINT64_MAX), so we won't
|
|
|
|
// have the [B+1, C-1] zero range.
|
|
|
|
if (Count == 0 && ZeroRangeDepth == 0)
|
|
|
|
BeginAddress = UINT64_MAX;
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void ProfileGeneratorBase::updateBodySamplesforFunctionProfile(
|
|
|
|
FunctionSamples &FunctionProfile, const SampleContextFrame &LeafLoc,
|
|
|
|
uint64_t Count) {
|
|
|
|
// Use the maximum count of samples with same line location
|
2021-10-01 16:58:59 -07:00
|
|
|
uint32_t Discriminator = getBaseDiscriminator(LeafLoc.Location.Discriminator);
|
2021-10-18 17:44:45 -07:00
|
|
|
|
2024-08-02 09:16:48 -04:00
|
|
|
// Use duplication factor to compensated for loop unroll/vectorization.
|
|
|
|
// Note that this is only needed when we're taking MAX of the counts at
|
|
|
|
// the location instead of SUM.
|
|
|
|
Count *= getDuplicationFactor(LeafLoc.Location.Discriminator);
|
|
|
|
|
|
|
|
ErrorOr<uint64_t> R =
|
|
|
|
FunctionProfile.findSamplesAt(LeafLoc.Location.LineOffset, Discriminator);
|
|
|
|
|
|
|
|
uint64_t PreviousCount = R ? R.get() : 0;
|
|
|
|
if (PreviousCount <= Count) {
|
2021-10-01 16:58:59 -07:00
|
|
|
FunctionProfile.addBodySamples(LeafLoc.Location.LineOffset, Discriminator,
|
2024-08-02 09:16:48 -04:00
|
|
|
Count - PreviousCount);
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-10-27 00:25:50 -07:00
|
|
|
void ProfileGeneratorBase::updateTotalSamples() {
|
|
|
|
for (auto &Item : ProfileMap) {
|
|
|
|
FunctionSamples &FunctionProfile = Item.second;
|
|
|
|
FunctionProfile.updateTotalSamples();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-05-12 22:08:18 -07:00
|
|
|
void ProfileGeneratorBase::updateCallsiteSamples() {
|
|
|
|
for (auto &Item : ProfileMap) {
|
|
|
|
FunctionSamples &FunctionProfile = Item.second;
|
|
|
|
FunctionProfile.updateCallsiteSamples();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void ProfileGeneratorBase::updateFunctionSamples() {
|
|
|
|
updateCallsiteSamples();
|
|
|
|
|
|
|
|
if (UpdateTotalSamples)
|
|
|
|
updateTotalSamples();
|
|
|
|
}
|
|
|
|
|
2022-03-23 12:36:44 -07:00
|
|
|
void ProfileGeneratorBase::collectProfiledFunctions() {
|
|
|
|
std::unordered_set<const BinaryFunction *> ProfiledFunctions;
|
2022-06-23 20:14:47 -07:00
|
|
|
if (collectFunctionsFromRawProfile(ProfiledFunctions))
|
|
|
|
Binary->setProfiledFunctions(ProfiledFunctions);
|
|
|
|
else if (collectFunctionsFromLLVMProfile(ProfiledFunctions))
|
|
|
|
Binary->setProfiledFunctions(ProfiledFunctions);
|
|
|
|
else
|
|
|
|
llvm_unreachable("Unsupported input profile");
|
|
|
|
}
|
2022-03-30 12:27:10 -07:00
|
|
|
|
2022-06-23 20:14:47 -07:00
|
|
|
bool ProfileGeneratorBase::collectFunctionsFromRawProfile(
|
|
|
|
std::unordered_set<const BinaryFunction *> &ProfiledFunctions) {
|
|
|
|
if (!SampleCounters)
|
|
|
|
return false;
|
|
|
|
// Go through all the stacks, ranges and branches in sample counters, use
|
|
|
|
// the start of the range to look up the function it belongs and record the
|
|
|
|
// function.
|
|
|
|
for (const auto &CI : *SampleCounters) {
|
|
|
|
if (const auto *CtxKey = dyn_cast<AddrBasedCtxKey>(CI.first.getPtr())) {
|
2022-10-13 20:42:51 -07:00
|
|
|
for (auto StackAddr : CtxKey->Context) {
|
|
|
|
if (FuncRange *FRange = Binary->findFuncRange(StackAddr))
|
2022-03-23 12:36:44 -07:00
|
|
|
ProfiledFunctions.insert(FRange->Func);
|
|
|
|
}
|
2022-06-23 20:14:47 -07:00
|
|
|
}
|
2022-03-23 12:36:44 -07:00
|
|
|
|
2022-06-23 20:14:47 -07:00
|
|
|
for (auto Item : CI.second.RangeCounter) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t StartAddress = Item.first.first;
|
|
|
|
if (FuncRange *FRange = Binary->findFuncRange(StartAddress))
|
2022-06-23 20:14:47 -07:00
|
|
|
ProfiledFunctions.insert(FRange->Func);
|
2022-03-23 12:36:44 -07:00
|
|
|
}
|
2022-06-23 20:14:47 -07:00
|
|
|
|
|
|
|
for (auto Item : CI.second.BranchCounter) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t SourceAddress = Item.first.first;
|
2022-11-01 15:17:57 -07:00
|
|
|
uint64_t TargetAddress = Item.first.second;
|
2022-10-13 20:42:51 -07:00
|
|
|
if (FuncRange *FRange = Binary->findFuncRange(SourceAddress))
|
2022-06-23 20:14:47 -07:00
|
|
|
ProfiledFunctions.insert(FRange->Func);
|
2022-10-13 20:42:51 -07:00
|
|
|
if (FuncRange *FRange = Binary->findFuncRange(TargetAddress))
|
2022-06-23 20:14:47 -07:00
|
|
|
ProfiledFunctions.insert(FRange->Func);
|
2022-03-23 12:36:44 -07:00
|
|
|
}
|
|
|
|
}
|
2022-06-23 20:14:47 -07:00
|
|
|
return true;
|
|
|
|
}
|
2022-03-23 12:36:44 -07:00
|
|
|
|
2022-06-23 20:14:47 -07:00
|
|
|
bool ProfileGenerator::collectFunctionsFromLLVMProfile(
|
|
|
|
std::unordered_set<const BinaryFunction *> &ProfiledFunctions) {
|
|
|
|
for (const auto &FS : ProfileMap) {
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
if (auto *Func = Binary->getBinaryFunction(FS.second.getFunction()))
|
2022-06-23 20:14:47 -07:00
|
|
|
ProfiledFunctions.insert(Func);
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool CSProfileGenerator::collectFunctionsFromLLVMProfile(
|
|
|
|
std::unordered_set<const BinaryFunction *> &ProfiledFunctions) {
|
[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie
This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner.
One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node.
Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic.
Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D127031
2022-06-27 23:00:05 -07:00
|
|
|
for (auto *Node : ContextTracker) {
|
2022-06-23 20:14:47 -07:00
|
|
|
if (!Node->getFuncName().empty())
|
|
|
|
if (auto *Func = Binary->getBinaryFunction(Node->getFuncName()))
|
|
|
|
ProfiledFunctions.insert(Func);
|
|
|
|
}
|
|
|
|
return true;
|
2022-03-23 12:36:44 -07:00
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
FunctionSamples &
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
ProfileGenerator::getTopLevelFunctionProfile(FunctionId FuncName) {
|
2021-09-22 20:00:24 -07:00
|
|
|
SampleContext Context(FuncName);
|
2024-07-09 14:35:49 -07:00
|
|
|
return ProfileMap.create(Context);
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
void ProfileGenerator::generateProfile() {
|
2022-03-23 12:36:44 -07:00
|
|
|
collectProfiledFunctions();
|
2022-03-30 12:27:10 -07:00
|
|
|
|
|
|
|
if (Binary->usePseudoProbes())
|
|
|
|
Binary->decodePseudoProbe();
|
|
|
|
|
|
|
|
if (SampleCounters) {
|
|
|
|
if (Binary->usePseudoProbes()) {
|
|
|
|
generateProbeBasedProfile();
|
|
|
|
} else {
|
|
|
|
generateLineNumBasedProfile();
|
|
|
|
}
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
2022-03-30 12:27:10 -07:00
|
|
|
|
2021-11-28 18:42:09 -08:00
|
|
|
postProcessProfiles();
|
|
|
|
}
|
|
|
|
|
|
|
|
void ProfileGenerator::postProcessProfiles() {
|
2022-06-23 20:14:47 -07:00
|
|
|
computeSummaryAndThreshold(ProfileMap);
|
2021-11-28 23:43:11 -08:00
|
|
|
trimColdProfiles(ProfileMap, ColdCountThreshold);
|
[llvm-profgen] Filter out ambiguous cold profiles during profile generation (#81803)
For the built-in local initialization function(`__cxx_global_var_init`,
`__tls_init` prefix), there could be multiple versions of the functions
in the final binary, e.g. `__cxx_global_var_init`, which is a wrapper of
global variable ctors, the compiler could assign suffixes like
`__cxx_global_var_init.N` for different ctors.
However, in the profile generation, we call `getCanonicalFnName` to
canonicalize the names which strip the suffixes. Therefore, samples from
different functions queries the same profile(only
`__cxx_global_var_init`) and the counts are merged. As the functions are
essentially different, entries of the merged profile are ambiguous. In
sample loading, for each version of this function, the IR from one
version would be attributed towards a merged entries, which is
inaccurate, especially for fuzzy profile matching, it gets multiple
callsites(from different function) but using to match one callsite,
which mislead the matching and report a lot of false positives.
Hence, we want to filter them out from the profile map during the
profile generation time. The profiles are all cold functions, it won't
have perf impact.
2024-02-16 14:29:24 -08:00
|
|
|
filterAmbiguousProfile(ProfileMap);
|
2021-11-28 18:42:09 -08:00
|
|
|
calculateAndShowDensity(ProfileMap);
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
|
|
|
|
2021-11-28 23:43:11 -08:00
|
|
|
void ProfileGenerator::trimColdProfiles(const SampleProfileMap &Profiles,
|
|
|
|
uint64_t ColdCntThreshold) {
|
|
|
|
if (!TrimColdProfile)
|
|
|
|
return;
|
|
|
|
|
|
|
|
// Move cold profiles into a tmp container.
|
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact:
When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
2023-08-01 21:37:29 +00:00
|
|
|
std::vector<hash_code> ColdProfileHashes;
|
2021-11-28 23:43:11 -08:00
|
|
|
for (const auto &I : ProfileMap) {
|
|
|
|
if (I.second.getTotalSamples() < ColdCntThreshold)
|
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact:
When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
2023-08-01 21:37:29 +00:00
|
|
|
ColdProfileHashes.emplace_back(I.first);
|
2021-11-28 23:43:11 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
// Remove the cold profile from ProfileMap.
|
[llvm-profdata] Refactoring Sample Profile Reader to increase FDO build speed using MD5 as key to Sample Profile map
This is phase 1 of multiple planned improvements on the sample profile loader. The major change is to use MD5 hash code ((instead of the function itself) as the key to look up the function offset table and the profiles, which significantly reduce the time it takes to construct the map.
The optimization is based on the fact that many practical sample profiles are using MD5 values for function names to reduce profile size, so we shouldn't need to convert the MD5 to a string and then to a SampleContext and use it as the map's key, because it's extremely slow.
Several changes to note:
(1) For non-CS SampleContext, if it is already MD5 string, the hash value will be its integral value, instead of hashing the MD5 again. In phase 2 this is going to be optimized further using a union to represent MD5 function (without converting it to string) and regular function names.
(2) The SampleProfileMap is a wrapper to *map<uint64_t, FunctionSamples>, while providing interface allowing using SampleContext as key, so that existing code still work. It will check for MD5 collision (unlikely but not too unlikely, since we only takes the lower 64 bits) and handle it to at least guarantee compilation correctness (conflicting old profile is dropped, instead of returning an old profile with inconsistent context). Other code should not try to use MD5 as key to access the map directly, because it will not be able to handle MD5 collision at all. (see exception at (5) )
(3) Any SampleProfileMap::emplace() followed by SampleContext assignment if newly inserted, should be replaced with SampleProfileMap::Create(), which does the same thing.
(4) Previously we ensure an invariant that in SampleProfileMap, the key is equal to the Context of the value, for profile map that is eventually being used for output (as in llvm-profdata/llvm-profgen). Since the key became MD5 hash, only the value keeps the context now, in several places where an intermediate SampleProfileMap is created, each new FunctionSample's context is set immediately after insertion, which is necessary to "remember" the context otherwise irretrievable.
(5) When reading a profile, we cache the MD5 values of all functions, because they are used at least twice (one to index into FuncOffsetTable, the other into SampleProfileMap, more if there are additional sections), in this case the SampleProfileMap is directly accessed with MD5 value so that we don't recalculate it each time (expensive)
Performance impact:
When reading a ~1GB extbinary profile (fixed length MD5, not compressed) with 10 million function names and 2.5 million top level functions (non CS functions, each function has varying nesting level from 0 to 20), this patch improves the function offset table loading time by 20%, and improves full profile read by 5%.
Reviewed By: davidxl, snehasish
Differential Revision: https://reviews.llvm.org/D147740
2023-08-01 21:37:29 +00:00
|
|
|
for (const auto &I : ColdProfileHashes)
|
2021-11-28 23:43:11 -08:00
|
|
|
ProfileMap.erase(I);
|
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void ProfileGenerator::generateLineNumBasedProfile() {
|
2022-03-30 12:27:10 -07:00
|
|
|
assert(SampleCounters->size() == 1 &&
|
2021-09-22 20:00:24 -07:00
|
|
|
"Must have one entry for profile generation.");
|
2022-03-30 12:27:10 -07:00
|
|
|
const SampleCounter &SC = SampleCounters->begin()->second;
|
2021-09-22 20:00:24 -07:00
|
|
|
// Fill in function body samples
|
|
|
|
populateBodySamplesForAllFunctions(SC.RangeCounter);
|
|
|
|
// Fill in boundary sample counts as well as call site samples for calls
|
|
|
|
populateBoundarySamplesForAllFunctions(SC.BranchCounter);
|
2021-10-27 00:25:50 -07:00
|
|
|
|
2022-05-12 22:08:18 -07:00
|
|
|
updateFunctionSamples();
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
|
|
|
|
2022-03-01 18:43:53 -08:00
|
|
|
void ProfileGenerator::generateProbeBasedProfile() {
|
2022-03-30 12:27:10 -07:00
|
|
|
assert(SampleCounters->size() == 1 &&
|
2022-03-01 18:43:53 -08:00
|
|
|
"Must have one entry for profile generation.");
|
|
|
|
// Enable pseudo probe functionalities in SampleProf
|
|
|
|
FunctionSamples::ProfileIsProbeBased = true;
|
2022-03-30 12:27:10 -07:00
|
|
|
const SampleCounter &SC = SampleCounters->begin()->second;
|
2022-03-01 18:43:53 -08:00
|
|
|
// Fill in function body samples
|
|
|
|
populateBodySamplesWithProbesForAllFunctions(SC.RangeCounter);
|
|
|
|
// Fill in boundary sample counts as well as call site samples for calls
|
|
|
|
populateBoundarySamplesWithProbesForAllFunctions(SC.BranchCounter);
|
|
|
|
|
2022-05-12 22:08:18 -07:00
|
|
|
updateFunctionSamples();
|
2022-03-01 18:43:53 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
void ProfileGenerator::populateBodySamplesWithProbesForAllFunctions(
|
|
|
|
const RangeSample &RangeCounter) {
|
|
|
|
ProbeCounterMap ProbeCounter;
|
2022-03-23 12:36:44 -07:00
|
|
|
// preprocessRangeCounter returns disjoint ranges, so no longer to redo it
|
|
|
|
// inside extractProbesFromRange.
|
|
|
|
extractProbesFromRange(preprocessRangeCounter(RangeCounter), ProbeCounter,
|
|
|
|
false);
|
2022-03-01 18:43:53 -08:00
|
|
|
|
|
|
|
for (const auto &PI : ProbeCounter) {
|
|
|
|
const MCDecodedPseudoProbe *Probe = PI.first;
|
|
|
|
uint64_t Count = PI.second;
|
|
|
|
SampleContextFrameVector FrameVec;
|
|
|
|
Binary->getInlineContextForProbe(Probe, FrameVec, true);
|
2022-03-23 12:36:44 -07:00
|
|
|
FunctionSamples &FunctionProfile =
|
|
|
|
getLeafProfileAndAddTotalSamples(FrameVec, Count);
|
2023-04-10 11:06:27 -07:00
|
|
|
FunctionProfile.addBodySamples(Probe->getIndex(), Probe->getDiscriminator(),
|
|
|
|
Count);
|
2022-03-01 18:43:53 -08:00
|
|
|
if (Probe->isEntry())
|
|
|
|
FunctionProfile.addHeadSamples(Count);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void ProfileGenerator::populateBoundarySamplesWithProbesForAllFunctions(
|
|
|
|
const BranchSample &BranchCounters) {
|
|
|
|
for (const auto &Entry : BranchCounters) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t SourceAddress = Entry.first.first;
|
|
|
|
uint64_t TargetAddress = Entry.first.second;
|
2022-03-01 18:43:53 -08:00
|
|
|
uint64_t Count = Entry.second;
|
|
|
|
assert(Count != 0 && "Unexpected zero weight branch");
|
|
|
|
|
2022-10-13 20:42:51 -07:00
|
|
|
StringRef CalleeName = getCalleeNameForAddress(TargetAddress);
|
2022-03-01 18:43:53 -08:00
|
|
|
if (CalleeName.size() == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
const MCDecodedPseudoProbe *CallProbe =
|
|
|
|
Binary->getCallProbeForAddr(SourceAddress);
|
|
|
|
if (CallProbe == nullptr)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
// Record called target sample and its count.
|
|
|
|
SampleContextFrameVector FrameVec;
|
|
|
|
Binary->getInlineContextForProbe(CallProbe, FrameVec, true);
|
|
|
|
|
|
|
|
if (!FrameVec.empty()) {
|
|
|
|
FunctionSamples &FunctionProfile =
|
|
|
|
getLeafProfileAndAddTotalSamples(FrameVec, 0);
|
|
|
|
FunctionProfile.addCalledTargetSamples(
|
2023-04-10 11:06:27 -07:00
|
|
|
FrameVec.back().Location.LineOffset,
|
|
|
|
FrameVec.back().Location.Discriminator,
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
FunctionId(CalleeName), Count);
|
2022-03-01 18:43:53 -08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-12-02 16:51:42 -08:00
|
|
|
FunctionSamples &ProfileGenerator::getLeafProfileAndAddTotalSamples(
|
|
|
|
const SampleContextFrameVector &FrameVec, uint64_t Count) {
|
2021-09-22 20:00:24 -07:00
|
|
|
// Get top level profile
|
|
|
|
FunctionSamples *FunctionProfile =
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
&getTopLevelFunctionProfile(FrameVec[0].Func);
|
2021-12-02 16:51:42 -08:00
|
|
|
FunctionProfile->addTotalSamples(Count);
|
2022-03-01 18:43:53 -08:00
|
|
|
if (Binary->usePseudoProbes()) {
|
2022-03-23 12:36:44 -07:00
|
|
|
const auto *FuncDesc = Binary->getFuncDescForGUID(
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
FunctionProfile->getFunction().getHashCode());
|
2022-03-01 18:43:53 -08:00
|
|
|
FunctionProfile->setFunctionHash(FuncDesc->FuncHash);
|
|
|
|
}
|
2021-09-22 20:00:24 -07:00
|
|
|
|
|
|
|
for (size_t I = 1; I < FrameVec.size(); I++) {
|
2021-10-01 16:58:59 -07:00
|
|
|
LineLocation Callsite(
|
|
|
|
FrameVec[I - 1].Location.LineOffset,
|
|
|
|
getBaseDiscriminator(FrameVec[I - 1].Location.Discriminator));
|
2021-09-22 20:00:24 -07:00
|
|
|
FunctionSamplesMap &SamplesMap =
|
2021-10-01 16:58:59 -07:00
|
|
|
FunctionProfile->functionSamplesAt(Callsite);
|
2021-09-22 20:00:24 -07:00
|
|
|
auto Ret =
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
SamplesMap.emplace(FrameVec[I].Func, FunctionSamples());
|
2021-09-22 20:00:24 -07:00
|
|
|
if (Ret.second) {
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
SampleContext Context(FrameVec[I].Func);
|
2021-09-22 20:00:24 -07:00
|
|
|
Ret.first->second.setContext(Context);
|
|
|
|
}
|
|
|
|
FunctionProfile = &Ret.first->second;
|
2021-12-02 16:51:42 -08:00
|
|
|
FunctionProfile->addTotalSamples(Count);
|
2022-03-01 18:43:53 -08:00
|
|
|
if (Binary->usePseudoProbes()) {
|
2022-03-23 12:36:44 -07:00
|
|
|
const auto *FuncDesc = Binary->getFuncDescForGUID(
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
FunctionProfile->getFunction().getHashCode());
|
2022-03-01 18:43:53 -08:00
|
|
|
FunctionProfile->setFunctionHash(FuncDesc->FuncHash);
|
|
|
|
}
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
return *FunctionProfile;
|
|
|
|
}
|
|
|
|
|
2021-09-23 22:53:12 -07:00
|
|
|
RangeSample
|
|
|
|
ProfileGenerator::preprocessRangeCounter(const RangeSample &RangeCounter) {
|
|
|
|
RangeSample Ranges(RangeCounter.begin(), RangeCounter.end());
|
2021-10-29 16:33:31 -07:00
|
|
|
if (FillZeroForAllFuncs) {
|
|
|
|
for (auto &FuncI : Binary->getAllBinaryFunctions()) {
|
|
|
|
for (auto &R : FuncI.second.Ranges) {
|
2021-11-04 20:51:04 -07:00
|
|
|
Ranges[{R.first, R.second - 1}] += 0;
|
2021-10-29 16:33:31 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// For each range, we search for all ranges of the function it belongs to
|
|
|
|
// and initialize it with zero count, so it remains zero if doesn't hit any
|
|
|
|
// samples. This is to be consistent with compiler that interpret zero count
|
|
|
|
// as unexecuted(cold).
|
2022-01-14 14:16:18 +00:00
|
|
|
for (const auto &I : RangeCounter) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t StartAddress = I.first.first;
|
|
|
|
for (const auto &Range : Binary->getRanges(StartAddress))
|
2021-10-29 16:33:31 -07:00
|
|
|
Ranges[{Range.first, Range.second - 1}] += 0;
|
|
|
|
}
|
2021-09-23 22:53:12 -07:00
|
|
|
}
|
|
|
|
RangeSample DisjointRanges;
|
|
|
|
findDisjointRanges(DisjointRanges, Ranges);
|
|
|
|
return DisjointRanges;
|
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void ProfileGenerator::populateBodySamplesForAllFunctions(
|
|
|
|
const RangeSample &RangeCounter) {
|
2022-01-14 14:16:18 +00:00
|
|
|
for (const auto &Range : preprocessRangeCounter(RangeCounter)) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t RangeBegin = Range.first.first;
|
|
|
|
uint64_t RangeEnd = Range.first.second;
|
2021-09-22 20:00:24 -07:00
|
|
|
uint64_t Count = Range.second;
|
|
|
|
|
|
|
|
InstructionPointer IP(Binary, RangeBegin, true);
|
|
|
|
// Disjoint ranges may have range in the middle of two instr,
|
|
|
|
// e.g. If Instr1 at Addr1, and Instr2 at Addr2, disjoint range
|
|
|
|
// can be Addr1+1 to Addr2-1. We should ignore such range.
|
2021-11-04 20:51:04 -07:00
|
|
|
if (IP.Address > RangeEnd)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
do {
|
2022-10-25 21:07:55 -07:00
|
|
|
const SampleContextFrameVector FrameVec =
|
2022-10-13 20:42:51 -07:00
|
|
|
Binary->getFrameLocationStack(IP.Address);
|
2021-09-22 20:00:24 -07:00
|
|
|
if (!FrameVec.empty()) {
|
2021-12-02 16:51:42 -08:00
|
|
|
// FIXME: As accumulating total count per instruction caused some
|
|
|
|
// regression, we changed to accumulate total count per byte as a
|
|
|
|
// workaround. Tuning hotness threshold on the compiler side might be
|
|
|
|
// necessary in the future.
|
|
|
|
FunctionSamples &FunctionProfile = getLeafProfileAndAddTotalSamples(
|
2022-10-13 20:42:51 -07:00
|
|
|
FrameVec, Count * Binary->getInstSize(IP.Address));
|
2021-09-22 20:00:24 -07:00
|
|
|
updateBodySamplesforFunctionProfile(FunctionProfile, FrameVec.back(),
|
2021-10-18 17:44:45 -07:00
|
|
|
Count);
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
2021-11-04 20:51:04 -07:00
|
|
|
} while (IP.advance() && IP.Address <= RangeEnd);
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-10-13 20:42:51 -07:00
|
|
|
StringRef
|
|
|
|
ProfileGeneratorBase::getCalleeNameForAddress(uint64_t TargetAddress) {
|
2021-10-26 14:55:33 -07:00
|
|
|
// Get the function range by branch target if it's a call branch.
|
2022-10-13 20:42:51 -07:00
|
|
|
auto *FRange = Binary->findFuncRangeForStartAddr(TargetAddress);
|
2021-09-16 00:31:57 -07:00
|
|
|
|
2021-10-26 14:55:33 -07:00
|
|
|
// We won't accumulate sample count for a range whose start is not the real
|
|
|
|
// function entry such as outlined function or inner labels.
|
|
|
|
if (!FRange || !FRange->IsFuncEntry)
|
2021-09-16 00:31:57 -07:00
|
|
|
return StringRef();
|
|
|
|
|
2021-10-26 14:55:33 -07:00
|
|
|
return FunctionSamples::getCanonicalFnName(FRange->getFuncName());
|
2021-09-16 00:31:57 -07:00
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void ProfileGenerator::populateBoundarySamplesForAllFunctions(
|
|
|
|
const BranchSample &BranchCounters) {
|
2022-01-14 14:16:18 +00:00
|
|
|
for (const auto &Entry : BranchCounters) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t SourceAddress = Entry.first.first;
|
|
|
|
uint64_t TargetAddress = Entry.first.second;
|
2021-09-22 20:00:24 -07:00
|
|
|
uint64_t Count = Entry.second;
|
|
|
|
assert(Count != 0 && "Unexpected zero weight branch");
|
|
|
|
|
2022-10-13 20:42:51 -07:00
|
|
|
StringRef CalleeName = getCalleeNameForAddress(TargetAddress);
|
2021-09-22 20:00:24 -07:00
|
|
|
if (CalleeName.size() == 0)
|
|
|
|
continue;
|
|
|
|
// Record called target sample and its count.
|
|
|
|
const SampleContextFrameVector &FrameVec =
|
2022-10-25 21:07:55 -07:00
|
|
|
Binary->getCachedFrameLocationStack(SourceAddress);
|
2021-09-22 20:00:24 -07:00
|
|
|
if (!FrameVec.empty()) {
|
2021-12-02 16:51:42 -08:00
|
|
|
FunctionSamples &FunctionProfile =
|
|
|
|
getLeafProfileAndAddTotalSamples(FrameVec, 0);
|
2021-09-22 20:00:24 -07:00
|
|
|
FunctionProfile.addCalledTargetSamples(
|
2021-10-01 16:51:38 -07:00
|
|
|
FrameVec.back().Location.LineOffset,
|
2021-10-01 16:58:59 -07:00
|
|
|
getBaseDiscriminator(FrameVec.back().Location.Discriminator),
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
FunctionId(CalleeName), Count);
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
|
|
|
// Add head samples for callee.
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
FunctionSamples &CalleeProfile =
|
|
|
|
getTopLevelFunctionProfile(FunctionId(CalleeName));
|
2021-09-22 20:00:24 -07:00
|
|
|
CalleeProfile.addHeadSamples(Count);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-05-24 14:37:24 -04:00
|
|
|
void ProfileGeneratorBase::calculateBodySamplesAndSize(
|
|
|
|
const FunctionSamples &FSamples, uint64_t &TotalBodySamples,
|
|
|
|
uint64_t &FuncBodySize) {
|
|
|
|
// Note that ideally the size should be the number of function instruction.
|
|
|
|
// However, for probe-based profile, we don't have the accurate instruction
|
|
|
|
// count for each probe, instead, the probe sample is the samples count for
|
|
|
|
// the block, which is equivelant to
|
|
|
|
// total_instruction_samples/num_of_instruction in one block. Hence, we use
|
|
|
|
// the number of probe as a proxy for the function's size.
|
|
|
|
FuncBodySize += FSamples.getBodySamples().size();
|
|
|
|
|
|
|
|
// The accumulated body samples re-calculated here could be different from the
|
|
|
|
// TotalSamples(getTotalSamples) field of FunctionSamples for line-number
|
|
|
|
// based profile. The reason is that TotalSamples is the sum of all the
|
|
|
|
// samples of the machine instruction in one source-code line, however, the
|
|
|
|
// entry of Bodysamples is the only max number of them, so the TotalSamples is
|
|
|
|
// usually much bigger than the accumulated body samples as one souce-code
|
|
|
|
// line can emit many machine instructions. We observed a regression when we
|
|
|
|
// switched to use the accumulated body samples(by using
|
|
|
|
// -update-total-samples). Hence, it's safer to re-calculate here to avoid
|
|
|
|
// such discrepancy. There is no problem for probe-based profile, as the
|
|
|
|
// TotalSamples is exactly the same as the accumulated body samples.
|
|
|
|
for (const auto &I : FSamples.getBodySamples())
|
|
|
|
TotalBodySamples += I.second.getSamples();
|
|
|
|
|
|
|
|
for (const auto &CallsiteSamples : FSamples.getCallsiteSamples())
|
|
|
|
for (const auto &Callee : CallsiteSamples.second) {
|
|
|
|
// For binary-level density, the inlinees' samples and size should be
|
|
|
|
// included in the calculation.
|
|
|
|
calculateBodySamplesAndSize(Callee.second, TotalBodySamples,
|
|
|
|
FuncBodySize);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Calculate Profile-density:
|
|
|
|
// Calculate the density for each function and sort them in descending order,
|
|
|
|
// keep accumulating their total samples unitl it exceeds the
|
|
|
|
// percentage_threshold(cut-off) of total profile samples, the profile-density
|
|
|
|
// is the last(minimum) function-density of the processed functions, which means
|
|
|
|
// all the functions hot to perf are on good density if the profile-density is
|
|
|
|
// good. The percentage_threshold(--profile-density-cutoff-hot) is configurable
|
|
|
|
// depending on how much regression the system want to tolerate.
|
|
|
|
double
|
|
|
|
ProfileGeneratorBase::calculateDensity(const SampleProfileMap &Profiles) {
|
|
|
|
double ProfileDensity = 0.0;
|
|
|
|
|
|
|
|
uint64_t TotalProfileSamples = 0;
|
|
|
|
// A list of the function profile density and its total samples.
|
|
|
|
std::vector<std::pair<double, uint64_t>> FuncDensityList;
|
|
|
|
for (const auto &I : Profiles) {
|
|
|
|
uint64_t TotalBodySamples = 0;
|
|
|
|
uint64_t FuncBodySize = 0;
|
|
|
|
calculateBodySamplesAndSize(I.second, TotalBodySamples, FuncBodySize);
|
|
|
|
|
|
|
|
if (FuncBodySize == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
double FuncDensity = static_cast<double>(TotalBodySamples) / FuncBodySize;
|
|
|
|
TotalProfileSamples += TotalBodySamples;
|
|
|
|
FuncDensityList.emplace_back(FuncDensity, TotalBodySamples);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Sorted by the density in descending order.
|
|
|
|
llvm::stable_sort(FuncDensityList, [&](const std::pair<double, uint64_t> &A,
|
|
|
|
const std::pair<double, uint64_t> &B) {
|
|
|
|
if (A.first != B.first)
|
|
|
|
return A.first > B.first;
|
|
|
|
return A.second < B.second;
|
|
|
|
});
|
|
|
|
|
|
|
|
uint64_t AccumulatedSamples = 0;
|
|
|
|
uint32_t I = 0;
|
|
|
|
assert(ProfileDensityCutOffHot <= 1000000 &&
|
|
|
|
"The cutoff value is greater than 1000000(100%)");
|
|
|
|
while (AccumulatedSamples < TotalProfileSamples *
|
|
|
|
static_cast<float>(ProfileDensityCutOffHot) /
|
|
|
|
1000000 &&
|
|
|
|
I < FuncDensityList.size()) {
|
|
|
|
AccumulatedSamples += FuncDensityList[I].second;
|
|
|
|
ProfileDensity = FuncDensityList[I].first;
|
|
|
|
I++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ProfileDensity;
|
|
|
|
}
|
|
|
|
|
2021-11-28 18:42:09 -08:00
|
|
|
void ProfileGeneratorBase::calculateAndShowDensity(
|
|
|
|
const SampleProfileMap &Profiles) {
|
2024-05-24 14:37:24 -04:00
|
|
|
double Density = calculateDensity(Profiles);
|
2021-11-28 18:42:09 -08:00
|
|
|
showDensitySuggestion(Density);
|
|
|
|
}
|
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
FunctionSamples *
|
|
|
|
CSProfileGenerator::getOrCreateFunctionSamples(ContextTrieNode *ContextNode,
|
|
|
|
bool WasLeafInlined) {
|
|
|
|
FunctionSamples *FProfile = ContextNode->getFunctionSamples();
|
|
|
|
if (!FProfile) {
|
|
|
|
FSamplesList.emplace_back();
|
|
|
|
FProfile = &FSamplesList.back();
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
FProfile->setFunction(ContextNode->getFuncName());
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextNode->setFunctionSamples(FProfile);
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
2022-06-27 22:57:22 -07:00
|
|
|
// Update ContextWasInlined attribute for existing contexts.
|
|
|
|
// The current function can be called in two ways:
|
|
|
|
// - when processing a probe of the current frame
|
|
|
|
// - when processing the entry probe of an inlinee's frame, which
|
|
|
|
// is then used to update the callsite count of the current frame.
|
|
|
|
// The two can happen in any order, hence here we are making sure
|
|
|
|
// `ContextWasInlined` is always set as expected.
|
|
|
|
// TODO: Note that the former does not always happen if no probes of the
|
|
|
|
// current frame has samples, and if the latter happens, we could lose the
|
|
|
|
// attribute. This should be fixed.
|
|
|
|
if (WasLeafInlined)
|
|
|
|
FProfile->getContext().setAttribute(ContextWasInlined);
|
|
|
|
return FProfile;
|
|
|
|
}
|
2022-03-31 12:35:40 -07:00
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextTrieNode *
|
|
|
|
CSProfileGenerator::getOrCreateContextNode(const SampleContextFrames Context,
|
|
|
|
bool WasLeafInlined) {
|
|
|
|
ContextTrieNode *ContextNode =
|
|
|
|
ContextTracker.getOrCreateContextPath(Context, true);
|
|
|
|
getOrCreateFunctionSamples(ContextNode, WasLeafInlined);
|
|
|
|
return ContextNode;
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
|
|
|
|
[CSSPGO] Load context profile for external functions in PreLink and populate ThinLTO import list
For ThinLTO's prelink compilation, we need to put external inline candidates into an import list attached to function's entry count metadata. This enables ThinLink to treat such cross module callee as hot in summary index, and later helps postlink to import them for profile guided cross module inlining.
For AutoFDO, the import list is retrieved by traversing the nested inlinee functions. For CSSPGO, since profile is flatterned, a few things need to happen for it to work:
- When loading input profile in extended binary format, we need to load all child context profile whose parent is in current module, so context trie for current module includes potential cross module inlinee.
- In order to make the above happen, we need to know whether input profile is CSSPGO profile before start reading function profile, hence a flag for profile summary section is added.
- When searching for cross module inline candidate, we need to walk through the context trie instead of nested inlinee profile (callsite sample of AutoFDO profile).
- Now that we have more accurate counts with CSSPGO, we swtiched to use entry count instead of total count to decided if an external callee is potentially beneficial to inline. This make it consistent with how we determine whether call tagert is potential inline candidate.
Differential Revision: https://reviews.llvm.org/D98590
2021-03-13 13:55:28 -08:00
|
|
|
void CSProfileGenerator::generateProfile() {
|
2022-04-28 11:31:02 -07:00
|
|
|
FunctionSamples::ProfileIsCS = true;
|
2021-09-24 18:16:36 -07:00
|
|
|
|
2022-03-23 12:36:44 -07:00
|
|
|
collectProfiledFunctions();
|
2021-09-24 18:16:36 -07:00
|
|
|
|
2022-12-15 18:36:52 -08:00
|
|
|
if (Binary->usePseudoProbes()) {
|
2022-03-30 12:27:10 -07:00
|
|
|
Binary->decodePseudoProbe();
|
2022-12-15 18:36:52 -08:00
|
|
|
if (InferMissingFrames)
|
|
|
|
initializeMissingFrameInferrer();
|
|
|
|
}
|
2022-03-30 12:27:10 -07:00
|
|
|
|
|
|
|
if (SampleCounters) {
|
|
|
|
if (Binary->usePseudoProbes()) {
|
|
|
|
generateProbeBasedProfile();
|
|
|
|
} else {
|
|
|
|
generateLineNumBasedProfile();
|
|
|
|
}
|
2021-09-22 20:00:24 -07:00
|
|
|
}
|
2022-03-23 12:36:44 -07:00
|
|
|
|
|
|
|
if (Binary->getTrackFuncContextSize())
|
|
|
|
computeSizeForProfiledFunctions();
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
postProcessProfiles();
|
|
|
|
}
|
|
|
|
|
2022-12-15 18:36:52 -08:00
|
|
|
void CSProfileGenerator::initializeMissingFrameInferrer() {
|
|
|
|
Binary->getMissingContextInferrer()->initialize(SampleCounters);
|
|
|
|
}
|
|
|
|
|
|
|
|
void CSProfileGenerator::inferMissingFrames(
|
|
|
|
const SmallVectorImpl<uint64_t> &Context,
|
|
|
|
SmallVectorImpl<uint64_t> &NewContext) {
|
|
|
|
Binary->inferMissingFrames(Context, NewContext);
|
|
|
|
}
|
|
|
|
|
2021-09-24 18:16:36 -07:00
|
|
|
void CSProfileGenerator::computeSizeForProfiledFunctions() {
|
2022-03-23 12:36:44 -07:00
|
|
|
for (auto *Func : Binary->getProfiledFunctions())
|
2022-01-28 15:53:37 -08:00
|
|
|
Binary->computeInlinedContextSizeForFunc(Func);
|
|
|
|
|
|
|
|
// Flush the symbolizer to save memory.
|
|
|
|
Binary->flushSymbolizer();
|
2021-09-24 18:16:36 -07:00
|
|
|
}
|
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
void CSProfileGenerator::updateFunctionSamples() {
|
[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie
This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner.
One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node.
Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic.
Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D127031
2022-06-27 23:00:05 -07:00
|
|
|
for (auto *Node : ContextTracker) {
|
2022-06-27 22:57:22 -07:00
|
|
|
FunctionSamples *FSamples = Node->getFunctionSamples();
|
|
|
|
if (FSamples) {
|
|
|
|
if (UpdateTotalSamples)
|
|
|
|
FSamples->updateTotalSamples();
|
|
|
|
FSamples->updateCallsiteSamples();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void CSProfileGenerator::generateLineNumBasedProfile() {
|
2022-03-30 12:27:10 -07:00
|
|
|
for (const auto &CI : *SampleCounters) {
|
2022-01-14 14:10:07 +00:00
|
|
|
const auto *CtxKey = cast<StringBasedCtxKey>(CI.first.getPtr());
|
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextTrieNode *ContextNode = &getRootContext();
|
[llvm-profgen] Decouple artificial branch from LBR parser and fix external address related issues
This patch is fixing two issues for both CS and non-CS.
1) For external-call-internal, the head samples of the the internal function should be recorded.
2) avoid ignoring LBR after meeting the interrupt branch for CS profile
LBR parser is shared between CS and non-CS, we found it's error-prone while dealing with artificial branch inside LBR parser. Since artificial branch is mainly used for CS profile unwinding, this patch tries to simplify LBR parser by decoupling artificial branch code from it, the concept of artificial branch is removed and split into two transitional branches(internal-to-external, external-to-internal). Then we leave all the processing of external branch to unwinder.
Specifically for unwinder, remembering that we introduce external frame in https://reviews.llvm.org/D115550. We can just take external address as a regular address and reuse current unwind function(unwindCall, unwindReturn). For a normal case, the external frame will match an external LBR, and it will be filtered out by `unwindLinear` without losing any context.
The data also shows that the interrupt or standalone LBR pattern(unpaired case) does exist, we choose to handle it by clearing the call stack and keeping unwinding. Here we leverage checking in `unwindLinear`, because a standalone LBR, no matter its type, since it doesn’t have other part to pair, it will eventually cause a wrong linear range, like [external, internal], [internal, external]. Then set the state to invalid there.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D118177
2022-04-24 12:07:54 -07:00
|
|
|
// Sample context will be empty if the jump is an external-to-internal call
|
|
|
|
// pattern, the head samples should be added for the internal function.
|
|
|
|
if (!CtxKey->Context.empty()) {
|
|
|
|
// Get or create function profile for the range
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextNode =
|
|
|
|
getOrCreateContextNode(CtxKey->Context, CtxKey->WasLeafInlined);
|
[llvm-profgen] Decouple artificial branch from LBR parser and fix external address related issues
This patch is fixing two issues for both CS and non-CS.
1) For external-call-internal, the head samples of the the internal function should be recorded.
2) avoid ignoring LBR after meeting the interrupt branch for CS profile
LBR parser is shared between CS and non-CS, we found it's error-prone while dealing with artificial branch inside LBR parser. Since artificial branch is mainly used for CS profile unwinding, this patch tries to simplify LBR parser by decoupling artificial branch code from it, the concept of artificial branch is removed and split into two transitional branches(internal-to-external, external-to-internal). Then we leave all the processing of external branch to unwinder.
Specifically for unwinder, remembering that we introduce external frame in https://reviews.llvm.org/D115550. We can just take external address as a regular address and reuse current unwind function(unwindCall, unwindReturn). For a normal case, the external frame will match an external LBR, and it will be filtered out by `unwindLinear` without losing any context.
The data also shows that the interrupt or standalone LBR pattern(unpaired case) does exist, we choose to handle it by clearing the call stack and keeping unwinding. Here we leverage checking in `unwindLinear`, because a standalone LBR, no matter its type, since it doesn’t have other part to pair, it will eventually cause a wrong linear range, like [external, internal], [internal, external]. Then set the state to invalid there.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D118177
2022-04-24 12:07:54 -07:00
|
|
|
// Fill in function body samples
|
2022-06-27 22:57:22 -07:00
|
|
|
populateBodySamplesForFunction(*ContextNode->getFunctionSamples(),
|
|
|
|
CI.second.RangeCounter);
|
[llvm-profgen] Decouple artificial branch from LBR parser and fix external address related issues
This patch is fixing two issues for both CS and non-CS.
1) For external-call-internal, the head samples of the the internal function should be recorded.
2) avoid ignoring LBR after meeting the interrupt branch for CS profile
LBR parser is shared between CS and non-CS, we found it's error-prone while dealing with artificial branch inside LBR parser. Since artificial branch is mainly used for CS profile unwinding, this patch tries to simplify LBR parser by decoupling artificial branch code from it, the concept of artificial branch is removed and split into two transitional branches(internal-to-external, external-to-internal). Then we leave all the processing of external branch to unwinder.
Specifically for unwinder, remembering that we introduce external frame in https://reviews.llvm.org/D115550. We can just take external address as a regular address and reuse current unwind function(unwindCall, unwindReturn). For a normal case, the external frame will match an external LBR, and it will be filtered out by `unwindLinear` without losing any context.
The data also shows that the interrupt or standalone LBR pattern(unpaired case) does exist, we choose to handle it by clearing the call stack and keeping unwinding. Here we leverage checking in `unwindLinear`, because a standalone LBR, no matter its type, since it doesn’t have other part to pair, it will eventually cause a wrong linear range, like [external, internal], [internal, external]. Then set the state to invalid there.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D118177
2022-04-24 12:07:54 -07:00
|
|
|
}
|
2021-08-11 18:01:37 -07:00
|
|
|
// Fill in boundary sample counts as well as call site samples for calls
|
2022-06-27 22:57:22 -07:00
|
|
|
populateBoundarySamplesForFunction(ContextNode, CI.second.BranchCounter);
|
[CSSPGO] Load context profile for external functions in PreLink and populate ThinLTO import list
For ThinLTO's prelink compilation, we need to put external inline candidates into an import list attached to function's entry count metadata. This enables ThinLink to treat such cross module callee as hot in summary index, and later helps postlink to import them for profile guided cross module inlining.
For AutoFDO, the import list is retrieved by traversing the nested inlinee functions. For CSSPGO, since profile is flatterned, a few things need to happen for it to work:
- When loading input profile in extended binary format, we need to load all child context profile whose parent is in current module, so context trie for current module includes potential cross module inlinee.
- In order to make the above happen, we need to know whether input profile is CSSPGO profile before start reading function profile, hence a flag for profile summary section is added.
- When searching for cross module inline candidate, we need to walk through the context trie instead of nested inlinee profile (callsite sample of AutoFDO profile).
- Now that we have more accurate counts with CSSPGO, we swtiched to use entry count instead of total count to decided if an external callee is potentially beneficial to inline. This make it consistent with how we determine whether call tagert is potential inline candidate.
Differential Revision: https://reviews.llvm.org/D98590
2021-03-13 13:55:28 -08:00
|
|
|
}
|
|
|
|
// Fill in call site value sample for inlined calls and also use context to
|
|
|
|
// infer missing samples. Since we don't have call count for inlined
|
|
|
|
// functions, we estimate it from inlinee's profile using the entry of the
|
|
|
|
// body sample.
|
2022-06-27 22:57:22 -07:00
|
|
|
populateInferredFunctionSamples(getRootContext());
|
2021-10-27 00:25:50 -07:00
|
|
|
|
2022-05-12 22:08:18 -07:00
|
|
|
updateFunctionSamples();
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void CSProfileGenerator::populateBodySamplesForFunction(
|
2021-08-11 18:01:37 -07:00
|
|
|
FunctionSamples &FunctionProfile, const RangeSample &RangeCounter) {
|
2020-12-02 23:10:11 -08:00
|
|
|
// Compute disjoint ranges first, so we can use MAX
|
|
|
|
// for calculating count for each location.
|
|
|
|
RangeSample Ranges;
|
|
|
|
findDisjointRanges(Ranges, RangeCounter);
|
2022-01-14 14:16:18 +00:00
|
|
|
for (const auto &Range : Ranges) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t RangeBegin = Range.first.first;
|
|
|
|
uint64_t RangeEnd = Range.first.second;
|
2020-12-02 23:10:11 -08:00
|
|
|
uint64_t Count = Range.second;
|
|
|
|
// Disjoint ranges have introduce zero-filled gap that
|
|
|
|
// doesn't belong to current context, filter them out.
|
|
|
|
if (Count == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
InstructionPointer IP(Binary, RangeBegin, true);
|
|
|
|
// Disjoint ranges may have range in the middle of two instr,
|
|
|
|
// e.g. If Instr1 at Addr1, and Instr2 at Addr2, disjoint range
|
|
|
|
// can be Addr1+1 to Addr2-1. We should ignore such range.
|
2021-11-04 20:51:04 -07:00
|
|
|
if (IP.Address > RangeEnd)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
do {
|
2022-10-13 20:42:51 -07:00
|
|
|
auto LeafLoc = Binary->getInlineLeafFrameLoc(IP.Address);
|
2022-06-20 10:38:12 -07:00
|
|
|
if (LeafLoc) {
|
2021-02-10 10:04:39 -08:00
|
|
|
// Recording body sample for this specific context
|
2021-10-18 17:44:45 -07:00
|
|
|
updateBodySamplesforFunctionProfile(FunctionProfile, *LeafLoc, Count);
|
2021-12-02 16:51:42 -08:00
|
|
|
FunctionProfile.addTotalSamples(Count);
|
2021-02-10 10:04:39 -08:00
|
|
|
}
|
2021-11-04 20:51:04 -07:00
|
|
|
} while (IP.advance() && IP.Address <= RangeEnd);
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void CSProfileGenerator::populateBoundarySamplesForFunction(
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextTrieNode *Node, const BranchSample &BranchCounters) {
|
2020-12-02 23:10:11 -08:00
|
|
|
|
2022-01-14 14:16:18 +00:00
|
|
|
for (const auto &Entry : BranchCounters) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t SourceAddress = Entry.first.first;
|
|
|
|
uint64_t TargetAddress = Entry.first.second;
|
2020-12-02 23:10:11 -08:00
|
|
|
uint64_t Count = Entry.second;
|
2021-09-22 20:00:24 -07:00
|
|
|
assert(Count != 0 && "Unexpected zero weight branch");
|
|
|
|
|
2022-10-13 20:42:51 -07:00
|
|
|
StringRef CalleeName = getCalleeNameForAddress(TargetAddress);
|
2020-12-02 23:10:11 -08:00
|
|
|
if (CalleeName.size() == 0)
|
|
|
|
continue;
|
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextTrieNode *CallerNode = Node;
|
|
|
|
LineLocation CalleeCallSite(0, 0);
|
|
|
|
if (CallerNode != &getRootContext()) {
|
[llvm-profgen] Decouple artificial branch from LBR parser and fix external address related issues
This patch is fixing two issues for both CS and non-CS.
1) For external-call-internal, the head samples of the the internal function should be recorded.
2) avoid ignoring LBR after meeting the interrupt branch for CS profile
LBR parser is shared between CS and non-CS, we found it's error-prone while dealing with artificial branch inside LBR parser. Since artificial branch is mainly used for CS profile unwinding, this patch tries to simplify LBR parser by decoupling artificial branch code from it, the concept of artificial branch is removed and split into two transitional branches(internal-to-external, external-to-internal). Then we leave all the processing of external branch to unwinder.
Specifically for unwinder, remembering that we introduce external frame in https://reviews.llvm.org/D115550. We can just take external address as a regular address and reuse current unwind function(unwindCall, unwindReturn). For a normal case, the external frame will match an external LBR, and it will be filtered out by `unwindLinear` without losing any context.
The data also shows that the interrupt or standalone LBR pattern(unpaired case) does exist, we choose to handle it by clearing the call stack and keeping unwinding. Here we leverage checking in `unwindLinear`, because a standalone LBR, no matter its type, since it doesn’t have other part to pair, it will eventually cause a wrong linear range, like [external, internal], [internal, external]. Then set the state to invalid there.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D118177
2022-04-24 12:07:54 -07:00
|
|
|
// Record called target sample and its count
|
2022-10-13 20:42:51 -07:00
|
|
|
auto LeafLoc = Binary->getInlineLeafFrameLoc(SourceAddress);
|
2022-06-20 10:38:12 -07:00
|
|
|
if (LeafLoc) {
|
2022-06-27 22:57:22 -07:00
|
|
|
CallerNode->getFunctionSamples()->addCalledTargetSamples(
|
[llvm-profgen] Decouple artificial branch from LBR parser and fix external address related issues
This patch is fixing two issues for both CS and non-CS.
1) For external-call-internal, the head samples of the the internal function should be recorded.
2) avoid ignoring LBR after meeting the interrupt branch for CS profile
LBR parser is shared between CS and non-CS, we found it's error-prone while dealing with artificial branch inside LBR parser. Since artificial branch is mainly used for CS profile unwinding, this patch tries to simplify LBR parser by decoupling artificial branch code from it, the concept of artificial branch is removed and split into two transitional branches(internal-to-external, external-to-internal). Then we leave all the processing of external branch to unwinder.
Specifically for unwinder, remembering that we introduce external frame in https://reviews.llvm.org/D115550. We can just take external address as a regular address and reuse current unwind function(unwindCall, unwindReturn). For a normal case, the external frame will match an external LBR, and it will be filtered out by `unwindLinear` without losing any context.
The data also shows that the interrupt or standalone LBR pattern(unpaired case) does exist, we choose to handle it by clearing the call stack and keeping unwinding. Here we leverage checking in `unwindLinear`, because a standalone LBR, no matter its type, since it doesn’t have other part to pair, it will eventually cause a wrong linear range, like [external, internal], [internal, external]. Then set the state to invalid there.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D118177
2022-04-24 12:07:54 -07:00
|
|
|
LeafLoc->Location.LineOffset,
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
getBaseDiscriminator(LeafLoc->Location.Discriminator),
|
|
|
|
FunctionId(CalleeName),
|
[llvm-profgen] Decouple artificial branch from LBR parser and fix external address related issues
This patch is fixing two issues for both CS and non-CS.
1) For external-call-internal, the head samples of the the internal function should be recorded.
2) avoid ignoring LBR after meeting the interrupt branch for CS profile
LBR parser is shared between CS and non-CS, we found it's error-prone while dealing with artificial branch inside LBR parser. Since artificial branch is mainly used for CS profile unwinding, this patch tries to simplify LBR parser by decoupling artificial branch code from it, the concept of artificial branch is removed and split into two transitional branches(internal-to-external, external-to-internal). Then we leave all the processing of external branch to unwinder.
Specifically for unwinder, remembering that we introduce external frame in https://reviews.llvm.org/D115550. We can just take external address as a regular address and reuse current unwind function(unwindCall, unwindReturn). For a normal case, the external frame will match an external LBR, and it will be filtered out by `unwindLinear` without losing any context.
The data also shows that the interrupt or standalone LBR pattern(unpaired case) does exist, we choose to handle it by clearing the call stack and keeping unwinding. Here we leverage checking in `unwindLinear`, because a standalone LBR, no matter its type, since it doesn’t have other part to pair, it will eventually cause a wrong linear range, like [external, internal], [internal, external]. Then set the state to invalid there.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D118177
2022-04-24 12:07:54 -07:00
|
|
|
Count);
|
|
|
|
// Record head sample for called target(callee)
|
2022-06-27 22:57:22 -07:00
|
|
|
CalleeCallSite = LeafLoc->Location;
|
[llvm-profgen] Decouple artificial branch from LBR parser and fix external address related issues
This patch is fixing two issues for both CS and non-CS.
1) For external-call-internal, the head samples of the the internal function should be recorded.
2) avoid ignoring LBR after meeting the interrupt branch for CS profile
LBR parser is shared between CS and non-CS, we found it's error-prone while dealing with artificial branch inside LBR parser. Since artificial branch is mainly used for CS profile unwinding, this patch tries to simplify LBR parser by decoupling artificial branch code from it, the concept of artificial branch is removed and split into two transitional branches(internal-to-external, external-to-internal). Then we leave all the processing of external branch to unwinder.
Specifically for unwinder, remembering that we introduce external frame in https://reviews.llvm.org/D115550. We can just take external address as a regular address and reuse current unwind function(unwindCall, unwindReturn). For a normal case, the external frame will match an external LBR, and it will be filtered out by `unwindLinear` without losing any context.
The data also shows that the interrupt or standalone LBR pattern(unpaired case) does exist, we choose to handle it by clearing the call stack and keeping unwinding. Here we leverage checking in `unwindLinear`, because a standalone LBR, no matter its type, since it doesn’t have other part to pair, it will eventually cause a wrong linear range, like [external, internal], [internal, external]. Then set the state to invalid there.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D118177
2022-04-24 12:07:54 -07:00
|
|
|
}
|
|
|
|
}
|
2022-06-27 22:57:22 -07:00
|
|
|
|
|
|
|
ContextTrieNode *CalleeNode =
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
CallerNode->getOrCreateChildContext(CalleeCallSite,
|
|
|
|
FunctionId(CalleeName));
|
2022-06-27 22:57:22 -07:00
|
|
|
FunctionSamples *CalleeProfile = getOrCreateFunctionSamples(CalleeNode);
|
|
|
|
CalleeProfile->addHeadSamples(Count);
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
void CSProfileGenerator::populateInferredFunctionSamples(
|
|
|
|
ContextTrieNode &Node) {
|
|
|
|
// There is no call jmp sample between the inliner and inlinee, we need to use
|
|
|
|
// the inlinee's context to infer inliner's context, i.e. parent(inliner)'s
|
|
|
|
// sample depends on child(inlinee)'s sample, so traverse the tree in
|
|
|
|
// post-order.
|
|
|
|
for (auto &It : Node.getAllChildContext())
|
|
|
|
populateInferredFunctionSamples(It.second);
|
|
|
|
|
|
|
|
FunctionSamples *CalleeProfile = Node.getFunctionSamples();
|
|
|
|
if (!CalleeProfile)
|
|
|
|
return;
|
|
|
|
// If we already have head sample counts, we must have value profile
|
|
|
|
// for call sites added already. Skip to avoid double counting.
|
|
|
|
if (CalleeProfile->getHeadSamples())
|
|
|
|
return;
|
|
|
|
ContextTrieNode *CallerNode = Node.getParentContext();
|
|
|
|
// If we don't have context, nothing to do for caller's call site.
|
|
|
|
// This could happen for entry point function.
|
|
|
|
if (CallerNode == &getRootContext())
|
|
|
|
return;
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
LineLocation CallerLeafFrameLoc = Node.getCallSiteLoc();
|
|
|
|
FunctionSamples &CallerProfile = *getOrCreateFunctionSamples(CallerNode);
|
|
|
|
// Since we don't have call count for inlined functions, we
|
|
|
|
// estimate it from inlinee's profile using entry body sample.
|
2022-07-21 08:30:23 -07:00
|
|
|
uint64_t EstimatedCallCount = CalleeProfile->getHeadSamplesEstimate();
|
2022-06-27 22:57:22 -07:00
|
|
|
// If we don't have samples with location, use 1 to indicate live.
|
|
|
|
if (!EstimatedCallCount && !CalleeProfile->getBodySamples().size())
|
|
|
|
EstimatedCallCount = 1;
|
|
|
|
CallerProfile.addCalledTargetSamples(CallerLeafFrameLoc.LineOffset,
|
|
|
|
CallerLeafFrameLoc.Discriminator,
|
|
|
|
Node.getFuncName(), EstimatedCallCount);
|
|
|
|
CallerProfile.addBodySamples(CallerLeafFrameLoc.LineOffset,
|
|
|
|
CallerLeafFrameLoc.Discriminator,
|
|
|
|
EstimatedCallCount);
|
|
|
|
CallerProfile.addTotalSamples(EstimatedCallCount);
|
|
|
|
}
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
void CSProfileGenerator::convertToProfileMap(
|
|
|
|
ContextTrieNode &Node, SampleContextFrameVector &Context) {
|
|
|
|
FunctionSamples *FProfile = Node.getFunctionSamples();
|
|
|
|
if (FProfile) {
|
|
|
|
Context.emplace_back(Node.getFuncName(), LineLocation(0, 0));
|
|
|
|
// Save the new context for future references.
|
|
|
|
SampleContextFrames NewContext = *Contexts.insert(Context).first;
|
|
|
|
auto Ret = ProfileMap.emplace(NewContext, std::move(*FProfile));
|
|
|
|
FunctionSamples &NewProfile = Ret.first->second;
|
|
|
|
NewProfile.getContext().setContext(NewContext);
|
|
|
|
Context.pop_back();
|
|
|
|
}
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
for (auto &It : Node.getAllChildContext()) {
|
|
|
|
ContextTrieNode &ChildNode = It.second;
|
|
|
|
Context.emplace_back(Node.getFuncName(), ChildNode.getCallSiteLoc());
|
|
|
|
convertToProfileMap(ChildNode, Context);
|
|
|
|
Context.pop_back();
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
void CSProfileGenerator::convertToProfileMap() {
|
|
|
|
assert(ProfileMap.empty() &&
|
|
|
|
"ProfileMap should be empty before converting from the trie");
|
|
|
|
assert(IsProfileValidOnTrie &&
|
|
|
|
"Do not convert the trie twice, it's already destroyed");
|
|
|
|
|
|
|
|
SampleContextFrameVector Context;
|
|
|
|
for (auto &It : getRootContext().getAllChildContext())
|
|
|
|
convertToProfileMap(It.second, Context);
|
|
|
|
|
|
|
|
IsProfileValidOnTrie = false;
|
|
|
|
}
|
|
|
|
|
[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.
It will serve two purposes:
1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.
Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.
This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.
Differential Revision: https://reviews.llvm.org/D99146
2021-03-05 07:50:36 -08:00
|
|
|
void CSProfileGenerator::postProcessProfiles() {
|
|
|
|
// Compute hot/cold threshold based on profile. This will be used for cold
|
|
|
|
// context profile merging/trimming.
|
|
|
|
computeSummaryAndThreshold();
|
|
|
|
|
|
|
|
// Run global pre-inliner to adjust/merge context profile based on estimated
|
|
|
|
// inline decisions.
|
2021-08-16 18:29:07 -07:00
|
|
|
if (EnableCSPreInliner) {
|
[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie
This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner.
One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node.
Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic.
Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D127031
2022-06-27 23:00:05 -07:00
|
|
|
ContextTracker.populateFuncToCtxtMap();
|
|
|
|
CSPreInliner(ContextTracker, *Binary, Summary.get()).run();
|
2021-10-27 16:56:06 -07:00
|
|
|
// Turn off the profile merger by default unless it is explicitly enabled.
|
|
|
|
if (!CSProfMergeColdContext.getNumOccurrences())
|
|
|
|
CSProfMergeColdContext = false;
|
2021-08-16 18:29:07 -07:00
|
|
|
}
|
[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.
It will serve two purposes:
1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.
Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.
This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.
Differential Revision: https://reviews.llvm.org/D99146
2021-03-05 07:50:36 -08:00
|
|
|
|
[CSSPGO][llvm-profgen] Reimplement SampleContextTracker using context trie
This is the followup patch to https://reviews.llvm.org/D125246 for the `SampleContextTracker` part. Before the promotion and merging of the context is based on the SampleContext(the array of frame), this causes a lot of cost to the memory. This patch detaches the tracker from using the array ref instead to use the context trie itself. This can save a lot of memory usage and benefit both the compiler's CS inliner and llvm-profgen's pre-inliner.
One structure needs to be specially treated is the `FuncToCtxtProfiles`, this is used to get all the functionSamples for one function to do the merging and promoting. Before it search each functions' context and traverse the trie to get the node of the context. Now we don't have the context inside the profile, instead we directly use an auxiliary map `ProfileToNodeMap` for profile , it initialize to create the FunctionSamples to TrieNode relations and keep updating it during promoting and merging the node.
Moreover, I was expecting the results before and after remain the same, but I found that the order of FuncToCtxtProfiles matter and affect the results. This can happen on recursive context case, but the difference should be small. Now we don't have the context, so I just used a vector for the order, the result is still deterministic.
Measured on one huge size(12GB) profile from one of our internal service. The profile similarity difference is 99.999%, and the running time is improved by 3X(debug mode) and the memory is reduced from 170GB to 90GB.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D127031
2022-06-27 23:00:05 -07:00
|
|
|
convertToProfileMap();
|
|
|
|
|
2021-10-29 16:33:31 -07:00
|
|
|
// Trim and merge cold context profile using cold threshold above.
|
2021-11-28 23:43:11 -08:00
|
|
|
if (TrimColdProfile || CSProfMergeColdContext) {
|
2021-08-24 09:55:18 -07:00
|
|
|
SampleContextTrimmer(ProfileMap)
|
|
|
|
.trimAndMergeColdContextProfiles(
|
2021-11-28 23:43:11 -08:00
|
|
|
HotCountThreshold, TrimColdProfile, CSProfMergeColdContext,
|
2021-10-27 16:56:06 -07:00
|
|
|
CSProfMaxColdContextDepth, EnableCSPreInliner);
|
2021-08-24 09:55:18 -07:00
|
|
|
}
|
2021-11-28 18:42:09 -08:00
|
|
|
|
2021-12-14 10:03:05 -08:00
|
|
|
if (GenCSNestedProfile) {
|
2023-03-19 22:37:01 -07:00
|
|
|
ProfileConverter CSConverter(ProfileMap);
|
|
|
|
CSConverter.convertCSProfiles();
|
2022-04-28 11:31:02 -07:00
|
|
|
FunctionSamples::ProfileIsCS = false;
|
2021-12-14 10:03:05 -08:00
|
|
|
}
|
[llvm-profgen] Filter out ambiguous cold profiles during profile generation (#81803)
For the built-in local initialization function(`__cxx_global_var_init`,
`__tls_init` prefix), there could be multiple versions of the functions
in the final binary, e.g. `__cxx_global_var_init`, which is a wrapper of
global variable ctors, the compiler could assign suffixes like
`__cxx_global_var_init.N` for different ctors.
However, in the profile generation, we call `getCanonicalFnName` to
canonicalize the names which strip the suffixes. Therefore, samples from
different functions queries the same profile(only
`__cxx_global_var_init`) and the counts are merged. As the functions are
essentially different, entries of the merged profile are ambiguous. In
sample loading, for each version of this function, the IR from one
version would be attributed towards a merged entries, which is
inaccurate, especially for fuzzy profile matching, it gets multiple
callsites(from different function) but using to match one callsite,
which mislead the matching and report a lot of false positives.
Hence, we want to filter them out from the profile map during the
profile generation time. The profiles are all cold functions, it won't
have perf impact.
2024-02-16 14:29:24 -08:00
|
|
|
filterAmbiguousProfile(ProfileMap);
|
2024-05-24 14:37:24 -04:00
|
|
|
ProfileGeneratorBase::calculateAndShowDensity(ProfileMap);
|
[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.
It will serve two purposes:
1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.
Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.
This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.
Differential Revision: https://reviews.llvm.org/D99146
2021-03-05 07:50:36 -08:00
|
|
|
}
|
|
|
|
|
2022-06-23 20:14:47 -07:00
|
|
|
void ProfileGeneratorBase::computeSummaryAndThreshold(
|
|
|
|
SampleProfileMap &Profiles) {
|
2021-03-18 09:45:07 -07:00
|
|
|
SampleProfileSummaryBuilder Builder(ProfileSummaryBuilder::DefaultCutoffs);
|
2022-06-23 20:14:47 -07:00
|
|
|
Summary = Builder.computeSummaryForProfiles(Profiles);
|
2021-04-14 22:53:40 -07:00
|
|
|
HotCountThreshold = ProfileSummaryBuilder::getHotCountThreshold(
|
|
|
|
(Summary->getDetailedSummary()));
|
|
|
|
ColdCountThreshold = ProfileSummaryBuilder::getColdCountThreshold(
|
|
|
|
(Summary->getDetailedSummary()));
|
2021-01-21 09:36:32 -08:00
|
|
|
}
|
|
|
|
|
2022-06-23 20:14:47 -07:00
|
|
|
void CSProfileGenerator::computeSummaryAndThreshold() {
|
|
|
|
// Always merge and use context-less profile map to compute summary.
|
|
|
|
SampleProfileMap ContextLessProfiles;
|
|
|
|
ContextTracker.createContextLessProfileMap(ContextLessProfiles);
|
|
|
|
|
|
|
|
// Set the flag below to avoid merging the profile again in
|
|
|
|
// computeSummaryAndThreshold
|
|
|
|
FunctionSamples::ProfileIsCS = false;
|
|
|
|
assert(
|
|
|
|
(!UseContextLessSummary.getNumOccurrences() || UseContextLessSummary) &&
|
|
|
|
"Don't set --profile-summary-contextless to false for profile "
|
|
|
|
"generation");
|
|
|
|
ProfileGeneratorBase::computeSummaryAndThreshold(ContextLessProfiles);
|
|
|
|
// Recover the old value.
|
|
|
|
FunctionSamples::ProfileIsCS = true;
|
|
|
|
}
|
|
|
|
|
2022-03-01 18:43:53 -08:00
|
|
|
void ProfileGeneratorBase::extractProbesFromRange(
|
|
|
|
const RangeSample &RangeCounter, ProbeCounterMap &ProbeCounter,
|
|
|
|
bool FindDisjointRanges) {
|
|
|
|
const RangeSample *PRanges = &RangeCounter;
|
|
|
|
RangeSample Ranges;
|
|
|
|
if (FindDisjointRanges) {
|
|
|
|
findDisjointRanges(Ranges, RangeCounter);
|
|
|
|
PRanges = &Ranges;
|
2021-01-11 09:08:39 -08:00
|
|
|
}
|
|
|
|
|
2022-03-01 18:43:53 -08:00
|
|
|
for (const auto &Range : *PRanges) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t RangeBegin = Range.first.first;
|
|
|
|
uint64_t RangeEnd = Range.first.second;
|
2021-01-11 09:08:39 -08:00
|
|
|
uint64_t Count = Range.second;
|
|
|
|
|
|
|
|
InstructionPointer IP(Binary, RangeBegin, true);
|
|
|
|
// Disjoint ranges may have range in the middle of two instr,
|
|
|
|
// e.g. If Instr1 at Addr1, and Instr2 at Addr2, disjoint range
|
|
|
|
// can be Addr1+1 to Addr2-1. We should ignore such range.
|
|
|
|
if (IP.Address > RangeEnd)
|
|
|
|
continue;
|
|
|
|
|
2021-11-04 20:51:04 -07:00
|
|
|
do {
|
2021-01-11 09:08:39 -08:00
|
|
|
const AddressProbesMap &Address2ProbesMap =
|
|
|
|
Binary->getAddress2ProbesMap();
|
2024-08-26 09:14:35 -07:00
|
|
|
for (const MCDecodedPseudoProbe &Probe :
|
|
|
|
Address2ProbesMap.find(IP.Address)) {
|
|
|
|
ProbeCounter[&Probe] += Count;
|
2021-01-11 09:08:39 -08:00
|
|
|
}
|
2021-11-04 20:51:04 -07:00
|
|
|
} while (IP.advance() && IP.Address <= RangeEnd);
|
2021-01-11 09:08:39 -08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-10-13 20:42:51 -07:00
|
|
|
static void extractPrefixContextStack(SampleContextFrameVector &ContextStack,
|
|
|
|
const SmallVectorImpl<uint64_t> &AddrVec,
|
|
|
|
ProfiledBinary *Binary) {
|
2022-03-23 12:36:44 -07:00
|
|
|
SmallVector<const MCDecodedPseudoProbe *, 16> Probes;
|
2022-10-13 20:42:51 -07:00
|
|
|
for (auto Address : reverse(AddrVec)) {
|
|
|
|
const MCDecodedPseudoProbe *CallProbe =
|
|
|
|
Binary->getCallProbeForAddr(Address);
|
2022-03-23 12:36:44 -07:00
|
|
|
// These could be the cases when a probe is not found at a calliste. Cutting
|
|
|
|
// off the context from here since the inliner will not know how to consume
|
|
|
|
// a context with unknown callsites.
|
|
|
|
// 1. for functions that are not sampled when
|
|
|
|
// --decode-probe-for-profiled-functions-only is on.
|
|
|
|
// 2. for a merged callsite. Callsite merging may cause the loss of original
|
|
|
|
// probe IDs.
|
|
|
|
// 3. for an external callsite.
|
|
|
|
if (!CallProbe)
|
|
|
|
break;
|
|
|
|
Probes.push_back(CallProbe);
|
|
|
|
}
|
|
|
|
|
|
|
|
std::reverse(Probes.begin(), Probes.end());
|
|
|
|
|
|
|
|
// Extract context stack for reusing, leaf context stack will be added
|
|
|
|
// compressed while looking up function profile.
|
2022-03-01 18:43:53 -08:00
|
|
|
for (const auto *P : Probes) {
|
|
|
|
Binary->getInlineContextForProbe(P, ContextStack, true);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void CSProfileGenerator::generateProbeBasedProfile() {
|
|
|
|
// Enable pseudo probe functionalities in SampleProf
|
|
|
|
FunctionSamples::ProfileIsProbeBased = true;
|
2022-03-30 12:27:10 -07:00
|
|
|
for (const auto &CI : *SampleCounters) {
|
2022-03-23 12:36:44 -07:00
|
|
|
const AddrBasedCtxKey *CtxKey =
|
|
|
|
dyn_cast<AddrBasedCtxKey>(CI.first.getPtr());
|
2022-03-01 18:43:53 -08:00
|
|
|
// Fill in function body samples from probes, also infer caller's samples
|
|
|
|
// from callee's probe
|
2022-12-15 18:36:52 -08:00
|
|
|
populateBodySamplesWithProbes(CI.second.RangeCounter, CtxKey);
|
2022-03-01 18:43:53 -08:00
|
|
|
// Fill in boundary samples for a call probe
|
2022-12-15 18:36:52 -08:00
|
|
|
populateBoundarySamplesWithProbes(CI.second.BranchCounter, CtxKey);
|
2022-03-01 18:43:53 -08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void CSProfileGenerator::populateBodySamplesWithProbes(
|
2022-12-15 18:36:52 -08:00
|
|
|
const RangeSample &RangeCounter, const AddrBasedCtxKey *CtxKey) {
|
2021-01-11 09:08:39 -08:00
|
|
|
ProbeCounterMap ProbeCounter;
|
|
|
|
// Extract the top frame probes by looking up each address among the range in
|
|
|
|
// the Address2ProbeMap
|
2021-08-11 18:01:37 -07:00
|
|
|
extractProbesFromRange(RangeCounter, ProbeCounter);
|
2021-08-04 20:20:58 -07:00
|
|
|
std::unordered_map<MCDecodedPseudoProbeInlineTree *,
|
|
|
|
std::unordered_set<FunctionSamples *>>
|
2021-08-04 08:50:28 -07:00
|
|
|
FrameSamples;
|
2022-01-14 14:16:18 +00:00
|
|
|
for (const auto &PI : ProbeCounter) {
|
2021-08-04 08:50:28 -07:00
|
|
|
const MCDecodedPseudoProbe *Probe = PI.first;
|
2021-01-11 09:08:39 -08:00
|
|
|
uint64_t Count = PI.second;
|
2022-03-01 18:43:53 -08:00
|
|
|
// Disjoint ranges have introduce zero-filled gap that
|
|
|
|
// doesn't belong to current context, filter them out.
|
|
|
|
if (!Probe->isBlock() || Count == 0)
|
|
|
|
continue;
|
2022-06-27 22:57:22 -07:00
|
|
|
|
2022-12-15 18:36:52 -08:00
|
|
|
ContextTrieNode *ContextNode = getContextNodeForLeafProbe(CtxKey, Probe);
|
2022-06-27 22:57:22 -07:00
|
|
|
FunctionSamples &FunctionProfile = *ContextNode->getFunctionSamples();
|
2021-04-19 18:04:43 -07:00
|
|
|
// Record the current frame and FunctionProfile whenever samples are
|
|
|
|
// collected for non-danglie probes. This is for reporting all of the
|
2021-06-17 11:09:13 -07:00
|
|
|
// zero count probes of the frame later.
|
2021-08-04 20:20:58 -07:00
|
|
|
FrameSamples[Probe->getInlineTreeNode()].insert(&FunctionProfile);
|
2023-04-10 11:06:27 -07:00
|
|
|
FunctionProfile.addBodySamples(Probe->getIndex(), Probe->getDiscriminator(),
|
|
|
|
Count);
|
2021-01-11 09:08:39 -08:00
|
|
|
FunctionProfile.addTotalSamples(Count);
|
|
|
|
if (Probe->isEntry()) {
|
|
|
|
FunctionProfile.addHeadSamples(Count);
|
|
|
|
// Look up for the caller's function profile
|
|
|
|
const auto *InlinerDesc = Binary->getInlinerDescForProbe(Probe);
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextTrieNode *CallerNode = ContextNode->getParentContext();
|
|
|
|
if (InlinerDesc != nullptr && CallerNode != &getRootContext()) {
|
2021-01-11 09:08:39 -08:00
|
|
|
// Since the context id will be compressed, we have to use callee's
|
|
|
|
// context id to infer caller's context id to ensure they share the
|
|
|
|
// same context prefix.
|
2022-06-27 22:57:22 -07:00
|
|
|
uint64_t CallerIndex = ContextNode->getCallSiteLoc().LineOffset;
|
2023-04-10 11:06:27 -07:00
|
|
|
uint64_t CallerDiscriminator = ContextNode->getCallSiteLoc().Discriminator;
|
2021-01-11 09:08:39 -08:00
|
|
|
assert(CallerIndex &&
|
|
|
|
"Inferred caller's location index shouldn't be zero!");
|
2023-04-10 11:06:27 -07:00
|
|
|
assert(!CallerDiscriminator &&
|
|
|
|
"Callsite probe should not have a discriminator!");
|
2021-01-11 09:08:39 -08:00
|
|
|
FunctionSamples &CallerProfile =
|
2022-06-27 22:57:22 -07:00
|
|
|
*getOrCreateFunctionSamples(CallerNode);
|
2021-01-11 09:08:39 -08:00
|
|
|
CallerProfile.setFunctionHash(InlinerDesc->FuncHash);
|
2023-04-10 11:06:27 -07:00
|
|
|
CallerProfile.addBodySamples(CallerIndex, CallerDiscriminator, Count);
|
2021-01-11 09:08:39 -08:00
|
|
|
CallerProfile.addTotalSamples(Count);
|
2023-04-10 11:06:27 -07:00
|
|
|
CallerProfile.addCalledTargetSamples(CallerIndex, CallerDiscriminator,
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextNode->getFuncName(), Count);
|
2021-01-11 09:08:39 -08:00
|
|
|
}
|
|
|
|
}
|
2021-08-04 20:20:58 -07:00
|
|
|
}
|
2021-04-19 18:04:43 -07:00
|
|
|
|
2021-08-04 20:20:58 -07:00
|
|
|
// Assign zero count for remaining probes without sample hits to
|
|
|
|
// differentiate from probes optimized away, of which the counts are unknown
|
|
|
|
// and will be inferred by the compiler.
|
|
|
|
for (auto &I : FrameSamples) {
|
|
|
|
for (auto *FunctionProfile : I.second) {
|
2024-08-26 09:09:13 -07:00
|
|
|
for (const MCDecodedPseudoProbe &Probe : I.first->getProbes()) {
|
|
|
|
FunctionProfile->addBodySamples(Probe.getIndex(),
|
|
|
|
Probe.getDiscriminator(), 0);
|
2021-04-19 18:04:43 -07:00
|
|
|
}
|
|
|
|
}
|
2021-01-11 09:08:39 -08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-22 20:00:24 -07:00
|
|
|
void CSProfileGenerator::populateBoundarySamplesWithProbes(
|
2022-12-15 18:36:52 -08:00
|
|
|
const BranchSample &BranchCounter, const AddrBasedCtxKey *CtxKey) {
|
2022-01-14 14:16:18 +00:00
|
|
|
for (const auto &BI : BranchCounter) {
|
2022-10-13 20:42:51 -07:00
|
|
|
uint64_t SourceAddress = BI.first.first;
|
|
|
|
uint64_t TargetAddress = BI.first.second;
|
2021-01-11 09:08:39 -08:00
|
|
|
uint64_t Count = BI.second;
|
2021-08-04 08:50:28 -07:00
|
|
|
const MCDecodedPseudoProbe *CallProbe =
|
|
|
|
Binary->getCallProbeForAddr(SourceAddress);
|
2021-01-11 09:08:39 -08:00
|
|
|
if (CallProbe == nullptr)
|
|
|
|
continue;
|
|
|
|
FunctionSamples &FunctionProfile =
|
2022-12-15 18:36:52 -08:00
|
|
|
getFunctionProfileForLeafProbe(CtxKey, CallProbe);
|
2021-08-04 08:50:28 -07:00
|
|
|
FunctionProfile.addBodySamples(CallProbe->getIndex(), 0, Count);
|
2021-01-11 09:08:39 -08:00
|
|
|
FunctionProfile.addTotalSamples(Count);
|
2022-10-13 20:42:51 -07:00
|
|
|
StringRef CalleeName = getCalleeNameForAddress(TargetAddress);
|
2021-01-11 09:08:39 -08:00
|
|
|
if (CalleeName.size() == 0)
|
|
|
|
continue;
|
2023-04-10 11:06:27 -07:00
|
|
|
FunctionProfile.addCalledTargetSamples(CallProbe->getIndex(),
|
|
|
|
CallProbe->getDiscriminator(),
|
[llvm-profdata] Do not create numerical strings for MD5 function names read from a Sample Profile. (#66164)
This is phase 2 of the MD5 refactoring on Sample Profile following
https://reviews.llvm.org/D147740
In previous implementation, when a MD5 Sample Profile is read, the
reader first converts the MD5 values to strings, and then create a
StringRef as if the numerical strings are regular function names, and
later on IPO transformation passes perform string comparison over these
numerical strings for profile matching. This is inefficient since it
causes many small heap allocations.
In this patch I created a class `ProfileFuncRef` that is similar to
`StringRef` but it can represent a hash value directly without any
conversion, and it will be more efficient (I will attach some benchmark
results later) when being used in associative containers.
ProfileFuncRef guarantees the same function name in string form or in
MD5 form has the same hash value, which also fix a few issue in IPO
passes where function matching/lookup only check for function name
string, while returns a no-match if the profile is MD5.
When testing on an internal large profile (> 1 GB, with more than 10
million functions), the full profile load time is reduced from 28 sec to
25 sec in average, and reading function offset table from 0.78s to 0.7s
2023-10-17 17:09:39 -04:00
|
|
|
FunctionId(CalleeName), Count);
|
2021-01-11 09:08:39 -08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextTrieNode *CSProfileGenerator::getContextNodeForLeafProbe(
|
2022-12-15 18:36:52 -08:00
|
|
|
const AddrBasedCtxKey *CtxKey, const MCDecodedPseudoProbe *LeafProbe) {
|
|
|
|
|
|
|
|
const SmallVectorImpl<uint64_t> *PContext = &CtxKey->Context;
|
|
|
|
SmallVector<uint64_t, 16> NewContext;
|
|
|
|
|
|
|
|
if (InferMissingFrames) {
|
|
|
|
SmallVector<uint64_t, 16> Context = CtxKey->Context;
|
|
|
|
// Append leaf frame for a complete inference.
|
|
|
|
Context.push_back(LeafProbe->getAddress());
|
|
|
|
inferMissingFrames(Context, NewContext);
|
|
|
|
// Pop out the leaf probe that was pushed in above.
|
|
|
|
NewContext.pop_back();
|
|
|
|
PContext = &NewContext;
|
|
|
|
}
|
|
|
|
|
|
|
|
SampleContextFrameVector ContextStack;
|
|
|
|
extractPrefixContextStack(ContextStack, *PContext, Binary);
|
[CSSPGO] Split context string to deduplicate function name used in the context.
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.
A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.
When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.
Testing
This is no-diff change in terms of code quality and profile content (for text profile).
For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.
The compile time of ads is dropped by 25%.
Differential Revision: https://reviews.llvm.org/D107299
2021-08-25 11:40:34 -07:00
|
|
|
|
|
|
|
// Explicitly copy the context for appending the leaf context
|
|
|
|
SampleContextFrameVector NewContextStack(ContextStack.begin(),
|
|
|
|
ContextStack.end());
|
|
|
|
Binary->getInlineContextForProbe(LeafProbe, NewContextStack, true);
|
2021-01-11 09:08:39 -08:00
|
|
|
// For leaf inlined context with the top frame, we should strip off the top
|
|
|
|
// frame's probe id, like:
|
|
|
|
// Inlined stack: [foo:1, bar:2], the ContextId will be "foo:1 @ bar"
|
[CSSPGO] Split context string to deduplicate function name used in the context.
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.
A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.
When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.
Testing
This is no-diff change in terms of code quality and profile content (for text profile).
For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.
The compile time of ads is dropped by 25%.
Differential Revision: https://reviews.llvm.org/D107299
2021-08-25 11:40:34 -07:00
|
|
|
auto LeafFrame = NewContextStack.back();
|
2021-10-01 16:51:38 -07:00
|
|
|
LeafFrame.Location = LineLocation(0, 0);
|
[CSSPGO] Split context string to deduplicate function name used in the context.
Currently context strings contain a lot of duplicated function names and that significantly increase the profile size. This change split the context into a series of {name, offset, discriminator} tuples so function names used in the context can be replaced by the index into the name table and that significantly reduce the size consumed by context.
A follow-up improvement made in the compiler and profiling tools is to avoid reconstructing full context strings which is time- and memory- consuming. Instead a context vector of `StringRef` is adopted to represent the full context in all scenarios. As a result, the previous prevalent profile map which was implemented as a `StringRef` is now engineered as an unordered map keyed by `SampleContext`. `SampleContext` is reshaped to using an `ArrayRef` to represent a full context for CS profile. For non-CS profile, it falls back to use `StringRef` to represent a contextless function name. Both the `ArrayRef` and `StringRef` objects are underpinned by real array and string objects that are stored in producer buffers. For compiler, they are maintained by the sample reader. For llvm-profgen, they are maintained in `ProfiledBinary` and `ProfileGenerator`. Full context strings can be generated only in those cases of debugging and printing.
When it comes to profile format, nothing has changed to the text format, though internally CS context is implemented as a vector. Extbinary format is only changed for CS profile, with an additional `SecCSNameTable` section which stores all full contexts logically in the form of `vector<int>`, which each element as an offset points to `SecNameTable`. All occurrences of contexts elsewhere are redirected to using the offset of `SecCSNameTable`.
Testing
This is no-diff change in terms of code quality and profile content (for text profile).
For our internal large service (aka ads), the profile generation is cut to half, with a 20x smaller string-based extbinary format generated.
The compile time of ads is dropped by 25%.
Differential Revision: https://reviews.llvm.org/D107299
2021-08-25 11:40:34 -07:00
|
|
|
NewContextStack.pop_back();
|
|
|
|
// Compress the context string except for the leaf frame
|
|
|
|
CSProfileGenerator::compressRecursionContext(NewContextStack);
|
|
|
|
CSProfileGenerator::trimContext(NewContextStack);
|
|
|
|
NewContextStack.push_back(LeafFrame);
|
2021-08-04 08:50:28 -07:00
|
|
|
|
|
|
|
const auto *FuncDesc = Binary->getFuncDescForGUID(LeafProbe->getGuid());
|
|
|
|
bool WasLeafInlined = LeafProbe->getInlineTreeNode()->hasInlineSite();
|
2022-06-27 22:57:22 -07:00
|
|
|
ContextTrieNode *ContextNode =
|
|
|
|
getOrCreateContextNode(NewContextStack, WasLeafInlined);
|
|
|
|
ContextNode->getFunctionSamples()->setFunctionHash(FuncDesc->FuncHash);
|
|
|
|
return ContextNode;
|
|
|
|
}
|
|
|
|
|
|
|
|
FunctionSamples &CSProfileGenerator::getFunctionProfileForLeafProbe(
|
2022-12-15 18:36:52 -08:00
|
|
|
const AddrBasedCtxKey *CtxKey, const MCDecodedPseudoProbe *LeafProbe) {
|
|
|
|
return *getContextNodeForLeafProbe(CtxKey, LeafProbe)->getFunctionSamples();
|
2021-01-11 09:08:39 -08:00
|
|
|
}
|
|
|
|
|
[CSSPGO][llvm-profgen] Context-sensitive profile data generation
This stack of changes introduces `llvm-profgen` utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.
This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.
we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.
With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.
A breakdown of noteworthy changes:
- Added `HybridSample` class as the abstraction perf sample including LBR stack and call stack
* Extended `PerfReader` to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple `HybridSample` are extracted
* Speed up by aggregating `HybridSample` into `AggregatedSamples`
* Added VirtualUnwinder that consumes aggregated `HybridSample` and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like `main:1 @ foo:2 @ bar`.
* Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.
* Leveraged LLVM build-in(`SampleProfWriter`) writer to support different serialization format with no stop
- `getCanonicalFnName` for callee name and name from ELF section
- Added regression test for both unwinding and profile generation
Test Plan:
ninja & ninja check-llvm
Reviewed By: hoy, wenlei, wmi
Differential Revision: https://reviews.llvm.org/D89723
2020-10-19 12:55:59 -07:00
|
|
|
} // end namespace sampleprof
|
|
|
|
} // end namespace llvm
|