"BTF" is a debug information format used by LLVM's BPF backend.
The format is much smaller in scope than DWARF, the following info is
available:
- full set of C types used in the binary file;
- types for global values;
- line number / line source code information .
BTF information is embedded in ELF as .BTF and .BTF.ext sections.
Detailed format description could be found as a part of Linux Source
tree, e.g. here: [1].
This commit modifies `llvm-objdump` utility to use line number
information provided by BTF if DWARF information is not available.
E.g., the goal is to make the following to print source code lines,
interleaved with disassembly:
$ clang --target=bpf -g test.c -o test.o
$ llvm-strip --strip-debug test.o
$ llvm-objdump -Sd test.o
test.o: file format elf64-bpf
Disassembly of section .text:
<foo>:
; void foo(void) {
r1 = 0x1
; consume(1);
call -0x1
r1 = 0x2
; consume(2);
call -0x1
; }
exit
A common production use case for BPF programs is to:
- compile separate object files using clang with `-g -c` flags;
- link these files as a final "static" binary using bpftool linker ([2]).
The bpftool linker discards most of the DWARF sections
(line information sections as well) but merges .BTF and .BTF.ext sections.
Hence, having `llvm-objdump` capable to print source code using .BTF.ext
is valuable.
The commit consists of the following modifications:
- llvm/lib/DebugInfo/BTF aka `DebugInfoBTF` component is added to host
the code needed to process BTF (with assumption that BTF support
would be added to some other tools as well, e.g. `llvm-readelf`):
- `DebugInfoBTF` provides `llvm::BTFParser` class, that loads information
from `.BTF` and `.BTF.ext` sections of a given `object::ObjectFile`
instance and allows to query this information.
Currently only line number information is loaded.
- `DebugInfoBTF` also provides `llvm::BTFContext` class, which is an
implementation of `DIContext` interface, used by `llvm-objdump` to
query information about line numbers corresponding to specific
instructions.
- Structure `DILineInfo` is modified with field `LineSource`.
`DIContext` interface uses `DILineInfo` structure to communicate
line number and source code information.
Specifically, `DILineInfo::Source` field encodes full file source code,
if available. BTF only stores source code for selected lines of the
file, not a complete source file. Moreover, stored lines are not
guaranteed to be sorted in a specific order.
To avoid reconstruction of a file source code from a set of
available lines, this commit adds `LineSource` field instead.
- `Symbolize` class is modified to use `BTFContext` instead of
`DWARFContext` when DWARF sections are not available but BTF
sections are present in the object file.
(`Symbolize` is instantiated by `llvm-objdump`).
- Integration and unit tests.
Note, that DWARF has a notion of "instruction sequence".
DWARF implementation of `DIContext::getLineInfoForAddress()` provides
inexact responses if exact address information is not available but
address falls within "instruction sequence" with some known line
information (see `DWARFDebugLine::LineTable::findRowInSeq()`).
BTF does not provide instruction sequence groupings, thus
`getLineInfoForAddress()` queries only return exact matches.
This does not seem to be a big issue in practice, but output
of the `llvm-objdump -Sd` might differ slightly when BTF
is used instead of DWARF.
[1] https://www.kernel.org/doc/html/latest/bpf/btf.html
[2] https://github.com/libbpf/bpftool
Depends on https://reviews.llvm.org/D149501
Reviewed By: MaskRay, yonghong-song, nickdesaulniers, #debug-info
Differential Revision: https://reviews.llvm.org/D149058
The symbolizer markup syntax is structured such that fields require only
previous fields for their interpretation; this was originally intended
to make adding new fields a natural extension mechanism for existing
elements. This codifies this into the spec and makes the behavior of the
llvm-symbolizer match. Extra fields are now warned about, but ignored,
rather than ignoring the whole element.
Reviewed By: mcgrathr
Differential Revision: https://reviews.llvm.org/D153821
GNU addr2line exits immediately if -e (default to a.out) specifies a file that
cannot be open or a directory. llvm-addr2line used to wait for input on if the
input file cannot be open and addresses are not specified in command line.
Replace the D147652 checkFileExists with getOrCreateModuleInfo to avoid
a separate `sys::fs::status` operation.
Reviewed By: sepavloff
Differential Revision: https://reviews.llvm.org/D153595
GNU addr2line exits immediately if it cannot open the file specified as
executable/relocatable. In contrast llvm-addr2line does not exit and, if
addresses are not specified in command line, waits for input on stdin. This
causes the test compiler-rt/test/asan/TestCases/Posix/asan-symbolize-bad-path.cc to block
forever on Gentoo (see https://reviews.llvm.org/rG27c4777f41d2ab204c1cf84ff1cccd5ba41354da#1190273).
To fix this issue the behavior llvm-addr2line now exits if
executable/relocatable file cannot be found.
It fixes https://github.com/llvm/llvm-project/issues/42099 (llvm-addr2line
does not exit when passed a non-existent file).
Differential Revision: https://reviews.llvm.org/D147652
This should be last of the "bottom-up conversions" of various demanglers
to accept std::string_view. After this, D149104 may be revisited.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D152176
D149104 converted llvm::demangle to use std::string_view. Enabling
"expensive checks" (via -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON) causes
lld/test/wasm/why-extract.s to fail. The reason for this is obscure:
Reason #10007 why std::string_view is dangerous:
Consider the following pattern:
std::string_view s = ...;
const char *c = s.data();
std::strlen(c);
Is c a NUL-terminated C style string? It depends; but if it's not then
it's not safe to call std::strlen on the std::string_view::data().
std::string_view::length() should be used instead.
Fixing this fixes the one lone test that caught this.
microsoftDemangle, rustDemangle, and dlangDemangle should get this same
treatment, too. I will do that next.
Reviewed By: MaskRay, efriedma
Differential Revision: https://reviews.llvm.org/D149675
This is a recommit of 75f1f158812d, reverted in 7a443b1c493d, because
it caused compilation error in
compiler-rt/lib/sanitizer_common/symbolizer/sanitizer_symbolize.cpp.
The error was fixed by Kasimir Georgiev in de4c038c7ba2, but this
commit was reverted in de088dd3a0aa, because the initial commit was
reverted.
This commit reverts both the reverting commits, 7a443b1c493d and
de088dd3a0aa.
Original commit message is below.
If llvm-symbolize did not find module, the error looked like:
LLVMSymbolizer: error reading file: No such file or directory
This message does not follow common practice: LLVMSymbolizer is not an
utility name. Also the message did not not contain the name of missed file.
With this change the error message looks differently:
llvm-symbolizer: error: 'abc': No such file or directory
This format is closer to messages produced by other utilities and allow
proper coloring.
Differential Revision: https://reviews.llvm.org/D148032
If llvm-symbolize did not find module, the error looked like:
LLVMSymbolizer: error reading file: No such file or directory
This message does not follow common practice: LLVMSymbolizer is not an
utility name. Also the message did not not contain the name of missed file.
With this change the error message looks differently:
llvm-symbolizer: error: 'abc': No such file or directory
This format is closer to messages produced by other utilities and allow
proper coloring.
Differential Revision: https://reviews.llvm.org/D148032
This makes parsing for build IDs in the markup filter slightly more
permissive, in line with fromHex.
It also removes the distinction between missing build ID and empty build
ID; empty build IDs aren't a useful concept, since their purpose is to
uniquely identify a binary. This removes a layer of indirection wherever
build IDs are obtained.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D147485
Move the conversion of DILineInfo to JSON into a separate function, so
it can be used in other places too.
This is a prerequisite patch for implementation of symbol+offset lookup.
Differential Revision: https://reviews.llvm.org/D147112
llvm-symbolizer echoed input if it was not recognized as a valid address.
This behavior was extended to llvm-addr2line as well. GNU addr2line in
this case optputs "??:0". This difference prevents implementation of
symbol+offset lookup available in the recent versions of GNU binutils.
In that case a string that is not an address may be a symbol.
This change make reaction of llvm-addr2line on unrecognized input closer
to GNU addr2line.
On i386 Windows, after C++ names have been Itanium-mangled, the C name
mangling specific to its call convention may also be applied on top.
This change teaches symbolizer to be able to demangle this type of
mangled names.
As part of this change, `demanglePE32ExternCFunc` has also been modified
to fix unwanted stripping for vectorcall names when the demangled name
is supposed to contain a leading `_`. Notice that the vectorcall
mangling does not add either an `_` or `@` prefix. The old code always
tries to strip the prefix first, which for Itanium mangled names in
vectorcall, the leading underscore of the Itanium name gets stripped
instead and breaks the Itanium demangler.
Differential Revision: https://reviews.llvm.org/D144049
This is a fairly large changeset, but it can be broken into a few
pieces:
- `llvm/Support/*TargetParser*` are all moved from the LLVM Support
component into a new LLVM Component called "TargetParser". This
potentially enables using tablegen to maintain this information, as
is shown in https://reviews.llvm.org/D137517. This cannot currently
be done, as llvm-tblgen relies on LLVM's Support component.
- This also moves two files from Support which use and depend on
information in the TargetParser:
- `llvm/Support/Host.{h,cpp}` which contains functions for inspecting
the current Host machine for info about it, primarily to support
getting the host triple, but also for `-mcpu=native` support in e.g.
Clang. This is fairly tightly intertwined with the information in
`X86TargetParser.h`, so keeping them in the same component makes
sense.
- `llvm/ADT/Triple.h` and `llvm/Support/Triple.cpp`, which contains
the target triple parser and representation. This is very intertwined
with the Arm target parser, because the arm architecture version
appears in canonical triples on arm platforms.
- I moved the relevant unittests to their own directory.
And so, we end up with a single component that has all the information
about the following, which to me seems like a unified component:
- Triples that LLVM Knows about
- Architecture names and CPUs that LLVM knows about
- CPU detection logic for LLVM
Given this, I have also moved `RISCVISAInfo.h` into this component, as
it seems to me to be part of that same set of functionality.
If you get link errors in your components after this patch, you likely
need to add TargetParser into LLVM_LINK_COMPONENTS in CMake.
Differential Revision: https://reviews.llvm.org/D137838
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
This creates a library for fetching debug info by build ID, whether
locally or remotely via debuginfod. The functionality was refactored
out of existing code in the Symboliize library. Existing utilities
were refactored to use this library.
Reviewed By: phosek
Differential Revision: https://reviews.llvm.org/D132504
When populating the symbol table for an ELF object file, don't insert any symbols that come from ELF sections which don't have runtime allocated memory (typically debugging symbols).
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D133795
This adds support for backtrace generation to the llvm-symbolizer markup
filter, which is likely the largest use case.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D132706
Implements the pc element for the symbolizing filter, including it's
"ra" and "pc" modes. Return addresses ("ra") are adjusted by
decrementing one. By default, {{{pc}}} elements are assumed to point to
precise code ("pc") locations. Backtrace elements will adopt the
opposite convention.
Along the way, some minor refactors of value printing and colorization.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D131115
This connects the Symbolizer to the markup filter and enables the first
working end-to-end flow using the filter.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D130187
This change implements the contextual symbolizer markup elements: reset,
module, and mmap. These provide information about the runtime context of
the binary necessary to resolve addresses to symbolic values.
Summary information is printed to the output about this context.
Multiple mmap elements for the same module line are coalesced together.
The standard requires that such elements occur on their own lines to
allow for this; accordingly, anything after a contextual element on a
line is silently discarded.
Implementing this cleanly requires that the filter drive the parser;
this allows skipped sections to avoid being parsed. This also makes the
filter quite a bit easier to use, at the cost of some unused
flexibility.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D129519
This library implements the class `DebuginfodCollection`, which scans a set of directories for binaries, classifying them according to whether they contain debuginfo. This also provides the `DebuginfodServer`, an `HTTPServer` which serves debuginfod's `/debuginfo` and `/executable` endpoints. This is intended as the final new supporting library required for `llvm-debuginfod`.
As implemented here, `DebuginfodCollection` only finds ELF binaries and DWARF debuginfo. All other files are ignored. However, the class interface is format-agnostic. Generalizing to support other platforms will require refactoring of LLVM's object parsing libraries to eliminate use of `report_fatal_error` ([[ https://github.com/llvm/llvm-project/blob/main/llvm/lib/Object/WasmObjectFile.cpp#L74 | e.g. when reading WASM files ]]), so that the debuginfod daemon does not crash when it encounters a malformed file on the disk.
The `DebuginfodCollection` is tested by end-to-end tests of the debuginfod server (D114846).
Reviewed By: mysterymath
Differential Revision: https://reviews.llvm.org/D114845
This adds a --filter option to llvm-symbolizer. This takes log-bearing
symbolizer markup from stdin and writes a human-readable version to
stdout.
For now, this only implements the "symbol" markup tag; all others are
passed through unaltered. This is a proof-of-concept bit of
functionalty; implement the various tags is more-or-less just a matter
of hooking up various parts of the Symbolize library to the architecture
established here.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D126980
This allows registering certain tags as possibly beginning multi-line
elements in the symbolizer markup parser. The parser is kept agnostic to
how lines are delimited; it reports the entire contents, including line
endings, once the end of element marker is reached.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D124798
This adds a parser for the log symbolizer markup format discussed in
https://discourse.llvm.org/t/rfc-log-symbolizer/61282. The parser
operates in a line-by-line fashion with minimal memory requirements.
This doesn't yet include support for multi-line tags or specific parsing
for ANSI X3.64 SGR control sequences, but it can be extended to do so.
The latter can also be relatively easily handled by examining the
resulting text elements.
Reviewed By: peter.smith
Differential Revision: https://reviews.llvm.org/D124686
Currently, llvm-symbolizer doesn't like to parse .debug_info in order to
show the line info for global variables. addr2line does this. In the
future, I'm looking to migrate AddressSanitizer off of internal metadata
over to using debuginfo, and this is predicated on being able to get the
line info for global variables.
This patch adds the requisite support for getting the line info from the
.debug_info section for symbolizing global variables. This only happens
when you ask for a global variable to be symbolized as data.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D123538
Follow up to 1e396affca6a0d21247d960c93a415e8f6fe0301. On some standard
library configurations these have a dependency on the complete type of
SymbolizableModule.
This change adds a simple LRU cache to the Symbolize class to put a cap
on llvm-symbolizer memory usage. Previously, the Symbolizer's virtual
memory footprint would grow without bound as additional binaries were
referenced.
I'm putting this out there early for an informal review, since there may be
a dramatically different/better way to go about this. I still need to
figure out a good default constant for the memory cap and benchmark the
implementation against a large symbolization workload. Right now I've
pegged max memory usage at zero for testing purposes, which evicts the whole
cache every time.
Unfortunately, it looks like StringRefs in the returned DI objects can
directly refer to the contents of binaries. Accordingly, the cache
pruning must be explicitly requested by the caller, as the caller must
guarantee that none of the returned objects will be used afterwards.
For llvm-symbolizer this a light burden; symbolization occurs
line-by-line, and the returned objects are discarded after each.
Implementation wise, there are a number of nested caches that depend
on one another. I've implemented a simple Evictor callback system to
allow derived caches to register eviction actions to occur when the
underlying binaries are evicted.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D119784
On some standard library configurations these have a dependency on the
complete type of SymbolizableModule. They also do a lot of
copying/freeing so no point in inlining them.
This adds a --build-id=<hex build ID> flag to llvm-symbolizer. If --obj
is unspecified, this will attempt to look up the provided build ID using
whatever mechanisms are available to the Symbolizer (typically,
debuginfod). The semantics are then as if the found binary were given
using the --obj flag.
Reviewed By: jhenderson, phosek
Differential Revision: https://reviews.llvm.org/D118633