mirror of
https://github.com/llvm/llvm-project.git
synced 2025-04-24 16:16:08 +00:00
258 lines
9.3 KiB
ReStructuredText
258 lines
9.3 KiB
ReStructuredText
![]() |
BOLT
|
|||
|
====
|
|||
|
|
|||
|
BOLT is a post-link optimizer developed to speed up large applications.
|
|||
|
It achieves the improvements by optimizing application’s code layout
|
|||
|
based on execution profile gathered by sampling profiler, such as Linux
|
|||
|
``perf`` tool. An overview of the ideas implemented in BOLT along with a
|
|||
|
discussion of its potential and current results is available in `CGO’19
|
|||
|
paper <https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/>`__.
|
|||
|
|
|||
|
Input Binary Requirements
|
|||
|
-------------------------
|
|||
|
|
|||
|
BOLT operates on X86-64 and AArch64 ELF binaries. At the minimum, the
|
|||
|
binaries should have an unstripped symbol table, and, to get maximum
|
|||
|
performance gains, they should be linked with relocations
|
|||
|
(``--emit-relocs`` or ``-q`` linker flag).
|
|||
|
|
|||
|
BOLT disassembles functions and reconstructs the control flow graph
|
|||
|
(CFG) before it runs optimizations. Since this is a nontrivial task,
|
|||
|
especially when indirect branches are present, we rely on certain
|
|||
|
heuristics to accomplish it. These heuristics have been tested on a code
|
|||
|
generated with Clang and GCC compilers. The main requirement for C/C++
|
|||
|
code is not to rely on code layout properties, such as function pointer
|
|||
|
deltas. Assembly code can be processed too. Requirements for it include
|
|||
|
a clear separation of code and data, with data objects being placed into
|
|||
|
data sections/segments. If indirect jumps are used for intra-function
|
|||
|
control transfer (e.g., jump tables), the code patterns should be
|
|||
|
matching those generated by Clang/GCC.
|
|||
|
|
|||
|
NOTE: BOLT is currently incompatible with the
|
|||
|
``-freorder-blocks-and-partition`` compiler option. Since GCC8 enables
|
|||
|
this option by default, you have to explicitly disable it by adding
|
|||
|
``-fno-reorder-blocks-and-partition`` flag if you are compiling with
|
|||
|
GCC8 or above.
|
|||
|
|
|||
|
NOTE2: DWARF v5 is the new debugging format generated by the latest LLVM
|
|||
|
and GCC compilers. It offers several benefits over the previous DWARF
|
|||
|
v4. Currently, the support for v5 is a work in progress for BOLT. While
|
|||
|
you will be able to optimize binaries produced by the latest compilers,
|
|||
|
until the support is complete, you will not be able to update the debug
|
|||
|
info with ``-update-debug-sections``. To temporarily work around the
|
|||
|
issue, we recommend compiling binaries with ``-gdwarf-4`` option that
|
|||
|
forces DWARF v4 output.
|
|||
|
|
|||
|
PIE and .so support has been added recently. Please report bugs if you
|
|||
|
encounter any issues.
|
|||
|
|
|||
|
Installation
|
|||
|
------------
|
|||
|
|
|||
|
Docker Image
|
|||
|
~~~~~~~~~~~~
|
|||
|
|
|||
|
You can build and use the docker image containing BOLT using our `docker
|
|||
|
file <utils/docker/Dockerfile>`__. Alternatively, you can build BOLT
|
|||
|
manually using the steps below.
|
|||
|
|
|||
|
Manual Build
|
|||
|
~~~~~~~~~~~~
|
|||
|
|
|||
|
BOLT heavily uses LLVM libraries, and by design, it is built as one of
|
|||
|
LLVM tools. The build process is not much different from a regular LLVM
|
|||
|
build. The following instructions are assuming that you are running
|
|||
|
under Linux.
|
|||
|
|
|||
|
Start with cloning LLVM repo:
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
> git clone https://github.com/llvm/llvm-project.git
|
|||
|
> mkdir build
|
|||
|
> cd build
|
|||
|
> cmake -G Ninja ../llvm-project/llvm -DLLVM_TARGETS_TO_BUILD="X86;AArch64" -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_PROJECTS="bolt"
|
|||
|
> ninja bolt
|
|||
|
|
|||
|
``llvm-bolt`` will be available under ``bin/``. Add this directory to
|
|||
|
your path to ensure the rest of the commands in this tutorial work.
|
|||
|
|
|||
|
Optimizing BOLT’s Performance
|
|||
|
-----------------------------
|
|||
|
|
|||
|
BOLT runs many internal passes in parallel. If you foresee heavy usage
|
|||
|
of BOLT, you can improve the processing time by linking against one of
|
|||
|
memory allocation libraries with good support for concurrency. E.g. to
|
|||
|
use jemalloc:
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
> sudo yum install jemalloc-devel
|
|||
|
> LD_PRELOAD=/usr/lib64/libjemalloc.so llvm-bolt ....
|
|||
|
|
|||
|
Or if you rather use tcmalloc:
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
> sudo yum install gperftools-devel
|
|||
|
> LD_PRELOAD=/usr/lib64/libtcmalloc_minimal.so llvm-bolt ....
|
|||
|
|
|||
|
Usage
|
|||
|
-----
|
|||
|
|
|||
|
For a complete practical guide of using BOLT see `Optimizing Clang with
|
|||
|
BOLT <docs/OptimizingClang.md>`__.
|
|||
|
|
|||
|
Step 0
|
|||
|
~~~~~~
|
|||
|
|
|||
|
In order to allow BOLT to re-arrange functions (in addition to
|
|||
|
re-arranging code within functions) in your program, it needs a little
|
|||
|
help from the linker. Add ``--emit-relocs`` to the final link step of
|
|||
|
your application. You can verify the presence of relocations by checking
|
|||
|
for ``.rela.text`` section in the binary. BOLT will also report if it
|
|||
|
detects relocations while processing the binary.
|
|||
|
|
|||
|
Step 1: Collect Profile
|
|||
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|||
|
|
|||
|
This step is different for different kinds of executables. If you can
|
|||
|
invoke your program to run on a representative input from a command
|
|||
|
line, then check **For Applications** section below. If your program
|
|||
|
typically runs as a server/service, then skip to **For Services**
|
|||
|
section.
|
|||
|
|
|||
|
The version of ``perf`` command used for the following steps has to
|
|||
|
support ``-F brstack`` option. We recommend using ``perf`` version 4.5
|
|||
|
or later.
|
|||
|
|
|||
|
For Applications
|
|||
|
^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
This assumes you can run your program from a command line with a typical
|
|||
|
input. In this case, simply prepend the command line invocation with
|
|||
|
``perf``:
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
$ perf record -e cycles:u -j any,u -o perf.data -- <executable> <args> ...
|
|||
|
|
|||
|
For Services
|
|||
|
^^^^^^^^^^^^
|
|||
|
|
|||
|
Once you get the service deployed and warmed-up, it is time to collect
|
|||
|
perf data with LBR (branch information). The exact perf command to use
|
|||
|
will depend on the service. E.g., to collect the data for all processes
|
|||
|
running on the server for the next 3 minutes use:
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
$ perf record -e cycles:u -j any,u -a -o perf.data -- sleep 180
|
|||
|
|
|||
|
Depending on the application, you may need more samples to be included
|
|||
|
with your profile. It’s hard to tell upfront what would be a sweet spot
|
|||
|
for your application. We recommend the profile to cover 1B instructions
|
|||
|
as reported by BOLT ``-dyno-stats`` option. If you need to increase the
|
|||
|
number of samples in the profile, you can either run the ``sleep``
|
|||
|
command for longer and use ``-F<N>`` option with ``perf`` to increase
|
|||
|
sampling frequency.
|
|||
|
|
|||
|
Note that for profile collection we recommend using cycle events and not
|
|||
|
``BR_INST_RETIRED.*``. Empirically we found it to produce better
|
|||
|
results.
|
|||
|
|
|||
|
If the collection of a profile with branches is not available, e.g.,
|
|||
|
when you run on a VM or on hardware that does not support it, then you
|
|||
|
can use only sample events, such as cycles. In this case, the quality of
|
|||
|
the profile information would not be as good, and performance gains with
|
|||
|
BOLT are expected to be lower.
|
|||
|
|
|||
|
With instrumentation
|
|||
|
^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
If perf record is not available to you, you may collect profile by first
|
|||
|
instrumenting the binary with BOLT and then running it.
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
llvm-bolt <executable> -instrument -o <instrumented-executable>
|
|||
|
|
|||
|
After you run instrumented-executable with the desired workload, its
|
|||
|
BOLT profile should be ready for you in ``/tmp/prof.fdata`` and you can
|
|||
|
skip **Step 2**.
|
|||
|
|
|||
|
Run BOLT with the ``-help`` option and check the category “BOLT
|
|||
|
instrumentation options” for a quick reference on instrumentation knobs.
|
|||
|
|
|||
|
Step 2: Convert Profile to BOLT Format
|
|||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|||
|
|
|||
|
NOTE: you can skip this step and feed ``perf.data`` directly to BOLT
|
|||
|
using experimental ``-p perf.data`` option.
|
|||
|
|
|||
|
For this step, you will need ``perf.data`` file collected from the
|
|||
|
previous step and a copy of the binary that was running. The binary has
|
|||
|
to be either unstripped, or should have a symbol table intact (i.e.,
|
|||
|
running ``strip -g`` is okay).
|
|||
|
|
|||
|
Make sure ``perf`` is in your ``PATH``, and execute ``perf2bolt``:
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
$ perf2bolt -p perf.data -o perf.fdata <executable>
|
|||
|
|
|||
|
This command will aggregate branch data from ``perf.data`` and store it
|
|||
|
in a format that is both more compact and more resilient to binary
|
|||
|
modifications.
|
|||
|
|
|||
|
If the profile was collected without LBRs, you will need to add ``-nl``
|
|||
|
flag to the command line above.
|
|||
|
|
|||
|
Step 3: Optimize with BOLT
|
|||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|||
|
|
|||
|
Once you have ``perf.fdata`` ready, you can use it for optimizations
|
|||
|
with BOLT. Assuming your environment is setup to include the right path,
|
|||
|
execute ``llvm-bolt``:
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
$ llvm-bolt <executable> -o <executable>.bolt -data=perf.fdata -reorder-blocks=ext-tsp -reorder-functions=hfsort -split-functions -split-all-cold -split-eh -dyno-stats
|
|||
|
|
|||
|
If you do need an updated debug info, then add
|
|||
|
``-update-debug-sections`` option to the command above. The processing
|
|||
|
time will be slightly longer.
|
|||
|
|
|||
|
For a full list of options see ``-help``/``-help-hidden`` output.
|
|||
|
|
|||
|
The input binary for this step does not have to 100% match the binary
|
|||
|
used for profile collection in **Step 1**. This could happen when you
|
|||
|
are doing active development, and the source code constantly changes,
|
|||
|
yet you want to benefit from profile-guided optimizations. However,
|
|||
|
since the binary is not precisely the same, the profile information
|
|||
|
could become invalid or stale, and BOLT will report the number of
|
|||
|
functions with a stale profile. The higher the number, the less
|
|||
|
performance improvement should be expected. Thus, it is crucial to
|
|||
|
update ``.fdata`` for release branches.
|
|||
|
|
|||
|
Multiple Profiles
|
|||
|
-----------------
|
|||
|
|
|||
|
Suppose your application can run in different modes, and you can
|
|||
|
generate multiple profiles for each one of them. To generate a single
|
|||
|
binary that can benefit all modes (assuming the profiles don’t
|
|||
|
contradict each other) you can use ``merge-fdata`` tool:
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
$ merge-fdata *.fdata > combined.fdata
|
|||
|
|
|||
|
Use ``combined.fdata`` for **Step 3** above to generate a universally
|
|||
|
optimized binary.
|
|||
|
|
|||
|
License
|
|||
|
-------
|
|||
|
|
|||
|
BOLT is licensed under the `Apache License v2.0 with LLVM
|
|||
|
Exceptions <./LICENSE.TXT>`__.
|