mirror of
https://github.com/llvm/llvm-project.git
synced 2025-04-25 17:56:06 +00:00

gfx940 and gfx941 are no longer supported. This is one of a series of PRs to remove them from the code base. For SWDEV-512631 and SWDEV-512633
222 lines
9.0 KiB
ReStructuredText
222 lines
9.0 KiB
ReStructuredText
.. _libc_gpu_usage:
|
|
|
|
===================
|
|
Using libc for GPUs
|
|
===================
|
|
|
|
.. contents:: Table of Contents
|
|
:depth: 4
|
|
:local:
|
|
|
|
Using the GPU C library
|
|
=======================
|
|
|
|
Once you have finished :ref:`building<libc_gpu_building>` the GPU C library it
|
|
can be used to run libc or libm functions directly on the GPU. Currently, not
|
|
all C standard functions are supported on the GPU. Consult the :ref:`list of
|
|
supported functions<libc_gpu_support>` for a comprehensive list.
|
|
|
|
The GPU C library supports two main usage modes. The first is as a supplementary
|
|
library for offloading languages such as OpenMP, CUDA, or HIP. These aim to
|
|
provide standard system utilities similarly to existing vendor libraries. The
|
|
second method treats the GPU as a hosted target by compiling C or C++ for it
|
|
directly. This is more similar to targeting OpenCL and is primarily used for
|
|
exported functions on the GPU and testing.
|
|
|
|
Offloading usage
|
|
----------------
|
|
|
|
Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source
|
|
file for both the host target and a list of offloading devices. In order to
|
|
support standard compilation flows, the ``clang`` driver uses fat binaries,
|
|
described in the `clang documentation
|
|
<https://clang.llvm.org/docs/OffloadingDesign.html>`_. This linking mode is used
|
|
by the OpenMP toolchain, but is currently opt-in for the CUDA and HIP toolchains
|
|
through the ``--offload-new-driver``` and ``-fgpu-rdc`` flags.
|
|
|
|
In order or link the GPU runtime, we simply pass this library to the embedded
|
|
device linker job. This can be done using the ``-Xoffload-linker`` option, which
|
|
forwards an argument to a ``clang`` job used to create the final GPU executable.
|
|
The toolchain should pick up the C libraries automatically in most cases, so
|
|
this shouldn't be necessary.
|
|
|
|
.. code-block:: sh
|
|
|
|
$> clang openmp.c -fopenmp --offload-arch=gfx90a -Xoffload-linker -lc
|
|
$> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
|
|
$> clang hip.hip --offload-arch=gfx942 --offload-new-driver -fgpu-rdc -Xoffload-linker -lc
|
|
|
|
This will automatically link in the needed function definitions if they were
|
|
required by the user's application. Normally using the ``-fgpu-rdc`` option
|
|
results in sub-par performance due to ABA linking. However, the offloading
|
|
toolchain supports the ``--foffload-lto`` option to support LTO on the target
|
|
device.
|
|
|
|
Offloading languages require that functions present on the device be declared as
|
|
such. This is done with the ``__device__`` keyword in CUDA and HIP or the
|
|
``declare target`` pragma in OpenMP. This requires that the LLVM C library
|
|
exposes its implemented functions to the compiler when it is used to build. We
|
|
support this by providing wrapper headers in the compiler's resource directory.
|
|
These are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your
|
|
installation.
|
|
|
|
The support for HIP and CUDA is more experimental, requiring manual intervention
|
|
to link and use the facilities. An example of this is shown in the :ref:`CUDA
|
|
server example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is
|
|
completely integrated with the LLVM C library however. It will automatically
|
|
handle including the necessary libraries, define device-side interfaces, and run
|
|
the RPC server.
|
|
|
|
OpenMP Offloading example
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
This section provides a simple example of compiling an OpenMP program with the
|
|
GPU C library.
|
|
|
|
.. code-block:: c++
|
|
|
|
#include <stdio.h>
|
|
|
|
int main() {
|
|
FILE *file = stderr;
|
|
#pragma omp target teams num_teams(2) thread_limit(2)
|
|
#pragma omp parallel num_threads(2)
|
|
{ fputs("Hello from OpenMP!\n", file); }
|
|
}
|
|
|
|
This can simply be compiled like any other OpenMP application to print from two
|
|
threads and two blocks.
|
|
|
|
.. code-block:: sh
|
|
|
|
$> clang openmp.c -fopenmp --offload-arch=gfx90a
|
|
$> ./a.out
|
|
Hello from OpenMP!
|
|
Hello from OpenMP!
|
|
Hello from OpenMP!
|
|
Hello from OpenMP!
|
|
|
|
Including the wrapper headers, linking the C library, and running the :ref:`RPC
|
|
server<libc_gpu_rpc>` are all handled automatically by the compiler and runtime.
|
|
|
|
Direct compilation
|
|
------------------
|
|
|
|
Instead of using standard offloading languages, we can also target the CPU
|
|
directly using C and C++ to create a GPU executable similarly to OpenCL. This is
|
|
done by targeting the GPU architecture using `clang's cross compilation
|
|
support <https://clang.llvm.org/docs/CrossCompilation.html>`_. This is the
|
|
method that the GPU C library uses both to build the library and to run tests.
|
|
|
|
This allows us to easily define GPU specific libraries and programs that fit
|
|
well into existing tools. In order to target the GPU effectively we rely heavily
|
|
on the compiler's intrinsic and built-in functions. For example, the following
|
|
function gets the thread identifier in the 'x' dimension on both GPUs supported
|
|
GPUs.
|
|
|
|
.. code-block:: c++
|
|
|
|
uint32_t get_thread_id_x() {
|
|
#if defined(__AMDGPU__)
|
|
return __builtin_amdgcn_workitem_id_x();
|
|
#elif defined(__NVPTX__)
|
|
return __nvvm_read_ptx_sreg_tid_x();
|
|
#else
|
|
#error "Unsupported platform"
|
|
#endif
|
|
}
|
|
|
|
We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the
|
|
following commands. This will yield valid LLVM-IR for the given target just like
|
|
if we were using CUDA, OpenCL, or OpenMP.
|
|
|
|
.. code-block:: sh
|
|
|
|
$> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c
|
|
$> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c
|
|
|
|
We can also use this support to treat the GPU as a hosted environment by
|
|
providing a C library and startup object just like a standard C library running
|
|
on the host machine. Then, in order to execute these programs, we provide a
|
|
loader utility to launch the executable on the GPU similar to a cross-compiling
|
|
emulator. This is how we run :ref:`unit tests <libc_gpu_testing>` targeting the
|
|
GPU. This is clearly not the most efficient way to use a GPU, but it provides a
|
|
simple method to test execution on a GPU for debugging or development.
|
|
|
|
Building for AMDGPU targets
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The AMDGPU target supports several features natively by virtue of using ``lld``
|
|
as its linker. The installation will include the ``include/amdgcn-amd-amdhsa``
|
|
and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use
|
|
the library. We can directly link against ``libc.a`` and use LTO to generate the
|
|
final executable.
|
|
|
|
.. code-block:: c++
|
|
|
|
#include <stdio.h>
|
|
|
|
int main() { printf("Hello from AMDGPU!\n"); }
|
|
|
|
This program can then be compiled using the ``clang`` compiler. Note that
|
|
``-flto`` and ``-mcpu=`` should be defined. This is because the GPU
|
|
sub-architectures do not have strict backwards compatibility. Use ``-mcpu=help``
|
|
for accepted arguments or ``-mcpu=native`` to target the system's installed GPUs
|
|
if present. Additionally, the AMDGPU target always uses ``-flto`` because we
|
|
currently do not fully support ELF linking in ``lld``. Once built, we use the
|
|
``amdhsa-loader`` utility to launch execution on the GPU. This will be built if
|
|
the ``hsa_runtime64`` library was found during build time.
|
|
|
|
.. code-block:: sh
|
|
|
|
$> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
|
|
$> amdhsa-loader --threads 2 --blocks 2 a.out
|
|
Hello from AMDGPU!
|
|
Hello from AMDGPU!
|
|
Hello from AMDGPU!
|
|
Hello from AMDGPU!
|
|
|
|
This will include the ``stdio.h`` header, which is found in the
|
|
``include/amdgcn-amd-amdhsa`` directory. We define out ``main`` function like a
|
|
standard application. The startup utility in ``lib/amdgcn-amd-amdhsa/crt1.o``
|
|
will handle the necessary steps to execute the ``main`` function along with
|
|
global initializers and command line arguments. Finally, we link in the
|
|
``libc.a`` library stored in ``lib/amdgcn-amd-amdhsa`` to define the standard C
|
|
functions.
|
|
|
|
The search paths for the include directories and libraries are automatically
|
|
handled by the compiler. We use this support internally to run unit tests on the
|
|
GPU directly. See :ref:`libc_gpu_testing` for more information. The installation
|
|
also provides ``libc.bc`` which is a single LLVM-IR bitcode blob that can be
|
|
used instead of the static library.
|
|
|
|
Building for NVPTX targets
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The infrastructure is the same as the AMDGPU example. However, the NVPTX binary
|
|
utilities are very limited and must be targeted directly. A utility called
|
|
``clang-nvlink-wrapper`` instead wraps around the standard link job to give the
|
|
illusion that ``nvlink`` is a functional linker.
|
|
|
|
.. code-block:: c++
|
|
|
|
#include <stdio.h>
|
|
|
|
int main(int argc, char **argv, char **envp) {
|
|
printf("Hello from NVPTX!\n");
|
|
}
|
|
|
|
Additionally, the NVPTX ABI requires that every function signature matches. This
|
|
requires us to pass the full prototype from ``main``. The installation will
|
|
contain the ``nvptx-loader`` utility if the CUDA driver was found during
|
|
compilation. Using link time optimization will help hide this.
|
|
|
|
.. code-block:: sh
|
|
|
|
$> clang hello.c --target=nvptx64-nvidia-cuda -march=native -flto -lc <install>/lib/nvptx64-nvidia-cuda/crt1.o
|
|
$> nvptx-loader --threads 2 --blocks 2 a.out
|
|
Hello from NVPTX!
|
|
Hello from NVPTX!
|
|
Hello from NVPTX!
|
|
Hello from NVPTX!
|