Clang provides partial HIP support on Intel GPUs using the CHIP-Star project `<https://github.com/CHIP-SPV/chipStar>`_.
CHIP-Star implements the HIP runtime over oneAPI Level Zero or OpenCL runtime. The Clang driver uses the HIPSPV
toolchain to compile HIP device code into LLVM IR, which is subsequently translated to SPIR-V via the SPIR-V
backend or the out-of-tree LLVM-SPIRV translator. The SPIR-V is then bundled and embedded into the host executables.
..note::
While Clang does not directly provide HIP support for NVIDIA GPUs and CPUs, these platforms are supported via other means:
- NVIDIA GPUs: HIP support is offered through the HIP project `<https://github.com/ROCm-Developer-Tools/HIP>`_, which provides a header-only library for translating HIP runtime APIs into CUDA runtime APIs. The code is subsequently compiled using NVIDIA's `nvcc`.
- CPUs: HIP support is available through the HIP-CPU runtime library `<https://github.com/ROCm-Developer-Tools/HIP-CPU>`_. This header-only library enables CPUs to execute unmodified HIP code.
Example Usage
=============
To compile a HIP program, use the following command:
Compiling a HIP program depends on the HIP runtime and device library. The paths to the HIP runtime and device libraries
can be specified either using compiler options or environment variables. The paths can also be set through the ROCm path
if they follow the ROCm installation directory structure.
Order of Precedence for HIP Path
--------------------------------
1.``--hip-path`` compiler option
2.``HIP_PATH`` environment variable *(use with caution)*
3.``--rocm-path`` compiler option
4.``ROCM_PATH`` environment variable *(use with caution)*
5. Default automatic detection (relative to Clang or at the default ROCm installation location)
Order of Precedence for Device Library Path
-------------------------------------------
1.``--hip-device-lib-path`` compiler option
2.``HIP_DEVICE_LIB_PATH`` environment variable *(use with caution)*
3.``--rocm-path`` compiler option
4.``ROCM_PATH`` environment variable *(use with caution)*
5. Default automatic detection (relative to Clang or at the default ROCm installation location)
..list-table::
:header-rows:1
* - Compiler Option
- Environment Variable
- Description
- Default Value
* - ``--rocm-path=<path>``
-``ROCM_PATH``
- Specifies the ROCm installation path.
- Automatic detection
* - ``--hip-path=<path>``
-``HIP_PATH``
- Specifies the HIP runtime installation path.
- Determined by ROCm directory structure
* - ``--hip-device-lib-path=<path>``
-``HIP_DEVICE_LIB_PATH``
- Specifies the HIP device library installation path.
- Determined by ROCm directory structure
..note::
We recommend using the compiler options as the primary method for specifying these paths. While the environment variables ``ROCM_PATH``, ``HIP_PATH``, and ``HIP_DEVICE_LIB_PATH`` are supported, their use can lead to implicit dependencies that might cause issues in the long run. Use them with caution.
Predefined Macros
=================
..list-table::
:header-rows:1
* - Macro
- Description
* - ``__CLANG_RDC__``
- Defined when Clang is compiling code in Relocatable Device Code (RDC) mode. RDC, enabled with the ``-fgpu-rdc`` compiler option, is necessary for linking device codes across translation units.
* - ``__HIP__``
- Defined when compiling with HIP language support, indicating that the code targets the HIP environment.
* - ``__HIPCC__``
- Alias to ``__HIP__``.
* - ``__HIP_DEVICE_COMPILE__``
- Defined during device code compilation in Clang's separate compilation process for the host and each offloading GPU architecture.
* - ``__HIP_MEMORY_SCOPE_SINGLETHREAD``
- Represents single-thread memory scope in HIP (value is 1).
* - ``__HIP_MEMORY_SCOPE_WAVEFRONT``
- Represents wavefront memory scope in HIP (value is 2).
* - ``__HIP_MEMORY_SCOPE_WORKGROUP``
- Represents workgroup memory scope in HIP (value is 3).
* - ``__HIP_MEMORY_SCOPE_AGENT``
- Represents agent memory scope in HIP (value is 4).
* - ``__HIP_MEMORY_SCOPE_SYSTEM``
- Represents system-wide memory scope in HIP (value is 5).
* - ``__HIP_NO_IMAGE_SUPPORT__``
- Defined with a value of 1 when the target device lacks support for HIP image functions.
* - ``__HIP_NO_IMAGE_SUPPORT``
- Alias to ``__HIP_NO_IMAGE_SUPPORT__``. Deprecated.
* - ``__HIP_API_PER_THREAD_DEFAULT_STREAM__``
- Defined when the GPU default stream is set to per-thread mode.
* - ``HIP_API_PER_THREAD_DEFAULT_STREAM``
- Alias to ``__HIP_API_PER_THREAD_DEFAULT_STREAM__``. Deprecated.
Each HIP source file contains intertwined device and host code. Depending on the chosen compilation mode by the compiler options ``-fno-gpu-rdc`` and ``-fgpu-rdc``, these portions of code are compiled differently.
Device Code Compilation
-----------------------
**``-fno-gpu-rdc`` Mode (default)**:
- Compiles to a self-contained, fully linked offloading device binary for each offloading device architecture.
- Device code within a Translation Unit (TU) cannot call functions located in another TU.
**``-fgpu-rdc`` Mode**:
- Compiles to a bitcode for each GPU architecture.
- For each offloading device architecture, the bitcode from different TUs are linked together to create a single offloading device binary.
- Device code in one TU can call functions located in another TU.
Host Code Compilation
---------------------
**Both Modes**:
- Compiles to a relocatable object for each TU.
- These relocatable objects are then linked together.
- Host code within a TU can call host functions and launch kernels from another TU.
Clang's front end, used for both CUDA and HIP programming models, shares the same parsing and semantic analysis mechanisms. This includes the resolution of overloads concerning device and host functions. While there exists a comprehensive documentation on the syntax differences between Clang and NVCC for CUDA at `Dialect Differences Between Clang and NVCC <https://llvm.org/docs/CompileCudaWithLLVM.html#dialect-differences-between-clang-and-nvcc>`_, it is important to note that these differences also apply to HIP code compilation.
Predefined Macros for Differentiation
-------------------------------------
To facilitate differentiation between HIP and CUDA code, as well as between device and host compilations within HIP, Clang defines specific macros:
-``__HIP__`` : This macro is defined only when compiling HIP code. It can be used to conditionally compile code specific to HIP, enabling developers to write portable code that can be compiled for both CUDA and HIP.
-``__HIP_DEVICE_COMPILE__`` : Defined exclusively during HIP device compilation, this macro allows for conditional compilation of device-specific code. It provides a mechanism to segregate device and host code, ensuring that each can be optimized for their respective execution environments.
Function pointers' support varies with the usage mode in Clang with HIP. The following table provides an overview of the support status across different use-cases and modes.
..list-table:: Function Pointers Support Overview
:widths:25 25 25
:header-rows:1
* - Use Case
-``-fno-gpu-rdc`` Mode (default)
-``-fgpu-rdc`` Mode
* - Defined and used in the same TU
- Supported
- Supported
* - Defined in one TU and used in another TU
- Not Supported
- Supported
In the ``-fno-gpu-rdc`` mode, the compiler calculates the resource usage of kernels based only on functions present within the same TU. This mode does not support the use of function pointers defined in a different TU due to the possibility of incorrect resource usage calculations, leading to undefined behavior.
On the other hand, the ``-fgpu-rdc`` mode allows the definition and use of function pointers across different TUs, as resource usage calculations can accommodate functions from disparate TUs.
Virtual Function Support
========================
In Clang with HIP, support for calling virtual functions of an object in device or host code is contingent on where the object is constructed.
-**Constructed in Device Code**: Virtual functions of an object can be called in device code on a specific offloading device if the object is constructed in device code on an offloading device with the same architecture.
-**Constructed in Host Code**: Virtual functions of an object can be called in host code if the object is constructed in host code.
In other scenarios, calling virtual functions is not allowed.
Explanation
-----------
An object constructed on the device side contains a pointer to the virtual function table on the device side, which is not accessible in host code, and vice versa. Thus, trying to invoke virtual functions from a context different from where the object was constructed will be disallowed because the appropriate virtual table cannot be accessed. The virtual function tables for offloading devices with different architecures are different, therefore trying to invoke virtual functions from an offloading device with a different architecture than where the object is constructed is also disallowed.
Example Usage
-------------
..code-block:: c++
class Base {
public:
__device__ virtual void virtualFunction() {
// Base virtual function implementation
}
};
class Derived : public Base {
public:
__device__ void virtualFunction() override {
// Derived virtual function implementation
}
};
__global__ void kernel() {
Derived obj;
Base* basePtr = &obj;
basePtr->virtualFunction(); // Allowed since obj is constructed in device code
and release it via `hipFree <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html#ga740d08da65cae1441ba32f8fedb863d1>`_.
Predefined Macros
=================
..list-table::
:header-rows:1
* - Macro
- Description
* - ``__HIPSTDPAR__``
- Defined when Clang is compiling code in algorithm offload mode, enabled
with the ``--hipstdpar`` compiler option.
* - ``__HIPSTDPAR_INTERPOSE_ALLOC__``
- Defined only when compiling in algorithm offload mode, when the user
enables interposition mode with the ``--hipstdpar-interpose-alloc``
compiler option, indicating that all dynamic memory allocation /
deallocation functions should be replaced with accelerator aware
variants.
Restrictions
============
We define two modes in which runtime execution can occur:
1.**HMM Mode** - this assumes that the
`HMM <https://docs.kernel.org/mm/hmm.html>`_ subsystem of the Linux kernel
is used to provide transparent on-demand paging i.e. memory obtained from a
system / OS allocator such as via a call to ``malloc`` or ``operator new`` is
directly accessible to the accelerator and it follows the C++ memory model;
2.**Interposition Mode** - this is a fallback mode for cases where transparent
on-demand paging is unavailable (e.g. in the Windows OS), which means that
memory must be allocated via an accelerator aware mechanism, and system
allocated memory is inaccessible for the accelerator.
The following restrictions imposed on user code apply to both modes:
1. Pointers to function, and all associated features, such as e.g. dynamic
polymorphism, cannot be used (directly or transitively) by the user provided
`HIP kernel language <https://rocm.docs.amd.com/projects/HIP/en/latest/reference/kernel_language.html>`_;
whilst things like using `__device__` annotations might accidentally "work",
they are not guaranteed to, and thus cannot be relied upon by user code;
- A consequence of the above is that both bitcode linking and linking
relocatable object files will "work", but it is not guaranteed to remain
working or actively tested at the moment; this restriction might be relaxed
in the future.
2. Combining explicit HIP, CUDA or OpenMP Offload compilation with
``--hipstdpar`` based offloading is not allowed or supported in any way.
3. There is no way to target different accelerators via a standard algorithm
invocation (`this might be addressed in future C++ standards <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2500r1.html>`_);
an unsafe (per the point above) way of achieving this is to spawn new threads
and invoke the `hipSetDevice <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___device.html#ga43c1e7f15925eeb762195ccb5e063eae>`_
interface e.g.:
..code-block:: c++
int accelerator_0 = ...;
int accelerator_1 = ...;
bool multiple_accelerators(const std::vector<int>& u, const std::vector<int>& v) {
std::atomic<unsigned int> r{0u};
thread t0{[&]() {
hipSetDevice(accelerator_0);
r += std::count(std::execution::par_unseq, std::cbegin(u), std::cend(u), 42);
}};
thread t1{[&]() {
hitSetDevice(accelerator_1);
r += std::count(std::execution::par_unseq, std::cbegin(v), std::cend(v), 314152)
}};
t0.join();
t1.join();
return r;
}
Note that this is a temporary, unsafe workaround for a deficiency in the C++
Standard.
Open Questions / Future Developments
====================================
1. The restriction on the use of global / namespace scope / ``static`` /
``thread`` storage duration variables in offloaded algorithms will be lifted
in the future, when running in **HMM Mode**;
2. The restriction on the use of dynamic memory allocation in offloaded
algorithms will be lifted in the future.
3. The restriction on the use of pointers to function, and associated features
such as dynamic polymorphism might be lifted in the future, when running in
**HMM Mode**;
4. Offload support might be extended to cases where the ``parallel_policy`` is