llvm-project/libc/src/__support/RPC/rpc_util.h

//===-- Shared memory RPC client / server utilities -------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifndef LLVM_LIBC_SRC_SUPPORT_RPC_RPC_UTILS_H
#define LLVM_LIBC_SRC_SUPPORT_RPC_RPC_UTILS_H

#include "src/__support/CPP/type_traits.h"
#include "src/__support/GPU/utils.h"
#include "src/__support/macros/attributes.h"
#include "src/__support/macros/properties/architectures.h"

namespace __llvm_libc {
namespace rpc {

/// Maximum amount of data a single lane can use.
constexpr uint64_t MAX_LANE_SIZE = 64;

/// Suspend the thread briefly to assist the thread scheduler during busy loops.
LIBC_INLINE void sleep_briefly() {
#if defined(LIBC_TARGET_ARCH_IS_NVPTX) && __CUDA_ARCH__ >= 700
  LIBC_INLINE_ASM("nanosleep.u32 64;" ::: "memory");
#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)
  __builtin_amdgcn_s_sleep(2);
#else
  // Simply do nothing if sleeping isn't supported on this platform.
#endif
}

/// Get the first active thread inside the lane.
LIBC_INLINE uint64_t get_first_lane_id(uint64_t lane_mask) {
  return __builtin_ffsl(lane_mask) - 1;
}

/// Conditional that is only true for a single thread in a lane.
LIBC_INLINE bool is_first_lane(uint64_t lane_mask) {
  return gpu::get_lane_id() == get_first_lane_id(lane_mask);
}

/// Conditional to indicate if this process is running on the GPU.
LIBC_INLINE constexpr bool is_process_gpu() {
#if defined(LIBC_TARGET_ARCH_IS_GPU)
  return true;
#else
  return false;
#endif
}

/// Return \p val aligned "upwards" according to \p align.
template <typename V, typename A> LIBC_INLINE V align_up(V val, A align) {
  return ((val + V(align) - 1) / V(align)) * V(align);
}

/// Utility to provide a unified interface between the CPU and GPU's memory
/// model. On the GPU stack variables are always private to a lane so we can
/// simply use the variable passed in. On the CPU we need to allocate enough
/// space for the whole lane and index into it.
template <typename V> LIBC_INLINE V &lane_value(V *val, uint32_t id) {
  if constexpr (is_process_gpu())
    return *val;
  return val[id];
}

/// Helper to get the maximum value.
template <typename T> LIBC_INLINE const T &max(const T &x, const T &y) {
  return x < y ? y : x;
}

/// Advance the \p p by \p bytes.
template <typename T, typename U> LIBC_INLINE T *advance(T *ptr, U bytes) {
  if constexpr (cpp::is_const_v<T>)
    return reinterpret_cast<T *>(reinterpret_cast<const uint8_t *>(ptr) +
                                 bytes);
  else
    return reinterpret_cast<T *>(reinterpret_cast<uint8_t *>(ptr) + bytes);
}

} // namespace rpc
} // namespace __llvm_libc

#endif
[libc] Support suspending threads during RPC spin loops The RPC interface relies on waiting on atomic signals to coordinate which side of the protocol is in control of the shared buffer. The GPU client supports briefly suspending the executing thread group. This is used by the thread scheduler to identify which thread groups can be switched out so that others may execute. This allows us to ensure that other threads get a chance to make forward progress while these threads wait on the atomic signal. This is currently only relevant on the client-side. We could use an alternative implementation on the server that uses the standard `nanosleep` on supported hosts. Reviewed By: JonChesterfield, tianshilei1992 Differential Revision: https://reviews.llvm.org/D147238 2023-03-30 09:50:56 -05:00			`//===-- Shared memory RPC client / server utilities -------------- C++ --===//`
			`//`
			`// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.`
			`// See https://llvm.org/LICENSE.txt for license information.`
			`// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception`
			`//`
			`//===----------------------------------------------------------------------===//`

			`#ifndef LLVM_LIBC_SRC_SUPPORT_RPC_RPC_UTILS_H`
			`#define LLVM_LIBC_SRC_SUPPORT_RPC_RPC_UTILS_H`

[libc] More efficiently send bytes via `send_n` and `recv_n` Currently we have the `send_n` and `recv_n` routines to stream data, such as a string to print, to the other side. The first operation is to send the size so the other side knows the number of bytes to recieve. However, this wasted 56 bytes that could've been sent. This meant that small values, like the arguments to a function to call on the host for example, needed to perform an extra send. This patch sends the first 56 bytes in the first packet and continues if necessary. Depends on D150992 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D151041 2023-05-19 11:17:42 -05:00			`#include "src/__support/CPP/type_traits.h"`
[libc][rpc] Update locking to work on volta Carefully work around not knowing the thread mask that nvptx intrinsic functions require. If the warp is converged when calling try_lock, a single rpc call will handle all lanes within it. Otherwise more than one rpc call with thread masks that compose to the unknown one will occur. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D149897 2023-05-04 22:30:53 +01:00			`#include "src/__support/GPU/utils.h"`
[libc] Support suspending threads during RPC spin loops The RPC interface relies on waiting on atomic signals to coordinate which side of the protocol is in control of the shared buffer. The GPU client supports briefly suspending the executing thread group. This is used by the thread scheduler to identify which thread groups can be switched out so that others may execute. This allows us to ensure that other threads get a chance to make forward progress while these threads wait on the atomic signal. This is currently only relevant on the client-side. We could use an alternative implementation on the server that uses the standard `nanosleep` on supported hosts. Reviewed By: JonChesterfield, tianshilei1992 Differential Revision: https://reviews.llvm.org/D147238 2023-03-30 09:50:56 -05:00			`#include "src/__support/macros/attributes.h"`
			`#include "src/__support/macros/properties/architectures.h"`

			`namespace __llvm_libc {`
			`namespace rpc {`

[libc] Enable multiple threads to use RPC on the GPU The execution model of the GPU expects that groups of threads will execute in lock-step in SIMD fashion. It's both important for performance and correctness that we treat this as the smallest possible granularity for an RPC operation. Thus, we map multiple threads to a single larger buffer and ship that across the wire. This patch makes the necessary changes to support executing the RPC on the GPU with multiple threads. This requires some workarounds to mimic the model when handling the protocol from the CPU. I'm not completely happy with some of the workarounds required, but I think it should work. Uses some of the implementation details from D148191. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D148943 2023-05-04 14:53:28 -05:00			`/// Maximum amount of data a single lane can use.`
			`constexpr uint64_t MAX_LANE_SIZE = 64;`

[libc] Support suspending threads during RPC spin loops The RPC interface relies on waiting on atomic signals to coordinate which side of the protocol is in control of the shared buffer. The GPU client supports briefly suspending the executing thread group. This is used by the thread scheduler to identify which thread groups can be switched out so that others may execute. This allows us to ensure that other threads get a chance to make forward progress while these threads wait on the atomic signal. This is currently only relevant on the client-side. We could use an alternative implementation on the server that uses the standard `nanosleep` on supported hosts. Reviewed By: JonChesterfield, tianshilei1992 Differential Revision: https://reviews.llvm.org/D147238 2023-03-30 09:50:56 -05:00			`/// Suspend the thread briefly to assist the thread scheduler during busy loops.`
			`LIBC_INLINE void sleep_briefly() {`
			`#if defined(LIBC_TARGET_ARCH_IS_NVPTX) && __CUDA_ARCH__ >= 700`
[libc] Replace use of `asm` in the GPU code with LIBC_INLINE_ASM We should more consistently use inline assembly using the LIBC wrappers. It's much safer to mark all of these volatile as well. Reviewed By: lntue Differential Revision: https://reviews.llvm.org/D152294 2023-06-06 13:22:15 -05:00			`LIBC_INLINE_ASM("nanosleep.u32 64;" ::: "memory");`
[libc] Support suspending threads during RPC spin loops The RPC interface relies on waiting on atomic signals to coordinate which side of the protocol is in control of the shared buffer. The GPU client supports briefly suspending the executing thread group. This is used by the thread scheduler to identify which thread groups can be switched out so that others may execute. This allows us to ensure that other threads get a chance to make forward progress while these threads wait on the atomic signal. This is currently only relevant on the client-side. We could use an alternative implementation on the server that uses the standard `nanosleep` on supported hosts. Reviewed By: JonChesterfield, tianshilei1992 Differential Revision: https://reviews.llvm.org/D147238 2023-03-30 09:50:56 -05:00			`#elif defined(LIBC_TARGET_ARCH_IS_AMDGPU)`
			`__builtin_amdgcn_s_sleep(2);`
			`#else`
			`// Simply do nothing if sleeping isn't supported on this platform.`
			`#endif`
			`}`

[libc][rpc] Update locking to work on volta Carefully work around not knowing the thread mask that nvptx intrinsic functions require. If the warp is converged when calling try_lock, a single rpc call will handle all lanes within it. Otherwise more than one rpc call with thread masks that compose to the unknown one will occur. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D149897 2023-05-04 22:30:53 +01:00			`/// Get the first active thread inside the lane.`
			`LIBC_INLINE uint64_t get_first_lane_id(uint64_t lane_mask) {`
			`return __builtin_ffsl(lane_mask) - 1;`
			`}`

			`/// Conditional that is only true for a single thread in a lane.`
			`LIBC_INLINE bool is_first_lane(uint64_t lane_mask) {`
			`return gpu::get_lane_id() == get_first_lane_id(lane_mask);`
			`}`

[libc] Enable multiple threads to use RPC on the GPU The execution model of the GPU expects that groups of threads will execute in lock-step in SIMD fashion. It's both important for performance and correctness that we treat this as the smallest possible granularity for an RPC operation. Thus, we map multiple threads to a single larger buffer and ship that across the wire. This patch makes the necessary changes to support executing the RPC on the GPU with multiple threads. This requires some workarounds to mimic the model when handling the protocol from the CPU. I'm not completely happy with some of the workarounds required, but I think it should work. Uses some of the implementation details from D148191. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D148943 2023-05-04 14:53:28 -05:00			`/// Conditional to indicate if this process is running on the GPU.`
			`LIBC_INLINE constexpr bool is_process_gpu() {`
			`#if defined(LIBC_TARGET_ARCH_IS_GPU)`
			`return true;`
			`#else`
			`return false;`
			`#endif`
			`}`

[libc] Support concurrent RPC port access on the GPU Previously we used a single port to implement the RPC. This was sufficient for single threaded tests but can potentially cause deadlocks when using multiple threads. The reason for this is that GPUs make no forward progress guarantees. Therefore one group of threads waiting on another group of threads can spin forever because there is no guarantee that the other threads will continue executing. The typical workaround for this is to allocate enough memory that a sufficiently large number of work groups can make progress. As long as this number is somewhat close to the amount of total concurrency we can obtain reliable execution around a shared resource. This patch enables using multiple ports by widening the arrays to a predetermined size and indexes into them. Empty ports are currently obtained via a trivial linker scan. This should be imporoved in the future for performance reasons. Portions of D148191 were applied to achieve parallel support. Depends on D149581 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D149598 2023-05-01 12:10:04 -05:00			`/// Return \p val aligned "upwards" according to \p align.`
			`template <typename V, typename A> LIBC_INLINE V align_up(V val, A align) {`
			`return ((val + V(align) - 1) / V(align)) * V(align);`
			`}`

[libc] Implement a generic streaming interface in the RPC Currently we provide the `send_n` and `recv_n` functions. These were somewhat divergent and not tested on the GPU. This patch changes the support to be more common. We do this my making the CPU provide an array equal the to at least the lane size while the GPU can rely on the private memory address of its stack variables. This allows us to send data back and forth generically. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D150379 2023-05-11 11:11:24 -05:00			`/// Utility to provide a unified interface between the CPU and GPU's memory`
			`/// model. On the GPU stack variables are always private to a lane so we can`
			`/// simply use the variable passed in. On the CPU we need to allocate enough`
			`/// space for the whole lane and index into it.`
			`template <typename V> LIBC_INLINE V &lane_value(V *val, uint32_t id) {`
			`if constexpr (is_process_gpu())`
			`return *val;`
			`return val[id];`
			`}`

			`/// Helper to get the maximum value.`
			`template <typename T> LIBC_INLINE const T &max(const T &x, const T &y) {`
			`return x < y ? y : x;`
			`}`

[libc] More efficiently send bytes via `send_n` and `recv_n` Currently we have the `send_n` and `recv_n` routines to stream data, such as a string to print, to the other side. The first operation is to send the size so the other side knows the number of bytes to recieve. However, this wasted 56 bytes that could've been sent. This meant that small values, like the arguments to a function to call on the host for example, needed to perform an extra send. This patch sends the first 56 bytes in the first packet and continues if necessary. Depends on D150992 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D151041 2023-05-19 11:17:42 -05:00			`/// Advance the \p p by \p bytes.`
			`template <typename T, typename U> LIBC_INLINE T advance(T ptr, U bytes) {`
			`if constexpr (cpp::is_const_v<T>)`
			`return reinterpret_cast<T >(reinterpret_cast<const uint8_t >(ptr) +`
			`bytes);`
			`else`
			`return reinterpret_cast<T >(reinterpret_cast<uint8_t >(ptr) + bytes);`
[libc][NFC] Clean up the memory buffer handling for RPC We do a lot of arithmetic on void pointers here, so include a helper and make some more consistent names. Changes no functionality. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D150576 2023-05-15 09:46:56 -05:00			`}`

[libc] Support suspending threads during RPC spin loops The RPC interface relies on waiting on atomic signals to coordinate which side of the protocol is in control of the shared buffer. The GPU client supports briefly suspending the executing thread group. This is used by the thread scheduler to identify which thread groups can be switched out so that others may execute. This allows us to ensure that other threads get a chance to make forward progress while these threads wait on the atomic signal. This is currently only relevant on the client-side. We could use an alternative implementation on the server that uses the standard `nanosleep` on supported hosts. Reviewed By: JonChesterfield, tianshilei1992 Differential Revision: https://reviews.llvm.org/D147238 2023-03-30 09:50:56 -05:00			`} // namespace rpc`
			`} // namespace __llvm_libc`

			`#endif`