Overview
Important
This example and the RPC backend are currently in a proof-of-concept development stage. As such, the functionality is fragile and insecure. Never run the RPC server on an open network or in a sensitive environment!
The rpc-server
allows running ggml
backend on a remote host.
The RPC backend communicates with one or several instances of rpc-server
and offloads computations to them.
This can be used for distributed LLM inference with llama.cpp
in the following way:
flowchart TD
rpcb<-->|TCP|srva
rpcb<-->|TCP|srvb
rpcb<-.->|TCP|srvn
subgraph hostn[Host N]
srvn[rpc-server]<-.->backend3["Backend (CUDA,Metal,etc.)"]
end
subgraph hostb[Host B]
srvb[rpc-server]<-->backend2["Backend (CUDA,Metal,etc.)"]
end
subgraph hosta[Host A]
srva[rpc-server]<-->backend["Backend (CUDA,Metal,etc.)"]
end
subgraph host[Main Host]
local["Backend (CUDA,Metal,etc.)"]<-->ggml[llama-cli]
ggml[llama-cli]<-->rpcb[RPC backend]
end
style hostn stroke:#66,stroke-width:2px,stroke-dasharray: 5 5
Each host can run a different backend, e.g. one with CUDA and another with Metal.
You can also run multiple rpc-server
instances on the same host, each with a different backend.
Usage
On each host, build the corresponding backend with cmake
and add -DGGML_RPC=ON
to the build options.
For example, to build the CUDA backend with RPC support:
mkdir build-rpc-cuda
cd build-rpc-cuda
cmake .. -DGGML_CUDA=ON -DGGML_RPC=ON
cmake --build . --config Release
Then, start the rpc-server
with the backend:
$ bin/rpc-server -p 50052
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes
Starting RPC server on 0.0.0.0:50052
When using the CUDA backend, you can specify the device with the CUDA_VISIBLE_DEVICES
environment variable, e.g.:
$ CUDA_VISIBLE_DEVICES=0 bin/rpc-server -p 50052
This way you can run multiple rpc-server
instances on the same host, each with a different CUDA device.
On the main host build llama.cpp
for the local backend and add -DGGML_RPC=ON
to the build options.
Finally, when running llama-cli
, use the --rpc
option to specify the host and port of each rpc-server
:
$ bin/llama-cli -m ../models/tinyllama-1b/ggml-model-f16.gguf -p "Hello, my name is" --repeat-penalty 1.0 -n 64 --rpc 192.168.88.10:50052,192.168.88.11:50052 -ngl 99
This way you can offload model layers to both local and remote devices.
Local cache
The RPC server can use a local cache to store large tensors and avoid transferring them over the network.
This can speed up model loading significantly, especially when using large models.
To enable the cache, use the -c
option:
$ bin/rpc-server -c
By default, the cache is stored in the $HOME/.cache/llama.cpp/rpc
directory and can be controlled via the LLAMA_CACHE
environment variable.