llama.cpp/docs/docker.md

# Docker

## Prerequisites
* Docker must be installed and running on your system.
* Create a folder to store big models & intermediate files (ex. /llama/models)

## Images
We have three Docker images available for this project:

1. `ghcr.io/ggml-org/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
2. `ghcr.io/ggml-org/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`)
3. `ghcr.io/ggml-org/llama.cpp:server`: This image only includes the server executable file. (platforms: `linux/amd64`, `linux/arm64`)

Additionally, there the following images, similar to the above:

- `ghcr.io/ggml-org/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA support. (platforms: `linux/amd64`)
- `ghcr.io/ggml-org/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA support. (platforms: `linux/amd64`)
- `ghcr.io/ggml-org/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA support. (platforms: `linux/amd64`)
- `ghcr.io/ggml-org/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
- `ghcr.io/ggml-org/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
- `ghcr.io/ggml-org/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
- `ghcr.io/ggml-org/llama.cpp:full-musa`: Same as `full` but compiled with MUSA support. (platforms: `linux/amd64`)
- `ghcr.io/ggml-org/llama.cpp:light-musa`: Same as `light` but compiled with MUSA support. (platforms: `linux/amd64`)
- `ghcr.io/ggml-org/llama.cpp:server-musa`: Same as `server` but compiled with MUSA support. (platforms: `linux/amd64`)

The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](../.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](../.github/workflows/docker.yml). If you need different settings (for example, a different CUDA, ROCm or MUSA library, you'll need to build the images locally for now).

## Usage

The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.

Replace `/path/to/models` below with the actual path where you downloaded the models.

```bash
docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:full --all-in-one "/models/" 7B
```

On completion, you are ready to play!

```bash
docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
```

or with a light image:

```bash
docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512
```

or with a server image:

```bash
docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggml-org/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512
```

## Docker With CUDA

Assuming one has the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) properly installed on Linux, or is using a GPU enabled cloud, `cuBLAS` should be accessible inside the container.

## Building Docker locally

```bash
docker build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .
docker build -t local/llama.cpp:light-cuda --target light -f .devops/cuda.Dockerfile .
docker build -t local/llama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile .
```

You may want to pass in some different `ARGS`, depending on the CUDA environment supported by your container host, as well as the GPU architecture.

The defaults are:

- `CUDA_VERSION` set to `12.4.0`
- `CUDA_DOCKER_ARCH` set to the cmake build default, which includes all the supported architectures

The resulting images, are essentially the same as the non-CUDA images:

1. `local/llama.cpp:full-cuda`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
2. `local/llama.cpp:light-cuda`: This image only includes the main executable file.
3. `local/llama.cpp:server-cuda`: This image only includes the server executable file.

## Usage

After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the `--gpus` flag. You will also want to use the `--n-gpu-layers` flag.

```bash
docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
```

## Docker With MUSA

Assuming one has the [mt-container-toolkit](https://developer.mthreads.com/musa/native) properly installed on Linux, `muBLAS` should be accessible inside the container.

## Building Docker locally

```bash
docker build -t local/llama.cpp:full-musa --target full -f .devops/musa.Dockerfile .
docker build -t local/llama.cpp:light-musa --target light -f .devops/musa.Dockerfile .
docker build -t local/llama.cpp:server-musa --target server -f .devops/musa.Dockerfile .
```

You may want to pass in some different `ARGS`, depending on the MUSA environment supported by your container host, as well as the GPU architecture.

The defaults are:

- `MUSA_VERSION` set to `rc3.1.1`

The resulting images, are essentially the same as the non-MUSA images:

1. `local/llama.cpp:full-musa`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
2. `local/llama.cpp:light-musa`: This image only includes the main executable file.
3. `local/llama.cpp:server-musa`: This image only includes the server executable file.

## Usage

After building locally, Usage is similar to the non-MUSA examples, but you'll need to set `mthreads` as default Docker runtime. This can be done by executing `(cd /usr/bin/musa && sudo ./docker setup $PWD)` and verifying the changes by executing `docker info | grep mthreads` on the host machine. You will also want to use the `--n-gpu-layers` flag.

```bash
docker run -v /path/to/models:/models local/llama.cpp:full-musa --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/llama.cpp:light-musa -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1
docker run -v /path/to/models:/models local/llama.cpp:server-musa -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1
```
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00			`# Docker`

			`## Prerequisites`
			`* Docker must be installed and running on your system.`
			`* Create a folder to store big models & intermediate files (ex. /llama/models)`

			`## Images`
			`We have three Docker images available for this project:`

repo : update links to new url (#11886) * repo : update links to new url ggml-ci * cont : more urls ggml-ci 2025-02-15 16:40:57 +02:00			1. `ghcr.io/ggml-org/llama.cpp:full`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. (platforms: `linux/amd64`, `linux/arm64`)
			2. `ghcr.io/ggml-org/llama.cpp:light`: This image only includes the main executable file. (platforms: `linux/amd64`, `linux/arm64`)
			3. `ghcr.io/ggml-org/llama.cpp:server`: This image only includes the server executable file. (platforms: `linux/amd64`, `linux/arm64`)
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00
			`Additionally, there the following images, similar to the above:`

repo : update links to new url (#11886) * repo : update links to new url ggml-ci * cont : more urls ggml-ci 2025-02-15 16:40:57 +02:00			- `ghcr.io/ggml-org/llama.cpp:full-cuda`: Same as `full` but compiled with CUDA support. (platforms: `linux/amd64`)
			- `ghcr.io/ggml-org/llama.cpp:light-cuda`: Same as `light` but compiled with CUDA support. (platforms: `linux/amd64`)
			- `ghcr.io/ggml-org/llama.cpp:server-cuda`: Same as `server` but compiled with CUDA support. (platforms: `linux/amd64`)
			- `ghcr.io/ggml-org/llama.cpp:full-rocm`: Same as `full` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
			- `ghcr.io/ggml-org/llama.cpp:light-rocm`: Same as `light` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
			- `ghcr.io/ggml-org/llama.cpp:server-rocm`: Same as `server` but compiled with ROCm support. (platforms: `linux/amd64`, `linux/arm64`)
			- `ghcr.io/ggml-org/llama.cpp:full-musa`: Same as `full` but compiled with MUSA support. (platforms: `linux/amd64`)
			- `ghcr.io/ggml-org/llama.cpp:light-musa`: Same as `light` but compiled with MUSA support. (platforms: `linux/amd64`)
			- `ghcr.io/ggml-org/llama.cpp:server-musa`: Same as `server` but compiled with MUSA support. (platforms: `linux/amd64`)
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00
musa: add docker image support (#9685) * mtgpu: add docker image support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> 2024-10-11 02:10:37 +08:00			`The GPU enabled images are not currently tested by CI beyond being built. They are not built with any variation from the ones in the Dockerfiles defined in [.devops/](../.devops/) and the GitHub Action defined in [.github/workflows/docker.yml](../.github/workflows/docker.yml). If you need different settings (for example, a different CUDA, ROCm or MUSA library, you'll need to build the images locally for now).`
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00
			`## Usage`

			`The easiest way to download the models, convert them to ggml and optimize them is with the --all-in-one command which includes the full docker image.`

			Replace `/path/to/models` below with the actual path where you downloaded the models.

			```bash
repo : update links to new url (#11886) * repo : update links to new url ggml-ci * cont : more urls ggml-ci 2025-02-15 16:40:57 +02:00			`docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:full --all-in-one "/models/" 7B`
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00			```

			`On completion, you are ready to play!`

			```bash
repo : update links to new url (#11886) * repo : update links to new url ggml-ci * cont : more urls ggml-ci 2025-02-15 16:40:57 +02:00			`docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512`
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00			```

			`or with a light image:`

			```bash
repo : update links to new url (#11886) * repo : update links to new url ggml-ci * cont : more urls ggml-ci 2025-02-15 16:40:57 +02:00			`docker run -v /path/to/models:/models ghcr.io/ggml-org/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512`
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00			```

			`or with a server image:`

			```bash
repo : update links to new url (#11886) * repo : update links to new url ggml-ci * cont : more urls ggml-ci 2025-02-15 16:40:57 +02:00			`docker run -v /path/to/models:/models -p 8000:8000 ghcr.io/ggml-org/llama.cpp:server -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512`
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00			```

			`## Docker With CUDA`

			Assuming one has the [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) properly installed on Linux, or is using a GPU enabled cloud, `cuBLAS` should be accessible inside the container.

			`## Building Docker locally`

			```bash
docs : Update readme to build targets for local docker build (#11368) 2025-01-24 13:30:13 +00:00			`docker build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .`
			`docker build -t local/llama.cpp:light-cuda --target light -f .devops/cuda.Dockerfile .`
			`docker build -t local/llama.cpp:server-cuda --target server -f .devops/cuda.Dockerfile .`
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00			```

			You may want to pass in some different `ARGS`, depending on the CUDA environment supported by your container host, as well as the GPU architecture.

			`The defaults are:`

docker : drop to CUDA 12.4 (#11869) * docker : drop to CUDA 12.4 * docker : update readme [no ci] 2025-02-14 14:48:40 +02:00			- `CUDA_VERSION` set to `12.4.0`
docker : update CUDA images (#9213) 2024-08-28 13:20:36 +02:00			- `CUDA_DOCKER_ARCH` set to the cmake build default, which includes all the supported architectures
Reorganize documentation pages (#8325) * re-organize docs * add link among docs * add link to build docs * fix style * de-duplicate sections 2024-07-05 18:08:32 +02:00
			`The resulting images, are essentially the same as the non-CUDA images:`

			1. `local/llama.cpp:full-cuda`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
			2. `local/llama.cpp:light-cuda`: This image only includes the main executable file.
			3. `local/llama.cpp:server-cuda`: This image only includes the server executable file.

			`## Usage`

			After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the `--gpus` flag. You will also want to use the `--n-gpu-layers` flag.

			```bash
			`docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1`
			`docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1`
			`docker run --gpus all -v /path/to/models:/models local/llama.cpp:server-cuda -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1`
			```
musa: add docker image support (#9685) * mtgpu: add docker image support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> 2024-10-11 02:10:37 +08:00
			`## Docker With MUSA`

			Assuming one has the [mt-container-toolkit](https://developer.mthreads.com/musa/native) properly installed on Linux, `muBLAS` should be accessible inside the container.

			`## Building Docker locally`

			```bash
docs : Update readme to build targets for local docker build (#11368) 2025-01-24 13:30:13 +00:00			`docker build -t local/llama.cpp:full-musa --target full -f .devops/musa.Dockerfile .`
			`docker build -t local/llama.cpp:light-musa --target light -f .devops/musa.Dockerfile .`
			`docker build -t local/llama.cpp:server-musa --target server -f .devops/musa.Dockerfile .`
musa: add docker image support (#9685) * mtgpu: add docker image support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> 2024-10-11 02:10:37 +08:00			```

			You may want to pass in some different `ARGS`, depending on the MUSA environment supported by your container host, as well as the GPU architecture.

			`The defaults are:`

musa: bump MUSA SDK version to rc3.1.1 (#11822) * musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: Remove workaround in PR #10042 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> 2025-02-13 20:28:18 +08:00			- `MUSA_VERSION` set to `rc3.1.1`
musa: add docker image support (#9685) * mtgpu: add docker image support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable docker workflow Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> 2024-10-11 02:10:37 +08:00
			`The resulting images, are essentially the same as the non-MUSA images:`

			1. `local/llama.cpp:full-musa`: This image includes both the main executable file and the tools to convert LLaMA models into ggml and convert into 4-bit quantization.
			2. `local/llama.cpp:light-musa`: This image only includes the main executable file.
			3. `local/llama.cpp:server-musa`: This image only includes the server executable file.

			`## Usage`

			After building locally, Usage is similar to the non-MUSA examples, but you'll need to set `mthreads` as default Docker runtime. This can be done by executing `(cd /usr/bin/musa && sudo ./docker setup $PWD)` and verifying the changes by executing `docker info \| grep mthreads` on the host machine. You will also want to use the `--n-gpu-layers` flag.

			```bash
			`docker run -v /path/to/models:/models local/llama.cpp:full-musa --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1`
			`docker run -v /path/to/models:/models local/llama.cpp:light-musa -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1`
			`docker run -v /path/to/models:/models local/llama.cpp:server-musa -m /models/7B/ggml-model-q4_0.gguf --port 8000 --host 0.0.0.0 -n 512 --n-gpu-layers 1`
			```