llama.cpp/examples/llava/README-granitevision.md

# Granite Vision

Download the model and point your `GRANITE_MODEL` environment variable to the path.

```bash
$ git clone https://huggingface.co/ibm-granite/granite-vision-3.2-2b
$ export GRANITE_MODEL=./granite-vision-3.2-2b
```


### 1. Running llava surgery v2.
First, we need to run the llava surgery script as shown below:

`python llava_surgery_v2.py -C -m $GRANITE_MODEL`

You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.

```bash
$ ls $GRANITE_MODEL | grep -i llava
llava.clip
llava.projector
```

We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
```python
import os
import torch

MODEL_PATH = os.getenv("GRANITE_MODEL")
if not MODEL_PATH:
    raise ValueError("env var GRANITE_MODEL is unset!")

encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))

assert len(encoder_tensors) > 0
assert len(projector_tensors) > 0
```

If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.


### 2. Creating the Visual Component GGUF
Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below.

```bash
$ ENCODER_PATH=$PWD/visual_encoder
$ mkdir $ENCODER_PATH

$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
```

Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`.

```json
{
    "_name_or_path": "siglip-model",
    "architectures": [
      "SiglipVisionModel"
    ],
    "image_grid_pinpoints": [
        [384,384],
        [384,768],
        [384,1152],
        [384,1536],
        [384,1920],
        [384,2304],
        [384,2688],
        [384,3072],
        [384,3456],
        [384,3840],
        [768,384],
        [768,768],
        [768,1152],
        [768,1536],
        [768,1920],
        [1152,384],
        [1152,768],
        [1152,1152],
        [1536,384],
        [1536,768],
        [1920,384],
        [1920,768],
        [2304,384],
        [2688,384],
        [3072,384],
        [3456,384],
        [3840,384]
    ],
    "mm_patch_merge_type": "spatial_unpad",
    "hidden_size": 1152,
    "image_size": 384,
    "intermediate_size": 4304,
    "model_type": "siglip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 27,
    "patch_size": 14,
    "layer_norm_eps": 1e-6,
    "hidden_act": "gelu_pytorch_tanh",
    "projection_dim": 0,
    "vision_feature_layer": [-24, -20, -12, -1]
}
```

At this point you should have something like this:
```bash
$ ls $ENCODER_PATH
config.json             llava.projector         pytorch_model.bin
```

Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`.
```bash
$ python convert_image_encoder_to_gguf.py \
    -m $ENCODER_PATH \
    --llava-projector $ENCODER_PATH/llava.projector \
    --output-dir $ENCODER_PATH \
    --clip-model-is-vision \
    --clip-model-is-siglip \
    --image-mean 0.5 0.5 0.5 \
    --image-std 0.5 0.5 0.5
```

This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.`


### 3. Creating the LLM GGUF.
The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.

First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
```bash
$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
```

```python
import os
import transformers

MODEL_PATH = os.getenv("GRANITE_MODEL")
if not MODEL_PATH:
    raise ValueError("env var GRANITE_MODEL is unset!")

LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
if not LLM_EXPORT_PATH:
    raise ValueError("env var LLM_EXPORT_PATH is unset!")

tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)

# NOTE: granite vision support was added to transformers very recently (4.49);
# if you get size mismatches, your version is too old.
# If you are running with an older version, set `ignore_mismatched_sizes=True`
# as shown below; it won't be loaded correctly, but the LLM part of the model that
# we are exporting will be loaded correctly.
model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)

tokenizer.save_pretrained(LLM_EXPORT_PATH)
model.language_model.save_pretrained(LLM_EXPORT_PATH)
```

Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
```bash
$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
...
$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
```


### 4. Quantization
If you want to quantize the LLM, you can do so with `llama-quantize` as you would any other LLM. For example:
```bash
$ ./build/bin/llama-quantize $LLM_EXPORT_PATH/granite_llm.gguf $LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf Q4_K_M
$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf
```

Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32.


### 5. Running the Model in Llama cpp
Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.

```bash
$ ./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \
    --mmproj $VISUAL_GGUF_PATH \
    --image ./media/llama0-banner.png \
    -c 16384 \
    -p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat does the text in this image say?\n<|assistant|>\n" \
    --temp 0
```

Sample output: `The text in the image reads "LLAMA C++ Can it run DOOM Llama?"`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			`# Granite Vision`

			Download the model and point your `GRANITE_MODEL` environment variable to the path.

			```bash
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`$ git clone https://huggingface.co/ibm-granite/granite-vision-3.2-2b`
			`$ export GRANITE_MODEL=./granite-vision-3.2-2b`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			```


			`### 1. Running llava surgery v2.`
			`First, we need to run the llava surgery script as shown below:`

			`python llava_surgery_v2.py -C -m $GRANITE_MODEL`

			You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory, as shown below.

			```bash
			`$ ls $GRANITE_MODEL \| grep -i llava`
			`llava.clip`
			`llava.projector`
			```

			`We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:`
			```python
			`import os`
			`import torch`

			`MODEL_PATH = os.getenv("GRANITE_MODEL")`
			`if not MODEL_PATH:`
			`raise ValueError("env var GRANITE_MODEL is unset!")`

			`encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))`
			`projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))`

			`assert len(encoder_tensors) > 0`
			`assert len(projector_tensors) > 0`
			```

			If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.


			`### 2. Creating the Visual Component GGUF`
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below.`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			```bash
			`$ ENCODER_PATH=$PWD/visual_encoder`
			`$ mkdir $ENCODER_PATH`

			`$ cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin`
			`$ cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/`
			```

			Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`.
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00
			```json
			`{`
			`"_name_or_path": "siglip-model",`
			`"architectures": [`
			`"SiglipVisionModel"`
			`],`
			`"image_grid_pinpoints": [`
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`[384,384],`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			`[384,768],`
			`[384,1152],`
			`[384,1536],`
			`[384,1920],`
			`[384,2304],`
			`[384,2688],`
			`[384,3072],`
			`[384,3456],`
			`[384,3840],`
			`[768,384],`
			`[768,768],`
			`[768,1152],`
			`[768,1536],`
			`[768,1920],`
			`[1152,384],`
			`[1152,768],`
			`[1152,1152],`
			`[1536,384],`
			`[1536,768],`
			`[1920,384],`
			`[1920,768],`
			`[2304,384],`
			`[2688,384],`
			`[3072,384],`
			`[3456,384],`
			`[3840,384]`
			`],`
			`"mm_patch_merge_type": "spatial_unpad",`
			`"hidden_size": 1152,`
			`"image_size": 384,`
			`"intermediate_size": 4304,`
			`"model_type": "siglip_vision_model",`
			`"num_attention_heads": 16,`
			`"num_hidden_layers": 27,`
			`"patch_size": 14,`
			`"layer_norm_eps": 1e-6,`
			`"hidden_act": "gelu_pytorch_tanh",`
			`"projection_dim": 0,`
			`"vision_feature_layer": [-24, -20, -12, -1]`
			`}`
			```

Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`At this point you should have something like this:`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			```bash
			`$ ls $ENCODER_PATH`
			`config.json llava.projector pytorch_model.bin`
			```

Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`.
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			```bash
			`$ python convert_image_encoder_to_gguf.py \`
			`-m $ENCODER_PATH \`
			`--llava-projector $ENCODER_PATH/llava.projector \`
			`--output-dir $ENCODER_PATH \`
			`--clip-model-is-vision \`
			`--clip-model-is-siglip \`
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`--image-mean 0.5 0.5 0.5 \`
			`--image-std 0.5 0.5 0.5`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			```

Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00

			`### 3. Creating the LLM GGUF.`
			The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.

			First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			```bash
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			`$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm`
			```

			```python
			`import os`
			`import transformers`

			`MODEL_PATH = os.getenv("GRANITE_MODEL")`
			`if not MODEL_PATH:`
			`raise ValueError("env var GRANITE_MODEL is unset!")`

			`LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")`
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`if not LLM_EXPORT_PATH:`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			`raise ValueError("env var LLM_EXPORT_PATH is unset!")`

			`tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)`

			`# NOTE: granite vision support was added to transformers very recently (4.49);`
			`# if you get size mismatches, your version is too old.`
			# If you are running with an older version, set `ignore_mismatched_sizes=True`
			`# as shown below; it won't be loaded correctly, but the LLM part of the model that`
			`# we are exporting will be loaded correctly.`
			`model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)`

			`tokenizer.save_pretrained(LLM_EXPORT_PATH)`
			`model.language_model.save_pretrained(LLM_EXPORT_PATH)`
			```

			`Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.`
			```bash
			`$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf`
			`...`
			`$ python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH`
			```


Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`### 4. Quantization`
			If you want to quantize the LLM, you can do so with `llama-quantize` as you would any other LLM. For example:
			```bash
			`$ ./build/bin/llama-quantize $LLM_EXPORT_PATH/granite_llm.gguf $LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf Q4_K_M`
			`$ LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm_q4_k_m.gguf`
			```

			`Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32.`

Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`### 5. Running the Model in Llama cpp`
			Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00
			```bash
			`$ ./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \`
			`--mmproj $VISUAL_GGUF_PATH \`
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`--image ./media/llama0-banner.png \`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			`-c 16384 \`
Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			`-p "<\|system\|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<\|user\|>\n\<image>\nWhat does the text in this image say?\n<\|assistant\|>\n" \`
Add Doc for Converting Granite Vision -> GGUF (#12006) * Add example docs for granite vision Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-25 02:46:05 -07:00			`--temp 0`
			```

Update granite vision docs for 3.2 model (#12105) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> 2025-02-28 04:31:47 -07:00			Sample output: `The text in the image reads "LLAMA C++ Can it run DOOM Llama?"`