Running a Local LLM with Image Input Support (AMD GPU / CPU Compatible)

The purpose
Build environment
1. llama.cpp
2. Model
Execute
1. Executing as server
Troubleshooting
Reference

The purpose

Running a local vision-capable LLM (Chat AI) using llama.cpp.

In this article, we will use Qwen2.5-VL, which is Google’s model optimized for local environments.

This can be run on AMD GPUs as well as environments without a GPU (CPU-only).

Please refer to the following page for launching Gamma.

Build environment

llama.cpp

Download the zip file that matches your environment from the page below.

If you want to run it on Windows with an AMD GPU (or without a GPU), it will work with the package for Vulkan.

If you are using an Nvidia GPU, it will work with the package for CUDA.

If it does not work with the versions above, use the package for CPU.

Releases · ggml-org/llama.cpp

LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.

Once you extract the downloaded file into the folder of your choice, the preparation is complete.

Model

Download a total of two files from the page: one of the Qwen2.5-VL-3B-Instruct-XXXXXXX.gguf files and one of the mmproj-Qwen2.5-VL-3B-Instruct-XXXXXXX.gguf files.

ggml-org/Qwen2.5-VL-3B-Instruct-GGUF at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Execute

Executing as server

Run the following command in the Command Prompt.

Note: Please replace model_path and mmproj_model_path with the actual paths to the models you downloaded.

llama-server -m model_path --mmproj mmproj_model_path --port 8080

Once the model finishes loading, you will see a message like this:

main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle

When you see the message above, open a browser (such as Google Chrome) and go to

404 Not Found

(http://127.0.0.1:8080/).

The interface will look like this, and you can start chatting.

You can drag and drop image files onto the shown page.

Troubleshooting

An AMD bug report appeared and the system crashed when I inputted an image.

I resolved the issue by doing the following two things (though I am not sure which one was the actual cause):

1. Updated the driver from the following page:

プロセッサ/グラフィックスのドライバーとサポート

AMD 製品のドライバーとソフトウェアをダウンロード — Windows および Linux のサポート、自動検出ツール、インストールの詳細ガイドもご利用いただけます。

2. Launch AMD Software (Adrenalin Edition) and change “Memory Optimizer” under the “Performance” → “Tuning” tab to “Gaming”. (This increased the dedicated GPU memory from 2GB to 4GB.)

Reference

【備忘録】llama.cppで、マルチモーダルがサポートされたので使ってみた。｜猫又

個人用の備忘録です。 llama.cppは以下を使用・llama-b5342-bin-win-cuda12.4-x64 モデルは以下からダウンロードして使用・Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf ・mmproj-Qwen2.5-VL-3B-Instruct-f16...