Running LLMs Locally (with AMD GPU/CPU Support)

The purpose
Build environment
1. llama.cpp
2. Model
Execute
1. Executing via Command Prompt
  1. Exit
2. Executing as server
  1. Exit
  2. Accessing from Other Devices (PC or Smartphone)

The purpose

Running a local LLM (Chat AI) using llama.cpp.

In this article, we will use Gemma, Google’s model designed for local environments.

It can be run on AMD GPUs as well as in environments without a GPU (CPU-only).

Build environment

llama.cpp

Download the zip file that matches your environment from the page below.

If you want to run it on Windows with an AMD GPU (or without a GPU), it will work with the package for Vulkan.

If you are using an Nvidia GPU, it will work with the package for CUDA.

If it does not work with the versions above, use the package for CPU.

Releases · ggml-org/llama.cpp

LLM inference in C/C++. Contribute to ggml-org/llama.cpp development by creating an account on GitHub.

Once you extract the downloaded file into the folder of your choice, the preparation is complete.

Model

Please download a model from one of the following pages.

gemma-2-2b-jpn is a model specifically optimized for Japanese.

gemma-2-9b is a smarter model than gemma-2-2b-jpn.

On the linked page, you will see files labeled with terms like “Q4” or “Q8.” The higher the number after the “Q,” the smarter the model is.

Generally, smarter models have larger file sizes and take longer to generate responses.

(There is also an even smarter model called Gemma-2-27B.)

Please choose a model that fits your environment. (As a general rule of thumb: 2B is for smartphones and low-spec PCs, 9B is for high-spec PCs, and 27B is for ultra-powerful PCs.)

grapevine-AI/gemma-2-2b-jpn-it-gguf at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

bartowski/gemma-2-9b-it-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Execute

Executing via Command Prompt

Execute the following command in the Command Prompt.

Please replace path/to/model with the actual path of the model you downloaded.

llama-cli.exe -m path/to/model

Once the model finishes loading, a screen like the following will be displayed:

Once the screen above is displayed, you can start chatting. Japanese input is also supported.

When I used gemma-2-2B-jpn-it-Q4_K_M.gguf on my system (Ryzen 7 7735HS with Radeon Graphics + 32GB RAM), the responses came back almost instantly.

Exit

You can exit by typing /exit or pressing Ctrl+C.

Executing as server

Run the following command in the Command Prompt.

Note: Please replace path/to/model with the actual path of the model you downloaded.

llama-server -m path/to/model --port 8080

Once the model has finished loading, you will see a message like this:

main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle

Once you see the message above, open a browser like Chrome and go to http://127.0.0.1:8080/.

The following screen will appear, and you can start chatting.

Gemma, unlike Gemini, does not support image input.

Additionally, it cannot output images.

Exit

You can terminate it by pressing Ctrl+C in the prompt where you started the server.

Accessing from Other Devices (PC or Smartphone)

If you run it using the command above, you will not be able to access it from other devices.

If you want to allow access from other devices, add the following argument when starting it up. (Please note that when accessing it, you will need to find the server-side IP address and replace 127.0.0.1 in

404 Not Found

(http://127.0.0.1:8080)).

--host 0.0.0.0

Please be very careful with security.