Arms on Coaching massive language fashions (LLMs) could require hundreds of thousands and even billion of {dollars} of infrastructure, however the fruits of that labor are sometimes extra accessible than you may suppose. Many current releases, together with Alibaba’s Qwen 3 and OpenAI’s gpt-oss, can run on even modest PC {hardware}.

For those who actually need to find out about how LLMs work, working one regionally is crucial. It additionally offers you limitless entry to a chatbot, with out paying further for precedence entry or sending your information to the cloud. Whereas there are easier instruments, activating Llama.cpp on the command line gives the most effective efficiency and most choices, together with the power to assign the workload to the CPU or GPU and the aptitude of quantizing (aka compressing) fashions for sooner output.

Beneath the hood, lots of the hottest frameworks for working fashions regionally in your PC or Mac, together with Ollama, Jan, or LM Studio are actually wrappers constructed atop Llama.cpp’s open supply basis with the objective of abstracting away complexity and enhancing the person expertise.

Whereas these niceties make working native fashions much less daunting for newcomers, they typically go away one thing to be desired with regard to efficiency and options.

As of this writing, Ollama nonetheless would not help Llama.cpp’s Vulkan again finish, which affords broader compatibility and sometimes larger era efficiency, significantly for AMD GPUs and APUs. And whereas LM Studio does help Vulkan, it lacks help for Intel’s SYCL runtime and GGUF mannequin creation.

On this hands-on information, we’ll discover Llama.cpp, together with how one can construct and set up the app, deploy and serve LLMs throughout GPUs and CPUs, generate quantized fashions, maximize efficiency, and allow software calling.

Stipulations:

Llama.cpp will run on absolutely anything, together with a Raspberry Pi. Nonetheless, for the most effective expertise doable, we advocate a machine with at the very least 16GB of system reminiscence.

Whereas not required, a devoted GPU from Intel, AMD, or Nvidia will vastly enhance efficiency. For those who do have entry to 1 you may need to ensure you have the newest drivers for them put in in your system earlier than continuing.

For many customers, putting in Llama.cpp is about as straightforward as downloading a ZIP file.

Whereas Llama.cpp could also be obtainable from package deal managers like apt, snap, or WinGet, it’s up to date very often, typically a number of instances a day, so it is best to seize the newest precompiled binaries from the official GitHub page.

Binaries can be found for a wide range of accelerators and frameworks for macOS, Home windows, and Ubuntu on each Arm64 and x86-64 based mostly host CPUs.

Here is a fast cheat sheet when you’re unsure which to seize:

  • Nvidia: CUDA
  • Intel Arc / Xe: Sycl
  • AMD: Vulkan or HIP
  • Qualcomm: OpenCL-Adreno
  • Apple M-series: macOS-Arm64

Or, if you do not have a supported GPU, seize the suitable “CPU” construct on your working system and processor structure. Notice that built-in GPUs might be reasonably hit-and-miss with Llama.cpp and, attributable to reminiscence bandwidth constraints, could not end in larger efficiency than CPU-based inference even when you may get them working.

After getting downloaded Llama.cpp, unzip the folder to your private home listing for simple entry.

If you cannot discover a prebuilt binary on your most popular taste of Linux or accelerator we’ll cowl how one can construct Llama.cpp from supply a bit later. We promise it is simpler than it sounds.

macOS customers:

Whereas we advocate that Home windows and Linux customers seize the precompiled binaries from GitHub, platform safety measures in macOS make working unsigned code a little bit of a headache. Due to this, we advocate that macOS customers make use of the brew package deal supervisor to put in Llama.cpp. Simply bear in mind that it will not be the newest model obtainable.

A information to organising the Homebrew package deal supervisor might be discovered here. After getting Homebrew put in, you may get Llama.cpp by working:

brew set up llama.cpp

Deploying your first mannequin

In contrast to different apps similar to LM Studio or Ollama, Llama.cpp is a command-line utility. To entry it, you may must open the terminal and navigate to the folder we simply downloaded. Notice that, on Linux, the binaries will likely be situated underneath the construct/bin listing.

cd folder_name_here

We will then run the next command to obtain and run a 4-bit quantized model of Qwen3-8B inside a command-line chat interface on our gadget. For this mannequin, we advocate at at the very least 8GB of system reminiscence or a GPU with at the very least 6GB of VRAM.

./llama-cli -hfr bartowski/Qwen_Qwen3-8B-GGUF:Q4_K_M

For those who put in Llama.cpp utilizing brew, you’ll be able to go away off the ./ earlier than llama-cli.

As soon as the mannequin is downloaded, it ought to solely take a number of seconds to spin up and you will be introduced with a rudimentary command-line chat interface.

The easiest way to interact with Llama.cpp is through the CLI

The simplest solution to work together with Llama.cpp is thru the CLI – Click on to enlarge

Until you occur to be working M-series silicon, Llama.cpp goes to load the mannequin into system reminiscence and run it on the CPU by default. For those who’ve received a GPU with sufficient reminiscence, you in all probability do not need to do this since DDR is often quite a bit slower than GDDR.

To make use of the GPU, we have to specify what number of layers you want to dump onto it by appending the -ngl flag. On this case, Qwen3-8B has 37 layers, however when you’re undecided, setting -ngl to one thing like 999 will assure the mannequin runs fully from the GPU. And, sure, you’ll be able to regulate this to separate the mannequin between the system and GPU reminiscence if you do not have sufficient. We’ll dive deeper into that, together with some superior approaches, a bit later within the story.

./llama-cli -hfr bartowski/Qwen_Qwen3-8B-GGUF:Q4_K_M -ngl 37

Coping with a number of gadgets

Llama.cpp will try to make use of all obtainable GPUs which can trigger issues when you’ve received each a devoted graphics card and an iGPU on board. In our testing with an AMD W7900 on Home windows utilizing the HIP binaries, we encountered an error as a result of the mannequin tried to dump some layers to our CPU’s built-in graphics.

To get round this, we will specify which GPUs to run Llama.cpp on utilizing the –device flag. We will record all obtainable gadgets by working the next:

./llama-cli --list-devices

You need to see an output just like this one:

Out there gadgets:
  ROCm0: AMD Radeon RX 7900 XT (20464 MiB, 20314 MiB free
  ROCm1: AMD Radeon(TM) Graphics (12498 MiB, 12347 MiB free)

Notice that, relying on whether or not you are utilizing the HIP, Vulkan, CUDA, or OpenCL, backend gadget names are going to be totally different. For instance when you’re utilizing CUDA, you may see CUDA0 and CUDA1.

We will now launch Llama.cpp utilizing our most popular GPU by working

./llama-cli -hfr bartowski/Qwen_Qwen3-8B-GGUF:Q4_K_M -ngl 37 --device ROCm0

Serving up your mannequin:

As nice because the CLI-based chat interface is, it isn’t essentially essentially the most handy solution to work together with Llama.cpp. As a substitute, you may need to join it as much as a graphical person interface (GUI) as a substitute.

Fortunately, Llama.cpp consists of an API server that may be related to any app that helps OpenAI-compatible endpoints, like Jan or Open WebUI. For those who simply need a primary GUI, you needn’t do something particular, simply launch the mannequin with llama-server as a substitute.

./llama-server -hfr bartowski/Qwen_Qwen3-8B-GGUF:Q4_K_M -ngl 37

After a number of moments, you need to be capable of open the net GUI by navigating to http://localhost:8080 in your internet browser.

By default launching a model with llama-server will start a basic web interface for chatting with models

By default, launching a mannequin with llama-server will begin a primary internet interface for chatting with fashions – Click on to enlarge

If you wish to entry the server from one other gadget, you may want to reveal the server to the remainder of your community, by setting the --host deal with to 0.0.0.0, and, if you wish to use a special port, you’d append the --port flag. If you are going to make your server obtainable to strangers on the Web or a big community, we additionally advocate setting an --api-key flag.

./llama-server -hfr bartowski/Qwen_Qwen3-8B-GGUF:Q4_K_M -ngl 37 --host 0.0.0.0 --port 8000 --api-key top-secret

Now the API will likely be obtainable at:

API deal with: http://ServerIP:8080/v1

The API key ought to be handed to the URL as a bearer token when you’re writing your individual software. Another instruments similar to OpenWeb UI have a subject the place you’ll be able to enter the important thing and the shopper utility takes care of the remaining.

Editor’s be aware: For many residence customers, this ought to be comparatively protected as you have to be deploying the mannequin from behind your router’s firewall. Nonetheless, when you’re working Llama.cpp within the cloud, you may undoubtedly need to lock down your firewall first.

The place to search out fashions

Llama.cpp works with most fashions quantized utilizing the GGUF format. These fashions might be discovered on a wide range of mannequin repos, with Hugging Face being among the many hottest.

For those who’re in search of a selected mannequin, it is value checking profiles like Bartowski, Unsloth, and GGML-Org as they’re often among the many first to have GGUF quants of latest fashions.

For those who’re utilizing Hugging Face, downloading them might be finished straight from Llama.cpp. In reality, that is how we pulled down Qwen3-8B within the earlier step and requires specifying the mannequin repo and the particular quantization stage you’d want.

For instance -hfr bartowski/Qwen_Qwen3-8B-GGUF:Q8_0 would pull down an 8-bit quantized model of the mannequin whereas -hfr bartowski/Qwen_Qwen3-8B-GGUF:IQ3_XS would obtain a 3-bit i-quant.

Usually talking, smaller quants require fewer sources to run, but in addition are typically decrease high quality.

Quantizing your individual fashions

If the mannequin you are in search of is not already obtainable as a GGUF, you might be able to create your individual. Llama.cpp gives instruments for changing fashions to GGUF format after which quantizing them from 16-bits to a decrease precision (often 8-2 bits) to allow them to run on lesser {hardware}.

To do that, you may must clone the Llama.cpp repo and set up a current model of Python. For those who’re working Home windows, we truly discover it is simpler to do that step in Home windows Subsystem for Linux (WSL) reasonably than attempting to wrangle Python packages natively. For those who need assistance organising WSL, Microsoft’s arrange information might be discovered here.

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

Subsequent, we’ll must create a Python digital surroundings and set up our dependencies. For those who do not have already got python3-pip and python3-venv put in you may need to seize these first.

sudo apt set up python3-pip python3-venv

Then create a digital surroundings and activate by working:

python3 -m venv llama-cpp
supply llama-cpp/bin/activate

With that out of the best way we will set up the Python dependencies with:

pip set up -r necessities.txt

From there, we will use the convert_hf_to_gguf.py script to transform a protected tensors mannequin, on this case Microsoft’s Phi4, to a 16-bit GGUF file.

python3.12 convert_hf_to_gguf.py --remote microsoft/phi-4 --outfile phi4-14b-FP16.gguf

Until you have received a multi-gig interconnection it would take a couple of minutes to obtain. At native precision, Phi 4 is almost 30GB. For those who run into an error attempting to obtain a mannequin like Llama or Gemma, you seemingly must request permission on Hugging Face first and check in utilizing the huggingface-cli

huggingface-cli log-in

From right here, we will quantize the mannequin to the specified bit-width. We’ll use Q4_K_M quantization because it cuts the mannequin measurement by almost three quarters with out sacrificing an excessive amount of high quality. Yow will discover a full record of accessible quants here. Or by working llama-quantize --help.

./llama-quantize phi4-14b-FP16.gguf phi4-14b-Q4_K_M.gguf q4_k_m

To check the mannequin, we will launch llama-cli however reasonably than utilizing -hfr to pick a Hugging Face repo, we’ll use -m as a substitute and level it at our newly quantized mannequin.

llama-cli -m phi4-14b-Q4_K_M.gguf -ngl 99

If you would like to study extra about quantization, we have a whole information devoted to mannequin compression together with how one can measure and decrease high quality losses, which you could find here.

Constructing from supply

On the off likelihood that the Llama.cpp would not supply a precompiled binary on your {hardware} or working system, you may must construct the app from supply.

The Llama.cpp dev workforce maintains complete documentation on how one can construct from supply on each working system and compute runtime, be it CUDA, HIP, SYCL, CANN, MUSA, or one thing else. No matter which you are constructing for, you may need to ensure you have the newest drivers and runtimes put in and configured first.

For this demonstration, we’ll be constructing Llama.cpp for each an 8GB Raspberry Pi 5 and an x86-based Linux field with an Nvidia GPU since there aren’t any pre-compiled binaries for both. For our host working system, we’ll be utilizing Ubuntu Server (25.04 for the RPI and 24.04 for the PC).

To get began, we’ll set up a few dependencies utilizing apt.

sudo apt set up git cmake build-essential libcurl4-openssl-dev

Subsequent we will clone the repo from GitHub and open the listing.

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

From there, constructing the Llama.cpp is fairly easy. The devs have a whole web page devoted to building from source with directions for every thing SYCL and CUDA to HIP and even Huawei’s CANN and Moore Threads MUSA runtimes.

Constructing LLama.cpp on the RPI5:

For the Raspberry Pi 5, we will use the usual construct flags. Notice we have added the -j 4 flag right here to parallelize the method throughout the RPI’s 4 cores.

cmake -B construct
cmake --build construct --config Launch -j 4

Constructing Llama.cpp for x86 and CUDA:

For the x86 field, we’ll must make it possible for the Nvidia drivers and CUDA software equipment are put in first by working:

sudo apt set up nvidia-driver-570-server nvidia-cuda-toolkit
sudo reboot

After the system has rebooted, open the llama.cpp folder and construct it with CUDA help by working:

cd llama.cpp
cmake -B construct -DGGML_CUDA=ON
cmake --build construct --config Launch

Constructing for a special system or working into bother? Try Llama.cpp’s docs.

Finishing the set up:

No matter which system you are constructing on, it might take a pair minutes to finish. As soon as they’re completed, binaries will likely be saved within the /construct/bin/ folder.

You may run these binaries straight from this listing, or full the set up by copying them to your /usr/bin listing.

sudo cp /construct/bin/ /usr/bin/

If every thing labored appropriately, we must always be capable of spin up Google’s itty bitty new language model, Gemma 3 270M, by working:

llama-cli -hfr bartowski/google_gemma-3-270m-it-qat-GGUF:bf16

Would we advocate working LLMs on a Raspberry Pi? Not likely, however at the very least now you understand you’ll be able to.

Efficiency tuning

Thus far we have lined how one can obtain, set up, run, serve, and quantize fashions in Llama.cpp, however we have solely scratched the floor of what it is able to.

Run llama-cli --help and you may see simply what number of levers to drag and knobs to show there actually are. So, lets check out among the extra helpful flags at our disposal.

On this instance, we have configured Llama.cpp to run OpenAI’s gpt-oss-20b mannequin together with a number of addition flags to maximise efficiency. Let’s break them down one after the other.

./llama-server -hf ggml-org/gpt-oss-20b-GGUF --jinja -fa -c 16384 -np 2 --cache-reuse 256 -ngl 99

-fa — permits Flash Consideration on supported platforms which might dramatically velocity up immediate processing instances whereas additionally decreasing reminiscence necessities. We discover that that is useful for many setups, but it surely’s value attempting with and with out to make certain.

-c 16384 — units the mannequin’s context window, or short-term reminiscence, to 16,384 tokens. If left unset, Llama.cpp defaults to 4,096 tokens which minimizes reminiscence necessities, however means the mannequin will begin forgetting particulars when you exceed that threshold. For those who’ve received reminiscence to spare, we advocate setting this as excessive as you’ll be able to with out working into out-of-memory errors, as much as the mannequin limits. For gpt-oss, that is 131,072 tokens.

The bigger the context window, the extra RAM or VRAM is required to run the mannequin. LMCache affords a calculator that can assist you decide what number of tokens you’ll be able to match into your reminiscence.

-np 2 — permits Llama.cpp to course of as much as two requests concurrently, which might be helpful in multi-user or when connecting Llama.cpp to code assistant instruments like Cline or Proceed, which could make a number of simultaneous requests for code competitors or chat performance. Notice that the context window is split by the variety of parallel processes. On this instance, every parallel course of has an 8192 token context.

--cache-reuse 256 — setting this helps to keep away from recomputing key-value caches dashing up immediate processing, significantly for prolonged multi-turn conversations. We advocate beginning with 256 token chunks.

Hyperparameter tuning

For optimum efficiency and output high quality, many mannequin builders advocate setting sampling parameters, like temperature or min-p to particular values.

For instance, Alibaba’s Qwen workforce recommends setting temp to 0.7, top-p to 0.8, top-k to twenty, and min-p to 0 when working lots of its instruct fashions, like Qwen3-30B-A3B-Instruct-2507. Beneficial hyperparameters can often be discovered on the mannequin card on repos like Hugging Face.

In a nutshell, these parameters affect which tokens the mannequin selects from the chance curve. Temperature is among the best to grasp, as setting it decrease often leads to much less inventive and extra deterministic outputs whereas setting it larger will end in extra adventurous outcomes.

Many purposes, like Open WebUI, LibreChat, and Jan permit overriding them through the API, which they (otherwise you) can entry at localhost:8080 if you’re working llama-server. Nonetheless, for purposes that do not, it may be useful to set these when spinning up the mannequin in Llama.cpp.

For instance, for Qwen 3 instruct fashions, we would run one thing like the next. (Notice this specific mannequin requires a bit over 20GB of reminiscence so that you if need to take a look at it, you could must swap out Qwen for a smaller mannequin.)

./llama-server -hfr bartowski/Qwen_Qwen3-30B-A3B-Instruct-2507-GGUF:Q4_K_M --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0

Yow will discover a full record of accessible sampling parameters by working:

./llama-cli --help

For extra info on how sampling parameters affect output era, try Amazon Internet Providers’ explainer here.

Boosting efficiency with speculative decoding

One of many options in Llama.cpp that you simply will not see in different mannequin runners, like Ollama, is help for speculative decoding. This course of can velocity up token era in extremely repetitive workloads, like code era, through the use of a small draft mannequin to foretell the outputs of a bigger, extra correct one.

This method does require a appropriate draft mannequin, often one from the identical household as your main one. On this instance we’re utilizing speculative decoding to hurry up Alibaba’s Qwen3-14B mannequin utilizing the 0.6B variant as our drafter.

./llama-server -hfr Qwen/Qwen3-14B-GGUF:Q4_K_M -hfrd Qwen/Qwen3-0.6B-GGUF:Q8_0 -c 4096 -cd 4096 -ngl 99 -ngld 99 --draft-max 32 --draft-min 2 --cache-reuse 256 -fa

To check, we will ask the mannequin to generate a block of textual content or code. In our testing with speculative decoding enabled, era charges have been roughly on par with working Qwen3-14B by itself. Nonetheless, once we requested a minor change to that textual content or code, efficiency roughly doubled leaping from round 60 tok/s to 117 tok/s.

If you would like to know extra about how speculative decoding works in Llama.cpp, you could find our deep dive here.

Splitting massive fashions between CPUs and GPUs

One among Llama.cpp’s most useful options is its capacity to separate massive fashions between CPU and GPUs. As long as you have got sufficient reminiscence between your DRAM and VRAM to suit the mannequin weights (and the OS), there is a good likelihood you can run it.

As we alluded to earlier, the simplest manner to do that is to slowly improve the quantity layers offloaded to the GPU (-ngl) till you get an out-of-memory error after which again off a bit.

For instance, when you had a GPU with 20GB of vRAM and 32GB of DDR5 and needed to run Meta’s Llama 3.1 70B mannequin at 4-bit precision, which requires a bit over 42GB of reminiscence, you may offload 40 of the layers to the GPU and run the remaining on the CPU.

./llama-server -hfr bartowski/Meta-Llama-3.1-70B-Instruct-GGUF -ngl 40

Whereas the mannequin runs, efficiency will not be nice — in our testing we received about 2 tok/s.

Nonetheless, because of a comparatively small variety of energetic parameters in combination of skilled (MoE) fashions similar to gpt-oss, its truly doable to get respectable efficiency, even when working far bigger fashions.

By profiting from Llama.cpp’s MoE skilled offload options, we have been capable of get OpenAI’s 120 billion parameter gpt-oss mannequin working at a rather-respectable 20 tok/s on a system with a 20GB GPU and 64GB of DDR4 3200 MT/s reminiscence.

./llama-server -hf ggml-org/gpt-oss-120b-GGU -fa -c 32768 --jinja -ngl 999 --n-cpu-moe 26

On this case, we set -ngl to 999 and used the --n-cpu-moe parameter to dump progressively extra skilled layers to the CPU till Llama.cpp stopped throwing out-of-memory errors.

Instrument calling

In case your workload requires it, Llama.cpp also can parse software calls from OpenAI appropriate API Endpoints like Open WebUI or Cline. You’d want instruments if you wish to usher in an outdoor function similar to a clock, a calculator, or to examine the standing of Proxmox cluster.

We previously built a tool to generate status reports from Proxmox VE.

We beforehand constructed a software to generate standing stories from Proxmox VE.

Enabling software calling varies from mannequin to mannequin. For hottest fashions, together with gpt-oss, nothing particular is required. Merely append the --jinja flag and also you’re off to the races.

./llama-server -hf ggml-org/gpt-oss-20b-GGUF --jinja

Others fashions, like DeepSeek R1, could require setting a chat-template manually when launching the mannequin. For instance:

./llama-server --jinja -fa -hf bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF:Q4_K_M 
    --chat-template-file fashions/templates/llama-cpp-deepseek-r1.jinja

Instrument calling is a complete can of worms unto itself, so, when you’re fascinated with studying extra, testing our practical calling and Mannequin Context Protocol deep dives here and here.

Summing up

Whereas Llama.cpp could also be one of the crucial complete mannequin runners on the market — we have solely mentioned a fraction of all of the app entails — we perceive it could fairly daunting for these dipping their toes into native LLMs for the primary time. This is among the the explanation why it is taken us so lengthy to do a hands-on of the app, and why we predict that easier apps like Ollama and LM Studio are nonetheless priceless.

So now that you have gotten your head wrapped round Llama.cpp you is perhaps questioning how LLMs are actually deployed in manufacturing, or possibly how one can get began with image generation. We have got guides for each. ®


Source link