llama.cpp¶

What it is¶

llama.cpp is a lightweight C/C++ inference runtime for running GGUF/quantized LLMs locally on commodity hardware.

What problem it solves¶

It makes local LLM inference practical on CPUs and smaller devices by combining quantization support with optimized low-level inference paths.

Where it fits in the stack¶

Infrastructure / Inference Runtime. It is a core local-serving building block used directly or via wrappers.

Typical use cases¶

Running quantized LLMs offline on laptops, servers, or edge devices
Building local-first AI tools without cloud API dependency
Powering higher-level local model tools and wrappers

Strengths¶

Lightweight and portable local runtime
Strong support for quantized model execution
Large ecosystem and broad community adoption

Limitations¶

Requires manual model/runtime tuning for best performance
Feature parity can vary across hardware backends
Large models still require substantial memory/compute

When to use it¶

When privacy, offline operation, or cost control require local inference
When you need direct control of quantization/runtime tradeoffs

When not to use it¶

When managed cloud APIs are preferred for simplicity and elasticity
When you need frontier-model quality that local hardware cannot sustain

Getting started¶

Docker Compose Example¶

Run llama.cpp as an OpenAI-compatible server using Docker:

services:
  llama-cpp:
    image: ghcr.io/ggerganov/llama.cpp:server
    ports:
      - "8080:8080"
    volumes:
      - ./models:/models
    command: "-m /models/llama-3-8b-instruct.Q4_K_M.gguf -c 2048 --host 0.0.0.0 --port 8080"

Python API Example¶

Use the openai library to interact with your local llama.cpp server:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="sk-no-key-required")

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantization in 3 sentences."}
    ]
)

print(response.choices[0].message.content)

llama.cpp¶

What it is¶

What problem it solves¶

Where it fits in the stack¶

Typical use cases¶

Strengths¶

Limitations¶

When to use it¶

When not to use it¶

Getting started¶

Docker Compose Example¶

Python API Example¶

Licensing and cost¶

Sources / References¶

Contribution Metadata¶

llama.cpp¶

What it is¶

What problem it solves¶

Where it fits in the stack¶

Typical use cases¶

Strengths¶

Limitations¶

When to use it¶

When not to use it¶

Getting started¶

Docker Compose Example¶

Python API Example¶

Licensing and cost¶

Related tools / concepts¶

Sources / References¶

Contribution Metadata¶