Skip to content

Fine-tuning Open Models

What it is

Fine-tuning is the process of continuing the training of a pre-trained language model on a curated dataset to adapt its behaviour, tone, knowledge, or task performance for a specific domain. Unlike Retrieval-Augmented Generation (RAG), fine-tuning modifies the model weights themselves, baking knowledge and behavioural patterns into the model rather than retrieving them at inference time.

What problem it solves

Pre-trained open models are generalist. They may: - Not follow a specific output format consistently - Lack domain terminology or institutional knowledge - Perform poorly on narrow task types (e.g., extracting structured fields from a specific document type) - Respond in unwanted styles or languages

Fine-tuning addresses these gaps without replacing the base model's general capabilities.

Where it fits in the stack

Model Adaptation Layer — between the raw pre-trained base model (Layer 0) and the inference/serving infrastructure (Layer 1). Fine-tuning is an offline process; the resulting model is then served via Ollama, vLLM, or similar.

┌─────────────────────────────────────────────────────────┐
│         Training (offline, GPU/Apple Silicon)           │
│  Dataset → [Base Model] + [Adapter (LoRA)] → Fine-tuned │
└───────────────────────────┬─────────────────────────────┘
                            │  export GGUF / safetensors
┌───────────────────────────▼─────────────────────────────┐
│        Inference (Ollama / vLLM / llama.cpp)            │
│                                                         │
│  Agents: OpenClaw │ OpenHands │ n8n AI nodes            │
└─────────────────────────────────────────────────────────┘

Fine-tuning vs RAG — decision guide

Criterion Fine-tune RAG
Knowledge type Style, format, task behaviour, domain vocabulary Factual content, documents, up-to-date information
Update frequency Rare (hours to retrain) Continuous (update vector store)
Cost High upfront (compute) Low incremental (embedding + storage)
Hallucination risk Reduced for in-distribution tasks Reduced by grounding in retrieved context
Privacy Weights stay local Data stays in vector store
Latency Zero (knowledge is in weights) Adds retrieval time (~100–500 ms)
Best for Consistent format/tone, narrow task types, instruction following Q&A over documents, knowledge freshness, citations

Rule of thumb: Use RAG first for factual knowledge. Use fine-tuning when you need the model to behave differently — follow a specific format, adopt a persona, or excel at a narrow task type it currently performs poorly on.

Combine both: fine-tune for behaviour + RAG for factual grounding.

Approaches

Parameter-Efficient Fine-Tuning (PEFT)

Trains only a small subset of parameters, dramatically reducing compute and memory requirements. The standard for home-lab and small-team fine-tuning.

LoRA (Low-Rank Adaptation)

Adds small rank-decomposition matrices to existing weight matrices. Only the adapter weights are trained; the base model is frozen. Adapters can be merged back into the base model for zero-latency inference.

W_new = W_base + (A × B)    where A ∈ R^(d×r), B ∈ R^(r×k), r << d
  • Rank (r): Typical values 4–64; higher = more capacity but more VRAM
  • Alpha: Scaling factor; usually set to 2× rank
  • Target modules: Usually q_proj, v_proj (attention); sometimes all linear layers

QLoRA (Quantised LoRA)

Loads the base model in 4-bit NF4 quantisation, then trains LoRA adapters in bfloat16. Reduces VRAM by ~75% vs full fine-tuning. Enables fine-tuning 7B models on a 10 GB GPU or a MacBook M4.

Full Fine-tuning

All weights updated. Requires substantial VRAM (2× model size in bfloat16). Typically only worthwhile for models < 3B or when compute is unconstrained. Not recommended for home-lab use.

Tools

Unsloth

Best for: Fast LoRA/QLoRA fine-tuning on NVIDIA GPUs; 2–5× faster than vanilla transformers with 80% less VRAM.

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=4096,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Supports: Llama 3, Qwen 2.5, Mistral, Gemma, Phi-3, CodeLlama, and more.

LLaMA-Factory

Best for: No-code / low-code UI for training; supports 100+ model architectures; CLI and WebUI.

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

# Launch WebUI
llamafactory-cli webui

# Or train from config YAML
llamafactory-cli train examples/train_lora/qwen2.5_7b_lora_sft.yaml

Includes dataset management, training monitoring, merge + export, and evaluation in one tool.

axolotl

Best for: YAML-driven, reproducible training pipelines; excellent for teams and automation.

# config.yaml
base_model: Qwen/Qwen2.5-7B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
adapter: lora
lora_r: 16
lora_alpha: 32
lora_target_modules:
  - q_proj
  - v_proj

datasets:
  - path: data/my_dataset.jsonl
    type: sharegpt
    conversation: chatml

output_dir: ./output/qwen-finetuned
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
pip install axolotl[flash-attn,deepspeed]
accelerate launch -m axolotl.cli.train config.yaml

TRL / SFT Trainer (Hugging Face)

Best for: Custom training loops; maximum control; integrates with the full Hugging Face ecosystem.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

# Load model in 4-bit
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                                 bnb_4bit_compute_dtype="bfloat16")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct", quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

dataset = load_dataset("json", data_files="data/train.jsonl", split="train")

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
                          lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    args=SFTConfig(output_dir="./output", num_train_epochs=3,
                   per_device_train_batch_size=2, learning_rate=2e-4),
)
trainer.train()
trainer.save_model("./output/final")

MLX Fine-tuning (Apple Silicon)

Best for: Fine-tuning on MacBook M4 or Mac Studio without NVIDIA GPU.

pip install mlx-lm

# Fine-tune Llama or Qwen using LoRA on Apple Silicon
python -m mlx_lm.lora \
  --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --train \
  --data data/ \
  --iters 1000 \
  --batch-size 4 \
  --lora-layers 16 \
  --adapter-path adapters/

# Fuse adapter back into model
python -m mlx_lm.fuse \
  --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --adapter-path adapters/ \
  --save-path ./fused-model

MacBook M4 (16 GB unified memory) can fine-tune 4-bit quantised 7B models at ~3–5 tokens/sec training throughput. Sufficient for small-to-medium datasets (< 50k examples).

Dataset preparation

Format (ShareGPT / ChatML)

The most common format for instruction fine-tuning:

{"conversations": [{"from": "human", "value": "Extract the invoice number from this document: ..."}, {"from": "gpt", "value": "{\"invoice_number\": \"INV-2024-0042\", ...}"}]}
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Minimum dataset sizes

Task type Min examples Recommended
Format adaptation (consistent output structure) 100–500 1,000
Domain vocabulary / tone 500–1,000 5,000
Narrow task type (specific extraction) 200–500 2,000
General instruction following improvement 5,000 50,000+

Data quality > quantity

100 high-quality, diverse examples outperform 1,000 noisy ones. Deduplicate, validate outputs, and include hard negative examples (cases where the model currently fails).

Creating datasets from existing data

# Convert Paperless-ngx documents to training examples
import json

def convert_paperless_doc(doc):
    """Turn a Paperless document into a fine-tuning example."""
    return {
        "conversations": [
            {
                "from": "human",
                "value": f"Extract key fields from this document:\n\n{doc['content'][:2000]}"
            },
            {
                "from": "gpt",
                "value": json.dumps({
                    "type": doc["document_type"],
                    "date": doc["created"],
                    "correspondent": doc["correspondent"],
                    "amount": doc.get("custom_fields", {}).get("amount")
                }, indent=2)
            }
        ]
    }

Compute requirements

Model size Method Min VRAM Recommended Training time (1k steps)
1–3B QLoRA 4 GB 6 GB 5–10 min (T4)
7B QLoRA 8 GB 16 GB 15–30 min (T4)
7B LoRA (bf16) 16 GB 24 GB 20–40 min (A100)
14B QLoRA 16 GB 24 GB 40–80 min (A100)
32B QLoRA 24 GB 40 GB 2–4 hrs (A100)
70B QLoRA + DeepSpeed 4× A100 80 GB 8–16 hrs

Apple Silicon (M4 16 GB): Practical up to 7B QLoRA; 14B possible with patience.

Free and low-cost training platforms

Platform Free tier Notes
Google Colab T4 GPU (15 GB), limited hours Best for prototyping; Unsloth works well
Hugging Face AutoTrain Pay-per-job (no free) No-code UI; very easy for standard tasks
Kaggle Notebooks 2× T4 (30 GB total), 30 hr/week Good for 7B QLoRA
Modal $30/month free credit Serverless GPU; great for automation
RunPod Pay-as-you-go from $0.20/hr Best cost/performance for A100
Vast.ai Pay-as-you-go from $0.10/hr Cheaper than RunPod; spot pricing
Lambda Cloud Pay-as-you-go Reliable; A100 and H100 available

Exporting and serving fine-tuned models

Export to GGUF (for Ollama)

# After training with Unsloth — export directly to GGUF
model.save_pretrained_gguf(
    "my-model-q4",
    tokenizer,
    quantization_method="q4_k_m"    # q4_k_m is a good default
)

# Or convert from safetensors using llama.cpp
python llama.cpp/convert_hf_to_gguf.py ./output/final --outtype q4_k_m --outfile my-model.gguf

Create an Ollama Modelfile

# Modelfile
FROM ./my-model.gguf

SYSTEM """You are a document extraction assistant for a home office.
Always respond with valid JSON. Never add commentary outside the JSON object."""

PARAMETER temperature 0.1
PARAMETER num_ctx 4096
ollama create my-extraction-model -f Modelfile
ollama run my-extraction-model "Extract fields from: ..."

After creating the model in Ollama, it becomes available to all tools in the stack (OpenClaw, OpenHands, n8n, LiteLLM) just like any other model.

Evaluation

After fine-tuning, evaluate on a held-out test set before deploying:

from trl import create_reference_model
# Compare fine-tuned vs base model on your test set
# Metrics: exact match, ROUGE, format adherence rate, task-specific accuracy

Post-training verification checklist

  • [ ] In-distribution test: Evaluate on 50–100 held-out examples from the training distribution.
  • [ ] Out-of-distribution test: Evaluate on 10–20 adversarial or edge-case inputs.
  • [ ] Format adherence: Measure JSON validity and presence of all required fields.
  • [ ] Catastrophic forgetting check: Compare against the base model on a general benchmark (e.g., MMLU subset) to check for catastrophic forgetting.
  • [ ] Production monitoring: Monitor inference quality and hallucination rates for the first 2 weeks of deployment.

Common failure modes

Symptom Likely cause Fix
Model ignores system prompt after fine-tuning Training data lacked system prompt in correct position Include system prompt in every training example
Repetition / looping Learning rate too high, or dataset has many near-duplicates Lower LR; deduplicate dataset
Format regression (good format → inconsistent after fine-tuning) Dataset has inconsistent output formats Standardise all outputs in dataset
Catastrophic forgetting (general capability drops) Too many epochs or large r Reduce epochs; lower LoRA rank; mix general data
High perplexity on validation set Dataset too small or overfitting More data; apply early stopping; increase dropout

Strengths

  • Zero inference overhead: Knowledge is in weights; no retrieval latency
  • Consistent behaviour: Reliable format adherence and tone even without in-context examples
  • Privacy: Training data and model stay on-premises
  • Works with small models: A fine-tuned 3B model can outperform a general 70B on narrow tasks

Limitations

  • Compute cost: Training run requires GPU; free tiers have limited hours
  • Static knowledge: Fine-tuned model does not know about events after training cutoff
  • Expensive to update: Retraining needed to incorporate new knowledge
  • Risk of catastrophic forgetting: Heavy fine-tuning can degrade general capabilities
  • Data requirements: Needs curated, high-quality datasets; bad data = bad model

When to use it

  • When the model needs to reliably follow a specific output format (structured JSON, specific template)
  • When the model should adopt a consistent persona or tone without relying on long system prompts
  • When you have a narrow, repetitive task (e.g., extracting fields from a specific document type) where general models underperform
  • When you want the model's capabilities without retrieval latency for a known domain
  • When your dataset contains institutional knowledge not present in public training data

When not to use it

  • When knowledge needs to be updated frequently — use RAG instead
  • When you need to cite sources — RAG provides provenance, fine-tuning does not
  • When the base model already performs well on the task with a good system prompt
  • When compute budget is limited and RAG can solve the problem — RAG is cheaper to iterate
  • When the task requires the full knowledge of a large model that is too expensive to fine-tune
  • RAG Pattern — complementary approach; fine-tune for behaviour + RAG for facts
  • Ollama — serve fine-tuned GGUF models locally
  • vLLM — high-throughput serving of fine-tuned models
  • MLX — Apple Silicon inference and fine-tuning framework
  • llama.cpp — GGUF format reference implementation
  • OpenPipe — managed fine-tuning pipeline service
  • LM Studio — local model management and testing
  • Document Preparation for LLM Training — playbook for dataset preparation from local documents

Sources / References

Contribution Metadata

  • Last reviewed: 2026-03-21
  • Confidence: high