Skip to content

Fine-tuning Open Models

What it is

Fine-tuning is the process of continuing the training of a pre-trained language model on a curated dataset to adapt its behaviour, tone, knowledge, or task performance for a specific domain. Unlike Retrieval-Augmented Generation (RAG), fine-tuning modifies the model weights themselves, baking knowledge and behavioural patterns into the model rather than retrieving them at inference time.

What problem it solves

Pre-trained open models are generalist. They may: - Not follow a specific output format consistently - Lack domain terminology or institutional knowledge - Perform poorly on narrow task types (e.g., extracting structured fields from a specific document type) - Respond in unwanted styles or languages

Fine-tuning addresses these gaps without replacing the base model's general capabilities.

Where it fits in the stack

Model Adaptation Layer — between the raw pre-trained base model (Layer 0) and the inference/serving infrastructure (Layer 1). Fine-tuning is an offline process; the resulting model is then served via Ollama, vLLM, or similar.

┌─────────────────────────────────────────────────────────┐
│         Training (offline, GPU/Apple Silicon)           │
│  Dataset → [Base Model] + [Adapter (LoRA)] → Fine-tuned │
└───────────────────────────┬─────────────────────────────┘
                            │  export GGUF / safetensors
┌───────────────────────────▼─────────────────────────────┐
│        Inference (Ollama / vLLM / llama.cpp)            │
│                                                         │
│  Agents: OpenClaw │ OpenHands │ n8n AI nodes            │
└─────────────────────────────────────────────────────────┘

Typical use cases

  • Structured Data Extraction: Fine-tuning a small model (e.g., 3B or 7B) to consistently output JSON from messy OCR text.
  • Brand Voice Alignment: Ensuring customer-facing agents always use a specific company tone and vocabulary.
  • SQL Generation: Adapting a model to a specific database schema and dialect for Text-to-SQL tasks.
  • Code Completion: Training on a private codebase to provide context-aware autocomplete that understands internal libraries.
  • System Log Analysis: Teaching a model to identify specific error patterns in proprietary server logs.

Strengths

  • Zero inference overhead: Knowledge is in weights; no retrieval latency.
  • Consistent behaviour: Reliable format adherence and tone even without in-context examples.
  • Privacy: Training data and model stay on-premises.
  • Works with small models: A fine-tuned 3B model can outperform a general 70B on narrow tasks.

Limitations

  • Compute cost: Training run requires GPU; free tiers have limited hours.
  • Static knowledge: Fine-tuned model does not know about events after training cutoff.
  • Expensive to update: Retraining needed to incorporate new knowledge.
  • Risk of catastrophic forgetting: Heavy fine-tuning can degrade general capabilities.
  • Data requirements: Needs curated, high-quality datasets; bad data = bad model.

Fine-tuning vs RAG — decision guide

Criterion Fine-tune RAG
Knowledge type Style, format, task behaviour, domain vocabulary Factual content, documents, up-to-date information
Update frequency Rare (hours to retrain) Continuous (update vector store)
Cost High upfront (compute) Low incremental (embedding + storage)
Hallucination risk Reduced for in-distribution tasks Reduced by grounding in retrieved context
Privacy Weights stay local Data stays in vector store
Latency Zero (knowledge is in weights) Adds retrieval time (~100–500 ms)
Best for Consistent format/tone, narrow task types, instruction following Q&A over documents, knowledge freshness, citations

Rule of thumb: Use RAG first for factual knowledge. Use fine-tuning when you need the model to behave differently — follow a specific format, adopt a persona, or excel at a narrow task type it currently performs poorly on.

Combine both: fine-tune for behaviour + RAG for factual grounding.

Approaches

Parameter-Efficient Fine-Tuning (PEFT)

Trains only a small subset of parameters, dramatically reducing compute and memory requirements. The standard for home-lab and small-team fine-tuning.

LoRA (Low-Rank Adaptation)

Adds small rank-decomposition matrices to existing weight matrices. Only the adapter weights are trained; the base model is frozen. Adapters can be merged back into the base model for zero-latency inference.

W_new = W_base + (A × B)    where A ∈ R^(d×r), B ∈ R^(r×k), r << d
  • Rank (r): Typical values 4–64; higher = more capacity but more VRAM
  • Alpha: Scaling factor; usually set to 2× rank
  • Target modules: Usually q_proj, v_proj (attention); sometimes all linear layers

QLoRA (Quantised LoRA)

Loads the base model in 4-bit NF4 quantisation, then trains LoRA adapters in bfloat16. Reduces VRAM by ~75% vs full fine-tuning. Enables fine-tuning 7B models on a 10 GB GPU or a MacBook M4.

DPO (Direct Preference Optimization)

Instead of using a separate reward model (as in RLHF), DPO directly optimises the policy based on paired preferences (chosen vs. rejected responses). It is more stable and computationally efficient than traditional reinforcement learning.

Full Fine-tuning

All weights updated. Requires substantial VRAM (2× model size in bfloat16). Typically only worthwhile for models < 3B or when compute is unconstrained. Not recommended for home-lab use.

Tools

Unsloth

Best for: Fast LoRA/QLoRA fine-tuning on NVIDIA GPUs; 2–5× faster than vanilla transformers with 80% less VRAM.

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct",
    max_seq_length=4096,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Supports: Llama 3, Qwen 2.5, Mistral, Gemma, Phi-3, CodeLlama, and more.

LLaMA-Factory

Best for: No-code / low-code UI for training; supports 100+ model architectures; CLI and WebUI.

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

# Launch WebUI
llamafactory-cli webui

# Or train from config YAML
llamafactory-cli train examples/train_lora/qwen2.5_7b_lora_sft.yaml

Includes dataset management, training monitoring, merge + export, and evaluation in one tool.

axolotl

Best for: YAML-driven, reproducible training pipelines; excellent for teams and automation.

# config.yaml
base_model: Qwen/Qwen2.5-7B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
adapter: lora
lora_r: 16
lora_alpha: 32
lora_target_modules:
  - q_proj
  - v_proj

datasets:
  - path: data/my_dataset.jsonl
    type: sharegpt
    conversation: chatml

output_dir: ./output/qwen-finetuned
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
pip install axolotl[flash-attn,deepspeed]
accelerate launch -m axolotl.cli.train config.yaml

TRL / SFT Trainer (Hugging Face)

Best for: Custom training loops; maximum control; integrates with the full Hugging Face ecosystem.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer

# Load model in 4-bit
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                                 bnb_4bit_compute_dtype="bfloat16")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct", quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

dataset = load_dataset("json", data_files="data/train.jsonl", split="train")

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
                          lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    args=SFTConfig(output_dir="./output", num_train_epochs=3,
                   per_device_train_batch_size=2, learning_rate=2e-4),
)
trainer.train()
trainer.save_model("./output/final")

MLX Fine-tuning (Apple Silicon)

Best for: Fine-tuning on MacBook M4 or Mac Studio without NVIDIA GPU.

pip install mlx-lm

# Fine-tune Llama or Qwen using LoRA on Apple Silicon
python -m mlx_lm.lora \
  --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --train \
  --data data/ \
  --iters 1000 \
  --batch-size 4 \
  --lora-layers 16 \
  --adapter-path adapters/

# Fuse adapter back into model
python -m mlx_lm.fuse \
  --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --adapter-path adapters/ \
  --save-path ./fused-model

MacBook M4 (16 GB unified memory) can fine-tune 4-bit quantised 7B models at ~3–5 tokens/sec training throughput. Sufficient for small-to-medium datasets (< 50k examples).

Dataset preparation

Format (ShareGPT / ChatML)

The most common format for instruction fine-tuning:

{"conversations": [{"from": "human", "value": "Extract the invoice number from this document: ..."}, {"from": "gpt", "value": "{\"invoice_number\": \"INV-2024-0042\", ...}"}]}
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}

Minimum dataset sizes

Task type Min examples Recommended
Format adaptation (consistent output structure) 100–500 1,000
Domain vocabulary / tone 500–1,000 5,000
Narrow task type (specific extraction) 200–500 2,000
General instruction following improvement 5,000 50,000+

Data quality > quantity

100 high-quality, diverse examples outperform 1,000 noisy ones. Deduplicate, validate outputs, and include hard negative examples (cases where the model currently fails).

Synthetic Data Generation

When high-quality human data is scarce, use larger models (Claude 3.5 Sonnet, GPT-4o) to generate synthetic training examples.

  • Distilabel: A framework for synthesizing data and using LLM-as-a-judge for quality filtering.
  • Self-Correction Loops: Have the base model generate responses, then use a stronger model to correct them and format them into training pairs.
  • Glaive: Specialized in generating high-fidelity functional and tool-use datasets.

Creating datasets from existing data

# Convert Paperless-ngx documents to training examples
import json

def convert_paperless_doc(doc):
    """Turn a Paperless document into a fine-tuning example."""
    return {
        "conversations": [
            {
                "from": "human",
                "value": f"Extract key fields from this document:\n\n{doc['content'][:2000]}"
            },
            {
                "from": "gpt",
                "value": json.dumps({
                    "type": doc["document_type"],
                    "date": doc["created"],
                    "correspondent": doc["correspondent"],
                    "amount": doc.get("custom_fields", {}).get("amount")
                }, indent=2)
            }
        ]
    }

Compute requirements

Model size Method Min VRAM Recommended Training time (1k steps)
1–3B QLoRA 4 GB 6 GB 5–10 min (T4)
7B QLoRA 8 GB 16 GB 15–30 min (T4)
7B LoRA (bf16) 16 GB 24 GB 20–40 min (A100)
14B QLoRA 16 GB 24 GB 40–80 min (A100)
32B QLoRA 24 GB 40 GB 2–4 hrs (A100)
70B QLoRA + DeepSpeed 4× A100 80 GB 8–16 hrs

Apple Silicon (M4 16 GB): Practical up to 7B QLoRA; 14B possible with patience.

Free and low-cost training platforms

Platform Free tier Notes
Google Colab T4 GPU (15 GB), limited hours Best for prototyping; Unsloth works well
Hugging Face AutoTrain Pay-per-job (no free) No-code UI; very easy for standard tasks
Kaggle Notebooks 2× T4 (30 GB total), 30 hr/week Good for 7B QLoRA
Modal $30/month free credit Serverless GPU; great for automation
RunPod Pay-as-you-go from $0.20/hr Best cost/performance for A100
Vast.ai Pay-as-you-go from $0.10/hr Cheaper than RunPod; spot pricing
Lambda Cloud Pay-as-you-go Reliable; A100 and H100 available

Exporting and serving fine-tuned models

Export to GGUF (for Ollama)

# After training with Unsloth — export directly to GGUF
model.save_pretrained_gguf(
    "my-model-q4",
    tokenizer,
    quantization_method="q4_k_m"    # q4_k_m is a good default
)

# Or convert from safetensors using llama.cpp
python llama.cpp/convert_hf_to_gguf.py ./output/final --outtype q4_k_m --outfile my-model.gguf

Create an Ollama Modelfile

# Modelfile
FROM ./my-model.gguf

SYSTEM """You are a document extraction assistant for a home office.
Always respond with valid JSON. Never add commentary outside the JSON object."""

PARAMETER temperature 0.1
PARAMETER num_ctx 4096
ollama create my-extraction-model -f Modelfile
ollama run my-extraction-model "Extract fields from: ..."

After creating the model in Ollama, it becomes available to all tools in the stack (OpenClaw, OpenHands, n8n, LiteLLM) just like any other model.

Technical Verification & Evaluation

After fine-tuning, you must evaluate the model on a held-out test set to ensure it has learned the target behavior without losing general intelligence.

Automated Format Validation

Use sqlglot or standard JSON parsers to measure adherence to structured output requirements.

import json
import sqlglot

def validate_extraction(prediction, expected_schema):
    """Verify that the model output is valid JSON and matches the schema."""
    try:
        data = json.loads(prediction)
        return all(field in data for field in expected_schema)
    except json.JSONDecodeError:
        return False

def validate_sql(prediction, dialect="postgres"):
    """Verify that the generated SQL is syntactically correct."""
    try:
        sqlglot.transpile(prediction, read=dialect)
        return True
    except sqlglot.errors.ParseError:
        return False

Evaluation Protocol

  1. In-distribution test: Evaluate on 50–100 held-out examples. Aim for >95% format adherence.
  2. Out-of-distribution test: Test on 20 edge cases (e.g., extremely long inputs, corrupted OCR).
  3. Catastrophic Forgetting Check: Run a subset of MMLU or GSM8K. A drop of >10% vs the base model indicates over-training.
    # Example using lm-evaluation-harness
    lm_eval --model hf --model_args pretrained=./output/fused-model \
            --tasks mmlu_humanities,gsm8k --device cuda:0 --batch_size 8
    
  4. Inference Monitoring: Track the "Hallucination Rate" (invalid references or made-up data) during the first 500 production requests.

Common failure modes

Symptom Likely cause Fix
Model ignores system prompt after fine-tuning Training data lacked system prompt in correct position Include system prompt in every training example
Repetition / looping Learning rate too high, or dataset has many near-duplicates Lower LR; deduplicate dataset
Format regression (good format → inconsistent after fine-tuning) Dataset has inconsistent output formats Standardise all outputs in dataset
Catastrophic forgetting (general capability drops) Too many epochs or large r Reduce epochs; lower LoRA rank; mix general data
High perplexity on validation set Dataset too small or overfitting More data; apply early stopping; increase dropout

When to use it

  • When the model needs to reliably follow a specific output format (structured JSON, specific template).
  • When the model should adopt a consistent persona or tone without relying on long system prompts.
  • When you have a narrow, repetitive task (e.g., extracting fields from a specific document type) where general models underperform.
  • When you want the model's capabilities without retrieval latency for a known domain.
  • When your dataset contains institutional knowledge not present in public training data.

When not to use it

  • When knowledge needs to be updated frequently — use RAG instead.
  • When you need to cite sources — RAG provides provenance, fine-tuning does not.
  • When the base model already performs well on the task with a good system prompt.
  • When compute budget is limited and RAG can solve the problem — RAG is cheaper to iterate.
  • When the task requires the full knowledge of a large model that is too expensive to fine-tune.

Sources / References

Contribution Metadata

  • Last reviewed: 2026-05-25
  • Confidence: high