Fine-tuning Open Models¶
What it is¶
Fine-tuning is the process of continuing the training of a pre-trained language model on a curated dataset to adapt its behaviour, tone, knowledge, or task performance for a specific domain. Unlike Retrieval-Augmented Generation (RAG), fine-tuning modifies the model weights themselves, baking knowledge and behavioural patterns into the model rather than retrieving them at inference time.
What problem it solves¶
Pre-trained open models are generalist. They may: - Not follow a specific output format consistently - Lack domain terminology or institutional knowledge - Perform poorly on narrow task types (e.g., extracting structured fields from a specific document type) - Respond in unwanted styles or languages
Fine-tuning addresses these gaps without replacing the base model's general capabilities.
Where it fits in the stack¶
Model Adaptation Layer — between the raw pre-trained base model (Layer 0) and the inference/serving infrastructure (Layer 1). Fine-tuning is an offline process; the resulting model is then served via Ollama, vLLM, or similar.
┌─────────────────────────────────────────────────────────┐
│ Training (offline, GPU/Apple Silicon) │
│ Dataset → [Base Model] + [Adapter (LoRA)] → Fine-tuned │
└───────────────────────────┬─────────────────────────────┘
│ export GGUF / safetensors
┌───────────────────────────▼─────────────────────────────┐
│ Inference (Ollama / vLLM / llama.cpp) │
│ │
│ Agents: OpenClaw │ OpenHands │ n8n AI nodes │
└─────────────────────────────────────────────────────────┘
Fine-tuning vs RAG — decision guide¶
| Criterion | Fine-tune | RAG |
|---|---|---|
| Knowledge type | Style, format, task behaviour, domain vocabulary | Factual content, documents, up-to-date information |
| Update frequency | Rare (hours to retrain) | Continuous (update vector store) |
| Cost | High upfront (compute) | Low incremental (embedding + storage) |
| Hallucination risk | Reduced for in-distribution tasks | Reduced by grounding in retrieved context |
| Privacy | Weights stay local | Data stays in vector store |
| Latency | Zero (knowledge is in weights) | Adds retrieval time (~100–500 ms) |
| Best for | Consistent format/tone, narrow task types, instruction following | Q&A over documents, knowledge freshness, citations |
Rule of thumb: Use RAG first for factual knowledge. Use fine-tuning when you need the model to behave differently — follow a specific format, adopt a persona, or excel at a narrow task type it currently performs poorly on.
Combine both: fine-tune for behaviour + RAG for factual grounding.
Approaches¶
Parameter-Efficient Fine-Tuning (PEFT)¶
Trains only a small subset of parameters, dramatically reducing compute and memory requirements. The standard for home-lab and small-team fine-tuning.
LoRA (Low-Rank Adaptation)
Adds small rank-decomposition matrices to existing weight matrices. Only the adapter weights are trained; the base model is frozen. Adapters can be merged back into the base model for zero-latency inference.
W_new = W_base + (A × B) where A ∈ R^(d×r), B ∈ R^(r×k), r << d
- Rank (r): Typical values 4–64; higher = more capacity but more VRAM
- Alpha: Scaling factor; usually set to 2× rank
- Target modules: Usually
q_proj,v_proj(attention); sometimes all linear layers
QLoRA (Quantised LoRA)
Loads the base model in 4-bit NF4 quantisation, then trains LoRA adapters in bfloat16. Reduces VRAM by ~75% vs full fine-tuning. Enables fine-tuning 7B models on a 10 GB GPU or a MacBook M4.
Full Fine-tuning
All weights updated. Requires substantial VRAM (2× model size in bfloat16). Typically only worthwhile for models < 3B or when compute is unconstrained. Not recommended for home-lab use.
Tools¶
Unsloth¶
Best for: Fast LoRA/QLoRA fine-tuning on NVIDIA GPUs; 2–5× faster than vanilla transformers with 80% less VRAM.
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=4096,
dtype=torch.bfloat16,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
Supports: Llama 3, Qwen 2.5, Mistral, Gemma, Phi-3, CodeLlama, and more.
LLaMA-Factory¶
Best for: No-code / low-code UI for training; supports 100+ model architectures; CLI and WebUI.
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
# Launch WebUI
llamafactory-cli webui
# Or train from config YAML
llamafactory-cli train examples/train_lora/qwen2.5_7b_lora_sft.yaml
Includes dataset management, training monitoring, merge + export, and evaluation in one tool.
axolotl¶
Best for: YAML-driven, reproducible training pipelines; excellent for teams and automation.
# config.yaml
base_model: Qwen/Qwen2.5-7B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
adapter: lora
lora_r: 16
lora_alpha: 32
lora_target_modules:
- q_proj
- v_proj
datasets:
- path: data/my_dataset.jsonl
type: sharegpt
conversation: chatml
output_dir: ./output/qwen-finetuned
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
pip install axolotl[flash-attn,deepspeed]
accelerate launch -m axolotl.cli.train config.yaml
TRL / SFT Trainer (Hugging Face)¶
Best for: Custom training loops; maximum control; integrates with the full Hugging Face ecosystem.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
# Load model in 4-bit
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct", quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
dataset = load_dataset("json", data_files="data/train.jsonl", split="train")
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=lora_config,
args=SFTConfig(output_dir="./output", num_train_epochs=3,
per_device_train_batch_size=2, learning_rate=2e-4),
)
trainer.train()
trainer.save_model("./output/final")
MLX Fine-tuning (Apple Silicon)¶
Best for: Fine-tuning on MacBook M4 or Mac Studio without NVIDIA GPU.
pip install mlx-lm
# Fine-tune Llama or Qwen using LoRA on Apple Silicon
python -m mlx_lm.lora \
--model mlx-community/Qwen2.5-7B-Instruct-4bit \
--train \
--data data/ \
--iters 1000 \
--batch-size 4 \
--lora-layers 16 \
--adapter-path adapters/
# Fuse adapter back into model
python -m mlx_lm.fuse \
--model mlx-community/Qwen2.5-7B-Instruct-4bit \
--adapter-path adapters/ \
--save-path ./fused-model
MacBook M4 (16 GB unified memory) can fine-tune 4-bit quantised 7B models at ~3–5 tokens/sec training throughput. Sufficient for small-to-medium datasets (< 50k examples).
Dataset preparation¶
Format (ShareGPT / ChatML)¶
The most common format for instruction fine-tuning:
{"conversations": [{"from": "human", "value": "Extract the invoice number from this document: ..."}, {"from": "gpt", "value": "{\"invoice_number\": \"INV-2024-0042\", ...}"}]}
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
Minimum dataset sizes¶
| Task type | Min examples | Recommended |
|---|---|---|
| Format adaptation (consistent output structure) | 100–500 | 1,000 |
| Domain vocabulary / tone | 500–1,000 | 5,000 |
| Narrow task type (specific extraction) | 200–500 | 2,000 |
| General instruction following improvement | 5,000 | 50,000+ |
Data quality > quantity¶
100 high-quality, diverse examples outperform 1,000 noisy ones. Deduplicate, validate outputs, and include hard negative examples (cases where the model currently fails).
Creating datasets from existing data¶
# Convert Paperless-ngx documents to training examples
import json
def convert_paperless_doc(doc):
"""Turn a Paperless document into a fine-tuning example."""
return {
"conversations": [
{
"from": "human",
"value": f"Extract key fields from this document:\n\n{doc['content'][:2000]}"
},
{
"from": "gpt",
"value": json.dumps({
"type": doc["document_type"],
"date": doc["created"],
"correspondent": doc["correspondent"],
"amount": doc.get("custom_fields", {}).get("amount")
}, indent=2)
}
]
}
Compute requirements¶
| Model size | Method | Min VRAM | Recommended | Training time (1k steps) |
|---|---|---|---|---|
| 1–3B | QLoRA | 4 GB | 6 GB | 5–10 min (T4) |
| 7B | QLoRA | 8 GB | 16 GB | 15–30 min (T4) |
| 7B | LoRA (bf16) | 16 GB | 24 GB | 20–40 min (A100) |
| 14B | QLoRA | 16 GB | 24 GB | 40–80 min (A100) |
| 32B | QLoRA | 24 GB | 40 GB | 2–4 hrs (A100) |
| 70B | QLoRA + DeepSpeed | 4× A100 80 GB | — | 8–16 hrs |
Apple Silicon (M4 16 GB): Practical up to 7B QLoRA; 14B possible with patience.
Free and low-cost training platforms¶
| Platform | Free tier | Notes |
|---|---|---|
| Google Colab | T4 GPU (15 GB), limited hours | Best for prototyping; Unsloth works well |
| Hugging Face AutoTrain | Pay-per-job (no free) | No-code UI; very easy for standard tasks |
| Kaggle Notebooks | 2× T4 (30 GB total), 30 hr/week | Good for 7B QLoRA |
| Modal | $30/month free credit | Serverless GPU; great for automation |
| RunPod | Pay-as-you-go from $0.20/hr | Best cost/performance for A100 |
| Vast.ai | Pay-as-you-go from $0.10/hr | Cheaper than RunPod; spot pricing |
| Lambda Cloud | Pay-as-you-go | Reliable; A100 and H100 available |
Exporting and serving fine-tuned models¶
Export to GGUF (for Ollama)¶
# After training with Unsloth — export directly to GGUF
model.save_pretrained_gguf(
"my-model-q4",
tokenizer,
quantization_method="q4_k_m" # q4_k_m is a good default
)
# Or convert from safetensors using llama.cpp
python llama.cpp/convert_hf_to_gguf.py ./output/final --outtype q4_k_m --outfile my-model.gguf
Create an Ollama Modelfile¶
# Modelfile
FROM ./my-model.gguf
SYSTEM """You are a document extraction assistant for a home office.
Always respond with valid JSON. Never add commentary outside the JSON object."""
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
ollama create my-extraction-model -f Modelfile
ollama run my-extraction-model "Extract fields from: ..."
After creating the model in Ollama, it becomes available to all tools in the stack (OpenClaw, OpenHands, n8n, LiteLLM) just like any other model.
Evaluation¶
After fine-tuning, evaluate on a held-out test set before deploying:
from trl import create_reference_model
# Compare fine-tuned vs base model on your test set
# Metrics: exact match, ROUGE, format adherence rate, task-specific accuracy
Post-training verification checklist¶
- [ ] In-distribution test: Evaluate on 50–100 held-out examples from the training distribution.
- [ ] Out-of-distribution test: Evaluate on 10–20 adversarial or edge-case inputs.
- [ ] Format adherence: Measure JSON validity and presence of all required fields.
- [ ] Catastrophic forgetting check: Compare against the base model on a general benchmark (e.g., MMLU subset) to check for catastrophic forgetting.
- [ ] Production monitoring: Monitor inference quality and hallucination rates for the first 2 weeks of deployment.
Common failure modes¶
| Symptom | Likely cause | Fix |
|---|---|---|
| Model ignores system prompt after fine-tuning | Training data lacked system prompt in correct position | Include system prompt in every training example |
| Repetition / looping | Learning rate too high, or dataset has many near-duplicates | Lower LR; deduplicate dataset |
| Format regression (good format → inconsistent after fine-tuning) | Dataset has inconsistent output formats | Standardise all outputs in dataset |
| Catastrophic forgetting (general capability drops) | Too many epochs or large r | Reduce epochs; lower LoRA rank; mix general data |
| High perplexity on validation set | Dataset too small or overfitting | More data; apply early stopping; increase dropout |
Strengths¶
- Zero inference overhead: Knowledge is in weights; no retrieval latency
- Consistent behaviour: Reliable format adherence and tone even without in-context examples
- Privacy: Training data and model stay on-premises
- Works with small models: A fine-tuned 3B model can outperform a general 70B on narrow tasks
Limitations¶
- Compute cost: Training run requires GPU; free tiers have limited hours
- Static knowledge: Fine-tuned model does not know about events after training cutoff
- Expensive to update: Retraining needed to incorporate new knowledge
- Risk of catastrophic forgetting: Heavy fine-tuning can degrade general capabilities
- Data requirements: Needs curated, high-quality datasets; bad data = bad model
When to use it¶
- When the model needs to reliably follow a specific output format (structured JSON, specific template)
- When the model should adopt a consistent persona or tone without relying on long system prompts
- When you have a narrow, repetitive task (e.g., extracting fields from a specific document type) where general models underperform
- When you want the model's capabilities without retrieval latency for a known domain
- When your dataset contains institutional knowledge not present in public training data
When not to use it¶
- When knowledge needs to be updated frequently — use RAG instead
- When you need to cite sources — RAG provides provenance, fine-tuning does not
- When the base model already performs well on the task with a good system prompt
- When compute budget is limited and RAG can solve the problem — RAG is cheaper to iterate
- When the task requires the full knowledge of a large model that is too expensive to fine-tune
Related tools / concepts¶
- RAG Pattern — complementary approach; fine-tune for behaviour + RAG for facts
- Ollama — serve fine-tuned GGUF models locally
- vLLM — high-throughput serving of fine-tuned models
- MLX — Apple Silicon inference and fine-tuning framework
- llama.cpp — GGUF format reference implementation
- OpenPipe — managed fine-tuning pipeline service
- LM Studio — local model management and testing
- Document Preparation for LLM Training — playbook for dataset preparation from local documents
Sources / References¶
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
- Unsloth — Fast LLM Fine-tuning
- LLaMA-Factory
- axolotl
- Hugging Face TRL — SFT Trainer
- MLX Examples — LoRA fine-tuning
Contribution Metadata¶
- Last reviewed: 2026-03-21
- Confidence: high