Fine-tuning Open Models¶
What it is¶
Fine-tuning is the process of continuing the training of a pre-trained language model on a curated dataset to adapt its behaviour, tone, knowledge, or task performance for a specific domain. Unlike Retrieval-Augmented Generation (RAG), fine-tuning modifies the model weights themselves, baking knowledge and behavioural patterns into the model rather than retrieving them at inference time.
What problem it solves¶
Pre-trained open models are generalist. They may: - Not follow a specific output format consistently - Lack domain terminology or institutional knowledge - Perform poorly on narrow task types (e.g., extracting structured fields from a specific document type) - Respond in unwanted styles or languages
Fine-tuning addresses these gaps without replacing the base model's general capabilities.
Where it fits in the stack¶
Model Adaptation Layer — between the raw pre-trained base model (Layer 0) and the inference/serving infrastructure (Layer 1). Fine-tuning is an offline process; the resulting model is then served via Ollama, vLLM, or similar.
┌─────────────────────────────────────────────────────────┐
│ Training (offline, GPU/Apple Silicon) │
│ Dataset → [Base Model] + [Adapter (LoRA)] → Fine-tuned │
└───────────────────────────┬─────────────────────────────┘
│ export GGUF / safetensors
┌───────────────────────────▼─────────────────────────────┐
│ Inference (Ollama / vLLM / llama.cpp) │
│ │
│ Agents: OpenClaw │ OpenHands │ n8n AI nodes │
└─────────────────────────────────────────────────────────┘
Typical use cases¶
- Structured Data Extraction: Fine-tuning a small model (e.g., 3B or 7B) to consistently output JSON from messy OCR text.
- Brand Voice Alignment: Ensuring customer-facing agents always use a specific company tone and vocabulary.
- SQL Generation: Adapting a model to a specific database schema and dialect for Text-to-SQL tasks.
- Code Completion: Training on a private codebase to provide context-aware autocomplete that understands internal libraries.
- System Log Analysis: Teaching a model to identify specific error patterns in proprietary server logs.
Strengths¶
- Zero inference overhead: Knowledge is in weights; no retrieval latency.
- Consistent behaviour: Reliable format adherence and tone even without in-context examples.
- Privacy: Training data and model stay on-premises.
- Works with small models: A fine-tuned 3B model can outperform a general 70B on narrow tasks.
Limitations¶
- Compute cost: Training run requires GPU; free tiers have limited hours.
- Static knowledge: Fine-tuned model does not know about events after training cutoff.
- Expensive to update: Retraining needed to incorporate new knowledge.
- Risk of catastrophic forgetting: Heavy fine-tuning can degrade general capabilities.
- Data requirements: Needs curated, high-quality datasets; bad data = bad model.
Fine-tuning vs RAG — decision guide¶
| Criterion | Fine-tune | RAG |
|---|---|---|
| Knowledge type | Style, format, task behaviour, domain vocabulary | Factual content, documents, up-to-date information |
| Update frequency | Rare (hours to retrain) | Continuous (update vector store) |
| Cost | High upfront (compute) | Low incremental (embedding + storage) |
| Hallucination risk | Reduced for in-distribution tasks | Reduced by grounding in retrieved context |
| Privacy | Weights stay local | Data stays in vector store |
| Latency | Zero (knowledge is in weights) | Adds retrieval time (~100–500 ms) |
| Best for | Consistent format/tone, narrow task types, instruction following | Q&A over documents, knowledge freshness, citations |
Rule of thumb: Use RAG first for factual knowledge. Use fine-tuning when you need the model to behave differently — follow a specific format, adopt a persona, or excel at a narrow task type it currently performs poorly on.
Combine both: fine-tune for behaviour + RAG for factual grounding.
Approaches¶
Parameter-Efficient Fine-Tuning (PEFT)¶
Trains only a small subset of parameters, dramatically reducing compute and memory requirements. The standard for home-lab and small-team fine-tuning.
LoRA (Low-Rank Adaptation)
Adds small rank-decomposition matrices to existing weight matrices. Only the adapter weights are trained; the base model is frozen. Adapters can be merged back into the base model for zero-latency inference.
W_new = W_base + (A × B) where A ∈ R^(d×r), B ∈ R^(r×k), r << d
- Rank (r): Typical values 4–64; higher = more capacity but more VRAM
- Alpha: Scaling factor; usually set to 2× rank
- Target modules: Usually
q_proj,v_proj(attention); sometimes all linear layers
QLoRA (Quantised LoRA)
Loads the base model in 4-bit NF4 quantisation, then trains LoRA adapters in bfloat16. Reduces VRAM by ~75% vs full fine-tuning. Enables fine-tuning 7B models on a 10 GB GPU or a MacBook M4.
DPO (Direct Preference Optimization)
Instead of using a separate reward model (as in RLHF), DPO directly optimises the policy based on paired preferences (chosen vs. rejected responses). It is more stable and computationally efficient than traditional reinforcement learning.
Full Fine-tuning
All weights updated. Requires substantial VRAM (2× model size in bfloat16). Typically only worthwhile for models < 3B or when compute is unconstrained. Not recommended for home-lab use.
Tools¶
Unsloth¶
Best for: Fast LoRA/QLoRA fine-tuning on NVIDIA GPUs; 2–5× faster than vanilla transformers with 80% less VRAM.
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-7B-Instruct",
max_seq_length=4096,
dtype=torch.bfloat16,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
Supports: Llama 3, Qwen 2.5, Mistral, Gemma, Phi-3, CodeLlama, and more.
LLaMA-Factory¶
Best for: No-code / low-code UI for training; supports 100+ model architectures; CLI and WebUI.
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
# Launch WebUI
llamafactory-cli webui
# Or train from config YAML
llamafactory-cli train examples/train_lora/qwen2.5_7b_lora_sft.yaml
Includes dataset management, training monitoring, merge + export, and evaluation in one tool.
axolotl¶
Best for: YAML-driven, reproducible training pipelines; excellent for teams and automation.
# config.yaml
base_model: Qwen/Qwen2.5-7B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
adapter: lora
lora_r: 16
lora_alpha: 32
lora_target_modules:
- q_proj
- v_proj
datasets:
- path: data/my_dataset.jsonl
type: sharegpt
conversation: chatml
output_dir: ./output/qwen-finetuned
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 2e-4
pip install axolotl[flash-attn,deepspeed]
accelerate launch -m axolotl.cli.train config.yaml
TRL / SFT Trainer (Hugging Face)¶
Best for: Custom training loops; maximum control; integrates with the full Hugging Face ecosystem.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTConfig, SFTTrainer
# Load model in 4-bit
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct", quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
dataset = load_dataset("json", data_files="data/train.jsonl", split="train")
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=lora_config,
args=SFTConfig(output_dir="./output", num_train_epochs=3,
per_device_train_batch_size=2, learning_rate=2e-4),
)
trainer.train()
trainer.save_model("./output/final")
MLX Fine-tuning (Apple Silicon)¶
Best for: Fine-tuning on MacBook M4 or Mac Studio without NVIDIA GPU.
pip install mlx-lm
# Fine-tune Llama or Qwen using LoRA on Apple Silicon
python -m mlx_lm.lora \
--model mlx-community/Qwen2.5-7B-Instruct-4bit \
--train \
--data data/ \
--iters 1000 \
--batch-size 4 \
--lora-layers 16 \
--adapter-path adapters/
# Fuse adapter back into model
python -m mlx_lm.fuse \
--model mlx-community/Qwen2.5-7B-Instruct-4bit \
--adapter-path adapters/ \
--save-path ./fused-model
MacBook M4 (16 GB unified memory) can fine-tune 4-bit quantised 7B models at ~3–5 tokens/sec training throughput. Sufficient for small-to-medium datasets (< 50k examples).
Dataset preparation¶
Format (ShareGPT / ChatML)¶
The most common format for instruction fine-tuning:
{"conversations": [{"from": "human", "value": "Extract the invoice number from this document: ..."}, {"from": "gpt", "value": "{\"invoice_number\": \"INV-2024-0042\", ...}"}]}
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
Minimum dataset sizes¶
| Task type | Min examples | Recommended |
|---|---|---|
| Format adaptation (consistent output structure) | 100–500 | 1,000 |
| Domain vocabulary / tone | 500–1,000 | 5,000 |
| Narrow task type (specific extraction) | 200–500 | 2,000 |
| General instruction following improvement | 5,000 | 50,000+ |
Data quality > quantity¶
100 high-quality, diverse examples outperform 1,000 noisy ones. Deduplicate, validate outputs, and include hard negative examples (cases where the model currently fails).
Synthetic Data Generation¶
When high-quality human data is scarce, use larger models (Claude 3.5 Sonnet, GPT-4o) to generate synthetic training examples.
- Distilabel: A framework for synthesizing data and using LLM-as-a-judge for quality filtering.
- Self-Correction Loops: Have the base model generate responses, then use a stronger model to correct them and format them into training pairs.
- Glaive: Specialized in generating high-fidelity functional and tool-use datasets.
Creating datasets from existing data¶
# Convert Paperless-ngx documents to training examples
import json
def convert_paperless_doc(doc):
"""Turn a Paperless document into a fine-tuning example."""
return {
"conversations": [
{
"from": "human",
"value": f"Extract key fields from this document:\n\n{doc['content'][:2000]}"
},
{
"from": "gpt",
"value": json.dumps({
"type": doc["document_type"],
"date": doc["created"],
"correspondent": doc["correspondent"],
"amount": doc.get("custom_fields", {}).get("amount")
}, indent=2)
}
]
}
Compute requirements¶
| Model size | Method | Min VRAM | Recommended | Training time (1k steps) |
|---|---|---|---|---|
| 1–3B | QLoRA | 4 GB | 6 GB | 5–10 min (T4) |
| 7B | QLoRA | 8 GB | 16 GB | 15–30 min (T4) |
| 7B | LoRA (bf16) | 16 GB | 24 GB | 20–40 min (A100) |
| 14B | QLoRA | 16 GB | 24 GB | 40–80 min (A100) |
| 32B | QLoRA | 24 GB | 40 GB | 2–4 hrs (A100) |
| 70B | QLoRA + DeepSpeed | 4× A100 80 GB | — | 8–16 hrs |
Apple Silicon (M4 16 GB): Practical up to 7B QLoRA; 14B possible with patience.
Free and low-cost training platforms¶
| Platform | Free tier | Notes |
|---|---|---|
| Google Colab | T4 GPU (15 GB), limited hours | Best for prototyping; Unsloth works well |
| Hugging Face AutoTrain | Pay-per-job (no free) | No-code UI; very easy for standard tasks |
| Kaggle Notebooks | 2× T4 (30 GB total), 30 hr/week | Good for 7B QLoRA |
| Modal | $30/month free credit | Serverless GPU; great for automation |
| RunPod | Pay-as-you-go from $0.20/hr | Best cost/performance for A100 |
| Vast.ai | Pay-as-you-go from $0.10/hr | Cheaper than RunPod; spot pricing |
| Lambda Cloud | Pay-as-you-go | Reliable; A100 and H100 available |
Exporting and serving fine-tuned models¶
Export to GGUF (for Ollama)¶
# After training with Unsloth — export directly to GGUF
model.save_pretrained_gguf(
"my-model-q4",
tokenizer,
quantization_method="q4_k_m" # q4_k_m is a good default
)
# Or convert from safetensors using llama.cpp
python llama.cpp/convert_hf_to_gguf.py ./output/final --outtype q4_k_m --outfile my-model.gguf
Create an Ollama Modelfile¶
# Modelfile
FROM ./my-model.gguf
SYSTEM """You are a document extraction assistant for a home office.
Always respond with valid JSON. Never add commentary outside the JSON object."""
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
ollama create my-extraction-model -f Modelfile
ollama run my-extraction-model "Extract fields from: ..."
After creating the model in Ollama, it becomes available to all tools in the stack (OpenClaw, OpenHands, n8n, LiteLLM) just like any other model.
Technical Verification & Evaluation¶
After fine-tuning, you must evaluate the model on a held-out test set to ensure it has learned the target behavior without losing general intelligence.
Automated Format Validation¶
Use sqlglot or standard JSON parsers to measure adherence to structured output requirements.
import json
import sqlglot
def validate_extraction(prediction, expected_schema):
"""Verify that the model output is valid JSON and matches the schema."""
try:
data = json.loads(prediction)
return all(field in data for field in expected_schema)
except json.JSONDecodeError:
return False
def validate_sql(prediction, dialect="postgres"):
"""Verify that the generated SQL is syntactically correct."""
try:
sqlglot.transpile(prediction, read=dialect)
return True
except sqlglot.errors.ParseError:
return False
Evaluation Protocol¶
- In-distribution test: Evaluate on 50–100 held-out examples. Aim for >95% format adherence.
- Out-of-distribution test: Test on 20 edge cases (e.g., extremely long inputs, corrupted OCR).
- Catastrophic Forgetting Check: Run a subset of MMLU or GSM8K. A drop of >10% vs the base model indicates over-training.
# Example using lm-evaluation-harness lm_eval --model hf --model_args pretrained=./output/fused-model \ --tasks mmlu_humanities,gsm8k --device cuda:0 --batch_size 8 - Inference Monitoring: Track the "Hallucination Rate" (invalid references or made-up data) during the first 500 production requests.
Common failure modes¶
| Symptom | Likely cause | Fix |
|---|---|---|
| Model ignores system prompt after fine-tuning | Training data lacked system prompt in correct position | Include system prompt in every training example |
| Repetition / looping | Learning rate too high, or dataset has many near-duplicates | Lower LR; deduplicate dataset |
| Format regression (good format → inconsistent after fine-tuning) | Dataset has inconsistent output formats | Standardise all outputs in dataset |
| Catastrophic forgetting (general capability drops) | Too many epochs or large r | Reduce epochs; lower LoRA rank; mix general data |
| High perplexity on validation set | Dataset too small or overfitting | More data; apply early stopping; increase dropout |
When to use it¶
- When the model needs to reliably follow a specific output format (structured JSON, specific template).
- When the model should adopt a consistent persona or tone without relying on long system prompts.
- When you have a narrow, repetitive task (e.g., extracting fields from a specific document type) where general models underperform.
- When you want the model's capabilities without retrieval latency for a known domain.
- When your dataset contains institutional knowledge not present in public training data.
When not to use it¶
- When knowledge needs to be updated frequently — use RAG instead.
- When you need to cite sources — RAG provides provenance, fine-tuning does not.
- When the base model already performs well on the task with a good system prompt.
- When compute budget is limited and RAG can solve the problem — RAG is cheaper to iterate.
- When the task requires the full knowledge of a large model that is too expensive to fine-tune.
Related tools / concepts¶
- RAG Pattern
- Ollama
- vLLM
- MLX
- llama.cpp
- Document Preparation for LLM Training
- Model Classes
- Model Routing Guide
Sources / References¶
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
- Unsloth — Fast LLM Fine-tuning
- LLaMA-Factory
- axolotl
- Hugging Face TRL — SFT Trainer
- MLX Examples — LoRA fine-tuning
Contribution Metadata¶
- Last reviewed: 2026-05-25
- Confidence: high