Quick answer. Fine-tuning an open-weight LLM in May 2026 is a one-command operation. Pick QLoRA (4-bit base + LoRA adapter) with rank 16 on top of Llama 3, Qwen 3, Gemma 4, or Mistral. Use Unsloth on a single GPU (free tier handles 7B; a rented H100 handles 70B in under three hours), Axolotl for multi-GPU production, MLX-LoRA on Mac, or TRL when you need raw control. Most jobs need 500–2,000 hand-curated examples in ChatML format and run in under an hour. But the most important step is upstream: don't fine-tune for knowledge (use RAG), don't fine-tune for format (use structured outputs), and don't fine-tune until prompt engineering has stopped paying off. When you do fine-tune, the right reasons are narrow style / niche reasoning / cheaper inference via distillation. This guide covers the 2026 framework landscape, the decision tree, dataset prep, hyperparameters, evaluation, and the failure modes that quietly ruin most first runs.
Should you fine-tune at all?
Fine-tuning has a reputation as the "advanced" answer to LLM problems. In 2026 it is rarely the right first move. Three patterns reliably beat fine-tuning on the most common requests:
- RAG (retrieval-augmented generation) is the right answer when you want the model to know more facts — internal docs, current events, private data, customer history. Fine-tuning baked-in knowledge ages badly (the data is stale the moment you ship) and can be wrong in ways that are hard to detect. RAG keeps the knowledge in a vector store you can update independently.
- Structured outputs / tool calling is the right answer when you want a specific format — JSON, function call, schema-validated response. Modern model APIs and open-weight chat templates support strict JSON / Pydantic schemas; you don't need a fine-tune to get reliable structure.
- Prompt engineering with in-context examples handles most narrow tasks if you can fit 5–10 examples in the prompt. Frontier models in 2026 have 128 K–1 M context windows; you can include a small "training set" inline and skip the fine-tune entirely.
The cases where fine-tuning genuinely wins:
- Brand voice / style. A 500–2,000 example QLoRA on Llama 3 8B reliably teaches a model to write in your house tone, faster and cheaper than a 10 K-token system prompt.
- Niche reasoning style. Legal citations, medical SOAP notes, scientific paper sections — domain conventions that prompt engineering can describe but not reliably enforce.
- Distillation for inference cost. Use Claude Opus 4.7 or GPT-5.5 to generate 50–100K high-quality examples, then fine-tune Llama 3 7B or Qwen 3.6 8B to mimic the behaviour on your narrow task. You'll serve at 1/100th the cost.
- Tool-calling fine-tunes where the model needs to learn a custom function schema reliably.
- Multilingual or rare-language work where the base model is weak.
If your use case isn't on that list, finish prompt engineering and RAG first. Then come back.
LoRA, QLoRA, DoRA: the parameter-efficient core
Every fine-tune you'll do in 2026 uses one of these adapter techniques. They freeze the base model's billions of weights and train tiny matrices on top, capturing the "task delta" without touching the foundation.
LoRA (Low-Rank Adaptation). Updates two small matrices A and B whose product approximates a weight delta. Two hyperparameters: r (rank — the bottleneck width) and alpha (the scale factor). A LoRA adapter is typically a few hundred MB instead of the original model's tens of GB.
QLoRA. Keeps the base model in 4-bit NF4 quantization and trains an fp16 LoRA on top. The memory savings are dramatic — a 70B model fits in ~48 GB of VRAM instead of ~140 GB. Quality is within 1–2% of full LoRA on standard benchmarks. This is the 2026 default.
DoRA (Weight-Decomposed LoRA). Decomposes the weight update into magnitude and direction, then applies LoRA only to the direction. Converges faster and often matches full fine-tuning at the same rank. Frameworks default DoRA on in 2026; use_dora=True is a free upgrade.
Rank and alpha defaults that work:
- Style / voice tasks:
r=16, alpha=16 - General SFT:
r=32, alpha=32 - Complex multi-turn or code:
r=64, alpha=64
Target all linear layers (q,k,v,o,gate,up,down). The VRAM cost of targeting all-linear vs just q/v is small and the quality gain is consistent. Ignore the old alpha=2*r convention; Unsloth's 2026 ablations show alpha=r is the cleaner default.
Which framework should I use?
Pick by the shape of your hardware, not by which library is trending on X.
| Framework | Right when | Wrong when |
|---|---|---|
| Unsloth | Single GPU (RTX 3090/4090/5090 or one rented H100). Indie hacker / solo dev / startup. You want fastest training in the cleanest notebook. | Multi-GPU (open-source is single-GPU only; multi-GPU costs $10/mo). |
| Axolotl | Multi-GPU, long context (sequence parallelism), CI-integrated training, you want YAML configs you can diff in PRs. | Single GPU — Unsloth is faster. |
| HF TRL + PEFT | You need a custom training loop, custom reward model, research, or you're building your own training framework. | Hello-world fine-tunes — too much boilerplate. |
| MLX-LM (Mac) | Apple Silicon. Up to 70B QLoRA on 96 GB Mac Studio Ultra. Quiet, no fan noise, no cloud bill. | Serious multi-day runs; NVIDIA still trains 2–4× faster on whatever fits in VRAM. |
| LLaMA-Factory | You want a web GUI for fine-tuning, multi-model + multi-method support, no code. | Production reproducibility (the YAML/CLI path in Axolotl is cleaner). |
Unsloth shipped MoE support in Feb 2026 with a claimed 7–12× speedup on MoE fine-tuning, runs 500+ base models including Llama 4, Qwen 3.6, Gemma 4, DeepSeek V4, and the gpt-oss family. A 70B QLoRA fits and trains in ~2.8 hours on a single H100. Open-source Unsloth is single-GPU only; multi-GPU is Unsloth Pro ($9.99/mo) — for a solo dev this is the right tradeoff.
Axolotl v0.8.x is the production maturity point — config + accelerate launch -m axolotl.cli.train config.yml is a complete pipeline. Supports SFT, LoRA, QLoRA, DPO, KTO, ORPO, GRPO, full reward modelling, quantization-aware training, and recently shipped sequence parallelism for >128 K context training. Slower per-GPU than Unsloth on single-GPU runs, but multi-GPU scaling is real and the operational story is much cleaner.
HF TRL hit v1.0 in April 2026 and unified the post-training stack: SFTTrainer, DPOTrainer, KTOTrainer, ORPOTrainer, GRPOTrainer, RewardTrainer in one library. v1.0 even pulls in Unsloth kernels for a 2× SFT speedup. Reach for it when you want full control of the loop.
MLX-LM on Mac handles LoRA, QLoRA, and DoRA via mlx_lm.lora --train. Realistic ceiling: 32 GB Mac fine-tunes 7–8B comfortably; 96 GB Mac Studio Ultra fine-tunes 70B QLoRA. The unified-memory advantage means a 32 GB Mac fine-tunes models that OOM a 24 GB RTX 3090. See our Apple Silicon LLMs guide for the full picture.
What about hosted fine-tuning APIs?
The hosted landscape changed in early 2026:
OpenAI. Wound down fine-tuning for new models in May 2026. Fine-tuning still works on the GPT-4.1 family and o4-mini for existing customers, but no GPT-5 / 5.4 / 5.5 fine-tunes. OpenAI's bet is that prompts + tools + memory beat bespoke weights for most use cases. Use the legacy fine-tune only if you specifically need a closed-weight API model with API-grade reliability and your data volume is too small to justify any infra (<5K examples).
Anthropic. Claude fine-tuning is not generally available. Older Claude models (Haiku tier) can be fine-tuned through AWS Bedrock or Google Vertex AI under a managed-service workflow. Opus 4.7, Sonnet 4.6, and the current frontier tier are not fine-tunable — Anthropic positions prompt caching, extended thinking, and tool use as the right knobs. If you need Claude-quality behaviour in a fine-tune in 2026, you can't have it — distil to Llama 3 or Qwen 3.6 instead.
Together AI. Every major Llama / Mistral / Qwen size up to 405B. LoRA at ~$0.48 per million training tokens (≤16B), full fine-tune at ~$3.20 per million tokens (70–100B). Fine-tuned adapters serve at base-model inference price plus a small overhead. The cheapest hosted path for most teams.
Fireworks AI. Similar pricing to Together. Strong DPO support at 2× SFT cost. Fine-tuned models serve at base price (no surcharge). The right pick when you want RLHF-style alignment without infrastructure.
Lamini. $0.50 per million inference tokens plus $0.50 per tuning step, with $300 free credit. Niche around their "Memory Tuning" (Mixture of Memory Experts) for factual-recall use cases where hallucination is the failure mode. Most teams don't need this; Together / Fireworks is the safer default.
Rule of thumb: API-hosted wins when you'll serve under 1 M tokens/day, when you don't want to manage GPUs, or when you need an SLA you can't personally provide. Local fine-tuning wins on cost above that threshold, on privacy, and on iteration speed.
How do I prepare a fine-tuning dataset?
Format. ChatML (<|im_start|>role\ncontent<|im_end|>) is the 2026 default everywhere — OpenAI, Bedrock, Vertex, every major open-weight chat model. Use it. ShareGPT (turns array with from/value) is fine for multi-turn community datasets. Alpaca (instruction/input/output) is the simplest format for single-turn classification or extraction. Pick one and be consistent. Template inconsistency is the single biggest cause of silent fine-tune failures.
Size. 500–1,000 clean examples genuinely works for narrow tasks (classification, format conversion, brand voice). 5,000–10,000 for general SFT. Quality beats quantity by a wide margin. 500 hand-curated examples consistently outperform 5,000 LLM-scraped ones with format drift and quality variance.
Hygiene. Deduplicate (MinHash or simple exact-match), length-filter, run a quick contamination check against the eval set you'll use, validate the chat template renders correctly on 3–5 random samples before launching a multi-hour run. The datatrove library from Hugging Face handles dedup at scale; for a small dataset, a Python one-liner over hashes is enough.
Contamination check. If your eval data was in the base model's pretraining corpus, your "improvement" is just recall. Always hold out a fresh eval set written or collected after the base model's training cutoff. A 2026 Berkeley study found 8 major agent benchmarks could be exploited to near-perfect scores without solving anything, via leaked references and broken scoring. Don't trust any benchmark you didn't read the source of.
What hyperparameters should I use?
The cheat sheet that works across most chat-style QLoRA fine-tunes:
- Learning rate: 1e-4 to 2e-4 for LoRA on 7B–14B; 5e-5 to 1e-4 for full fine-tune; 1e-5 to 5e-5 if you see catastrophic forgetting.
- Batch size: Effective batch of 8–32 sequences. Use gradient accumulation freely — physical batch can be as low as 1 with accumulation of 16.
- Epochs: 1–3. More than 3 on a small dataset overfits.
- Sequence length: Set to the 95th percentile of your dataset's tokenized length. Padding to the max wastes compute.
- Warmup: 3–10% of total steps. Cosine decay schedule.
- Optimizer: 8-bit AdamW (saves VRAM, minimal quality cost).
- LoRA dropout: 0.05 is a safe default; raise to 0.1 if overfitting.
Track training loss and validation loss every N steps; stop when validation plateaus. The most common amateur mistake is "train for 3 epochs because the docs said so" — early-stop on the validation curve.
How do I evaluate a fine-tune?
Three layers, in this order:
1. General capability regression. Run lm-eval-harness (EleutherAI) on MMLU / ARC / HellaSwag / GSM8K before and after to confirm you didn't break the base. A 1–3% drop on general benchmarks is normal; 10%+ is catastrophic forgetting.
2. Task-specific eval. Write a custom eval — 100–500 held-out examples scored by exact-match (for structured outputs) or a frontier-model judge (for generative quality). The judge pattern is standard in 2026: ask Claude Opus 4.7 or GPT-5.5 to grade outputs against a rubric. Cheap, fast, and surprisingly reliable if the rubric is clear.
3. Production A/B test. The only eval that can't be gamed. Ship to 5–10% of traffic, log outputs, compare against the base on real workloads. Anything that passes layers 1 and 2 but fails layer 3 was suffering from train/test distribution shift you didn't catch.
Common failure modes
Catastrophic forgetting. Fine-tune on narrow data → model loses general capability. Mitigations: lower learning rate, train fewer epochs, mix in 10–20% general-instruction data (the regularization approach), prefer LoRA over full fine-tune (the base stays frozen, so forgetting is bounded), or use DoRA which empirically retains better.
Overfitting tiny datasets. 3 epochs on 500 examples and the model memorizes them. Always hold out 5–10% as validation; early-stop on plateau.
Base-model contamination. If your eval data was in the base's pretraining corpus, your "improvement" is just recall. Hold out fresh data written after the base's cutoff.
Template drift. Training on ChatML, inferencing on Alpaca → nonsense outputs. Always use the same chat template at inference that you trained with. tokenizer.apply_chat_template is your friend.
Silent format failures. The model trains successfully but the dataset's role tags were wrong, so it learned nothing useful. Always render 5 random training examples with the tokenizer's chat template and eyeball them before launching the run.
How much does fine-tuning actually cost?
Realistic 2026 cost ranges on cloud spot pricing:
| Model | Method | Dataset | Hardware | Wall clock | Cost |
|---|---|---|---|---|---|
| Llama 3 7B | QLoRA r=16 | 5K examples | 1× T4 (Colab free) | ~6h | $0 |
| Llama 3 7B | QLoRA r=16 | 5K examples | 1× A100 40GB | ~45 min | ~$1.50 |
| Qwen 3.5 14B | QLoRA r=32 | 10K examples | 1× A100 80GB | ~2h | ~$3.50 |
| Llama 3 70B | QLoRA r=64 | 10K examples | 1× H100 80GB | ~3h | ~$8 |
| Mistral 7B | QLoRA on Mac | 5K examples | M2 Max 32GB | ~90 min | $0 (own hardware) |
| Llama 4 70B | Full LoRA | 50K examples | 8× H100 | ~12h | ~$320 |
The CoreWorxLab "fine-tune Qwen 3.5 for $11 on a rented A100" YouTube clip floating around in April 2026 is real; the workflow is plausibly Unsloth's Qwen 3.5 Colab template on a single A100. Anyone telling you fine-tuning is expensive is talking about full pre-training, not LoRA.
Quick start: three concrete recipes
Recipe A — Llama 3 7B QLoRA on free Colab, Unsloth.
!pip install unsloth
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=2048, load_in_4bit=True)
model = FastLanguageModel.get_peft_model(
model, r=16, lora_alpha=16, use_dora=True,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"])
# ...load dataset, train with SFTTrainer, save adapter
Recipe B — Mistral 7B QLoRA on an M-series Mac, MLX-LoRA.
pip install mlx-lm
mlx_lm.lora \
--model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
--train \
--data ./my-jsonl-dir \
--iters 1000 \
--batch-size 4 \
--lora-layers 16 \
--save-every 200 \
--adapter-path ./adaptersRecipe C — Multi-GPU production fine-tune, Axolotl.
# config.yml
base_model: meta-llama/Llama-3-70B
adapter: qlora
load_in_4bit: true
sequence_len: 4096
lora_r: 64
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
use_dora: true
datasets:
- path: ./train.jsonl
type: chat_template
num_epochs: 2
learning_rate: 1e-4
gradient_accumulation_steps: 4
micro_batch_size: 1
warmup_steps: 100
lr_scheduler: cosine
optimizer: adamw_bnb_8bit
# then: accelerate launch -m axolotl.cli.train config.ymlFAQ
What's the difference between LoRA, QLoRA, and DoRA?
LoRA trains tiny adapter matrices on top of a frozen base model — much cheaper than full fine-tuning. QLoRA adds 4-bit quantization of the base, cutting memory roughly 4× with negligible quality loss. DoRA decomposes the weight update into magnitude and direction, often matching full fine-tune quality at the same rank. Default in 2026: QLoRA + DoRA together.
How much data do I need to fine-tune?
500–2,000 examples for narrow tasks like style, voice, classification, or format conversion. 5,000–10,000 for broader instruction tuning. Quality dominates quantity: 500 hand-curated examples beat 5,000 LLM-scraped ones almost every time.
Should I fine-tune Llama 3, Qwen 3.6, Gemma 4, or Mistral?
For most general tasks in 2026: Qwen 3.6 8B or Llama 3 8B. For tiny on-device deployment: Gemma 4 E2B. For coding specifically: Qwen3-Coder. For multilingual: Qwen 3.6. See our open-source LLMs landscape for the full comparison.
Can I fine-tune Claude or GPT?
Not on the current frontier models. OpenAI offers fine-tuning on legacy GPT-4.1 and o4-mini. Anthropic offers Claude Haiku fine-tuning through Bedrock and Vertex but not Sonnet/Opus. The 2026 reality is: if you need Claude or GPT-class quality on a fine-tune, you can't have it directly — distil their outputs to an open-weight model instead.
What's the cheapest way to fine-tune?
Colab free tier with Unsloth for 7B QLoRA — literally $0 if it fits in your patience window. Rented H100 spot on Lambda or RunPod at $2.50/hr for serious work — most 7B–14B fine-tunes finish in under an hour. Together AI's hosted fine-tuning at ~$0.48 per million training tokens if you don't want to touch GPUs at all.
How long does a typical fine-tune take?
A 7B QLoRA on 5K examples: ~45 min on an A100, ~6h on Colab free T4. A 70B QLoRA on 10K examples: ~3h on an H100. A full multi-day pre-training run is a different problem entirely — see our self-training guide.
When is RAG better than fine-tuning?
Always, when the goal is to give the model access to knowledge. RAG keeps data fresh, auditable, and updatable independently of the model. Fine-tuning baked-in facts is brittle: the data stales, hallucinations are harder to detect, and updating means a new training run. Use fine-tuning for behaviour — style, format, reasoning patterns — and RAG for facts.
What about catastrophic forgetting?
Real and common. The model "forgets" general capabilities while specialising. Mitigations: lower learning rate (1e-5 to 5e-5), train fewer epochs (often 1 is enough), mix 10–20% general instruction data into your training set, prefer LoRA over full fine-tune (the base stays intact under the adapter), and use DoRA which retains better empirically.
Can I run a fine-tuned model on a Mac?
Yes. Train on cloud H100 if needed, then download the adapter and merge it with the base, convert to MLX 4-bit, and run via mlx_lm.generate or Ollama. The whole serving side is covered in our Apple Silicon LLMs guide.
What about DPO, ORPO, GRPO, SimPO?
Post-SFT alignment techniques for preference tuning. DPO is the foundational pairwise-preference method. ORPO merges SFT and DPO into one stage — the simplest pipeline. GRPO is what DeepSeek R1 popularised for reasoning-with-verifiable-rewards (math, code). For most fine-tunes: SFT alone is enough. Add ORPO if you have preference pairs. Reach for GRPO only if you're going after a reasoning benchmark specifically.
How do I deploy a fine-tuned model?
Three options. (1) Self-host: convert adapter to MLX or GGUF, serve via Ollama / MLX-LM / vLLM. (2) Hosted: upload the adapter to Together / Fireworks, serve at base-model price. (3) Edge: quantize to 4-bit and run on-device. The right path depends on traffic volume and SLA needs — see our self-hosting guide.
Related guides
- Apple Silicon LLMs — running and fine-tuning models on Mac with MLX
- Self-training a small LLM from scratch — when pre-training a model yourself actually makes sense
- Self-hosting LLMs — serving infrastructure in the cloud
- Open-source LLMs landscape — which base model to fine-tune
- Llama 4 guide
- Qwen 3.5 guide
- Gemma 4 guide
- DeepSeek V4 guide