Self-Training a Small LLM From Scratch: The 2026 Complete Guide

Pre-training a small LLM from scratch in 2026 — the nanochat starter, the architecture template that won, what datasets to use, and when self-training beats fine-tuning.

Quick answer. Pre-training an LLM from scratch in 2026 is dramatically cheaper than it was in 2024 — you can train a GPT-2-quality model end-to-end for around $50 on an 8× H100 spot instance using Karpathy's nanochat — but for almost every product use case the right answer is still to fine-tune an existing open-weight model, not pre-train. Pre-training makes sense when (a) you have a genuinely novel domain (rare language, niche scientific corpus, proprietary code) with 10B+ clean tokens no open model has seen, (b) you have $20K+ in compute budget you've already exhausted on fine-tuning, or (c) you want to learn how the stack works. This guide covers the realistic recipes — nanochat for learning, modded-nanoGPT for speedrunning, lit-gpt for production — the modern 2026 architecture template (decoder-only + RoPE + RMSNorm + SwiGLU + GQA), the open datasets that actually work (FineWeb, FineWeb-Edu, SlimPajama, The Stack v2), honest cost estimates across model sizes, and a decision tree for when self-training beats fine-tuning. Reading it carefully will save most teams from a six-figure mistake.

Should you train a model from scratch?

For 99% of teams, no. Pre-training a model means building from zero what Meta, OpenAI, and Anthropic spend tens of millions of dollars on every few months. Even with 2026's cost reductions, a Llama-3-class 8B model trained at the now-standard "trillions of tokens" scale costs $1M–2M on spot pricing. A Chinchilla-optimal 7B at 140B tokens is ~$30K on spot. A tiny "learning project" 700M model trained to GPT-2 quality is genuinely $50.

The handful of cases where pre-training actually wins:

  • Genuinely novel domain. Medical, legal, code in a niche language, a rare natural language, a closed scientific corpus the open models haven't seen. If your tokens are meaningfully different from web text, an existing tokenizer will be inefficient and an existing model will hallucinate where it should defer.
  • Data sovereignty. Regulated industries where you can't use a model trained on unknown web data — defence, certain healthcare deployments, classified work.
  • Edge-deployment niche. A 150M–500M model purpose-built for your task and quantised to 4-bit can run on a phone CPU. Llama 3 1B is overkill for most edge use cases. A custom small model can be 3–10× more efficient on your domain.
  • Learning. Understanding the full stack — tokeniser, dataloader, optimiser, distributed training, eval — is the highest-leverage skill in AI engineering. Karpathy's nanochat exists for exactly this. It's the best 4 hours you can spend.

For everything else: fine-tune an open-weight model instead. Better quality, faster, cheaper, lower risk.

What does "small LLM" actually mean in 2026?

The 2026 small-LLM bands and what they cost to train, on cloud spot pricing (H100 at ~$2.50/hr, 8× node at ~$18/hr):

TargetTokensGPU-hours (H100 80GB)Spot costReal-world example
150M, 30B tokens30B~150~$200Tiny edge model
700M, 40B tokens40B~30 on 8× H100~$50nanochat speedrun
1B, 50B tokens50B~600~$1.5KChinchilla-optimal 1B
1B, 3T tokens3T~16,000~$10K–15KTinyLlama scale (inference-optimal)
3B, 60B tokens60B~5,500~$14KChinchilla-optimal 3B
7B, 140B tokens140B~30,000~$30K–50KChinchilla-optimal 7B
7B, 15T tokens15T~3,000,000~$1M–2MLlama-3-class quality

H200 is ~1.3× faster than H100, B200 is ~2–3× — those numbers shift downward 30–60% on newer hardware. The dramatic point is that "GPT-2 quality" is now a $50 weekend project; the dramatic point in the other direction is that Llama 3 quality is a $1M+ training run even at spot.

Has Chinchilla's "20 tokens per parameter" rule broken?

Yes, in the inference-optimal direction. Chinchilla (2022) defined compute-optimal training: train a 1B model on ~20B tokens, a 70B on ~1.4T. In 2024–2026, the field shifted to inference-optimal training — train far past Chinchilla because the per-token inference savings dwarf the extra training cost when you're going to serve the model for years.

Llama 3 8B trained at ~200:1 tokens-per-parameter (15T tokens for 8B). Liquid AI's LFM2.5-350M hit 80,000:1 (28T tokens on 350M params) in April 2026. The general 2026 rule for any model you'll actually serve: train at least 100:1, ideally 500–1,000:1 for small models. For a $50 learning project, Chinchilla is fine; for a production model you'll embed in an app, push way past.

What's the modern 2026 architecture?

The small-LLM architecture is essentially solved. Every open-weight model that matters in 2026 — Llama 3, Qwen 3.6, Gemma 4, Mistral, DeepSeek V4, Granite 4 — converged on the same template. Copy it; don't innovate:

  • Decoder-only transformer. Encoder-decoder is dead for language; sparse-attention / SSM hybrids are experimental.
  • Pre-norm with RMSNorm. LayerNorm is legacy. RMSNorm trains more stably and slightly faster.
  • Rotary positional embeddings (RoPE). With YaRN or similar scaling for context-length extension.
  • SwiGLU FFN. Gated linear unit with Swish activation. Replaced ReLU/GELU FFNs.
  • Grouped-query attention (GQA). Typically 4–8× fewer KV heads than Q heads. Cuts KV-cache memory dramatically with negligible quality loss.
  • No bias terms. In Q/K/V projections and FFN, bias-free is standard.
  • Tied embeddings for small models (output projection shares weights with input embedding). Save parameters that aren't pulling weight.

If you want to experiment beyond this: MoE for parameter efficiency at the cost of training complexity, sliding-window attention for cheap long context, or Mamba/SSM hybrids for very long sequences. Otherwise: copy a Llama-3 config, scale dimensions to your budget, move on.

Which reference recipe should I use?

Three project shapes, three right answers:

1. Learning / "I want to understand the stack" — karpathy/nanochat. The canonical solo-dev starting point. $48 on an 8× H100 node (~$15 on spot), ~1.65 hours wall clock as of March 2026. Single speedrun.sh script takes you tokenizer → pretrain → mid-train → SFT → optional RL on GSM8K → eval → web chat UI. The code is short enough to read end-to-end in an afternoon. If you're new to pre-training, start here.

2. Speed-record geeking — KellerJordan/modded-nanogpt. The community speedrun fork of nanoGPT. As of April 2026, the record is 1.35 minutes to GPT-2-quality on 8× H100, achieved through the Muon optimiser, Flash Attention 3, FP8 head, learnable cross-stream attention, and multi-token prediction. Reading the commit history is a graduate course in modern training tricks. Use this when you want to learn what's actually fast in 2026.

3. Production-grade work — Lightning-AI/litgpt. 20+ supported architectures, production-grade pretrain + fine-tune + deploy. Powered the original TinyLlama. Use this when nanochat's hackability stops being a feature and you need a stable CLI, FSDP2 / TP scaling, and someone-else-can-inherit-this configuration. Pair with torchtitan for multi-node training beyond a single 8× H100 box.

Other notable references worth reading:

  • EleutherAI Pythia — 160M to 12B series with intermediate checkpoints, the gold-standard "training dynamics" reference
  • Allen AI OLMo — fully open weights + data + intermediate states
  • TinyLlama — the canonical 1.1B model trained on 3T tokens, the inference-optimal pattern

What datasets should I use?

For general pre-training, three open corpora dominate 2026:

FineWeb (Hugging Face, 15T tokens). Common Crawl with careful filtering. The default open base. Outperforms RedPajama V1 and RefinedWeb in head-to-head ablations.

FineWeb-Edu (1.3T tokens). Educational-content-filtered subset of FineWeb. Dramatically better MMLU / ARC numbers per training-token than vanilla FineWeb. Use when knowledge density matters for the downstream tasks.

RedPajama V2 (100T+ tokens with quality metadata). Useful when you want to apply your own filtering — the metadata lets you slice by quality, language, source domain.

SlimPajama (627B tokens). Deduplicated subset of RedPajama V1. The right "small but clean" web corpus when your training budget is a single 8× H100 box and you don't have time to do your own dedup at scale.

Domain-specific corpora that pair well:

  • The Stack v2 — code, ~67T tokens, opt-out respected
  • PG-19 — long-form prose, classic literature
  • Wikipedia dumps — factual grounding, multilingual
  • arXiv — scientific writing
  • OpenWebMath — mathematical reasoning
  • SmolLM-Corpus — Hugging Face's curated small-model training set with deduplication done

Deduplication is mandatory. Duplicated examples poison training dynamics — the model overfits to whatever appeared multiple times. Use datatrove (Hugging Face) or text-dedup for scale; MinHash + LSH at large scale; exact-match for small corpora.

Should I train my own tokenizer?

Usually no. Adopt an existing tokenizer (Llama 3, Qwen, Mistral) and inherit a battle-tested vocabulary. You'll save days of iteration and avoid subtle bugs.

Train your own only when you have a heavy domain skew — code-heavy, non-English, scientific notation. The test: measure your dataset's compression rate (bytes-per-token) on your data vs the Llama tokenizer. If you save >15%, you'll feel that gain forever in training cost and inference cost. If you save 3%, don't bother.

Default tools: SentencePiece BPE (byte-level), vocab size 32K for small models, 128K+ for larger or more multilingual models. tokenizers from Hugging Face is the production-grade implementation.

What training infrastructure do I need?

Match to your scale:

  • Single H100: nanochat-scale only. Fine for <500M models, <50B tokens.
  • 8× H100 single node: the solo-dev sweet spot. nanochat, TinyLlama, MicroLlama all train here. PyTorch FSDP2 is the right default.
  • Multi-node: when you outgrow 8× H100. DeepSpeed ZeRO-3, FSDP2 + tensor parallelism, or torchtitan (the recommended scale-out path in 2026 — native PyTorch, async checkpointing, scales to thousands of GPUs).
  • Megatron-LM: NVIDIA's reference for huge scale; rigid. Only use at >100 GPUs.

Where to rent: Lambda Labs, RunPod, Vast.ai for spot 8× H100 at ~$15–20/hr. Modal and Beam are good for "I'd rather not manage infra at all" workflows but cost more. AWS / GCP / Azure are the most expensive on per-GPU-hour but easiest to fold into existing enterprise procurement.

How does the post-training step work?

Pre-training produces a base model — it completes text but doesn't follow instructions. To get a usable chat / assistant model, you post-train in three stages:

1. Supervised fine-tuning (SFT). 10K–50K instruction examples in ChatML format. This teaches the model to respond to instructions rather than continue text. A single epoch is often enough.

2. Preference tuning. 5K–20K preference pairs (good vs bad response to the same prompt). Use DPO, ORPO, KTO, or SimPO. ORPO merges SFT and DPO into one stage and is the simplest pipeline. The preference data is the bottleneck — generating it is harder than the training.

3. (Optional) Reinforcement learning with verifiable rewards. GRPO or DAPO. Only for reasoning tasks where you can mechanically check the answer (math, code that compiles and passes tests). This is what powered DeepSeek R1's reasoning step-change. For most small-model projects, skip — the dataset cost is much higher than the gain on non-reasoning tasks.

For a solo dev: SFT + ORPO is the right floor. Detailed framework coverage is in our fine-tuning guide.

Walk me through a $50 pre-training run

This is roughly what happens when you run speedrun.sh on nanochat:

  1. Provision an 8× H100 spot instance on Lambda or RunPod (~$15/hr).
  2. Train a tokenizer on a slice of FineWeb (~10 minutes).
  3. Pre-train a ~700M-parameter model on 40B FineWeb tokens (~1 hour).
  4. Mid-train on a curated higher-quality mix (~15 minutes).
  5. SFT on instruction data (~10 minutes).
  6. Optional RL on GSM8K math (~15 minutes).
  7. Eval against a standard benchmark suite (~5 minutes).
  8. Serve via a web chat UI nanochat ships (~immediate).

Total wall clock: ~3 hours. Total cost: ~$48 in compute (you'll also burn ~$5 on storage and ingress/egress). The output is a GPT-2-class model — not useful in production, but you understand the entire stack and you have a complete deployable artifact. This is the right introduction to pre-training in 2026.

When does the math actually work out?

The decision tree for "should I pre-train":

  1. Do you have >10B clean domain-specific tokens that no open model has seen? If no, fine-tune.
  2. Do you have $20K+ in compute budget for the training run? If no, fine-tune.
  3. Have you exhausted fine-tuning and RAG on your problem? If no, go back and try harder.
  4. Is the inference cost or sovereignty constraint dominant? If yes, proceed.
  5. If you're doing this to learn — run nanochat for $50. Don't conflate "learning project" with "production project."

The most common mistake is treating pre-training as a quality lever — "if I train my own, it'll be better than Llama." It will not be. The best open-weight models cost $50M+ to train and were built by teams of 30+ researchers over a year. Your $30K pre-training run will produce a model worse than the freely-available Llama 4 Scout, unless you have data Meta didn't.

FAQ

What's the cheapest way to actually pre-train a model end-to-end?

~$50 on an 8× H100 spot instance, using karpathy/nanochat's speedrun.sh. You get a deployable GPT-2-quality model and complete understanding of the stack in about three hours.

What's the difference between pre-training and fine-tuning?

Pre-training builds the model from random initialization on huge web-scale data. Fine-tuning starts from someone else's pre-trained weights and adjusts them for a narrow task. Pre-training is the $10K–$1M operation; fine-tuning is the $1–$50 operation. For 99% of teams, fine-tuning is the right answer.

How many tokens do I need to pre-train per billion parameters?

Chinchilla compute-optimal is ~20:1 tokens per parameter. The 2026 inference-optimal practice for production small models is 100:1 to 1000:1 — Llama 3 used ~200:1, Liquid LFM2.5-350M used 80,000:1. For learning projects, 20:1 is enough. For production small models, push way past.

What datasets should I use for pre-training?

FineWeb (general web), FineWeb-Edu (educational filter for knowledge density), SlimPajama (cleaned, smaller scale), The Stack v2 (code), PG-19 (long-form prose). Mix domain-specifically. Deduplicate with datatrove. For tiny models, SmolLM-Corpus is pre-cleaned and ready.

What's the modern small-LLM architecture I should copy?

Decoder-only transformer + RMSNorm pre-norm + RoPE positional embeddings + SwiGLU FFN + grouped-query attention + no biases + tied embeddings for small models. Every 2026 open-weight model converged here. Don't innovate on architecture — innovate on data and training dynamics.

Should I train my own tokenizer?

Only if your domain compression rate is >15% better than the Llama tokenizer on your data. Otherwise adopt Llama 3, Qwen, or Mistral's tokenizer and inherit the proven vocabulary.

What hardware do I need?

Single H100 for tiny experiments. 8× H100 single node for serious solo work — this is the sweet spot. Multi-node only above that. Mac Studio Ultra is great for inference but ~10× slower than H100 for training; don't pre-train on a Mac unless it's a learning exercise. See our Apple Silicon LLMs guide for the inference side.

What does it cost to train a Llama-3-class model?

For 7B parameters at 15T tokens (Llama-3-8B scale): ~3M H100-hours, ~$1M–2M at spot pricing. Out of reach for solo devs without funding. For Chinchilla-optimal 7B at 140B tokens: ~$30K–50K at spot. For 1B at 50B tokens: ~$1.5K. For "GPT-2 quality" at 700M / 40B tokens: ~$50.

Why bother pre-training at all if I can just fine-tune?

You shouldn't, unless you have genuinely novel domain data, a data-sovereignty constraint, an edge-deployment niche that justifies a custom tiny model, or you're learning. For everything else, fine-tune an existing open-weight model.

How long does pre-training take?

nanochat ~3 hours on 8× H100. TinyLlama 1.1B (3T tokens) ~90 days on 16× A100 historically; closer to 30–45 days on modern 8× H100. Llama-3-8B-class (15T tokens) ~6+ months on a multi-node cluster. The runtime is roughly linear in tokens-per-parameter times parameter-count.

What's the right post-training stack in 2026?

SFT (10K–50K instruction examples, ChatML) → ORPO or DPO (5K–20K preference pairs) → optional GRPO/DAPO for verifiable-reward reasoning tasks. ORPO merges SFT and DPO into one stage, simplest pipeline. The framework details (TRL, Axolotl, Unsloth, MLX-LM) are covered in our fine-tuning guide.