Self-Hosting LLMs in 2026: The Complete Guide

When to self-host LLMs vs use the API, what hardware and inference stack to choose, and which open-weight models actually work — with real numbers and crossover thresholds.

Updated 01 May 2026 • 14 min read

Self-hosting large language models stopped being a hobbyist exercise sometime in late 2025. By May 2026, open-weight models (DeepSeek V4, Llama 4, Qwen 3.5, Gemma 4) match or beat the closed frontier on most non-reasoning workloads, inference engines like vLLM and SGLang have become genuinely production-grade, and Blackwell-class GPUs have collapsed the cost-per-token math against API providers. The question is no longer can you self-host — it is whether you should, what to run on, and how to keep latency tail-flat under real traffic. This guide is the engineering brief we hand teams at Codersera before they wire a self-hosted LLM into a product roadmap.

Last updated: May 1, 2026.

TL;DR

Cost crossover vs frontier APIs lands around 2M–5M tokens/day on reserved GPU capacity over a 12-month window. Below that, the API still wins.
SGLang now leads vLLM by ~29% throughput on H100 (16,200 vs 12,500 tok/s on standard workloads) and up to 6x on prefix-heavy RAG pipelines thanks to RadixAttention.
B200 delivers ~3x the throughput of H200 on Llama 2 70B Interactive in MLPerf reporting, with HBM3e at 8 TB/s vs H200's 4.8 TB/s — and roughly 3x lower $/token in FP4 serving.
FP8 is the production-default precision in 2026. NVFP4 is rolling out on Blackwell but calibration tooling is still maturing; FP4 is not yet a safe default.
DeepSeek V4 Flash needs ~158 GB of VRAM in FP4+FP8 mixed precision — fits on a single H200, comfortable on 2x H200 for production-grade KV cache headroom.
For dev workstations, the RTX 5090 (32 GB GDDR7, 1.79 TB/s) and Mac Studio M3 Ultra (up to 512 GB unified, 800 GB/s) cover almost every model below 70B at usable interactive speed.

1. Why self-host in 2026, and when not to

Self-hosting earns its keep on three axes: unit economics at volume, data residency, and latency control. It loses on engineering overhead and on the speed at which you can iterate on a model upgrade.

The honest break-even framing: if your application processes under ~50M tokens per month, almost every cost analysis we've reviewed concludes that hosted APIs are cheaper once you account for engineering time, on-call burden, and electricity. The crossover gets serious around 11 billion tokens per month — at that scale, self-hosted infrastructure with appropriate utilization comfortably undercuts GPT-4.1 / Claude 4 tier API pricing, sometimes by 4–10x.

Three non-cost reasons we still see teams self-host below the crossover:

Regulated data (healthcare, finance, defense). The rules don't care about your token count.
Latency-sensitive inline UX (autocomplete, voice, real-time agents) where the 60–250 ms of round-trip API latency is the product.
Custom adapters or fine-tunes that simply aren't available on hosted endpoints.

If none of those apply and you're under 2M tokens/day, keep using the API and revisit in six months. If one of them applies, the rest of this guide is for you.

2. Hardware tiers: what to actually buy or rent

The hardware decision splits cleanly into workstation (one developer, one model, interactive use), single-node production (one model serving real traffic), and multi-node fleet (multiple models, autoscaling, redundancy). Don't conflate them — the right answer for each is wildly different.

Workstation tier

The RTX 5090 has reset the local-inference bar. With 32 GB of GDDR7 and 1.79 TB/s of bandwidth (a 78% jump over the 4090), it hits roughly 234 tok/s on a 30B MoE at short context, over 10,000 tok/s on prefill for Qwen3 8B, and well past 17,000 generation tok/s on Qwen3-8B Q4_K_M with speculative decoding tricks. For most 7B–32B work, one 5090 is enough. Two 5090s comfortably serve a 70B model in Q4 quantization and beat an A100 on $/token.

The Mac Studio M3 Ultra is the surprise winner for memory-heavy work. Up to 512 GB of unified memory at 800 GB/s means a fully unquantized 70B fits trivially, and you can hold MoE models well past 100B in aggressive quantization. MLX (Apple's native framework) consistently runs 26–30% faster than Ollama on the same hardware. We cover the model-by-model tradeoffs in our deep-dive on the best small LLMs for local hardware and the DeepSeek V4 Flash local setup walkthrough, both of which lean heavily on MLX for Mac users.

Single-node production tier

This is where H100 / H200 / B200 live. The H200 (141 GB HBM3e, 4.8 TB/s) gives 1.83x–2.14x the long-context throughput of an H100 across DeepSeek, Llama, and Qwen flagships. Two H200 SXM in one pod (282 GB pooled) is the current sweet spot for V4 Flash with 256K-context KV cache headroom. The B200 (192 GB HBM3e, 8 TB/s) is the new performance king — MLPerf reporting shows roughly 3x H200 throughput on Llama 2 70B Interactive and sub-3 ms latency on 8x B200 systems. SemiAnalysis InferenceX numbers put B200 at roughly 3x lower cost-per-token than H200 in FP4 serving, despite higher hourly rental.

Hardware comparison

Tier	GPU / system	VRAM	Bandwidth	Realistic workload ceiling	Indicative $/hr (rented)
Hobbyist	RTX 4090	24 GB GDDR6X	1.0 TB/s	13B FP16, 32B Q4	~$0.35
Workstation	RTX 5090	32 GB GDDR7	1.79 TB/s	32B FP16, 70B Q4	~$0.69
Workstation (memory-rich)	Mac Studio M3 Ultra 512 GB	512 GB unified	800 GB/s	70B FP16, 120B+ Q4 MoE	n/a (capex ~$10k)
Single-node prod	H100 SXM 80 GB	80 GB HBM3	3.35 TB/s	70B FP8, 100B+ MoE Q4	~$2.5–3.0
Single-node prod	H200 SXM 141 GB	141 GB HBM3e	4.8 TB/s	V4 Flash FP4+FP8 (2x), Llama 4 405B Q4 (2x)	~$3.5–4.5
Frontier prod	B200	192 GB HBM3e	8.0 TB/s	V4 full FP4, frontier MoE	~$6–8

Two warnings on the rental numbers: spot pricing varies 30–50% week to week, and several providers (Lambda, RunPod, Modal, Together, Anyscale) trade leadership on different SKUs. Always price two providers before committing.

3. Inference stacks: pick the engine for the workload, not the brand

The inference engine matters more than the GPU once you're past the workstation tier. There are now seven serious options. They are not interchangeable.

vLLM

vLLM is the reference production engine. PagedAttention KV cache, continuous batching, OpenAI-compatible API, the broadest model support, and the largest community. v0.6.0 delivered a 2.7x throughput improvement and 5x latency reduction over earlier releases. On H100 with FlashInfer enabled, it peaks around 12,500 tok/s on standard workloads. The new production-stack sub-project ships a Helm chart, Prometheus metrics, KV-cache reuse via LMCache, and Grafana dashboards out of the box. If you don't have a strong reason to pick something else, pick vLLM.

SGLang

SGLang's RadixAttention beats vLLM by ~29% on standard H100 throughput (16,200 vs 12,500 tok/s) and up to 6x on prefix-heavy workloads — multi-turn chat, RAG over shared documents, few-shot prompting — where it sustains 75–95% prefix-cache hit rates. On DeepSeek V3-class MoE models, SGLang is ~3.1x faster than vLLM. The downside: smaller community, slightly less smooth Day 0 support for new model architectures, and a steeper learning curve for the structured-generation API.

llama.cpp

The C++ engine that almost every consumer-facing tool wraps. GGUF format, ggml kernels, runs on essentially anything (CUDA, Metal, Vulkan, ROCm, CPU). For single-stream inference at the workstation tier, llama.cpp is consistently the fastest option, especially with speculative decoding and the right Q4_K_M / Q5_K_M / Q6_K quantization. It does not do continuous batching well — you don't want it as your serving layer in production.

Ollama

Ollama is llama.cpp wrapped in a daemon, a CLI, an HTTP API, and a sane model registry. As of April 2026, it is the lowest-regret entry point for local inference: ollama pull qwen2.5:32b-instruct-q4_K_M and you have an OpenAI-compatible endpoint on localhost in 30 seconds. We use it as the runtime in our local agent setup guide and our personal AI assistant walkthrough. Don't use Ollama for serving real concurrent traffic — it's optimized for one-user-at-a-time, not for batched throughput.

LM Studio

The desktop GUI. Best in class on Apple Silicon thanks to first-class MLX support; the only one of these tools where double-clicking a model file works. We compare it against the alternatives in detail in our writeup on LM Studio vs Ollama vs OpenClaw. Production-irrelevant — no Docker support, desktop-only.

MLX

Apple's framework. If you're on Mac, this is what your stack should be calling under the hood. 26–30% faster than Ollama on M3 Ultra for the same model and quantization. Native FP16, BF16, INT4, and now FP8 support on M-series.

TGI (Text Generation Inference)

HuggingFace's serving stack. As of mid-2025 it moved into maintenance mode — minor bug fixes only. It still works fine for HuggingFace Inference Endpoints customers, but new deployments should default to vLLM or SGLang. TGI now multi-backends through TRT-LLM and vLLM under the hood anyway.

Inference engine comparison

Engine	Best for	Throughput (H100, std)	Continuous batching	Prefix caching	Operational maturity
vLLM	General production serving	~12,500 tok/s	Yes	Yes (LMCache)	High
SGLang	RAG, multi-turn, structured output	~16,200 tok/s	Yes	Yes (RadixAttention)	Medium-High
TensorRT-LLM	NVIDIA-only, max FP8/FP4 perf	~16,200 tok/s	Yes	Partial	High (NVIDIA-supported)
llama.cpp	Single-user, broad hardware	n/a (single-stream tool)	Limited	Manual	High (community)
Ollama	Dev environments, agents	Inherits llama.cpp	Limited	No	Medium
LM Studio	Mac desktop	Inherits llama.cpp / MLX	No	No	Low (desktop)
MLX	Apple Silicon production lite	n/a (Apple-only)	Limited	Yes	Medium
TGI	HF Inference Endpoints	Comparable to vLLM	Yes	Yes	Maintenance mode

4. Models: which open weights to actually run

The 2026 open-weight tier is a four-horse race: DeepSeek V4 / V4 Flash (March–April 2026), Qwen 3.5 (February 2026, Apache 2.0), Gemma 4 (April 2026, Apache 2.0), and Llama 4 (custom community license, MoE). All four families ship MoE flagships; the dense models are now the smaller variants.

For a typical product team, the decision tree is simpler than the leaderboards make it look:

Reasoning, code, complex agents: DeepSeek V4 or V4 Flash. V4 Flash gives you ~80% of full V4 quality at ~25% of the VRAM. Our DeepSeek V4 Flash deep dive and the broader V4 pillar walk through the architecture and the FP4+FP8 mixed-precision deployment story.
General chat, RAG, balanced workloads: Qwen 3.5 27B (dense) or 397B-A17B (MoE). Apache 2.0, no licensing pain. Edges Gemma 4 31B on MMLU Pro (86.1% vs 85.2%) and GPQA Diamond (85.5% vs 84.3%).
Math, competitive code, structured reasoning: Gemma 4 31B leads the pack — AIME 2026 at 89.2%, Codeforces ELO of 2150. The 26B-A4B MoE activates only 3.8B parameters per forward pass and lands 6th on Arena AI text leaderboard at 1441 ELO. Best efficiency-per-parameter in the open ecosystem.
Long-context document workflows: Llama 4 Scout. 109B params, 17B active, 10M-token theoretical context (1–2M practical). License caveat: not usable in apps with over 700M MAU.
Edge / sub-16 GB hardware: Phi-4 (14B) for math and structured reasoning — 80.4% on MATH, beats many 30–70B models. Phi-4-mini (3.8B) at Q4_K_M runs at 15–20 tok/s on an M1 Air. Mistral Small 3 (24B dense) beats Llama 3.3 70B at a third of the params and runs on a single 4090 quantized. We have a survey of these in our small-LLM guide.
Massive model on tiny hardware: oLLM-style CPU offload tricks — covered in our writeup on running an 80 GB model on 8 GB of VRAM. Slow, but real.

5. Quantization: what's safe to ship in production

Four quantization regimes matter in 2026.

FP16 / BF16 — the reference. Use when VRAM allows. No accuracy concerns, slowest throughput.

FP8 — the production default on Hopper and Blackwell. Typical accuracy degradation is 0.5–2% on standard benchmarks; head-to-head, Llama-3.1-70B at FP8 scores 69.64% on MMLU-Pro vs 70.24% BF16 (a 0.6 point gap), and matches BF16 exactly on HumanEval at 39.02%. Use FP8 by default for serving on H100/H200/B200.

INT4 (AWQ, GPTQ, GGUF Q4_K_M) — the consumer-hardware default. AWQ retains roughly 95% of base quality, GPTQ around 90%. The Marlin kernel made INT4 actually fast on modern GPUs: Marlin-AWQ hits 741 tok/s vs vanilla AWQ's 68 tok/s (10.9x speedup); Marlin-GPTQ hits 712 tok/s vs 276 tok/s (2.6x). Sources disagree on which method wins on which task — AWQ tends to win on code, GPTQ tends to win on instruction-following (IFEval). Test on your eval set before committing.

FP4 (NVFP4, MXFP4) — the Blackwell unlock. NVIDIA TensorRT Model Optimizer with NVFP4 gives near-FP8 quality at half the memory and double the FLOPS on B200. As of May 2026, FP4 is not the production default — calibration tooling is still maturing, and accuracy varies by model and task. Treat it as a high-value bet for cost-sensitive Blackwell deployments where you can validate end-to-end on real evals.

Quantization quick-pick

Format	Bits	Accuracy retention	Where it shines	Where to avoid
BF16/FP16	16	100%	Reference, training	Cost-sensitive serving
FP8	8	98–99.5%	Hopper/Blackwell production	Pre-Hopper hardware
AWQ-INT4 (Marlin)	4	~95%	Consumer GPUs, code workloads	Strict instruction-following
GPTQ-INT4 (Marlin)	4	~90%	Cheap memory savings, IFEval	Code-heavy production
GGUF Q4_K_M	~4.5	~94%	llama.cpp / Ollama / LM Studio	Multi-tenant serving
NVFP4 / MXFP4	4	~96–98% (model-dependent)	B200 cost optimization	Anything you can't re-evaluate

6. Serving patterns: dev vs prod

Single-GPU developer pattern

One model, one GPU, OpenAI-compatible endpoint on localhost. The whole stack is vllm serve Qwen/Qwen3.5-27B-Instruct --quantization fp8 or the equivalent ollama run command. Use it for prototyping, evals, and integration tests. Don't ship it.

Multi-GPU production pattern

The minimum production stack as of May 2026:

2x H200 (or 1x B200) running vLLM or SGLang in tensor-parallel mode
An L4 or L40S in front for the embedding model (cheap, fast, separate KV pressure)
Kubernetes with the vLLM production-stack Helm chart, or an equivalent on Modal / RunPod / Anyscale
LMCache for cross-request KV reuse — typically 30–60% effective throughput uplift on RAG-shaped traffic
A token-aware autoscaler keying off vllm:num_requests_waiting and KV-cache utilization, not CPU%
Speculative decoding for latency-sensitive paths: a 1B–3B draft model in front of the production model, gives 2–3x speedup on greedy decoding

Hybrid pattern (the underrated default)

Run a self-hosted instance for high-volume, predictable workloads (embeddings, classification, summary, retrieval rewrites), and route long-tail or peak-load requests to a hosted API. Most teams over-build the self-hosted side and underestimate how much spiky traffic the API absorbs cheaply. Don't be one of them.

7. Monitoring and observability

LLM serving has its own metrics that traditional APM doesn't capture. The non-negotiables:

End-to-end latency (request received → final token) — your customer-facing SLO.
Time to first token (TTFT) — dominated by prefill. Spikes mean queue depth or KV pressure.
Inter-token latency (ITL) — dominated by decode. Spikes mean batch contention or memory bandwidth saturation.
KV cache utilization — vLLM exposes this directly. When it crosses ~85%, latency tail blows up. This is your north-star gauge.
Queue depth (num_requests_waiting) — your autoscaler signal.
Tokens generated / second — capacity-planning metric, not an SLO.

vLLM exposes all of this on /metrics in Prometheus format. Pair with Grafana dashboards (the production-stack ships defaults), wire in OpenTelemetry traces for the request lifecycle, and you'll be able to debug a tail-latency regression in minutes rather than hours.

8. Cost crossover, in detail

The math, as cleanly as we can state it for May 2026:

Under 50M tokens/month: APIs win. Don't self-host.
50M–500M tokens/month: A reserved 2x H200 node serving Qwen 3.5 or DeepSeek V4 Flash at ~70% utilization is roughly cost-neutral with frontier APIs. The win comes from latency control or data residency, not the bill.
500M–11B tokens/month: Self-hosting wins on cost. Plan for 30–50% savings vs frontier APIs after engineering overhead.
11B+ tokens/month: Self-hosting wins decisively. 4–10x savings are realistic, especially on B200 with FP4 / NVFP4.

One trap to avoid: comparing self-hosting against the frontier API price (GPT-4.1, Claude 4) when your workload would run fine on a cheaper hosted tier (Haiku, Gemini Flash, DeepSeek API). The API providers also have a Flash tier; the crossover against that is much further out — closer to 200M+ tokens/day.

9. Known issues and sharp edges

KV cache fragmentation under heterogeneous request lengths. vLLM 0.6+ helps, but you can still see 20–30% throughput loss when mixing 200-token and 32K-token requests in the same batch. Solution: route by length, or pin a separate replica for long-context.
Speculative decoding interacts badly with structured output. If you're using grammar-constrained generation or JSON-mode, draft models often fail validation, killing the speedup. Disable speculative for those paths.
FP8 KV cache is not free. Some models lose noticeable quality with FP8 KV (vs FP8 weights only). Check on your evals before flipping it on.
NCCL across vendor / driver mismatches. Mixing GPU generations in one tensor-parallel group is a debugging nightmare. Don't.
Tokenizer drift. Quantized models occasionally ship with subtly different tokenizers. Compare token counts on a fixed prompt before promoting.
Cold start on consumer hardware. Loading a 70B Q4 model from disk to GPU can take 60–120 seconds. Pre-warm before traffic.
Mac Studio thermal throttling under sustained load. Real, but only matters past 30+ minute heavy generation sessions. Unlikely to bite interactive use.
Ollama defaults to a tiny context window (often 2k or 4k) regardless of what the model supports. Override with OLLAMA_CONTEXT_LENGTH or modelfile num_ctx.

10. FAQ

How much VRAM do I actually need for a 70B model?

FP16: ~140 GB plus KV cache. FP8: ~70 GB. Q4 (AWQ/GPTQ/GGUF): ~35–40 GB plus KV. With Q4 you can run a 70B comfortably on a single H100 80 GB, or on 2x RTX 5090s with tensor parallelism.

Is DeepSeek V4 Flash actually self-hostable?

Yes, on 1x H200 (141 GB) with FP4+FP8 mixed precision at ~158 GB on-disk weights — tight but viable. 2x H200 SXM (282 GB pooled) is the production-grade configuration with comfortable KV headroom for 256K context.

Should I use vLLM or SGLang?

Default to vLLM. Switch to SGLang if your workload is RAG, multi-turn chat, or anything with heavy prefix reuse — that's where the 6x throughput edge shows up.

Is Ollama production-ready?

For a single team's internal tools, yes. For multi-tenant customer-facing serving, no — it's optimized for one user at a time. Use vLLM or SGLang for that.

What's the cheapest way to try a 405B-class model?

Rent a single 8x H100 or 4x H200 node from RunPod or Lambda for an hour. Spin up vLLM with the model in FP8, hit it from your laptop, tear it down. Total cost: under $30.

Can I really run a 70B on a Mac?

Yes — comfortably on M2 Ultra / M3 Ultra with 96 GB+ unified memory. MLX or LM Studio give you 8–15 tok/s on a Q4 70B at usable interactive latency.

How does FP8 compare to FP4 in practice?

FP8 is the boring, safe production choice in 2026. FP4 (NVFP4) on Blackwell is the cost lever for high-volume serving, but plan to validate end-to-end on your eval set — calibration tooling isn't bulletproof yet.

Do I need Kubernetes?

Below ~3 nodes, no — Docker Compose or systemd plus a load balancer is fine. Above that, the autoscaling and rollout story gets painful without it. The vLLM production-stack Helm chart is the path of least resistance.

What about AMD MI300X / MI325X?

vLLM and SGLang both support ROCm, and MI300X has 192 GB of HBM3 — more than an H100. Real-world throughput trails NVIDIA on most kernels by 10–30%, but $/token is competitive on the secondary cloud market. Worth pricing if NVIDIA capacity is unavailable.

How do I handle model upgrades without downtime?

Blue-green at the deployment level: bring up the new model on a parallel replica set, shift traffic via the load balancer, drain the old set. The vLLM production-stack supports this natively.

What's the simplest stack for a startup hitting product-market fit?

Hosted API for the user-facing path, plus one self-hosted Ollama or vLLM instance on a rented 5090 or H100 for embeddings and classification. Move more inline as your token volume crosses 50M/month.

Is fine-tuning worth doing in 2026?

For most teams, no — well-prompted Qwen 3.5, Gemma 4, or DeepSeek V4 Flash will match a fine-tuned 7B on most tasks. Fine-tune when you have a domain-specific eval that pretrained models reliably fail and you have enough labeled data (10k+ high-quality examples) to move it.

How do I prevent KV cache OOM under load?

Set --max-num-batched-tokens and --gpu-memory-utilization conservatively (start at 0.85), monitor vllm:gpu_cache_usage_perc, and reject or queue requests when it crosses 0.85. Don't try to use 100% of VRAM — fragmentation will get you.

What's the right way to benchmark my own deployment?

Replay real traffic, not synthetic. Capture a week of production prompts (with PII scrubbed), replay against the candidate stack at 1x, 2x, and 5x rate, and measure P50/P95/P99 TTFT, ITL, and end-to-end latency. Synthetic benchmarks lie about prefix cache hit rates.

Next steps

If you're under 2M tokens/day, stay on the API and use this guide as a six-month checkpoint. If you're crossing the threshold, the highest-leverage next moves are: (1) pick one model family and one inference engine and stick with them long enough to learn their failure modes, (2) instrument /metrics end-to-end before you scale, and (3) run a real-traffic replay benchmark on your candidate stack before you commit to multi-node infrastructure.

Most teams that successfully ship self-hosted LLMs do it with two engineers: one who has shipped production GPU infrastructure before, and one who has shipped applied ML. If you don't have both already, that's the gap to close first. Hire a Codersera-vetted Python or ML engineer who has shipped self-hosted LLMs.