Nari Dia 1.6B vs Sesame CSM-1B: Best Open-Source TTS in 2026?
Last updated April 2026 — refreshed for current model/tool versions.
Two open-source speech models released in early 2025 changed expectations for what runs locally: Nari Labs Dia 1.6B, a dialogue-first TTS that generates expressive multi-speaker audio in a single pass, and Sesame CSM-1B, a conversational speech model built on a Llama backbone with native Hugging Face Transformers support. This guide compares both on architecture, hardware requirements, 2026 benchmark data, and practical deployment — so you can pick the right tool without running both.
What changed in 2026 — key updates for readers of the 2025 versionDia 1.6B has a successor: Dia2. Nari Labs released Dia2 with 1B and 2B checkpoints. Dia2 uses a streaming architecture — it starts generating audio from the first few tokens rather than requiring the full text input. Both variants are Apache 2.0 licensed and available on Hugging Face (nari-labs/Dia2-1B,nari-labs/Dia2-2B). The original Dia 1.6B remains available and fully usable.CSM-1B is now natively integrated into Hugging Face Transformers (v4.52.1+, released May 2025). You can load it directly withCsmForConditionalGeneration— no custom inference code required.CSM-1B license confirmed Apache 2.0. The original post listed "open source" without specifying the license; it is Apache 2.0, matching Dia 1.6B.New competing models entered the field. Chatterbox (Resemble AI, 350M params) beats ElevenLabs in blind tests at 63.75% preference. Kokoro v1.0 holds ~44% win rate on TTS Arena V2 at 82M params. F5-TTS and CSM-1B remain among the "most well-rounded" open-source performers per Inferless's 12-model benchmark.The TTS Arena ELO landscape. Among open-source deployable models, Chatterbox sits at ~1502 ELO on TTS Arena, Kokoro v1.0 at ~1400. Dia 1.6B and CSM-1B are not currently ranked on TTS Arena V2 (they target a different deployment mode than the Arena's single-utterance evaluation).VRAM note correction. The original post stated Dia needs 10GB VRAM. In practice, the 1.6B model requires approximately 8–10GB VRAM in bfloat16. The new Dia2-1B runs in less VRAM than the original 1.6B; exact requirements are not officially published (verify on vendor page).
TL;DR — Which Model Should You Use?
| Criterion | Nari Dia 1.6B | Sesame CSM-1B |
|---|---|---|
| Best for | Expressive multi-speaker dialogue, audiobooks, podcasts | Real-time conversational agents, low-latency voice apps |
| Nonverbal sounds | Yes — from text cues like (laughs) |
Partial — via contextual conditioning only |
| GPU requirement | ~8–10GB VRAM (bfloat16) | 6–8GB VRAM recommended; CPU fallback available |
| Streaming output | No (Dia 1.6B); Yes (Dia2) | No native streaming; low-latency generation |
| HF Transformers native | No (custom inference code) | Yes (v4.52.1+) |
| License | Apache 2.0 | Apache 2.0 |
| Voice cloning | Audio conditioning | Contextual prompting |
| Successor model | Dia2 (1B/2B, streaming) | None announced as of April 2026 |
Model Overview
Nari Dia 1.6B
- Developer: Nari Labs (small team: 1 full-time, 1 part-time engineer as of 2025)
- Parameters: 1.6 billion
- Released: April 22, 2025
- License: Apache 2.0
- Architecture: Encoder → Transformer (1.6B) → audio frame generator → Descript Audio Codec decoder
- Language: English only
- Hardware: ~8–10GB VRAM in bfloat16; CUDA 12.6+; PyTorch 2.0+
- Successor: Dia2 (1B and 2B streaming variants, released November 2025)
Sesame CSM-1B
- Developer: Sesame AI Labs
- Parameters: 1 billion (12-layer primary transformer at ~1.2B + 6-layer secondary at ~300M)
- Released: March 13, 2025 (Hugging Face); February 27, 2025 (initial release)
- License: Apache 2.0
- Architecture: Llama backbone + Mimi audio codec decoder, producing RVQ audio codes
- Language: English primary; limited non-English capacity
- Hardware: 6–8GB VRAM recommended; CPU inference supported
- HF Transformers: Native support since v4.52.1 (May 20, 2025) via
CsmForConditionalGeneration
Technical Architecture
| Feature | Nari Dia 1.6B | Sesame CSM-1B |
|---|---|---|
| Model Size | 1.6B parameters | 1B parameters |
| Core Technology | TTS-optimized language model + DAC decoder | Llama backbone + Mimi audio codec (RVQ) |
| Input Modalities | Text + optional audio prefix | Text + optional audio context segments |
| Output Format | Direct waveform via Descript Audio Codec | RVQ codes → Mimi waveform reconstruction |
| Dialogue Support | Speaker tags [S1], [S2] in text |
Context segments tracking speaker turns |
| Nonverbal Sounds | Yes: (laughs), (coughs), (sighs) from text |
Partial — possible with careful audio conditioning |
| Voice Cloning | Audio conditioning (supply reference audio) | Contextual prompting with reference audio |
| Streaming | No (Dia 1.6B); Yes in Dia2 | No native streaming; low-latency batch |
| torch.compile | Supported | Supported (CUDA graphs) |
| CPU Support | Not available in Dia 1.6B (planned) | Yes |
Installation and Hardware Requirements
Nari Dia 1.6B — Setup
The recommended path uses uv as the package manager:
git clone https://github.com/nari-labs/dia.git
cd dia
uv venv && source .venv/bin/activate
uv pip install -e .
python app.py # Gradio UI
Hardware checklist:
- NVIDIA GPU with ≥8–10GB VRAM (RTX 3070 Ti, RTX 4070, A4000, or better)
- CUDA 12.6+ drivers
- PyTorch 2.0+, Python 3.8+
- On first run, the Descript Audio Codec is downloaded automatically (~300MB)
- Reference throughput: ~40 tokens/second on an A4000 (86 tokens ≈ 1 second of audio, so ~0.5× real-time on that card)
- CPU support is not yet available for Dia 1.6B; use the Hugging Face ZeroGPU Space for testing without a local GPU
Sesame CSM-1B — Setup
The fastest path uses the native Hugging Face Transformers integration (requires transformers>=4.52.1):
pip install "transformers>=4.52.1" torch torchaudio
from transformers import CsmForConditionalGeneration, AutoProcessor
import torch
model = CsmForConditionalGeneration.from_pretrained("sesame/csm-1b")
processor = AutoProcessor.from_pretrained("sesame/csm-1b")
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
Hardware checklist:
- NVIDIA GPU with ≥6GB VRAM recommended; CPU inference works but is slow
- Python 3.10+, PyTorch, torchaudio
- Hugging Face account required (model is gated — accept terms on the HF model page)
- Supports
torch.compile()and CUDA graphs for optimization
Dia2 (Streaming Successor) — Setup
If streaming latency matters and you want the current Nari Labs offering:
pip install dia2
from dia2 import Dia2, GenerationConfig, SamplingConfig
model = Dia2.from_repo("nari-labs/Dia2-2B", device="cuda", dtype="bfloat16")
config = GenerationConfig(
cfg_scale=2.0,
audio=SamplingConfig(temperature=0.8, top_k=50),
use_cuda_graph=True,
)
model.generate("[S1] Hello there, welcome to the show.", config=config, output_wav="out.wav")
Dia2 requires CUDA 12.8+ and defaults to bfloat16 precision. The 1B variant has lower VRAM requirements than the 2B; exact published figures are not available — verify on the Dia2-2B Hugging Face model card.
Performance and Benchmark Data (2026)
No single standardized benchmark covers both Dia 1.6B and CSM-1B together with comparable methodology, so the following draws from several independent evaluations. Treat MOS figures as indicative, not definitive — listener panels differ across studies.
TTS Arena V2 (Hugging Face Elo Rankings)
TTS Arena V2 pits models head-to-head in blind listening tests. As of early 2026, the open-source deployable models with published Elo scores include:
| Model | ELO (approx.) | Notes |
|---|---|---|
| Chatterbox (Resemble AI) | ~1502 | 350M params; beats ElevenLabs in blind tests |
| Kokoro v1.0 | ~1400 | 82M params; 44% win rate; CPU-friendly |
| StyleTTS 2 | 1369 | Style-transfer focused |
| Dia 1.6B | Not ranked on Arena V2 | Dialogue-first; better evaluated on multi-speaker tasks |
| CSM-1B | Not ranked on Arena V2 | Conversational model; Inferless ranked it top in "well-roundedness" |
The TTS Arena evaluates single-utterance naturalness, which disadvantages dialogue-specific models like Dia and CSM. Their real strength appears in multi-turn conversations and expressive narration, not single-sentence synthesis.
Inferless 12-Model Comparison
Inferless tested 12 open-source TTS models on an NVIDIA L4 (24GB VRAM) for synthesized speech quality and controllability. Key findings:
- CSM-1B and F5-TTS emerged as "the most well-rounded performers" in combined quality and controllability.
- Kokoro-82M was the fastest: sub-0.3 second processing for any text length tested.
- F5-TTS: sub-7 second processing for all tested inputs.
- Dia 1.6B was not in the Inferless 12-model cohort — it was released after their evaluation window for that batch.
Qualitative Observations from Practitioners
- Dia 1.6B: Practitioners consistently rank it highest for emotional depth in dialogue — laughter, sighs, and backchannel sounds from text cues are a capability no other model in this class matches natively.
- CSM-1B: Described by DigitalOcean as having "the most impressive TTS demonstration" for acoustic quality, though the open-source checkpoint underperforms relative to Sesame's proprietary Maya demo. Word error rate increases on longer generations compared to F5-TTS.
- F5-TTS (CC-BY-NC 4.0): DigitalOcean's reviewers called it their personal favorite for overall balance; note the non-commercial license restricts commercial use.
Latency Reference Points
| Model | Latency Profile | Source |
|---|---|---|
| Kokoro 82M | <0.3s per utterance; 96× real-time on cloud GPU | ocdevel benchmark 2025 |
| CSM-1B | Low-latency; CPU fallback available | Sesame HF model card |
| Dia 1.6B | ~0.5× real-time on A4000 (40 tokens/s; 86 tok = 1s audio) | Nari Labs HF model card |
| F5-TTS | <7s per generation (Inferless, L4 GPU) | Inferless 12-model study |
| Chatterbox-Turbo | <200ms | Resemble AI product page |
Key Features and Practical Differences
Nari Dia 1.6B
- Single-pass dialogue generation: Write a full transcript with
[S1]and[S2]speaker tags; the model generates both voices in one inference call. No stitching required. - Nonverbal sound synthesis: Type
(laughs),(coughs),(clears throat)anywhere in the transcript and the model produces the sound. This is unique among open-source models at this scale. - Voice cloning via audio prefix: Provide a short reference audio clip to condition the speaker identity. Seed control enables reproducible outputs.
- Full local operation: No API calls required; data never leaves your machine.
- Community tooling: A Gradio UI, Docker server (third-party: devnen/Dia-TTS-Server), and OpenAI-compatible API wrapper (phildougherty/sesame_csm_openai) are available.
- Successor path: Dia2 adds streaming and smaller 1B/2B checkpoints, but removes the 1.6B size point in favor of cleaner scaling.
Sesame CSM-1B
- Conversational context modeling: Processes previous utterances as audio context segments, maintaining speaker identity and conversational rhythm across turns. This is its core differentiation from standard TTS models.
- HF Transformers native: Since
transformers>=4.52.1, CSM-1B works out of the box without custom inference code — important for teams already using the HF ecosystem. - CPU inference: Runs on CPU (slowly), making it deployable on machines without a dedicated GPU. Useful for prototyping or very low-throughput use cases.
- Multi-speaker architecture: CSM's primary training objective is multi-speaker coherence across a full conversation, not just per-utterance quality.
- Fine-tuning path: Speechmatics published a comprehensive guide to fine-tuning CSM on new languages and voices; the model's Llama-based architecture makes it tractable for standard fine-tuning workflows.
- Ecosystem: Available via DeepInfra hosted API if you want cloud inference without managing GPU infrastructure.
Use Cases
Best Use Cases for Nari Dia 1.6B
- Audiobook narration with multiple characters and natural emotional delivery
- Podcast or YouTube dialogue synthesis from scripts
- Offline TTS pipelines where data privacy is mandatory
- Content that requires nonverbal vocalizations (laughter, sighs, hesitations)
- Custom fine-tuning for specific speaker voices or domain terminology
Best Use Cases for Sesame CSM-1B
- Real-time conversational AI assistants where turn history matters
- Customer service bots requiring contextually consistent speech
- Educational simulations with persistent speaker identity
- Prototyping on CPU-only machines or constrained VRAM environments
- Pipelines already using HF Transformers where integration friction matters
How to Choose — Decision Guide
Work through these questions in order:
- Do you need streaming output (audio starts before text is complete)?
→ Use Dia2 (not Dia 1.6B), or consider Chatterbox-Turbo (<200ms) or Kokoro for ultra-low latency. - Do you need nonverbal sounds from text cues (laughter, coughing, sighs)?
→ Use Dia 1.6B. No other open-source model at this scale supports this natively. - Do you need CPU inference or have <8GB VRAM?
→ Use CSM-1B (CPU supported) or Kokoro-82M (fastest, smallest). - Is your pipeline already built on Hugging Face Transformers?
→ Use CSM-1B (transformers>=4.52.1, zero custom code). - Do you need the highest MOS quality for single-utterance synthesis?
→ Consider Chatterbox (ELO ~1502, 63.75% preference over ElevenLabs in blind tests) or F5-TTS (note CC-BY-NC license for commercial use). - Do you need commercial-safe licensing?
→ Both Dia 1.6B and CSM-1B are Apache 2.0. F5-TTS is CC-BY-NC 4.0 (non-commercial only). - Is data privacy and full offline operation mandatory?
→ Use Dia 1.6B or CSM-1B locally. Avoid DeepInfra or any cloud API endpoint.
If you plan to integrate TTS into a local AI agent pipeline, the OpenClaw + Ollama setup guide for running local AI agents covers how to wire TTS models into a local LLM pipeline — both CSM-1B and Dia work within that pattern.
The 2026 Open-Source TTS Ecosystem
Dia and CSM are no longer the only serious open-source TTS options. By April 2026, the field has expanded significantly:
| Model | Params | License | Standout Feature |
|---|---|---|---|
| Nari Dia 1.6B | 1.6B | Apache 2.0 | Nonverbal sounds, multi-speaker single-pass |
| Dia2-2B | 2B | Apache 2.0 | Streaming + Dia quality |
| Sesame CSM-1B | 1B | Apache 2.0 | Contextual speech, HF native, CPU support |
| Kokoro v1.0 | 82M | Apache 2.0 | Fastest: sub-0.3s, 96× real-time on GPU |
| Chatterbox-Turbo | 350M | Apache 2.0 | Best ELO for open source; sub-200ms |
| F5-TTS | ~300M | CC-BY-NC 4.0 | Best balance quality + controllability (non-commercial) |
| Orpheus TTS | 3B, 1B, 400M, 150M | Apache 2.0 | 100k hrs training data; guided emotion; streaming |
For teams building production voice features, Codersera's guide to the best free AI TTS models and the Orpheus 3B vs Kokoro comparison cover adjacent models in detail.
Common Pitfalls and Troubleshooting
Dia 1.6B
- First-run latency: The Descript Audio Codec downloads on first use (~300MB). Run once in advance if deploying in a time-sensitive environment.
- Out of memory errors: If you see CUDA OOM with less than 10GB VRAM, reduce batch size or use
torch.compilecarefully — it can sometimes increase peak memory. Try bfloat16 explicitly if not already set. - Inconsistent voice identity: Without setting a seed or audio prefix, voice character varies between runs. Use
seed=for reproducibility or supply a short audio prefix for consistent identity. - No CPU fallback: Do not attempt to run Dia 1.6B on CPU — it will fail or produce unusable output. Use the HuggingFace ZeroGPU Space or Dia2 (which has CPU fallback) instead.
- Nonverbal cues must be in parentheses:
(laughs)works;*laughs*or[laughs]do not trigger the intended behavior.
Sesame CSM-1B
- Gated model: You must accept Sesame's usage terms on the Hugging Face model page before
from_pretrainedwill succeed. Error messages from this are sometimes cryptic — check your HF login (huggingface-cli login) and model access first. - Context segment preparation: CSM's quality advantage over generic TTS depends on correctly formatting the context segments (previous turns). Without it, it behaves like a basic TTS model and loses its main selling point.
- Word error rate on long generations: Multiple practitioner reports note CSM-1B has higher WER on longer text passages compared to F5-TTS. Split long content into segments of ≤100 words for best results.
- CPU inference is slow: CPU mode is viable for <30 word utterances with acceptable wait time. For anything longer, a GPU is strongly recommended.
- Transformers version: Ensure
transformers>=4.52.1. Earlier versions will fail silently or use a different code path.
FAQ
Is Nari Dia 1.6B still the best open-source TTS in 2026?
For multi-speaker expressive dialogue with nonverbal sounds, yes — it remains uniquely capable. For general single-speaker TTS quality, Chatterbox and Kokoro now score higher on TTS Arena ELO benchmarks. Nari Labs has also released Dia2 (1B and 2B streaming variants) as the active development branch.
Can I run Dia 1.6B or CSM-1B without a GPU?
CSM-1B supports CPU inference — slow but functional. Dia 1.6B does not have CPU support as of April 2026 (planned on the roadmap). For CPU-first use, CSM-1B or Kokoro-82M are the practical options.
What replaced Dia 1.6B — should I use Dia2?
Dia2 (released November 2025) is the streaming successor from the same team. It comes in 1B and 2B variants and begins generating audio from the first few tokens. If latency is critical or you're building a real-time pipeline, Dia2 is the better choice. Dia 1.6B remains available and is still the correct choice if you need its specific nonverbal sound generation behavior and the Dia2 quality changes don't suit your use case.
Is Sesame CSM-1B commercial-use safe?
Yes — CSM-1B is Apache 2.0 licensed, which permits commercial use. Verify the current terms on the Sesame model card as Sesame may apply additional usage conditions for the model weights (the model is gated and requires accepting their terms).
How do Dia and CSM compare to ElevenLabs?
ElevenLabs Turbo v2.5 scores 4.8 MOS on CodeSOTA's TTS leaderboard, compared to Sesame CSM's 4.7 MOS (proprietary, non-open-source version). The open-source CSM-1B checkpoint falls short of that. Dia 1.6B is not MOS-ranked on the same leaderboard but exceeds CSM-1B and ElevenLabs in practitioner tests specifically for expressive dialogue. Open-source Chatterbox now matches or beats ElevenLabs Flash in blind tests (63.75% preference).
What is the minimum VRAM to run these models?
Dia 1.6B: approximately 8–10GB VRAM in bfloat16. CSM-1B: 6–8GB VRAM recommended (CPU fallback exists). Kokoro-82M runs on <2GB VRAM. If you're on a 6GB card, CSM-1B or Kokoro are the safer bets.
Can I fine-tune CSM-1B on custom voices?
Yes. Speechmatics published a detailed guide to fine-tuning CSM-1B on new datasets (new languages and voices). The Llama backbone makes standard supervised fine-tuning tractable with reasonable compute. Dia 1.6B also supports fine-tuning; its Apache 2.0 license allows modifying and distributing the weights.
Are there any ethical or legal concerns with voice cloning from these models?
Both models can clone voices from audio samples. Voice cloning without consent is illegal or heavily regulated in many jurisdictions (EU AI Act, various US state laws as of 2025). Nari Labs' and Sesame's usage terms explicitly prohibit non-consensual voice cloning. Apply your own legal review before deploying voice cloning in a product.
References and Further Reading
- Nari Labs Dia GitHub repository — source code, README, and issue tracker for Dia 1.6B
- Dia-1.6B Hugging Face model card — hardware specs, inference guide, ZeroGPU Space link
- Nari Labs Dia2 GitHub repository — streaming successor (1B and 2B variants)
- Sesame CSM-1B Hugging Face model card — architecture, license, usage instructions
- Hugging Face Transformers CSM documentation — native integration via
CsmForConditionalGeneration - Inferless 12-model TTS benchmark (2025) — controlled latency and quality comparison on NVIDIA L4
- Speechmatics: How to Fine-tune Sesame CSM on New Languages and Voices
- CodeSOTA Speech AI Benchmarks 2026 — MOS rankings for commercial and open-source TTS models