Nari Dia 1.6B vs Sesame CSM 1B: Which Is the Best TTS?

Last updated April 2026 — refreshed for current model/tool versions.

Two open-source speech models released in early 2025 changed expectations for what runs locally: Nari Labs Dia 1.6B, a dialogue-first TTS that generates expressive multi-speaker audio in a single pass, and Sesame CSM-1B, a conversational speech model built on a Llama backbone with native Hugging Face Transformers support. This guide compares both on architecture, hardware requirements, 2026 benchmark data, and practical deployment — so you can pick the right tool without running both.

What changed in 2026 — key updates for readers of the 2025 versionDia 1.6B has a successor: Dia2. Nari Labs released Dia2 with 1B and 2B checkpoints. Dia2 uses a streaming architecture — it starts generating audio from the first few tokens rather than requiring the full text input. Both variants are Apache 2.0 licensed and available on Hugging Face (nari-labs/Dia2-1B, nari-labs/Dia2-2B). The original Dia 1.6B remains available and fully usable.CSM-1B is now natively integrated into Hugging Face Transformers (v4.52.1+, released May 2025). You can load it directly with CsmForConditionalGeneration — no custom inference code required.CSM-1B license confirmed Apache 2.0. The original post listed "open source" without specifying the license; it is Apache 2.0, matching Dia 1.6B.New competing models entered the field. Chatterbox (Resemble AI, 350M params) beats ElevenLabs in blind tests at 63.75% preference. Kokoro v1.0 holds ~44% win rate on TTS Arena V2 at 82M params. F5-TTS and CSM-1B remain among the "most well-rounded" open-source performers per Inferless's 12-model benchmark.The TTS Arena ELO landscape. Among open-source deployable models, Chatterbox sits at ~1502 ELO on TTS Arena, Kokoro v1.0 at ~1400. Dia 1.6B and CSM-1B are not currently ranked on TTS Arena V2 (they target a different deployment mode than the Arena's single-utterance evaluation).VRAM note correction. The original post stated Dia needs 10GB VRAM. In practice, the 1.6B model requires approximately 8–10GB VRAM in bfloat16. The new Dia2-1B runs in less VRAM than the original 1.6B; exact requirements are not officially published (verify on vendor page).

TL;DR — Which Model Should You Use?

Criterion	Nari Dia 1.6B	Sesame CSM-1B
Best for	Expressive multi-speaker dialogue, audiobooks, podcasts	Real-time conversational agents, low-latency voice apps
Nonverbal sounds	Yes — from text cues like `(laughs)`	Partial — via contextual conditioning only
GPU requirement	~8–10GB VRAM (bfloat16)	6–8GB VRAM recommended; CPU fallback available
Streaming output	No (Dia 1.6B); Yes (Dia2)	No native streaming; low-latency generation
HF Transformers native	No (custom inference code)	Yes (v4.52.1+)
License	Apache 2.0	Apache 2.0
Voice cloning	Audio conditioning	Contextual prompting
Successor model	Dia2 (1B/2B, streaming)	None announced as of April 2026

Model Overview

Nari Dia 1.6B

Developer: Nari Labs (small team: 1 full-time, 1 part-time engineer as of 2025)
Parameters: 1.6 billion
Released: April 22, 2025
License: Apache 2.0
Architecture: Encoder → Transformer (1.6B) → audio frame generator → Descript Audio Codec decoder
Language: English only
Hardware: ~8–10GB VRAM in bfloat16; CUDA 12.6+; PyTorch 2.0+
Successor: Dia2 (1B and 2B streaming variants, released November 2025)

Sesame CSM-1B

Developer: Sesame AI Labs
Parameters: 1 billion (12-layer primary transformer at ~1.2B + 6-layer secondary at ~300M)
Released: March 13, 2025 (Hugging Face); February 27, 2025 (initial release)
License: Apache 2.0
Architecture: Llama backbone + Mimi audio codec decoder, producing RVQ audio codes
Language: English primary; limited non-English capacity
Hardware: 6–8GB VRAM recommended; CPU inference supported
HF Transformers: Native support since v4.52.1 (May 20, 2025) via CsmForConditionalGeneration

Technical Architecture

Feature	Nari Dia 1.6B	Sesame CSM-1B
Model Size	1.6B parameters	1B parameters
Core Technology	TTS-optimized language model + DAC decoder	Llama backbone + Mimi audio codec (RVQ)
Input Modalities	Text + optional audio prefix	Text + optional audio context segments
Output Format	Direct waveform via Descript Audio Codec	RVQ codes → Mimi waveform reconstruction
Dialogue Support	Speaker tags `[S1]`, `[S2]` in text	Context segments tracking speaker turns
Nonverbal Sounds	Yes: `(laughs)`, `(coughs)`, `(sighs)` from text	Partial — possible with careful audio conditioning
Voice Cloning	Audio conditioning (supply reference audio)	Contextual prompting with reference audio
Streaming	No (Dia 1.6B); Yes in Dia2	No native streaming; low-latency batch
torch.compile	Supported	Supported (CUDA graphs)
CPU Support	Not available in Dia 1.6B (planned)	Yes

Installation and Hardware Requirements

Nari Dia 1.6B — Setup

The recommended path uses uv as the package manager:

git clone https://github.com/nari-labs/dia.git
cd dia
uv venv && source .venv/bin/activate
uv pip install -e .
python app.py        # Gradio UI

Hardware checklist:

NVIDIA GPU with ≥8–10GB VRAM (RTX 3070 Ti, RTX 4070, A4000, or better)
CUDA 12.6+ drivers
PyTorch 2.0+, Python 3.8+
On first run, the Descript Audio Codec is downloaded automatically (~300MB)
Reference throughput: ~40 tokens/second on an A4000 (86 tokens ≈ 1 second of audio, so ~0.5× real-time on that card)
CPU support is not yet available for Dia 1.6B; use the Hugging Face ZeroGPU Space for testing without a local GPU

Sesame CSM-1B — Setup

The fastest path uses the native Hugging Face Transformers integration (requires transformers>=4.52.1):

pip install "transformers>=4.52.1" torch torchaudio

from transformers import CsmForConditionalGeneration, AutoProcessor
import torch

model = CsmForConditionalGeneration.from_pretrained("sesame/csm-1b")
processor = AutoProcessor.from_pretrained("sesame/csm-1b")
model = model.to("cuda" if torch.cuda.is_available() else "cpu")

Hardware checklist:

NVIDIA GPU with ≥6GB VRAM recommended; CPU inference works but is slow
Python 3.10+, PyTorch, torchaudio
Hugging Face account required (model is gated — accept terms on the HF model page)
Supports torch.compile() and CUDA graphs for optimization

Dia2 (Streaming Successor) — Setup

If streaming latency matters and you want the current Nari Labs offering:

pip install dia2

from dia2 import Dia2, GenerationConfig, SamplingConfig

model = Dia2.from_repo("nari-labs/Dia2-2B", device="cuda", dtype="bfloat16")
config = GenerationConfig(
    cfg_scale=2.0,
    audio=SamplingConfig(temperature=0.8, top_k=50),
    use_cuda_graph=True,
)
model.generate("[S1] Hello there, welcome to the show.", config=config, output_wav="out.wav")

Dia2 requires CUDA 12.8+ and defaults to bfloat16 precision. The 1B variant has lower VRAM requirements than the 2B; exact published figures are not available — verify on the Dia2-2B Hugging Face model card.

Performance and Benchmark Data (2026)

No single standardized benchmark covers both Dia 1.6B and CSM-1B together with comparable methodology, so the following draws from several independent evaluations. Treat MOS figures as indicative, not definitive — listener panels differ across studies.

TTS Arena V2 (Hugging Face Elo Rankings)

TTS Arena V2 pits models head-to-head in blind listening tests. As of early 2026, the open-source deployable models with published Elo scores include:

Model	ELO (approx.)	Notes
Chatterbox (Resemble AI)	~1502	350M params; beats ElevenLabs in blind tests
Kokoro v1.0	~1400	82M params; 44% win rate; CPU-friendly
StyleTTS 2	1369	Style-transfer focused
Dia 1.6B	Not ranked on Arena V2	Dialogue-first; better evaluated on multi-speaker tasks
CSM-1B	Not ranked on Arena V2	Conversational model; Inferless ranked it top in "well-roundedness"

The TTS Arena evaluates single-utterance naturalness, which disadvantages dialogue-specific models like Dia and CSM. Their real strength appears in multi-turn conversations and expressive narration, not single-sentence synthesis.

Inferless 12-Model Comparison

Inferless tested 12 open-source TTS models on an NVIDIA L4 (24GB VRAM) for synthesized speech quality and controllability. Key findings:

CSM-1B and F5-TTS emerged as "the most well-rounded performers" in combined quality and controllability.
Kokoro-82M was the fastest: sub-0.3 second processing for any text length tested.
F5-TTS: sub-7 second processing for all tested inputs.
Dia 1.6B was not in the Inferless 12-model cohort — it was released after their evaluation window for that batch.

Qualitative Observations from Practitioners

Dia 1.6B: Practitioners consistently rank it highest for emotional depth in dialogue — laughter, sighs, and backchannel sounds from text cues are a capability no other model in this class matches natively.
CSM-1B: Described by DigitalOcean as having "the most impressive TTS demonstration" for acoustic quality, though the open-source checkpoint underperforms relative to Sesame's proprietary Maya demo. Word error rate increases on longer generations compared to F5-TTS.
F5-TTS (CC-BY-NC 4.0): DigitalOcean's reviewers called it their personal favorite for overall balance; note the non-commercial license restricts commercial use.

Latency Reference Points

Model	Latency Profile	Source
Kokoro 82M	<0.3s per utterance; 96× real-time on cloud GPU	ocdevel benchmark 2025
CSM-1B	Low-latency; CPU fallback available	Sesame HF model card
Dia 1.6B	~0.5× real-time on A4000 (40 tokens/s; 86 tok = 1s audio)	Nari Labs HF model card
F5-TTS	<7s per generation (Inferless, L4 GPU)	Inferless 12-model study
Chatterbox-Turbo	<200ms	Resemble AI product page

Key Features and Practical Differences

Nari Dia 1.6B

Single-pass dialogue generation: Write a full transcript with [S1] and [S2] speaker tags; the model generates both voices in one inference call. No stitching required.
Nonverbal sound synthesis: Type (laughs), (coughs), (clears throat) anywhere in the transcript and the model produces the sound. This is unique among open-source models at this scale.
Voice cloning via audio prefix: Provide a short reference audio clip to condition the speaker identity. Seed control enables reproducible outputs.
Full local operation: No API calls required; data never leaves your machine.
Community tooling: A Gradio UI, Docker server (third-party: devnen/Dia-TTS-Server), and OpenAI-compatible API wrapper (phildougherty/sesame_csm_openai) are available.
Successor path: Dia2 adds streaming and smaller 1B/2B checkpoints, but removes the 1.6B size point in favor of cleaner scaling.

Sesame CSM-1B

Conversational context modeling: Processes previous utterances as audio context segments, maintaining speaker identity and conversational rhythm across turns. This is its core differentiation from standard TTS models.
HF Transformers native: Since transformers>=4.52.1, CSM-1B works out of the box without custom inference code — important for teams already using the HF ecosystem.
CPU inference: Runs on CPU (slowly), making it deployable on machines without a dedicated GPU. Useful for prototyping or very low-throughput use cases.
Multi-speaker architecture: CSM's primary training objective is multi-speaker coherence across a full conversation, not just per-utterance quality.
Fine-tuning path: Speechmatics published a comprehensive guide to fine-tuning CSM on new languages and voices; the model's Llama-based architecture makes it tractable for standard fine-tuning workflows.
Ecosystem: Available via DeepInfra hosted API if you want cloud inference without managing GPU infrastructure.

Use Cases

Best Use Cases for Nari Dia 1.6B

Audiobook narration with multiple characters and natural emotional delivery
Podcast or YouTube dialogue synthesis from scripts
Offline TTS pipelines where data privacy is mandatory
Content that requires nonverbal vocalizations (laughter, sighs, hesitations)
Custom fine-tuning for specific speaker voices or domain terminology

Best Use Cases for Sesame CSM-1B

Real-time conversational AI assistants where turn history matters
Customer service bots requiring contextually consistent speech
Educational simulations with persistent speaker identity
Prototyping on CPU-only machines or constrained VRAM environments
Pipelines already using HF Transformers where integration friction matters

How to Choose — Decision Guide

Work through these questions in order:

Do you need streaming output (audio starts before text is complete)?
→ Use Dia2 (not Dia 1.6B), or consider Chatterbox-Turbo (<200ms) or Kokoro for ultra-low latency.
Do you need nonverbal sounds from text cues (laughter, coughing, sighs)?
→ Use Dia 1.6B. No other open-source model at this scale supports this natively.
Do you need CPU inference or have <8GB VRAM?
→ Use CSM-1B (CPU supported) or Kokoro-82M (fastest, smallest).
Is your pipeline already built on Hugging Face Transformers?
→ Use CSM-1B (transformers>=4.52.1, zero custom code).
Do you need the highest MOS quality for single-utterance synthesis?
→ Consider Chatterbox (ELO ~1502, 63.75% preference over ElevenLabs in blind tests) or F5-TTS (note CC-BY-NC license for commercial use).
Do you need commercial-safe licensing?
→ Both Dia 1.6B and CSM-1B are Apache 2.0. F5-TTS is CC-BY-NC 4.0 (non-commercial only).
Is data privacy and full offline operation mandatory?
→ Use Dia 1.6B or CSM-1B locally. Avoid DeepInfra or any cloud API endpoint.

If you plan to integrate TTS into a local AI agent pipeline, the OpenClaw + Ollama setup guide for running local AI agents covers how to wire TTS models into a local LLM pipeline — both CSM-1B and Dia work within that pattern.

The 2026 Open-Source TTS Ecosystem

Dia and CSM are no longer the only serious open-source TTS options. By April 2026, the field has expanded significantly:

Model	Params	License	Standout Feature
Nari Dia 1.6B	1.6B	Apache 2.0	Nonverbal sounds, multi-speaker single-pass
Dia2-2B	2B	Apache 2.0	Streaming + Dia quality
Sesame CSM-1B	1B	Apache 2.0	Contextual speech, HF native, CPU support
Kokoro v1.0	82M	Apache 2.0	Fastest: sub-0.3s, 96× real-time on GPU
Chatterbox-Turbo	350M	Apache 2.0	Best ELO for open source; sub-200ms
F5-TTS	~300M	CC-BY-NC 4.0	Best balance quality + controllability (non-commercial)
Orpheus TTS	3B, 1B, 400M, 150M	Apache 2.0	100k hrs training data; guided emotion; streaming

For teams building production voice features, Codersera's guide to the best free AI TTS models and the Orpheus 3B vs Kokoro comparison cover adjacent models in detail.

Common Pitfalls and Troubleshooting

Dia 1.6B

First-run latency: The Descript Audio Codec downloads on first use (~300MB). Run once in advance if deploying in a time-sensitive environment.
Out of memory errors: If you see CUDA OOM with less than 10GB VRAM, reduce batch size or use torch.compile carefully — it can sometimes increase peak memory. Try bfloat16 explicitly if not already set.
Inconsistent voice identity: Without setting a seed or audio prefix, voice character varies between runs. Use seed= for reproducibility or supply a short audio prefix for consistent identity.
No CPU fallback: Do not attempt to run Dia 1.6B on CPU — it will fail or produce unusable output. Use the HuggingFace ZeroGPU Space or Dia2 (which has CPU fallback) instead.
Nonverbal cues must be in parentheses: (laughs) works; *laughs* or [laughs] do not trigger the intended behavior.

Sesame CSM-1B

Gated model: You must accept Sesame's usage terms on the Hugging Face model page before from_pretrained will succeed. Error messages from this are sometimes cryptic — check your HF login (huggingface-cli login) and model access first.
Context segment preparation: CSM's quality advantage over generic TTS depends on correctly formatting the context segments (previous turns). Without it, it behaves like a basic TTS model and loses its main selling point.
Word error rate on long generations: Multiple practitioner reports note CSM-1B has higher WER on longer text passages compared to F5-TTS. Split long content into segments of ≤100 words for best results.
CPU inference is slow: CPU mode is viable for <30 word utterances with acceptable wait time. For anything longer, a GPU is strongly recommended.
Transformers version: Ensure transformers>=4.52.1. Earlier versions will fail silently or use a different code path.

FAQ

Is Nari Dia 1.6B still the best open-source TTS in 2026?

For multi-speaker expressive dialogue with nonverbal sounds, yes — it remains uniquely capable. For general single-speaker TTS quality, Chatterbox and Kokoro now score higher on TTS Arena ELO benchmarks. Nari Labs has also released Dia2 (1B and 2B streaming variants) as the active development branch.

Can I run Dia 1.6B or CSM-1B without a GPU?

CSM-1B supports CPU inference — slow but functional. Dia 1.6B does not have CPU support as of April 2026 (planned on the roadmap). For CPU-first use, CSM-1B or Kokoro-82M are the practical options.

What replaced Dia 1.6B — should I use Dia2?

Dia2 (released November 2025) is the streaming successor from the same team. It comes in 1B and 2B variants and begins generating audio from the first few tokens. If latency is critical or you're building a real-time pipeline, Dia2 is the better choice. Dia 1.6B remains available and is still the correct choice if you need its specific nonverbal sound generation behavior and the Dia2 quality changes don't suit your use case.

Is Sesame CSM-1B commercial-use safe?

Yes — CSM-1B is Apache 2.0 licensed, which permits commercial use. Verify the current terms on the Sesame model card as Sesame may apply additional usage conditions for the model weights (the model is gated and requires accepting their terms).

How do Dia and CSM compare to ElevenLabs?

ElevenLabs Turbo v2.5 scores 4.8 MOS on CodeSOTA's TTS leaderboard, compared to Sesame CSM's 4.7 MOS (proprietary, non-open-source version). The open-source CSM-1B checkpoint falls short of that. Dia 1.6B is not MOS-ranked on the same leaderboard but exceeds CSM-1B and ElevenLabs in practitioner tests specifically for expressive dialogue. Open-source Chatterbox now matches or beats ElevenLabs Flash in blind tests (63.75% preference).

What is the minimum VRAM to run these models?

Dia 1.6B: approximately 8–10GB VRAM in bfloat16. CSM-1B: 6–8GB VRAM recommended (CPU fallback exists). Kokoro-82M runs on <2GB VRAM. If you're on a 6GB card, CSM-1B or Kokoro are the safer bets.

Can I fine-tune CSM-1B on custom voices?

Yes. Speechmatics published a detailed guide to fine-tuning CSM-1B on new datasets (new languages and voices). The Llama backbone makes standard supervised fine-tuning tractable with reasonable compute. Dia 1.6B also supports fine-tuning; its Apache 2.0 license allows modifying and distributing the weights.

Are there any ethical or legal concerns with voice cloning from these models?

Both models can clone voices from audio samples. Voice cloning without consent is illegal or heavily regulated in many jurisdictions (EU AI Act, various US state laws as of 2025). Nari Labs' and Sesame's usage terms explicitly prohibit non-consensual voice cloning. Apply your own legal review before deploying voice cloning in a product.

References and Further Reading

Nari Labs Dia GitHub repository — source code, README, and issue tracker for Dia 1.6B
Dia-1.6B Hugging Face model card — hardware specs, inference guide, ZeroGPU Space link
Nari Labs Dia2 GitHub repository — streaming successor (1B and 2B variants)
Sesame CSM-1B Hugging Face model card — architecture, license, usage instructions
Hugging Face Transformers CSM documentation — native integration via CsmForConditionalGeneration
Inferless 12-model TTS benchmark (2025) — controlled latency and quality comparison on NVIDIA L4
Speechmatics: How to Fine-tune Sesame CSM on New Languages and Voices
CodeSOTA Speech AI Benchmarks 2026 — MOS rankings for commercial and open-source TTS models