Codersera

4 min to read

Running Zonos TTS on Windows: Multilingual Local Installation

Zonos-TTS, a recent offering from ZyphraAI, is a fully open-source, multilingual text-to-speech (TTS) model that supports real-time voice cloning and is commercially usable under the Apache 2.0 License.

Trained on 200,000 hours of English voice data, Zonos-TTS delivers impressive performance, with ZyphraAI's tests on an RTX 4090 graphics card showing the model running at approximately twice the real-time speed.

What is Zonos-TTS?

Zonos-TTS is a text-to-speech model designed to generate natural-sounding speech from text prompts using a speaker embedding or audio prefix. It allows for high-fidelity voice cloning with just 5 to 30 seconds of speech and enables conditioning based on speaking rate, pitch variation, audio quality, and emotions. The model supports multiple languages, including English, Japanese, Chinese, French, and German, outputting speech natively at 44kHz.

Key Features of Zonos-TTS:

  • Zero-shot TTS with voice cloning: Generates high-quality TTS output using a 10-30 second speaker sample.
  • Audio prefix inputs: Enhances speaker matching by adding text plus an audio prefix, which can elicit behaviors like whispering.
  • Multilingual support: Supports English, Japanese, Chinese, French, and German.
  • Audio quality and emotion control: Offers fine-grained control over speaking rate, pitch, and emotions like happiness, anger, sadness, and fear.
  • Fast performance: Runs with a real-time factor of approximately 2x on an RTX 4090.
  • WebUI Gradio interface: Comes with an easy-to-use Gradio interface for generating speech.
  • Simple installation and deployment: Can be installed easily using the provided Docker file.

Installation Methods

There are two primary methods to install Zonos-TTS on Windows:

  1. Using Docker – Recommended for users who prefer a straightforward, containerized approach.
  2. DIY Installation – A manual method that provides more control over the environment setup.

Why Choose Zonos-TTS? 💡

Feature Zonos-TTS Other TTS Tools
Speed 2x real-time Often slower
Voice Cloning 5-second samples Typically 1min+
Audio Quality 44kHz output Usually 16-24kHz
Languages 5 supported Often 1-2
Commercial Use Allowed (Apache 2.0) Many restrict usage

System Requirements 🖥️

Feature Zonos-TTS Other TTS Tools
Speed 2x real-time Often slower
Voice Cloning 5-second samples Typically 1min+
Audio Quality 44kHz output Usually 16-24kHz
Languages 5 supported Often 1-2
Commercial Use Allowed (Apache 2.0) Many restrict usage

Minimum:

  • OS: Windows 10/11 64-bit
  • RAM: 8GB+
  • GPU: NVIDIA GTX 1660 (6GB VRAM)
  • Storage: 10GB free space

Recommended:

  • GPU: RTX 3060 (12GB VRAM) or better
  • RAM: 16GB+
  • Python 3.10+

Installation Methods ⚙️

Step 1: Install Docker Desktop

Step 2: Launch PowerShell as Admin

git clone https://github.com/Zyphra/Zonos
cd Zonos

Step 3: Start Container

docker compose up

Step 4: Access Web Interface
Open http://localhost:7860 in your browser.

Alternatively, build and run the Docker image for development:

docker build -t Zonos .
docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t Zonos
cd /Zonos
python3 sample.py  # Generates sample.wav

Replace /path/to/Zonos with your actual directory path.

Run Docker Compose:

docker compose up

Clone the Zonos repository:

git clone https://github.com/Zyphra/Zonos
cd Zonos

Method 2: Manual Installation 🔧

Step 1: Install Dependencies

  1. Install Python 3.10+

Install Git:

winget install --id Git.Git

Install eSpeak-NG via Chocolatey:

choco install espeak-ng

Step 2: Set Up Python Environment

git clone https://github.com/Zyphra/Zonos
cd Zonos
python -m venv zonos-env
.\zonos-env\Scripts\activate
pip install -r requirements.txt

Step 3: Verify Installation

python sample.py
# Output: sample.wav created

Usage Examples

Using the Gradio Interface

  1. Open http://localhost:7860.
  2. Input text in the provided box.
  3. Upload a 10-30 second audio sample for voice cloning.
  4. Adjust parameters like speaking rate, pitch, and emotion.
  5. Click "Generate" to produce speech.
  6. Download the generated audio.

Using Python Code

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
model.bfloat16()

wav, sampling_rate = torchaudio.load("./exampleaudio.mp3")
spk_embedding = model.embed_spk_audio(wav, sampling_rate)

torch.manual_seed(421)

cond_dict = make_cond_dict(
    text="Hello, world!",
    speaker=spk_embedding.to(torch.bfloat16),
    language="en-us",
)

conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs, model.autoencoder.sampling_rate)

Tips and Troubleshooting

Issue Solution
CUDA Out of Memory Reduce batch size in config.yml
eSpeak Not Found Add C:\Program Files\eSpeak NG to PATH
Gradio Port Conflict Change port: docker compose up --port 8080
Slow Generation Enable GPU in Docker Desktop Settings
  • Ensure GPU Support: Verify that PyTorch is using your GPU.
  • Check Dependencies: Resolve any version conflicts.
  • File Paths: Ensure file paths are correct.
  • Memory Issues: Reduce batch size or use a smaller model if needed.
  • Docker Issues: Verify Docker Desktop is running correctly.

Alternatives to Zonos-TTS

If Zonos-TTS does not meet your needs, consider these alternatives:🔄

  1. StyleTTS 2
    • Pros: Better for emotional speech
    • Cons: No commercial license
  2. Tortoise-TTS
    • Pros: More voice presets
    • Cons: Slower generation
  3. Microsoft Azure TTS
    • Pros: Enterprise support
    • Cons: Monthly costs

Conclusion

Zonos-TTS is a significant advancement in open-source TTS technology, providing high-quality voice cloning and multilingual support. Whether using Docker or manual installation, this guide equips you with the steps to get Zonos-TTS running on your Windows machine.

References

  1. Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
  2. Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
  3. Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
  4. Run Llasa TTS 3B on Windows: A Step-by-Step Guide
  5. Install Llasa TTS 3B on macOS: Voice Cloning & Text-to-Speech

Need expert guidance? Connect with a top Codersera professional today!

;