Create Your Imagination
AI-Powered Image Editing
No restrictions, just pure creativity. Browser-based and free!
4 min to read
Zonos-TTS is an open-source, multilingual, real-time text-to-speech (TTS) model that offers high expressiveness and voice cloning capabilities. Released by ZyphraAI under the Apache 2.0 license, Zonos-TTS supports features like real-time voice cloning, audio prefix input, and fine control over speech attributes such as rate, pitch, and emotion.
This guide provides a step-by-step method to install and run Zonos-TTS locally on an Ubuntu system.
Zonos-TTS leverages deep learning methodologies to generate naturalistic speech outputs from textual inputs. The framework incorporates speaker embeddings and audio prefix conditioning to enhance voice fidelity. Notable features include:
Ensure your Ubuntu system meets the following requirements:
You can install Zonos-TTS via Docker or a manual (DIY) installation.
Docker simplifies dependency management and deployment.
Steps:
Generate Sample Audio:
python3 sample.py
Run Docker Compose:
docker compose up
Clone the Zonos Repository:
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
Install Docker & Docker Compose:
sudo apt update
sudo apt install docker.io docker-compose
sudo systemctl start docker
sudo systemctl enable docker
For manual installation, follow these steps:
Generate Sample Audio:
python3 sample.py
Clone the Zonos Repository:
git clone https://github.com/Zyphra/Zonos.git
cd Zonos
Install Python Dependencies:
python3 -m pip install --upgrade uv
uv venv
source .venv/bin/activate
uv sync --no-group main
uv sync
Install eSpeak:
sudo apt install espeak-ng
Once installed, use Python to generate speech:
import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
# Load the model
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device="cuda")
model.bfloat16()
# Load example audio for voice cloning
wav, sampling_rate = torchaudio.load("./exampleaudio.mp3")
spk_embedding = model.embed_spk_audio(wav, sampling_rate)
torch.manual_seed(421)
# Define conditioning parameters
cond_dict = make_cond_dict(
text="Hello, world!",
speaker=spk_embedding.to(torch.bfloat16),
language="en-us",
)
# Prepare conditioning and generate speech
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)
wavs = model.autoencoder.decode(codes).cpu()
# Save the generated audio
torchaudio.save("sample.wav", wavs, model.autoencoder.sampling_rate)
Zonos provides two models:
Transformer Model: Use this for higher fidelity:
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
Zonos offers multiple model configurations:
Transformer Model: Higher fidelity output, albeit with increased computational demand.
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")
Adjust the language
parameter for speech synthesis in different languages:
cond_dict = make_cond_dict(
text="Bonjour le monde!",
speaker=spk_embedding.to(torch.bfloat16),
language="fr-fr",
)
Fine-tune output speech by modifying expressive parameters:
cond_dict = make_cond_dict(
text="I am very happy!",
speaker=spk_embedding.to(torch.bfloat16),
language="en-us",
emotion="happiness",
speaking_rate=1.2,
pitch_variation=0.1,
)
Modify the language
parameter for multilingual support:
cond_dict = make_cond_dict(
text="Bonjour le monde!",
speaker=spk_embedding.to(torch.bfloat16),
language="fr-fr",
)
Fine-tune speech attributes:
cond_dict = make_cond_dict(
text="I am very happy!",
speaker=spk_embedding.to(torch.bfloat16),
language="en-us",
emotion="happiness",
speaking_rate=1.2,
pitch_variation=0.1,
)
uv sync
to fix missing packages.Zonos-TTS can be used for:
Engage with the Zonos-TTS ecosystem via:
Zonos-TTS is a powerful open-source TTS model, offering multilingual support and expressive voice synthesis. Whether using Docker for quick deployment or DIY installation for greater control, this guide helps set up and run Zonos-TTS efficiently on Ubuntu. Its applications range from content creation to accessibility and research, making it a versatile tool for real-time voice synthesis.
Need expert guidance? Connect with a top Codersera professional today!