Stop Paying for Screen Recording
Switch to Free & Open Source
Built for developers, by developers
3 min to read
Sesame CSM 1B is a cutting-edge, open-source speech synthesis model optimized for local deployment. It enables lifelike voice generation and cloning with efficient VRAM usage, making it ideal for users with consumer GPUs like the RTX 4060 (8GB VRAM). This guide covers installation, configuration, and advanced usage on Ubuntu systems to ensure a seamless deployment.
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install Python and essential packages
sudo apt install python3 python3-pip python3-venv git -y
# Install NVIDIA drivers and CUDA for GPU acceleration
sudo apt install nvidia-driver-535 cuda-12-2 -y
git clone https://github.com/sesame-ai/csm-1b.git
cd csm-1b
python3 -m venv venv
source venv/bin/activate
pip install torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
python scripts/download_models.py
~/.cache/sesame
by default.Create a test script:
# test_hello.py
from sesame import Synthesizer
synth = Synthesizer("sesame-1b")
audio = synth.generate("Hello from Sesame CSM 1B")
audio.save("output.wav")
Run the script:
python test_hello.py
librosa
or numba
) can be ignored initially..wav
file of the target voice in ./samples
.python scripts/clone_voice.py --text "Custom speech here" --reference samples/your_voice.wav
--seed
for reproducibility.Technique | Command/Setting | VRAM Reduction |
---|---|---|
FP16 Precision | torch.set_float32_matmul_precision('medium') |
30% |
Batch Size Reduction | --batch_size 1 |
20% |
Gradient Checkpointing | --use_checkpointing |
15% |
# Reinstall specific library versions
pip install Flask==2.0.3 PyMySQL==1.0.2 --force-reinstall
export http_proxy=http://proxy.example.com:80
export https_proxy=$http_proxy
Modify config.yaml
to adjust settings:
voice:
pitch_range: [60, 80] # Adjust for tonal variation
speed: 1.2 # 1.0 = default speed
Expose endpoints using Flask:
from flask import Flask, request
from sesame import Synthesizer
app = Flask(__name__)
@app.route('/synthesize', methods=['POST'])
def synthesize():
text = request.json['text']
audio = Synthesizer().generate(text)
return audio.to_bytes()
speakers = [0, 1, 0, 0]
transcripts = [
"Hey how are you doing.",
"Pretty good, pretty good.",
"I'm great.",
"So happy to be speaking to you.",
]
audio_paths = [
"utterance_0.wav",
"utterance_1.wav",
"utterance_2.wav",
"utterance_3.wav",
]
def load_audio(audio_path):
audio_tensor, sample_rate = torchaudio.load(audio_path)
audio_tensor = torchaudio.functional.resample(
audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
)
return audio_tensor
segments = [
Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
text="Me too, this is some cool stuff huh?",
speaker=1,
context=segments,
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
from generator import load_csm_1b
import torchaudio
import torch
if torch.backends.mps.is_available():
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
generator = load_csm_1b(device=device)
audio = generator.generate(
text="Hello from Sesame.",
speaker=0,
context=[],
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
Sesame CSM 1B offers enterprise-grade voice synthesis on consumer hardware. By following this guide, users can deploy it on Ubuntu with GPU acceleration, troubleshoot common issues, and extend functionality through APIs or custom voice profiles.
Need expert guidance? Connect with a top Codersera professional today!