Codersera

3 min to read

How to Run Sesame CSM 1B on Ubuntu: Step-by-Step Installation

Sesame CSM 1B is a cutting-edge, open-source speech synthesis model optimized for local deployment. It enables lifelike voice generation and cloning with efficient VRAM usage, making it ideal for users with consumer GPUs like the RTX 4060 (8GB VRAM). This guide covers installation, configuration, and advanced usage on Ubuntu systems to ensure a seamless deployment.

System Requirements

Hardware:

  • NVIDIA GPU with ≥8GB VRAM (RTX 4060 recommended)
  • 16GB RAM, 50GB disk space

Software:

  • Ubuntu 22.04 LTS or newer
  • Python 3.8+ and pip
  • CUDA 12.x and cuDNN 8.x
  • PyTorch 2.0+ with GPU support

Installation Steps

1. Install Prerequisites

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install Python and essential packages
sudo apt install python3 python3-pip python3-venv git -y

# Install NVIDIA drivers and CUDA for GPU acceleration
sudo apt install nvidia-driver-535 cuda-12-2 -y

2. Clone the Repository

git clone https://github.com/sesame-ai/csm-1b.git
cd csm-1b

3. Set Up a Virtual Environment

python3 -m venv venv
source venv/bin/activate

4. Install Dependencies

pip install torch torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Model Download & Initial Testing

1. Download Pretrained Models

python scripts/download_models.py
  • Models are cached in ~/.cache/sesame by default.

2. Generate Test Audio

Create a test script:

# test_hello.py
from sesame import Synthesizer
synth = Synthesizer("sesame-1b")
audio = synth.generate("Hello from Sesame CSM 1B")
audio.save("output.wav")

Run the script:

python test_hello.py
  • Warnings about missing dependencies (e.g., librosa or numba) can be ignored initially.

Voice Cloning

1. Prepare Reference Audio

  • Save a clean 10-second .wav file of the target voice in ./samples.

2. Run Cloning Script

python scripts/clone_voice.py --text "Custom speech here" --reference samples/your_voice.wav
  • Use --seed for reproducibility.

Performance Optimization

Technique Command/Setting VRAM Reduction
FP16 Precision torch.set_float32_matmul_precision('medium') 30%
Batch Size Reduction --batch_size 1 20%
Gradient Checkpointing --use_checkpointing 15%

Troubleshooting

1. Boot Issues

  • Disable Secure Boot in BIOS.
  • Ensure PCIe mode is set to UEFI, not Legacy.

2. Dependency Conflicts

# Reinstall specific library versions
pip install Flask==2.0.3 PyMySQL==1.0.2 --force-reinstall

3. Proxy Setup for Enterprise Networks

export http_proxy=http://proxy.example.com:80
export https_proxy=$http_proxy

Advanced Configuration

1. Custom Voice Styles

Modify config.yaml to adjust settings:

voice:
  pitch_range: [60, 80]  # Adjust for tonal variation
  speed: 1.2             # 1.0 = default speed

2. API Integration

Expose endpoints using Flask:

from flask import Flask, request
from sesame import Synthesizer
app = Flask(__name__)

@app.route('/synthesize', methods=['POST'])
def synthesize():
    text = request.json['text']
    audio = Synthesizer().generate(text)
    return audio.to_bytes()

Usage

  1. Generate Speech with Context:PythonCopy
speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
  1. Generate a Sentence:PythonCopy
from generator import load_csm_1b
import torchaudio
import torch

if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

generator = load_csm_1b(device=device)

audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Conclusion

Sesame CSM 1B offers enterprise-grade voice synthesis on consumer hardware. By following this guide, users can deploy it on Ubuntu with GPU acceleration, troubleshoot common issues, and extend functionality through APIs or custom voice profiles.

References

  1. Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
  2. Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
  3. Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
  4. How to Run Sesame CSM 1B on Windows: Step-by-Step Installation

Need expert guidance? Connect with a top Codersera professional today!

;