Unleash Your Creativity
AI Image Editor
Create, edit, and transform images with AI - completely free
4 min to read
Sesame AI's CSM 1B is a state-of-the-art conversational speech model renowned for its ability to generate human-like voices. This guide provides a step-by-step walkthrough for running Sesame CSM 1B on a Mac, detailing prerequisites, installation procedures, and troubleshooting tips to ensure a seamless experience.
Sesame CSM 1B is part of a suite of AI models developed to enhance conversational experiences. It employs advanced deep learning techniques to synthesize speech that closely resembles human voices, making it ideal for applications such as audiobooks, voice assistants, and more.
Notably, CSM 1B generates residual vector quantization (RVQ) audio codes from text and audio inputs, utilizing a Llama backbone and a specialized audio decoder to produce Mimi audio codes. citeturn0search0
Before proceeding, ensure your Mac meets the following specifications:
If Python isn't already installed, download it from the official Python website.
Create a virtual environment to manage dependencies:
python3 -m venv myenv
Activate the virtual environment:
source myenv/bin/activate
Install the necessary libraries using pip:
pip install torch torchvision torchaudio transformers
If you don't have an account, create one and generate a token for authentication.
Log in to your Hugging Face account:
huggingface-cli login
Install the Hugging Face CLI:
pip install huggingface_hub
Navigate to your desired installation directory and clone the Sesame CSM repository using Git:
git clone https://github.com/SesameAILabs/csm.git
This repository contains the necessary code and instructions to run the model. citeturn0search2
Navigate into the cloned repository:
cd csm
Install dependencies listed in the requirements.txt
file:
pip install -r requirements.txt
Note: The triton
package cannot be installed on Windows. Instead, use:
pip install triton-windows
3. Download Models:
Use the Hugging Face Hub to download the CSM 1B model checkpoint:
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="sesame/csm-1b", filename="ckpt.pt")
Ensure you have accepted the model's terms on Hugging Face and that your authentication token is correctly set. citeturn0search0
generator
is loaded with the correct device. For Apple Silicon Macs, using the Metal Performance Shaders (MPS) backend is recommended for optimal performance. citeturn0search3audio.wav
.Create a Python script to load and run the model. Here's an example:
import torch
from generator import load_csm_1b
import torchaudio
# Load model
model_path = "path_to_downloaded_ckpt.pt"
generator = load_csm_1b(model_path, "mps") # Use 'mps' for Apple Silicon Macs
# Generate audio
input_text = "Hello from Sesame."
audio = generator.generate(
text=input_text,
speaker=0,
context=[],
max_audio_length_ms=10_000,
)
# Save audio
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
Generate Speech with Context:PythonCopy
speakers = [0, 1, 0, 0]
transcripts = [
"Hey how are you doing.",
"Pretty good, pretty good.",
"I'm great.",
"So happy to be speaking to you.",
]
audio_paths = [
"utterance_0.wav",
"utterance_1.wav",
"utterance_2.wav",
"utterance_3.wav",
]
def load_audio(audio_path):
audio_tensor, sample_rate = torchaudio.load(audio_path)
audio_tensor = torchaudio.functional.resample(
audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
)
return audio_tensor
segments = [
Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
text="Me too, this is some cool stuff huh?",
speaker=1,
context=segments,
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
Generate a Sentence:PythonCopy
from generator import load_csm_1b
import torchaudio
import torch
if torch.backends.mps.is_available():
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
generator = load_csm_1b(device=device)
audio = generator.generate(
text="Hello from Sesame.",
speaker=0,
context=[],
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
Sesame CSM 1B has received widespread acclaim for its natural-sounding speech and low-latency performance. Users have reported that the model's speech naturalness is so high that it is "impossible to distinguish from a human voice".
In blind tests, participants could not distinguish between CSM and real humans during short conversation snippets. However, longer dialogues still revealed some limitations, such as occasional unnatural pauses and audio artifacts.
Error | Solution |
---|---|
MPS Compatibility Issues | Modify code to avoid unsupported operations or use a different backend. |
Missing Dependencies | Install missing libraries using pip. |
Hugging Face Access Issues | Ensure you have the necessary permissions and tokens. |
Sesame plans to release key components of their research as open source under the Apache 2.0 license. In the coming months, they aim to scale up both model size and training scope, with plans to expand to over 20 languages.
The company is also focusing on integrating pre-trained language models and developing fully duplex-capable systems that can learn conversation dynamics like speaker transitions, pauses, and pacing directly from data.
Sesame CSM 1B represents a significant breakthrough in AI speech technology, offering high-quality speech generation with contextual understanding and real-time performance. By following the steps outlined in this guide, you can install and run Sesame CSM 1B locally.
Need expert guidance? Connect with a top Codersera professional today!