4 min to read
Sesame AI's CSM 1B is a state-of-the-art conversational speech model renowned for its ability to generate human-like voices. This guide provides a step-by-step walkthrough for running Sesame CSM 1B on a Mac, detailing prerequisites, installation procedures, and troubleshooting tips to ensure a seamless experience.
Sesame CSM 1B is part of a suite of AI models developed to enhance conversational experiences. It employs advanced deep learning techniques to synthesize speech that closely resembles human voices, making it ideal for applications such as audiobooks, voice assistants, and more.
Notably, CSM 1B generates residual vector quantization (RVQ) audio codes from text and audio inputs, utilizing a Llama backbone and a specialized audio decoder to produce Mimi audio codes. citeturn0search0
Before proceeding, ensure your Mac meets the following specifications:
If Python isn't already installed, download it from the official Python website.
Create a virtual environment to manage dependencies:
python3 -m venv myenv
Activate the virtual environment:
source myenv/bin/activate
Install the necessary libraries using pip:
pip install torch torchvision torchaudio transformers
If you don't have an account, create one and generate a token for authentication.
Log in to your Hugging Face account:
huggingface-cli login
Install the Hugging Face CLI:
pip install huggingface_hub
Navigate to your desired installation directory and clone the Sesame CSM repository using Git:
git clone https://github.com/SesameAILabs/csm.git
This repository contains the necessary code and instructions to run the model. citeturn0search2
Navigate into the cloned repository:
cd csm
Install dependencies listed in the requirements.txt file:
pip install -r requirements.txt
Note: The triton package cannot be installed on Windows. Instead, use:
pip install triton-windows
3. Download Models:
Use the Hugging Face Hub to download the CSM 1B model checkpoint:
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="sesame/csm-1b", filename="ckpt.pt")
Ensure you have accepted the model's terms on Hugging Face and that your authentication token is correctly set. citeturn0search0
generator is loaded with the correct device. For Apple Silicon Macs, using the Metal Performance Shaders (MPS) backend is recommended for optimal performance. citeturn0search3audio.wav.Create a Python script to load and run the model. Here's an example:
import torch
from generator import load_csm_1b
import torchaudio
# Load model
model_path = "path_to_downloaded_ckpt.pt"
generator = load_csm_1b(model_path, "mps") # Use 'mps' for Apple Silicon Macs
# Generate audio
input_text = "Hello from Sesame."
audio = generator.generate(
text=input_text,
speaker=0,
context=[],
max_audio_length_ms=10_000,
)
# Save audio
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
Generate Speech with Context:PythonCopy
speakers = [0, 1, 0, 0]
transcripts = [
"Hey how are you doing.",
"Pretty good, pretty good.",
"I'm great.",
"So happy to be speaking to you.",
]
audio_paths = [
"utterance_0.wav",
"utterance_1.wav",
"utterance_2.wav",
"utterance_3.wav",
]
def load_audio(audio_path):
audio_tensor, sample_rate = torchaudio.load(audio_path)
audio_tensor = torchaudio.functional.resample(
audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
)
return audio_tensor
segments = [
Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
text="Me too, this is some cool stuff huh?",
speaker=1,
context=segments,
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)Generate a Sentence:PythonCopy
from generator import load_csm_1b
import torchaudio
import torch
if torch.backends.mps.is_available():
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
generator = load_csm_1b(device=device)
audio = generator.generate(
text="Hello from Sesame.",
speaker=0,
context=[],
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)Sesame CSM 1B has received widespread acclaim for its natural-sounding speech and low-latency performance. Users have reported that the model's speech naturalness is so high that it is "impossible to distinguish from a human voice".
In blind tests, participants could not distinguish between CSM and real humans during short conversation snippets. However, longer dialogues still revealed some limitations, such as occasional unnatural pauses and audio artifacts.
| Error | Solution |
|---|---|
| MPS Compatibility Issues | Modify code to avoid unsupported operations or use a different backend. |
| Missing Dependencies | Install missing libraries using pip. |
| Hugging Face Access Issues | Ensure you have the necessary permissions and tokens. |
Sesame plans to release key components of their research as open source under the Apache 2.0 license. In the coming months, they aim to scale up both model size and training scope, with plans to expand to over 20 languages.
The company is also focusing on integrating pre-trained language models and developing fully duplex-capable systems that can learn conversation dynamics like speaker transitions, pauses, and pacing directly from data.
Sesame CSM 1B represents a significant breakthrough in AI speech technology, offering high-quality speech generation with contextual understanding and real-time performance. By following the steps outlined in this guide, you can install and run Sesame CSM 1B locally.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.