Stop Paying for Screen Recording
Switch to Free & Open Source
Built for developers, by developers
4 min to read
LLaSA (LLaMA-based Speech Synthesis) is a text-to-speech (TTS) system that extends the text-based LLaMA language model by incorporating speech tokens. LLaSA models come in different sizes, such as 1B, 3B, and 8B.
This article focuses on running the LLaSA TTS 3B model on Ubuntu, providing a comprehensive guide covering installation, setup, and usage.
LLaSA (LLaMA-based Speech Synthesis) is a cutting-edge text-to-speech system built on Meta's LLaMA architecture. The 3B parameter version offers:
Before installation, ensure your system meets the necessary requirements, as running LLaSA TTS 3B can be resource-intensive, particularly when loading additional models like Whisper for transcription.
Component | Minimum Spec | Recommended Spec |
---|---|---|
GPU (NVIDIA) | 6GB VRAM (4-bit) | 12GB+ VRAM (FP16) |
RAM | 16GB | 32GB |
Storage | 50GB HDD | 100GB NVMe SSD |
OS | Ubuntu 20.04 LTS | Ubuntu 22.04 LTS |
CUDA Version | 11.7 | 12.1 |
Key Notes:
Start by cloning the local-llasa-tts
repository from GitHub, which contains necessary scripts and files.
git clone https://github.com/nivibilla/local-llasa-tts.git
cd local-llasa-tts
Install required Python packages using pip
:
pip install -r ./requirements_base.txt
pip install -r ./requirements_native_hf.txt
If you wish to use llama.cpp
for inference, follow these steps:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CURL=1 make LLAMA_CUDA=1
Download a GGUF version of the LLaSA TTS 3B model, available on Hugging Face, and run inference:
./llama-cli --hf-repo srinivasbilla/llasa-3b-Q4_K_M-GGUF --hf-file llasa-3b-q4_k_m.gguf -p "The meaning of life and the universe is"
Launch the Gradio web interface for interacting with the model:
python ./hf_app.py
For long texts, use VLLM for efficient inference:
pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb
Access the web interface (usually http://localhost:7860
), enter text, select a voice, and generate speech.
Generate speech from text using the Transformers library:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "HKUSTAudio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
text = "Hello, this is a test of the LLaSA TTS model."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
speech = model.generate(**inputs)
LLaSA TTS supports voice cloning with a few seconds of audio input. Example:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import soundfile as sf
model_name = "HKUSTAudio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
audio_path = "path/to/audio_sample.wav"
audio, sr = sf.read(audio_path)
voice_embedding = voice_encoder_model.encode(audio, sr)
text = "Hello, I am a cloned voice."
inputs = tokenizer(text, return_tensors="pt")
inputs["voice_embedding"] = torch.tensor(voice_embedding).unsqueeze(0)
with torch.no_grad():
speech = model.generate(**inputs)
sf.write("cloned_voice_speech.wav", speech.cpu().numpy(), sr)
pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb
Key Features:
Technique | VRAM Reduction | Speed Boost |
---|---|---|
4-bit Quant | 40% | 1.2x |
FP16 Precision | 50% | 3x |
Flash Attention | - | 5x |
Enable optimizations in code:
model = AutoModelForCausalLM.from_pretrained(
"HKUSTAudio/Llasa-3B",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2"
)
1: CUDA Out of Memory Error
โ Reduce batch size โ Use 4-bit quantization โ Upgrade GPU
2: Audio Artifacts
โ Check sample rate (16kHz recommended) โ Clean input text โ Increase num_mel_bins
in config
3: Slow Inference
# Enable GPU acceleration
model.to("cuda")
# Use Torch Compile
model = torch.compile(model)
Model | VRAM | Languages | Voice Cloning |
---|---|---|---|
LLaSA 3B | 8GB | 50+ | โ (5 sec) |
Coqui TTS | 4GB | 20+ | โ |
Bark | 12GB | 100+ | โ (10 sec) |
Tortoise TTS | 16GB | English | โ (1 min) |
1: Can I run this on Google Colab?
A: Yes! Use T4 GPU with this Colab template.
2: Commercial use allowed?
A: Check LLaMA's licensing. Non-commercial research only.
3: Chinese/Japanese support?
A: Yes, via custom tokenizers.
Other TTS solutions include:
LLaSA TTS 3B brings state-of-the-art speech synthesis to Ubuntu users. With proper GPU setup and our optimization tips, you can deploy realistic voice AI for:
Need expert guidance? Connect with a top Codersera professional today!