Install and Run LLaSA TTS 3B on Windows: Step by Step Guide
LLaSA-3B revolutionizes text-to-speech technology with emotional nuance recognition and bilingual capabilities (English/Chinese). Built on Meta's LLaMA framework, this open-source model leverages XCodec2 architecture for studio-quality audio output at 24kHz sampling rate. Perfect for developers creating voice assistants, audiobook tools, or multilingual content platforms.
Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) β hardware, ollama and vllm, cost-per-token, and when to self-host.
System Requirements Checklist
Before installation, verify your Windows setup meets these specs:
- Operating System: Windows 10 or later
- Python: Version 3.9 is recommended to avoid compatibility issues.
- RAM: At least 16GB is required, but 32GB is preferred for optimal performance.
- Storage: A minimum of 50GB of free space, preferably an NVMe SSD with 100GB, for models, libraries, and dependencies.
- GPU (NVIDIA): A dedicated NVIDIA GPU with CUDA support is recommended. Minimum 6GB VRAM for 4-bit quantization or 12GB+ VRAM for FP16 processing. CPU-only mode is possible but extremely slow.
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 16GB | 32GB DDR4 |
| Storage | 50GB HDD | 100GB NVMe SSD |
| GPU | NVIDIA GTX 1660 (6GB) | RTX 3090 (24GB) |
| Python | 3.8 | 3.9 |
Critical Notes:
- π΄ CPU-only mode possible but impractical (10x slower)
- π’ CUDA 11.8+ required for GPU acceleration
- π‘ Validate CUDA compatibility:
nvidia-smiin Command Prompt
Step-by-Step Installation Walkthrough
1. Install XCodec2
XCodec2 is required for decoding speech tokens into audio.
pip install xcodec2==0.1.3
2. Environment Setup
# Create dedicated Conda environment
conda create -n llasa_tts python=3.9 -y
conda activate llasa_tts
# Install core dependencies
pip install torch==2.0.1+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install xcodec2==0.1.3 transformers==4.31.0
Pro Tip: Use Windows Subsystem for Linux (WSL2) for smoother CLI operations.
3. Model Deployment
git clone https://github.com/nivibilla/local-llasa-tts.git
cd local-llasa-tts
# Download 4-bit quantized model (3.8GB)
wget https://huggingface.co/srinivasbilla/llasa-3b-Q4_K_M-GGUF/resolve/main/llasa-3b-q4_k_m.gguf
OR
For inference using llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CURL=1 make LLAMA_CUDA=1
Download a GGUF version of the LLaSA TTS 3B model and run inference:
./llama-cli --hf-repo srinivasbilla/llasa-3b-Q4_K_M-GGUF --hf-file llasa-3b-q4_k_m.gguf -p "The meaning of life and the universe is"
4. Hardware Acceleration Setup
- Install NVIDIA CUDA Toolkit 12.1
Verify installation:
nvcc --version # Should show CUDA 12.1+
nvidia-smi # Check GPU memory allocation
5. Running the Gradio App
Launch the Gradio web interface to interact with the model:
python ./hf_app.py
6. Long Text Inference with VLLM
For efficient inference of longer texts:
pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb
Complete Text-to-Speech Script Walkthrough
Code Implementation for Text-to-Speech
Hereβs a basic implementation to convert text into speech using LLaSA 3B:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()
input_text = 'This is a test sentence for speech synthesis.'
def ids_to_speech_tokens(speech_ids):
return [f"<|s_{speech_id}|>" for speech_id in speech_ids]
def extract_speech_ids(speech_tokens_str):
return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
tokenizer.padding_side = "left"
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
generated_ids = outputs[input_ids.shape[-1]:]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Audio saved to gen.wav")
OR
# text_to_speech.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torchaudio
model_path = "HKUST-Audio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,
device_map="auto",
torch_dtype=torch.float16)
def synthesize(text, output_file="output.wav"):
inputs = tokenizer(f"<|TEXT|>{text}<|SPEECH|>", return_tensors="pt").to("cuda")
with torch.inference_mode():
outputs = model.generate(**inputs,
max_new_tokens=500,
temperature=0.7,
top_p=0.95)
audio = decode_speech_tokens(outputs[0])
torchaudio.save(output_file, audio, 24000)
print(f"Generated {output_file}")
Key Parameters to Tweak:
temperature(0.1-1.0): Lower = more deterministictop_p(0.5-0.95): Controls vocabulary diversitymax_new_tokens: Adjust based on text length (1 token β 0.75s speech)
Running the Script
- Save the code as a
.pyfile (e.g.,text_to_speech.py). - Open the terminal and activate the Conda environment:
conda activate llasa_tts. - Navigate to the script directory.
- Run the script:
python text_to_speech.py. - The generated speech will be saved as
gen.wav.
Advanced Features Unleashed
Real-Time Voice Cloning
# Clone voices with 5-second reference audio
from llasa.voice_cloning import VoiceCloneEngine
cloner = VoiceCloneEngine()
cloner.load_reference("reference.wav")
cloner.generate("Target text here", output_file="clone_output.wav")
Batch Processing
# Process multiple texts via CLI
python -m llasa.batch \
--input-file texts.txt \
--output-dir ./audio_output \
--batch-size 8 \
--precision fp16
Performance Optimization Table
| Technique | VRAM Usage | Speed (RTX 4090) | Quality |
|---|---|---|---|
| FP32 | 24GB | 1.0x | Lossless |
| FP16 | 12GB | 1.8x | Near-lossless |
| 4-bit Quant | 6GB | 2.5x | Good |
| 8-bit Quant | 8GB | 2.1x | Excellent |
Troubleshooting Matrix
| Symptom | Solution |
|---|---|
| CUDA Out of Memory | Reduce batch size, enable 4-bit quantization |
| Audio Artifacts | Increase top_p to 0.9+, check sample rate consistency |
| Slow Inference | Enable Flash Attention 2, use llama.cpp optimizations |
| Chinese Text Failures | Ensure proper tokenization with tokenizer.apply_chat_template() |
Cloud Alternatives for Low-Spec Hardware
Docker Deployment
docker pull kjjk10/llasa-3b-long
docker run -p 7860:7860 --gpus all kjjk10/llasa-3b-long
# Access via http://localhost:7860
Google Colab Pro
!pip install -q llasa-tts
from llasa import RemoteEngine
engine = RemoteEngine(api_key="your_key")
engine.synthesize("Your text", voice="chinese-female")
Alternative Solutions
- Google Colab: Utilize free cloud GPUs if local hardware is insufficient.
- Cloud-Based Services: Services like Replicate allow model execution without local hardware.
- Docker: Use
kjjk10/llasa-3b-longfor streamlined deployment.
Conclusion
By following this guide, you can set up and run LLaSA TTS 3B on Windows for text-to-speech conversion and voice cloning. Ensure system compatibility, follow the installation steps, and use troubleshooting tips for a smooth experience. LLaSA 3B opens up new possibilities for high-quality AI-driven speech synthesis
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()
input_text = 'This is a test sentence for speech synthesis.'
def ids_to_speech_tokens(speech_ids):
return [f"<|s_{speech_id}|>" for speech_id in speech_ids]
def extract_speech_ids(speech_tokens_str):
return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
tokenizer.padding_side = "left"
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
generated_ids = outputs[input_ids.shape[-1]:]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Audio saved to gen.wav")
Running the Script
- Save the code as a
.pyfile (e.g.,text_to_speech.py). - Open the terminal and activate the Conda environment:
conda activate llasa_tts. - Navigate to the script directory.
- Run the script:
python text_to_speech.py. - The generated speech will be saved as
gen.wav.
Troubleshooting
- Compatibility Issues: Use Python 3.9 to avoid dependency conflicts.
- CUDA Errors: Ensure your NVIDIA drivers are installed and compatible with your CUDA version.
- Out of Memory: Reduce batch size or use a GPU with higher VRAM.
- Slow Inference: A CUDA-enabled GPU is highly recommended.
Alternative Solutions
- Google Colab: Utilize free cloud GPUs if local hardware is insufficient.
- Cloud-Based Services: Services like Replicate allow model execution without local hardware.
- Docker: Use
kjjk10/llasa-3b-longfor streamlined deployment.
Conclusion
By following this guide, you can set up and run LLaSA TTS 3B on Windows for text-to-speech conversion and voice cloning. Ensure system compatibility, follow the installation steps, and use troubleshooting tips for a smooth experience. LLaSA 3B opens up new possibilities for high-quality AI-driven speech synthesis.
References
- Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
- Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
- Run DeepSeek Janus-Pro 7B on Windows: A Complete Installation Guide
- Run DeepSeek-VL2 on macOS: Step-by-Step Installation Guide
- Install and Run DeepSeek-VL2 on Ubuntu: A Step-by-Step Guide