Install LLaSA TTS 3B on Windows: Local TTS Setup Guide

LLaSA-3B revolutionizes text-to-speech technology with emotional nuance recognition and bilingual capabilities (English/Chinese). Built on Meta's LLaMA framework, this open-source model leverages XCodec2 architecture for studio-quality audio output at 24kHz sampling rate. Perfect for developers creating voice assistants, audiobook tools, or multilingual content platforms.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

System Requirements Checklist

Before installation, verify your Windows setup meets these specs:

Operating System: Windows 10 or later
Python: Version 3.9 is recommended to avoid compatibility issues.
RAM: At least 16GB is required, but 32GB is preferred for optimal performance.
Storage: A minimum of 50GB of free space, preferably an NVMe SSD with 100GB, for models, libraries, and dependencies.
GPU (NVIDIA): A dedicated NVIDIA GPU with CUDA support is recommended. Minimum 6GB VRAM for 4-bit quantization or 12GB+ VRAM for FP16 processing. CPU-only mode is possible but extremely slow.

Component	Minimum	Recommended
RAM	16GB	32GB DDR4
Storage	50GB HDD	100GB NVMe SSD
GPU	NVIDIA GTX 1660 (6GB)	RTX 3090 (24GB)
Python	3.8	3.9

Critical Notes:

🔴 CPU-only mode possible but impractical (10x slower)
🟢 CUDA 11.8+ required for GPU acceleration
💡 Validate CUDA compatibility: nvidia-smi in Command Prompt

Step-by-Step Installation Walkthrough

1. Install XCodec2

XCodec2 is required for decoding speech tokens into audio.

pip install xcodec2==0.1.3

2. Environment Setup

# Create dedicated Conda environment
conda create -n llasa_tts python=3.9 -y
conda activate llasa_tts

# Install core dependencies
pip install torch==2.0.1+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install xcodec2==0.1.3 transformers==4.31.0

Pro Tip: Use Windows Subsystem for Linux (WSL2) for smoother CLI operations.

3. Model Deployment

git clone https://github.com/nivibilla/local-llasa-tts.git
cd local-llasa-tts

# Download 4-bit quantized model (3.8GB)
wget https://huggingface.co/srinivasbilla/llasa-3b-Q4_K_M-GGUF/resolve/main/llasa-3b-q4_k_m.gguf

For inference using llama.cpp:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CURL=1 make LLAMA_CUDA=1

Download a GGUF version of the LLaSA TTS 3B model and run inference:

./llama-cli --hf-repo srinivasbilla/llasa-3b-Q4_K_M-GGUF --hf-file llasa-3b-q4_k_m.gguf -p "The meaning of life and the universe is"

4. Hardware Acceleration Setup

Install NVIDIA CUDA Toolkit 12.1

Verify installation:

nvcc --version # Should show CUDA 12.1+
nvidia-smi # Check GPU memory allocation

5. Running the Gradio App

Launch the Gradio web interface to interact with the model:

python ./hf_app.py

6. Long Text Inference with VLLM

For efficient inference of longer texts:

pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb

Complete Text-to-Speech Script Walkthrough

Code Implementation for Text-to-Speech

Here’s a basic implementation to convert text into speech using LLaSA 3B:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf

llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')

from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()

input_text = 'This is a test sentence for speech synthesis.'

def ids_to_speech_tokens(speech_ids):
    return [f"<|s_{speech_id}|>" for speech_id in speech_ids]

def extract_speech_ids(speech_tokens_str):
    return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]

with torch.no_grad():
    formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
    chat = [
        {"role": "user", "content": "Convert the text to speech:" + formatted_text},
        {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
    ]

    tokenizer.padding_side = "left"
    input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
    speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
    outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)

    generated_ids = outputs[input_ids.shape[-1]:]
    speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    speech_tokens = extract_speech_ids(speech_tokens)
    speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
    gen_wav = codec_model.decode_code(speech_tokens)
    sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
    print("Audio saved to gen.wav")

OR

# text_to_speech.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torchaudio

model_path = "HKUST-Audio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                                           device_map="auto",
                                           torch_dtype=torch.float16)

def synthesize(text, output_file="output.wav"):
    inputs = tokenizer(f"<|TEXT|>{text}<|SPEECH|>", return_tensors="pt").to("cuda")
    
    with torch.inference_mode():
        outputs = model.generate(**inputs, 
                               max_new_tokens=500,
                               temperature=0.7,
                               top_p=0.95)
    
    audio = decode_speech_tokens(outputs[0])
    torchaudio.save(output_file, audio, 24000)
    print(f"Generated {output_file}")

Key Parameters to Tweak:

temperature (0.1-1.0): Lower = more deterministic
top_p (0.5-0.95): Controls vocabulary diversity
max_new_tokens: Adjust based on text length (1 token ≈ 0.75s speech)

Running the Script

Save the code as a .py file (e.g., text_to_speech.py).
Open the terminal and activate the Conda environment: conda activate llasa_tts.
Navigate to the script directory.
Run the script: python text_to_speech.py.
The generated speech will be saved as gen.wav.

Advanced Features Unleashed

Real-Time Voice Cloning

# Clone voices with 5-second reference audio
from llasa.voice_cloning import VoiceCloneEngine

cloner = VoiceCloneEngine()
cloner.load_reference("reference.wav")
cloner.generate("Target text here", output_file="clone_output.wav")

Batch Processing

# Process multiple texts via CLI
python -m llasa.batch \
  --input-file texts.txt \
  --output-dir ./audio_output \
  --batch-size 8 \
  --precision fp16

Performance Optimization Table

Technique	VRAM Usage	Speed (RTX 4090)	Quality
FP32	24GB	1.0x	Lossless
FP16	12GB	1.8x	Near-lossless
4-bit Quant	6GB	2.5x	Good
8-bit Quant	8GB	2.1x	Excellent

Troubleshooting Matrix

Symptom	Solution
CUDA Out of Memory	Reduce batch size, enable 4-bit quantization
Audio Artifacts	Increase `top_p` to 0.9+, check sample rate consistency
Slow Inference	Enable Flash Attention 2, use `llama.cpp` optimizations
Chinese Text Failures	Ensure proper tokenization with `tokenizer.apply_chat_template()`

Cloud Alternatives for Low-Spec Hardware

Docker Deployment

docker pull kjjk10/llasa-3b-long
docker run -p 7860:7860 --gpus all kjjk10/llasa-3b-long
# Access via http://localhost:7860

Google Colab Pro

!pip install -q llasa-tts
from llasa import RemoteEngine

engine = RemoteEngine(api_key="your_key")
engine.synthesize("Your text", voice="chinese-female")

Alternative Solutions

Google Colab: Utilize free cloud GPUs if local hardware is insufficient.
Cloud-Based Services: Services like Replicate allow model execution without local hardware.
Docker: Use kjjk10/llasa-3b-long for streamlined deployment.

Conclusion

By following this guide, you can set up and run LLaSA TTS 3B on Windows for text-to-speech conversion and voice cloning. Ensure system compatibility, follow the installation steps, and use troubleshooting tips for a smooth experience. LLaSA 3B opens up new possibilities for high-quality AI-driven speech synthesis

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf

llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')

from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()

input_text = 'This is a test sentence for speech synthesis.'

def ids_to_speech_tokens(speech_ids):
    return [f"<|s_{speech_id}|>" for speech_id in speech_ids]

def extract_speech_ids(speech_tokens_str):
    return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]

with torch.no_grad():
    formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
    chat = [
        {"role": "user", "content": "Convert the text to speech:" + formatted_text},
        {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
    ]

    tokenizer.padding_side = "left"
    input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
    speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
    outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)

    generated_ids = outputs[input_ids.shape[-1]:]
    speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
    speech_tokens = extract_speech_ids(speech_tokens)
    speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
    gen_wav = codec_model.decode_code(speech_tokens)
    sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
    print("Audio saved to gen.wav")

Running the Script

Save the code as a .py file (e.g., text_to_speech.py).
Open the terminal and activate the Conda environment: conda activate llasa_tts.
Navigate to the script directory.
Run the script: python text_to_speech.py.
The generated speech will be saved as gen.wav.

Troubleshooting

Compatibility Issues: Use Python 3.9 to avoid dependency conflicts.
CUDA Errors: Ensure your NVIDIA drivers are installed and compatible with your CUDA version.
Out of Memory: Reduce batch size or use a GPU with higher VRAM.
Slow Inference: A CUDA-enabled GPU is highly recommended.

Alternative Solutions

Google Colab: Utilize free cloud GPUs if local hardware is insufficient.
Cloud-Based Services: Services like Replicate allow model execution without local hardware.
Docker: Use kjjk10/llasa-3b-long for streamlined deployment.

Install and Run LLaSA TTS 3B on Windows: Step by Step Guide

System Requirements Checklist

Step-by-Step Installation Walkthrough

1. Install XCodec2

2. Environment Setup

3. Model Deployment

4. Hardware Acceleration Setup

5. Running the Gradio App

6. Long Text Inference with VLLM

Complete Text-to-Speech Script Walkthrough

Code Implementation for Text-to-Speech

OR

Running the Script

Advanced Features Unleashed

Real-Time Voice Cloning

Batch Processing

Performance Optimization Table

Troubleshooting Matrix

Cloud Alternatives for Low-Spec Hardware

Alternative Solutions

Conclusion

Running the Script

Troubleshooting

Alternative Solutions

Conclusion

References