Stand Out From the Crowd
Professional Resume Builder
Used by professionals from Google, Meta, and Amazon
6 min to read
LLaSA-3B revolutionizes text-to-speech technology with emotional nuance recognition and bilingual capabilities (English/Chinese). Built on Meta's LLaMA framework, this open-source model leverages XCodec2 architecture for studio-quality audio output at 24kHz sampling rate. Perfect for developers creating voice assistants, audiobook tools, or multilingual content platforms.
Before installation, verify your Windows setup meets these specs:
Component | Minimum | Recommended |
---|---|---|
RAM | 16GB | 32GB DDR4 |
Storage | 50GB HDD | 100GB NVMe SSD |
GPU | NVIDIA GTX 1660 (6GB) | RTX 3090 (24GB) |
Python | 3.8 | 3.9 |
Critical Notes:
nvidia-smi
in Command PromptXCodec2 is required for decoding speech tokens into audio.
pip install xcodec2==0.1.3
# Create dedicated Conda environment
conda create -n llasa_tts python=3.9 -y
conda activate llasa_tts
# Install core dependencies
pip install torch==2.0.1+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install xcodec2==0.1.3 transformers==4.31.0
Pro Tip: Use Windows Subsystem for Linux (WSL2) for smoother CLI operations.
git clone https://github.com/nivibilla/local-llasa-tts.git
cd local-llasa-tts
# Download 4-bit quantized model (3.8GB)
wget https://huggingface.co/srinivasbilla/llasa-3b-Q4_K_M-GGUF/resolve/main/llasa-3b-q4_k_m.gguf
OR
For inference using llama.cpp
:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CURL=1 make LLAMA_CUDA=1
Download a GGUF version of the LLaSA TTS 3B model and run inference:
./llama-cli --hf-repo srinivasbilla/llasa-3b-Q4_K_M-GGUF --hf-file llasa-3b-q4_k_m.gguf -p "The meaning of life and the universe is"
Verify installation:
nvcc --version # Should show CUDA 12.1+
nvidia-smi # Check GPU memory allocation
Launch the Gradio web interface to interact with the model:
python ./hf_app.py
For efficient inference of longer texts:
pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb
Here’s a basic implementation to convert text into speech using LLaSA 3B:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()
input_text = 'This is a test sentence for speech synthesis.'
def ids_to_speech_tokens(speech_ids):
return [f"<|s_{speech_id}|>" for speech_id in speech_ids]
def extract_speech_ids(speech_tokens_str):
return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
tokenizer.padding_side = "left"
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
generated_ids = outputs[input_ids.shape[-1]:]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Audio saved to gen.wav")
# text_to_speech.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torchaudio
model_path = "HKUST-Audio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,
device_map="auto",
torch_dtype=torch.float16)
def synthesize(text, output_file="output.wav"):
inputs = tokenizer(f"<|TEXT|>{text}<|SPEECH|>", return_tensors="pt").to("cuda")
with torch.inference_mode():
outputs = model.generate(**inputs,
max_new_tokens=500,
temperature=0.7,
top_p=0.95)
audio = decode_speech_tokens(outputs[0])
torchaudio.save(output_file, audio, 24000)
print(f"Generated {output_file}")
Key Parameters to Tweak:
temperature
(0.1-1.0): Lower = more deterministictop_p
(0.5-0.95): Controls vocabulary diversitymax_new_tokens
: Adjust based on text length (1 token ≈ 0.75s speech).py
file (e.g., text_to_speech.py
).conda activate llasa_tts
.python text_to_speech.py
.gen.wav
.# Clone voices with 5-second reference audio
from llasa.voice_cloning import VoiceCloneEngine
cloner = VoiceCloneEngine()
cloner.load_reference("reference.wav")
cloner.generate("Target text here", output_file="clone_output.wav")
# Process multiple texts via CLI
python -m llasa.batch \
--input-file texts.txt \
--output-dir ./audio_output \
--batch-size 8 \
--precision fp16
Technique | VRAM Usage | Speed (RTX 4090) | Quality |
---|---|---|---|
FP32 | 24GB | 1.0x | Lossless |
FP16 | 12GB | 1.8x | Near-lossless |
4-bit Quant | 6GB | 2.5x | Good |
8-bit Quant | 8GB | 2.1x | Excellent |
Symptom | Solution |
---|---|
CUDA Out of Memory | Reduce batch size, enable 4-bit quantization |
Audio Artifacts | Increase top_p to 0.9+, check sample rate consistency |
Slow Inference | Enable Flash Attention 2, use llama.cpp optimizations |
Chinese Text Failures | Ensure proper tokenization with tokenizer.apply_chat_template() |
Docker Deployment
docker pull kjjk10/llasa-3b-long
docker run -p 7860:7860 --gpus all kjjk10/llasa-3b-long
# Access via http://localhost:7860
Google Colab Pro
!pip install -q llasa-tts
from llasa import RemoteEngine
engine = RemoteEngine(api_key="your_key")
engine.synthesize("Your text", voice="chinese-female")
kjjk10/llasa-3b-long
for streamlined deployment.By following this guide, you can set up and run LLaSA TTS 3B on Windows for text-to-speech conversion and voice cloning. Ensure system compatibility, follow the installation steps, and use troubleshooting tips for a smooth experience. LLaSA 3B opens up new possibilities for high-quality AI-driven speech synthesis
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()
input_text = 'This is a test sentence for speech synthesis.'
def ids_to_speech_tokens(speech_ids):
return [f"<|s_{speech_id}|>" for speech_id in speech_ids]
def extract_speech_ids(speech_tokens_str):
return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
tokenizer.padding_side = "left"
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
generated_ids = outputs[input_ids.shape[-1]:]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Audio saved to gen.wav")
.py
file (e.g., text_to_speech.py
).conda activate llasa_tts
.python text_to_speech.py
.gen.wav
.kjjk10/llasa-3b-long
for streamlined deployment.By following this guide, you can set up and run LLaSA TTS 3B on Windows for text-to-speech conversion and voice cloning. Ensure system compatibility, follow the installation steps, and use troubleshooting tips for a smooth experience. LLaSA 3B opens up new possibilities for high-quality AI-driven speech synthesis.
Need expert guidance? Connect with a top Codersera professional today!