6 min to read
LLaSA-3B revolutionizes text-to-speech technology with emotional nuance recognition and bilingual capabilities (English/Chinese). Built on Meta's LLaMA framework, this open-source model leverages XCodec2 architecture for studio-quality audio output at 24kHz sampling rate. Perfect for developers creating voice assistants, audiobook tools, or multilingual content platforms.
Before installation, verify your Windows setup meets these specs:
Component | Minimum | Recommended |
---|---|---|
RAM | 16GB | 32GB DDR4 |
Storage | 50GB HDD | 100GB NVMe SSD |
GPU | NVIDIA GTX 1660 (6GB) | RTX 3090 (24GB) |
Python | 3.8 | 3.9 |
Critical Notes:
nvidia-smi
in Command PromptXCodec2 is required for decoding speech tokens into audio.
pip install xcodec2==0.1.3
# Create dedicated Conda environment
conda create -n llasa_tts python=3.9 -y
conda activate llasa_tts
# Install core dependencies
pip install torch==2.0.1+cu118 --index-url https://download.pytorch.org/whl/cu118
pip install xcodec2==0.1.3 transformers==4.31.0
Pro Tip: Use Windows Subsystem for Linux (WSL2) for smoother CLI operations.
git clone https://github.com/nivibilla/local-llasa-tts.git
cd local-llasa-tts
# Download 4-bit quantized model (3.8GB)
wget https://huggingface.co/srinivasbilla/llasa-3b-Q4_K_M-GGUF/resolve/main/llasa-3b-q4_k_m.gguf
OR
For inference using llama.cpp
:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
LLAMA_CURL=1 make LLAMA_CUDA=1
Download a GGUF version of the LLaSA TTS 3B model and run inference:
./llama-cli --hf-repo srinivasbilla/llasa-3b-Q4_K_M-GGUF --hf-file llasa-3b-q4_k_m.gguf -p "The meaning of life and the universe is"
Verify installation:
nvcc --version # Should show CUDA 12.1+
nvidia-smi # Check GPU memory allocation
Launch the Gradio web interface to interact with the model:
python ./hf_app.py
For efficient inference of longer texts:
pip install vllm
jupyter notebook llasa_vllm_longtext_inference.ipynb
Here’s a basic implementation to convert text into speech using LLaSA 3B:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()
input_text = 'This is a test sentence for speech synthesis.'
def ids_to_speech_tokens(speech_ids):
return [f"<|s_{speech_id}|>" for speech_id in speech_ids]
def extract_speech_ids(speech_tokens_str):
return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
tokenizer.padding_side = "left"
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
generated_ids = outputs[input_ids.shape[-1]:]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Audio saved to gen.wav")
# text_to_speech.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torchaudio
model_path = "HKUST-Audio/Llasa-3B"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path,
device_map="auto",
torch_dtype=torch.float16)
def synthesize(text, output_file="output.wav"):
inputs = tokenizer(f"<|TEXT|>{text}<|SPEECH|>", return_tensors="pt").to("cuda")
with torch.inference_mode():
outputs = model.generate(**inputs,
max_new_tokens=500,
temperature=0.7,
top_p=0.95)
audio = decode_speech_tokens(outputs[0])
torchaudio.save(output_file, audio, 24000)
print(f"Generated {output_file}")
Key Parameters to Tweak:
temperature
(0.1-1.0): Lower = more deterministictop_p
(0.5-0.95): Controls vocabulary diversitymax_new_tokens
: Adjust based on text length (1 token ≈ 0.75s speech).py
file (e.g., text_to_speech.py
).conda activate llasa_tts
.python text_to_speech.py
.gen.wav
.# Clone voices with 5-second reference audio
from llasa.voice_cloning import VoiceCloneEngine
cloner = VoiceCloneEngine()
cloner.load_reference("reference.wav")
cloner.generate("Target text here", output_file="clone_output.wav")
# Process multiple texts via CLI
python -m llasa.batch \
--input-file texts.txt \
--output-dir ./audio_output \
--batch-size 8 \
--precision fp16
Technique | VRAM Usage | Speed (RTX 4090) | Quality |
---|---|---|---|
FP32 | 24GB | 1.0x | Lossless |
FP16 | 12GB | 1.8x | Near-lossless |
4-bit Quant | 6GB | 2.5x | Good |
8-bit Quant | 8GB | 2.1x | Excellent |
Symptom | Solution |
---|---|
CUDA Out of Memory | Reduce batch size, enable 4-bit quantization |
Audio Artifacts | Increase top_p to 0.9+, check sample rate consistency |
Slow Inference | Enable Flash Attention 2, use llama.cpp optimizations |
Chinese Text Failures | Ensure proper tokenization with tokenizer.apply_chat_template() |
Docker Deployment
docker pull kjjk10/llasa-3b-long
docker run -p 7860:7860 --gpus all kjjk10/llasa-3b-long
# Access via http://localhost:7860
Google Colab Pro
!pip install -q llasa-tts
from llasa import RemoteEngine
engine = RemoteEngine(api_key="your_key")
engine.synthesize("Your text", voice="chinese-female")
kjjk10/llasa-3b-long
for streamlined deployment.By following this guide, you can set up and run LLaSA TTS 3B on Windows for text-to-speech conversion and voice cloning. Ensure system compatibility, follow the installation steps, and use troubleshooting tips for a smooth experience. LLaSA 3B opens up new possibilities for high-quality AI-driven speech synthesis
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b = 'HKUST-Audio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b).eval().to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
codec_model = XCodec2Model.from_pretrained("HKUST-Audio/xcodec2").eval().cuda()
input_text = 'This is a test sentence for speech synthesis.'
def ids_to_speech_tokens(speech_ids):
return [f"<|s_{speech_id}|>" for speech_id in speech_ids]
def extract_speech_ids(speech_tokens_str):
return [int(token[4:-2]) for token in speech_tokens_str if token.startswith('<|s_') and token.endswith('|>')]
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
tokenizer.padding_side = "left"
input_ids = tokenizer.apply_chat_template(chat, tokenize=True, return_tensors='pt', continue_final_message=True).to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
outputs = model.generate(input_ids, max_length=2048, eos_token_id=speech_end_id, do_sample=True, top_p=1, temperature=0.8)
generated_ids = outputs[input_ids.shape[-1]:]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
gen_wav = codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
print("Audio saved to gen.wav")
.py
file (e.g., text_to_speech.py
).conda activate llasa_tts
.python text_to_speech.py
.gen.wav
.kjjk10/llasa-3b-long
for streamlined deployment.By following this guide, you can set up and run LLaSA TTS 3B on Windows for text-to-speech conversion and voice cloning. Ensure system compatibility, follow the installation steps, and use troubleshooting tips for a smooth experience. LLaSA 3B opens up new possibilities for high-quality AI-driven speech synthesis.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.