4 min to read
Kimi-Audio is an open-source audio foundation model capable of speech recognition, audio generation, and conversational AI tasks. While primarily designed for Linux environments, this guide provides detailed instructions for Windows users to leverage its capabilities through multiple methods.
Step 1: Set Up Development Environment
powershell# Install Chocolatey package manager
Force
Set-ExecutionPolicy Bypass -Scope Process -[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]
::Tls12iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
y
# Install essential packages
choco install git python310 cuda ffmpeg -
Step 2: Configure Python Environment
powershellpython -
m venv kimi-env.
\kimi-env\Scripts\activatepip install torch torchvision torchaudio --index-url https://download.pytorch.
org/whl/cu121
Step 3: Install Kimi-Audio
powershellgit clone https://github.
com/MoonshotAI/Kimi-Audio
cd Kimi-Audiopip install -r requirements.
txt
Step 1: Install Docker Desktop
Step 2: Build Kimi-Audio Image
powershelldocker build -t kimi-audio:v0.1 .
Step 3: Run Container with GPU Passthrough
powershelldocker run --gpus all -it kimi-audio:v0.
powershellhuggingface-cli
loginhuggingface-cli download moonshotai/Kimi-Audio-7B-Instruct --local-dir ./
models
Create .env
file:
textMODEL_PATH=./models/Kimi-Audio-7B-Instruct
DEVICE=cuda
PRECISION=bf16
pythonfrom kimia_infer.api.kimia import
KimiAudiomodel = KimiAudio(model_path="models/Kimi-Audio-7B-Instruct")
messages = [
{"role": "user", "message_type": "text", "content": "Transcribe this audio:"},
{"role": "user", "message_type": "audio", "content": "input.wav"}
]
_, transcript = model.generate(messages)
print(transcript)
pythonmessages = [
{"role": "user", "message_type": "text", "content": "Generate happy birthday music"}
]
audio, _ = model.generate(messages, output_type="both")
pythonmessages = [
{"role": "user", "message_type": "audio", "content": "user_query.wav"},
{"role": "assistant", "message_type": "text", "content": "How can I help you?"}
]
response_audio, response_text = model.generate(messages)
asr_example.wav
.Run the ASR Example:PythonCopy
import soundfile as sf
from kimia_infer.api.kimia import KimiAudio
import torch
# Load the model
model_id = "moonshotai/Kimi-Audio-7B-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = KimiAudio(model_path=model_id, load_detokenizer=True)
model.to(device)
# Define sampling parameters
sampling_params = {
"audio_temperature": 0.8,
"audio_top_k": 10,
"text_temperature": 0.0,
"text_top_k": 5,
"audio_repetition_penalty": 1.0,
"audio_repetition_window_size": 64,
"text_repetition_penalty": 1.0,
"text_repetition_window_size": 16,
}
# Example 1: Audio-to-Text (ASR)
asr_audio_path = "asr_example.wav"
messages_asr = [
{"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
{"role": "user", "message_type": "audio", "content": asr_audio_path}
]
# Generate only text output
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output)
qa_example.wav
.Run the Conversation Example:PythonCopy
# Example 2: Audio-to-Audio/Text Conversation
qa_audio_path = "qa_example.wav"
messages_conversation = [
{"role": "user", "message_type": "audio", "content": qa_audio_path}
]
# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")
# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output)
pythonfrom torch.quantization import
quantize_dynamicmodel = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
python# Enable gradient checkpointing
autocast
model.gradient_checkpointing_enable()
# Use mixed precision
from torch.cuda.amp importwith autocast(dtype=torch.bfloat16):
outputs = model.generate(...)
Error | Solution |
---|---|
CUDA Out of Memory | Reduce batch size, enable gradient checkpointing |
FFmpeg Not Found | Add to PATH: choco install ffmpeg |
DLL Load Failed | Reinstall CUDA 12.1+ with correct driver |
Tokenizer Errors | Clear HuggingFace cache: rm -r ~/.cache/huggingface |
pythonfrom kimia_train.core import
AudioDatasetdataset = AudioDataset(
audio_dir="custom_audio",
sampling_rate=24000,
max_duration=30
)
trainer = KimiTrainer(
model=model,
train_dataset=dataset,
batch_size=2,
learning_rate=5e-5
)
trainer.train()
pythonfrom fastapi import
FastAPIapp = FastAPI()
@app.post("/transcribe")
async def transcribe(file: UploadFile):
audio = await file.read()
_, text = model.generate([{"role": "user", "content": audio}])
return {"text": text}
Feature | Kimi-Audio | Whisper | AudioCraft |
---|---|---|---|
ASR Accuracy | 98.2% | 97.8% | N/A |
Audio Generation | Yes | No | Yes |
Multimodal Support | Yes | No | Limited |
VRAM Requirements | 24GB+ | 8GB | 16GB |
Conversation Ability | Yes | No | No |
Kimi-Audio is a powerful open-source model for audio tasks, including automatic speech recognition and audio-to-text conversation. By following the steps outlined above, you can successfully run Kimi-Audio on Windows using Docker.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.