4 min to read
Kimi-Audio is Moonshot AI's state-of-the-art 7B parameter audio foundation model capable of speech recognition, audio generation, and multimodal conversations.
bash# Update system packages
full-upgrade -y
sudo apt update && sudo apt# Install essential tools
-y git-lfs build-essential ninja-build ffmpeg
sudo apt install
bashwget
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pinsudo mv
cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600sudo
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pubsudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /"
update
sudo apt-getsudo apt-get -y install
cuda-toolkit-12-2
bash# Install Miniconda
https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
wgetbash
Miniconda3-latest-Linux-x86_64.sh# Create virtual environment
-y
conda create -n kimi-audio python=3.10
conda activate kimi-audio
bash# Clone main repository
clone https://github.com/MoonshotAI/Kimi-Audio.git
gitcd
Kimi-Audio# Initialize submodules
submodule update --init --recursive
git# Manual fix for GLM tokenizer (critical step)[6]
clone https://github.com/THUDM/GLM-4-Voice.git
gitcp
-r GLM-4-Voice/ glm4_voice/mv
glm4_voice/ tokenizers/GLM4/
bash# Install PyTorch with CUDA 12.2
torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122
pip install# Install project requirements
-r requirements.txt
pip install# Additional audio processing libraries
.0
pip install soundfile librosa==0.10.1 torchaudio==2.1
bash# Install Hugging Face Hub tools
huggingface_hub
pip install# Download model weights
huggingface-cli download moonshotai/Kimi-Audio-7B-Instruct --local-dir ./models/7B-Instruct
Create config.yaml
:
textmodel_path: "./models/7B-Instruct"
device: "cuda:0"
audio_sample_rate: 24000
text_tokenizer: "Qwen-7B"
audio_tokenizer: "GLM4-Voice"
max_audio_length: 600
pythonfrom kimia_infer.api.kimia import
KimiAudioimport soundfile as
sfmodel = KimiAudio(config_path="config.yaml")
messages = [
{"role": "user", "message_type": "text", "content": "Transcribe this:"},
{"role": "user", "message_type": "audio", "content": "test.wav"}
]
_, transcription = model.generate(messages, output_type="text")
print(f"Transcription: {transcription}")
pythonmessages = [
{"role": "user", "message_type": "audio", "content": "question.wav"}
]
audio_output, text_output = model.generate(
messages,
audio_temperature=0.7,
text_temperature=0.3,
output_type="both"
)
sf.write("response.wav", audio_output.cpu().numpy(), 24000)
bash# Enable FlashAttention
.23
export USE_FLASH_ATTENTION=1
# Set memory-efficient attention
export MAX_JOBS=4
pip install xformers==0.0
Create batch_process.py
:
pythonimport
globfrom tqdm import
tqdmaudio_files = glob.glob("dataset/*.wav")
for file in tqdm(audio_files):
messages = [
{"role": "user", "message_type": "text", "content": "Describe this audio:"},
{"role": "user", "message_type": "audio", "content": file}
]
_, description = model.generate(messages)
with open(f"{file}.txt", "w") as f:
f.write(description)
model.enable_gradient_checkpointing()
max_audio_length
in configrm
-rf tokenizers/GLM4git
clone https://github.com/THUDM/GLM-4-Voice.git tokenizers/GLM4sampling_params = {
"audio_prior_temperature": 0.5,
"audio_top_k": 50,
"audio_repetition_penalty": 1.2
}
Task | WER | BLEU | MCD |
---|---|---|---|
ASR (Librispeech) | 2.1% | - | - |
Audio Captioning | - | 42.5 | - |
Speech Emotion | 85.3% Acc | - | - |
Text-to-Speech | - | - | 3.8 |
bash# Convert audio to 24kHz mono
find ./custom_data -name "*.wav" -exec ffmpeg -i {} -ar 24000 -ac 1 {}.converted.wav \;
# Create manifest.json
python tools/create_manifest.py --input_dir ./custom_data --output manifest.json
bashaccelerate launch train.py \
--model_name_or_path ./models/7B-Instruct \
--train_files manifest.json \
--output_dir ./finetuned_model \
--per_device_train_batch_size 2 \
--learning_rate 1e-5 \
--num_train_epochs 3
textFROM nvidia/cuda:12.2.0-devel-ubuntu22.04
RUN apt update && apt install -y git-lfs python3.10
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
CMD ["python", "api_server.py"]
pythonfrom fastapi import
FastAPIimport
uvicornapp = FastAPI()
model = KimiAudio(config_path="config.yaml")
@app.post("/transcribe")
async def transcribe(file: UploadFile):
with open("temp.wav", "wb") as f:
f.write(await file.read())
messages = [
{"role": "user", "message_type": "audio", "content": "temp.wav"}
]
_, text = model.generate(messages)
return {"transcription": text}
uvicorn.run(app, host="0.0.0.0", port=8000)
git
pull origin mainconda env export >
environment.ymlpip freeze >
requirements.txtimport
logginglogging.basicConfig(
filename='kimi_audio.log',
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
This guide covers installation, configuration, optimization, and deployment of Kimi-Audio on Ubuntu systems. For the complete 5,000-word version with extended troubleshooting scenarios, advanced optimization techniques, and production deployment checklists, refer to the official documentation and technical report.
Key Pro Tips:
For the full technical specifications and architecture details, consult the Kimi-Audio Technical Report and GitHub repository.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.