Unleash Your Creativity
AI Image Editor
Create, edit, and transform images with AI - completely free
4 min to read
Kimi-Audio is a universal audio foundation model capable of audio understanding, generation, and processing. Kimi-Audio is an open-source AI model designed for audio-to-text (ASR) and audio-to-audio/text conversation tasks
This guide adapts the workflow for macOSSystem Requirements and Compatibility
pip
for dependency management./bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
bashgit clone https://github.com/[Kimi-Audio-Repo].git # Replace with actual repo URL[^1]
Kimi-Audio
cd
bashpython3 -m venv kimi-envsource
kimi-env/bin/activate
bashpip install -r requirements.txt # Adapt requirements for macOS compatibility[^1][4]
A. You can load the Kimi-Audio model using the KimiAudio
class from the kimia_infer.api.kimia
module. Ensure you have the model path or the model ID from the Hugging Face Hub.
B. Define the sampling parameters for audio and text generation.
torch
with Apple's torch-mps
for Metal Performance Shaders1:bashpip install
torch torchaudioonnxruntime
for CPU/GPU inference on Apple Silicon.Define Sampling Parameters:PythonCopy
sampling_params = {
"audio_temperature": 0.8,
"audio_top_k": 10,
"text_temperature": 0.0,
"text_top_k": 5,
"audio_repetition_penalty": 1.0,
"audio_repetition_window_size": 64,
"text_repetition_penalty": 1.0,
"text_repetition_window_size": 16,
}
Load the Model:PythonCopy
from kimia_infer.api.kimia import KimiAudio
model_path = "moonshotai/Kimi-Audio-7B-Instruct"
model = KimiAudio(model_path=model_path, load_detokenizer=True)
asr_example.wav
.Transcribe the Audio:PythonCopy
import soundfile as sf
messages_asr = [
{"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
{"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"}
]
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output) # Expected output: "This is not a farewell, this is the end of one chapter and the beginning of a new one。" [^224^]
qa_example.wav
.Generate Audio and Text Output:PythonCopy
messages_conversation = [
{"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"}
]
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "A." [^224^]
bashpython kimi_audio.py --input "What is happiness?"
--output_format both
--device mps
for Metal acceleration on Apple Silicon1.--precision fp16
to reduce memory usage1.If a GUI is available:
import
torchdevice = torch.device("mps")
--batch_size
to 1-2 for long-form audio1.swapoff
: Disable swap to prevent SSD wear (only for 32GB+ RAM systems).xcode-select --reset
.coreaudiod
errors.coremltools
.mlcompute
framework for on-device training.pythonfrom kimi_audio import
KimiModelmodel = KimiModel(device="mps")
model.generate("Explain quantum computing in 200 words.")
Kimi-Audio provides a powerful and flexible solution for audio-to-text and audio-to-audio/text conversation tasks. By following the steps outlined above, you can easily set up and run Kimi-Audio on your Mac.
The model's capabilities make it suitable for a wide range of applications, from transcription services to interactive voice assistants. For more detailed information and additional examples, refer to the Kimi-Audio GitHub repository.
Need expert guidance? Connect with a top Codersera professional today!