Last updated April 2026 — refreshed for current Kimi-Audio releases, real GitHub URLs, and verified Apple Silicon caveats.
Kimi-Audio is Moonshot AI's open-source audio foundation model: one 7B-parameter network that handles speech recognition, audio understanding, audio question answering, audio captioning, and end-to-end speech-to-speech conversation. This guide is the practical, no-fluff walkthrough for getting it running on a Mac in 2026 — including the rough edges around CUDA-only kernels, MPS fallback, and what to do when a 23 GB VRAM target meets a 16 GB MacBook.
What changed in 2026Moonshot AI shipped Kimi-Audio-7B-Instruct and the inference code on April 25, 2025; pretrained base weights followed on April 27, 2025, and a fine-tuning example on May 29, 2025. The repo URL is real: github.com/MoonshotAI/Kimi-Audio. The earlier version of this post had a placeholder — that's now fixed.The official model card requires ~23 GB of GPU VRAM at BF16. No Mac has that on a single GPU. On Apple Silicon you are running on Unified Memory, so a 32 GB or 64 GB M-series machine is the realistic baseline; 16 GB will only work for short clips at reduced precision and with heavy CPU fallback.The repo expects flash-attn, which has no Apple Silicon build. The right pattern in 2026 isattn_implementation="sdpa"plusPYTORCH_ENABLE_MPS_FALLBACK=1— not the speculativetorch-mpspackage the old guide mentioned.The BigVGAN-based detokenizer uses custom CUDA kernels. On macOS, setload_detokenizer=Falsefor ASR / audio-understanding workflows. Speech generation realistically needs a Linux box or a remote GPU.Moonshot's broader Kimi family moved fast in 2026 (Kimi K2 in July 2025, Kimi K2.5 in January 2026, Kimi K2.6 in April 2026), but Kimi-Audio has not yet had a v2. The 7B-Instruct checkpoint released in April 2025 is still current as of April 2026.The earlier post's "Hypothetical GUI" section has been removed — there is no official Moonshot GUI for Kimi-Audio. Community wrappers exist on Hugging Face Spaces and Replicate; we link to a real one below.
Want the full picture? Read our continuously-updated Kimi K2.6: Complete Guide (2026) — Benchmarks, pricing, agent swarms, and how Kimi K2.6 stacks up against Opus 4.7 and GPT-5.5..
TL;DR
| Question | Short answer |
|---|---|
| Can I run Kimi-Audio on a Mac at all? | Yes for ASR and audio understanding on M-series with 32 GB+ unified memory. Speech generation is impractical without an NVIDIA GPU. |
| Minimum hardware? | M1 Pro / M2 / M3 / M4, macOS 13+, 32 GB unified memory recommended. 16 GB works only for short clips with aggressive offloading. |
| What about Intel Macs? | Technically possible (CPU-only PyTorch), but inference is too slow to be useful. Skip. |
| Is the model still current? | Yes — Kimi-Audio-7B-Instruct (April 25, 2025 release) remains the latest checkpoint as of April 2026. |
| License? | Code: MIT. Weights inherit Apache 2.0 from the Qwen 2.5-7B base. |
What Kimi-Audio actually is
Per the Kimi-Audio Technical Report (arXiv:2504.18425), the model is initialized from Qwen 2.5-7B, then continually pre-trained on more than 13 million hours of speech, sound, and music. It uses a 12.5 Hz audio tokenizer, treats continuous audio features as input and discrete audio + text tokens as output, and pairs with a chunk-wise streaming detokenizer based on flow matching for speech generation.
Reported benchmarks from the technical report and the GitHub README:
| Benchmark | Kimi-Audio-7B | Notable comparison |
|---|---|---|
| LibriSpeech test-clean (WER) | 1.28% | Qwen2-Audio-base 1.74%, Qwen2.5-Omni 2.37% |
| LibriSpeech test-other (WER) | 2.42% | — |
| AISHELL-1 (WER) | 0.60% | Qwen2.5-Omni 1.13% |
| AISHELL-2 (WER) | 2.56% | — |
| FLEURS Chinese (WER) | 2.69% | — |
| VocalSound (accuracy) | 94.85% | — |
| MELD (emotion, accuracy) | 59.13% | — |
| OpenAudioBench AlpacaEval | 75.73 | — |
| OpenAudioBench Llama Questions | 79.33 | — |
| OpenAudioBench TriviaQA | 62.10 | — |
Bottom line: on standard ASR benchmarks Kimi-Audio is competitive with or better than Whisper-class models, and beats Qwen2-Audio and Qwen2.5-Omni head-to-head. It also handles tasks Whisper can't (audio Q&A, captioning, end-to-end speech conversation) in one model.
Hardware reality check on Mac
The official Hugging Face Kimi-Audio-7B-Instruct model card calls out ~23 GB of GPU VRAM at BF16 for full inference. On Apple Silicon, "VRAM" is unified memory shared with the OS and apps. Map it like this:
| Mac | Practical for ASR / understanding? | Practical for speech generation? |
|---|---|---|
| M1 / M2 / M3 / M4 base, 8 GB | No | No |
| M-series, 16 GB | Short clips only, expect swap | No |
| M-series Pro/Max, 32 GB | Yes (recommended floor) | Detokenizer needs CUDA kernels; not realistically |
| M2/M3/M4 Max, 64 GB+ | Comfortable | Still gated by detokenizer / flash-attn |
| M3/M4 Ultra, 128–512 GB (Mac Studio) | Comfortable | Best Apple option, but expect functional gaps vs Linux+CUDA |
| Intel Mac (any) | Too slow to be useful | No |
If your goal is production speech-to-speech with Kimi-Audio, run it on a Linux machine with a single 24 GB+ NVIDIA GPU (RTX 3090, 4090, A10, L4, A100). The Mac path is for experimentation, ASR pipelines, and audio-understanding evaluation.
Step-by-step install on macOS
1. Prerequisites
xcode-select --install
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install python@3.11 git ffmpegPython 3.10–3.11 is the safe range. The Kimi-Audio code does not pin a version, but several dependencies (notably torch 2.4+ and transformers 4.45+) are well-tested there.
2. Clone the real repo and create a venv
git clone https://github.com/MoonshotAI/Kimi-Audio.git
cd Kimi-Audio
git submodule update --init --recursive
python3.11 -m venv kimi-env
source kimi-env/bin/activate
pip install --upgrade pip wheel3. Install PyTorch with MPS
From PyTorch's MPS backend docs, the standard torch wheel on macOS already includes Metal Performance Shaders. The "torch-mps" package referenced in older Kimi guides is not a real package — install the regular wheel:
pip install "torch>=2.4" "torchaudio>=2.4"Verify MPS is live:
python -c "import torch; print('mps available:', torch.backends.mps.is_available())"4. Install Kimi-Audio dependencies
The repo's requirements.txt assumes a CUDA box. Two specific lines need handling on macOS:
- flash-attn — no Apple Silicon support. Comment it out of
requirements.txtbefore installing. - BigVGAN custom CUDA ops — pulled by the detokenizer; ignore unless you need speech generation.
grep -v "flash" requirements.txt > requirements-mac.txt
pip install -r requirements-mac.txt
pip install -e .Set the MPS fallback environment variable so any unsupported op silently falls back to CPU instead of crashing:
export PYTORCH_ENABLE_MPS_FALLBACK=1Add it to your ~/.zshrc if you'll be using the model regularly.
5. Pull the model weights
pip install "huggingface_hub[cli]"
huggingface-cli download moonshotai/Kimi-Audio-7B-Instruct --local-dir ./Kimi-Audio-7B-InstructThe download is roughly 20 GB in BF16 safetensors plus the tokenizer, vocoder, and config files.
First inference: ASR on Apple Silicon
Use load_detokenizer=False on Mac. The detokenizer (BigVGAN + flow-matching vocoder) ships with CUDA-only kernels and will fail to import on macOS.
import os
import torch
from kimia_infer.api.kimia import KimiAudio
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
device = "mps" if torch.backends.mps.is_available() else "cpu"
model = KimiAudio(
model_path="./Kimi-Audio-7B-Instruct",
load_detokenizer=False, # required on macOS for ASR-only
)
# Move the LLM trunk to MPS where supported.
model.to(device)
sampling_params = {
"audio_temperature": 0.8,
"audio_top_k": 10,
"text_temperature": 0.0,
"text_top_k": 5,
"audio_repetition_penalty": 1.0,
"audio_repetition_window_size": 64,
"text_repetition_penalty": 1.0,
"text_repetition_window_size": 16,
}
messages = [
{"role": "user", "message_type": "text",
"content": "Please transcribe the following audio:"},
{"role": "user", "message_type": "audio",
"content": "test_audios/asr_example.wav"},
]
_, text = model.generate(messages, **sampling_params, output_type="text")
print(text)For audio question answering, replace the system prompt with the question and pass the audio file as the user turn. The generation interface is the same — only output_type="text" is reliable on Mac.
Speech generation: why it's hard on Mac
End-to-end speech-to-speech needs the detokenizer. Three blockers on macOS:
- BigVGAN custom CUDA kernels — the vocoder's anti-aliased activations are written as CUDA extensions and don't compile on Metal.
- flash-attn — used in some attention paths; no Apple Silicon build exists.
- Memory — keeping the 7B trunk and vocoder resident comfortably exceeds 16 GB.
If you need speech-out on a Mac, two pragmatic options:
- Use Kimi-Audio for ASR / understanding locally and pipe text through a Mac-friendly TTS (Apple's
say,kokoro-tts, orparler-ttswith MPS). - Run the full pipeline remotely. The community-maintained zsxkib/kimi-audio-7b-instruct Replicate endpoint exposes the full model behind an HTTP API; call it from your Mac as a thin client.
Real-world performance numbers
The benchmarks at the top of the post are from Moonshot's own evaluation. Mac-specific throughput is governed by unified-memory bandwidth and the share of ops that fall back from MPS to CPU. Indicative numbers from independent practitioner reports in the broader Apple-Silicon ML community:
- M2 Max, 64 GB: Kimi-Audio ASR on a 30-second LibriSpeech clip runs roughly 4–6× slower than an RTX 4090 — usable for batch transcription, sluggish for real-time.
- M3 Pro, 18 GB: works for clips under ~20 seconds; longer audio frequently triggers swap and tail-latency spikes.
- M4 Max, 64 GB: best Mac-laptop class; ASR throughput is competitive with a single L4 GPU on similar clips, generation still gated by missing kernels.
If you're benchmarking yourself, set PYTORCH_ENABLE_MPS_FALLBACK=1 and time both the warm-cache second run and the cold first run — the JIT-compile path on MPS is a one-time cost that makes the first call look 5–10× slower than steady state.
How to choose: Kimi-Audio vs alternatives in 2026
| Use case | Best pick on Mac (April 2026) | Why |
|---|---|---|
| English-only ASR, single language | Whisper large-v3 / faster-whisper | Mature MPS path, smaller VRAM, lower latency |
| Multilingual ASR with strong Chinese | Kimi-Audio-7B-Instruct | Beats Qwen2-Audio and Qwen2.5-Omni on AISHELL/FLEURS-zh |
| Audio Q&A, captioning, emotion | Kimi-Audio-7B-Instruct or Qwen2.5-Omni | Whisper can't do these; one-shot multimodal beats stitched pipelines |
| Real-time speech-to-speech on Mac | Remote Kimi-Audio + local STT/TTS | Detokenizer doesn't run on Metal; latency dominated by network either way |
| Local speech-to-speech, full stack | Linux + 24 GB+ NVIDIA GPU | Only path where the full Kimi-Audio inference graph is well-supported |
Common pitfalls and troubleshooting
ImportError: flash_attn— stripflash-attnfromrequirements.txt, then passattn_implementation="sdpa"when constructing any transformer that exposes it.RuntimeError: MPS does not support op X— setPYTORCH_ENABLE_MPS_FALLBACK=1. If you forget, you'll see hard crashes instead of slow CPU fallback.- Detokenizer import fails — instantiate
KimiAudio(load_detokenizer=False). The CUDA extensions for BigVGAN's anti-aliased ops will not compile on macOS. - OOM on 16 GB Macs — you can't fit the full BF16 model. Expect to either clip audio aggressively or move to a 32 GB+ machine. Quantized community variants don't yet exist for Kimi-Audio (as of April 2026); track the HF discussions tab.
- Crackling / sample-rate mismatch — the model expects 16 kHz mono input. Resample with
ffmpeg -i in.wav -ar 16000 -ac 1 out.wavbefore passing togenerate(). - First call is glacial — that's the MPS JIT compile path. Warm the model with a tiny dummy clip at startup and serve real requests on the warm cache.
- vLLM on Mac — Kimi-Audio support landed in vllm-project/vllm#17234 for CUDA only. Don't expect a Mac-friendly serving runtime in 2026.
What was removed from this guide and why
- Placeholder GitHub URL (
[Kimi-Audio-Repo]) — replaced with the realgithub.com/MoonshotAI/Kimi-Audio. - "Hypothetical GUI workflow" — there is no first-party Moonshot GUI. We now point to a real Replicate endpoint and the Hugging Face Spaces ecosystem if you want a UI without writing code.
- "Install
torch-mpsfor Metal Performance Shaders" — that package never existed on PyPI. Standardtorchon macOS includes MPS. - "Use
--device mpscommand-line flag" — the repo's CLI doesn't expose that flag; device selection happens in Python viamodel.to("mps"). - Logic Pro / Sweetwater Sequoia compatibility / OrbStack sandboxing tangents — these were filler not relevant to actually running the model.
When to bring in help
Wiring Kimi-Audio (or any open-weights audio model) into a production pipeline — call routing, meeting summarization, voice agents — usually breaks at the integration layer, not the model layer. If you're shipping voice-driven product features and want vetted ML and backend engineers who've already done this on Apple Silicon and on cloud GPUs, Codersera connects companies with vetted remote developers who can extend your engineering team without the months-long hiring cycle.
For a broader picture of running open-weights agents and models locally — including the Ollama / OpenClaw stack that pairs naturally with Kimi-Audio for multimodal pipelines — see our OpenClaw + Ollama setup guide for running local AI agents.
FAQ
Is Kimi-Audio better than Whisper for English ASR?
On LibriSpeech test-clean, Moonshot reports 1.28% WER for Kimi-Audio-7B versus Whisper large-v3's typically reported 1.8–2.0% on the same set. The gap is real but small. For pure English ASR with no other tasks, Whisper is still the better Mac citizen because it's smaller, faster, and has a mature MPS path.
Does it support real-time streaming on Mac?
The detokenizer is designed for chunk-wise streaming, but it depends on CUDA kernels, so streaming speech-out is not a real Mac option in 2026. Streaming ASR (audio-in, text-out) is feasible on M-series Macs with 32 GB+ unified memory.
What languages does it support?
The technical report emphasises English and Mandarin Chinese as the headline languages. The pre-training corpus includes other languages but published benchmarks focus on en/zh.
Can I fine-tune Kimi-Audio on a Mac?
No. Moonshot published a fine-tuning example on May 29, 2025, but it assumes CUDA. Use a cloud GPU or a Linux box with at least one 24 GB NVIDIA card. Inference-only is the realistic Mac use case.
Is there a quantized version that fits in 16 GB?
As of April 2026, no first-party GGUF / AWQ / GPTQ release exists. Community quants for similar Qwen 2.5-7B-derived models work, but you'd be re-exporting weights yourself. Watch the model's Hugging Face discussions tab.
What's the license?
Code is MIT. The model weights inherit Apache 2.0 from the Qwen 2.5-7B base. Both are commercial-use friendly, but read the upstream Qwen license for region-specific terms.
Why does my first inference call take ~30 seconds when later calls take 2 seconds?
That's the MPS JIT compilation path. On first use, PyTorch lowers each operator to a Metal shader. Subsequent calls hit the cached shader. Warm the model on app start with a tiny dummy input.
How does Kimi-Audio fit alongside Kimi K2.6?
They are different products. Kimi K2.6 (April 2026) is Moonshot's flagship 1T-parameter MoE language model and has nothing to do with audio. Kimi-Audio remains a separate 7B audio model — it has not yet been merged into the K2 line.
References and further reading
- MoonshotAI/Kimi-Audio — official GitHub repository
- moonshotai/Kimi-Audio-7B-Instruct — Hugging Face model card
- Kimi-Audio Technical Report (arXiv:2504.18425)
- Kimi-Audio-Evalkit — official evaluation toolkit
- PyTorch MPS backend documentation
- Hugging Face Forums — flash-attn on Apple Silicon best practices
- vllm-project/vllm#17234 — Kimi-Audio support tracking issue
- zsxkib/kimi-audio-7b-instruct — community Replicate endpoint