Kimi

Running Kimi-Audio on Mac: A Practical 2026 Guide

Published 29 Apr 2025 • Updated 31 May 2026 • 10 min read

Quick answer. Kimi-Audio 7B runs on Apple Silicon Macs via MLX-LM for ASR, but speech generation still depends on CUDA-only kernels — pair it with kokoro-tts or parler-tts for Mac TTS. Needs ~20 GB unified RAM, Python 3.11, and HF transformers from main. As of May 2026, no first-party MLX/GGUF release.

Last updated April 2026 — refreshed for current Kimi-Audio releases, real GitHub URLs, and verified Apple Silicon caveats.

Kimi-Audio is Moonshot AI's open-source audio foundation model: one 7B-parameter network that handles speech recognition, audio understanding, audio question answering, audio captioning, and end-to-end speech-to-speech conversation. This guide is the practical, no-fluff walkthrough for getting it running on a Mac in 2026 — including the rough edges around CUDA-only kernels, MPS fallback, and what to do when a 23 GB VRAM target meets a 16 GB MacBook.

What changed in 2026Moonshot AI shipped Kimi-Audio-7B-Instruct and the inference code on April 25, 2025; pretrained base weights followed on April 27, 2025, and a fine-tuning example on May 29, 2025. The repo URL is real: github.com/MoonshotAI/Kimi-Audio. The earlier version of this post had a placeholder — that's now fixed.The official model card requires ~23 GB of GPU VRAM at BF16. No Mac has that on a single GPU. On Apple Silicon you are running on Unified Memory, so a 32 GB or 64 GB M-series machine is the realistic baseline; 16 GB will only work for short clips at reduced precision and with heavy CPU fallback.The repo expects flash-attn, which has no Apple Silicon build. The right pattern in 2026 is attn_implementation="sdpa" plus PYTORCH_ENABLE_MPS_FALLBACK=1 — not the speculative torch-mps package the old guide mentioned.The BigVGAN-based detokenizer uses custom CUDA kernels. On macOS, set load_detokenizer=False for ASR / audio-understanding workflows. Speech generation realistically needs a Linux box or a remote GPU.Moonshot's broader Kimi family moved fast in 2026 (Kimi K2 in July 2025, Kimi K2.5 in January 2026, Kimi K2.6 in April 2026), but Kimi-Audio has not yet had a v2. The 7B-Instruct checkpoint released in April 2025 is still current as of April 2026.The earlier post's "Hypothetical GUI" section has been removed — there is no official Moonshot GUI for Kimi-Audio. Community wrappers exist on Hugging Face Spaces and Replicate; we link to a real one below.

Want the full picture? Read our continuously-updated Kimi K2.6: Complete Guide (2026) — Benchmarks, pricing, agent swarms, and how Kimi K2.6 stacks up against Opus 4.7 and GPT-5.5..

TL;DR

Question	Short answer
Can I run Kimi-Audio on a Mac at all?	Yes for ASR and audio understanding on M-series with 32 GB+ unified memory. Speech generation is impractical without an NVIDIA GPU.
Minimum hardware?	M1 Pro / M2 / M3 / M4, macOS 13+, 32 GB unified memory recommended. 16 GB works only for short clips with aggressive offloading.
What about Intel Macs?	Technically possible (CPU-only PyTorch), but inference is too slow to be useful. Skip.
Is the model still current?	Yes — Kimi-Audio-7B-Instruct (April 25, 2025 release) remains the latest checkpoint as of April 2026.
License?	Code: MIT. Weights inherit Apache 2.0 from the Qwen 2.5-7B base.

What Kimi-Audio actually is

Per the Kimi-Audio Technical Report (arXiv:2504.18425), the model is initialized from Qwen 2.5-7B, then continually pre-trained on more than 13 million hours of speech, sound, and music. It uses a 12.5 Hz audio tokenizer, treats continuous audio features as input and discrete audio + text tokens as output, and pairs with a chunk-wise streaming detokenizer based on flow matching for speech generation.

Reported benchmarks from the technical report and the GitHub README:

Benchmark	Kimi-Audio-7B	Notable comparison
LibriSpeech test-clean (WER)	1.28%	Qwen2-Audio-base 1.74%, Qwen2.5-Omni 2.37%
LibriSpeech test-other (WER)	2.42%	—
AISHELL-1 (WER)	0.60%	Qwen2.5-Omni 1.13%
AISHELL-2 (WER)	2.56%	—
FLEURS Chinese (WER)	2.69%	—
VocalSound (accuracy)	94.85%	—
MELD (emotion, accuracy)	59.13%	—
OpenAudioBench AlpacaEval	75.73	—
OpenAudioBench Llama Questions	79.33	—
OpenAudioBench TriviaQA	62.10	—

Bottom line: on standard ASR benchmarks Kimi-Audio is competitive with or better than Whisper-class models, and beats Qwen2-Audio and Qwen2.5-Omni head-to-head. It also handles tasks Whisper can't (audio Q&A, captioning, end-to-end speech conversation) in one model.

Hardware reality check on Mac

The official Hugging Face Kimi-Audio-7B-Instruct model card calls out ~23 GB of GPU VRAM at BF16 for full inference. On Apple Silicon, "VRAM" is unified memory shared with the OS and apps. Map it like this:

Mac	Practical for ASR / understanding?	Practical for speech generation?
M1 / M2 / M3 / M4 base, 8 GB	No	No
M-series, 16 GB	Short clips only, expect swap	No
M-series Pro/Max, 32 GB	Yes (recommended floor)	Detokenizer needs CUDA kernels; not realistically
M2/M3/M4 Max, 64 GB+	Comfortable	Still gated by detokenizer / flash-attn
M3/M4 Ultra, 128–512 GB (Mac Studio)	Comfortable	Best Apple option, but expect functional gaps vs Linux+CUDA
Intel Mac (any)	Too slow to be useful	No

If your goal is production speech-to-speech with Kimi-Audio, run it on a Linux machine with a single 24 GB+ NVIDIA GPU (RTX 3090, 4090, A10, L4, A100). The Mac path is for experimentation, ASR pipelines, and audio-understanding evaluation.

Step-by-step install on macOS

1. Prerequisites

xcode-select --install
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install python@3.11 git ffmpeg

Python 3.10–3.11 is the safe range. The Kimi-Audio code does not pin a version, but several dependencies (notably torch 2.4+ and transformers 4.45+) are well-tested there.

2. Clone the real repo and create a venv

git clone https://github.com/MoonshotAI/Kimi-Audio.git
cd Kimi-Audio
git submodule update --init --recursive

python3.11 -m venv kimi-env
source kimi-env/bin/activate
pip install --upgrade pip wheel

3. Install PyTorch with MPS

From PyTorch's MPS backend docs, the standard torch wheel on macOS already includes Metal Performance Shaders. The "torch-mps" package referenced in older Kimi guides is not a real package — install the regular wheel:

pip install "torch>=2.4" "torchaudio>=2.4"

Verify MPS is live:

python -c "import torch; print('mps available:', torch.backends.mps.is_available())"

4. Install Kimi-Audio dependencies

The repo's requirements.txt assumes a CUDA box. Two specific lines need handling on macOS:

flash-attn — no Apple Silicon support. Comment it out of requirements.txt before installing.
BigVGAN custom CUDA ops — pulled by the detokenizer; ignore unless you need speech generation.

grep -v "flash" requirements.txt > requirements-mac.txt
pip install -r requirements-mac.txt
pip install -e .

Set the MPS fallback environment variable so any unsupported op silently falls back to CPU instead of crashing:

export PYTORCH_ENABLE_MPS_FALLBACK=1

Add it to your ~/.zshrc if you'll be using the model regularly.

5. Pull the model weights

pip install "huggingface_hub[cli]"
huggingface-cli download moonshotai/Kimi-Audio-7B-Instruct --local-dir ./Kimi-Audio-7B-Instruct

The download is roughly 20 GB in BF16 safetensors plus the tokenizer, vocoder, and config files.

First inference: ASR on Apple Silicon

Use load_detokenizer=False on Mac. The detokenizer (BigVGAN + flow-matching vocoder) ships with CUDA-only kernels and will fail to import on macOS.

import os
import torch
from kimia_infer.api.kimia import KimiAudio

os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

device = "mps" if torch.backends.mps.is_available() else "cpu"

model = KimiAudio(
    model_path="./Kimi-Audio-7B-Instruct",
    load_detokenizer=False,        # required on macOS for ASR-only
)
# Move the LLM trunk to MPS where supported.
model.to(device)

sampling_params = {
    "audio_temperature": 0.8,
    "audio_top_k": 10,
    "text_temperature": 0.0,
    "text_top_k": 5,
    "audio_repetition_penalty": 1.0,
    "audio_repetition_window_size": 64,
    "text_repetition_penalty": 1.0,
    "text_repetition_window_size": 16,
}

messages = [
    {"role": "user", "message_type": "text",
     "content": "Please transcribe the following audio:"},
    {"role": "user", "message_type": "audio",
     "content": "test_audios/asr_example.wav"},
]

_, text = model.generate(messages, **sampling_params, output_type="text")
print(text)

For audio question answering, replace the system prompt with the question and pass the audio file as the user turn. The generation interface is the same — only output_type="text" is reliable on Mac.

Speech generation: why it's hard on Mac

End-to-end speech-to-speech needs the detokenizer. Three blockers on macOS:

BigVGAN custom CUDA kernels — the vocoder's anti-aliased activations are written as CUDA extensions and don't compile on Metal.
flash-attn — used in some attention paths; no Apple Silicon build exists.
Memory — keeping the 7B trunk and vocoder resident comfortably exceeds 16 GB.

If you need speech-out on a Mac, two pragmatic options:

Use Kimi-Audio for ASR / understanding locally and pipe text through a Mac-friendly TTS (Apple's say, kokoro-tts, or parler-tts with MPS).
Run the full pipeline remotely. The community-maintained zsxkib/kimi-audio-7b-instruct Replicate endpoint exposes the full model behind an HTTP API; call it from your Mac as a thin client.

Real-world performance numbers

The benchmarks at the top of the post are from Moonshot's own evaluation. Mac-specific throughput is governed by unified-memory bandwidth and the share of ops that fall back from MPS to CPU. Indicative numbers from independent practitioner reports in the broader Apple-Silicon ML community:

M2 Max, 64 GB: Kimi-Audio ASR on a 30-second LibriSpeech clip runs roughly 4–6× slower than an RTX 4090 — usable for batch transcription, sluggish for real-time.
M3 Pro, 18 GB: works for clips under ~20 seconds; longer audio frequently triggers swap and tail-latency spikes.
M4 Max, 64 GB: best Mac-laptop class; ASR throughput is competitive with a single L4 GPU on similar clips, generation still gated by missing kernels.

If you're benchmarking yourself, set PYTORCH_ENABLE_MPS_FALLBACK=1 and time both the warm-cache second run and the cold first run — the JIT-compile path on MPS is a one-time cost that makes the first call look 5–10× slower than steady state.

How to choose: Kimi-Audio vs alternatives in 2026

Use case	Best pick on Mac (April 2026)	Why
English-only ASR, single language	Whisper large-v3 / faster-whisper	Mature MPS path, smaller VRAM, lower latency
Multilingual ASR with strong Chinese	Kimi-Audio-7B-Instruct	Beats Qwen2-Audio and Qwen2.5-Omni on AISHELL/FLEURS-zh
Audio Q&A, captioning, emotion	Kimi-Audio-7B-Instruct or Qwen2.5-Omni	Whisper can't do these; one-shot multimodal beats stitched pipelines
Real-time speech-to-speech on Mac	Remote Kimi-Audio + local STT/TTS	Detokenizer doesn't run on Metal; latency dominated by network either way
Local speech-to-speech, full stack	Linux + 24 GB+ NVIDIA GPU	Only path where the full Kimi-Audio inference graph is well-supported

Common pitfalls and troubleshooting

ImportError: flash_attn — strip flash-attn from requirements.txt, then pass attn_implementation="sdpa" when constructing any transformer that exposes it.
RuntimeError: MPS does not support op X — set PYTORCH_ENABLE_MPS_FALLBACK=1. If you forget, you'll see hard crashes instead of slow CPU fallback.
Detokenizer import fails — instantiate KimiAudio(load_detokenizer=False). The CUDA extensions for BigVGAN's anti-aliased ops will not compile on macOS.
OOM on 16 GB Macs — you can't fit the full BF16 model. Expect to either clip audio aggressively or move to a 32 GB+ machine. Quantized community variants don't yet exist for Kimi-Audio (as of April 2026); track the HF discussions tab.
Crackling / sample-rate mismatch — the model expects 16 kHz mono input. Resample with ffmpeg -i in.wav -ar 16000 -ac 1 out.wav before passing to generate().
First call is glacial — that's the MPS JIT compile path. Warm the model with a tiny dummy clip at startup and serve real requests on the warm cache.
vLLM on Mac — Kimi-Audio support landed in vllm-project/vllm#17234 for CUDA only. Don't expect a Mac-friendly serving runtime in 2026.

What was removed from this guide and why

Placeholder GitHub URL ([Kimi-Audio-Repo]) — replaced with the real github.com/MoonshotAI/Kimi-Audio.
"Hypothetical GUI workflow" — there is no first-party Moonshot GUI. We now point to a real Replicate endpoint and the Hugging Face Spaces ecosystem if you want a UI without writing code.
"Install torch-mps for Metal Performance Shaders" — that package never existed on PyPI. Standard torch on macOS includes MPS.
"Use --device mps command-line flag" — the repo's CLI doesn't expose that flag; device selection happens in Python via model.to("mps").
Logic Pro / Sweetwater Sequoia compatibility / OrbStack sandboxing tangents — these were filler not relevant to actually running the model.

When to bring in help

Wiring Kimi-Audio (or any open-weights audio model) into a production pipeline — call routing, meeting summarization, voice agents — usually breaks at the integration layer, not the model layer. If you're shipping voice-driven product features and want vetted ML and backend engineers who've already done this on Apple Silicon and on cloud GPUs, Codersera connects companies with vetted remote developers who can extend your engineering team without the months-long hiring cycle.

For a broader picture of running open-weights agents and models locally — including the Ollama / OpenClaw stack that pairs naturally with Kimi-Audio for multimodal pipelines — see our OpenClaw + Ollama setup guide for running local AI agents.

FAQ

Is Kimi-Audio better than Whisper for English ASR?

On LibriSpeech test-clean, Moonshot reports 1.28% WER for Kimi-Audio-7B versus Whisper large-v3's typically reported 1.8–2.0% on the same set. The gap is real but small. For pure English ASR with no other tasks, Whisper is still the better Mac citizen because it's smaller, faster, and has a mature MPS path.

Does it support real-time streaming on Mac?

The detokenizer is designed for chunk-wise streaming, but it depends on CUDA kernels, so streaming speech-out is not a real Mac option in 2026. Streaming ASR (audio-in, text-out) is feasible on M-series Macs with 32 GB+ unified memory.

What languages does it support?

The technical report emphasises English and Mandarin Chinese as the headline languages. The pre-training corpus includes other languages but published benchmarks focus on en/zh.

Can I fine-tune Kimi-Audio on a Mac?

No. Moonshot published a fine-tuning example on May 29, 2025, but it assumes CUDA. Use a cloud GPU or a Linux box with at least one 24 GB NVIDIA card. Inference-only is the realistic Mac use case.

Is there a quantized version that fits in 16 GB?

As of April 2026, no first-party GGUF / AWQ / GPTQ release exists. Community quants for similar Qwen 2.5-7B-derived models work, but you'd be re-exporting weights yourself. Watch the model's Hugging Face discussions tab.

What's the license?

Code is MIT. The model weights inherit Apache 2.0 from the Qwen 2.5-7B base. Both are commercial-use friendly, but read the upstream Qwen license for region-specific terms.

Why does my first inference call take ~30 seconds when later calls take 2 seconds?

That's the MPS JIT compilation path. On first use, PyTorch lowers each operator to a Metal shader. Subsequent calls hit the cached shader. Warm the model on app start with a tiny dummy input.

How does Kimi-Audio fit alongside Kimi K2.6?

They are different products. Kimi K2.6 (April 2026) is Moonshot's flagship 1T-parameter MoE language model and has nothing to do with audio. Kimi-Audio remains a separate 7B audio model — it has not yet been merged into the K2 line.

Running Kimi-Audio on Mac: A Practical 2026 Guide

TL;DR

What Kimi-Audio actually is

Hardware reality check on Mac

Step-by-step install on macOS

1. Prerequisites

2. Clone the real repo and create a venv

3. Install PyTorch with MPS

4. Install Kimi-Audio dependencies

5. Pull the model weights

First inference: ASR on Apple Silicon

Speech generation: why it's hard on Mac

Real-world performance numbers

How to choose: Kimi-Audio vs alternatives in 2026

Common pitfalls and troubleshooting

What was removed from this guide and why

When to bring in help

FAQ

Is Kimi-Audio better than Whisper for English ASR?

Does it support real-time streaming on Mac?

What languages does it support?

Can I fine-tune Kimi-Audio on a Mac?

Is there a quantized version that fits in 16 GB?

What's the license?

Why does my first inference call take ~30 seconds when later calls take 2 seconds?

How does Kimi-Audio fit alongside Kimi K2.6?

References and further reading

Sign up for more like this.

TL;DR

What Kimi-Audio actually is

Hardware reality check on Mac

Step-by-step install on macOS

1. Prerequisites

2. Clone the real repo and create a venv

3. Install PyTorch with MPS

4. Install Kimi-Audio dependencies

5. Pull the model weights

First inference: ASR on Apple Silicon

Speech generation: why it's hard on Mac

Real-world performance numbers

How to choose: Kimi-Audio vs alternatives in 2026

Common pitfalls and troubleshooting

What was removed from this guide and why

When to bring in help

FAQ

Is Kimi-Audio better than Whisper for English ASR?

Does it support real-time streaming on Mac?

What languages does it support?

Can I fine-tune Kimi-Audio on a Mac?

Is there a quantized version that fits in 16 GB?

What's the license?

Why does my first inference call take ~30 seconds when later calls take 2 seconds?

How does Kimi-Audio fit alongside Kimi K2.6?

References and further reading

Related Codersera guides

Sign up for more like this.