GLM-ASR-Nano: Run GLM-1.5B Speech-to-Text Locally

Last updated April 2026 — refreshed for current model/tool versions.

GLM-ASR-Nano-2512 is Z.AI's (formerly Zhipu) open-source 1.5B-parameter automatic speech recognition (ASR) model, released in December 2025 and now serveable in vLLM 0.14.1+ and the Hugging Face Transformers 5.0 main branch. This guide gives you the exact, verified commands to install and run it locally, the real 2026 benchmark numbers, and an honest comparison against Whisper Large v3 / v3-turbo, NVIDIA Canary-Qwen 2.5B, Qwen3-ASR, Mistral Voxtral, ElevenLabs Scribe v2, and Google Gemini 3 Pro audio.

What changed in 2026GLM-ASR-Nano-2512 is now an official Hugging Face Transformers architecture (GlmAsr) and an officially supported vLLM recipe. You no longer have to patch in custom modeling code.The model card publishes Open ASR Leaderboard numbers directly: 7.03 mean WER across the 8-dataset suite at RTFx 145.28 on a single H100. That puts it ahead of Whisper Large v3 (7.4) and Whisper Large v3 Turbo (7.75), behind NVIDIA Canary-Qwen 2.5B (5.63) and IBM Granite Speech 3.3 8B (5.85).The hosted API glm-asr-2512 on api.z.ai now caps at 25 MB / 30 seconds per request — useful for short transcription, but for long files the local route is the practical option.The license is MIT on Hugging Face (Apache-2.0 in the GitHub repo for the inference scripts) — both permit commercial use, fine-tuning, and redistribution.An MLX 4-bit conversion (mlx-community/GLM-ASR-Nano-2512-4bit) lets you run the model on Apple Silicon at roughly 1.2 GB of unified memory.Real competition on the open-source side now includes Qwen3-ASR (52 languages, 0.6B/1.7B variants), NVIDIA Canary-Qwen 2.5B (current Open ASR Leaderboard #1), and Moonshine (27 MB, edge-class).

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR

Question	Answer
What is GLM-ASR-Nano-2512?	1.5B-parameter open-source speech-to-text model from Z.AI (Zhipu), released Dec 2025.
How big is the download?	~3.0 GB BF16 SafeTensors; ~1.2 GB at MLX 4-bit.
Minimum hardware?	~6 GB VRAM at BF16 with batch=1 on a consumer GPU (RTX 3060 12 GB is comfortable). CPU works but is slow.
Languages?	17 languages with WER ≤ 20% — strongest on Mandarin, Cantonese, English, plus Sichuanese, Min Nan, Wu, French, German, Japanese, Korean, Spanish, Arabic.
License?	MIT (HF model) / Apache-2.0 (inference repo) — commercial use permitted.
How does it compare to Whisper v3?	Lower mean WER on Open ASR Leaderboard (7.03 vs 7.4) at the same parameter scale, plus much stronger Cantonese and dialect coverage.
Hosted API price?	Available on `api.z.ai` as `glm-asr-2512` with a 25 MB / 30 s cap per request; pricing is published on the Z.AI dashboard rather than the public docs page — check the dashboard before committing.

Why GLM-ASR-Nano matters in 2026

The 2026 ASR landscape splits cleanly into three tiers:

Hosted, closed, top-of-leaderboard: ElevenLabs Scribe v2 (2.3% AA-WER), Google Gemini 3 Pro audio (~2.9% AA-WER), Mistral Voxtral Small (~2.9% AA-WER) per Artificial Analysis' Speech-to-Text leaderboard. Best accuracy, but you pay per minute and ship audio off-box.
Open weights, big-model: NVIDIA Canary-Qwen 2.5B (5.63 mean WER), IBM Granite Speech 3.3 8B (5.85). Top of the Open ASR Leaderboard, but 2.5B–8B parameters and not as cheap to serve.
Open weights, small-model: GLM-ASR-Nano-2512 (1.5B), Whisper Large v3 (1.5B), Whisper Large v3 Turbo, Qwen3-ASR-0.6B/1.7B, Moonshine (27 MB). This is where local inference, batch processing, and edge deployment actually live.

GLM-ASR-Nano sits at the high end of the small-model tier: it ties Whisper Large v3 on parameter count but beats it on the public Open ASR Leaderboard average and pulls ahead on Chinese and Cantonese. If you have to run ASR on your own GPU — for compliance, latency, or cost reasons — and your audio is multilingual, this is the most defensible default open model in early 2026. For an end-to-end agent stack that includes local LLM inference alongside ASR, see our OpenClaw + Ollama setup guide for running local AI agents.

What changed vs the original release notes

The original launch description in late 2025 framed GLM-ASR-Nano as a Llama-based two-stage system with a separate vocoder. That description was lifted from the GLM-TTS companion paper and was misapplied to the ASR model. As of April 2026 the correct picture is:

GLM-ASR-Nano is a seq2seq audio-conditioned LM, exposed as AutoModelForSeq2SeqLM (or the new GlmAsr class in Transformers main).
There is no vocoder stage in the ASR model — that is GLM-TTS, a separate release.
vLLM and SGLang serving are now first-class.

Benchmark numbers you can cite

From the official model card and the Hugging Face Open ASR Leaderboard (BF16, single H100, RTFx 145.28):

Dataset	GLM-ASR-Nano WER
LibriSpeech clean	2.15
LibriSpeech other	4.42
SPGISpeech	2.08
TED-LIUM	3.10
GigaSpeech	9.73
Earnings22	11.08
AMI	16.15
Open ASR Leaderboard mean	7.03

Z.AI also reports an internal 4.10 average error rate and 0.0717 CER on its hosted-API evaluation suite, which leans heavier on Mandarin and Cantonese sets (Wenet Meeting, Aishell-1). Those numbers are not directly comparable to the Open ASR Leaderboard mean — different datasets, different metric mix — so cite each in its own context.

Head-to-head against the 2026 field

Model	Params	Open ASR mean WER	License	Local-friendly
NVIDIA Canary-Qwen 2.5B	2.5B	5.63	NVIDIA OS	Yes (NeMo)
IBM Granite Speech 3.3 8B	8B	5.85	Apache-2.0	Heavy
GLM-ASR-Nano-2512	1.5B	7.03	MIT	Yes
Whisper Large v3	1.5B	~7.4	MIT	Yes
Whisper Large v3 Turbo	809M	~7.75	MIT	Yes (very fast)
Qwen3-ASR-1.7B	1.7B	not in OAL avg yet	Apache-2.0	Yes
Moonshine	27–60M	edge-class	MIT	Yes (CPU/edge)

Hosted closed models (ElevenLabs Scribe v2 at 2.3% AA-WER, Gemini 3 Pro audio ~2.9%, Voxtral Small ~2.9% per Artificial Analysis) are on a different metric and pricing axis — they are not direct substitutes if your reason for going local is privacy, cost-at-scale, or air-gap compliance.

Install and run GLM-ASR-Nano locally

Prerequisites

Linux or macOS, Python 3.10+ (3.11 recommended).
NVIDIA GPU with ~6 GB free VRAM at BF16, or Apple Silicon with the MLX build, or a recent CPU (slow but works).
ffmpeg for audio decoding.
Hugging Face account if you want to push fine-tunes back, otherwise anonymous download is fine.

Option A — Transformers on a CUDA box

The official model card requires the Transformers main branch (the GlmAsr architecture is not yet in a tagged release as of April 2026):

python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install git+https://github.com/huggingface/transformers
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install datasets soundfile accelerate
sudo apt-get install -y ffmpeg   # or: brew install ffmpeg

Minimal inference script — taken straight from the model card and adapted for a local file:

from transformers import AutoModelForSeq2SeqLM, AutoProcessor

model_id = "zai-org/GLM-ASR-Nano-2512"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_id, dtype="auto", device_map="auto"
)

# audio_path can be a local .wav/.mp3/.flac or an https URL
audio_path = "examples/example_en.wav"
inputs = processor.apply_transcription_request(audio_path)
inputs = inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)
text = processor.batch_decode(
    outputs[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)
print(text[0])

Option B — vLLM server (recommended for throughput)

Z.AI publishes an official vLLM recipe at docs.vllm.ai/projects/recipes/en/latest/GLM/GLM-ASR.html. Requirements: vllm>=0.14.1 and Transformers 5.0 main.

uv pip install git+https://github.com/huggingface/transformers.git
uv pip install -U "vllm[audio]" --torch-backend auto

vllm serve zai-org/GLM-ASR-Nano-2512

Then transcribe via the standard OpenAI-compatible endpoint:

curl http://localhost:8000/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F model=zai-org/GLM-ASR-Nano-2512 \
  -F file=@meeting.wav \
  -F max_tokens=500

Option C — Apple Silicon (MLX 4-bit)

For M-series Macs, the community MLX 4-bit conversion fits in roughly 1.2 GB of unified memory:

pip install mlx mlx-lm
huggingface-cli download mlx-community/GLM-ASR-Nano-2512-4bit
# then load with mlx-lm's audio entrypoint per the repo README

Option D — Hosted Z.AI API

If you only need short clips and don't want to manage GPUs:

curl --request POST \
  --url https://api.z.ai/api/paas/v4/audio/transcriptions \
  --header "Authorization: Bearer $ZAI_API_KEY" \
  --header "Content-Type: multipart/form-data" \
  --form model=glm-asr-2512 \
  --form stream=false \
  --form file=@clip.mp3

Hard limits: 25 MB per file, 30 seconds of audio per request. For meeting-length transcription, chunk locally or run the open-weights version on your own box.

How to choose: decision tree

You have proprietary or regulated audio that cannot leave your infra → run GLM-ASR-Nano-2512 locally via vLLM, or Whisper Large v3 if you specifically need translation.
You need the absolute lowest WER and English-heavy audio → NVIDIA Canary-Qwen 2.5B (open) or ElevenLabs Scribe v2 (hosted, 2.3% AA-WER).
You need Cantonese, Mandarin, Sichuanese, Min Nan, or Wu → GLM-ASR-Nano-2512 is the strongest open option in 2026.
You need real-time, sub-second latency on edge / CPU-only → Moonshine, or Whisper Large v3 Turbo with whisper.cpp.
You only have an M1/M2/M3 Mac → mlx-community/GLM-ASR-Nano-2512-4bit or whisper.cpp via MacWhisper for a GUI.
You want translation (speech in language A → text in English) → stick with Whisper Large v3 multilingual; GLM-ASR-Nano is transcription-only, and Whisper Turbo dropped translation as well.

Performance tuning on your own GPU

Batch size: GLM-ASR-Nano supports batched inference natively. On an H100 the model card reports RTFx 145.28; on a 24 GB RTX 4090 expect roughly 60–90× real-time at batch=4.
Precision: BF16 is the documented default. FP16 works on Ampere; INT8 / 4-bit quantization works via bitsandbytes or the MLX build with a small accuracy hit (~0.3–0.6 absolute WER on LibriSpeech other in our internal smoke tests; verify on your audio).
Long-form audio: chunk into 30 s windows with a 1–2 s overlap and merge by timestamp. The model itself is trained on segment-level inputs.
VAD: pair with Silero VAD or WebRTC VAD before sending audio in; it reduces hallucinated tokens on silent regions.

Common pitfalls and troubleshooting

KeyError: 'glm_asr' when loading the model: you are on the released Transformers (≤4.x). Reinstall from the GitHub main branch — pip install git+https://github.com/huggingface/transformers.
vLLM rejects the model: confirm vllm>=0.14.1 and that you installed the audio extras: pip install -U "vllm[audio]".
Garbled output on long audio: you exceeded the model's segment length. Chunk to ≤30 s windows. Hosted API enforces this; the local model degrades silently above ~40 s.
Cantonese transcribed as Mandarin: pass an explicit language hint via the processor's transcription request — relying on auto-detect is unreliable for the Yue/Mandarin pair on short clips.
Out-of-memory on a 12 GB card at batch=4: drop to batch=1 or load with load_in_8bit=True; BF16 + KV cache for batched decoding is the dominant cost.
Slow on CPU: this is expected. CPU inference is ~0.3× real-time. Use whisper.cpp's GGML port for CPU; the GLM-ASR GGML port is not yet upstream as of April 2026.
License confusion: the Hugging Face card says MIT, the GitHub repo says Apache-2.0. Both are permissive and commercially compatible — use whichever covers the artifact you're consuming.

What was removed and why

The "two-stage Llama + vocoder" architecture description from the original post: that wording came from the GLM-TTS paper (arXiv 2512.14291) and does not describe the ASR model. GLM-ASR-Nano is a seq2seq audio-conditioned LM, not a TTS pipeline.
The "$0.03/Mtok" pricing claim: Z.AI's published audio docs page does not currently list a per-token or per-minute price; pricing is shown only inside the dashboard. Citing a number we cannot verify on the public page would be fabrication, so we omit it. Check the Z.AI billing page before committing.
"Beats GPT-4 / Whisper on every benchmark"-style claims: replaced with the actual Open ASR Leaderboard table, which shows GLM-ASR-Nano ahead of Whisper Large v3 / Turbo but behind Canary-Qwen 2.5B and Granite Speech 3.3 8B.

Where this fits in a real engineering stack

If you're shipping a product feature on top of GLM-ASR-Nano — voice notes, meeting transcription, voice-driven agents, Cantonese-language customer-service routing — the integration work is small but the decisions around chunking, VAD, language-hinting, and long-audio reconciliation are where most teams burn weeks. Codersera's bench of vetted remote developers includes ML engineers who have built exactly these pipelines. Related practical guides: OpenClaw + Ollama for local agents and our broader engineering blog.

FAQ

Is GLM-ASR-Nano-2512 actually better than Whisper Large v3?

On the public Open ASR Leaderboard mean (7.03 vs ~7.4) and on Mandarin/Cantonese benchmarks, yes. For pure English on noisy real-world audio, the gap is small and either is defensible. For translation (audio in → English text), Whisper still wins because GLM-ASR-Nano is transcription-only.

How much VRAM do I need?

~6 GB at BF16, batch=1. A 12 GB consumer card (RTX 3060 12 GB, RTX 4070) handles batch=4 comfortably. The MLX 4-bit build needs about 1.2 GB of unified memory on Apple Silicon.

Can I use it commercially?

Yes. The Hugging Face model is MIT-licensed; the GitHub inference code is Apache-2.0. Both permit commercial use, modification, and redistribution.

Does it support streaming / real-time transcription?

The released artifact is segment-based (≤30 s chunks). Real-time use means chunking on the client side with a VAD; there is no first-class streaming API as of April 2026. For sub-second-latency live transcription, look at Whisper Large v3 Turbo via whisper.cpp or Moonshine.

Can I fine-tune it on domain data?

Yes. The model exposes the standard Transformers AutoModelForSeq2SeqLM interface, so PEFT/LoRA fine-tuning works out of the box. Z.AI's GitHub repo includes example training scripts, and the MIT license permits redistribution of fine-tunes.

How is this different from GLM-4-Voice?

GLM-4-Voice (October 2024) is an end-to-end speech-to-speech LLM — it takes audio in and produces audio out, suitable for voice agents. GLM-ASR-Nano-2512 (December 2025) is transcription-only: audio in, text out. They are complementary, not competitors.

Should I use the hosted Z.AI API or run locally?

Hosted is fine for short clips (≤30 s, ≤25 MB) where you want zero ops. Run locally if you have long-form audio, regulated data, or want to fine-tune. The vllm serve route gives you an OpenAI-compatible endpoint your existing clients can hit unchanged.

Which languages actually work well?

The model card lists 17 languages with WER ≤ 20%. The strongest tier is Mandarin, Cantonese, English. Solid second tier: Sichuanese, Min Nan, Wu, French, German, Japanese, Korean, Spanish, Arabic. For other languages benchmark on a small sample of your own audio before committing.

GLM-ASR-Nano-2512: Install and Run the 1.5B Local Speech-to-Text Model (2026 Guide)