Whisper

Run Whisper Large Locally: Setup Guide (2026)

Install and run OpenAI Whisper's largest model locally for private, offline transcription — VRAM requirements, pip and Apple Silicon setup, faster-whisper, and quantization.

Published 03 Jun 2026 • Updated 03 Jun 2026 • 7 min read

Quick answer. To run OpenAI Whisper's largest model locally, install openai-whisper with a CUDA-matched PyTorch on Linux/Windows, or mlx-whisper on Apple Silicon. The current top open model is large-v3 (1.55B params, ~10 GB VRAM); large-v3-turbo runs in ~6 GB and is roughly 5x faster.

If you searched for "Whisper Large V4," here's the honest answer up front: as of 2026, OpenAI has not released a model named v4. The newest and most capable open Whisper checkpoints are large-v3 (released November 2023) and the faster large-v3-turbo (released October 2024). This guide shows you how to install and run whichever of those fits your hardware, fully offline, on Linux, Windows, or a Mac with Apple Silicon.

Running Whisper locally means your audio never leaves your machine — no API keys, no per-minute billing, no upload limits. That matters for legal recordings, medical dictation, internal meetings, and anything you'd rather not ship to a third-party cloud.

What hardware do you need to run Whisper Large locally?

The largest Whisper model is a 1.55-billion-parameter encoder-decoder. It's far smaller than a modern LLM, so a mid-range GPU handles it comfortably. The decisive factor is VRAM (on an Nvidia GPU) or unified memory (on a Mac).

Model	Parameters	Approx. VRAM (FP16)	Relative speed	Best for
`tiny`	39M	~1 GB	Fastest	Quick drafts, low-resource machines
`base`	74M	~1 GB	Very fast	Short clips, real-time use
`small`	244M	~2 GB	Fast	Good accuracy/speed balance
`medium`	769M	~5 GB	Moderate	Solid quality without a big GPU
`large-v3-turbo`	809M	~6 GB	~5x faster than large	Near-large quality, much faster
`large-v3`	1.55B	~10 GB	Baseline	Maximum transcription accuracy

Notes that save real headaches:

VRAM scales with audio length and beam size. The ~10 GB figure for large-v3 is a practical production guideline; a short clip on default settings fits in less. Long files and higher beam sizes push it up.
Quantization cuts memory hard. With faster-whisper or whisper.cpp, INT8 or 4-bit weights drop large-v3 into the 1.5–4 GB range, which is what makes it run on 8 GB cards and laptops.
Apple Silicon uses unified memory. large-v3-turbo runs comfortably on an M1 with 16 GB of RAM via MLX — no discrete GPU needed.
CPU-only works but is slow. You can run any model on CPU; expect transcription to take longer than the audio's real-time length on large.

How do you install Whisper on Linux or Windows?

The reference implementation is the openai-whisper Python package. The single most common mistake is installing PyTorch after Whisper, which pulls a CPU-only wheel and silently ignores your GPU. Install a CUDA-matched PyTorch first.

1. Use a supported Python version

Whisper works on Python 3.10–3.12. Avoid 3.13 for now — several dependencies don't have wheels for it yet. Create an isolated environment:

python -m venv whisper-env
# Linux / macOS
source whisper-env/bin/activate
# Windows
whisper-env\Scripts\activate

2. Install a CUDA-matched PyTorch first

Pick the index URL that matches your installed CUDA toolkit (check the PyTorch site for the current matrix). For CUDA 12.1:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

3. Install Whisper and FFmpeg

pip install -U openai-whisper

Whisper shells out to FFmpeg to decode audio, so it has to be on your PATH:

# Ubuntu / Debian
sudo apt update && sudo apt install ffmpeg
# Windows (Chocolatey)
choco install ffmpeg

4. Confirm the GPU is visible

python -c "import torch; print(torch.cuda.is_available())"

If that prints False, your PyTorch wheel is CPU-only — reinstall it with the correct CUDA index URL before going further.

How do you run Whisper on a Mac with Apple Silicon?

On M-series Macs the fastest path is MLX Whisper, Apple's own array framework port that runs natively on the Neural Engine and GPU through Metal. It downloads models from Hugging Face and caches them automatically.

pip install mlx-whisper
# Transcribe with the turbo model (recommended on Apple Silicon)
mlx_whisper audio.mp3 --model mlx-community/whisper-large-v3-turbo

Real-world speed is excellent: on an M1 MacBook Pro a one-hour recording transcribes in a few minutes with a mid-size model, and an M4 Max runs roughly 10x faster than real time. The large-v3-turbo model fits on a 16 GB M1, which covers most developer laptops.

You can also run the standard openai-whisper package on a Mac — it uses Apple's Metal Performance Shaders (MPS) backend — but MLX is generally faster and more memory-efficient on Apple hardware, so reach for mlx-whisper first.

How do you transcribe your first audio file?

Once installed, the CLI is a one-liner. The first run downloads the model weights (cached to ~/.cache/whisper or ~/.cache/huggingface for MLX), so subsequent runs are fully offline.

Command line

# Maximum accuracy on GPU
whisper meeting.mp3 --model large-v3 --device cuda --language English

# Faster, near-large quality
whisper meeting.mp3 --model large-v3-turbo --device cuda

# Write every output format (txt, srt, vtt, json, tsv)
whisper interview.wav --model large-v3-turbo --output_format all

Python

import whisper

model = whisper.load_model("large-v3-turbo")  # downloads on first call
result = model.transcribe("podcast.mp3")

print(result["text"])
# Segment-level timestamps for subtitles or search
for seg in result["segments"]:
    print(f"[{seg['start']:.1f}s -> {seg['end']:.1f}s] {seg['text']}")

Whisper auto-detects the spoken language, but passing --language explicitly is faster and avoids mis-detection on short or accented clips. Note that large-v3-turbo is optimised for transcription, not translation — for --task translate into English, stick with medium or large-v3.

How can you make local Whisper inference faster?

The stock openai-whisper package is the reference, not the fastest runtime. Two re-implementations are dramatically quicker for the same weights and accuracy.

faster-whisper (CTranslate2)

faster-whisper swaps Whisper's PyTorch runtime for CTranslate2, a C++ inference engine. It's up to 4x faster than the reference implementation at the same accuracy, using less memory — and INT8 quantization on CPU or GPU pushes that further.

pip install faster-whisper

from faster_whisper import WhisperModel

# int8_float16 = fast on GPU with minimal accuracy loss
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)

for seg in segments:
    print(f"[{seg.start:.1f}s -> {seg.end:.1f}s] {seg.text}")

whisper.cpp (GGML)

whisper.cpp is a dependency-free C/C++ port with GGML quantization and broad CPU/accelerator support — ideal when you want a single self-contained binary, want to run on a CPU-only server, or want to embed transcription in another app. It supports 4-bit and 8-bit quantized models; INT8 is usually the safe midpoint, while 4-bit can shave accuracy off long, noisy recordings.

Rule of thumb: faster-whisper for the best GPU throughput in Python, whisper.cpp for portable CPU deployment and edge devices, mlx-whisper for Apple Silicon.

Which model size and quantization should you pick for your GPU?

Match the model to your VRAM tier, then quantize if you need more headroom:

8 GB GPU (e.g. RTX 3060 / 4060): Run large-v3-turbo in FP16, or large-v3 in INT8 via faster-whisper.
10–12 GB GPU: large-v3 in FP16 fits with room for longer files.
16 GB+ GPU: large-v3 with larger beam sizes and batching for higher throughput.
CPU-only / low RAM: whisper.cpp with an INT8 or 4-bit model, or drop to medium.
Apple Silicon (16 GB): large-v3-turbo via MLX.

On quantization: INT8 typically costs you almost nothing in accuracy and roughly halves memory. 4-bit is more aggressive — fine for clean speech, riskier for noisy audio or rare languages. When in doubt, start at INT8 and only go lower if you're memory-constrained.

How do you fix common install and CUDA/MPS errors?

torch.cuda.is_available() is False. Your PyTorch wheel is CPU-only. Uninstall and reinstall with the CUDA index URL matching your toolkit.
FileNotFoundError / FFmpeg not found. FFmpeg isn't on your PATH. Install it system-wide and restart your shell.
CUDA out of memory. Drop to large-v3-turbo, switch to faster-whisper with compute_type="int8", lower beam_size, or split long audio into chunks.
Pip resolver conflicts on Python 3.13. Recreate the venv with Python 3.10–3.12; not all deps ship 3.13 wheels yet.
Slow or stuck first run. That's the one-time model download (the large models are ~1.5 GB). It's cached after the first call.
MPS errors on Mac. If the standard package misbehaves on the Metal backend, switch to mlx-whisper — it's purpose-built for Apple Silicon and sidesteps most MPS edge cases.

When does running Whisper locally beat a cloud STT API?

Local and cloud each win different scenarios. Run Whisper yourself when:

Privacy is non-negotiable — audio with PII, legal, or health data that can't leave your infrastructure.
Volume is high and steady — once you own the hardware, transcription is effectively free, unlike per-minute API pricing.
You need offline or air-gapped operation — field recording, secure environments, on-device apps.
You want full control — custom vocabulary, fine-tuning, and pipeline integration without rate limits.

Lean cloud when you need scale-to-zero with no GPU to maintain, the absolute lowest latency from a managed endpoint, or speaker diarization and other features bundled into a hosted product. Many teams do both: local for the bulk private workload, cloud for spiky overflow.

📘

Want the bigger picture on running models on your own hardware — GPU sizing, quantization, serving, and cost trade-offs? Read our complete guide to self-hosting LLMs (2026).

Standing up a private transcription pipeline — Whisper plus the storage, queuing, and serving around it — is real engineering work. If you'd rather move faster, Codersera helps companies hire vetted remote developers and extend their engineering teams with machine-learning and infrastructure experience, so you can ship a production STT stack without a long hiring cycle.

FAQ

Is there a Whisper Large V4?

No. As of 2026, OpenAI has not released a Whisper model called v4. The latest open checkpoints are large-v3 (November 2023) and large-v3-turbo (October 2024). If you want the newest and most accurate Whisper, use large-v3; for a much faster model with nearly the same quality, use large-v3-turbo.

How much VRAM do I need to run Whisper Large locally?

About 10 GB of VRAM for large-v3 in FP16 in production scenarios, or ~6 GB for large-v3-turbo. With INT8 or 4-bit quantization through faster-whisper or whisper.cpp, you can fit the large model into roughly 1.5–4 GB, which makes 8 GB GPUs and many laptops viable.

Can I run Whisper on a Mac without a GPU?

Yes. Apple Silicon Macs run Whisper through MLX, which uses the GPU and Neural Engine via Metal. Install mlx-whisper and run large-v3-turbo — it works comfortably on a 16 GB M1 and is far faster than CPU-only inference.

What's the difference between openai-whisper, faster-whisper, and whisper.cpp?

They run the same model weights with different engines. openai-whisper is the PyTorch reference. faster-whisper uses CTranslate2 and is up to 4x faster with less memory. whisper.cpp is a dependency-free C/C++ port with GGML quantization, best for portable CPU and edge deployments.

Does Whisper run fully offline?

Yes. After the one-time model download on first run, Whisper transcribes entirely on your machine with no network calls. That's the main reason to run it locally — your audio never leaves your hardware.

Which model should I use for the best accuracy?

Use large-v3 for maximum transcription accuracy, especially for multilingual or noisy audio. If you need speed and have a smaller GPU, large-v3-turbo keeps most of that accuracy while running about 5x faster. For English-only short clips on weak hardware, small or medium are often enough.