openai-whisper with a CUDA-matched PyTorch on Linux/Windows, or mlx-whisper on Apple Silicon. The current top open model is large-v3 (1.55B params, ~10 GB VRAM); large-v3-turbo runs in ~6 GB and is roughly 5x faster.If you searched for "Whisper Large V4," here's the honest answer up front: as of 2026, OpenAI has not released a model named v4. The newest and most capable open Whisper checkpoints are large-v3 (released November 2023) and the faster large-v3-turbo (released October 2024). This guide shows you how to install and run whichever of those fits your hardware, fully offline, on Linux, Windows, or a Mac with Apple Silicon.
Running Whisper locally means your audio never leaves your machine — no API keys, no per-minute billing, no upload limits. That matters for legal recordings, medical dictation, internal meetings, and anything you'd rather not ship to a third-party cloud.
What hardware do you need to run Whisper Large locally?
The largest Whisper model is a 1.55-billion-parameter encoder-decoder. It's far smaller than a modern LLM, so a mid-range GPU handles it comfortably. The decisive factor is VRAM (on an Nvidia GPU) or unified memory (on a Mac).
| Model | Parameters | Approx. VRAM (FP16) | Relative speed | Best for |
|---|---|---|---|---|
tiny | 39M | ~1 GB | Fastest | Quick drafts, low-resource machines |
base | 74M | ~1 GB | Very fast | Short clips, real-time use |
small | 244M | ~2 GB | Fast | Good accuracy/speed balance |
medium | 769M | ~5 GB | Moderate | Solid quality without a big GPU |
large-v3-turbo | 809M | ~6 GB | ~5x faster than large | Near-large quality, much faster |
large-v3 | 1.55B | ~10 GB | Baseline | Maximum transcription accuracy |
Notes that save real headaches:
- VRAM scales with audio length and beam size. The ~10 GB figure for
large-v3is a practical production guideline; a short clip on default settings fits in less. Long files and higher beam sizes push it up. - Quantization cuts memory hard. With
faster-whisperorwhisper.cpp, INT8 or 4-bit weights droplarge-v3into the 1.5–4 GB range, which is what makes it run on 8 GB cards and laptops. - Apple Silicon uses unified memory.
large-v3-turboruns comfortably on an M1 with 16 GB of RAM via MLX — no discrete GPU needed. - CPU-only works but is slow. You can run any model on CPU; expect transcription to take longer than the audio's real-time length on
large.
How do you install Whisper on Linux or Windows?
The reference implementation is the openai-whisper Python package. The single most common mistake is installing PyTorch after Whisper, which pulls a CPU-only wheel and silently ignores your GPU. Install a CUDA-matched PyTorch first.
1. Use a supported Python version
Whisper works on Python 3.10–3.12. Avoid 3.13 for now — several dependencies don't have wheels for it yet. Create an isolated environment:
python -m venv whisper-env
# Linux / macOS
source whisper-env/bin/activate
# Windows
whisper-env\Scripts\activate2. Install a CUDA-matched PyTorch first
Pick the index URL that matches your installed CUDA toolkit (check the PyTorch site for the current matrix). For CUDA 12.1:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu1213. Install Whisper and FFmpeg
pip install -U openai-whisperWhisper shells out to FFmpeg to decode audio, so it has to be on your PATH:
# Ubuntu / Debian
sudo apt update && sudo apt install ffmpeg
# Windows (Chocolatey)
choco install ffmpeg4. Confirm the GPU is visible
python -c "import torch; print(torch.cuda.is_available())"If that prints False, your PyTorch wheel is CPU-only — reinstall it with the correct CUDA index URL before going further.
How do you run Whisper on a Mac with Apple Silicon?
On M-series Macs the fastest path is MLX Whisper, Apple's own array framework port that runs natively on the Neural Engine and GPU through Metal. It downloads models from Hugging Face and caches them automatically.
pip install mlx-whisper
# Transcribe with the turbo model (recommended on Apple Silicon)
mlx_whisper audio.mp3 --model mlx-community/whisper-large-v3-turboReal-world speed is excellent: on an M1 MacBook Pro a one-hour recording transcribes in a few minutes with a mid-size model, and an M4 Max runs roughly 10x faster than real time. The large-v3-turbo model fits on a 16 GB M1, which covers most developer laptops.
You can also run the standard openai-whisper package on a Mac — it uses Apple's Metal Performance Shaders (MPS) backend — but MLX is generally faster and more memory-efficient on Apple hardware, so reach for mlx-whisper first.
How do you transcribe your first audio file?
Once installed, the CLI is a one-liner. The first run downloads the model weights (cached to ~/.cache/whisper or ~/.cache/huggingface for MLX), so subsequent runs are fully offline.
Command line
# Maximum accuracy on GPU
whisper meeting.mp3 --model large-v3 --device cuda --language English
# Faster, near-large quality
whisper meeting.mp3 --model large-v3-turbo --device cuda
# Write every output format (txt, srt, vtt, json, tsv)
whisper interview.wav --model large-v3-turbo --output_format allPython
import whisper
model = whisper.load_model("large-v3-turbo") # downloads on first call
result = model.transcribe("podcast.mp3")
print(result["text"])
# Segment-level timestamps for subtitles or search
for seg in result["segments"]:
print(f"[{seg['start']:.1f}s -> {seg['end']:.1f}s] {seg['text']}")Whisper auto-detects the spoken language, but passing --language explicitly is faster and avoids mis-detection on short or accented clips. Note that large-v3-turbo is optimised for transcription, not translation — for --task translate into English, stick with medium or large-v3.
How can you make local Whisper inference faster?
The stock openai-whisper package is the reference, not the fastest runtime. Two re-implementations are dramatically quicker for the same weights and accuracy.
faster-whisper (CTranslate2)
faster-whisper swaps Whisper's PyTorch runtime for CTranslate2, a C++ inference engine. It's up to 4x faster than the reference implementation at the same accuracy, using less memory — and INT8 quantization on CPU or GPU pushes that further.
pip install faster-whisperfrom faster_whisper import WhisperModel
# int8_float16 = fast on GPU with minimal accuracy loss
model = WhisperModel("large-v3", device="cuda", compute_type="int8_float16")
segments, info = model.transcribe("audio.mp3", beam_size=5)
for seg in segments:
print(f"[{seg.start:.1f}s -> {seg.end:.1f}s] {seg.text}")whisper.cpp (GGML)
whisper.cpp is a dependency-free C/C++ port with GGML quantization and broad CPU/accelerator support — ideal when you want a single self-contained binary, want to run on a CPU-only server, or want to embed transcription in another app. It supports 4-bit and 8-bit quantized models; INT8 is usually the safe midpoint, while 4-bit can shave accuracy off long, noisy recordings.
Rule of thumb: faster-whisper for the best GPU throughput in Python, whisper.cpp for portable CPU deployment and edge devices, mlx-whisper for Apple Silicon.
Which model size and quantization should you pick for your GPU?
Match the model to your VRAM tier, then quantize if you need more headroom:
- 8 GB GPU (e.g. RTX 3060 / 4060): Run
large-v3-turboin FP16, orlarge-v3in INT8 via faster-whisper. - 10–12 GB GPU:
large-v3in FP16 fits with room for longer files. - 16 GB+ GPU:
large-v3with larger beam sizes and batching for higher throughput. - CPU-only / low RAM:
whisper.cppwith an INT8 or 4-bit model, or drop tomedium. - Apple Silicon (16 GB):
large-v3-turbovia MLX.
On quantization: INT8 typically costs you almost nothing in accuracy and roughly halves memory. 4-bit is more aggressive — fine for clean speech, riskier for noisy audio or rare languages. When in doubt, start at INT8 and only go lower if you're memory-constrained.
How do you fix common install and CUDA/MPS errors?
torch.cuda.is_available()is False. Your PyTorch wheel is CPU-only. Uninstall and reinstall with the CUDA index URL matching your toolkit.FileNotFoundError/ FFmpeg not found. FFmpeg isn't on your PATH. Install it system-wide and restart your shell.CUDA out of memory. Drop tolarge-v3-turbo, switch to faster-whisper withcompute_type="int8", lowerbeam_size, or split long audio into chunks.- Pip resolver conflicts on Python 3.13. Recreate the venv with Python 3.10–3.12; not all deps ship 3.13 wheels yet.
- Slow or stuck first run. That's the one-time model download (the large models are ~1.5 GB). It's cached after the first call.
- MPS errors on Mac. If the standard package misbehaves on the Metal backend, switch to
mlx-whisper— it's purpose-built for Apple Silicon and sidesteps most MPS edge cases.
When does running Whisper locally beat a cloud STT API?
Local and cloud each win different scenarios. Run Whisper yourself when:
- Privacy is non-negotiable — audio with PII, legal, or health data that can't leave your infrastructure.
- Volume is high and steady — once you own the hardware, transcription is effectively free, unlike per-minute API pricing.
- You need offline or air-gapped operation — field recording, secure environments, on-device apps.
- You want full control — custom vocabulary, fine-tuning, and pipeline integration without rate limits.
Lean cloud when you need scale-to-zero with no GPU to maintain, the absolute lowest latency from a managed endpoint, or speaker diarization and other features bundled into a hosted product. Many teams do both: local for the bulk private workload, cloud for spiky overflow.
Standing up a private transcription pipeline — Whisper plus the storage, queuing, and serving around it — is real engineering work. If you'd rather move faster, Codersera helps companies hire vetted remote developers and extend their engineering teams with machine-learning and infrastructure experience, so you can ship a production STT stack without a long hiring cycle.
FAQ
Is there a Whisper Large V4?
No. As of 2026, OpenAI has not released a Whisper model called v4. The latest open checkpoints are large-v3 (November 2023) and large-v3-turbo (October 2024). If you want the newest and most accurate Whisper, use large-v3; for a much faster model with nearly the same quality, use large-v3-turbo.
How much VRAM do I need to run Whisper Large locally?
About 10 GB of VRAM for large-v3 in FP16 in production scenarios, or ~6 GB for large-v3-turbo. With INT8 or 4-bit quantization through faster-whisper or whisper.cpp, you can fit the large model into roughly 1.5–4 GB, which makes 8 GB GPUs and many laptops viable.
Can I run Whisper on a Mac without a GPU?
Yes. Apple Silicon Macs run Whisper through MLX, which uses the GPU and Neural Engine via Metal. Install mlx-whisper and run large-v3-turbo — it works comfortably on a 16 GB M1 and is far faster than CPU-only inference.
What's the difference between openai-whisper, faster-whisper, and whisper.cpp?
They run the same model weights with different engines. openai-whisper is the PyTorch reference. faster-whisper uses CTranslate2 and is up to 4x faster with less memory. whisper.cpp is a dependency-free C/C++ port with GGML quantization, best for portable CPU and edge deployments.
Does Whisper run fully offline?
Yes. After the one-time model download on first run, Whisper transcribes entirely on your machine with no network calls. That's the main reason to run it locally — your audio never leaves your hardware.
Which model should I use for the best accuracy?
Use large-v3 for maximum transcription accuracy, especially for multilingual or noisy audio. If you need speed and have a smaller GPU, large-v3-turbo keeps most of that accuracy while running about 5x faster. For English-only short clips on weak hardware, small or medium are often enough.