DeepSeek V4 Flash: The Practical Deep Dive Into the Cheap, Fast Open-Weights Model Everyone Slept On

DeepSeek V4 Flash is the under-covered story of the V4 release. 1M context, 47 on the AA Intelligence Index, $0.14 input / $0.28 output per million tokens, and it fits on a Mac Studio. Here is the full practical guide.

The DeepSeek V4 launch on April 24, 2026 was framed almost everywhere as a Pro story. The headlines were about the 16T parameter MoE on top of the leaderboard, the Codeforces rating north of 3,200, the Terminal-Bench numbers nipping at the heels of Claude Opus 4.7. But the model that actually changes how most engineering teams should think about LLM spend this week is the smaller sibling: DeepSeek V4 Flash.

Flash is 284B total / 13B active, 1M context, MIT-licensed, open weights on Hugging Face, and priced at $0.14 input / $0.28 output per million tokens on the first-party API. Cache-hit input drops to $0.014/M, a 90 percent discount that turns chat and RAG workloads into rounding errors on your bill. Latent Space called it "Legit 1M context for pennies." The HN top comment from gertlabs put it more bluntly: "DeepSeek V4 Flash is the model to pay attention to here. It's cheap, effective, and REALLY fast."

This is the canonical engineer-facing guide to DeepSeek V4 Flash: what it actually is, what it costs, where it shines, where it falls short, how to call it from your stack, and how to run it on a desk-sized rig tonight. If you only read one piece on the V4 release this week, this should probably be it. For a head-to-head with the bigger model, see our DeepSeek V4 Pro vs Flash comparison.

What DeepSeek V4 Flash actually is

Released April 24, 2026 alongside the V4-Pro preview, V4-Flash is the second model in DeepSeek's V4 family. It is not a distillation of Pro. It shares the architecture family but is its own pretraining run on 32T+ tokens, with the same MIT license and full open weights published on Hugging Face.

The headline numbers:

  • 284B total parameters, 13B active per token (Mixture-of-Experts).
  • 1M token context window, identical to V4-Pro.
  • FP8 mixed precision deployment weights at roughly 158 GB on disk.
  • Reasoning modes: High and xHigh (community-tagged Flash-Max in evaluation tables).
  • Text-only. No image input.

Architecturally, Flash uses the same hybrid as Pro: Compressed Sparse Attention (CSA), DeepSeek Sparse Attention (DSA), and Heavily Compressed Attention (HCA) stacked together, plus Manifold-Constrained Hyper-Connections (mHC) and the Muon optimizer. What is different is the compression aggressiveness. At 1M context, Flash uses approximately 10 percent of V3.2's per-token FLOPs and 7 percent of its KV cache, while Pro sits at 27 percent FLOPs and 10 percent KV cache. That is the trade: more compression, more speed, fewer dollars, and a measurable but small intelligence delta.

The naming is misleading. "Flash" in OpenAI and Google's product lines historically meant a small distilled model with sharply lower capability. DeepSeek V4 Flash is something different. It is a smaller-active-params MoE in the same family, and on most benchmarks it sits within 1-3 percentage points of Pro. The big exceptions are agentic tool-use, web browsing, factual recall, and hallucination resistance, all covered later.

The DeepSeek V4 Flash pricing story

Pricing is where Flash earns its place in your stack. The first-party DeepSeek API publishes the following rates per million tokens.

TierInputOutput
V4-Flash sticker$0.14$0.28
V4-Flash cache-hit input$0.014 (90% off)n/a
V4-Flash off-peak (community-reported nighttime)~$0.07~$0.14

The cache-hit lane is the killer feature. Any chat, agent, or RAG workload with a stable system prompt and reusable context collapses input cost to $0.014 per million tokens. That is a 7-10x savings against the sticker, and it is automatic; the API handles cache management for you.

The HN cost worked example doing the rounds is concrete: 40M tokens per month on Flash costs roughly $30-$70 depending on your input/output mix. Apply an 80 percent cache hit rate, blended cost drops to ~$0.04/M, and that 40M tokens drops to about $1.60. There is no closed model in the same intelligence band that comes within an order of magnitude of that number.

For comparison, V4-Pro's API rates are roughly $0.50 input / $2.00 output per million during the May promo, and revert to $1.74 / $3.48 after May 31, 2026. Even before caching, Flash is between 7x and 12x cheaper depending on your token mix. With cache hits, the gap stretches further.

V4 Flash benchmarks: what it actually scores

Most coverage frames Flash as Pro-minus-something. That is unfair, because Flash on its own is a top-tier open-weights model. Here is the headline scoreboard from Artificial Analysis and the model card.

BenchmarkV4-Flash MaxNote
Artificial Analysis Intelligence Index47#9 of 78 globally tracked, open-weights champion in tier
MMLU-Pro86.21.3pp behind Pro
GPQA Diamond88.1~2pp behind Pro
LiveCodeBench Pass@191.61.9pp behind Pro
SWE-bench Verified79.01.6pp behind Pro
Codeforces rating3,052154 below Pro, well above human grandmaster
Terminal-Bench 2.056.911pp behind Pro - Flash's big weakness
HMMT 2026 Feb94.80.4pp behind Pro
IMOAnswerBench88.41.4pp behind Pro
HLE (Humanity's Last Exam)~34.6~3pp behind Pro
MRCR @ 1M (MMR)78.74.8pp behind Pro
CorpusQA @ 1M60.51.5pp behind Pro
AA-Omniscience accuracy-23Pro -10
AA-Omniscience hallucination rate96%Pro 94%
BrowseComp~7310.2pp behind Pro
SimpleQA-Verified34.123.8pp behind Pro
GDPval-AA1388166 below Pro

What the gap to Pro actually means in practice

5 points on the AA Intelligence Index sounds large until you decompose where the gap lives. Look at the coding column: 91.6 LiveCodeBench Pass@1 and 79.0 SWE-bench Verified. Flash is 1.9pp and 1.6pp behind Pro on the two benchmarks that map most directly to what coding agents actually do. At IDE-task scale that delta is invisible. You will not feel it on a Cursor or Copilot replacement workload.

Where you will feel it: Terminal-Bench 2.0 (11pp gap) and SimpleQA-Verified (23.8pp gap). Translation: long-horizon agentic loops with chained tool calls, and pure factual recall, are the two tasks where Pro is meaningfully better. Everything in between is closer than the headline number suggests.

Latent Space summed it up: "Flash@max ≈ Pro@high on reasoning tasks." If you are not running tasks that need the maximum reasoning budget, the gap shrinks again.

Speed: 81 t/s on the first-party API

Speed is the other pillar of the Flash pitch. Artificial Analysis's provider page reports the following:

ProviderOutput t/sTTFT
Novita85.567.23 s
SiliconFlow (FP8)83.768.42 s
DeepSeek (1st party)81.370.07 s

The TTFT figures include the full reasoning budget. For non-reasoning Flash on the DeepSeek API the TTFT is roughly 1.05 seconds, per AA's non-reasoning page. For comparison, V4-Pro on the same first-party API runs around 35 t/s. Flash is roughly 2.3x faster end-to-end than Pro on identical infrastructure.

Production adoption is already real. OpenRouter is reporting daily volume of 29.2B prompt + 530M completion + 533M reasoning tokens on Flash. That is not a curiosity model; that is a pile of pipelines already moved over.

Cost-per-task math: $113 for the AA Index suite

The most quotable cost number from the V4 release is from Artificial Analysis itself. Completing the full AA Intelligence Index suite cost $113 on V4-Flash vs $1,071 on V4-Pro. That is a 9.5x cost reduction for a 5pp Index drop. Pick your battle: if you are not paying for the last 5 points of intelligence, why are you paying 9.5x for it?

Mapping that to typical workloads:

  • Small SaaS using LLM for chat/RAG, ~10M tokens/month: ~$2-5/month on Flash with caching. The same workload on Claude Opus 4.7 would be $50-150.
  • Mid-size product, ~40M tokens/month: $30-70 sticker on Flash, drops to ~$1.60 with 80% cache hit rate. Pro on the same volume sits north of $400.
  • High-throughput backend pipeline, ~500M tokens/month: $500-1,200 on Flash. Negotiate enterprise rates and you are firmly under $0.10/M blended.

For most teams the right mental model is: use Flash by default, route the small slice of requests where Pro's edge actually matters (long-horizon agents, complex factual lookup) to Pro. We dig into that routing pattern in our DeepSeek V4 Pro vs Flash comparison.

Run DeepSeek V4 Flash locally: the Mac Studio story

Flash is the first frontier-adjacent open-weights model that fits on a piece of hardware you can put on a desk. That is, by itself, a meaningful step. The full table:

QuantDiskMin pooled memoryHardwareReported t/s
FP16/BF16~568 GB~600 GBserver only
Q8_0~301 GB~315 GB4x H100 / dual H200
Q5_K_M~200 GB~210 GBdual H100 80GB
Q4_K_M (recommended)~158 GB~96 GBRTX PRO 6000 96GB / Mac Studio M4 Max 192GB / dual H10025-90 t/s
Q3_K_M~125 GB~80 GBdual RTX 5090 64GB w/ RAM offload22-30
IQ2_XS~90 GB~96 GBhobby tierquality drop on reasoning

Headline rigs in plain English:

  • Mac Studio M4 Max 192 GB ($5,999): Q4_K_M MLX gets you 25-35 t/s with 64K-128K context comfortably. This is the rig that makes Flash a desktop model.
  • 128 GB M5 MacBook Pro: lightly-quantized Flash is possible but tight. Simon Willison wrote that he is "hoping" his MacBook can run it; as of writing the answer is roughly yes-with-caveats.
  • RTX PRO 6000 96 GB (~$11K): 45-60 t/s. Single card, production-tier, no NVLink hassles.
  • Dual H100 PCIe 80 GB ($60K+): 60-90 t/s. The high end if you are running it as a shared internal service.

Hard floor: 90 GB pooled memory. Anything below that pages to disk and the model starts hallucinating in ways that are not reflected in benchmark scores. Do not try to run Flash on a 64 GB rig and judge the model on what comes out.

Tooling on day 0

Eight use cases where DeepSeek V4 Flash is the right answer

This is the spine of the article. For each, the question is the same: who is this for, what does Flash do, and why does it beat the alternatives?

1. IDE coding agent / Copilot replacement

Who: teams running internal Cursor-style tooling, or shipping a Copilot-class feature in their own product. Why Flash wins: 91.6 LiveCodeBench Pass@1, 79.0 SWE-bench Verified, 12x cheaper than Pro on the same tokens. The 1.9pp and 1.6pp gaps to Pro are functionally invisible at IDE-task scale. Cost ratio is brutal in your favor. If you are hiring engineers to build this kind of tooling, our TypeScript and Python talent pools cover most of the day-to-day stack.

2. Bulk text classification, summarization, data labeling

Who: data engineering teams running backend pipelines, log analysis at scale, code review automation. Why Flash wins: 81 t/s output and $0.14/$0.28 sticker pricing makes Flash an order of magnitude cheaper than Pro and frontier closed models. For a pipeline running tens of millions of tokens per day, the math is unambiguous. If you need Go or Rust engineers to wire up the throughput, Codersera has both.

3. Customer service chatbots and RAG over enterprise corpus

Who: any team running a chat product with a long-lived system prompt and document context. Why Flash wins: cache-hit input at $0.014/M is the killer. Your system prompt, retrieved chunks, and chat history all benefit from prompt caching, and the blended input cost on a well-tuned RAG pipeline gets close to free. 86.2 MMLU-Pro is enterprise-acceptable for nearly every customer-service workflow.

4. Tool-calling pipelines

Who: teams using LLMs as JSON-schema-emitting glue between systems. Why Flash wins: direct replacement for Claude Haiku, Gemma 4, GPT-4o-mini in the function-calling tier. Verified working tool calling and JSON mode on all three Flash providers. You get a smarter base model than the closed mini tier at a fraction of the cost.

5. Long-context document analysis

Who: legal review, medical records, scientific literature, large codebase search. Why Flash wins: 78.7 on MRCR @ 1M is, per Latent Space, above Claude Opus 4.7 1M reasoning. At a fraction of the cost. Most of these workloads do not need the additional 4.8pp that Pro delivers; they need the context window to actually be honest, and Flash's MRCR number says it is.

6. Speed-sensitive interactive UX

Who: products where output latency is part of the experience: IDE inline suggestions, voice assistants, live writing tools. Why Flash wins: 81 t/s on the first-party API vs Pro's 35 t/s. 2.3x faster output. Combined with sub-2s TTFT in non-reasoning mode, Flash feels live. Pro feels like a server is thinking.

7. Privacy / on-prem / air-gapped

Who: regulated industries, defense, anyone who cannot send tokens to a third-party API. Why Flash wins: runs on a Mac Studio. The full V4-Pro requires datacenter-class hardware (multi-H100 or H200). Flash on a Q4_K_M MLX build on a 192 GB Mac Studio is a real, deployable on-prem inference target.

8. Iterative agentic coding

Who: teams building or using coding agents that operate over multiple turns rather than one-shotting. Why Flash wins: the gertlabs HN observation captures it: "not a smart model on the first try, but it makes up for it over the course of a session." Multi-shot Flash approaches first-shot Pro on cost-per-correct-solution. If your agent loop has self-correction built in, Flash is plenty.

Where DeepSeek V4 Flash falls short

Flash is not a Pro substitute for every workload. Five places where the gap is real and you should route to Pro (or another model) instead:

  • Multi-step agentic loops with chained tool calls. Terminal-Bench 2.0 gap is 11pp. If your agent runs 30 tool calls and each one matters, the compounding error rate hurts.
  • Web-search agents. BrowseComp gap of 10.2pp. Pro is meaningfully better at synthesizing across browsing turns.
  • Pure factual recall and world knowledge. SimpleQA-Verified gap of 23.8pp is the largest in the table. This is Flash's weakest area.
  • Hallucination-sensitive enterprise workflows. AA-Omniscience accuracy at -23 vs Pro's -10. Flash is more likely to confidently make things up.
  • Research-grade math beyond olympiad set. HLE gap of ~3pp. For most teams this is irrelevant. If you are doing actual research, it is not.

The honest pattern: route by task. Default to Flash. Promote to Pro for long-horizon agents, browsing-heavy work, and anything where hallucination cost is high.

Code example: calling V4 Flash via the OpenAI-compatible API

The DeepSeek API is OpenAI-compatible. Endpoint https://api.deepseek.com/v1/chat/completions, model name deepseek-v4-flash, Bearer auth. Prompt caching is automatic when your prompt prefix repeats; the cache_hit_tokens field on the response tells you what fraction was billed at the discounted rate.

Python: a Flash call with prompt caching

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com/v1",
)

SYSTEM_PROMPT = """You are a senior code reviewer. Review the diff for:
- correctness, security, and performance issues
- style problems against the team's TypeScript guide
Return JSON: { issues: [{severity, file, line, message}] }"""

def review(diff: str) -> dict:
    response = client.chat.completions.create(
        model="deepseek-v4-flash",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": diff},
        ],
        response_format={"type": "json_object"},
        temperature=0.2,
    )
    usage = response.usage
    cache_hits = getattr(usage, "prompt_cache_hit_tokens", 0)
    cache_misses = getattr(usage, "prompt_cache_miss_tokens", 0)
    print(f"cache hit: {cache_hits} / miss: {cache_misses} / out: {usage.completion_tokens}")
    return response.choices[0].message.content

if __name__ == "__main__":
    print(review(open("diff.patch").read()))

JavaScript / Node.js

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.DEEPSEEK_API_KEY,
  baseURL: "https://api.deepseek.com/v1",
});

const SYSTEM_PROMPT = `You are a senior code reviewer. Review the diff for
correctness, security, and performance issues. Return JSON of the form
{ issues: [{ severity, file, line, message }] }.`;

export async function review(diff) {
  const res = await client.chat.completions.create({
    model: "deepseek-v4-flash",
    messages: [
      { role: "system", content: SYSTEM_PROMPT },
      { role: "user", content: diff },
    ],
    response_format: { type: "json_object" },
    temperature: 0.2,
  });
  const u = res.usage;
  console.log(`cache hit: ${u.prompt_cache_hit_tokens} / miss: ${u.prompt_cache_miss_tokens} / out: ${u.completion_tokens}`);
  return JSON.parse(res.choices[0].message.content);
}

The trick to maximize cache hits: keep your SYSTEM_PROMPT byte-stable, put long stable context (style guide, docs, schema) before short variable context (the diff, the user message). DeepSeek's caching is prefix-based; any change to the front of the prompt invalidates the cache.

Real-world community reception

The reception across HN, X, and the technical newsletters lands in the same place: Flash is the model people are actually moving production traffic to.

"DeepSeek V4 — almost on the frontier, a fraction of the price." — Simon Willison
"Flash@max ≈ Pro@high on reasoning tasks." — Latent Space
"Legit 1M context for pennies." — Latent Space
"DeepSeek V4 Flash is the model to pay attention to here. It's cheap, effective, and REALLY fast." — gertlabs, HN
"Not a smart model on the first try, but it makes up for it over the course of a session." — gertlabs, on Flash in agentic loops

The cost-math thread on HN is the one that converts skeptics: 40M tokens per month at sticker is $30-$70. With 80 percent cache hits, blended drops to about $0.04 per million, and 40M tokens then costs roughly $1.60. Engineers reading that math do the comparison to their current closed-model bill in their head, and the rest writes itself.

Setup checklist

I want to run V4 Flash on my Mac Studio

  1. Confirm 192 GB pooled memory. M4 Max or better. 96 GB is the floor; 192 GB is the comfort tier.
  2. Install MLX 0.24 or later: pip install mlx mlx-lm.
  3. Pull the community Q4_K_M MLX port from unsloth/DeepSeek-V4-Flash or convert from the official FP8 weights.
  4. Serve via mlx-lm.server on localhost:8080 with an OpenAI-compatible endpoint.
  5. Point your existing OpenAI SDK at http://localhost:8080/v1 and set model="deepseek-v4-flash".
  6. Expect 25-35 t/s at 64K-128K context. Above that, you'll want a GPU rig.

I want to call V4 Flash from my Next.js app

  1. Sign up at api-docs.deepseek.com, generate an API key.
  2. Add DEEPSEEK_API_KEY to your environment.
  3. Use the OpenAI SDK with baseURL: "https://api.deepseek.com/v1".
  4. Make all LLM calls server-side (in Route Handlers / Server Actions). Never ship the API key to the client.
  5. Keep system prompts byte-stable to maximize cache hits. Log prompt_cache_hit_tokens per call for ongoing cost visibility.
  6. Need help wiring this in? Our Node.js engineering pool ships AI integrations weekly.

I want to deploy V4 Flash on Together / Novita / SiliconFlow for production

  1. Pick a provider with verified Flash availability. AA's provider page tracks live throughput and pricing.
  2. For latency-sensitive paths, route to Novita (85.5 t/s) or SiliconFlow's FP8 endpoint (83.7 t/s).
  3. For cost-sensitive batch paths, route to first-party DeepSeek with caching enabled (81.3 t/s, lowest sticker).
  4. Set up a fallback chain: primary provider on failure rolls to the next, and your own queue retries.
  5. Instrument: log latency, cache hit rate, completion tokens, and per-request cost. Cost surprises usually hide in unbounded output.

What this means for engineering teams

For most teams shipping LLM-backed product in 2026, the right default just changed. DeepSeek V4 Flash is the model you build on, and Pro (or a closed frontier model) is the model you escalate to. Cost falls by an order of magnitude. Latency improves. The 1M context window is no longer a bragging point reserved for Anthropic and Google; it is a $0.14/M default on a model you can also run on a Mac Studio.

The harder question is execution. Most of the cost savings live in caching, routing, and batching, all of which depend on engineering work that does not show up in a benchmark. If your team needs senior engineers who have already shipped LLM pipelines in production, Codersera matches you with vetted remote talent in TypeScript, Python, Go, Rust, and Node. We have the services and the track record for AI-heavy product teams.

If you want a longer view across the V4 lineup, see the sister deep dive on DeepSeek V4 Pro vs Flash, and our cross-model comparisons against Claude Opus 4.7 and GPT-5.5 Pro. Or browse all our coverage on the AI tag on the Codersera blog.

FAQ

Is DeepSeek V4 Flash a distillation of V4 Pro?

No. Flash and Pro are separate pretraining runs in the same architecture family, both on 32T+ tokens. The model card and DeepSeek's release notes are explicit on this. Flash is not a smaller distilled student of Pro.

How much does DeepSeek V4 Flash cost per million tokens?

Sticker is $0.14 input / $0.28 output. Cache-hit input is $0.014 (90 percent off). Off-peak nighttime pricing community-reported around $0.07 / $0.14.

What hardware do I need to run DeepSeek V4 Flash locally?

Practical floor is 96 GB of pooled memory and the Q4_K_M quant (~158 GB on disk; the model is paged in over GPU/MLX). Realistic single-machine rigs: Mac Studio M4 Max 192 GB, RTX PRO 6000 96 GB, or dual H100 80 GB.

Can a 128 GB MacBook Pro run V4 Flash?

Tightly, with aggressive quantization (IQ2_XS or smaller dynamic quants), and at reduced quality. 96 GB is the realistic floor; under that you are paging to disk and the model degrades.

Is V4 Flash good for coding agents?

Yes for IDE-style agents and one-shot or short-loop coding tasks. 91.6 LiveCodeBench Pass@1 and 79.0 SWE-bench Verified put it within 2pp of Pro. Long-horizon multi-tool-call agents see an 11pp Terminal-Bench gap; route those to Pro.

Does V4 Flash support tool calling and JSON mode?

Yes. All three first-class providers (DeepSeek, Novita, SiliconFlow) support function calling and structured JSON output. The API is OpenAI-compatible.

How does prompt caching work on the DeepSeek API?

Caching is automatic and prefix-based. Any byte-stable prompt prefix is cached server-side; subsequent requests reusing that prefix are billed at $0.014/M for the cached portion. Keep system prompts and stable context at the front of the message; put variable inputs at the end.

How does V4 Flash compare to Claude Opus 4.7 on long context?

On MRCR @ 1M, Flash scores 78.7. Per Latent Space's analysis, that is above Claude Opus 4.7's 1M-reasoning number. Flash is meaningfully cheaper for long-context document workloads where retrieval fidelity matters more than the last few points of reasoning.

What is the easiest way to try V4 Flash today?

Sign up for a DeepSeek API key, point the OpenAI SDK at https://api.deepseek.com/v1, and use model deepseek-v4-flash. You can also call it via OpenRouter or run a local Q4_K_M build with MLX or vLLM.

When should I use Pro instead of Flash?

Use Pro for long-horizon agentic loops with many chained tool calls, web-browsing agents, hallucination-sensitive enterprise workflows, and tasks heavy on factual recall. Default everything else to Flash and let the cost savings compound.

Sources and further reading