Mellum2: JetBrains' 12B MoE Code Model, Explained for Developers

Quick answer. Mellum2 is a 12B-parameter Mixture-of-Experts model from JetBrains, trained from scratch on natural language and code and released under the Apache 2.0 license. It activates only 2.5B parameters per token, which JetBrains says delivers more than 2x faster inference than similarly sized open models. It is built as a fast, well-scoped "focal" model for high-frequency tasks — routing, RAG, summarization, sub-agents, and private deployment — not as a frontier replacement.

JetBrains shipped Mellum2 on June 1, 2026: an open Mixture-of-Experts (MoE) model aimed squarely at low-latency text-and-code workloads. If you have been watching the small-and-fast end of the open model landscape, this one is worth a closer look — not because it tries to win every benchmark, but because it is designed to do a specific job inside larger AI systems, cheaply and quickly.

Mellum originally started life as a code completion model. With Mellum2, JetBrains extends that foundation to a broader set of natural language and software engineering tasks while keeping the model deliberately compact and easy to serve. This guide walks through what Mellum2 actually is, where it fits in a real engineering stack, and how to start using it.

What is Mellum2?

Mellum2 is a 12B-parameter Mixture-of-Experts model trained from scratch on natural language and code. The headline number that matters for practitioners is the active-parameter count: although the model has 12B total parameters, it activates only 2.5B parameters per token. That is the core MoE trick — keep total capacity high so the model knows a lot, but route each token through a small subset of "experts" so inference stays fast and cheap.

A few defining characteristics:

  • Modality: text and code only. JetBrains intentionally skipped multimodal capabilities to keep the model focused and efficient for software engineering work.
  • License: Apache 2.0 — permissive, commercial-friendly, and self-host-ready.
  • Design goal: high-throughput, low-latency inference rather than maximal raw capability.

JetBrains frames Mellum2 as a "focal" model: a fast, well-scoped component optimized for the high-frequency tasks that show up everywhere in production AI systems. The pitch is not "replace your frontier model." It is "stop paying frontier prices and frontier latency for the dozens of small calls that don't need them."

The Mixture-of-Experts architecture, in plain terms

If you have only ever worked with dense models, MoE is worth understanding because it changes the cost math. In a dense 12B model, every token flows through all 12B parameters. In an MoE model like Mellum2, a routing layer picks which experts handle each token, so only a fraction of the weights — 2.5B here — fire on any given forward pass.

The practical consequence: you get the knowledge capacity of a larger model with the serving cost closer to a much smaller one. JetBrains states this keeps total model capacity high while making inference more efficient and reducing serving cost for real-time workloads. For anyone running models behind an API where p95 latency and GPU-hours are the budget, that trade-off is exactly the lever you want.

This is the same architectural direction the broader ecosystem has been moving toward. If you want the wider context on where sparse and dense open models sit relative to each other, our open-source LLM landscape guide maps the players and trade-offs.

Benchmarks: what JetBrains claims

In the technical report, JetBrains evaluates Mellum2 across code generation, reasoning, science, and math benchmarks. The summary claim is that Mellum2 is competitive with similarly sized open models while delivering more than 2x faster inference. That second half is the real story: the value proposition is not "beats everything," it is "matches comparable models and serves them roughly twice as fast."

That framing matters for how you should read the model. A "competitive, much faster" result is precisely what you want for the high-volume, latency-sensitive operations Mellum2 targets — and largely irrelevant if you were hoping for a single do-everything frontier model. For the full architecture details, training setup, and per-benchmark methodology, JetBrains points to the technical report on arXiv. We are deliberately not quoting specific benchmark scores here; check the report for the exact numbers before you make a procurement decision on them.

Where Mellum2 fits in a real AI stack

Modern AI systems rarely make a single model call. A typical request fans out into routing, retrieval, summarization, planning, validation, and tool use — many of which are latency-sensitive and do not require the largest available model. Mellum2 is built for exactly those operations. JetBrains highlights four primary use cases:

  • Routing and orchestration. Mellum2 works as a lightweight router in multi-model systems — prompt classification, tool selection, and intermediate control-flow steps where you want a fast decision, not a 30-second reasoning chain.
  • RAG pipelines. It is well suited to latency-sensitive retrieval work: context compression, summarization, and retrieval post-processing. These run on every query, so shaving latency and cost here compounds quickly.
  • Sub-agents. For agent subtasks like planning, validation, transformation, and context preparation, Mellum2 can handle the intermediate steps so you avoid invoking a larger, slower model for every internal hop.
  • Private deployment. Because it is open and efficient to serve, Mellum2 fits self-hosted environments that involve proprietary code or internal data — no third-party API call required.

If you are building agentic systems, this "small model for the inner loop, big model for the hard reasoning" pattern is increasingly the default. Our AI coding agents guide covers how those multi-model architectures get assembled in practice.

Why well-scoped models matter

JetBrains makes an argument worth repeating: as AI systems mature, the most effective architectures are becoming less monolithic. A single frontier model can be powerful, but production systems often need several specialized components working together — retrievers, routers, code-aware models, validators, tool callers, and larger reasoning models.

The goal of a focal model is not to replace every model in the stack. It is to make the stack faster, cheaper, and easier to control. In other words: route the high-frequency, low-difficulty work to something like Mellum2, and reserve your frontier-model budget for the calls that genuinely need it. For teams shipping production AI features, that decomposition is often the single biggest lever on both latency and cloud spend.

This is a meaningfully different mental model from "pick the best model and use it for everything," and it changes how you design pipelines, set timeouts, and reason about cost.

Self-hosting and private deployment

The Apache 2.0 license plus the efficient MoE design make Mellum2 a natural fit for self-hosted deployments. For organizations working with proprietary code or internal data, the ability to run the model on your own infrastructure — without sending source through an external API — is frequently the deciding factor, regardless of benchmark deltas.

Two things make Mellum2 comfortable to operate:

  • The 2.5B active-parameter footprint keeps per-token compute low, which matters when you are serving many concurrent requests on fixed hardware.
  • The text-and-code-only scope means you are not paying for multimodal capacity you will never use in a code pipeline.

If you are standing up your own inference infrastructure, our self-hosting LLMs guide covers the serving stack, hardware sizing, and throughput tuning, and the Apple Silicon LLMs guide is the reference if you want to prototype locally on a Mac before committing to GPU infrastructure. Because Mellum2 is code-aware and open, it also slots into developer tooling — the kind of model you might wire behind an editor integration alongside something like the setups we cover in our Cursor IDE guide.

How to get started with Mellum2

The weights are published on Hugging Face. JetBrains hosts them in a dedicated collection:

A sensible adoption path looks like this:

  1. Identify the high-frequency calls in your pipeline. Look for the operations that run on every request — routing, summarization, context compression, validation — and that currently hit a larger model.
  2. Swap one in. Point a single low-difficulty step at Mellum2 and measure latency, cost, and output quality against your current model.
  3. Validate on your own data. Benchmark claims are directional; your retrieval and code data are what matter. Run an A/B on real traffic before rolling out broadly.
  4. Decide on deployment shape. If your data is sensitive, plan for self-hosting from day one rather than retrofitting it later.

If you are evaluating fine-tuning the model to your domain, the permissive license makes that legally clean; our fine-tuning LLMs guide walks through when adaptation is worth the effort versus prompt engineering and retrieval.

Should you use Mellum2?

Use Mellum2 if you are building software-engineering AI systems and you have identified specific, high-frequency, latency-sensitive operations that don't need a frontier model. It is a strong candidate for the inner loop of agents, the post-processing stage of RAG, the router in a multi-model setup, and any pipeline where you want code awareness on self-hosted infrastructure.

Reach for something larger when the task is genuinely hard reasoning, long-horizon planning, or anything where a quality ceiling matters more than throughput. The whole point of a focal model is to coexist with bigger models, not to compete head-to-head with them. Used that way, Mellum2 is a practical, well-licensed building block for making an AI stack faster and cheaper to run.

FAQ

What is Mellum2?

Mellum2 is a 12B-parameter Mixture-of-Experts model from JetBrains, trained from scratch on natural language and code and released under the Apache 2.0 license. It activates only 2.5B parameters per token and is optimized for low-latency text-and-code workloads.

How many parameters does Mellum2 actually use per token?

Although Mellum2 has 12B total parameters, its Mixture-of-Experts architecture activates only 2.5B parameters per token. That is what keeps inference fast and serving cost low while preserving the capacity of a larger model.

What license is Mellum2 released under?

Mellum2 is released under the Apache 2.0 license, which is permissive and commercial-friendly, making it suitable for self-hosted and private deployments involving proprietary code or internal data.

How fast is Mellum2 compared to similar models?

JetBrains reports that Mellum2 is competitive with similarly sized open models on benchmarks while delivering more than 2x faster inference, which is what makes it suitable for high-throughput production workloads.

What is Mellum2 best used for?

JetBrains positions it as a "focal" model for high-frequency tasks: routing and orchestration, RAG pipelines (context compression, summarization, retrieval post-processing), agent sub-tasks like planning and validation, and private self-hosted deployment.

Is Mellum2 a multimodal model?

No. Mellum2 is intentionally focused on text and code rather than multimodal tasks. JetBrains chose this specialization to keep the model compact and efficient for software engineering workloads.

Where can I download Mellum2?

The weights are published in the JetBrains Mellum-2 collection on Hugging Face, and the full architecture, training, and evaluation details are in the technical report on arXiv.