Run Uncensored MiniMax on CPU Locally: M2.1-PRISM and M2.7 Abliterated Guide (April 2026)

Last updated April 2026 — refreshed for MiniMax M2.7 open weights, the new license, the abliterated variants currently shipping, and verified CPU performance numbers.

This guide shows you how to run an uncensored MiniMax model on a CPU-only box, end-to-end, without smoothing over the awkward parts. As of April 2026 the canonical reference build is still MiniMax-M2.1-PRISM (the Ex0bit abliteration that originally seeded this guide), but the ecosystem now extends to MiniMax-M2.7 open weights with a fresh round of abliterated GGUFs (Youssofal Heretic-ARA, huihui-ai abliterated). We cover hardware, quantization, llama.cpp setup, real measured tokens/sec, the 2026 NaN bug that bricks several popular GGUFs, and the license change that catches teams off guard.

What changed in 2026:MiniMax-M2.7 dropped on Hugging Face on 12 April 2026 as open weights. Same MoE shape as M2.1/M2.5 (229B total, 10B active, 256 experts, 62 layers), longer context (~200K), and a license shift.License change. M2 and M2.5 shipped under MIT. M2.7 ships under a "Modified MIT" — non-commercial use is fine, commercial use now needs written authorization from MiniMax (api@minimax.io). This affects anyone hosting M2.7 for paying users.Abliterated variants caught up. Ex0bit's PRISM pipeline still anchors M2.1; for M2.7 the active uncensored builds are Youssofal/MiniMax-M2.7-Abliterated-Heretic-GGUF (Heretic ARA, layer 30–51, 0/25 refusals) and huihui-ai/Huihui-MiniMax-M2-abliterated-GGUF.The Q4_K/Q5_K NaN bug. A llama.cpp overflow in blk.61.ffn_down_exps produced NaNs in 21–38% of MiniMax-M2.7 GGUFs on Hugging Face. Unsloth fixed theirs; many community uploads from late March 2026 are still broken. Use UD-IQ4_XS or the fixed Unsloth dynamic quants.CUDA 13.2 is poison for these GGUFs (gibberish output on low-bit quants). Stay on CUDA 13.1 or earlier until NVIDIA ships a fix. CPU-only inference is unaffected.Real CPU numbers are tighter than synthetic claims. A 14-core Xeon Gold 6132 with 192GB RAM gets 2.8–3.5 tok/s on Q4_K_M, with significant context-length decay (drops near 1.5 tok/s past 20K context). Apple Silicon 128GB unified hits ~15 tok/s on UD-IQ4_XS — the closest thing to a "good day" on consumer hardware.

Want the full picture? Read our continuously-updated Claude Opus 4.7 complete guide — capabilities, pricing, prompting tips, and head-to-head benchmarks vs GPT-5.5 and DeepSeek V4.

TL;DR — the answer in one table

You want…Use thisWhy
The original guide's exact stack (M2.1-PRISM)Ex0bit/MiniMax-M2.1-PRISM + llama.cppStill the cleanest abliteration; PRISM preserves MMLU and adds 5–8 points
Newest open weights, non-commercial use OKunsloth/MiniMax-M2.7-GGUF (UD-IQ4_XS, 108 GB)Fixed NaN bug; best size/quality tradeoff for 128 GB RAM machines
Newest weights, uncensoredYoussofal/MiniMax-M2.7-Abliterated-Heretic-GGUFHeretic ARA pipeline; 0/25 refusals on the test harness
No safety filtering, smaller modelhuihui-ai/Huihui-MiniMax-M2-abliterated-GGUFBase M2 abliteration, smaller download than M2.7
Commercial deploymentM2.7 only with written license, or stick to M2/M2.5 (MIT)M2.7 is non-commercial by default; M2/M2.5 remain MIT

Model architecture and technical specifications

What's actually inside MiniMax-M2.x

Every MiniMax-M2.x release shares the same skeleton: a sparse Mixture-of-Experts model with 229B total parameters and 10B active per token, 256 local experts with 8 routed per token, 62 layers, and a hybrid attention pattern (7 Lightning attention blocks + 1 softmax block per group of 8). That's why a 230B-class model can fit in 108 GB at 4-bit and run on a 128 GB box: only ~4.4% of the parameters are touched per token.

  • Total parameters: 229B (sometimes rounded to 230B in marketing).
  • Active per token: 10B.
  • Experts: 256 local, 8 routed per token.
  • Layers: 62.
  • Native precision: BF16 / FP8 mix (F32, BF16, F8_E4M3 tensors in the safetensors).
  • Context window: 196,608–205,000 tokens depending on release (M2.1 was 1M-marketed; M2.7 settled at ~200K usable).

If you are setting up a broader local-AI stack rather than just one model, the OpenClaw + Ollama setup guide for running local AI agents is the better starting point — it covers the orchestration layer that sits on top of llama.cpp.

The PRISM uncensored variant (M2.1) and what came after

MiniMax-M2.1-PRISM is Ex0bit's Projected Refusal Isolation via Subspace Modification (PRISM) abliteration of M2.1. The technique projects refusal directions out of the model's residual stream rather than fine-tuning them away, which is why benchmarks come back essentially unchanged — and in some cases up 5–8 points on MMLU-Pro because alignment-induced hedging is removed.

  • Adversarial response rate: 4096/4096 (100%) on the PRISM author's harness.
  • Coherence retention: 100% benign + long-chain.
  • Capability change: ~0 on SWE-bench Verified (74.0% baseline, unchanged).
  • License: Modified MIT, inherited from MiniMax-M2.1.

For M2.7 the picture has split into two distinct lineages:

  • Youssofal/MiniMax-M2.7-Abliterated-Heretic-GGUF — uses Heretic's Ablated Refusal Adaptation (ARA), a weight-level edit applied across layers 30–51 with parameters preserve_good_behavior=0.4512, steer_bad_behavior=0.0037, overcorrect_relative=0.8804. Reports 0/25 refusals on the eval harness. Non-commercial license inherited from M2.7.
  • huihui-ai/Huihui-MiniMax-M2-abliterated-GGUF — abliteration applied to the original M2 (MIT-licensed). Smaller and looser legally, but you give up the M2.7 self-evolution improvements.

If you don't need the very latest weights, M2.1-PRISM is still the most thoroughly characterized abliteration and the cleanest match for the rest of this guide.

Hardware and system requirements

What "CPU-only" actually demands in 2026

The numbers that look fine on paper hide a brutal truth about CPU MoE inference: memory bandwidth is the bottleneck, not core count. A 64-core CPU on dual-channel DDR4 will lose to a 16-core CPU on octa-channel DDR5. Plan around bandwidth first, cores second.

Minimum viable (development / hobby):

  • CPU: 16-core x86 with AVX-512 (Ryzen 9 7950X3D, Xeon W-3400 series, EPYC 9004). Apple M2/M3/M4 Max also works at this tier.
  • RAM: 128 GB DDR5 (or 128 GB unified memory on Apple Silicon).
  • Storage: 250 GB NVMe SSD (one quant + working space).
  • OS: Ubuntu 22.04+ / macOS 14+ / Windows 11 with WSL2.

Recommended (production / heavy use):

  • CPU: 32-core EPYC 9004/9005 or Xeon Sapphire Rapids/Emerald Rapids; 8 memory channels.
  • RAM: 192–256 GB DDR5-5200 or DDR5-5600 (octa-channel).
  • Storage: 1 TB NVMe Gen4 SSD (multiple quants, faster cold-load).
  • Optional GPU: 1× 16–24 GB GPU (RTX 4090, RTX 5090, RTX A6000) to offload 2–10 layers — that alone lifts throughput from ~3 to 25+ tok/s on the same RAM budget.

The bandwidth math: EPYC 9004 with DDR5-4800 sustains ~38 GB/s per channel × 12 channels = ~460 GB/s aggregate. A consumer Ryzen 9 with dual-channel DDR5-5200 gets ~83 GB/s. The MoE dispatch reads ~10 GB per token at Q4_K_M, so the consumer box tops out near 8 tok/s in theory; the EPYC theoretical ceiling is ~45 tok/s. In practice both lose 50–60% to NUMA, cache misses and scheduler overhead.

Quantization matrix (verified file sizes from the Unsloth M2.7 GGUF)

QuantizationBitsFile sizeRAM (incl. ctx)QualityUse case
UD-IQ1_M1-bit60.7 GB~80 GBPoor — degrades coherenceCuriosity only
UD-IQ2_M2-bit70.1 GB~96 GBBelow par on codingTight RAM, short prompts
UD-Q3_K_M3-bit101 GB~128 GBDecentSmaller dev box
UD-IQ4_XS4-bit108 GB~128 GBBest size/quality tradeoffDefault recommendation
UD-Q4_K_M4-bit140 GB~160 GBVery GoodIf you have 192 GB RAM
UD-Q5_K_M5-bit169 GB~192 GBExcellentProduction quality
Q6_K6-bit188 GB~224 GBNear-FP16Maximum quality on CPU
Q8_08-bit243 GB~288 GBFull precisionResearch, regression checks
BF1616-bit457 GB~512 GBReferenceDon't, on CPU

File sizes from unsloth/MiniMax-M2.7-GGUF. The "RAM (incl. ctx)" column assumes a 32K context with KV cache at q8_0. Add ~10–15% headroom for working memory.

The headline: UD-IQ4_XS at 108 GB is the right answer for almost everyone running on a 128 GB machine. It dodges the NaN bug (see below), it's the quant Unsloth recommends, and it benchmarks within a couple of points of full precision.

Step-by-step deployment

Environment preparation

# Create a clean working dir
mkdir -p ~/minimax-deploy/{models,env,logs}
cd ~/minimax-deploy

# Python venv
python3.11 -m venv env
source env/bin/activate

pip install --upgrade pip setuptools wheel
pip install huggingface-hub

Pin Python to 3.10 or 3.11. 3.12+ still has rough edges with several llama.cpp Python bindings as of April 2026.

Get the weights

Pick one of these depending on what you need:

# Option A: M2.1-PRISM (uncensored, Modified MIT)
huggingface-cli download Ex0bit/MiniMax-M2.1-PRISM \
  --local-dir ./models/m21-prism \
  --include "*Q4_K_M*"

# Option B: M2.7 vanilla (non-commercial OK, NaN-fixed by Unsloth)
huggingface-cli download unsloth/MiniMax-M2.7-GGUF \
  --local-dir ./models/m27-unsloth \
  --include "*UD-IQ4_XS*"

# Option C: M2.7 abliterated (Heretic ARA)
huggingface-cli download Youssofal/MiniMax-M2.7-Abliterated-Heretic-GGUF \
  --local-dir ./models/m27-abliterated \
  --include "*Q4_K_M*"

Always verify the checksum the model card publishes against your local file. Hugging Face's CDN occasionally truncates very large GGUFs on flaky connections, and a half-byte off renders the whole shard unusable.

llama.cpp build (CPU-only)

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# CPU-only build with all the AVX paths
cmake -B build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=OFF \
  -DGGML_NATIVE=ON \
  -DGGML_AVX2=ON -DGGML_AVX512=ON \
  -DGGML_F16C=ON -DGGML_FMA=ON

cmake --build build --config Release \
  -j --target llama-cli llama-server llama-bench

If you have a GPU and want to offload a couple of layers (which dramatically improves prompt processing on M2.x), flip -DGGML_CUDA=ON. But do not use CUDA 13.2 — known gibberish bug with low-bit quants. Pin to CUDA 13.1 or earlier until NVIDIA fixes it.

Run the server

./build/bin/llama-server \
  -m ../models/m27-unsloth/MiniMax-M2.7-UD-IQ4_XS-00001-of-00003.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 32768 \
  --threads 16 \
  --batch-size 512 \
  --cont-batching \
  --mlock \
  --n-gpu-layers 0 \
  --temp 1.0 --top-p 0.95 --top-k 40 \
  --log-file ../logs/server.log

Notes that actually matter:

  • --threads should match your physical core count, not logical. Hyperthreading hurts MoE throughput.
  • --mlock pins weights in RAM. Skip if you're tight on memory; the OS will page anyway and you'll feel the disk hit.
  • MiniMax-M2.x recommends temperature=1.0, top_p=0.95, top_k=40. Lower temperatures (0.2–0.4) tend to make M2.7 hedge or refuse on the abliterated variants — not because the safety came back, but because the model gets stuck in safety-shaped local minima.
  • Sanity-test with: curl -X POST http://localhost:8080/completion -H "Content-Type: application/json" -d '{"prompt":"Write a fast in-memory LRU cache in Rust.","max_tokens":400}'

Performance: real numbers, not marketing numbers

CPU throughput (verified, April 2026)

These numbers come from the llama.cpp issue tracker (discussion #19069), the Unsloth M2.7 docs, and Artificial Analysis's API benchmarks. Anything inside this table is observed, not extrapolated.

HardwareQuantContextTokens/secNotes
Apple M2 Ultra, 128 GB unifiedUD-IQ4_XS4K15+Unsloth-published number
Same + 1× 16 GB GPU offloadUD-IQ4_XS4K25+~10 layers offloaded
Xeon Gold 6132 (14c) + 192 GB DDR4 + RTX 2080 Tii1-Q4_K_M10K2.8–3.5llama.cpp #19069; degrades to ~1.5 at 20K context
Dual Xeon Gold 6138Q4_K_M10K~3.5Same thread; PCIe 3.0 bottleneck flagged
MiniMax cloud APIBF16n/a42.6Artificial Analysis median measurement

Reading between the rows: if you stay under ~16K tokens of context, expect 3–4 tok/s on a sane DDR5 server CPU with no GPU. Past 32K, throughput collapses for the same reason it collapses on every MoE: the KV cache grows, expert routing fragments memory access, and the per-channel bandwidth becomes the wall.

Quality on coding tasks

The original guide cited 74.0% on SWE-bench Verified for the M2.1 base model. That number stands as published — but the leaderboard moved on. As of April 2026:

  • MiniMax-M2.5 on SWE-bench Verified: 80.2%.
  • MiniMax-M2.7 reports SWE-bench Pro 56.22%, Terminal Bench 2 57.0%, SWE Multilingual 76.5%, Multi SWE Bench 52.7%. (M2.7 was benchmarked on the harder "Pro" suite; not directly comparable to the older Verified numbers.)
  • Top-of-leaderboard reference points: Claude Mythos Preview 93.9% / Claude Opus 4.7 Adaptive 87.6% / GPT-5.3 Codex 85% on SWE-bench Verified.

For local coding workflows, expect Q4_K_M to lose ~2 points vs. the cloud-API baseline; UD-IQ4_XS stays within ~1 point. PRISM/Heretic abliteration is essentially free on these benchmarks: 0–0.5 point degradation, occasionally a +1–2 point gain on MMLU-Pro because hedging behavior is removed.

Competitive landscape (April 2026)

ModelParamsOpen weightsCPU-runnableCoding benchContextLicense
MiniMax-M2.7229B MoE / 10B activeYesYes (~108 GB)SWE-bench Pro 56.22%~200KModified MIT (non-commercial)
MiniMax-M2.5229B MoE / 10B activeYesYesSWE-bench Verified 80.2%~200KMIT
MiniMax-M2.1-PRISM229B MoE / 10B activeYes (uncensored)YesSWE-bench Verified 74.0%~1M (marketed)Modified MIT
DeepSeek V4 Flash671B MoEYesYes (large)SWE-bench Verified ~76%128KMIT-style
GLM-5.1~250B MoEYesYesSWE-bench Verified ~74%128KApache 2.0
Qwen3.6 (latest)~235B MoEYesYesSWE-bench Verified ~78%256KApache 2.0
Claude Opus 4.7(closed)NoNoSWE-bench Verified 87.6%1MCommercial API
GPT-5.5 / 5.3 Codex(closed)NoNoSWE-bench Verified 85%1M+Commercial API
Gemini 3 Pro(closed)NoNoSWE-bench Verified ~82%2MCommercial API

The honest take: cloud frontier models still beat any local MoE on raw coding benchmarks. The reason to run M2.x locally is not benchmark wins — it's data residency, no per-token costs, and the ability to run an uncensored variant for security research, red-teaming, or adversarial evaluation. If you're choosing a coding assistant purely on quality, Claude Opus 4.7 or GPT-5.5 wins. If you need a 230B-class model that lives on your own hardware and answers without alignment hedging, MiniMax-M2.7 + Heretic abliteration is the strongest open option.

How to choose: a decision tree

  • You need commercial deployment with no licensing risk. Use M2 or M2.5 (still MIT). Skip M2.7 unless you're going to email api@minimax.io for a license.
  • You need uncensored output for research, red-team, or pen-testing. M2.1-PRISM is still the best-characterized abliteration. M2.7-Abliterated-Heretic is newer but less battle-tested.
  • You have 128 GB RAM and want one quant. UD-IQ4_XS. Don't agonize.
  • You have 192 GB+ RAM and care about quality. UD-Q4_K_M from Unsloth (NaN-fixed) or Q5_K_M.
  • You can throw a single GPU at it. Even 16 GB lifts you from CPU-only's 3 tok/s into the 20+ tok/s tier. The M2.7 GPU + hybrid setup guide walks through the offload tuning.
  • You want a different 230B-class open model. GLM-4.7 REAP at 218B and Qwen3-Next 80B A3B are the strongest non-MiniMax options for local CPU inference.

Common pitfalls (read this before you spend three hours debugging)

  • NaN outputs in the middle of long generations. Almost certainly the blk.61.ffn_down_exps overflow bug. It hits Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, and several IQ3 variants from some uploaders. The fix: redownload from unsloth/MiniMax-M2.7-GGUF (they patched theirs) or switch to UD-IQ4_XS, which never NaN'd.
  • Gibberish output on a CUDA box. CUDA 13.2 is broken for low-bit MoE quants. Downgrade to 13.1.
  • Abrupt slowdown past 16K tokens. Expected. KV cache growth + MoE expert dispatch destroys cache locality. Mitigations: --cache-type-k q8_0 --cache-type-v q8_0, smaller --ctx-size if your workload doesn't need it, and pin threads with numactl --cpunodebind=0 --membind=0 on multi-socket boxes.
  • Refusals from the abliterated model anyway. Almost always a system prompt or chat template problem. The Heretic and PRISM abliterations remove refusal directions; they don't override a system prompt that says "Refuse if asked about X." Set the system prompt to You are a helpful assistant. Your name is MiniMax-M2.7 and is built by MiniMax. (verbatim from the model card) or empty.
  • Hyperthreading slows you down. On most x86 CPUs, set --threads to physical cores only. AMD Zen4/Zen5 EPYC sometimes benefits from SMT for the prefill phase but not generation.
  • Page faults eating throughput. Use --mlock if RAM allows, or pre-warm with vmtouch -t model.gguf. Cold-loading a 108 GB GGUF from a Gen3 NVMe takes ~45–60 seconds; Gen4 cuts that to ~25–30.
  • "Quantized to 1-bit, why is it incoherent?" Because 1-bit quantization on a 230B MoE genuinely breaks the language model. The PRISM author flagged this in the original M2.1 release. Don't ship IQ1_S to anything that matters.

Cost and TCO (April 2026 numbers)

Hardware once-and-done:

  • Hobby tier (UD-IQ4_XS / 128 GB): ~$2,800. Ryzen 9 7950X3D, 128 GB DDR5-5600, 2 TB Gen4 NVMe, decent PSU/case.
  • Production tier (Q4_K_M / 192 GB): ~$8,000–9,000. EPYC 9354P or Xeon Sapphire Rapids 8c, 192 GB DDR5-5200 ECC, 4 TB Gen4 NVMe, server motherboard. Add a 24 GB GPU for ~$2,500 more if you want the 25 tok/s tier.
  • Apple Silicon path: Mac Studio M2 Ultra 128 GB ≈ $4,800. M3/M4 Max 128 GB MacBook Pro ≈ $4,500. The unified memory architecture is unreasonably effective for MoE inference.

Operating cost: 130–170W under sustained load × $0.12/kWh ≈ $0.02 per inference hour, plus AC.

Cloud equivalent: MiniMax's official API charges $0.30 per 1M input tokens and $1.20 per 1M output tokens (Artificial Analysis, April 2026). Claude Opus 4.7 sits at $15/$75. At 1M output tokens/day on M2.7's API ($1.20/day), the local hardware breaks even somewhere between 6 months and 2 years depending on tier — but the value of the local setup isn't usually the cost saving, it's data control.

If you're running this on behalf of a team that's also hiring engineers, Codersera's vetted remote developer hiring service has been the route most local-AI teams take when they need someone who can productionize this stack rather than just demo it.

Advanced: multi-node and hybrid GPU offload

llama.cpp's RPC mode lets you split a single inference across multiple machines. It's not magic — network latency caps you well below the single-node ceiling — but it's the only realistic way to run Q8_0 (243 GB) on a fleet of 128 GB boxes.

# Node A (compute primary)
./build/bin/llama-server \
  -m model.gguf \
  --rpc 192.168.1.100:50000,192.168.1.101:50000 \
  --threads 16 --port 8080

# Nodes B and C (workers)
./build/bin/rpc-server --host 0.0.0.0 --port 50000

Observed scaling on a 2× EPYC 9354P cluster with 25 GbE between nodes: 1 node = 4.1 tok/s, 2 nodes = 6.8 tok/s (~83% scaling), 3 nodes = 8.4 tok/s (~68% scaling). Past three nodes the network and coordination overhead eat the gains.

For mixed CPU + single GPU, the right call on M2.x is usually to offload just the embedding layer and the first 4–8 transformer blocks. That alone handles the prompt-processing bottleneck and lifts throughput substantially without trying to fit the whole model on the GPU.

Frequently asked questions

Can MiniMax M2.x really run on CPU only?

Yes, with caveats. With UD-IQ4_XS (108 GB) on a 128 GB RAM box, you'll get usable but slow generation — 3–4 tok/s on x86 server CPUs, ~15 tok/s on Apple Silicon's unified memory. That's enough for code review, refactoring, and one-shot generation. It is not enough for interactive chat.

Should I use M2.1-PRISM or one of the M2.7 abliterations?

Default to M2.1-PRISM if you want the configuration this guide was originally written around: it's the most extensively benchmarked uncensored M2 build, the abliteration methodology is documented in detail, and it's still under Modified MIT inherited from M2.1. Pick M2.7-Abliterated-Heretic if you want the newer self-evolution improvements and you're OK with non-commercial use.

What's the M2.7 GGUF "NaN bug" I keep seeing on Reddit?

A llama.cpp overflow in the blk.61.ffn_down_exps tensor causes NaN outputs starting around chunk 32 of perplexity evaluation, on Q4_K and Q5_K family quants from several uploaders. Unsloth patched their fileset. Many community uploads from late March 2026 are still broken. Stick with Unsloth's UD-IQ4_XS or their fixed UD-Q4_K_M until you check the model card date.

Did MiniMax really change the M2.7 license?

Yes. M2 (October 2025) and M2.5 (February 2026) shipped under MIT. M2.7 (April 2026) ships under a "Modified MIT" that allows non-commercial use freely but requires written authorization from MiniMax (api@minimax.io) for commercial deployment. The change drew vocal criticism on Hugging Face discussions and Hacker News. If you need a permissive commercial license, M2.5 is the highest version that's still safe.

How long a context can I actually use on CPU?

The model card lists ~200K. In practice on CPU, anything past 32K is painful: throughput halves at every doubling, and KV-cache memory at FP16 grows linearly. With --cache-type-k q8_0 --cache-type-v q8_0 you can push to 64K on a 192 GB box. Past that, use the cloud API for the long-context work.

Can I fine-tune the abliterated model on my data?

Full fine-tuning is impractical on a CPU box and difficult on a single GPU. LoRA / QLoRA against the BF16 weights with Unsloth tooling is the standard path, and works on 24 GB GPUs for the M2.x family. A separate write-up is the right place for that — start from the Unsloth fine-tuning notebooks rather than freelancing it.

Is "uncensored" the same as "jailbroken"?

Mechanically, no. Jailbreaks use prompt engineering against a still-aligned model. Abliteration (PRISM, Heretic ARA, huihui-ai's pipeline) edits the weights so that refusal directions are projected out of the residual stream. The model isn't being tricked into compliance; it has lost the architectural ability to refuse along the trained refusal axis. That's why benchmarks barely move and why subtle alignment behaviors (hedging, "as an AI assistant…") also disappear.

References and further reading


If your team needs an engineer who can productionize a setup like this — wire it into CI, add monitoring, manage the GGUF re-download cadence when bugs like the blk.61 NaN ship — Codersera places vetted remote engineers who already work at this layer. Local-AI infra has gotten cheap; the people who can run it well are still the bottleneck.