Run OlympicCoder-7B on macOS: Installation Guide 2026

Last updated April 2026 — refreshed for current model/tool versions.

OlympicCoder-7B is a 7-billion-parameter competitive programming model released by Hugging Face's Open-R1 team in March 2025. It runs entirely locally on Apple Silicon Macs, requires no cloud API, and on an M2 Max produces roughly 18–19 tokens per second on code-completion tasks — competitive with cloud API latency for most workflows. This guide covers every step from download to IDE integration, with verified 2026 tool versions throughout.

What changed since this post was first published (March 2025)LM Studio reached v0.4.12 (April 2026). The CLI workflow changed: use lms get / lms load / lms server start instead of the older GUI-only flow. MLX support was added in v0.3.4 (October 2024) and has been refined through every subsequent release.Ollama 0.19 (March 31, 2026) introduced an MLX backend preview that nearly doubles decode speed on Apple Silicon — from ~58 tok/s to ~112 tok/s on an M4 Max. The MLX backend requires 32 GB or more of unified memory to activate.llama.cpp moved to the ggml-org GitHub organisation. The install path via Homebrew (brew install llama.cpp) is unchanged, but any old GitHub links pointing to ggerganov/llama.cpp now redirect to ggml-org/llama.cpp.Continue.dev extension reached v1.2.18 (March 2026). The config file is now ~/.continue/config.yaml by default (YAML replaces JSON in v1.x), though JSON is still accepted.OlympicCoder-7B itself is unchanged — the model weights were published once (March 11, 2025) and have not been updated. The IOI'24 score of 129.0, base model (Qwen2.5-Coder-7B-Instruct), and Apache 2.0 licence remain current.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

What Is OlympicCoder-7B?

OlympicCoder-7B is part of Hugging Face's Open-R1 initiative, a fully open reproduction of DeepSeek-R1-style reasoning models. It was fine-tuned from Qwen/Qwen2.5-Coder-7B-Instruct on the CodeForces-CoTs dataset — approximately 100,000 high-quality chain-of-thought examples derived from CodeForces problems, with C++ solutions generated by DeepSeek-R1.

The model is released under the Apache 2.0 licence and is free for commercial use. The companion OlympicCoder-32B model offers higher accuracy on the hardest problems but requires substantially more RAM (at least 24 GB for Q4 quantization).

Key Characteristics

Architecture: 7B-parameter causal language model (Qwen2.5 Coder family)
Training data: ~100k verified CodeForces solutions with chain-of-thought traces
Primary languages: C++ (training domain), Python (partially out-of-domain)
Context window: 32,768 tokens (trained at this length)
Reasoning style: Prefill with <think> token to activate long chain-of-thought
Licence: Apache 2.0

Strengths and Limits

OlympicCoder-7B is optimised for algorithm-heavy problems: competitive programming, graph traversal, dynamic programming, and binary search implementations. It produces direct, terse solutions — it is not designed to be explanatory. For user-facing API code, system design discussions, or highly contextual codebase-wide refactoring, Qwen2.5-Coder or a larger general model will serve better. See the "How to Choose" section below.

TL;DR

Question	Answer
What hardware do I need?	Apple M1/M2/M3/M4 with at least 16 GB unified memory (8 GB works, but is tight)
What is the minimum macOS version?	macOS Ventura 13.5 or later (Sequoia recommended)
How big is the download?	~4.8 GB for Q4_K_M (recommended), ~7.2 GB for Q8_0
Easiest setup path?	LM Studio v0.4.12 GUI + Continue.dev extension in VS Code
Fastest inference path?	Ollama 0.19 with MLX backend (32 GB+ RAM), or llama.cpp via Homebrew with Metal
Does it beat Claude on IOI?	Yes — OlympicCoder-7B scores 129.0 vs Claude 3.7 Sonnet's 93.0 on IOI'24
Is the 7B or 32B better?	7B for most Macs (≤ 32 GB RAM); 32B for M2/M3/M4 Ultra or Max with 64 GB+

Hardware Requirements

Apple Silicon Macs use unified memory, which means the GPU and CPU share the same RAM pool. A GGUF-quantized 7B model at Q4_K_M occupies approximately 4.8 GB of that pool at load time, leaving the remainder for the OS and active context. The practical floor is 16 GB unified memory; 8 GB is technically possible but leaves no headroom for a longer context window.

RAM	Recommended quantization	Max safe context	Notes
8 GB	Q4_K_M only	8k tokens	Tight — close other apps
16 GB	Q4_K_M or Q5_K_M	16k tokens	Good daily-driver config
32 GB	Q5_K_M, Q8_0, or FP16	32k tokens	Enables Ollama MLX backend
64 GB+	FP16 or OlympicCoder-32B-Q4	Full 32k	Consider the 32B variant

Processor: Apple M1 or later (M-series GPU with Metal support)
Storage: 10 GB free minimum (model + tools); 25 GB comfortable
macOS: Ventura 13.5 or later; Sequoia 15.x recommended
Xcode Command Line Tools: Required for llama.cpp compilation (xcode-select --install)

Software Setup

There are three main paths to running OlympicCoder-7B locally on macOS. Choose based on your priorities:

Path A: LM Studio (Easiest — GUI + CLI)

LM Studio v0.4.12 (released April 17, 2026) is the most approachable entry point. It handles model downloads from Hugging Face, exposes an OpenAI-compatible local API at http://localhost:1234/v1, and includes a native Metal-accelerated inference engine for Apple Silicon. It also gained MLX-model support in v0.3.4.

Install:

# Via Homebrew (recommended — keeps LM Studio updatable)
brew install --cask lm-studio

# Verify
lms --version

Download OlympicCoder-7B via CLI:

# Pull the LMStudio Community GGUF (Q4_K_M is default)
lms get lmstudio-community/OlympicCoder-7B-GGUF

# Load the model into the inference engine
lms load olympiccoder-7b

# Start the local server
lms server start

The server now listens at http://localhost:1234/v1 with an OpenAI-compatible API. Test it:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "olympiccoder-7b",
    "messages": [
      {"role": "user", "content": "Write a C++ function for binary search on a sorted array."}
    ],
    "temperature": 0.2,
    "max_tokens": 512
  }'

Quantization selection in the GUI: When browsing models in LM Studio's model hub, choose Q4_K_M for the best balance of speed and accuracy on 16 GB machines. Q5_K_M is the next step up (5.7 GB, slightly better for subtle logic), and Q8_0 (7.2 GB) preserves ~99% of the original floating-point accuracy at the cost of speed.

Alternative direct download (curl):

mkdir -p ~/Models
curl -L \
  "https://huggingface.co/lmstudio-community/OlympicCoder-7B-GGUF/resolve/main/OlympicCoder-7B-Q4_K_M.gguf" \
  -o ~/Models/OlympicCoder-7B-Q4_K_M.gguf

Path B: Ollama 0.19 (Fastest on Apple Silicon)

Ollama 0.19, released March 31, 2026, introduced a preview MLX backend that nearly doubles decode speed on Apple Silicon when you have 32 GB or more of unified memory. On an M4 Max with 64 GB, Ollama 0.19 with MLX reaches approximately 112 tok/s on a 7B model — compared to ~58 tok/s with the standard llama.cpp backend.

# Install Ollama
brew install ollama

# Start the Ollama service
ollama serve &

# Pull and run OlympicCoder-7B
ollama pull sikamikanikobg/OlympicCoder-7B
ollama run sikamikanikobg/OlympicCoder-7B

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1 by default. To enable the MLX backend preview (requires 32 GB+ unified memory):

# Set the backend preference before starting the server
OLLAMA_BACKEND=mlx ollama serve &

Note: The MLX backend is a preview feature in Ollama 0.19. On machines with less than 32 GB, it silently falls back to the standard Metal/llama.cpp backend — performance is still good, just not the doubled speed.

Path C: llama.cpp (Most Control)

llama.cpp (now maintained at github.com/ggml-org/llama.cpp) is the underlying engine that both LM Studio and Ollama use. Running it directly gives you the most control over context, batching, and Metal shader settings.

# Install via Homebrew (handles Metal support automatically)
brew install llama.cpp

# Run OlympicCoder-7B directly
llama-cli \
  -m ~/Models/OlympicCoder-7B-Q4_K_M.gguf \
  --gpu-layers 99 \
  -p "<think>\nSolve: given an array of N integers, find the longest increasing subsequence in O(n log n)" \
  -n 1024

The --gpu-layers 99 flag offloads all layers to the Metal GPU. For an M2 Max with 64 GB, every layer fits in unified memory — no splitting needed. On a 16 GB machine, you may need to reduce this value if you see out-of-memory errors (try --gpu-layers 32 first).

Building from source (for latest performance patches, optional):

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j $(sysctl -n hw.logicalcpu)

IDE Integration

VS Code + Continue.dev

The Continue extension (v1.2.18, March 2026) is the most capable open-source VS Code plugin for local LLM code assistance. It supports chat, inline completion, and slash commands.

Install from the VS Code marketplace: search "Continue" by Continue Dev and click Install.
Open the config file: Cmd+Shift+P → "Continue: Open Config File"
Add OlympicCoder-7B as a model (YAML format in v1.x):

models:
  - title: OlympicCoder-7B (Local)
    provider: openai
    model: olympiccoder-7b
    apiBase: "http://localhost:1234/v1"   # LM Studio
    # apiBase: "http://localhost:11434/v1" # Ollama
    completionOptions:
      temperature: 0.2
      maxTokens: 2048
      stop:
        - "###"

If you are still using Continue v0.9.x with JSON config:

{
  "models": [
    {
      "title": "OlympicCoder-7B (Local)",
      "provider": "openai",
      "model": "olympiccoder-7b",
      "apiBase": "http://localhost:1234/v1",
      "completionOptions": {
        "temperature": 0.2,
        "maxTokens": 2048
      }
    }
  ]
}

Continue also supports Cline (formerly Claude Dev) for agentic multi-step tasks, and Copilot Chat for teams already on GitHub. For pure competitive-programming practice, the Continue chat panel is sufficient.

Python / Transformers API

If you want to run OlympicCoder-7B programmatically in Python — for example, inside a benchmark harness or a custom tool — use Hugging Face Transformers:

# Install dependencies
pip install transformers accelerate torch

import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="open-r1/OlympicCoder-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Write a C++ function to compute modular exponentiation in O(log n)."},
]

# Prefill with <think> to activate chain-of-thought reasoning
prompt = pipe.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
prompt += "<think>\n"  # Trigger CoT

outputs = pipe(
    prompt,
    max_new_tokens=8000,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
)
print(outputs[0]["generated_text"])

Important: Prepend <think>\n to the prompt. Without this prefill token, OlympicCoder-7B defaults to short, non-reasoning answers — especially on out-of-domain queries. This is a documented training characteristic, not a bug.

Performance and Benchmarks

Competitive Programming Benchmarks

OlympicCoder-7B's primary evaluation benchmarks are the IOI 2024 (International Olympiad in Informatics) and LiveCodeBench. The IOI evaluation used 6 very challenging problems with a 50-submission limit per problem to simulate real contest conditions.

Model	IOI'24 Score (50-sub limit)	LCB (Python)	Parameters
OlympicCoder-7B	129.0	~52 (vs base +10 pts)	7B
OlympicCoder-32B	Exceeds o1-mini	Higher	32B
Claude 3.7 Sonnet	93.0	~61	Closed
DeepSeek-R1	137.0	High	671B MoE
QwQ-32B	144.0	High	32B

Caveats on LiveCodeBench scores: OlympicCoder was trained predominantly on C++ solutions. LiveCodeBench expects Python outputs. This out-of-domain mismatch means the model's effective LiveCodeBench score understates its true algorithmic capability. In C++ competitive programming, performance is substantially higher.

macOS Inference Speed

The following figures are representative of community-reported benchmarks for Q4_K_M 7B models on Apple Silicon in 2025–2026. OlympicCoder-7B is based on Qwen2.5-Coder architecture; inference speed is equivalent to other 7B GGUF models of the same architecture.

Chip	RAM	Backend	Approx. tok/s (7B Q4_K_M)
M1 (base)	16 GB	llama.cpp Metal	~40–55
M2 Pro	16 GB	llama.cpp Metal	~55–70
M2 Max	32–64 GB	llama.cpp Metal	~70–90
M3 Pro	18–36 GB	llama.cpp Metal	~60–80
M4 Pro	24–48 GB	llama.cpp Metal	~70–90
M4 Max	64–128 GB	Ollama 0.19 MLX	~112

Inference speed depends heavily on context length, concurrent applications, and thermal state. The figures above are for short-to-medium prompts at a cool operating temperature.

The original post cited an M2 Max at 18.7 tok/s for code completion and 12.4 tok/s for full solution generation. These numbers reflect an older llama.cpp version and Metal configuration. With current llama.cpp (April 2026) and --gpu-layers 99, M2 Max figures are substantially higher — in the 70–90 tok/s range for short completions, dropping to 20–40 tok/s as context grows beyond 8k tokens due to the KV-cache memory pressure.

Metal Performance Configuration

For llama.cpp users who want to squeeze more performance from their Mac, these flags make a measurable difference:

llama-cli \
  -m ~/Models/OlympicCoder-7B-Q4_K_M.gguf \
  --gpu-layers 99 \
  --ctx-size 16384 \
  --batch-size 512 \
  --threads $(sysctl -n hw.physicalcpu) \
  -p "<think>\n[YOUR PROBLEM HERE]"

Key flags:

--gpu-layers 99 — offload all model layers to the Metal GPU (unified memory makes this safe on Apple Silicon)
--ctx-size — context window in tokens; larger values increase VRAM usage quadratically
--batch-size 512 — prompt evaluation batch size; larger = faster prompt processing, more memory
--threads — CPU threads for non-GPU ops; use physical (not logical) core count

LM Studio users can achieve equivalent configuration via the Developer tab's inference settings, without touching the command line.

Practical Usage Examples

Competitive Programming Problem

OlympicCoder-7B excels at translating algorithmic problem statements into correct, optimised C++. Example:

User: Implement fast modular exponentiation for Codeforces problem 678D.
      Constraints: base, exp, mod fit in long long. O(log exp) required.
<think>

The model produces — typically without hallucinating variable names or off-by-one errors:

long long mod_pow(long long base, long long exp, long long mod) {
    long long result = 1;
    base %= mod;
    while (exp > 0) {
        if (exp & 1)
            result = (__int128)result * base % mod;
        base = (__int128)base * base % mod;
        exp >>= 1;
    }
    return result;
}

Note the use of __int128 to avoid overflow during intermediate multiplication — a subtlety that smaller or general-purpose models frequently miss.

Custom Prompt Template

For LM Studio users, create a system prompt that enforces competitive programming best practices:

{
  "system": "You are a competition-level C++ programmer. Always analyse time and space complexity before writing code. Use STL containers optimally. Assume the judge uses -O2 optimisation. Output only the function or complete solution — no prose unless asked."
}

Setting temperature: 0.1–0.3 reduces hallucinations on constraint-heavy problems. Use temperature: 0.7 when exploring multiple approaches to a problem.

How to Choose: OlympicCoder-7B vs Alternatives

OlympicCoder-7B is not the right model for every coding task. Here is a practical decision guide:

Task	Best local model	Why
Competitive programming (IOI / CF / ICPC style)	OlympicCoder-7B	Fine-tuned specifically on this domain; outperforms Claude 3.7 on IOI
General code completion / autocomplete	Qwen2.5-Coder-7B or DeepSeek-Coder	More balanced; OlympicCoder is terse to a fault for general work
Codebase-wide refactoring with context >16k	OlympicCoder-32B or a 32B+ model	7B loses coherence in very long contexts
Python-first development	Qwen2.5-Coder or Llama 4 Scout	OlympicCoder was trained on C++; Python output quality is lower
Explaining code / tutorials	Any larger general model	OlympicCoder does not explain — it solves
Algorithm design + explanation	OlympicCoder-7B with `temperature: 0.6`	CoT traces are genuinely useful for learning

If you need to vet developers who work with competitive programming or algorithm-heavy systems, Codersera's vetted remote developer network includes engineers who regularly compete at IOI and ICPC level.

Common Pitfalls and Troubleshooting

Model gives short, unhelpful answers (no chain-of-thought)

Cause: The <think> prefill token was not included in the prompt.

Fix: Append <think>\n to the end of your user message before generation. In LM Studio, add it to the system prompt or the user message field. In the Transformers pipeline, append it to the formatted prompt string after apply_chat_template.

Slow inference (<10 tok/s)

Open Activity Monitor → GPU History. If GPU utilisation is 0%, Metal offloading is not active.
In LM Studio: Developer tab → GPU Offload slider → set to maximum.
In llama.cpp: confirm --gpu-layers 99 is in your command. If it fails, reduce to --gpu-layers 32 and increase until OOM.
Reduce context size (--ctx-size 8192) — larger contexts increase KV-cache pressure and reduce tok/s.
Close memory-hungry applications (Chrome tabs, Electron apps) before loading the model.

Out-of-memory crash or model fails to load

Switch from Q5_K_M or Q8_0 to Q4_K_M. Each quantization step down saves roughly 1–1.5 GB.
Reduce --ctx-size: a 32k context at Q4_K_M adds roughly 2 GB to the memory footprint.
Restart the Mac to clear unified memory fragmentation before loading large models.

llama-cpp-python installation fails

# Create a clean virtual environment
python3 -m venv ~/olympic-venv
source ~/olympic-venv/bin/activate

# Install with Metal support explicitly enabled
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

Weak Python output quality

Cause: OlympicCoder-7B was trained almost exclusively on C++ solutions. Python performance is structurally lower — this is documented in the model card and the Open-R1 blog post.

Fix: For Python-first workflows, use Qwen2.5-Coder-7B-Instruct or DeepSeek-Coder-V2-Lite alongside OlympicCoder. Route algorithm-design prompts to OlympicCoder, then ask a Python-native model to translate.

Response cuts off mid-solution

Competitive programming solutions can be lengthy. Increase max_tokens to at least 4096, preferably 8000. In LM Studio, set "Max Tokens" in the Model Parameters panel. In the Transformers pipeline, set max_new_tokens=8000.

Alternative Workflows

llama.cpp HTTP Server (OpenAI-compatible)

If you want llama.cpp's performance without LM Studio's overhead:

llama-server \
  -m ~/Models/OlympicCoder-7B-Q4_K_M.gguf \
  --gpu-layers 99 \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 16384

This exposes the same OpenAI-compatible API at http://localhost:8080/v1, which Continue.dev, Cline, and any OpenAI SDK client can connect to.

Apple MLX (Swift/Python, Apple-native)

Apple's MLX framework provides native Apple Silicon inference without the GGUF format. OlympicCoder-7B is not yet available as an official MLX-format model, but the mlx_lm package can convert GGUF to MLX at runtime:

pip install mlx-lm
mlx_lm.generate \
  --model open-r1/OlympicCoder-7B \
  --prompt "<think>\nSolve: maximum subarray sum in O(n)" \
  --max-tokens 2048

FAQ

Can OlympicCoder-7B run on an Intel Mac?

Yes, but inference will be CPU-only (no Metal GPU acceleration). Expect 2–6 tok/s on a typical Intel MacBook Pro — usable for short prompts, but impractical for full solution generation. Apple Silicon is strongly recommended for any serious use.

Does OlympicCoder-7B work with Cursor, Zed, or JetBrains?

Any IDE that supports an OpenAI-compatible API endpoint can connect to the LM Studio or Ollama server. Cursor supports custom models via Settings → Models → Add Model (point to http://localhost:1234/v1). Zed has built-in Ollama support since v0.139. JetBrains AI Assistant can connect to local endpoints via its custom provider option.

What is the difference between OlympicCoder-7B and OlympicCoder-32B?

The 32B model is trained on the same dataset but has significantly more capacity. It outperforms o1-mini on IOI problems and beats DeepSeek-R1 in some contest-condition benchmarks. The 7B model is the practical choice for Macs with 16–32 GB of RAM. The 32B at Q4_K_M requires approximately 20 GB of unified memory — viable on M2/M3/M4 Max configurations with 32 GB or more.

Is OlympicCoder-7B good for learning competitive programming?

Yes, with caveats. The chain-of-thought traces are detailed and show genuine problem-solving steps. Use it interactively: give it a problem, read the <think> trace before the final answer, and try to understand the reasoning. However, the model is trained on CodeForces problems up to early 2025 — if you are practising for a specific recent contest, verify that the problem is not in the training set.

How do I update OlympicCoder-7B when a new version is released?

As of April 2026, no updated version of OlympicCoder-7B has been released. The weights are static since March 11, 2025. Monitor the Hugging Face model page and the Open-R1 GitHub repository for any new releases.

Can I fine-tune OlympicCoder-7B further on my own data?

Yes. The Apache 2.0 licence permits fine-tuning and redistribution. The Open-R1 team published their full training configuration, including hyperparameters and dataset preparation scripts. The key lessons: use a learning rate of 4e-5, do not use sample packing, and prefill with <think> during training.

Why does the model sometimes output very long reasoning before a short answer?

This is by design. OlympicCoder uses chain-of-thought reasoning, which can produce hundreds of tokens of internal deliberation before the final solution. If you want shorter output, lower max_tokens (this risks truncating the solution) or add the string Think briefly to your system prompt (reduces reasoning verbosity at some accuracy cost). For production use, strip the text between <think> and </think> before displaying to users.

What is the best quantization for accuracy-sensitive tasks?

Q8_0 preserves approximately 99% of the floating-point accuracy and is the highest-quality GGUF quantization widely available. For a 7B model, Q8_0 requires ~7.2 GB of unified memory. On a 16 GB machine, Q8_0 leaves only 8.8 GB for the OS and context — tight. Q5_K_M (~5.7 GB) is the recommended middle ground for accuracy-sensitive competitive programming on 16 GB devices. The accuracy difference between Q4_K_M and Q8_0 on typical Codeforces problems is small but measurable on edge cases and harder problems (rating 2200+).

References and Further Reading

OlympicCoder-7B Model Card — Hugging Face — official model card with benchmark scores, training hyperparameters, and usage examples
Open-R1 Update #3: OlympicCoder Release — Hugging Face Blog — the original release announcement with detailed training methodology and lessons learned
Open R1: How to Use OlympicCoder Locally for Coding — Hugging Face Blog — official LM Studio + Continue.dev setup guide published by the Open-R1 team
LM Studio Changelog — official release notes; v0.4.12 (April 17, 2026) is current as of this writing
Ollama MLX Backend Announcement — Ollama 0.19 MLX preview details and performance data
llama.cpp GitHub Repository (ggml-org) — source code, releases, and Metal performance discussions
lmstudio-community/OlympicCoder-7B-GGUF — Hugging Face — community GGUF quantizations (Q4_K_M, Q5_K_M, Q8_0)
LiveCodeBench Leaderboard — live competitive programming benchmark results for code models