SmolVLM2

Run SmolVLM2-2.2B on macOS: 2026 Installation Guide (MLX, Transformers, llama.cpp)

Published 25 Feb 2025 • Updated 31 May 2026 • 9 min read

Last updated April 2026 — refreshed for current model/tool versions.

This guide walks through running SmolVLM2-2.2B-Instruct on macOS (Apple Silicon) using three production-grade paths: mlx-vlm (Python), Hugging Face transformers (PyTorch with MPS), and llama.cpp/Ollama (GGUF). Every command, model ID, and version number was verified against vendor sources in April 2026 — no placeholder paths, no broken fork branches.

What changed in 2026 (vs. the original Feb 2025 guide):The unstable pcuenca/mlx-vlm@smolvlm fork branch is no longer needed. Pull request #208 merged into upstream on 20 February 2025, so plain pip install -U mlx-vlm now works. The current upstream release is mlx-vlm 0.4.4 (April 2026).The placeholder your_username/SmolVLM2-2.2B path is replaced with the canonical HuggingFaceTB/SmolVLM2-2.2B-Instruct ID and its MLX twin mlx-community/SmolVLM2-2.2B-Instruct-mlx (4.49 GB on-disk).GGUF builds from ggml-org/SmolVLM2-2.2B-Instruct-GGUF are now supported by upstream llama.cpp (multimodal docs) and Ollama, giving you a fourth runtime option.The official HuggingSnap iOS app (built on SmolVLM2-500M, all on-device) shipped to the App Store in March 2025 and is the reference Swift-MLX consumer of these weights.For the transformers path on macOS, you need either the dedicated tag v4.49.0-SmolVLM-2 or any release ≥ 4.50.0 (where SmolVLM landed in stable). flash_attention_2 is CUDA-only — on Apple Silicon use "eager" attention and device="mps".

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR

Goal	Best path on Apple Silicon	Command
Fastest local inference, MLX-native	`mlx-vlm` with the `mlx-community` weights	`pip install -U mlx-vlm`
Reference behaviour, full PyTorch ecosystem	Hugging Face `transformers` + MPS	`pip install "transformers>=4.50" torch num2words`
Lowest RAM, broadest tooling	GGUF via `llama.cpp` / Ollama	`ollama run hf.co/ggml-org/SmolVLM2-2.2B-Instruct-GGUF`
Native Swift / iOS / Mac app	`mlx-swift-examples` (now upstream)	See HuggingSnap for a production reference

What is SmolVLM2-2.2B?

SmolVLM2 is a family of small video-and-image VLMs from Hugging Face — 2.2B, 500M, and 256M parameters — released 20 February 2025 under Apache 2.0. The 2.2B variant is the flagship: it accepts arbitrarily interleaved sequences of images, video frames, and text, and produces text. Per the model card, video inference fits in roughly 5.2 GB of accelerator memory, which is why the 2.2B variant runs comfortably on an 8 GB M-series Mac and easily on 16 GB+.

The 2.2B model was trained on 3.3M samples (LLaVA OneVision, M4-Instruct, Mammoth, LLaVA-Video, FineVideo, VideoStar, Vista-400K, MovieChat, ShareGPT4Video, and others) and reported the following on the official card:

Benchmark	SmolVLM2-2.2B-Instruct	What it measures
Video-MME	52.1	Video understanding (the canonical video benchmark)
MMMU	42.0	College-level multimodal reasoning
MathVista	51.5	Visual math problems
MMStar	46.0	Multimodal vision-centric eval
ScienceQA	90.0	Multimodal science QA
OCRBench	72.9	Text-in-image reading

For its size class, those numbers are competitive with — and on Video-MME often ahead of — other ~2B VLMs. It is not a substitute for a frontier model on hard reasoning, but for on-device video tagging, screen reading, document QA, and accessibility tooling, it is one of the strongest small VLMs available in April 2026.

Hardware and software prerequisites

Mac: any Apple Silicon Mac (M1 / M2 / M3 / M4 / M5). 8 GB unified memory works for image inference; 16 GB is realistic for video. The MLX 4-bit build is ~4.5 GB on disk.
macOS: 14 (Sonoma) or newer for MLX; macOS 15 (Sequoia) recommended.
Python: 3.10–3.12 (mlx-vlm 0.4.4 requires Python ≥ 3.10).
Xcode Command Line Tools for any path that compiles wheels: xcode-select --install.
For video paths: ffmpeg (Homebrew: brew install ffmpeg) and either decord (transformers path) or the built-in MLX video reader.

Create an isolated environment before installing anything:

python3 -m venv ~/.venvs/smolvlm2
source ~/.venvs/smolvlm2/bin/activate
pip install --upgrade pip

Path 1 — MLX (recommended for Apple Silicon)

MLX is Apple's array framework with native Metal kernels and unified-memory tensors. It is consistently the fastest path on M-series chips. As of April 2026, mlx-vlm ships SmolVLM2 support out of the box; the old fork instructions are obsolete.

Install

pip install -U mlx-vlm

Verify:

python -c "import mlx_vlm; print(mlx_vlm.__version__)"
# 0.4.4 or newer

Image inference (CLI)

python -m mlx_vlm.generate \
  --model mlx-community/SmolVLM2-2.2B-Instruct-mlx \
  --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
  --prompt "Describe this image in two sentences." \
  --max-tokens 128 \
  --temp 0.0

The first invocation downloads ~4.5 GB into ~/.cache/huggingface/hub/; subsequent runs are cached.

Video inference (CLI)

python -m mlx_vlm.video_generate \
  --model mlx-community/SmolVLM2-2.2B-Instruct-mlx \
  --system "Focus only on describing the key dramatic action or notable event in this video segment." \
  --prompt "What is happening in this video?" \
  --video ~/Downloads/clip.mov \
  --max-frames 32 \
  --max-tokens 256

Python API

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/SmolVLM2-2.2B-Instruct-mlx"
model, processor = load(model_path)
config = load_config(model_path)

image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
messages = [{"role": "user", "content": "Read any text in this image."}]

prompt = apply_chat_template(processor, config, messages, num_images=1)
output = generate(model, processor, prompt, image=[image], max_tokens=200, temp=0.0)
print(output)

Path 2 — Hugging Face transformers (PyTorch + MPS)

Use this path when you need full parity with the reference checkpoint, want LoRA fine-tuning, or are integrating into an existing PyTorch codebase. It is slower than MLX on Apple Silicon and uses more memory (no native 4-bit), but it is the canonical implementation.

Install

pip install "transformers>=4.50.0" torch torchvision pillow num2words
# For video:
pip install decord av

If you must pin to the exact tag the SmolVLM2 release used:

pip install "git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2"

Image inference on macOS

Two corrections vs. the original snippet you may have seen in 2025 tutorials: do not pass flash_attention_2 (it is CUDA-only) and set device to mps on Apple Silicon.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

DEVICE = "mps" if torch.backends.mps.is_available() else "cpu"

model_id = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    _attn_implementation="eager",
).to(DEVICE)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(DEVICE, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=128)
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

Video inference

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "/Users/me/Downloads/clip.mp4"},
            {"type": "text", "text": "Summarise this video in three bullet points."},
        ],
    },
]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(DEVICE, dtype=torch.bfloat16)
out = model.generate(**inputs, do_sample=False, max_new_tokens=256)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

Path 3 — GGUF via llama.cpp or Ollama

If you want the smallest possible disk footprint and a single binary deploy, use the official GGUF weights at ggml-org/SmolVLM2-2.2B-Instruct-GGUF. llama.cpp added multimodal support in late 2025 and Ollama can pull GGUF models directly from Hugging Face.

Ollama

brew install ollama
ollama serve &
ollama run hf.co/ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M

Pass an image at the prompt: ./image.jpg Describe what you see.

llama.cpp (mtmd / multimodal)

brew install llama.cpp   # or build from source: https://github.com/ggml-org/llama.cpp

# Run the multimodal CLI with the model + projector
llama-mtmd-cli \
  -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF \
  --image ./photo.jpg \
  -p "What is happening here?"

See llama.cpp/docs/multimodal.md for the current flags. Q4_K_M is the sweet spot for the 2.2B at roughly 1.6 GB on disk.

Path 4 — Native macOS / iOS apps with Swift MLX

If you are shipping a real Mac or iOS app, do not wrap a Python process — use the Swift MLX bindings. The reference implementation is Hugging Face's own HuggingSnap, an open-source iOS 18 app that runs SmolVLM2-500M entirely on-device for live camera Q&A. It uses a fork of mlx-swift-examples with VLM support; the upstream ml-explore/mlx-swift-examples repo has since absorbed most of those changes.

Skeleton steps:

Add https://github.com/ml-explore/mlx-swift and https://github.com/ml-explore/mlx-swift-examples as Swift Package Manager dependencies in Xcode.
Use VLMModelFactory from mlx-swift-examples to load mlx-community/SmolVLM2-2.2B-Instruct-mlx.
Feed images via UIImage / NSImage → CIImage → MLX array.

The HuggingSnap source code is the cleanest end-to-end example we have found.

How to choose

You want fastest tokens/sec on a Mac: Path 1 (MLX). On an M3 Pro the 2.2B-MLX-4bit model generates at roughly 25–45 tok/s for images depending on prompt length — measure with --verbose.
You want PyTorch parity for fine-tuning or research: Path 2 (transformers + MPS).
You want the smallest deploy and a one-line install: Path 3 (Ollama + GGUF).
You are shipping a Mac/iOS app: Path 4 (Swift MLX), and read the HuggingSnap source.

Common pitfalls and troubleshooting

flash_attention_2 errors on macOS. Flash-Attention is CUDA-only. Pass _attn_implementation="eager" on Apple Silicon.
RuntimeError: The MPS backend is... / OOM on 8 GB Macs. Lower max_new_tokens, drop video frame count, or switch to the MLX 4-bit build.
Old tutorials still reference pip install git+https://github.com/pcuenca/mlx-vlm.git@smolvlm. Don't. The branch is stale; PR #208 merged into upstream Blaizzy/mlx-vlm in February 2025. Use pip install -U mlx-vlm.
Old tutorials use the placeholder your_username/SmolVLM2-2.2B. Replace with the canonical HuggingFaceTB/SmolVLM2-2.2B-Instruct (transformers) or mlx-community/SmolVLM2-2.2B-Instruct-mlx (MLX).
ValueError: ... model type 'smolvlm' not recognized. Your transformers is too old. Upgrade to ≥ 4.50.0 or pin to v4.49.0-SmolVLM-2.
Decord install fails on Apple Silicon. Use av (PyAV) as a fallback, or rely on the MLX path which has its own video decoder.
Garbled or hallucinated OCR. SmolVLM2 is competitive but not perfect at dense text. For document OCR at scale, pair it with a real OCR engine first (Tesseract, Apple Vision VNRecognizeTextRequest, or a dedicated OCR model) and use SmolVLM2 only for understanding; r/LocalLLaMA practitioners flag this consistently.

Where SmolVLM2 fits in a real product

For production work, choose SmolVLM2 when you need local, private, low-latency vision over cheap-and-cheerful classification: real-time camera assistants (HuggingSnap-style), accessibility tools, on-device screenshot triage, video chapter generation, and CCTV-event summarisation. For frontier reasoning over images (chart-heavy financial analysis, complex multi-step math, code-from-screenshot), step up to a larger VLM — Qwen3-VL, InternVL3, or a hosted Claude/GPT vision endpoint.

If you are scoping such a build and want help shipping it, Codersera matches you with vetted remote ML and Swift engineers who have deployed on-device VLM features in production iOS and macOS apps. We also have related practical guides on DeepSeek Janus-Pro 7B on Mac and OmniParser V2 on Ubuntu for adjacent multimodal stacks.

FAQ

How much RAM do I need to run SmolVLM2-2.2B on a Mac?

The model card states ~5.2 GB of accelerator memory for video inference. On Apple Silicon with unified memory, an 8 GB Mac can run the MLX 4-bit build for image tasks; 16 GB is the realistic floor for video and longer prompts.

Is the placeholder model path `your_username/SmolVLM2-2.2B` real?

No. That was a copy-paste leftover from early 2025 tutorials. The canonical IDs are HuggingFaceTB/SmolVLM2-2.2B-Instruct (PyTorch / transformers) and mlx-community/SmolVLM2-2.2B-Instruct-mlx (MLX).

Do I still need the `pcuenca/mlx-vlm@smolvlm` fork?

No. PR #208 merged on 20 February 2025. Plain pip install -U mlx-vlm (currently 0.4.4) supports SmolVLM2 directly.

Why does the original code use `flash_attention_2`?

Because the model card example targets CUDA. On Apple Silicon use _attn_implementation="eager" and device="mps". Flash-Attention does not have a Metal backend.

Can I fine-tune SmolVLM2-2.2B on a Mac?

For LoRA / PEFT fine-tuning on small datasets, yes — use the transformers path with PEFT and MPS. For full fine-tuning, you'll want a CUDA box; the model is small enough that a single 24 GB consumer GPU suffices.

Is there an iOS app I can try without writing code?

Yes — HuggingSnap on the App Store (free, iOS 18+). It uses SmolVLM2-500M entirely on-device.

What's the most stable runtime in April 2026?

mlx-vlm 0.4.4 (April 2026) is the current upstream release; it has been stable for SmolVLM2 since the merge. transformers ≥ 4.50.0 is the equivalent stable line for the PyTorch path.

Will there be a SmolVLM3?

As of April 2026, the Hugging Face Smol family has shipped SmolLM3 (text-only, July 2025) but no SmolVLM3 announcement. SmolVLM2 remains the current vision/video line.

Run SmolVLM2-2.2B on macOS: 2026 Installation Guide (MLX, Transformers, llama.cpp)

TL;DR

What is SmolVLM2-2.2B?

Hardware and software prerequisites

Path 1 — MLX (recommended for Apple Silicon)

Install

Image inference (CLI)

Video inference (CLI)

Python API

Path 2 — Hugging Face transformers (PyTorch + MPS)

Install

Image inference on macOS

Video inference

Path 3 — GGUF via llama.cpp or Ollama

Ollama

llama.cpp (mtmd / multimodal)

Path 4 — Native macOS / iOS apps with Swift MLX

How to choose

Common pitfalls and troubleshooting

Where SmolVLM2 fits in a real product

FAQ

How much RAM do I need to run SmolVLM2-2.2B on a Mac?

Is the placeholder model path `your_username/SmolVLM2-2.2B` real?

Do I still need the `pcuenca/mlx-vlm@smolvlm` fork?

Why does the original code use `flash_attention_2`?

Can I fine-tune SmolVLM2-2.2B on a Mac?

Is there an iOS app I can try without writing code?

What's the most stable runtime in April 2026?

Will there be a SmolVLM3?

References & further reading

Sign up for more like this.

TL;DR

What is SmolVLM2-2.2B?

Hardware and software prerequisites

Path 1 — MLX (recommended for Apple Silicon)

Install

Image inference (CLI)

Video inference (CLI)

Python API

Path 2 — Hugging Face transformers (PyTorch + MPS)

Install

Image inference on macOS

Video inference

Path 3 — GGUF via llama.cpp or Ollama

Ollama

llama.cpp (mtmd / multimodal)

Path 4 — Native macOS / iOS apps with Swift MLX

How to choose

Common pitfalls and troubleshooting

Where SmolVLM2 fits in a real product

FAQ

How much RAM do I need to run SmolVLM2-2.2B on a Mac?

Is the placeholder model path your_username/SmolVLM2-2.2B real?

Do I still need the pcuenca/mlx-vlm@smolvlm fork?

Why does the original code use flash_attention_2?

Can I fine-tune SmolVLM2-2.2B on a Mac?

Is there an iOS app I can try without writing code?

What's the most stable runtime in April 2026?

Will there be a SmolVLM3?

References & further reading

Sign up for more like this.

Is the placeholder model path `your_username/SmolVLM2-2.2B` real?

Do I still need the `pcuenca/mlx-vlm@smolvlm` fork?

Why does the original code use `flash_attention_2`?