Fara-7B Install Guide: Run Microsoft's Local AI Agent

Last updated: May 1, 2026 — refreshed for current model/tool versions.

Microsoft's Fara-7B is a 7-billion-parameter, open-weight (MIT-licensed) computer-use agent released on November 24, 2025. It runs locally, drives a real browser via screenshots and predicted mouse/keyboard coordinates, and posts a 73.5% success rate on WebVoyager — beating OpenAI's computer-use baseline (70.9%) and SoM GPT-4o (65.1%) at a fraction of the inference cost. This guide is the practical end-to-end install and operations reference: hardware, four supported runtimes (vLLM, Transformers, Ollama/llama.cpp via GGUF, Azure AI Foundry), benchmark numbers from Microsoft's own paper, and the 2026-current way to wire Fara-7B into Magentic-UI for an agent loop.

What changed since the original 2025 launch coverageApril 2026 repo update: the microsoft/fara GitHub repo dropped its Autogen submodule dependency and now vendors chat clients directly — older clone-and-pip flows from December 2025 will fail until you re-clone.GGUF quants are now first-class. Community quants from bartowski, mradermacher, and Mungert on Hugging Face cover Q2_K through F16 (3.0–15.2 GB), making Ollama / LM Studio / Jan / llama.cpp deployment trivial on consumer hardware.Pinned dep versions matter. The official webeval environment now pins torch==2.7.1 and vllm==0.10.0 to avoid CUDA-graph capture crashes — unpinning produces silent inference failures, not loud errors.Cost gap widened. Microsoft's paper reports ~$0.025 per task for Fara-7B vs. ~$0.30 per task for proprietary baselines — roughly 12× cheaper, on top of the privacy benefit of running on-device.Status: still experimental. Fara-7B is in Azure AI Foundry Labs as an "early-stage experiment." Treat it as a research preview, not a stability-guaranteed production model.

Want the full picture? Read our continuously-updated Self-Hosting LLMs Complete Guide (2026) — hardware, ollama and vllm, cost-per-token, and when to self-host.

TL;DR

Question	Answer
What is it?	Microsoft's first agentic SLM for computer use. 7B params, multimodal, screenshot-in / action-out.
Base model	Qwen2.5-VL-7B (vision-language).
License	MIT.
Context	128K tokens.
Headline benchmark	73.5% on WebVoyager (Microsoft, Nov 2025).
Cost per task	~$0.025 (vs. ~$0.30 for proprietary CUA baselines).
Easiest local install	Ollama / LM Studio with a community GGUF quant (Q4_K_M ~4.7 GB).
Most accurate local install	`vllm serve microsoft/Fara-7B` at bf16 on a 24 GB+ GPU.
Where it lives	huggingface.co/microsoft/Fara-7B · github.com/microsoft/fara · Azure AI Foundry Labs.

What is Fara-7B?

Fara-7B is a multimodal, decoder-only model fine-tuned from Qwen2.5-VL-7B for one specific job: looking at a browser screenshot and predicting the next mouse click, key press, scroll, or text-entry to make progress on a user's stated goal. It does not rely on accessibility trees, DOM scraping, or HTML parsing. It sees pixels — the same modality a human user has — and emits grounded actions with predicted x/y coordinates.

This matters for three reasons:

Generality. Sites that lock down DOM access (anti-bot canvas rendering, shadow-DOM PWAs, drawn-in-canvas dashboards) work the same as plain HTML to Fara-7B.
Privacy. A 7B model fits on a single consumer GPU at quantization, so your bank login, calendar, and email don't have to leave the box.
Cost. Microsoft's paper reports an average of $0.025 per task for Fara-7B end-to-end, against ~$0.30 for proprietary CUA baselines.

If you're standing up a broader local-agent stack — picking a runtime, an orchestrator, and a model triad — the companion OpenClaw + Ollama setup guide for running local AI agents covers the full pillar. This post zooms into Fara-7B specifically.

Benchmark performance (numbers from Microsoft's Nov 2025 paper)

Microsoft's release paper (Fara-7B: An Efficient Agentic Model for Computer Use) reports the following success rates, averaged over 3 runs:

Benchmark	Fara-7B	SoM GPT-4o	OpenAI computer-use baseline	UI-TARS-1.5-7B	GLM-4.1V-9B-Thinking
WebVoyager	73.5%	65.1%	70.9%	66.4%	66.8%
Online-Mind2Web	34.1%	—	—	—	—
DeepShop	26.2%	—	—	—	—
WebTailBench (new)	38.4%	—	—	—	—

WebTailBench is a new long-tail evaluation Microsoft published alongside the model. It targets 11 task categories that legacy web-agent benchmarks under-cover — booking movie tickets, restaurant reservations, comparing prices across retailers, applying for jobs, finding real estate, and similar everyday flows. The 38.4% number is sobering: state-of-the-art for an open 7B is still well under half on real-world long-tail web flows. Plan accordingly.

Hardware requirements

Microsoft trained Fara-7B on 64 H100s for ~2.5 days. You don't need that to run it. For inference:

Tier	Hardware	Runtime	Notes
Reference (highest accuracy)	NVIDIA A6000 / A100 / H100, 24 GB+ VRAM	vLLM bf16	Microsoft's tested baseline (Ubuntu 24.04.3 LTS).
Workstation	RTX 4090 / 5090 (24–32 GB)	vLLM bf16 or Q8 GGUF	Best price/perf for full-precision local runs.
Consumer	RTX 4070 Ti Super / 4080 (12–16 GB)	Q4_K_M GGUF via Ollama	~4.7 GB weights; fits with screenshot context headroom.
Laptop / NPU	Copilot+ PC, Apple Silicon M3/M4 (16 GB unified)	LM Studio / Jan / Foundry Local	Use Q4 or Q5 quants. Slower TTFT; fine for sequential agent steps.

Two non-obvious requirements: you need at least 15K tokens of context headroom because each step ingests a fresh screenshot tokenized as image patches plus the running task transcript, and you need enough display memory to render a real browser at 1280×720 or larger — Fara-7B's coordinate predictions degrade on tiny viewports.

Four installation paths

1. vLLM (reference, highest accuracy)

This is the path Microsoft tests against. Use it if you have a 24 GB+ GPU and care about benchmark-grade behavior:

git clone https://github.com/microsoft/fara.git
cd fara
python3 -m venv .venv
source .venv/bin/activate
pip install -e .[vllm]
playwright install

# In one terminal:
vllm serve microsoft/Fara-7B --port 5000 --dtype auto

# In another:
fara-cli --task "Find the cheapest one-way SFO to JFK on April 30 2026 and screenshot the result"

Pin torch==2.7.1 and vllm==0.10.0. Newer combinations have hit CUDA-graph capture failures on H100 in the wild; the fara repo's webeval environment locks these versions for that reason.

2. Transformers (lowest-friction Python)

If you don't want to run a separate inference server:

pip install "transformers>=4.53.3" "torch>=2.7.1" pillow accelerate

Then load microsoft/Fara-7B with AutoModelForVision2Seq / AutoProcessor and feed it a PIL screenshot plus the task prompt. Throughput is much lower than vLLM (no continuous batching) but the install is one pip command.

3. Ollama / LM Studio / Jan (easiest, GGUF)

For consumer GPUs, pull a community GGUF quant. bartowski/microsoft_Fara-7B-GGUF is the most-used, and mradermacher/Fara-7B-GGUF publishes the full Q2_K through F16 ladder (3.02 GB to 15.2 GB).

# Ollama
ollama pull hf.co/bartowski/microsoft_Fara-7B-GGUF:Q4_K_M
ollama run hf.co/bartowski/microsoft_Fara-7B-GGUF:Q4_K_M

# LM Studio: search "Fara-7B" in Discover, pick Q4_K_M (~4.7 GB) or Q5_K_M (~5.4 GB)

Because Fara-7B is a vision model, your runtime needs an mmproj/vision adapter file alongside the language weights — verify the GGUF repo you pick includes the projector. Pure text-only quants will load but will not see screenshots.

4. Azure AI Foundry (zero-install)

If you don't want a GPU at all, Fara-7B is hosted in Azure AI Foundry Labs. The repo's fara-cli supports it natively:

fara-cli --task "your task" --endpoint_config azure_foundry_config.json

This is the right choice for evaluation, demos, and CI. You lose the on-device privacy guarantee but gain elasticity.

Wiring Fara-7B into Magentic-UI

The fara-cli is fine for scripted single-task runs. For an interactive agent loop with a real browser pane, Microsoft recommends Magentic-UI. Spin up the same vLLM (or Ollama) endpoint above, then point Magentic-UI's model config at http://localhost:5000/v1 using an OpenAI-compatible client. The Magentic-UI section of the microsoft/fara README has the current routing snippet — note that as of the April 2026 repo update, the integration assumes the vendored chat clients (no more Autogen submodule).

How to choose a path

You have a 24 GB+ GPU and want benchmark-faithful behavior. → vLLM bf16, pinned versions.
You have a 12–16 GB consumer GPU. → Ollama with Q4_K_M GGUF. Expect a small accuracy hit; fine for personal automations.
You're on Apple Silicon. → LM Studio or Jan with the Q4/Q5 GGUF; both have working Metal vision pipelines.
You're on a Copilot+ PC. → Foundry Local — Microsoft has tuned NPU paths there.
You're evaluating before committing hardware. → Azure AI Foundry Labs.
You want a turnkey browser UI. → Any of the local options above + Magentic-UI in front.

Cost analysis

Microsoft's paper reports ~$0.025 average cost per task for Fara-7B vs. ~$0.30 for proprietary CUA baselines (the OpenAI computer-use family at the time of writing). On-prem with a sunk-cost GPU, the marginal cost is just electricity — call it fractions of a cent per task on a 4090 at ~$0.12/kWh.

Deployment	Effective cost per 1,000 tasks	Notes
Local (RTX 4090, Q4 GGUF)	~$0.50–$1 (electricity)	Assumes the GPU is already paid for.
Local (rented A100 hourly)	~$5–$10	At ~$1.50–$2/hr spot pricing, ~150 tasks/hr.
Azure AI Foundry (Fara-7B)	~$25	Microsoft's reported figure.
Proprietary CUA baseline	~$300	Microsoft's comparison number.

What Fara-7B is genuinely good at (and what it is not)

Plays well:

Multi-step shopping flows (search → filter → add to cart → checkout up to payment).
Form-driven workflows: account creation, applying to jobs, simple CRM data entry.
Navigational research: "find me the spec sheet PDF for the X1 Carbon 13th gen and download it."
Reservations and bookings on mainstream consumer sites.

Plays poorly:

Long-tail or niche sites with unusual layouts — reflected in the 38.4% WebTailBench number.
CAPTCHA-gated flows (Fara-7B intentionally halts at "critical points" rather than try to solve them).
Anything requiring sustained reasoning over >30 steps without checkpointing.
Tasks involving payment or sensitive credentials — the model is trained to pause for human approval, which is a feature, not a bug.

Common pitfalls and troubleshooting

Old git clone from 2025 fails. The Autogen submodule was removed in April 2026 — re-clone fresh; don't try to patch.
CUDA-graph capture errors on H100. Pin torch==2.7.1 and vllm==0.10.0 exactly. Unpinned environments produce subtle silent failures, not crashes.
Coordinate clicks land off-target. Your viewport is too small. Run the controlled browser at 1280×720 minimum; 1440×900 is safer.
GGUF loads but model "can't see" the screen. You pulled a text-only quant. Verify the mmproj-*.gguf projector file is in the same repo and your runtime is loading it.
Out-of-context errors after a few steps. Each step appends a fresh screenshot. Drop history aggressively or use the model's own summarization step; do not just raise --max-model-len past 32K without watching VRAM.
Playwright can't reach the page (WebTailBench). Some target sites are blocked at the Playwright fingerprint level. The reference setup uses BrowserBase — expect to need a paid browser-as-a-service for the full bench.

Safety and production deployment

Fara-7B can drive a real browser logged into your real accounts. Treat that the way you'd treat a junior contractor with full RDP access:

Run it in an ephemeral browser profile or a sandboxed VM — never in your main browser.
Keep "critical point" stops on. The model is trained to halt at password fields, payment screens, and irreversible actions; do not silence those.
Log every action (URL, screenshot, predicted action JSON) for audit.
Rate-limit per session. A bug in your harness can otherwise rack up hundreds of clicks per minute on a target site.
Never expose Fara's vLLM endpoint to the public internet without an auth proxy.

If you're building a production agent and need senior engineers who've actually shipped CUA systems, Codersera's vetted remote developers have placement-ready specialists in agentic tooling. That's an aside; the rest of this post is the technical guide.

FAQ

Is Fara-7B free to use commercially?

Yes. The model is released under the MIT License (see the Hugging Face model card), which permits commercial use, modification, and redistribution with attribution. The "experimental" status in Azure AI Foundry Labs refers to product stability, not licensing.

What's the minimum VRAM I need?

About 6 GB if you're running a Q4 GGUF with a small viewport, ~10 GB for Q8, and ~16 GB for full bf16. Microsoft's reference setup used 24 GB+ cards.

How does Fara-7B compare to UI-TARS or OpenAI's computer-use models?

On WebVoyager: 73.5% (Fara-7B) vs. 70.9% (OpenAI computer-use baseline) vs. 66.4% (UI-TARS-1.5-7B) vs. 65.1% (SoM GPT-4o). Fara-7B is the current open-weight leader at the 7B size class. UI-TARS family models trade blows on other benches; OpenAI's hosted model is closer in raw quality but ~12× more expensive.

Can I fine-tune Fara-7B on my own browser data?

The MIT license permits it. Practically, you'd fine-tune the underlying Qwen2.5-VL-7B on your screenshot/action pairs. Microsoft has not (as of April 2026) released the exact FaraGen synthetic-trajectory pipeline, so you'd be building your own data flywheel.

Does Ollama officially support Fara-7B?

Not in Ollama's curated registry, but Ollama can pull any GGUF from Hugging Face directly via ollama pull hf.co/<repo>:<quant>. Use a community quant from bartowski, mradermacher, or Mungert.

Is Fara-7B's 73.5% WebVoyager score reproducible?

Microsoft reports the average over 3 runs in the official paper. Reproducing it locally requires the pinned vllm==0.10.0 environment, the WebVoyager harness, and a stable browser-as-a-service (BrowserBase or equivalent) so the target sites don't fingerprint-block your Playwright session.

Can Fara-7B handle desktop apps, or only browsers?

The released model is trained specifically for web/browser tasks. Generalizing to native desktop control is plausible (the architecture is generic) but not what Microsoft shipped.

Fara-7B Installation Guide (April 2026): Run Microsoft's Local Computer-Use Agent