Run Microsoft OmniParser V2 on Windows: Setup Guide

Last updated April 2026 — refreshed for OmniParser v2.0.1, the CVE-2025-55322 patch, and the current OmniTool stack.

Microsoft OmniParser is the screen-parsing layer that turns ordinary multimodal LLMs into GUI agents: feed it a screenshot and it returns a JSON list of every interactable element with bounding boxes, function captions, and IDs. The current production release on Windows is v2.0.1 (September 12, 2025), which patches the unauthenticated RCE tracked as CVE-2025-55322. This guide walks through a clean install on Windows 10 or 11 with CUDA, covers the full OmniTool Windows-VM stack, and flags the configuration mistakes that bite most first-time users.

What changed in 2026 (read this first)Security: CVE-2025-55322 (CVSS 7.3, "binding to an unrestricted IP" leading to unauthenticated RCE) was patched in OmniParser v2.0.1 on 12 September 2025. Anything earlier than 2.0.1 should be treated as compromised if it was ever network-reachable.Checkpoint download command: the original recipe (looped huggingface-cli download for six files, then a mv rename) is still correct, but most users now prefer huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights for the whole snapshot.OmniTool stack: the omnitool/ folder ships three components — omniparserserver (FastAPI), omnibox (Windows 11 Docker VM), and gradio (the operator UI). They are designed to run on separate hosts.AMD on Windows still has no first-class story. ROCm does not run natively on Windows, and OmniParser's PyTorch path expects CUDA. AMD users should run inside WSL2 with ROCm, or fall back to CPU.LLM choices have moved on. The README still lists GPT-4o, o1, o3-mini, DeepSeek R1, Qwen 2.5VL, and Anthropic Sonnet (3.5/3.7) for OmniTool. In practice in 2026, most teams pair OmniParser with Claude Sonnet 4.6, Claude Opus 4.7, GPT-5-class endpoints, or Qwen 3-VL — anything that emits structured tool calls works.

Want the full picture? Read our continuously-updated Claude Opus 4.7 complete guide — capabilities, pricing, prompting tips, and head-to-head benchmarks vs GPT-5.5 and DeepSeek V4.

TL;DR — what to install on Windows in April 2026

Component	Version / source	Note
OmniParser	v2.0.1 (Sep 2025)	Pin this — earlier is vulnerable to CVE-2025-55322
Python	3.12 in Conda	3.10 / 3.11 also work but 3.12 is the documented baseline
PyTorch (NVIDIA)	2.x with CUDA 11.8 or 12.x wheels	Match to your driver
Model weights	`microsoft/OmniParser-v2.0` on Hugging Face	YOLOv8 (icon detect, AGPL) + Florence-2 (icon caption, MIT)
OmniTool VM	Windows 11 Enterprise eval ISO + Docker Desktop	~30 GB free; first build runs 20–90 min
GPU target	RTX 3090 / 4090 / 5090, A100, H100	~0.6 s/frame on A100, ~0.8 s/frame on 4090

What OmniParser v2 actually is

OmniParser is two fine-tuned vision models stitched into a screen parser:

icon_detect — a YOLOv8 model fine-tuned on a Microsoft-curated dataset of interactable elements (buttons, icons, form fields). Returns bounding boxes plus a "clickable / not clickable" flag.
icon_caption — a Florence-2 base model fine-tuned to describe what each detected element does ("close window", "open settings panel"), not just what it looks like.

The output is a structured set-of-marks: each visible element is given a numeric ID and a description, then overlaid back onto the screenshot. A general-purpose VLM (GPT-5, Claude Opus 4.7, Qwen 3-VL, etc.) can then say "click element 14" instead of trying to predict pixel coordinates — which is the failure mode that pushes most raw-vision agents below 1% on tough benchmarks.

OmniParser v2 + GPT-4o set 39.6% average accuracy on the ScreenSpot Pro grounding benchmark (high-resolution professional applications across 23 apps and three OSes). For comparison, the same GPT-4o without OmniParser scores 0.8% on that benchmark. Latency dropped about 60% versus v1: ~0.6 s/frame on A100, ~0.8 s/frame on a single RTX 4090. Those numbers are from Microsoft Research's own write-up and the v2 model card.

Prerequisites

Windows 10 22H2 or Windows 11 23H2/24H2.
An NVIDIA GPU with at least 8 GB of VRAM is comfortable; 4 GB is the practical minimum for inference. CPU works but is roughly 10–30× slower.
The matching NVIDIA driver and a CUDA Toolkit (11.8 or 12.x) — check with nvidia-smi.
Anaconda or Miniconda for environment isolation. uv + venv works too, but the Microsoft README assumes Conda.
Git for Windows.
For OmniTool: Docker Desktop (WSL2 backend) and ~30 GB free disk space (5 GB ISO, 400 MB container image, 20 GB VM storage, plus headroom).

Step 1: Clone the repo and pin v2.0.1

git clone https://github.com/microsoft/OmniParser.git
cd OmniParser
git checkout v.2.0.1

Pinning the tag is not optional. The default branch may include unreleased changes, and v2.0.1 is the first release that has the CVE-2025-55322 fix.

Step 2: Create the Conda environment

conda create -n omni python=3.12 -y
conda activate omni

Python 3.12 is the version Microsoft documents. 3.10 and 3.11 also work; 3.13 has had transformers / Florence-2 compatibility issues.

Step 3: Install PyTorch (this is where most installs go wrong)

The OmniParser requirements.txt does not pin a CUDA-specific PyTorch wheel. If you let pip resolve it, you typically end up with a CPU-only build, then later silently run YOLOv8 on CPU and wonder why latency is 30 s/frame.

Install PyTorch first, then the rest of requirements.txt:

# NVIDIA GPU, CUDA 12.1 driver
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# NVIDIA GPU, older CUDA 11.8 driver
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# CPU only (slow, but works)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Verify before continuing:

python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"

You want to see something like 2.x.y True NVIDIA GeForce RTX 4090. If it prints False, stop and fix the driver/wheel mismatch — running the rest of the install with a CPU torch will work but defeats the point.

AMD GPUs on Windows

The original guide's "AMD step" pointed users at the CUDA 11.8 wheel, which is a CPU-shaped trap on AMD hardware. The honest 2026 picture:

AMD ROCm has no native Windows runtime. The rocm.docs.amd.com Windows pages cover only the Adrenalin-driver client side; PyTorch + ROCm wheels are Linux-only.
The supported workaround is WSL2: install Ubuntu 24.04 in WSL2, install AMD's ROCm 6.x for Linux, then install pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.x inside WSL. Run OmniParser from there.
If that's too much, run the CPU wheel and accept multi-second latency per frame.

Step 4: Install the rest of the dependencies

pip install -r requirements.txt

This pulls Ultralytics (YOLO), transformers, PaddleOCR (added in v1.5 as an alternative OCR backend), Gradio, and the rest.

Step 5: Download the v2 checkpoints

The original guide's command is correct but easy to mistype because of the brace-expansion. The simpler form pulls the whole repo:

pip install -U "huggingface_hub[cli]"
huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights
mv weights/icon_caption weights/icon_caption_florence

If you want the precise file list documented in Microsoft's README:

for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do
  huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights
done
mv weights/icon_caption weights/icon_caption_florence

The rename is mandatory — gradio_demo.py and the OmniTool server look for weights/icon_caption_florence, not weights/icon_caption.

License note: icon_detect inherits YOLO's AGPL-3.0 license, while icon_caption is MIT. If you ship OmniParser inside a closed-source product, the AGPL on the YOLO weights is the constraint to plan around.

Step 6: Run the local Gradio demo

python gradio_demo.py

Open the printed URL (default http://127.0.0.1:7860), upload a screenshot, and you should see boxes, IDs, and captions overlaid on the image plus the structured JSON below it. If it works here, the parser is healthy and you can move on to OmniTool.

Setting up OmniTool (the Windows 11 VM stack)

OmniTool is the reference stack for actually using OmniParser as an agent. It lives in omnitool/ in the repo and has three pieces:

omniparserserver — a FastAPI server that wraps OmniParser. Lives on whichever box has the GPU.
omnibox — a Dockerized Windows 11 VM that the agent actually clicks around in. Runs on QEMU/KVM via Docker.
gradio — the operator UI: type a goal, watch the agent loop.

Start the parser server

conda activate omni
cd OmniParser/omnitool/omniparserserver
python -m omniparserserver

By default it binds to 0.0.0.0:8000. Do not expose port 8000 to the public internet — that is precisely the setup CVE-2025-55322 weaponizes. Bind it to 127.0.0.1 or a private network only.

Provision the Windows 11 VM

Install Docker Desktop and confirm docker run hello-world succeeds.
Download the Windows 11 Enterprise evaluation ISO from the Microsoft Evaluation Center, rename it to custom.iso, and drop it in OmniParser/omnitool/omnibox/vm/win11iso/.
Build the VM:

cd OmniParser/omnitool/omnibox/scripts
./manage_vm.sh create

First-time creation runs unattended Windows setup and takes 20–90 minutes depending on disk speed. Microsoft's own docs say to budget ~30 GB of free space (≈5 GB ISO, 400 MB container image, 20 GB VM storage). Subsequent boots are fast.

Launch the Gradio operator UI

cd OmniParser/omnitool/gradio
conda activate omni
python app.py --windows_host_url localhost:8006 --omniparser_server_url localhost:8000

The UI lets you pick a vision model (configure API keys in the sidebar), type a goal in natural language, and watch the parse → plan → click → re-parse loop. The README ships with built-in adapters for OpenAI (GPT-4o / o1 / o3-mini), DeepSeek R1, Qwen 2.5VL, and Anthropic Sonnet.

Security: CVE-2025-55322 in plain English

CVE-2025-55322 is the only critical vulnerability disclosed in OmniParser to date. The short version:

What: the OmniTool VM controller bound its HTTP execution interface to 0.0.0.0 with no authentication. Anyone who could reach the port could direct the GUI agent to do anything the logged-in Windows user could do.
CVSS: 7.3 High (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L).
Timeline: reported to MSRC May 12, 2025; patched in v2.0.1 on Sep 12, 2025; CVE assigned Sep 24, 2025.
Fix: upgrade to v2.0.1, keep the parser server bound to 127.0.0.1 or a non-routable network, add a token / mTLS in front of any control endpoint, and never expose OmniTool to the open internet.

Benchmarks: how OmniParser actually performs in 2026

Benchmark	Setup	Score	Note
ScreenSpot Pro (grounding, hi-res pro apps)	OmniParser v2 + GPT-4o	39.6% avg	vs 0.8% for GPT-4o alone
ScreenSpot Pro (best non-OmniParser baseline at v2 launch)	strongest grounding model	~18.9%	v2 roughly doubles it
WindowsAgentArena	OmniParser v1 + GPT-4V	SOTA at v1 release	Microsoft Research
Inference latency, A100	v2	~0.6 s/frame	~60% faster than v1
Inference latency, RTX 4090	v2	~0.8 s/frame	single-GPU, fp16

Caveats: the 39.6% headline is "OmniParser as the grounder, GPT-4o as the planner." Swap GPT-4o for Claude Opus 4.7 or Qwen 3-VL and the absolute number shifts; the relative lift over a vision-only baseline does not. There is no public OmniParser-on-Claude-4.7 benchmark in the model card yet — treat that pairing as production-viable but not externally graded.

How to choose: do you actually need OmniParser?

Your situation	Recommendation
You're building a Windows desktop agent and want it to work with any VLM	OmniParser v2.0.1 + OmniTool
You're on Anthropic's stack and only need to drive a desktop	Claude's built-in Computer Use tool — no parser needed
You only need to automate a browser, not the OS	Playwright + a VLM, or browser-specific agents like Browser Use / OpenAI's Operator
Your target is mobile, not desktop	OmniParser is desktop-trained; mobile UIs need a different grounder
You can't run a Linux/CUDA box anywhere	Use the hosted endpoint at replicate.com/microsoft/omniparser-v2 or Azure AI Foundry's `microsoft-omniparser-v2-0`
You have to ship a closed-source binary that bundles OmniParser	The YOLO icon_detect weights are AGPL-3.0 — talk to legal before redistribution

Common pitfalls and how to fix them

1. `torch.cuda.is_available()` is `False`

Almost always a CPU-only torch wheel got installed because requirements.txt didn't pin one. Reinstall torch from the matching CUDA index URL (see Step 3) before running the requirements file again.

2. `FileNotFoundError: weights/icon_caption_florence`

You forgot the rename. Run mv weights/icon_caption weights/icon_caption_florence (PowerShell: Rename-Item weights/icon_caption icon_caption_florence).

3. `AttributeError: 'Florence2ForConditionalGeneration' object has no attribute '_supports_sdpa'`

Tracked as issue #323 on the repo. Caused by a mismatch between the cached Florence-2 model code and the installed transformers version. Fix: pip install --upgrade transformers, then clear the HF cache for that model (~/.cache/huggingface/hub/models--microsoft--Florence-2-base/) and re-run.

4. OmniTool VM stuck at "Installing Windows"

Check that your ISO is the Enterprise Evaluation build, named exactly custom.iso, and that Hyper-V / WSL2 virtualization is enabled in BIOS. The unattended setup will silently fail on Home or N editions of the ISO.

5. Latency is 5–10× higher than the published numbers

Confirm you are running the v2 weights (not v1), confirm CUDA is active, and confirm you are not running OmniParser inside the VM. The parser server should run on the host GPU; the VM should only host the agent's "fingers."

6. Gradio demo loads but parses return empty boxes

Most often the Florence-2 captioner failed silently. Check the terminal output for transformers warnings; the parser will return an empty list rather than crashing if the captioner fails to load.

What was removed (and why)

The original "AMD machines" command (pip install torch==2.1.0 ... cu118) — that is a CUDA wheel and does nothing useful on AMD hardware. Replaced with the realistic WSL2 + ROCm path above.
The pinned torch==2.1.0 recommendation — it predates several Florence-2 / transformers updates. Use a current 2.x torch matching your CUDA.
The "configure train_args.yaml" advice — the file ships from Hugging Face and should not be edited unless you are fine-tuning your own detection model.

Building this in production

If you are wiring OmniParser into a real product — an internal RPA platform, a customer-support agent, an accessibility tool — the model is the easy part. The hard parts are the safety harness around clicks, the retry/verify loop, the eval suite that catches regressions when you swap LLMs, and the auth around the parser server you just learned about in the CVE section. If your team is short an experienced agent engineer, Codersera places vetted remote developers who have shipped exactly this kind of stack.

FAQ

Is OmniParser v2 free?

Yes — the code is MIT-licensed on GitHub and the model weights are published on Hugging Face. The icon_detect (YOLOv8) weights are AGPL-3.0; the icon_caption (Florence-2) weights are MIT. Hosted endpoints on Replicate and Azure AI Foundry charge per call.

Does OmniParser work without a GPU?

It runs on CPU but slowly — multiple seconds per frame instead of sub-second. For interactive agent loops you want at least an RTX 3060 / 8 GB VRAM.

What is the difference between OmniParser and OmniTool?

OmniParser is the screen-parsing model. OmniTool is the full agent stack (parser server + Windows 11 VM + Gradio UI) that uses OmniParser to actually drive a desktop. You can use OmniParser without OmniTool; OmniTool depends on OmniParser.

Can I use OmniParser with Claude Opus 4.7 or GPT-5?

Yes. Any LLM that can take an image plus a structured tool schema and emit valid tool calls works. The OmniTool repo ships adapters for the older catalog (GPT-4o, o1, o3-mini, DeepSeek R1, Qwen 2.5VL, Sonnet 3.5/3.7), but adding a Claude 4.7 or GPT-5 adapter is a one-file change.

Should I run OmniParser inside the OmniTool VM?

No. Run the parser server on the host with the GPU, and let only the OmniBox VM run on a separate machine if you want. Running the parser inside a Windows VM removes GPU access and tanks latency.

Is there an OmniParser v3?

Not as of April 2026. The latest tagged release is v2.0.1 (September 12, 2025). Microsoft Research has signaled multi-agent orchestration improvements in OmniTool but no v3 model has been announced.

How does OmniParser compare to Anthropic's Computer Use?

Anthropic's Computer Use bakes screen understanding into the model itself — you point Claude at a screenshot and it emits coordinates. OmniParser is model-agnostic: it gives you the structured element list, then any VLM can plan against it. Computer Use is simpler if you're committing to Anthropic; OmniParser is the right choice if you need to swap LLMs or want stronger small-element grounding (the 39.6% vs ~18.9% gap on ScreenSpot Pro).

What hardware do I need to hit the published latency numbers?

Microsoft cites ~0.6 s/frame on an A100 and ~0.8 s/frame on a single RTX 4090. RTX 5090 / H100 are faster; consumer cards (3060, 3080) typically land between 1.0 and 1.5 s/frame at fp16.

References and further reading

microsoft/OmniParser on GitHub — source, releases, OmniTool subdirectory.
Release v.2.0.1 (Sep 12, 2025) — the security-patched build.
microsoft/OmniParser-v2.0 model card on Hugging Face — weights, license terms, benchmark numbers.
"OmniParser V2: Turning Any LLM into a Computer Use Agent" — Microsoft Research — the official launch write-up with the 39.6% ScreenSpot Pro number.
NVD: CVE-2025-55322 — the OmniParser RCE entry, CVSS 7.3.
Microsoft MSRC advisory for CVE-2025-55322 — vendor record and remediation guidance.
ScreenSpot-Pro benchmark repository — the benchmark OmniParser v2 is evaluated on.
"OmniParser for Pure Vision Based GUI Agent" (arXiv 2408.00203) — the research paper.
OmniParser V2 Hugging Face Space — try it in the browser before installing.