Run Microsoft OmniParser V2 on Windows: Step-by-Step Guide (April 2026, v2.0.1)
Last updated April 2026 — refreshed for OmniParser v2.0.1, the CVE-2025-55322 patch, and the current OmniTool stack.
Microsoft OmniParser is the screen-parsing layer that turns ordinary multimodal LLMs into GUI agents: feed it a screenshot and it returns a JSON list of every interactable element with bounding boxes, function captions, and IDs. The current production release on Windows is v2.0.1 (September 12, 2025), which patches the unauthenticated RCE tracked as CVE-2025-55322. This guide walks through a clean install on Windows 10 or 11 with CUDA, covers the full OmniTool Windows-VM stack, and flags the configuration mistakes that bite most first-time users.
What changed in 2026 (read this first)Security: CVE-2025-55322 (CVSS 7.3, "binding to an unrestricted IP" leading to unauthenticated RCE) was patched in OmniParser v2.0.1 on 12 September 2025. Anything earlier than 2.0.1 should be treated as compromised if it was ever network-reachable.Checkpoint download command: the original recipe (loopedhuggingface-cli downloadfor six files, then amvrename) is still correct, but most users now preferhuggingface-cli download microsoft/OmniParser-v2.0 --local-dir weightsfor the whole snapshot.OmniTool stack: theomnitool/folder ships three components —omniparserserver(FastAPI),omnibox(Windows 11 Docker VM), andgradio(the operator UI). They are designed to run on separate hosts.AMD on Windows still has no first-class story. ROCm does not run natively on Windows, and OmniParser's PyTorch path expects CUDA. AMD users should run inside WSL2 with ROCm, or fall back to CPU.LLM choices have moved on. The README still lists GPT-4o, o1, o3-mini, DeepSeek R1, Qwen 2.5VL, and Anthropic Sonnet (3.5/3.7) for OmniTool. In practice in 2026, most teams pair OmniParser with Claude Sonnet 4.6, Claude Opus 4.7, GPT-5-class endpoints, or Qwen 3-VL — anything that emits structured tool calls works.
Want the full picture? Read our continuously-updated Claude Opus 4.7 complete guide — capabilities, pricing, prompting tips, and head-to-head benchmarks vs GPT-5.5 and DeepSeek V4.
TL;DR — what to install on Windows in April 2026
| Component | Version / source | Note |
|---|---|---|
| OmniParser | v2.0.1 (Sep 2025) | Pin this — earlier is vulnerable to CVE-2025-55322 |
| Python | 3.12 in Conda | 3.10 / 3.11 also work but 3.12 is the documented baseline |
| PyTorch (NVIDIA) | 2.x with CUDA 11.8 or 12.x wheels | Match to your driver |
| Model weights | microsoft/OmniParser-v2.0 on Hugging Face | YOLOv8 (icon detect, AGPL) + Florence-2 (icon caption, MIT) |
| OmniTool VM | Windows 11 Enterprise eval ISO + Docker Desktop | ~30 GB free; first build runs 20–90 min |
| GPU target | RTX 3090 / 4090 / 5090, A100, H100 | ~0.6 s/frame on A100, ~0.8 s/frame on 4090 |
What OmniParser v2 actually is
OmniParser is two fine-tuned vision models stitched into a screen parser:
- icon_detect — a YOLOv8 model fine-tuned on a Microsoft-curated dataset of interactable elements (buttons, icons, form fields). Returns bounding boxes plus a "clickable / not clickable" flag.
- icon_caption — a Florence-2 base model fine-tuned to describe what each detected element does ("close window", "open settings panel"), not just what it looks like.
The output is a structured set-of-marks: each visible element is given a numeric ID and a description, then overlaid back onto the screenshot. A general-purpose VLM (GPT-5, Claude Opus 4.7, Qwen 3-VL, etc.) can then say "click element 14" instead of trying to predict pixel coordinates — which is the failure mode that pushes most raw-vision agents below 1% on tough benchmarks.
OmniParser v2 + GPT-4o set 39.6% average accuracy on the ScreenSpot Pro grounding benchmark (high-resolution professional applications across 23 apps and three OSes). For comparison, the same GPT-4o without OmniParser scores 0.8% on that benchmark. Latency dropped about 60% versus v1: ~0.6 s/frame on A100, ~0.8 s/frame on a single RTX 4090. Those numbers are from Microsoft Research's own write-up and the v2 model card.
Prerequisites
- Windows 10 22H2 or Windows 11 23H2/24H2.
- An NVIDIA GPU with at least 8 GB of VRAM is comfortable; 4 GB is the practical minimum for inference. CPU works but is roughly 10–30× slower.
- The matching NVIDIA driver and a CUDA Toolkit (11.8 or 12.x) — check with
nvidia-smi. - Anaconda or Miniconda for environment isolation.
uv+venvworks too, but the Microsoft README assumes Conda. - Git for Windows.
- For OmniTool: Docker Desktop (WSL2 backend) and ~30 GB free disk space (5 GB ISO, 400 MB container image, 20 GB VM storage, plus headroom).
Step 1: Clone the repo and pin v2.0.1
git clone https://github.com/microsoft/OmniParser.git
cd OmniParser
git checkout v.2.0.1
Pinning the tag is not optional. The default branch may include unreleased changes, and v2.0.1 is the first release that has the CVE-2025-55322 fix.
Step 2: Create the Conda environment
conda create -n omni python=3.12 -y
conda activate omni
Python 3.12 is the version Microsoft documents. 3.10 and 3.11 also work; 3.13 has had transformers / Florence-2 compatibility issues.
Step 3: Install PyTorch (this is where most installs go wrong)
The OmniParser requirements.txt does not pin a CUDA-specific PyTorch wheel. If you let pip resolve it, you typically end up with a CPU-only build, then later silently run YOLOv8 on CPU and wonder why latency is 30 s/frame.
Install PyTorch first, then the rest of requirements.txt:
# NVIDIA GPU, CUDA 12.1 driver
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# NVIDIA GPU, older CUDA 11.8 driver
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# CPU only (slow, but works)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
Verify before continuing:
python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'cpu')"
You want to see something like 2.x.y True NVIDIA GeForce RTX 4090. If it prints False, stop and fix the driver/wheel mismatch — running the rest of the install with a CPU torch will work but defeats the point.
AMD GPUs on Windows
The original guide's "AMD step" pointed users at the CUDA 11.8 wheel, which is a CPU-shaped trap on AMD hardware. The honest 2026 picture:
- AMD ROCm has no native Windows runtime. The
rocm.docs.amd.comWindows pages cover only the Adrenalin-driver client side; PyTorch + ROCm wheels are Linux-only. - The supported workaround is WSL2: install Ubuntu 24.04 in WSL2, install AMD's ROCm 6.x for Linux, then install
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.xinside WSL. Run OmniParser from there. - If that's too much, run the CPU wheel and accept multi-second latency per frame.
Step 4: Install the rest of the dependencies
pip install -r requirements.txt
This pulls Ultralytics (YOLO), transformers, PaddleOCR (added in v1.5 as an alternative OCR backend), Gradio, and the rest.
Step 5: Download the v2 checkpoints
The original guide's command is correct but easy to mistype because of the brace-expansion. The simpler form pulls the whole repo:
pip install -U "huggingface_hub[cli]"
huggingface-cli download microsoft/OmniParser-v2.0 --local-dir weights
mv weights/icon_caption weights/icon_caption_florence
If you want the precise file list documented in Microsoft's README:
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do
huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights
done
mv weights/icon_caption weights/icon_caption_florence
The rename is mandatory — gradio_demo.py and the OmniTool server look for weights/icon_caption_florence, not weights/icon_caption.
License note: icon_detect inherits YOLO's AGPL-3.0 license, while icon_caption is MIT. If you ship OmniParser inside a closed-source product, the AGPL on the YOLO weights is the constraint to plan around.
Step 6: Run the local Gradio demo
python gradio_demo.py
Open the printed URL (default http://127.0.0.1:7860), upload a screenshot, and you should see boxes, IDs, and captions overlaid on the image plus the structured JSON below it. If it works here, the parser is healthy and you can move on to OmniTool.
Setting up OmniTool (the Windows 11 VM stack)
OmniTool is the reference stack for actually using OmniParser as an agent. It lives in omnitool/ in the repo and has three pieces:
omniparserserver— a FastAPI server that wraps OmniParser. Lives on whichever box has the GPU.omnibox— a Dockerized Windows 11 VM that the agent actually clicks around in. Runs on QEMU/KVM via Docker.gradio— the operator UI: type a goal, watch the agent loop.
Start the parser server
conda activate omni
cd OmniParser/omnitool/omniparserserver
python -m omniparserserver
By default it binds to 0.0.0.0:8000. Do not expose port 8000 to the public internet — that is precisely the setup CVE-2025-55322 weaponizes. Bind it to 127.0.0.1 or a private network only.
Provision the Windows 11 VM
- Install Docker Desktop and confirm
docker run hello-worldsucceeds. - Download the Windows 11 Enterprise evaluation ISO from the Microsoft Evaluation Center, rename it to
custom.iso, and drop it inOmniParser/omnitool/omnibox/vm/win11iso/. - Build the VM:
cd OmniParser/omnitool/omnibox/scripts
./manage_vm.sh create
First-time creation runs unattended Windows setup and takes 20–90 minutes depending on disk speed. Microsoft's own docs say to budget ~30 GB of free space (≈5 GB ISO, 400 MB container image, 20 GB VM storage). Subsequent boots are fast.
Launch the Gradio operator UI
cd OmniParser/omnitool/gradio
conda activate omni
python app.py --windows_host_url localhost:8006 --omniparser_server_url localhost:8000
The UI lets you pick a vision model (configure API keys in the sidebar), type a goal in natural language, and watch the parse → plan → click → re-parse loop. The README ships with built-in adapters for OpenAI (GPT-4o / o1 / o3-mini), DeepSeek R1, Qwen 2.5VL, and Anthropic Sonnet.
Security: CVE-2025-55322 in plain English
CVE-2025-55322 is the only critical vulnerability disclosed in OmniParser to date. The short version:
- What: the OmniTool VM controller bound its HTTP execution interface to
0.0.0.0with no authentication. Anyone who could reach the port could direct the GUI agent to do anything the logged-in Windows user could do. - CVSS: 7.3 High (CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:L).
- Timeline: reported to MSRC May 12, 2025; patched in v2.0.1 on Sep 12, 2025; CVE assigned Sep 24, 2025.
- Fix: upgrade to v2.0.1, keep the parser server bound to
127.0.0.1or a non-routable network, add a token / mTLS in front of any control endpoint, and never expose OmniTool to the open internet.
Benchmarks: how OmniParser actually performs in 2026
| Benchmark | Setup | Score | Note |
|---|---|---|---|
| ScreenSpot Pro (grounding, hi-res pro apps) | OmniParser v2 + GPT-4o | 39.6% avg | vs 0.8% for GPT-4o alone |
| ScreenSpot Pro (best non-OmniParser baseline at v2 launch) | strongest grounding model | ~18.9% | v2 roughly doubles it |
| WindowsAgentArena | OmniParser v1 + GPT-4V | SOTA at v1 release | Microsoft Research |
| Inference latency, A100 | v2 | ~0.6 s/frame | ~60% faster than v1 |
| Inference latency, RTX 4090 | v2 | ~0.8 s/frame | single-GPU, fp16 |
Caveats: the 39.6% headline is "OmniParser as the grounder, GPT-4o as the planner." Swap GPT-4o for Claude Opus 4.7 or Qwen 3-VL and the absolute number shifts; the relative lift over a vision-only baseline does not. There is no public OmniParser-on-Claude-4.7 benchmark in the model card yet — treat that pairing as production-viable but not externally graded.
How to choose: do you actually need OmniParser?
| Your situation | Recommendation |
|---|---|
| You're building a Windows desktop agent and want it to work with any VLM | OmniParser v2.0.1 + OmniTool |
| You're on Anthropic's stack and only need to drive a desktop | Claude's built-in Computer Use tool — no parser needed |
| You only need to automate a browser, not the OS | Playwright + a VLM, or browser-specific agents like Browser Use / OpenAI's Operator |
| Your target is mobile, not desktop | OmniParser is desktop-trained; mobile UIs need a different grounder |
| You can't run a Linux/CUDA box anywhere | Use the hosted endpoint at replicate.com/microsoft/omniparser-v2 or Azure AI Foundry's microsoft-omniparser-v2-0 |
| You have to ship a closed-source binary that bundles OmniParser | The YOLO icon_detect weights are AGPL-3.0 — talk to legal before redistribution |
Common pitfalls and how to fix them
1. torch.cuda.is_available() is False
Almost always a CPU-only torch wheel got installed because requirements.txt didn't pin one. Reinstall torch from the matching CUDA index URL (see Step 3) before running the requirements file again.
2. FileNotFoundError: weights/icon_caption_florence
You forgot the rename. Run mv weights/icon_caption weights/icon_caption_florence (PowerShell: Rename-Item weights/icon_caption icon_caption_florence).
3. AttributeError: 'Florence2ForConditionalGeneration' object has no attribute '_supports_sdpa'
Tracked as issue #323 on the repo. Caused by a mismatch between the cached Florence-2 model code and the installed transformers version. Fix: pip install --upgrade transformers, then clear the HF cache for that model (~/.cache/huggingface/hub/models--microsoft--Florence-2-base/) and re-run.
4. OmniTool VM stuck at "Installing Windows"
Check that your ISO is the Enterprise Evaluation build, named exactly custom.iso, and that Hyper-V / WSL2 virtualization is enabled in BIOS. The unattended setup will silently fail on Home or N editions of the ISO.
5. Latency is 5–10× higher than the published numbers
Confirm you are running the v2 weights (not v1), confirm CUDA is active, and confirm you are not running OmniParser inside the VM. The parser server should run on the host GPU; the VM should only host the agent's "fingers."
6. Gradio demo loads but parses return empty boxes
Most often the Florence-2 captioner failed silently. Check the terminal output for transformers warnings; the parser will return an empty list rather than crashing if the captioner fails to load.
What was removed (and why)
- The original "AMD machines" command (
pip install torch==2.1.0 ... cu118) — that is a CUDA wheel and does nothing useful on AMD hardware. Replaced with the realistic WSL2 + ROCm path above. - The pinned
torch==2.1.0recommendation — it predates several Florence-2 / transformers updates. Use a current 2.x torch matching your CUDA. - The "configure
train_args.yaml" advice — the file ships from Hugging Face and should not be edited unless you are fine-tuning your own detection model.
Building this in production
If you are wiring OmniParser into a real product — an internal RPA platform, a customer-support agent, an accessibility tool — the model is the easy part. The hard parts are the safety harness around clicks, the retry/verify loop, the eval suite that catches regressions when you swap LLMs, and the auth around the parser server you just learned about in the CVE section. If your team is short an experienced agent engineer, Codersera places vetted remote developers who have shipped exactly this kind of stack.
FAQ
Is OmniParser v2 free?
Yes — the code is MIT-licensed on GitHub and the model weights are published on Hugging Face. The icon_detect (YOLOv8) weights are AGPL-3.0; the icon_caption (Florence-2) weights are MIT. Hosted endpoints on Replicate and Azure AI Foundry charge per call.
Does OmniParser work without a GPU?
It runs on CPU but slowly — multiple seconds per frame instead of sub-second. For interactive agent loops you want at least an RTX 3060 / 8 GB VRAM.
What is the difference between OmniParser and OmniTool?
OmniParser is the screen-parsing model. OmniTool is the full agent stack (parser server + Windows 11 VM + Gradio UI) that uses OmniParser to actually drive a desktop. You can use OmniParser without OmniTool; OmniTool depends on OmniParser.
Can I use OmniParser with Claude Opus 4.7 or GPT-5?
Yes. Any LLM that can take an image plus a structured tool schema and emit valid tool calls works. The OmniTool repo ships adapters for the older catalog (GPT-4o, o1, o3-mini, DeepSeek R1, Qwen 2.5VL, Sonnet 3.5/3.7), but adding a Claude 4.7 or GPT-5 adapter is a one-file change.
Should I run OmniParser inside the OmniTool VM?
No. Run the parser server on the host with the GPU, and let only the OmniBox VM run on a separate machine if you want. Running the parser inside a Windows VM removes GPU access and tanks latency.
Is there an OmniParser v3?
Not as of April 2026. The latest tagged release is v2.0.1 (September 12, 2025). Microsoft Research has signaled multi-agent orchestration improvements in OmniTool but no v3 model has been announced.
How does OmniParser compare to Anthropic's Computer Use?
Anthropic's Computer Use bakes screen understanding into the model itself — you point Claude at a screenshot and it emits coordinates. OmniParser is model-agnostic: it gives you the structured element list, then any VLM can plan against it. Computer Use is simpler if you're committing to Anthropic; OmniParser is the right choice if you need to swap LLMs or want stronger small-element grounding (the 39.6% vs ~18.9% gap on ScreenSpot Pro).
What hardware do I need to hit the published latency numbers?
Microsoft cites ~0.6 s/frame on an A100 and ~0.8 s/frame on a single RTX 4090. RTX 5090 / H100 are faster; consumer cards (3060, 3080) typically land between 1.0 and 1.5 s/frame at fp16.
Related Codersera guides
- Run Microsoft OmniParser V2 on macOS — step-by-step installation guide
- Run DeepSeek Janus-Pro 7B on Windows — complete installation guide
- Run DeepSeek Janus-Pro 7B on Mac with ComfyUI
- Run DeepSeek Janus-Pro 7B on Mac — step-by-step guide
References and further reading
- microsoft/OmniParser on GitHub — source, releases, OmniTool subdirectory.
- Release v.2.0.1 (Sep 12, 2025) — the security-patched build.
- microsoft/OmniParser-v2.0 model card on Hugging Face — weights, license terms, benchmark numbers.
- "OmniParser V2: Turning Any LLM into a Computer Use Agent" — Microsoft Research — the official launch write-up with the 39.6% ScreenSpot Pro number.
- NVD: CVE-2025-55322 — the OmniParser RCE entry, CVSS 7.3.
- Microsoft MSRC advisory for CVE-2025-55322 — vendor record and remediation guidance.
- ScreenSpot-Pro benchmark repository — the benchmark OmniParser v2 is evaluated on.
- "OmniParser for Pure Vision Based GUI Agent" (arXiv 2408.00203) — the research paper.
- OmniParser V2 Hugging Face Space — try it in the browser before installing.