microsoft

Run Microsoft OmniParser V2 on Ubuntu (2026): Step-by-Step Install + CVE-2025-55322 Fix

Published 19 Feb 2025 • Updated 31 May 2026 • 8 min read

Last updated: May 1, 2026.

Microsoft OmniParser V2 is a vision-based screen parser that turns a UI screenshot into structured, LLM-readable elements (bounding boxes plus icon captions). Pair it with a vision LLM and you have the perception layer for a "computer-use" agent. This guide is a clean, current install path for Ubuntu 22.04 LTS and 24.04 LTS — including the CUDA stack, the v2.0.1 security patch, and the operational hardening every team should now apply after CVE-2025-55322.

What changed in this 2026 refresh: Updated install guidance for current CUDA 12.x driver branches and refreshed planner-LLM pairings for 2026, while preserving the CVE-2025-55322 hardening steps.

What changed since the original 2025 guideSecurity: Microsoft shipped OmniParser v2.0.1 in September 2025 to address CVE-2025-55322, an OmniTool VM-controller RCE. Always pull v2.0.1 or later — never expose the controller on a public IP.Repo state: The model is still microsoft/OmniParser-v2.0 on Hugging Face; the icon-caption directory is still expected to be renamed to icon_caption_florence after download. That step has not changed.Drivers: Ubuntu 24.04 is fully supported by NVIDIA's CUDA APT keyring; the older 22.04 path still works but you must match a 535+ or 550+ driver branch for Ada / Hopper GPUs.Compatible LLMs (2026): OmniParser is model-agnostic at the planner layer. Practical pairings now include GPT-5-class and o-series reasoning models, Claude 4.x with native computer-use, DeepSeek V3/R1 successors, and Qwen-VL 2.5+ — but OmniParser itself does not bundle any of them.Licensing nuance: icon_detect (YOLOv8 finetune) is AGPL; icon_caption (Florence-2 finetune) is MIT. AGPL on the detector matters if you embed OmniParser in a hosted product.

TL;DR

Question	Answer
Latest version	OmniParser v2.0.1 (Sept 2025 security release on top of v2.0.0, Feb 2025)
Where to clone	`github.com/microsoft/OmniParser`
Weights	`microsoft/OmniParser-v2.0` on Hugging Face
Python	3.12 (matches the upstream conda recipe)
Ubuntu	22.04 LTS or 24.04 LTS, with CUDA 12.x and a 535+ / 550+ NVIDIA driver
Typical latency	~0.6 s/frame on A100, ~0.8 s/frame on a single RTX 4090
Benchmark	OmniParser V2 + GPT-4o: 39.6 avg accuracy on ScreenSpot Pro (vs. 0.8 baseline)
Critical CVE	CVE-2025-55322 (OmniTool controller RCE) — fixed in v2.0.1

What OmniParser actually does

OmniParser is two finetuned models stitched into one pipeline:

icon_detect — a YOLOv8 finetune trained on a curated, web-scraped dataset of clickable/interactable regions. Output: bounding boxes for actionable UI elements.
icon_caption — a Florence-2 base finetune that captions each detected element with a short functional description ("Submit button", "open file dialog", "tab close").

The combined output is a structured list of (bounding_box, semantic_label) pairs that any vision LLM can reason over. That converts an open-ended "click the right pixel" problem into a discrete-choice one — which is why GPT-4o went from 0.8 to 39.6 average on ScreenSpot Pro once OmniParser sat in front of it.

Ubuntu prerequisites

Hardware

GPU: any modern NVIDIA card with ≥8 GB VRAM works; a single RTX 4090 hits ~0.8 s/frame, an A100 hits ~0.6 s/frame. CPU-only is technically possible but impractical for interactive agent loops.
Disk: ~5 GB for the weights + dependencies; more if you add OmniTool's Windows VM.
RAM: 16 GB is the practical floor.

Where things land on Ubuntu

Conda envs: ~/miniconda3/envs/omni/ (or ~/anaconda3/envs/omni/).
NVIDIA driver: kernel modules under /lib/modules/$(uname -r)/kernel/drivers/video/; nvidia-smi on PATH.
CUDA toolkit: /usr/local/cuda-12.x/ with a /usr/local/cuda symlink.
OmniParser source: wherever you cloned it (this guide uses ~/OmniParser/).
Weights: ~/OmniParser/weights/.

Installing the NVIDIA + CUDA stack

On Ubuntu 22.04/24.04, prefer NVIDIA's official APT keyring over the distro's nvidia-driver-* metapackages — it gives you the matching CUDA runtime alongside a known-good driver branch.

# Ubuntu 24.04 example — swap "ubuntu2404" for "ubuntu2204" on 22.04
sudo apt-get update
sudo apt-get install -y build-essential dkms linux-headers-$(uname -r)

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-6 cuda-drivers
sudo reboot

After reboot, confirm:

nvidia-smi          # driver + GPU visible
nvcc --version      # CUDA 12.x toolkit on PATH (add /usr/local/cuda/bin if not)

If nvcc isn't found, append to ~/.bashrc:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
source $HOME/miniconda3/bin/activate
conda init bash

Step-by-step OmniParser V2 install

1. Clone the repo (pin to v2.0.1 or newer)

git clone https://github.com/microsoft/OmniParser.git
cd OmniParser
git fetch --tags
git checkout v.2.0.1   # security release; do NOT stay on v.2.0.0

2. Create the conda environment

conda create -n omni python=3.12 -y
conda activate omni

3. Install Python dependencies

pip install --upgrade pip
pip install -r requirements.txt

If requirements.txt pulls a generic torch, you may want to replace it with a CUDA-matched build, e.g.:

pip install --index-url https://download.pytorch.org/whl/cu124 \
  torch torchvision torchaudio

4. Download the v2.0 weights from Hugging Face

The repo expects two subdirectories under weights/: icon_detect/ (YOLOv8 finetune) and icon_caption_florence/ (renamed from icon_caption/ after download). This rename step still applies.

pip install -U "huggingface_hub[cli]"

for f in icon_detect/{train_args.yaml,model.pt,model.yaml} \
         icon_caption/{config.json,generation_config.json,model.safetensors}; do
  huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights
done

mv weights/icon_caption weights/icon_caption_florence

Verify the layout:

find weights -maxdepth 2 -type f
# weights/icon_detect/model.pt
# weights/icon_detect/model.yaml
# weights/icon_detect/train_args.yaml
# weights/icon_caption_florence/config.json
# weights/icon_caption_florence/generation_config.json
# weights/icon_caption_florence/model.safetensors

5. Sanity-check with the Gradio demo

python gradio_demo.py

The demo binds to 127.0.0.1 by default. Keep it that way on a shared box — see the security section below.

6. Programmatic use via `demo.ipynb`

pip install jupyterlab
jupyter lab --no-browser --port 8888

Open demo.ipynb, point it at any PNG screenshot, and inspect the structured output: a list of bounding boxes with captions. That's the input you'd feed to a planner LLM.

Security: CVE-2025-55322 and how to deploy safely

In September 2025 Microsoft published CVE-2025-55322, an RCE in the OmniTool VM controller. The controller exposed an unauthenticated HTTP execute interface; any attacker with network reach to that port could send arbitrary commands to the GUI agent. Microsoft scored it CVSS 7.3 and shipped OmniParser v2.0.1 with security updates and tightened defaults.

Operational rules of thumb:

Never bind the controller or Gradio demo to 0.0.0.0 on a host with a public IP. Keep it on 127.0.0.1 and reach it via SSH port-forward.
Put it behind authn: if you must expose it, front it with an authenticating reverse proxy (Caddy + basic auth at minimum, mTLS if you're serious).
Constrain the action surface: rather than letting the agent shell-out, restrict to a high-level DSL (click / type / open / wait) and validate inputs.
Log every action: v2.0.x added local trajectory logging — turn it on; it is your forensic trail if a planner goes off the rails.

Performance and benchmarks

Setup	Latency	ScreenSpot Pro avg accuracy
OmniParser V2 on A100	~0.6 s/frame	—
OmniParser V2 on RTX 4090	~0.8 s/frame	—
GPT-4o (no parser)	—	0.8
OmniParser V2 + GPT-4o	—	39.6

Numbers are from the OmniParser V2 model card and Microsoft Research's V2 announcement. Independent reproductions on consumer GPUs land in the same neighbourhood; the 60% latency win over V1 is the most consistently reported result.

How to choose: do you actually need OmniParser?

Use OmniParser when your agent must work across arbitrary apps, including legacy desktop software with no DOM and no accessibility tree, and you want a single perception layer that's model-agnostic on the planner side.
Skip OmniParser when the target is a web app you control: a DOM walker plus Playwright is faster, cheaper and more reliable than vision parsing.
Pair with native computer-use APIs (Anthropic Claude, OpenAI o-series tools) when you want the planner to handle clicks directly; OmniParser then becomes the "what's on screen?" oracle the planner consults.

OmniTool on Ubuntu (when you need it)

OmniTool is the dockerised Windows-11 VM that ships with OmniParser preinstalled, intended for repeatable agent eval. On an Ubuntu host:

Use KVM/QEMU + libvirt for the VM rather than VirtualBox — KVM gets you near-native CPU and reasonable GPU passthrough on supported hardware.
Or use VMware Workstation Pro (free for personal use since 2024) if libvirt is overkill.
The OmniTool controller is the component covered by CVE-2025-55322 — pin it to the host-only network and never to a bridged interface unless you've put authn in front.

Common pitfalls and troubleshooting

RuntimeError: CUDA error: no kernel image is available for execution — your installed PyTorch wheel doesn't match your CUDA driver. Reinstall torch from the cu121/cu124 index that matches nvidia-smi.
icon_caption not found — you forgot the mv weights/icon_caption weights/icon_caption_florence rename. The loader hard-codes the _florence suffix.
HF download stalls — set HF_HUB_ENABLE_HF_TRANSFER=1 after pip install hf_transfer for an order-of-magnitude speedup on the safetensors file.
Driver version mismatch after Ubuntu kernel update — re-run sudo apt-get install --reinstall cuda-drivers and reboot. DKMS will rebuild the modules against the new headers.
Gradio port already in use — pass --server-port 7861 or kill the previous process; don't move it to 0.0.0.0 as a workaround.
ScreenSpot scores far below 39.6 — you're probably comparing different planner LLMs. The 39.6 figure is specifically OmniParser V2 + GPT-4o; pairing with weaker models drops it sharply.

FAQ

Is OmniParser V2 free for commercial use?

Partly. The icon_caption Florence-2 finetune is MIT-licensed. The icon_detect YOLOv8 finetune is AGPL, which has copyleft implications for hosted services. Have your legal team review before you bake it into a SaaS product.

Does it work on Ubuntu 22.04, or do I need 24.04?

Both work. 24.04 is the default in NVIDIA's current installer guidance and avoids a few kernel-header gotchas with newer Ada/Hopper GPUs. 22.04 is fine if you're already on it.

Can I run OmniParser without a GPU?

You can — it's still YOLOv8 + Florence-2 — but expect multi-second per-frame latency, which is too slow for interactive agent loops. CPU-only is fine for offline screenshot analysis.

Which LLMs pair well with OmniParser in 2026?

OmniParser is model-agnostic. Frontier closed models (GPT-5-class, Claude 4.x, Gemini 2.x) and strong open models (DeepSeek V3-line, Qwen-VL 2.5+) all work; the planner LLM only sees OmniParser's structured output, not raw pixels. Your choice should turn on cost, latency, and whether you need on-prem inference.

Is the icon_caption_florence rename still required?

Yes — as of v2.0.1 the loader still expects weights/icon_caption_florence/. The Hugging Face repo serves the directory as icon_caption, so the post-download rename is mandatory.

How does OmniParser V2 compare to V1?

V2 is roughly 60% lower latency, has a cleaner icon-caption dataset, and is the first version with a published ScreenSpot Pro number worth quoting (39.6 with GPT-4o). There is no reason to start a new project on V1.

Where do I report a vulnerability?

Microsoft Security Response Center (MSRC). CVE-2025-55322 was disclosed and fixed via that channel.

Need engineers who can ship agent stacks like this?

Stitching a vision parser, a planner LLM, a sandboxed VM, and the security plumbing around CVE-2025-55322 is real work — most teams underestimate the integration and the operational hardening. Codersera places vetted remote engineers with hands-on experience in computer-use agents, CUDA infrastructure, and ML platform security. If you'd rather extend your team than fight the install, we can usually have a developer trialled inside a week.

Related Codersera walkthroughs:

References and further reading

microsoft/OmniParser on GitHub — source, issues, releases (including v.2.0.1).
OmniParser v.2.0.0 release notes — the V2 baseline.
microsoft/OmniParser-v2.0 model card on Hugging Face — weights, license details, latency numbers.
"OmniParser V2: Turning Any LLM into a Computer Use Agent" — Microsoft Research.
CVE-2025-55322 OmniParser RCE — vendor advisory recap.
"When a GUI Agent's Control Plane Becomes a Remote Control Surface" — independent CVE-2025-55322 analysis.
"OmniParser for Pure Vision Based GUI Agent" (arXiv 2408.00203) — the original paper.
NVIDIA CUDA Installation Guide for Linux — current Ubuntu 22.04 / 24.04 driver and toolkit instructions.