Holo3.1: Fast, Local Computer-Use Agents — A Developer's Guide

H Company's Holo3.1 family brings computer-use agents to local and on-device inference with quantized checkpoints and four model sizes. Here's what shipped and how to deploy it.

Quick answer. Holo3.1 is H Company's June 2026 family of computer-use agent models, built on Qwen, that operate across web, desktop, and mobile. It ships in four sizes (0.8B, 4B, 9B, and a 35B-A3B flagship) and — for the first time in the Holo line — quantized FP8, NVFP4, and Q4 GGUF checkpoints so the agent can run fully locally on a Windows or Mac machine, or on a DGX Spark on the same network.

Computer-use agents — models that look at a screen and drive a real GUI by clicking, typing, and scrolling — moved from demos to production fast this year. H Company's Holo3.1, released June 2, 2026, is a direct response to what broke when teams shipped the previous Holo3 generation: performance in one environment didn't transfer to another, third-party agent frameworks behaved differently, and almost everyone wanted to run the model closer to the workflow instead of in someone else's cloud.

This guide walks through what actually changed in Holo3.1, the four model sizes, the new quantized checkpoints, the benchmark numbers H Company published, and what it takes to run one of these agents locally. If you build automation over browsers, business software, or mobile devices, this is the release worth understanding.

What is Holo3.1?

Holo3.1 is a family of computer-use agent models from H Company, built on top of the Qwen family. A computer-use agent takes a screenshot (and a task), reasons about what's on screen, and emits the next action — a click at specific coordinates, a keystroke, a scroll. Chained together, those actions let the model operate software the way a person would, without a bespoke API for every app.

The previous generation, Holo3, launched earlier in 2026 and saw immediate adoption across browser automation, business software, internal tools, and desktop applications. Holo3.1 keeps that state-of-the-art performance but hardens it across the three dimensions H Company says matter most in production: environments (web, desktop, mobile), agent frameworks (the harness that wraps the model), and deployment targets (cloud inference through to fully local execution on end-user devices).

Because it's based on Qwen, Holo3.1 inherits a well-understood open-weight lineage. If you've worked with that ecosystem before, our Qwen 3.5 guide is useful background on the base model family these checkpoints derive from.

The four model sizes

Holo3.1 ships in four sizes so you can trade cost against capability instead of being forced onto a single flagship:

  • Holo3.1-0.8B — ultra-lightweight local agents.
  • Holo3.1-4B — cost-efficient deployment.
  • Holo3.1-9B — balanced performance and latency.
  • Holo3.1-35B-A3B — state-of-the-art performance.

The 35B-A3B naming reflects a mixture-of-experts-style design: a 35-billion-parameter model with roughly 3B active parameters per token, which is why it can be both the top performer and a realistic target for quantized local inference. The smaller 0.8B, 4B, and 9B sizes exist specifically to enable cost-effective and private deployment where a 35B model would be overkill or too slow.

H Company benchmarked the family's performance-versus-cost curve against the Qwen 3.5 family, averaging across OSWorld, AndroidWorld, their internal H Corporate suite, ScreenSpot-Pro, and OSWorld-G. The headline is that you can pick a size for your latency and budget envelope without falling off a cliff in capability.

Mobile, desktop, and web in one model

The clearest capability jump is on mobile. On AndroidWorld, the 35B-A3B model improves from 67% to 79.3%, and the smaller 4B and 9B variants jump from 58% to 72%. That's a meaningful gain for anyone automating Android workflows, where Holo3 previously lagged its browser and desktop performance.

This matters because mobile environments introduce their own distribution shift — different layouts, touch targets, and interaction patterns than a desktop browser. A model that's strong on the web doesn't automatically transfer to a phone screen, and Holo3.1 was explicitly trained to close that gap. If your testing or automation work touches Android, the broader landscape in our Android emulators guide and mobile app testing guide pairs well with an agent that can actually drive those environments.

Cross-harness support: function calling arrives

One of the most practical changes in Holo3.1 is native support for function-calling protocols, in addition to the structured JSON outputs already available in Holo3. The agent harness — the framework that captures the screen, sends it to the model, parses the response, and executes the action — varies a lot between teams. A model that only emits one output format forces everyone onto one integration style.

Across OSWorld and H Company's internal benchmark suite (covering e-commerce, business software, and collaboration workflows), function-calling and native execution now reach near-parity performance. In other words, you can integrate Holo3.1 into a function-calling-based agent stack without sacrificing accuracy. H Company also reports more than a 25% improvement over Holo3 when evaluated inside its own Holotab product harness.

If you're assembling the surrounding stack, our AI coding agents guide covers the broader landscape of agent frameworks and how the harness shapes real-world reliability.

Quantized checkpoints: FP8, NVFP4, and Q4 GGUF

This is Holo3.1's first release to ship quantized weights, and it's the part most relevant to anyone who wants to self-host. H Company is starting with the 35B-A3B checkpoint, available in three formats:

  • FP8 — 8-bit floating point, a common server-side quantization.
  • NVFP4 — produced with NVIDIA's Model Optimizer in a W4A16 configuration (4-bit weights, 16-bit activations).
  • Q4 GGUF — 4-bit GGUF format aimed at local deployment on consumer hardware.

The accuracy cost is small. FP8 and NVFP4 achieve the same OSWorld scores as each other, only about two points below the full-precision BF16 checkpoint. So you're trading roughly two points of benchmark accuracy for a model that fits and runs in dramatically less memory.

The speedups are where it gets interesting. On a DGX Spark, NVFP4 W4A16 delivers 1.41× the total token throughput of FP8 and 1.74× that of BF16. Combined with agent-harness optimizations H Company developed with NVIDIA, the NVFP4 path delivers a compound ~2× end-to-end speedup over the FP8 baseline, cutting average step time from 6.8 seconds to 3.3 seconds. For an interactive agent that takes many steps to finish a task, halving the per-step time changes the whole feel of the tool.

For the general theory of choosing a quantization format and the memory math behind it, see our self-hosting LLMs guide.

Running agents locally on consumer hardware

The Q4 GGUF checkpoints exist to push computer-use agents onto consumer machines. H Company's deployment model is worth understanding precisely: the agent itself runs locally on a Windows or Mac machine, while the model can run either on that same machine — H Company includes reference numbers for Apple Silicon — or on a DGX Spark on the same network.

In both configurations, execution stays fully private and local, with nothing leaving the user's network. That's a significant property for regulated industries, internal tooling that touches sensitive data, or anyone who simply doesn't want screenshots of their desktop flowing to a third-party API. H Company says these harness improvements, plus the quantization work above, will land in an upcoming desktop agent harness.

If you're targeting a Mac, our Apple Silicon LLMs guide covers the practical side of running quantized models on M-series hardware, and the open-source LLMs landscape places Holo3.1 among the other open-weight options you might evaluate.

How to get started with Holo3.1

There are two supported entry points:

  • Holo Models API — managed inference at hcompany.ai/holo-models-api, the fastest way to evaluate without provisioning hardware.
  • Hugging Face — the open checkpoints, including the quantized FP8, NVFP4, and Q4 GGUF variants, are published in the Holo3.1 collection.

A reasonable evaluation path: start on the API with the 35B-A3B model to establish a quality baseline for your task, then test the smaller 4B or 9B sizes to see how far down you can go before accuracy drops below your threshold. Once you've picked a size, move to the quantized local checkpoints if privacy or per-step latency matters more than the last couple of benchmark points.

Should you build on Holo3.1?

If you're shipping automation that has to work across browsers, desktop apps, and mobile — and you want the option to keep everything on-device — Holo3.1 is one of the more complete computer-use offerings available right now. The combination of four sizes, near-parity function-calling support, and genuinely usable quantized checkpoints means you're not locked into a single deployment story.

The honest caveats: the published quantized checkpoints are currently the 35B-A3B variant, the largest local-inference improvements are demonstrated on NVIDIA's DGX Spark hardware, and the desktop harness that ties it all together is described as upcoming rather than shipped. Benchmark the sizes against your workflows before committing — computer-use accuracy is highly task-dependent, and AndroidWorld or OSWorld scores are a starting signal, not a guarantee.

Teams that need help wiring an agent like this into a real product — the harness, the screenshot pipeline, the action executor, and the evaluation loop — often move faster with experienced engineers who've shipped agent systems before. If that's you, Codersera can help you extend your engineering team with vetted remote developers who work in this space.

Frequently asked questions

What is Holo3.1?

Holo3.1 is H Company's June 2026 family of computer-use agent models, built on the Qwen family. The models operate GUIs across web, desktop, and mobile environments by reading the screen and emitting actions like clicks and keystrokes.

What model sizes does Holo3.1 come in?

Four sizes: Holo3.1-0.8B (ultra-lightweight local agents), 4B (cost-efficient), 9B (balanced performance and latency), and 35B-A3B (state-of-the-art performance).

Can Holo3.1 run locally?

Yes. It's the first Holo release with quantized checkpoints — FP8, NVFP4, and Q4 GGUF for the 35B-A3B model. The agent runs on a Windows or Mac machine, and the model can run on that same machine (with Apple Silicon reference numbers provided) or on a DGX Spark on the same network, keeping execution fully private.

How much faster is the quantized version?

On a DGX Spark, NVFP4 W4A16 delivers 1.41× the total token throughput of FP8 and 1.74× that of BF16. With harness optimizations, the NVFP4 path reaches a compound ~2× end-to-end speedup over FP8, cutting average step time from 6.8s to 3.3s.

Does quantization hurt accuracy?

Only slightly, per H Company's numbers. FP8 and NVFP4 achieve the same OSWorld scores as each other, about two points below the full-precision BF16 checkpoint.

How much better is Holo3.1 at mobile automation than Holo3?

On AndroidWorld, the 35B-A3B model improves from 67% to 79.3%, while the 4B and 9B variants improve from 58% to 72%.

Where can I get Holo3.1?

Through the Holo Models API at hcompany.ai/holo-models-api, or as open checkpoints (including the quantized variants) in the Holo3.1 collection on Hugging Face.