Qwen3.5

Qwen3.5 Omni Plus vs GPT‑4o and Gemini 3.1 Pro: Benchmarks, Pricing, and Use Cases

Q: Is Qwen3.5 Omni Plus open source?

No. The full Qwen3.5 Omni Plus model is typically accessed via cloud APIs rather than fully open weights. However, related Omni models and earlier Qwen releases are available with open weights on platforms like GitHub and Hugging Face.

Q: How is Qwen3.5 Omni Plus different from Qwen2.5 Omni?

Qwen3.5 Omni builds on Qwen2.5 Omni with improved performance across audio, audio-visual, and document understanding benchmarks. It also focuses more on long-context multimodal capabilities and agent-style workflows.

Q: Does Qwen3.5 Omni Plus support real-time speech?

Yes. Demonstrations and early reviews show that it supports low-latency streaming speech output, with controls for voice style and emotion. Voice cloning capabilities are also expected as part of its evolving feature set.

Q: How does it compare to GPT-4o on benchmarks?

On benchmarks such as MMMU, HumanEval, LibriSpeech, and OmniDocBench, Qwen Omni models often match or outperform GPT-4o across tasks including reasoning, coding, audio processing, and document understanding, depending on the evaluation setup.

Q: How much does Qwen3.5 Omni Plus cost to use?

Pricing varies by provider, but Qwen Plus-class models generally cost around 0.32 to 0.40 USD per 1 million input tokens and 0.96 to 1.20 USD per 1 million output tokens. Some platforms may also offer free usage tiers or discounts.

Qwen3.5 Omni Plus guide with setup steps, real benchmarks, pricing, and comparisons vs GPT‑4o and Gemini 3.1 Pro.

Published 31 Mar 2026 • Updated 23 Apr 2026 • 12 min read

Qwen3.5 Omni Plus vs GPT‑4o and Gemini 3.1 Pro

Qwen3.5 Omni Plus is Alibaba’s newest multimodal AI model. It works with text, images, audio, and video in one unified system.

The Plus tier targets high‑quality reasoning and audio tasks, where it reaches state‑of‑the‑art results on many benchmarks.

It competes directly with models like GPT‑4o and Gemini 3.1 Pro in 2026.
This guide explains what it is, how to start using it, and how it compares to other large models.

What Is Qwen3.5 Omni Plus

Qwen3.5 Omni Plus is a large “omni‑modal” model from Alibaba’s Qwen team.
Omni‑modal means the same model handles text, images, audio, and video instead of using separate models for each type.

It builds on earlier Qwen2.5‑Omni and Qwen3‑Omni models, which already showed strong results on multimodal benchmarks.

The Plus variant in the Qwen3.5‑Omni family is tuned for high accuracy and strong audio and audio‑visual understanding.

The Qwen3.5‑Omni family ships in three size tiers: Plus, Flash, and Lite.
All three support a long context window; reviewers report 256K tokens for Qwen3.5‑Omni models, which can hold over 10 hours of audio or hundreds of seconds of 720p video with audio.

The Plus tier is the main “flagship” for quality, while Flash focuses on lower latency and cost, and Lite targets edge and on‑device scenarios.

Qwen3.5 Omni Plus is available through cloud APIs, such as Alibaba Cloud’s Model Studio and third‑party gateways that expose Qwen “Plus” class models.
There is also an offline demo on Hugging Face Spaces that lets you try Qwen3.5 Omni with a browser interface and optional multimodal input.

Key Features

Unified multimodal model
Handles text, images, audio, and video in one end‑to‑end architecture instead of separate components.
Strong audio and audio‑visual understanding
Across 36 audio and audio‑visual benchmarks, the Qwen Omni family achieves state‑of‑the‑art results on most tasks and outperforms closed models like Gemini 2.5 Pro and GPT‑4o.
Large context window
Reviews describe a 256K token context for Qwen3.5‑Omni, enough for over 10 hours of audio or hundreds of seconds of video with audio in one prompt.
High document understanding quality
The Qwen3.5 family scores 90.8 on OmniDocBench v1.5, beating GPT‑5.2, Claude Opus 4.5, and Gemini‑3.1 Pro on that benchmark.
OmniDocBench is a benchmark that tests document OCR, layout, and long‑form document reasoning.
Multilingual speech support
Community testing reports speech recognition in 113 languages and speech output in 36 languages.
Real‑time voice interaction
Qwen3.5‑Omni‑Plus supports low‑latency streaming speech output with detailed control over emotion, speed, and volume according to reviewers.
Voice cloning roadmap
The same public tests mention planned voice cloning from a short sample, although this is described as “coming soon”.
Web search and tool calling
The Plus tier integrates function calling and optional web search, so the model can trigger tools or fetch fresh information during a session.
Open ecosystem
The Qwen project releases many model weights openly, and the Qwen3‑Omni code is available on GitHub and Hugging Face, which supports custom deployments.

How to Install or Set Up

This section focuses on two practical paths: Alibaba Cloud API and browser demo.

Using Alibaba Cloud Model Studio

Create or log in to an Alibaba Cloud account
Go to the Alibaba Cloud website and sign in or create a new account.
Enable Model Studio or AI services
In the console, enable the AI service that exposes Qwen models, often under “Model Studio” or similar menus.
Select a Qwen Plus‑class model
In pricing and model lists, look for models like “Qwen‑Plus” or “Qwen Plus Latest,” which represent the higher quality tier and often map to the newest Qwen versions.
Qwen3.5‑Omni‑Plus may appear under a similar naming pattern once fully integrated.
Create an API key
Generate an API key or access token in the console so you can call the model from your code or application.
Choose a pricing plan
For small projects, pay‑as‑you‑go pricing by tokens is common.
For higher volume or enterprise usage, Alibaba Cloud offers savings plans and discounts for batch workloads.

Using the Hugging Face Offline Demo

Open the demo space
Visit the Qwen3.5 Omni offline demo on Hugging Face Spaces.
Prepare your input
Type a task in the text box and optionally upload one image, audio clip, or video file.
Run the model
Click the run or submit button. The demo sends both the text and the file to the Qwen3.5 Omni backend.
Review outputs
The page shows the text response, and in some configurations it can also produce audio output that you can play in the browser.

How to Run or Use It

This section focuses on a typical API workflow using a “Qwen Plus” style endpoint. Exact endpoint names vary across providers, but the ideas stay the same.

Basic Text and Image Chat

Many providers expose Qwen Plus‑class models with a chat API similar to OpenAI‑style or OpenRouter‑style JSON requests.

Example JSON structure (conceptual, not tied to one provider):

jsonPOST /v1/chat/completions
{ "model": "qwen3.5-omni-plus", "messages": [ {"role": "user", "content": "Describe this product image in plain English."} ], "images": [ {"url": "https://example.com/product.jpg"} ] }

The server returns a JSON response with a text field that contains the model’s answer, for example a short description of the product image.

Python Example with Qwen3‑Omni Code

The official Qwen3‑Omni GitHub repository shows how to run Omni models with the Transformers library.

The same pattern should work for Qwen3.5 Omni models once weights are available, with only the model name changed.

Key steps from the Qwen3‑Omni example:

Load the model and processor classes from transformers.
Prepare a conversation that includes text and references to audio, images, or videos.
Use the processor to turn all modalities into tensors.
Call generate on the model to get both text and optional audio output.

A simplified outline based on the official demo looks like this:

from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained("Qwen/Qwen3-Omni")
processor = Qwen3OmniMoeProcessor.from_pretrained("Qwen/Qwen3-Omni")

conversation = [
    {"role": "user", "content": [
        {"type": "text", "text": "What can you see and hear?"},
        {"type": "image", "image": "path/to/frame.png"},
        {"type": "audio", "audio": "path/to/audio.wav"}
    ]}
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)

inputs = processor(
    text=text,
    audio=audios,
    images=images,
    videos=videos,
    return_tensors="pt",
    padding=True,
    use_audio_in_video=True
)

inputs = inputs.to(model.device).to(model.dtype)

text_ids, audio = model.generate(
    **inputs,
    speaker="Ethan",
    thinker_return_dict_in_generate=True,
    use_audio_in_video=True
)

The processor then decodes text_ids back into natural language, and you can save the audio output as a .wav file.

Real‑Time Voice Interaction

Reviewers report that Qwen3.5‑Omni‑Plus supports real‑time streaming voice, with control over style and turn‑taking.
In practice, this usually means:

You stream microphone audio to the API.
The model streams back partial text and audio tokens.
The client plays these audio chunks as they arrive to create a voice assistant.

Exact streaming APIs depend on your provider, but they follow this general pattern.

Benchmark Results

Available public reports and technical write‑ups give several concrete benchmark numbers for the Qwen3 and Qwen3.5 Omni families, with the Plus tier at the top.

Selected benchmark results

Benchmark / Metric	Qwen3.5 / Qwen Omni Family Result	Comparator Result	Notes
OmniDocBench v1.5 (document OCR + QA)	90.8	GPT‑5.2: 85.7; Claude Opus 4.5: 87.7; Gemini‑3.1 Pro: 88.5	Qwen3.5 family leads this benchmark.
Audio & audio‑visual benchmarks	SOTA on 32 of 36; sets 22 new SOTA	Gemini 2.5 Pro and GPT‑4o behind on most tasks	Based on Qwen3‑Omni technical report.
General audio understanding vs Gemini‑3.1 Pro	Qwen3.5‑Omni‑Plus “wins outright”	Gemini‑3.1 Pro lower on aggregate audio tasks	Reviewer summary.
Audio‑video comprehension vs Gemini‑3.1 Pro	Matches overall performance	—	Similar quality on combined audio‑visual tasks.
MMMU (multimodal reasoning, Qwen3‑Omni generation benchmarks)	82.0%	GPT‑4o: 79.5%	Shows strong multimodal reasoning.
HumanEval (code generation, Qwen3‑Omni)	92.6%	GPT‑4o: 89.2%	Indicates strong coding ability.
LibriSpeech (ASR word error rate, Qwen3‑Omni)	1.7% WER	GPT‑4o: 2.2% WER	Lower is better; Qwen family leads.

These results show that the Qwen3 / Qwen3.5 Omni models, and especially the Plus tier, are very strong in audio, audio‑visual understanding, coding, and document processing relative to current proprietary models.

Testing Details

The Qwen team and independent reviewers use a mix of public benchmarks and custom suites.

For audio and audio‑visual tasks, the Qwen Omni family is tested on at least 36 benchmarks covering speech recognition, audio question answering, sound event classification, and multimodal video understanding.
On these, the models reach state‑of‑the‑art results on 32 benchmarks and set new state‑of‑the‑art records on 22, surpassing strong closed models like Gemini 2.5 Pro and GPT‑4o according to the technical report.
For document understanding, the Qwen3.5 family runs on OmniDocBench v1.5, which combines OCR, layout understanding, and long‑document question answering across many formats.
The family’s 90.8 score beats other frontier models, including GPT‑5.2 at 85.7, Claude Opus 4.5 at 87.7, and Gemini‑3.1 Pro at 88.5 on that benchmark.
The Qwen Omni line is also evaluated on general multimodal reasoning benchmarks such as MMMU, as well as code benchmarks like HumanEval.
On these, Qwen3‑Omni scores 82.0% on MMMU and 92.6% on HumanEval, which is above GPT‑4o’s 79.5% and 89.2% results on the same tests, showing strong reasoning and coding performance.

Comparison Table

This table compares Qwen3.5 Omni Plus with three other widely discussed multimodal models: GPT‑4o, Gemini 3.1 Pro, and Qwen2.5‑Omni.

Model	Provider	Modalities (native)	Context Window (tokens)	Open Weights	Strength Areas (based on public data)
Qwen3.5 Omni Plus	Alibaba / Qwen	Text, image, audio, video	~256K reported for Qwen3.5‑Omni	Partially (family)	Audio and audio‑visual tasks, document understanding, coding, multilingual voice interaction
GPT‑4o	OpenAI	Text, image, audio, video	128K via API	No	General chat, coding, reasoning, broad ecosystem and integrations
Gemini 3.1 Pro	Google DeepMind	Text, image, audio, video, PDFs, code repos	1M tokens context on Vertex AI	No	Long‑context reasoning, large document and code repository analysis
Qwen2.5‑Omni	Alibaba / Qwen	Text, image, audio, video	Varies by deployment; designed for smaller models	Yes	Strong open multimodal baseline, good speech instruction following and multimodal understanding vs other open models

Pricing Table

Pricing for Qwen3.5 Omni Plus will depend on the provider, but public data on Qwen “Plus” models gives a clear range.

Examples of Qwen Plus‑class pricing (per 1M tokens)

Tier / Scenario	Example Provider / Model	Input Price (per 1M tokens)	Output Price (per 1M tokens)	Notes
Free trial / limited quota	Alibaba Cloud Qwen models	Free up to quota	Free up to quota	Limited free credits; region‑restricted in some cases.
Pay‑as‑you‑go API (Alibaba Cloud)	Qwen‑Plus on Alibaba Cloud	0.40 USD	1.20 USD	1M token context window for Qwen‑Plus in this listing.
Pay‑as‑you‑go API (Gateway)	Qwen Plus Latest on LLM Gateway	0.32 USD	0.96 USD	Listed with 1M token context for “Latest” Plus tier.
Pay‑as‑you‑go API (Aggregator)	qwen‑plus on Requesty / similar	0.40 USD	1.20 USD	131K or higher context window depending on host.
Batch / discounted usage	Alibaba Cloud savings plans	Discounted vs on‑demand	Discounted vs on‑demand	Savings plans and batch discounts for non‑real‑time tasks.
Enterprise contract	Alibaba Cloud enterprise AI	Negotiated	Negotiated	Typically includes SLAs, support, and custom deployment options.

These figures show the price band for high‑end Qwen Plus‑class models in late 2025 and early 2026; Qwen3.5‑Omni‑Plus should fall near this range once fully listed across providers.

USP — What Makes Qwen3.5 Omni Plus Different

Qwen3.5 Omni Plus focuses on native multimodal audio and audio‑visual performance rather than only text benchmarks.

It offers an omni‑modal architecture that handles speech, sound, images, and video in a single model and reaches state‑of‑the‑art results across many audio benchmarks, while also leading document understanding benchmarks like OmniDocBench v1.5.

In addition, it comes from an ecosystem that releases many open weights, including earlier Qwen2.5‑Omni and Qwen3‑Omni variants, which helps teams build customized or on‑premise solutions around the same family of models.

Pros and Cons

Pros

Strong performance on audio and audio‑visual benchmarks compared with GPT‑4o and Gemini series models.
Leading document understanding scores on OmniDocBench v1.5.
Unified model for text, images, audio, and video instead of separate pipelines.
Long context window (around 256K tokens for Qwen3.5‑Omni) for long audio or video sessions.
Open ecosystem around Qwen with code, earlier weights, and community tools.

Cons

Very new release, so documentation, tooling, and community examples are still growing.
Pricing and availability for Qwen3.5‑Omni‑Plus are not yet as clear or standardized as GPT‑4o or Gemini 3.1 Pro across all regions.
Real‑time voice and voice cloning features may depend on specific providers and are still maturing.
Some benchmark numbers come from Qwen3‑Omni family reports rather than public, model‑card‑style documentation for every Qwen3.5‑Omni‑Plus deployment.

Quick Comparison Chart

A short summary of how Qwen3.5 Omni Plus compares to GPT‑4o and Gemini 3.1 Pro.

Feature	Qwen3.5 Omni Plus	GPT‑4o	Gemini 3.1 Pro
Provider	Alibaba / Qwen	OpenAI	Google DeepMind
Modalities	Text, image, audio, video	Text, image, audio, video	Text, image, audio, video, code, PDFs
Context window	~256K tokens (reported)	128K via API	1M tokens on Vertex AI
Audio / AV benchmarks	Strong lead on many tests	Strong but behind on key audio metrics	Strong but matched or beaten on audio by Qwen3.5‑Omni‑Plus
Document understanding	Top score on OmniDocBench v1.5	Slightly lower score vs Qwen3.5 family	Slightly lower score vs Qwen3.5 family
Openness	Mixed: some open weights, some hosted only	Closed weights	Closed weights

Demo or Real‑World Example

Use case: Multilingual meeting assistant with audio and slides

Goal: Use Qwen3.5 Omni Plus to transcribe, translate, and summarize a recorded meeting that includes speech in more than one language and a slide deck shared on screen.

Collect the inputs
Export the meeting recording as a video file with audio, for example meeting.mp4.
Save the slide deck as images (per slide) or as a PDF, depending on what your provider supports.
Send video and slide frames to the model
Prepare a prompt like: “Transcribe this meeting, then provide a summary in English and Hindi, and list decisions.”
Attach the meeting video as a video input and one or more slide images or the PDF pages as image or document inputs.
Let Qwen3.5 Omni Plus handle multimodal understanding
The model processes speech in the video (including multiple languages) and the text and charts in the slides in one run.
It can produce an accurate transcript, detect speakers, and then answer questions about the meeting content thanks to its long context window and strong audio and document understanding.
Generate summaries and follow‑up artifacts
Ask for:

A bullet point summary of decisions.
Action items per speaker.
A short email draft in English and another in Hindi summarizing the meeting.
The model uses its multilingual and long‑context abilities to output all of these without manual switching between tools.

This workflow shows how one Omni model can replace separate systems for transcription, translation, slide OCR, and summarization, which reduces integration work for teams.

Conclusion

Qwen3.5 Omni Plus extends the Qwen family into a strong, audio‑first multimodal model that competes directly with GPT‑4o and Gemini 3.1 Pro.

Its strengths are clear in audio understanding, audio‑visual interaction, and document processing benchmarks, where it often leads current frontier models.

FAQ

1. Is Qwen3.5 Omni Plus open source?
The full Qwen3.5‑Omni‑Plus model is served through cloud APIs, but related Omni models and earlier Qwen versions have open weights on GitHub and Hugging Face.

2. How is Qwen3.5 Omni Plus different from Qwen2.5‑Omni?
Qwen3.5 Omni builds on Qwen2.5‑Omni with better audio, audio‑visual, and document benchmarks and a focus on long‑context multimodal agents.

3. Does Qwen3.5 Omni Plus support real‑time speech?
Yes, public demos and reviews describe low‑latency streaming speech output with fine control over emotion and voice style, plus planned voice cloning.

4. How does it compare to GPT‑4o on benchmarks?
On family‑level benchmarks like MMMU, HumanEval, LibriSpeech, and OmniDocBench, Qwen Omni models often match or beat GPT‑4o on reasoning, coding, audio, and document tasks.

5. How much does Qwen3.5 Omni Plus cost to use?
Pricing depends on the provider, but Qwen Plus‑class models typically cost around 0.32–0.40 USD per 1M input tokens and 0.96–1.20 USD per 1M output tokens, with some free quotas and discounts available.