Unleash Your Creativity
AI Image Editor
Create, edit, and transform images with AI - completely free
3 min to read
Qwen2.5-Omni 3B is Alibaba Cloud’s compact, multimodal AI model optimized for local deployment on consumer-grade hardware. Unlike the 7B variant, the 3B model significantly reduces VRAM usage—by more than 50%—while maintaining robust performance across text, image, audio, and video tasks.
With real-time output and simultaneous multimodal input support, Qwen2.5-Omni 3B is ideal for building local virtual assistants, media analytics tools, and interactive content engines.
This guide walks you through installing Qwen2.5-Omni 3B on Windows, including dependency management, GPU compatibility, and handling multimodal inputs.
Download Miniconda from the official site, install it, then run:
conda create -n qwen python=3.10 -y
conda activate qwen
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install sentencepiece bitsandbytes protobuf numpy einops timm pillow soundfile
pip uninstall -y transformers
pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
pip install accelerate
pip install qwen-omni-utils[decord]
Note: If installation fails, fallback to:
pip install qwen-omni-utils
C:\path\to\ffmpeg\bin
to your system PATH.ffmpeg -version
from transformers import Qwen2_5OmniForConditionalGeneration
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-3B",
device_map="auto"
)
Use BF16 and flash attention for lower VRAM usage:
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-3B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto"
)
Set FORCE_QWENVL_VIDEO_READER
to use the proper backend:
set FORCE_QWENVL_VIDEO_READER=decord
import torch
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info
# Load model and processor
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-3B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-3B")
# Define conversation with video input
conversation = [
{
"role": "user",
"content": [{"type": "video", "video": "https://example.com/sample.mp4"}]
}
]
# Process inputs
text = processor.apply_chat_template(conversation, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=True)
inputs = processor(
text=text,
audio=audios,
images=images,
videos=videos,
return_tensors="pt"
).to(model.device)
# Generate output
text_ids, audio = model.generate(**inputs)
print(processor.decode(text_ids[0]))
sf.write("output.wav", audio.numpy(), 24000)
bitsandbytes
for 4-bit quantization (if supported).use_audio_in_video=False
to save memory.KeyError: 'qwen2_5_omni'
decord
for HTTP URLs.torchvision>=0.19.0
is installed.Task | Qwen2.5-Omni 3B | Qwen2.5-Omni 7B |
---|---|---|
15s Video (BF16) | 18.38 GB* | 31.11 GB* |
Text-Only Inference | 6–8 GB | 10–12 GB |
*Values represent minimum theoretical usage with flash attention.
Qwen2.5-Omni 3B brings advanced multimodal AI capabilities to local setups without requiring massive infrastructure. While setup requires careful attention to dependencies and GPU specs, the model's real-time performance and flexibility make it a powerful tool for researchers and developers alike.
Need expert guidance? Connect with a top Codersera professional today!