5 min to read
Running the Qwen3 Next 80B A3B model—a cutting-edge large language AI—on Windows is now achievable thanks to WSL2, NVIDIA GPU acceleration, and Docker containerization.
This guide provides a step-by-step walkthrough to install, configure, and run Qwen3 Next 80B A3B on Windows 11, including practical examples, API usage, optimizations, and troubleshooting tips.
Qwen3 Next 80B A3B is an instruction-optimized, sparse Mixture-of-Experts (MoE) LLM developed by Alibaba AI Research. Despite having 80 billion parameters, it activates only ~3B per inference, allowing high throughput while using fewer resources. Key features include:
Running this model locally requires NVIDIA GPUs with CUDA support, WSL2, and proper environment setup.
Metric | Qwen3 Next 80B A3B | Dense 70B Model | GPT-4-32K |
---|---|---|---|
Inference Tokens/sec (TP=1) | 1,200 | 450 | 300 |
VRAM Usage (FP8) | 48 GB | 115 GB (FP16) | 80 GB |
Avg. Latency per 1K tokens | 0.8 s | 2.5 s | 3.2 s |
Zero-Shot Accuracy (MMLU) | 78.5% | 75.0% | 76.2% |
Tested on RTX 4090 Ti, CUDA 12.1, vLLM 0.10.2.
# Enable WSL and Virtual Machine Platform
wsl --install
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
# Set default WSL version
wsl --set-default-version 2
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential curl git python3 python3-pip
nvidia-smi
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
# Verify GPU
nvidia-smi
# Add NVIDIA Docker repository
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-docker2
sudo systemctl restart docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-docker2
sudo systemctl restart docker
docker run --rm --gpus all nvidia/cuda:12.1-base nvidia-smi
pip3 install --upgrade pip
pip3 install transformers vllm>=0.10.2 flashinfer>=0.3.1
pip3 install --upgrade pip
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip3 install vllm flashinfer transformers
docker run -d --gpus all --name qwen3next80b -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface:rw \
-e HF_HOME=/root/.cache/huggingface \
vllm/vllm-openai:v0.10.2 \
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
--host 0.0.0.0 --port 8000 \
--async-scheduling --tensor-parallel-size=4 \
--trust-remote-code
mkdir -p ~/models/qwen3 && cd ~/models/qwen3
git clone https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
Notes:
tensor-parallel-size
for multiple GPUs.http://localhost:8000
.docker run -d --gpus all --name qwen3next80b -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HF_HOME=/root/.cache/huggingface \
vllm/vllm-openai:v0.10.2 \
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
--host 0.0.0.0 --port 8000 \
--async-scheduling --tensor-parallel-size=4 \
--trust-remote-code
pip install 'sglang[all]>=0.5.2'
sglang launch \
--model-path ~/models/qwen3/Qwen3-Next-80B-A3B-Instruct \
--port 8080 \
--max-context-length 262144 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-num-draft-tokens 4
--tensor-parallel-size
to match GPU count.--speculative-config
.FLASHINFER_USE_CUDA_GRAPH=1
for CUDA graph optimization.# Extract entities
entities = generate(model, tokenizer, prompt="Extract key entities from this text: ...")
# Generate summary
summary = generate(model, tokenizer, prompt=f"Summarize these entities: {entities}")
pip3 install transformers[vision]
from transformers import AutoProcessor, AutoModelForVision2Seq
processor = AutoProcessor.from_pretrained("Your/VisionAdapter-Qwen3Next")
model = AutoModelForVision2Seq.from_pretrained("Your/VisionAdapter-Qwen3Next")
inputs = processor(images=pil_image, text="Describe this scene:", return_tensors="pt")
outputs = model.generate(**inputs)
print(processor.decode(outputs[0], skip_special_tokens=True))
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model":"Qwen3-Next-80B-A3B-Instruct", "messages":[{"role":"user","content":"Identify risks in this contract: [contract text]"}] }'
from vllm import Client
client = Client("http://localhost:8000")
response = client.chat(
model="Qwen3-Next-80B-A3B-Instruct",
messages=[{"role":"user","content":"Review this Python function and suggest improvements:\n``````"}]
)
print(response.choices[0].message.content)
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-Next-80B-A3B-Instruct",
"messages": [
{"role": "user", "content": "Explain the benefits of using WSL2 for AI model deployment."}
]
}'
import requests
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "Qwen/Qwen3-Next-80B-A3B-Instruct",
"messages": [{"role": "user", "content": "Summarize advantages of sparse MoE models."}]
}
response = requests.post(url, headers=headers, json=data)
print(response.json()["choices"][0]["message"]["content"])
Run:
python3 query_qwen.py
nvidia-smi
and CUDA compatibility.Running Qwen3 Next 80B A3B on Windows 11 is now practical with WSL2, Docker, and NVIDIA GPU acceleration. With FP8 quantization, sparse MoE architecture, and extended context support, you can deploy large-scale, instruction-optimized AI models locally for research, NLP, multi-modal projects, and advanced chained-task pipelines.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.