Land Your Dream Job
AI-Powered Resume Builder
Create an ATS-friendly resume in minutes. Free forever!
3 min to read
Deploying the Qwen2.5-1M model locally on a Windows machine may seem complex due to its advanced features and hardware requirements. This guide provides a detailed, step-by-step approach to setting up Qwen2.5-1M, enabling users to leverage its cutting-edge capabilities in natural language processing and machine learning.
The Qwen2.5-1M model is a powerful language model developed by Alibaba's Qwen team. It boasts an impressive token capacity, supporting up to 1 million tokens. With advanced features like Dual Chunk Attention, Qwen2.5-1M excels in a range of NLP and ML tasks. The model comes in two primary configurations:
Each configuration has significant VRAM requirements, making it essential to ensure your system can handle the load for optimal performance.
Before you begin, make sure your system meets the following hardware and software requirements:
CUDA is necessary for utilizing the GPU capabilities of your system. Follow these steps:
Ensure you have a compatible version of Python:
Git is required to clone repositories. If it's not already installed, follow these steps:
Clone the necessary repository and install it in editable mode by running:
git clone -b dev/dual-chunk-attn git@github.com:QwenLM/vllm.git
cd vllm
pip install -e . -v
To run Qwen2.5-1M efficiently, install the following dependencies:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
pip install transformers
If you're using CUDA 12.3, replace cu121
with cu123
.
To configure your system to recognize CUDA, follow these steps:
CUDA_HOME
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.X
(replace v12.X
with your installed CUDA version).C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.X\bin
to your Path variable.Once the environment is set up, launch the API service with the following command:
vllm serve Qwen/Qwen2.5-7B-Instruct-1M \
--tensor-parallel-size 4 \
--max-model-len 1010000 \
--enable-chunked-prefill --max-num-batched-tokens 131072 \
--enforce-eager --max-num-seqs 1
To confirm that everything is working, you can test with a simple chat completion request using Python:
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1/', api_key='your_api_key')
response = client.chat.completions.create(
messages=[
{'role': 'user', 'content': 'Hello! How can I use Qwen?'}
],
model='Qwen/Qwen2.5-7B-Instruct',
)
print("Response:", response)
Replace 'your_api_key'
with a valid API key if needed.
If you encounter VRAM-related errors, try reducing the max-model-len or adjusting the tensor-parallel-size.
Ensure the API is running at http://localhost:8000
. If you face connection issues, check your firewall settings and ensure the service is active.
This guide provides the essential steps to deploy Qwen2.5-1M on Windows. By following the outlined steps, you'll be able to utilize this powerful model for advanced language processing tasks. Keep up-to-date with future improvements from Alibaba’s Qwen team to maximize performance and capabilities.
For those who prefer a MacOS setup, you can refer to our dedicated guide on setting up Qwen2.5-1M on Mac for detailed instructions on the process for Apple devices.