3 min to read
To set up the Qwen2.5-1M model locally on Ubuntu/Linux, follow this comprehensive step-by-step guide. This guide will cover system requirements, installation of dependencies, launching the model, and troubleshooting common issues.
Before you begin the installation process, ensure your system meets the following requirements for optimal performance:
If your GPUs do not meet the VRAM requirements, you can still use the Qwen2.5-1M models for shorter tasks.
To run the Qwen2.5-1M model, you need to clone the vLLM repository from the custom branch and install it manually. Follow these steps:
Install the necessary Python packages:
pip install -e . -v
Navigate to the cloned directory:
cd vllm
Clone the vLLM repository:
git clone -b dev/dual-chunk-attn git@github.com:QwenLM/vllm.git
This will set up the required environment for running the model.
Once you have installed all dependencies, you can launch the Qwen2.5-1M model using an OpenAI-compatible API service. Use the following command to start the service, adjusting it based on your hardware setup:
vllm serve Qwen/Qwen2.5-7B-Instruct-1M \
--tensor-parallel-size 4 \
--max-model-len 1010000 \
--enable-chunked-prefill --max-num-batched-tokens 131072 \
--enforce-eager \
--max-num-seqs 1
--tensor-parallel-size
: Set this to the number of GPUs you are using (maximum of 4 for the 7B model and 8 for the 14B model).--max-model-len
: This defines the maximum input sequence length; reduce this value if you encounter Out of Memory issues.--max-num-batched-tokens
: This sets the chunk size in Chunked Prefill; a smaller value reduces activation memory usage but may slow down inference.--max-num-seqs
: This limits concurrent sequences processed.You may also enable FP8 quantization for model weights to reduce memory usage by adding --quantization fp8
to your command.
After launching the model, it's crucial to test if everything is functioning correctly. You can do this by sending a sample request to your local server using a tool like curl
or Postman.
Using curl
, you can send a request like this:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct-1M",
"messages": [{"role": "user", "content": "Hello, how can I set up Qwen locally?"}]
}'
If everything is set up correctly, you should receive a response from the model.
While setting up and running Qwen2.5-1M, you may encounter some common issues:
If you experience Out of Memory (OOM) errors:
--max-model-len
.--max-num-batched-tokens
.If there are problems during installation:
For better performance:
Setting up the Qwen2.5-1M model locally on Ubuntu/Linux involves careful preparation and attention to system requirements and dependencies. By following this guide, you should be able to successfully deploy and test your own instance of this powerful language model, capable of processing long context lengths of up to one million tokens.
For setting up the Qwen2.5-1M model on macOS, refer to our detailed guide.
This concludes our detailed guide on setting up Qwen2.5-1M locally on Ubuntu/Linux. For further assistance or advanced configurations, refer to community forums or documentation related to Qwen models and vLLM usage.
Citations: [1] https://qwenlm.github.io/blog/qwen2.5-1m/ [2] https://www.reddit.com/r/LocalLLaMA/comments/1c6ehct/codeqwen15_7b_is_pretty_darn_good_and_supposedly/ [3] https://simonwillison.net/2025/Jan/26/qwen25-1m/ [4] https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.