Beat the ATS Systems
Smart Resume Builder
AI-optimized resumes that get past applicant tracking systems
3 min to read
To set up the Qwen2.5-1M model locally on Ubuntu/Linux, follow this comprehensive step-by-step guide. This guide will cover system requirements, installation of dependencies, launching the model, and troubleshooting common issues.
Before you begin the installation process, ensure your system meets the following requirements for optimal performance:
If your GPUs do not meet the VRAM requirements, you can still use the Qwen2.5-1M models for shorter tasks.
To run the Qwen2.5-1M model, you need to clone the vLLM repository from the custom branch and install it manually. Follow these steps:
Install the necessary Python packages:
pip install -e . -v
Navigate to the cloned directory:
cd vllm
Clone the vLLM repository:
git clone -b dev/dual-chunk-attn git@github.com:QwenLM/vllm.git
This will set up the required environment for running the model.
Once you have installed all dependencies, you can launch the Qwen2.5-1M model using an OpenAI-compatible API service. Use the following command to start the service, adjusting it based on your hardware setup:
vllm serve Qwen/Qwen2.5-7B-Instruct-1M \
--tensor-parallel-size 4 \
--max-model-len 1010000 \
--enable-chunked-prefill --max-num-batched-tokens 131072 \
--enforce-eager \
--max-num-seqs 1
--tensor-parallel-size
: Set this to the number of GPUs you are using (maximum of 4 for the 7B model and 8 for the 14B model).--max-model-len
: This defines the maximum input sequence length; reduce this value if you encounter Out of Memory issues.--max-num-batched-tokens
: This sets the chunk size in Chunked Prefill; a smaller value reduces activation memory usage but may slow down inference.--max-num-seqs
: This limits concurrent sequences processed.You may also enable FP8 quantization for model weights to reduce memory usage by adding --quantization fp8
to your command.
After launching the model, it's crucial to test if everything is functioning correctly. You can do this by sending a sample request to your local server using a tool like curl
or Postman.
Using curl
, you can send a request like this:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct-1M",
"messages": [{"role": "user", "content": "Hello, how can I set up Qwen locally?"}]
}'
If everything is set up correctly, you should receive a response from the model.
While setting up and running Qwen2.5-1M, you may encounter some common issues:
If you experience Out of Memory (OOM) errors:
--max-model-len
.--max-num-batched-tokens
.If there are problems during installation:
For better performance:
Setting up the Qwen2.5-1M model locally on Ubuntu/Linux involves careful preparation and attention to system requirements and dependencies. By following this guide, you should be able to successfully deploy and test your own instance of this powerful language model, capable of processing long context lengths of up to one million tokens.
For setting up the Qwen2.5-1M model on macOS, refer to our detailed guide.
This concludes our detailed guide on setting up Qwen2.5-1M locally on Ubuntu/Linux. For further assistance or advanced configurations, refer to community forums or documentation related to Qwen models and vLLM usage.
Citations: [1] https://qwenlm.github.io/blog/qwen2.5-1m/ [2] https://www.reddit.com/r/LocalLLaMA/comments/1c6ehct/codeqwen15_7b_is_pretty_darn_good_and_supposedly/ [3] https://simonwillison.net/2025/Jan/26/qwen25-1m/ [4] https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M