Run Kimi Moonlight 16B-A3B on Linux/Ubuntu: Installation Guide

Moonshot AI's Moonlight-16B-A3B is a Mixture-of-Experts model with 16B total parameters and ~3B active per token, trained with the Muon optimizer. Released under the MIT license on Hugging Face as moonshotai/Moonlight-16B-A3B-Instruct, it's positioned as Moonshot's compact open-weights model — distinct from the company's flagship Kimi K2.6 closed/API line.

This model is part of a broader trend in AI research, focusing on scalable models that can be deployed across different platforms. Running such models on Linux offers significant advantages due to the operating system's flexibility and customizability.

In this article, we’ll explore how to run Kimi Moonlight 3B on Linux, including prerequisites, installation steps, optimization techniques, and troubleshooting tips.

Prerequisites

Before setting up the Moonlight model on Linux, ensure you meet the following requirements:

Hardware Requirements

  • CPU: Multi-core processor recommended for better performance.
  • RAM: Minimum 16 GB, more is preferable for larger models.
  • GPU: Highly recommended. Because Moonlight is a 16B-total MoE, full BF16 weights occupy roughly 32 GB — plan for a single A100 40GB / H100 80GB for comfortable headroom, or quantize (GGUF via llama.cpp, AWQ, GPTQ) to fit on smaller cards. With only ~3B parameters active per forward pass, throughput is closer to a 3B model than a 16B dense model.

Software Requirements

  • Linux Distribution: Ubuntu or similar distributions for their extensive support.
  • Python: 3.10+ recommended (the HF model card targets python=3.10 with torch>=2.1.0).
  • pip: Python package installer.
  • Git: For cloning repositories.
  • Docker: Optional, for running models in a containerized environment.

Step by Step Installation Guide

Installing Necessary Packages

Update your system and install required packages:

sudo apt update && sudo apt upgrade
sudo apt install python3 python3-pip git

For Docker (optional), follow the official Docker installation guide.

Setting Up the Environment

Create a virtual environment and install the inference stack. There is no separate GitHub repo to clone — Moonlight is distributed entirely through Hugging Face, and the transformers library handles the download on first use.

python3 -m venv moonlight-env
source moonlight-env/bin/activate
pip install --upgrade pip
pip install "torch>=2.1.0" "transformers==4.48.2" accelerate huggingface_hub

Optionally cache the weights ahead of time (helpful on multi-GPU boxes or when you want to inspect the files):

huggingface-cli download moonshotai/Moonlight-16B-A3B-Instruct

Running the Moonlight Model

With transformers (Python) — save as run_moonlight.py and run with python3 run_moonlight.py:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "moonshotai/Moonlight-16B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant provided by Moonshot AI."},
    {"role": "user", "content": "Is 123 a prime number?"},
]
input_ids = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=500)
print(tokenizer.batch_decode(generated_ids)[0])

With vLLM — best for serving and batched inference:

pip install vllm
vllm serve moonshotai/Moonlight-16B-A3B-Instruct

# In another terminal:
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data '{
    "model": "moonshotai/Moonlight-16B-A3B-Instruct",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

With SGLang — also officially supported on the model card:

pip install sglang
python3 -m sglang.launch_server \
  --model-path moonshotai/Moonlight-16B-A3B-Instruct \
  --host 0.0.0.0 --port 30000

Docker (community). Moonshot does not publish an official Docker image for Moonlight. Community images exist but are unverified; for reproducible deployments, build your own from nvidia/cuda or vllm/vllm-openai base images, or use Docker's HF runner: docker model run hf.co/moonshotai/Moonlight-16B-A3B-Instruct.

Troubleshooting

  • Memory Issues: Reduce model size or increase RAM.
  • GPU Support: Ensure up-to-date GPU drivers.
  • Package Conflicts: Use virtual environments (venv) to avoid conflicts.

Optimizing Performance

  • Use a GPU: Significantly speeds up computations.
  • Optimize Memory Usage: Monitor with tools like top or htop.
  • Update Drivers: Keep your system and GPU drivers up to date.

Advanced Setup

Using Docker for Deployment

Create a Dockerfile:

FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 80
CMD ["python", "run_model.py"]

Build and run the image (tag it under your own namespace — kimiai/moonlight is not an official image):

docker build -t local/moonlight .
docker run --gpus all -it local/moonlight

Using Virtual Environments

Create and activate a virtual environment:

python3 -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt

Deactivate with:

deactivate

Future Developments and Scaling

  • Distributed Computing: Use frameworks like PyTorch for multi-GPU setups.
  • Model Pruning: Remove unnecessary weights to improve efficiency.
  • Quantization: Lower precision data types for faster inference.

Community Engagement

Contribute and stay engaged:

  • Report Issues: Use GitHub for bug reports and suggestions.
  • Contribute Code: Submit pull requests with improvements.
  • Join Forums: Participate in AI discussions on platforms like Reddit.

Further Guidance

Conclusion

Running Kimi Moonlight 3B on Linux is a flexible and powerful way to leverage AI models. By following this guide, you can set up and optimize your environment for efficient performance. Stay connected with the community and keep exploring advancements to maximize the potential of this model.

References

  1. Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
  2. Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
  3. Run Microsoft OmniParser V2 on Ubuntu : Step by Step Installation Guide
  4. Moonlight-16B-A3B-Instruct on Hugging Face