Codersera

4 min to read

Install and Run Hunyan 7b on Linux/ Ubuntu: An Installation Guide

Installing and running a 7-billion parameter (7B) Large Language Model—such as Mistral-7B, Llama-2-7B, or similar—on Linux/Ubuntu involves a sequence of well-defined steps covering system requirements, environment setup, Python dependencies, model download, and inference execution.

This comprehensive guide walks you through the entire process for a typical “7B” open-source model using HuggingFace’s Transformers library, including optional variations and troubleshooting for best results on a Linux or Ubuntu system.


1. Understanding the 7B Model Landscape

What is a 7B Model?

  • “7B” stands for 7 billion parameters, indicating the model’s scale and performance class.
  • Popular models include Meta’s Llama-2-7B, Mistral-7B, Janus-Pro-7B, and Google’s Gemma-7B.

Model Architecture

  • Most are transformer-based and support tasks like text generation, summarization, and chat.

Choosing the Right Model

  • Match the model to your use case (e.g., code generation, instruction-following, etc.).
  • This guide applies to any HuggingFace-hosted 7B model.

1.1 What is Hunyuan 7B?

Hunyuan 7B is part of Tencent’s suite of large multimodal models. It includes both pre-trained and instruction-tuned versions tailored for natural language processing, video generation, and image synthesis tasks. It serves as the backbone for AI applications in creative, analytical, and productive domains.

Key Features

  • State-of-the-art text generation and comprehension
  • Multimodal capabilities (text, image, video)
  • Instruction-following and prompt adaptation

2. Hardware and System Requirements

Recommended Minimum:

  • RAM: 32GB (16GB may work with quantization)
  • GPU: NVIDIA GPU with at least 16GB VRAM (RTX 3090/4090 ideal)
  • Disk Space: 30GB+
  • OS: Ubuntu 20.04+ (or Debian-based equivalent)
  • Python: 3.8+
  • CUDA: 11.x/12.x for GPU acceleration

You can run on CPU, but it will be significantly slower. Consider quantized models for CPU-based environments or use hosted inference.


3. Preparing the Linux Environment

3.1 Update System Packages

sudo apt update && sudo apt upgrade -y
sudo apt install python3 python3-pip python3-venv git wget -y

3.2 Create and Activate a Virtual Environment

python3 -m venv hunyan_env
source hunyan_env/bin/activate

4. Installing Python Dependencies

4.1 Install NVIDIA Drivers, CUDA, and cuDNN

Ensure GPU drivers and CUDA toolkit are installed:

nvidia-smi
nvcc --version

4.2 Install Required Python Packages

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers huggingface_hub

Replace cu121 with the CUDA version matching your setup.


5. Model Selection and Download

Common HuggingFace model IDs:

  • meta-llama/Llama-2-7b-hf
  • mistralai/Mistral-7B-Instruct-v0.3
  • deepseek-ai/Janus-Pro-7B
  • google/gemma-7b

5.1 Download Model via HuggingFace Hub

from huggingface_hub import snapshot_download
from pathlib import Path

model_path = Path.home() / 'hunyan_models' / 'Mistral-7B-Instruct-v0.3'
model_path.mkdir(parents=True, exist_ok=True)

snapshot_download(repo_id="mistralai/Mistral-7B-Instruct-v0.3", local_dir=model_path)

Use your desired model ID in repo_id.


6. Running Inference with Transformers

6.1 Basic Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name_or_path = "path_to_downloaded_model"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype="auto", device_map="auto")

prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

7. Instruction or Chat Interface (Optional)

7.1 Mistral 7B Python API Example

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file(f"{model_path}/tokenizer.model.v3")
model = Transformer.from_folder(model_path)
completion_request = ChatCompletionRequest(messages=[UserMessage(content="Tell me a joke!")])
tokens = tokenizer.encode_chat_completion(completion_request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.7, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
print(tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0]))

8. Quantization and Multi-GPU Setup

8.1 Quantized Models (Low RAM/CPU)

pip install bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3",
                                              load_in_4bit=True,
                                              device_map="auto")

8.2 Multi-GPU Support

Set device_map='auto' or map layers to GPUs manually. Use Accelerate or DeepSpeed for advanced parallelization.


9. Alternative Interfaces and Frameworks

9.1 Server or UI Wrappers

Many 7B models support:

  • Web UIs (Gradio, FastChat)
  • REST APIs
  • CLI-based chat

9.2 Using LangChain

from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="mistralai/Mistral-7B-Instruct-v0.3", 
    task="text-generation",
    model_kwargs={"temperature": 0.5, "max_length": 200}
)
response = llm("Summarize Linux memory management.")
print(response)

10. Tips for Efficient 7B Model Execution

  • Use -hf models for seamless integration.
  • Test PyTorch + CUDA setup before model loading.
  • Set local_files_only=True for offline environments.
  • Use htop and nvidia-smi for performance monitoring.
  • Reduce max_new_tokens to prevent memory overflow.

11. Troubleshooting

Issue: Model fails to load
Fix: Ensure correct paths and compatible CUDA drivers.

Issue: Out of Memory
Fix: Try quantization, smaller sequences, or use CPU fallback.

Issue: License required for model
Fix: Accept HuggingFace TOS for restricted models like Llama-2.

Issue: Want CPU-only?
Fix: Remove CUDA and set device='cpu'.


12. Automation with Docker (Optional)

Example Dockerfile:

FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu20.04
RUN apt-get update && apt-get install -y python3 python3-pip git
RUN pip3 install torch torchvision torchaudio transformers huggingface_hub

COPY ./run_model.py /app/run_model.py
WORKDIR /app
ENTRYPOINT ["python3", "run_model.py"]

13. Security Best Practices

  • Never expose sensitive inputs/outputs to public endpoints.
  • Keep your packages up to date.
  • Use token limits and monitor abuse on API-based access.

14. Extensions and Custom Use

  • Use LoRA or adapters for fine-tuning with less memory.
  • Deploy with FastAPI or vLLM for production APIs.
  • Explore vector databases and semantic search via LangChain or Haystack.

15. References


Conclusion

This guide serves as a modern blueprint to install, configure, and run any HuggingFace-hosted 7B LLM—such as Hunyan 7B or Mistral 7B—on Linux/Ubuntu systems using open-source tools and best practices as of today.

References

  1. Install and Run Hunyan 7b on Mac
  2. Install and Run Hunyuan 7B on Windows: A Step-by-Step Guide
  3. Hunyuan 7B vs Qwen 3: In-Depth Comparison
  4. Run SkyReels V1 Hunyuan I2V on macOS: Step by Step Guide
  5. Run SkyReels V1 Hunyuan I2V on Windows: Step by Step Guide

Need expert guidance? Connect with a top Codersera professional today!

;