Codersera

4 min to read

Run SmolVLM2 2.2B on macOS: Installation Guide

SmolVLM2, particularly the 2.2B model, represents a significant advancement in video understanding, offering powerful capabilities while being remarkably efficient. This article will guide you through the process of running SmolVLM2 2.2B on macOS, covering installation, setup, and practical applications.

What is SmolVLM2?

SmolVLM2 is part of a family of models designed to democratize video understanding by making it accessible across various devices, from smartphones to powerful servers. The 2.2B model is the most comprehensive, capable of handling a wide range of vision and video tasks, including solving math problems with images, reading text in photos, understanding complex diagrams, and tackling scientific visual questions.

Key Features of SmolVLM2 2.2B

  • Efficiency: Despite its size, SmolVLM2 2.2B is designed to be memory-efficient, allowing it to run on devices with limited resources, including free Google Colab environments.
  • Versatility: It supports multiple frameworks, including Hugging Face Transformers and MLX, making it versatile for different use cases.
  • Performance: SmolVLM2 2.2B performs strongly in benchmarks like Video-MME, showcasing its ability to handle diverse video types and data modalities.

Setting Up SmolVLM2 2.2B on macOS

To run SmolVLM2 2.2B on macOS, you can use either the Hugging Face Transformers library or MLX. Here’s a step-by-step guide for both methods:

Using Hugging Face Transformers

  1. Install Python and Required Libraries:
    • Ensure Python is installed on your macOS system. You can download it from the official Python website if needed.
  2. Load the SmolVLM2 2.2B Model:
  3. Perform Video/Image Inference:
    • Refer to the official documentation for examples on how to perform inference using the loaded model.

Use the following Python code to load the model:

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_path = "your_username/SmolVLM2-2.2B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
).to("cuda" if torch.cuda.is_available() else "cpu")

Install the Hugging Face Transformers library by running:

pip install git+https://github.com/huggingface/transformers.git

Using MLX

  1. Install MLX-VLM:
  2. Run Inference:

For video analysis:

python -m mlx_vlm.smolvlm_video_generate \
  --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
  --system "Focus only on describing the key dramatic action or notable event occurring in this video segment." \
  --prompt "What is happening in this video?" \
  --video /Users/yourname/Downloads/video.mov

For a single image:

python -m mlx_vlm.generate \
  --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
  --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
  --prompt "Can you describe this image?"

MLX is optimized for Apple Silicon (M1, M2, M3 chips), offering better performance on these devices.

pip install git+https://github.com/pcuenca/mlx-vlm.git@smolvlm

Swift MLX for Native macOS Apps

If you’re developing native macOS apps and want to integrate SmolVLM2 directly, you can use Swift MLX:

  1. Compile the Project: Follow the instructions in the repository to compile the project using Xcode.
  2. Run Inference:

For image analysis:

./mlx-run --debug llm-tool \
  --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
  --prompt "Can you describe this image?" \
  --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
  --temperature 0.7 --top-p 0.9 --max-tokens 100
  1. Clone the Forked Repository:
git clone https://github.com/YOUR_USERNAME/mlx-swift-examples
cd mlx-swift-examples

Practical Applications of SmolVLM2 2.2B

SmolVLM2 2.2B can be applied in various scenarios:

  • Video Analysis: Extract key events or actions from videos, useful for content creators or researchers.
  • Image Understanding: Read text in images, solve math problems, and interpret diagrams for educational tools.
  • Scientific Visual Questions: Tackle complex visual questions, aiding in scientific research.

Real-World Coding Examples

Below are two real-world coding examples demonstrating how to run SmolVLM2 2.2B on macOS using Python and Swift.

Example 1: Using Python with MLX

To run SmolVLM2 2.2B on macOS using Python, you can leverage the mlx-vlm library. This example demonstrates how to generate text based on an image using the model.

Run Inference: Use the following Python script to run inference on an image:PythonCopy

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load images
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceTB/SmolVLM-Instruct",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Can you describe the two images?"}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)

print(generated_texts[0])

This script loads two images, initializes the SmolVLM2 model, prepares the input messages, and generates a text description of the images.

Install Dependencies: First, install the necessary dependencies using pip:bashCopy

pip install git+https://github.com/pcuenca/mlx-vlm.git@smolvlm

Example 2: Using Swift with MLX

For macOS users who prefer Swift, you can use the mlx-swift-examples repository to run SmolVLM2 2.2B. This example demonstrates how to generate text based on an image using Swift.

Run Inference: Use the following command to run inference on an image:bashCopy

./mlx-run --debug llm-tool \
    --model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
    --prompt "Can you describe this image?" \
    --image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
    --temperature 0.7 --top-p 0.9 --max-tokens 100

This command uses the llm-tool CLI to run inference on the specified image and generate a text description.

Install Dependencies: First, clone the mlx-swift-examples repository and build the project:bashCopy

git clone https://github.com/pcuenca/mlx-swift-examples.git
cd mlx-swift-examples
./build.sh

These examples demonstrate how to run SmolVLM2 2.2B on macOS using both Python and Swift. The Python example leverages the mlx-vlm library for efficient inference, while the Swift example uses the mlx-swift-examples repository.

Challenges and Considerations

  • Hardware Requirements: Running the 2.2B model may require significant computational resources, especially for large-scale video analysis.
  • Model Size and Complexity: Smaller models like the 500M version offer efficiency but may not match the 2.2B model’s performance.

Conclusion

Running SmolVLM2 2.2B on macOS offers a powerful tool for video understanding. By following the steps outlined above, developers and researchers can harness the capabilities of this model for a wide range of applications.

Future Developments

Future updates may bring optimizations for smaller models, enhanced performance across diverse hardware, and broader support for additional frameworks and languages.

References

  1. Run DeepSeek Janus-Pro 7B on Mac: A Comprehensive Guide Using ComfyUI
  2. Run DeepSeek Janus-Pro 7B on Mac: Step-by-Step Guide
  3. Run Microsoft OmniParser V2 on Ubuntu : Step by Step Installation Guide
  4. Run YOLOv12 on Linux / Ubuntu: Step-by-Step Installation Guide

Need expert guidance? Connect with a top Codersera professional today!

;