3X Your Interview Chances
AI Resume Builder
Import LinkedIn, get AI suggestions, land more interviews
4 min to read
SmolVLM2, particularly the 2.2B model, represents a significant advancement in video understanding, offering powerful capabilities while being remarkably efficient. This article will guide you through the process of running SmolVLM2 2.2B on macOS, covering installation, setup, and practical applications.
SmolVLM2 is part of a family of models designed to democratize video understanding by making it accessible across various devices, from smartphones to powerful servers. The 2.2B model is the most comprehensive, capable of handling a wide range of vision and video tasks, including solving math problems with images, reading text in photos, understanding complex diagrams, and tackling scientific visual questions.
To run SmolVLM2 2.2B on macOS, you can use either the Hugging Face Transformers library or MLX. Here’s a step-by-step guide for both methods:
Use the following Python code to load the model:
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_path = "your_username/SmolVLM2-2.2B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2"
).to("cuda" if torch.cuda.is_available() else "cpu")
Install the Hugging Face Transformers library by running:
pip install git+https://github.com/huggingface/transformers.git
For video analysis:
python -m mlx_vlm.smolvlm_video_generate \
--model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
--system "Focus only on describing the key dramatic action or notable event occurring in this video segment." \
--prompt "What is happening in this video?" \
--video /Users/yourname/Downloads/video.mov
For a single image:
python -m mlx_vlm.generate \
--model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
--image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
--prompt "Can you describe this image?"
MLX is optimized for Apple Silicon (M1, M2, M3 chips), offering better performance on these devices.
pip install git+https://github.com/pcuenca/mlx-vlm.git@smolvlm
If you’re developing native macOS apps and want to integrate SmolVLM2 directly, you can use Swift MLX:
For image analysis:
./mlx-run --debug llm-tool \
--model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
--prompt "Can you describe this image?" \
--image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
--temperature 0.7 --top-p 0.9 --max-tokens 100
git clone https://github.com/YOUR_USERNAME/mlx-swift-examples
cd mlx-swift-examples
SmolVLM2 2.2B can be applied in various scenarios:
Below are two real-world coding examples demonstrating how to run SmolVLM2 2.2B on macOS using Python and Swift.
To run SmolVLM2 2.2B on macOS using Python, you can leverage the mlx-vlm
library. This example demonstrates how to generate text based on an image using the model.
Run Inference: Use the following Python script to run inference on an image:PythonCopy
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceTB/SmolVLM-Instruct",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "Can you describe the two images?"}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
This script loads two images, initializes the SmolVLM2 model, prepares the input messages, and generates a text description of the images.
Install Dependencies: First, install the necessary dependencies using pip:bashCopy
pip install git+https://github.com/pcuenca/mlx-vlm.git@smolvlm
For macOS users who prefer Swift, you can use the mlx-swift-examples
repository to run SmolVLM2 2.2B. This example demonstrates how to generate text based on an image using Swift.
Run Inference: Use the following command to run inference on an image:bashCopy
./mlx-run --debug llm-tool \
--model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
--prompt "Can you describe this image?" \
--image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
--temperature 0.7 --top-p 0.9 --max-tokens 100
This command uses the llm-tool
CLI to run inference on the specified image and generate a text description.
Install Dependencies: First, clone the mlx-swift-examples
repository and build the project:bashCopy
git clone https://github.com/pcuenca/mlx-swift-examples.git
cd mlx-swift-examples
./build.sh
These examples demonstrate how to run SmolVLM2 2.2B on macOS using both Python and Swift. The Python example leverages the mlx-vlm
library for efficient inference, while the Swift example uses the mlx-swift-examples
repository.
Running SmolVLM2 2.2B on macOS offers a powerful tool for video understanding. By following the steps outlined above, developers and researchers can harness the capabilities of this model for a wide range of applications.
Future updates may bring optimizations for smaller models, enhanced performance across diverse hardware, and broader support for additional frameworks and languages.
Need expert guidance? Connect with a top Codersera professional today!