4 min to read
SmolVLM2, particularly the 2.2B model, represents a significant advancement in video understanding, offering powerful capabilities while being remarkably efficient. This article will guide you through the process of running SmolVLM2 2.2B on macOS, covering installation, setup, and practical applications.
SmolVLM2 is part of a family of models designed to democratize video understanding by making it accessible across various devices, from smartphones to powerful servers. The 2.2B model is the most comprehensive, capable of handling a wide range of vision and video tasks, including solving math problems with images, reading text in photos, understanding complex diagrams, and tackling scientific visual questions.
To run SmolVLM2 2.2B on macOS, you can use either the Hugging Face Transformers library or MLX. Here’s a step-by-step guide for both methods:
Use the following Python code to load the model:
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_path = "your_username/SmolVLM2-2.2B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2"
).to("cuda" if torch.cuda.is_available() else "cpu")
Install the Hugging Face Transformers library by running:
pip install git+https://github.com/huggingface/transformers.git
For video analysis:
python -m mlx_vlm.smolvlm_video_generate \
--model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
--system "Focus only on describing the key dramatic action or notable event occurring in this video segment." \
--prompt "What is happening in this video?" \
--video /Users/yourname/Downloads/video.mov
For a single image:
python -m mlx_vlm.generate \
--model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
--image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
--prompt "Can you describe this image?"
MLX is optimized for Apple Silicon (M1, M2, M3 chips), offering better performance on these devices.
pip install git+https://github.com/pcuenca/mlx-vlm.git@smolvlm
If you’re developing native macOS apps and want to integrate SmolVLM2 directly, you can use Swift MLX:
For image analysis:
./mlx-run --debug llm-tool \
--model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
--prompt "Can you describe this image?" \
--image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
--temperature 0.7 --top-p 0.9 --max-tokens 100
git clone https://github.com/YOUR_USERNAME/mlx-swift-examples
cd mlx-swift-examples
SmolVLM2 2.2B can be applied in various scenarios:
Below are two real-world coding examples demonstrating how to run SmolVLM2 2.2B on macOS using Python and Swift.
To run SmolVLM2 2.2B on macOS using Python, you can leverage the mlx-vlm
library. This example demonstrates how to generate text based on an image using the model.
Run Inference: Use the following Python script to run inference on an image:PythonCopy
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceTB/SmolVLM-Instruct",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "Can you describe the two images?"}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
This script loads two images, initializes the SmolVLM2 model, prepares the input messages, and generates a text description of the images.
Install Dependencies: First, install the necessary dependencies using pip:bashCopy
pip install git+https://github.com/pcuenca/mlx-vlm.git@smolvlm
For macOS users who prefer Swift, you can use the mlx-swift-examples
repository to run SmolVLM2 2.2B. This example demonstrates how to generate text based on an image using Swift.
Run Inference: Use the following command to run inference on an image:bashCopy
./mlx-run --debug llm-tool \
--model mlx-community/SmolVLM2-500M-Video-Instruct-mlx \
--prompt "Can you describe this image?" \
--image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \
--temperature 0.7 --top-p 0.9 --max-tokens 100
This command uses the llm-tool
CLI to run inference on the specified image and generate a text description.
Install Dependencies: First, clone the mlx-swift-examples
repository and build the project:bashCopy
git clone https://github.com/pcuenca/mlx-swift-examples.git
cd mlx-swift-examples
./build.sh
These examples demonstrate how to run SmolVLM2 2.2B on macOS using both Python and Swift. The Python example leverages the mlx-vlm
library for efficient inference, while the Swift example uses the mlx-swift-examples
repository.
Running SmolVLM2 2.2B on macOS offers a powerful tool for video understanding. By following the steps outlined above, developers and researchers can harness the capabilities of this model for a wide range of applications.
Future updates may bring optimizations for smaller models, enhanced performance across diverse hardware, and broader support for additional frameworks and languages.
Connect with top remote developers instantly. No commitment, no risk.
Tags
Discover our most popular articles and guides
Running Android emulators on low-end PCs—especially those without Virtualization Technology (VT) or a dedicated graphics card—can be a challenge. Many popular emulators rely on hardware acceleration and virtualization to deliver smooth performance.
The demand for Android emulation has soared as users and developers seek flexible ways to run Android apps and games without a physical device. Online Android emulators, accessible directly through a web browser.
Discover the best free iPhone emulators that work online without downloads. Test iOS apps and games directly in your browser.
Top Android emulators optimized for gaming performance. Run mobile games smoothly on PC with these powerful emulators.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.
ApkOnline is a cloud-based Android emulator that allows users to run Android apps and APK files directly from their web browsers, eliminating the need for physical devices or complex software installations.
Choosing the right Android emulator can transform your experience—whether you're a gamer, developer, or just want to run your favorite mobile apps on a bigger screen.
The rapid evolution of large language models (LLMs) has brought forth a new generation of open-source AI models that are more powerful, efficient, and versatile than ever.