Smolvlm2 represents a fundamental shift in how we think about understanding video. This is about moving from a large model that requires substantial computing resources to an efficient model that can be run anywhere. Our goal is simple. It’s about making video understanding accessible on all devices and use cases, from mobile phones to servers.
We are releasing models from Day Zero in three sizes (2.2b, 500m, 256m) and in MLX Ready (Python and Swift APIs). All models and demos are available in this collection.
Why not try smolvlm2 right away? Check out the interactive chat interface that lets you test the visual and video understanding features of smolvlm2 2.2b through a simple and intuitive interface.
table of contents
Technical details
It introduces three new models with parameters of 256m, 500m and 2.2B. The 2.2B model is a vision and video task choice, while the 500m and 256m models represent the smallest video language models ever released.
The size will be smaller, but exceed the existing models per memory consumption. Looking at Video-Mme (a scientific benchmark for video), Smolvlm2 joins the frontier model family in the 2B range, leading the pack in even smaller spaces.
Video-MME specializes in high quality spanning a wide range of coverage across diverse video types, varying periods (11 seconds to 1 hour), multiple data modalities (including subtitles and audio), and 254 total 900 videos It stands out as a comprehensive benchmark due to its home annotation. time. Click here for details.
Smolvlm2 2.2b: New Star Player for Vision and Video
Compared to the previous Smolvlm family, our new 2.2B model solves mathematical problems with images, reads text in photos, understands complex diagrams, and address scientific visual questions. I got it well. This is displayed in model performance for various benchmarks.
When it comes to video tasks, 2.2B is good for the back. We want to highlight performance in video MMEs that outweigh all existing 2B models across the various science benchmarks we evaluated.
Thanks to data blended learning published in Apollo, we were able to balance video/image performance well: Exploring video understanding in large-scale multimodal models
It’s very memory efficient, so you can run it on free Google Colab.
python code! pip install git+https://github.com/huggingface/transformers.git
from transformer Import AutoProcessor, autorotelforimageTextTotext model_path = “Huggingfacetb/smolvlm2-2.2b-instruct”
processor = autoprocessor.from_pretrained(model_path) model = autorotelforimagetexttotext.from_pretrained(model_path, torch_dtype = torch.bfloat16, _attn_implementation =“flash_attention_2”
). In (“cuda”) Message = ({
“role”: “user”,
“content”:({“type”: “video”, “path”: “path_to_video.mp4”},{“type”: “Sentence”, “Sentence”: “Please explain this video in detail.”})},) inputs = processor.apply_chat_template(messages, add_generation_plompt =truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”.to(model.device) generated_ids = model.generate(** inputs, do_sample =error,max_new_tokens =64generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth,)
printing(generated_texts (0)))
Even smaller: Meet 500m and 256m video models
To this day, no one dared to release such a small video model.
The new SmolVLM2-500M-Video-Instruct model has video features that are very close to the Smolvlm 2.2b, but only a small portion of the size.
And then there’s our little experiment, Smolvlm2-256M-Video-Instruct. Think of it as a “what if” project – what if we could push the boundaries of a small model further? I was inspired by what IBM achieved with the Base Smolvlm-256M-Instruct a few weeks ago and wanted to see how far I could go with understanding the video. It’s a more experimental release, but we hope it stimulates some creative applications and specialized fine-tuning projects.
A suite of SMOLVLM2 demo application
To demonstrate the vision in small video models, we have built three practical applications that demonstrate the versatility of these models.
Understanding iPhone Videos
https://www.youtube.com/watch?v=g1yqlhtk_ig
I’ve created an iPhone app running smolvlm2 completely locally. Using the 500m model, users can directly analyze and understand video content on their devices – no cloud required. Interested in building an iPhone video processing app with an AI model running locally? We’re releasing it right away – fill out this form, test it and build it!
VLC Media Player Integration
https://www.youtube.com/watch?v=nghcfew7dcg
Working with VLC Media Player, we integrate Smolvlm2 to provide intelligent video segment descriptions and navigation. This integration allows users to semantically search video content and jump directly to related sections based on natural language descriptions. This is an in-progress work, but you can try out the current Playlist Builder prototype in this space.
Video Highlight Generator
https://www.youtube.com/watch?v=zt2os8eqnki
Available as a face space for hugging, this application shoots long-form videos (over an hour) and automatically extracts the most important moments. It has extensively tested in soccer games and other long events and has become a powerful tool for content recapitalization. Try it yourself in the demo space.
Use Smolvlm2 with transformer and MLX
Make Smolvlm2 available for use in Trans and MLX from Day Zero. In this section you can find alternatives and tutorials for different inferences for videos and multiple images.
transformer
The easiest way to perform inference in the SMOLVLM2 model is to use a conversational API. When you apply a chat template, all input preparations are handled automatically.
You can load the model as follows:
! Pip Install git+https://github.com/huggingface/transformers.git
from transformer Import Autoprocessor, automodelforimageTextTotext Processor = autoprocessor.from_pretrained(model_path) model = automodelforimageTextTotext.from_pretrained(model_path, torch_dtype = torch.bfloat16, _attn_implementation =“flash_attention_2”
).TO (device)
Video reasoning
You can pass the video to the chat template by passing {“type”: “video”, “path”:{video_path}. See below for a complete example.
Import Torch Message = ({
“role”: “user”,
“content”:({“type”: “video”, “path”: “path_to_video.mp4”},{“type”: “Sentence”, “Sentence”: “Please explain this video in detail.”})},) inputs = processor.apply_chat_template(messages, add_generation_plompt =truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”.to(model.device) generated_ids = model.generate(** inputs, do_sample =error,max_new_tokens =64generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth,)
printing(generated_texts (0)))
Multiple image reasoning
In addition to the video, Smolvlm2 supports multi-image conversations. You can use the same API via chat templates.
Import Torch Message = ({
“role”: “user”,
“content”:({“type”: “Sentence”, “Sentence”: “What is the difference between these two images?”},{“type”: “image”, “path”: “Image_1.png”},{“type”: “image”, “path”: “image_2.png”})},) inputs = processor.apply_chat_template(messages, add_generation_plompt =truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”.to(model.device) generated_ids = model.generate(** inputs, do_sample =error,max_new_tokens =64generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth,)
printing(generated_texts (0)))
Inference for MLX
To run Smolvlm2 using MLX on an Apple silicon device using Python, you can use the excellent MLX-VLM library. First, you need to install MLX-VLM from this branch using the following command:
pipinstallation git+https://github.com/pcuenca/mlx-vlm.git@smolvlm
You can then perform inference on a single image using the following 1 liner:
python -m mlx_vlm.generate \ – model mlx-community/smolvlm2-500m-video-intruct-mlx \ –image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \ – Prompt “Can you explain this image?”
I also created a simple script to understand the video. It can be used as follows:
python -m mlx_vlm.smolvlm_video_generate \ – model mlx-community/smolvlm2-500m-video-instruct-mlx \ – system “We will only focus on the description of important dramatic actions or salient events that occur in this video segment. Skip general context or scene configuration details unless it is important to understand the main actions. Masu.” \ – Prompt “What’s going on in this video?” \ -video/users/pedro/downloads/img_2855.mov \ –prompt “Can you explain this image?”
Note that the system prompt is important for bending the model to the desired behavior. For example, it can be used to describe every scene or transition, or to provide a one-letter overview of what is happening.
Swift MLX
The Swift language is supported through the MLX-Swift-Examples Repo. This is what I used to build an iPhone app.
You will need to compile the project from this fork until an ongoing PR is confirmed and merged. You can then use the LLM-Tool CLI in your MAC as follows:
For image inference:
./mlx-run – debug llm-tool \ – model mlx-community/smolvlm2-500m-video-instruct-mlx \ –prompt “Can you explain this image?” \ -image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \ – temperature 0.7 – top-p 0.9 – max-tokens100
Video analysis is also supported and provides a system prompt. We discovered that the system prompts are particularly useful for understanding videos, and were able to drive the model to the level of detail of the purpose of our interest. This is an example of video inference.
./mlx-run – debug llm-tool \ – model mlx-community/smolvlm2-500m-video-instruct-mlx \ – system “We will only focus on the description of important dramatic actions or salient events that occur in this video segment. Skip general context or scene configuration details unless it is important to understand the main actions. Masu.” \ – Prompt “What’s going on in this video?” \ -video/users/pedro/downloads/img_2855.mov \ – temperature 0.7 – top-p 0.9 – max-tokens100
If you want to integrate Smolvlm2 into your app using MLX and Swift, I’d like to know about it! Please send me your notes in the comments section below!
Fine tweaks smolvlm2
You can fine-tune Smolvlm2 in video using Transformers. I fine-tuned 500m variants with video caption pairs in the debomedivedback dataset. The 500m variant is small so it’s better to apply full tweaks instead of Qlora or Lora, but you can try to apply Qlora to CB variants. Check out the finely tuned notebook here.
read more
We would like to thank Raushan Turganbay, Arthur Zucker and Pablo Montalvo Leroux for the contribution of the model to Transformers.
I look forward to seeing everything I build with Smolvlm2! If you would like to learn more about the Smolvlm family of models, feel free to read:
Smolvlm2 – Collection with models and demos