Bring video understanding to all devices

Smolvlm2 represents a fundamental shift in how we think about understanding video. This is about moving from a large model that requires substantial computing resources to an efficient model that can be run anywhere. Our goal is simple. It’s about making video understanding accessible on all devices and use cases, from mobile phones to servers.

We are releasing models from Day Zero in three sizes (2.2b, 500m, 256m) and in MLX Ready (Python and Swift APIs). All models and demos are available in this collection.

Why not try smolvlm2 right away? Check out the interactive chat interface that lets you test the visual and video understanding features of smolvlm2 2.2b through a simple and intuitive interface.

Technical details

It introduces three new models with parameters of 256m, 500m and 2.2B. The 2.2B model is a vision and video task choice, while the 500m and 256m models represent the smallest video language models ever released.

The size will be smaller, but exceed the existing models per memory consumption. Looking at Video-Mme (a scientific benchmark for video), Smolvlm2 joins the frontier model family in the 2B range, leading the pack in even smaller spaces.

Video-MME specializes in high quality spanning a wide range of coverage across diverse video types, varying periods (11 seconds to 1 hour), multiple data modalities (including subtitles and audio), and 254 total 900 videos It stands out as a comprehensive benchmark due to its home annotation. time. Click here for details.

Smolvlm2 2.2b: New Star Player for Vision and Video

Compared to the previous Smolvlm family, our new 2.2B model solves mathematical problems with images, reads text in photos, understands complex diagrams, and address scientific visual questions. I got it well. This is displayed in model performance for various benchmarks.

When it comes to video tasks, 2.2B is good for the back. We want to highlight performance in video MMEs that outweigh all existing 2B models across the various science benchmarks we evaluated.

Thanks to data blended learning published in Apollo, we were able to balance video/image performance well: Exploring video understanding in large-scale multimodal models

It’s very memory efficient, so you can run it on free Google Colab.

python code! pip install git+https://github.com/huggingface/transformers.git

from transformer Import AutoProcessor, autorotelforimageTextTotext model_path = “Huggingfacetb/smolvlm2-2.2b-instruct”
processor = autoprocessor.from_pretrained(model_path) model = autorotelforimagetexttotext.from_pretrained(model_path, torch_dtype = torch.bfloat16, _attn_implementation =“flash_attention_2”
). In (“cuda”) Message = ({
“role”: “user”,
“content”:({“type”: “video”, “path”: “path_to_video.mp4”},{“type”: “Sentence”, “Sentence”: “Please explain this video in detail.”})},) inputs = processor.apply_chat_template(messages, add_generation_plompt =truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”.to(model.device) generated_ids = model.generate(** inputs, do_sample =error,max_new_tokens =64generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth,)

printing(generated_texts (0)))

Even smaller: Meet 500m and 256m video models

To this day, no one dared to release such a small video model.

The new SmolVLM2-500M-Video-Instruct model has video features that are very close to the Smolvlm 2.2b, but only a small portion of the size.

And then there’s our little experiment, Smolvlm2-256M-Video-Instruct. Think of it as a “what if” project – what if we could push the boundaries of a small model further? I was inspired by what IBM achieved with the Base Smolvlm-256M-Instruct a few weeks ago and wanted to see how far I could go with understanding the video. It’s a more experimental release, but we hope it stimulates some creative applications and specialized fine-tuning projects.

A suite of SMOLVLM2 demo application

To demonstrate the vision in small video models, we have built three practical applications that demonstrate the versatility of these models.

Understanding iPhone Videos

https://www.youtube.com/watch?v=g1yqlhtk_ig

I’ve created an iPhone app running smolvlm2 completely locally. Using the 500m model, users can directly analyze and understand video content on their devices – no cloud required. Interested in building an iPhone video processing app with an AI model running locally? We’re releasing it right away – fill out this form, test it and build it!

VLC Media Player Integration

https://www.youtube.com/watch?v=nghcfew7dcg

Working with VLC Media Player, we integrate Smolvlm2 to provide intelligent video segment descriptions and navigation. This integration allows users to semantically search video content and jump directly to related sections based on natural language descriptions. This is an in-progress work, but you can try out the current Playlist Builder prototype in this space.

Video Highlight Generator

https://www.youtube.com/watch?v=zt2os8eqnki

Available as a face space for hugging, this application shoots long-form videos (over an hour) and automatically extracts the most important moments. It has extensively tested in soccer games and other long events and has become a powerful tool for content recapitalization. Try it yourself in the demo space.

Use Smolvlm2 with transformer and MLX

Make Smolvlm2 available for use in Trans and MLX from Day Zero. In this section you can find alternatives and tutorials for different inferences for videos and multiple images.

transformer

The easiest way to perform inference in the SMOLVLM2 model is to use a conversational API. When you apply a chat template, all input preparations are handled automatically.

You can load the model as follows:

! Pip Install git+https://github.com/huggingface/transformers.git

from transformer Import Autoprocessor, automodelforimageTextTotext Processor = autoprocessor.from_pretrained(model_path) model = automodelforimageTextTotext.from_pretrained(model_path, torch_dtype = torch.bfloat16, _attn_implementation =“flash_attention_2”
).TO (device)

Video reasoning

You can pass the video to the chat template by passing {“type”: “video”, “path”:{video_path}. See below for a complete example.

Import Torch Message = ({
“role”: “user”,
“content”:({“type”: “video”, “path”: “path_to_video.mp4”},{“type”: “Sentence”, “Sentence”: “Please explain this video in detail.”})},) inputs = processor.apply_chat_template(messages, add_generation_plompt =truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”.to(model.device) generated_ids = model.generate(** inputs, do_sample =error,max_new_tokens =64generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth,)

printing(generated_texts (0)))

Multiple image reasoning

In addition to the video, Smolvlm2 supports multi-image conversations. You can use the same API via chat templates.

Import Torch Message = ({
“role”: “user”,
“content”:({“type”: “Sentence”, “Sentence”: “What is the difference between these two images?”},{“type”: “image”, “path”: “Image_1.png”},{“type”: “image”, “path”: “image_2.png”})},) inputs = processor.apply_chat_template(messages, add_generation_plompt =truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”.to(model.device) generated_ids = model.generate(** inputs, do_sample =error,max_new_tokens =64generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth,)

printing(generated_texts (0)))

Inference for MLX

To run Smolvlm2 using MLX on an Apple silicon device using Python, you can use the excellent MLX-VLM library. First, you need to install MLX-VLM from this branch using the following command:

pipinstallation git+https://github.com/pcuenca/mlx-vlm.git@smolvlm

You can then perform inference on a single image using the following 1 liner:

python -m mlx_vlm.generate \ – model mlx-community/smolvlm2-500m-video-intruct-mlx \ –image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \ – Prompt “Can you explain this image?”

I also created a simple script to understand the video. It can be used as follows:

python -m mlx_vlm.smolvlm_video_generate \ – model mlx-community/smolvlm2-500m-video-instruct-mlx \ – system “We will only focus on the description of important dramatic actions or salient events that occur in this video segment. Skip general context or scene configuration details unless it is important to understand the main actions. Masu.” \ – Prompt “What’s going on in this video?” \ -video/users/pedro/downloads/img_2855.mov \ –prompt “Can you explain this image?”

Note that the system prompt is important for bending the model to the desired behavior. For example, it can be used to describe every scene or transition, or to provide a one-letter overview of what is happening.

Swift MLX

The Swift language is supported through the MLX-Swift-Examples Repo. This is what I used to build an iPhone app.

You will need to compile the project from this fork until an ongoing PR is confirmed and merged. You can then use the LLM-Tool CLI in your MAC as follows:

For image inference:

./mlx-run – debug llm-tool \ – model mlx-community/smolvlm2-500m-video-instruct-mlx \ –prompt “Can you explain this image?” \ -image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \ – temperature 0.7 – top-p 0.9 – max-tokens100

Video analysis is also supported and provides a system prompt. We discovered that the system prompts are particularly useful for understanding videos, and were able to drive the model to the level of detail of the purpose of our interest. This is an example of video inference.

./mlx-run – debug llm-tool \ – model mlx-community/smolvlm2-500m-video-instruct-mlx \ – system “We will only focus on the description of important dramatic actions or salient events that occur in this video segment. Skip general context or scene configuration details unless it is important to understand the main actions. Masu.” \ – Prompt “What’s going on in this video?” \ -video/users/pedro/downloads/img_2855.mov \ – temperature 0.7 – top-p 0.9 – max-tokens100

If you want to integrate Smolvlm2 into your app using MLX and Swift, I’d like to know about it! Please send me your notes in the comments section below!

Fine tweaks smolvlm2

You can fine-tune Smolvlm2 in video using Transformers. I fine-tuned 500m variants with video caption pairs in the debomedivedback dataset. The 500m variant is small so it’s better to apply full tweaks instead of Qlora or Lora, but you can try to apply Qlora to CB variants. Check out the finely tuned notebook here.

We would like to thank Raushan Turganbay, Arthur Zucker and Pablo Montalvo Leroux for the contribution of the model to Transformers.

I look forward to seeing everything I build with Smolvlm2! If you would like to learn more about the Smolvlm family of models, feel free to read:

Smolvlm2 – Collection with models and demos

See Full Bio

What's Hot

AI-Media and Audioshake partners to enhance multilingual broadcasting

Piclumen Primo AI Model Debut: Next Generation Cyberpunk Image Generation for the Creative Industry | AI News Details

People are beginning to sound like AI, research shows

Reachy Mini – Open Source Robot for Today and Tomorrow’s AI Builders

AI is rewriting the rules of the insurance industry

Deploy the Full Stack Desktop Agent

Data and AI Status: Security and Privacy

Leading the Korean LLM evaluation ecosystem

Introducing the Red Team Resistance Leaderboard

Most Popular

Data and AI Status: Security and Privacy

Leading the Korean LLM evaluation ecosystem

Introducing the Red Team Resistance Leaderboard

Don't Miss

AI-Media and Audioshake partners to enhance multilingual broadcasting

Piclumen Primo AI Model Debut: Next Generation Cyberpunk Image Generation for the Creative Industry | AI News Details

People are beginning to sound like AI, research shows

Subscribe to Updates

What's Hot

Bring video understanding to all devices

table of contents

Technical details

Smolvlm2 2.2b: New Star Player for Vision and Video

Even smaller: Meet 500m and 256m video models

A suite of SMOLVLM2 demo application

Understanding iPhone Videos

VLC Media Player Integration

Video Highlight Generator

Use Smolvlm2 with transformer and MLX

transformer

Video reasoning

Multiple image reasoning

Inference for MLX

Swift MLX

Fine tweaks smolvlm2

read more

Related Posts