Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Piclumen Realistic V2 introduces advanced AI art generation. AI News Details

June 17, 2025

GROQ hugging face reasoning provider

June 17, 2025

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

June 16, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, June 17
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Bring video understanding to all devices
Tools

Bring video understanding to all devices

By February 22, 2025No Comments8 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

Smolvlm2 represents a fundamental shift in how we think about understanding video. This is about moving from a large model that requires substantial computing resources to an efficient model that can be run anywhere. Our goal is simple. It’s about making video understanding accessible on all devices and use cases, from mobile phones to servers.

We are releasing models from Day Zero in three sizes (2.2b, 500m, 256m) and in MLX Ready (Python and Swift APIs). All models and demos are available in this collection.

Why not try smolvlm2 right away? Check out the interactive chat interface that lets you test the visual and video understanding features of smolvlm2 2.2b through a simple and intuitive interface.

table of contents

Technical details

It introduces three new models with parameters of 256m, 500m and 2.2B. The 2.2B model is a vision and video task choice, while the 500m and 256m models represent the smallest video language models ever released.

The size will be smaller, but exceed the existing models per memory consumption. Looking at Video-Mme (a scientific benchmark for video), Smolvlm2 joins the frontier model family in the 2B range, leading the pack in even smaller spaces.

Video-MME specializes in high quality spanning a wide range of coverage across diverse video types, varying periods (11 seconds to 1 hour), multiple data modalities (including subtitles and audio), and 254 total 900 videos It stands out as a comprehensive benchmark due to its home annotation. time. Click here for details.

Smolvlm2 2.2b: New Star Player for Vision and Video

Compared to the previous Smolvlm family, our new 2.2B model solves mathematical problems with images, reads text in photos, understands complex diagrams, and address scientific visual questions. I got it well. This is displayed in model performance for various benchmarks.

SMOLVLM2 Vision Score Gain

When it comes to video tasks, 2.2B is good for the back. We want to highlight performance in video MMEs that outweigh all existing 2B models across the various science benchmarks we evaluated.

Thanks to data blended learning published in Apollo, we were able to balance video/image performance well: Exploring video understanding in large-scale multimodal models

It’s very memory efficient, so you can run it on free Google Colab.

python code! pip install git+https://github.com/huggingface/transformers.git

from transformer Import AutoProcessor, autorotelforimageTextTotext model_path = “Huggingfacetb/smolvlm2-2.2b-instruct”
processor = autoprocessor.from_pretrained(model_path) model = autorotelforimagetexttotext.from_pretrained(model_path, torch_dtype = torch.bfloat16, _attn_implementation =“flash_attention_2”
). In (“cuda”) Message = ({
“role”: “user”,
“content”:({“type”: “video”, “path”: “path_to_video.mp4”},{“type”: “Sentence”, “Sentence”: “Please explain this video in detail.”})},) inputs = processor.apply_chat_template(messages, add_generation_plompt =truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”.to(model.device) generated_ids = model.generate(** inputs, do_sample =error,max_new_tokens =64generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth,)

printing(generated_texts (0)))

Even smaller: Meet 500m and 256m video models

To this day, no one dared to release such a small video model.

The new SmolVLM2-500M-Video-Instruct model has video features that are very close to the Smolvlm 2.2b, but only a small portion of the size.

And then there’s our little experiment, Smolvlm2-256M-Video-Instruct. Think of it as a “what if” project – what if we could push the boundaries of a small model further? I was inspired by what IBM achieved with the Base Smolvlm-256M-Instruct a few weeks ago and wanted to see how far I could go with understanding the video. It’s a more experimental release, but we hope it stimulates some creative applications and specialized fine-tuning projects.

A suite of SMOLVLM2 demo application

To demonstrate the vision in small video models, we have built three practical applications that demonstrate the versatility of these models.

Understanding iPhone Videos

https://www.youtube.com/watch?v=g1yqlhtk_ig

I’ve created an iPhone app running smolvlm2 completely locally. Using the 500m model, users can directly analyze and understand video content on their devices – no cloud required. Interested in building an iPhone video processing app with an AI model running locally? We’re releasing it right away – fill out this form, test it and build it!

VLC Media Player Integration

https://www.youtube.com/watch?v=nghcfew7dcg

Working with VLC Media Player, we integrate Smolvlm2 to provide intelligent video segment descriptions and navigation. This integration allows users to semantically search video content and jump directly to related sections based on natural language descriptions. This is an in-progress work, but you can try out the current Playlist Builder prototype in this space.

Video Highlight Generator

https://www.youtube.com/watch?v=zt2os8eqnki

Available as a face space for hugging, this application shoots long-form videos (over an hour) and automatically extracts the most important moments. It has extensively tested in soccer games and other long events and has become a powerful tool for content recapitalization. Try it yourself in the demo space.

Use Smolvlm2 with transformer and MLX

Make Smolvlm2 available for use in Trans and MLX from Day Zero. In this section you can find alternatives and tutorials for different inferences for videos and multiple images.

transformer

The easiest way to perform inference in the SMOLVLM2 model is to use a conversational API. When you apply a chat template, all input preparations are handled automatically.

You can load the model as follows:

! Pip Install git+https://github.com/huggingface/transformers.git

from transformer Import Autoprocessor, automodelforimageTextTotext Processor = autoprocessor.from_pretrained(model_path) model = automodelforimageTextTotext.from_pretrained(model_path, torch_dtype = torch.bfloat16, _attn_implementation =“flash_attention_2”
).TO (device)

Video reasoning

You can pass the video to the chat template by passing {“type”: “video”, “path”:{video_path}. See below for a complete example.

Import Torch Message = ({
“role”: “user”,
“content”:({“type”: “video”, “path”: “path_to_video.mp4”},{“type”: “Sentence”, “Sentence”: “Please explain this video in detail.”})},) inputs = processor.apply_chat_template(messages, add_generation_plompt =truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”.to(model.device) generated_ids = model.generate(** inputs, do_sample =error,max_new_tokens =64generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth,)

printing(generated_texts (0)))

Multiple image reasoning

In addition to the video, Smolvlm2 supports multi-image conversations. You can use the same API via chat templates.

Import Torch Message = ({
“role”: “user”,
“content”:({“type”: “Sentence”, “Sentence”: “What is the difference between these two images?”},{“type”: “image”, “path”: “Image_1.png”},{“type”: “image”, “path”: “image_2.png”})},) inputs = processor.apply_chat_template(messages, add_generation_plompt =truthtokenize =truthreturn_dict =truthreturn_tensors =“PT”.to(model.device) generated_ids = model.generate(** inputs, do_sample =error,max_new_tokens =64generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth,)

printing(generated_texts (0)))

Inference for MLX

To run Smolvlm2 using MLX on an Apple silicon device using Python, you can use the excellent MLX-VLM library. First, you need to install MLX-VLM from this branch using the following command:

pipinstallation git+https://github.com/pcuenca/mlx-vlm.git@smolvlm

You can then perform inference on a single image using the following 1 liner:

python -m mlx_vlm.generate \ – model mlx-community/smolvlm2-500m-video-intruct-mlx \ –image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \ – Prompt “Can you explain this image?”

I also created a simple script to understand the video. It can be used as follows:

python -m mlx_vlm.smolvlm_video_generate \ – model mlx-community/smolvlm2-500m-video-instruct-mlx \ – system “We will only focus on the description of important dramatic actions or salient events that occur in this video segment. Skip general context or scene configuration details unless it is important to understand the main actions. Masu.” \ – Prompt “What’s going on in this video?” \ -video/users/pedro/downloads/img_2855.mov \ –prompt “Can you explain this image?”

Note that the system prompt is important for bending the model to the desired behavior. For example, it can be used to describe every scene or transition, or to provide a one-letter overview of what is happening.

Swift MLX

The Swift language is supported through the MLX-Swift-Examples Repo. This is what I used to build an iPhone app.

You will need to compile the project from this fork until an ongoing PR is confirmed and merged. You can then use the LLM-Tool CLI in your MAC as follows:

For image inference:

./mlx-run – debug llm-tool \ – model mlx-community/smolvlm2-500m-video-instruct-mlx \ –prompt “Can you explain this image?” \ -image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg \ – temperature 0.7 – top-p 0.9 – max-tokens100

Video analysis is also supported and provides a system prompt. We discovered that the system prompts are particularly useful for understanding videos, and were able to drive the model to the level of detail of the purpose of our interest. This is an example of video inference.

./mlx-run – debug llm-tool \ – model mlx-community/smolvlm2-500m-video-instruct-mlx \ – system “We will only focus on the description of important dramatic actions or salient events that occur in this video segment. Skip general context or scene configuration details unless it is important to understand the main actions. Masu.” \ – Prompt “What’s going on in this video?” \ -video/users/pedro/downloads/img_2855.mov \ – temperature 0.7 – top-p 0.9 – max-tokens100

If you want to integrate Smolvlm2 into your app using MLX and Swift, I’d like to know about it! Please send me your notes in the comments section below!

Fine tweaks smolvlm2

You can fine-tune Smolvlm2 in video using Transformers. I fine-tuned 500m variants with video caption pairs in the debomedivedback dataset. The 500m variant is small so it’s better to apply full tweaks instead of Qlora or Lora, but you can try to apply Qlora to CB variants. Check out the finely tuned notebook here.

read more

We would like to thank Raushan Turganbay, Arthur Zucker and Pablo Montalvo Leroux for the contribution of the model to Transformers.

I look forward to seeing everything I build with Smolvlm2! If you would like to learn more about the Smolvlm family of models, feel free to read:

Smolvlm2 – Collection with models and demos

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleMeta’s Yang Lekun: Scientists may look overseas amid Trump’s fund cuts
Next Article Saudi Arabia’s Media Forum opens with a focus on AI

Related Posts

Tools

GROQ hugging face reasoning provider

June 17, 2025
Tools

Ericsson and AWS bet on AI to create self-healing networks

June 16, 2025
Tools

Thousands of open LLMs bloom in the top AI model garden

June 16, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

June 5, 20253 Views

Presight plans to expand its AI business internationally

April 14, 20252 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

June 5, 20253 Views

Presight plans to expand its AI business internationally

April 14, 20252 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20252 Views
Don't Miss

Piclumen Realistic V2 introduces advanced AI art generation. AI News Details

June 17, 2025

GROQ hugging face reasoning provider

June 17, 2025

KREA 1 Image Model launches with excellent aesthetic controls and custom training for AI art generation | AI News Details

June 16, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?