Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

kv cache from scratch in nanovlm

June 4, 2025

Workplace AI Series – Part 3: Artificial Intelligence in Employment: How States Around Pennsylvania Are Near Legal Situation | Tucker Aresberg, PC

June 4, 2025

AI-Media announces innovative AI voice translation at NAB Show 2025

June 4, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Wednesday, June 4
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»SmolVLM miniaturization – now available in 256M and 500M models!
Tools

SmolVLM miniaturization – now available in 256M and 500M models!

By January 23, 2025Updated:February 13, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

We are pleased to announce the addition of two new products to the SmolVLM family: SmolVLM-256M and SmolVLM-500M. that’s right. It has 256 million parameters, making it the world’s smallest vision language model.

We built on everything we learned from SmolVLM 2B, focusing on efficiency, data mixing, and new design tradeoffs. We are pleased to introduce two models that maintain strong multimodal performance in a small footprint.

This release comes with four checkpoints. Two basic models and two instruction fine-tuned models with size 256M and 500M parameters. These models can be loaded directly into Transformers, MLX, and ONNX, and there are demos for Transformers and WebGPU (using ONNX). All models and demos for this release can be found here.

table of contents

overview

SmolVLM-256M – The world’s smallest VLM! SmolVLM-500M – Its 500 million parameter sibling delivers significantly improved performance while remaining ultra-lightweight. New Vision Encoder Selection – We compared the SigLIP 400M SO (used in SmolVLM 2B and many other large VLMs) to the smaller SigLIP-based Patch 16/512. Surprisingly, the larger encoder gave only slightly better results, so we opted for the SigLIP base patch 16/512 with 93M parameters for these new releases. Larger image resolution – Our miniature vision encoder processes images at larger resolution (inspired by Apple’s VLM research and Google’s PaliGemma). This allows for a clearer understanding of the image with minimal overhead. Training optimization – A new tokenization trick made training losses look worse on paper, but it significantly improved real-world benchmarks.

We are now reaching model parity with the SmolLM2 family (135M, 360M, 1.7B), so we have a complete set of smaller LLM + VLM combos.

Why downsize?

When we released SmolVLM 2B, the community response was amazing. This model is extremely lightweight, open source and forgiving, and can be easily integrated into existing workflows. But we wanted to take this approach further for constrained devices, consumer laptops, and even people using browser-based inference. That’s where our new 256M and 500M models come in. On the other hand, for those looking to process huge amounts of data, these models can run at a fraction of the cost of 2B models.

Last year we trained two 80B VLMs and reduced them to 8B. Next, SmolVLM took on the challenge of reducing that 2B. And what we learned is that we can push the frontier even further. We are happy to show that 256M and 500M also provide good performance. Our new 256M model is the smallest VLM ever released, but it outperforms the Idefics 80B model, which is only 17 months old.

benchmark

Introducing the 256 million parameter giant

With just 256 million parameters, this model serves as the smallest VLM to date. Despite its small size, it packs a surprising punch. Proficient in many multimodal tasks, including:

Caption: A description of the image or short video. Document Q&A: Answer questions about PDFs or scanned text. Basic Visual Reasoning: Answer questions about charts and diagrams.

Step up: 500M

If you want more performance headroom while keeping memory usage low, SmolVLM-500M is a 500 million parameter compromise. Although significantly smaller than previous 2B releases, it still manages to score tasks like DocVQA and MMMU closer to larger models. We also found that this model was more robust to prompts, making it suitable for production right away. However, both models have great performance when fine-tuned.

In the graph below, we have visualized the throughput improvement across different batch sizes. The numbers below are throughput benchmarks run on the A100.

benchmark

What has changed since SmolVLM 2B?

1. Vision Encoder Selection Previously, we used the standard SigLIP 400M SO vision backbone, the same as found in many VLM architectures. For these small models, we experimented with two setups:

SigLIP 400M SO: Higher capacity, better performance. SigLIP Base Patch-16/512 (93M): Much smaller, surprisingly close performance.

We found that the performance gap between the 256M and 500M models was not large enough to justify the heavier encoders. Therefore, we decided to gradually work on the vision encoder. As a bonus, smaller encoders process images at a higher resolution, which (according to Apple and Google research) often allows for better visual understanding without increasing the number of parameters.

2. Update data mix

As with previous releases, it relies on The Cauldron and Docmatix, and adds MathWriting to the mix.

Mixed data

The proportion of the dataset focuses on document understanding (41%) and image captioning (14%), while focusing on other important areas such as visual reasoning, understanding diagrams, and following general instructions. has also been adjusted to maintain a balanced focus. With this update, the model builds on strong document understanding and opens the door to fine-tuning to tailor understanding for specific tasks.

3. Tokenization optimization

We’ve added more pixel shuffle! The new model encodes images at a rate of 4096 pixels per token compared to 1820 pixels per token for the 2B model.

To further optimize Token Zyton, we added a special token to represent sub-image separators in a more efficient way. This means that the string is mapped to a single token instead of being mapped to 7 tokens. as any string up to. This significantly improved the stability of the model during training and the quality of the results. More information can be found in this LinkedIn post.

4. Completion of SmolLM2-SmolVLM family

SmolLM2 is available in three sizes: 135M, 360M, and 1.7B. With the two models we’re releasing today, we have a complete set of small LLM + VLM combos.

Small multimodal search: ColSmolVLM 256M and 500M

We also found it surprisingly easy to tweak and experiment. The team behind search models like ColBERT trained ColSmolVLM to achieve SOTA multimodal search speeds with performance comparable to models 10 times its size. SmolVLM allows you to build searchable databases faster and cheaper. We think the 256M model has the potential to be a great model specialized for many tasks. Find links on how to use the new ColSmolVLM with new SmolVLM models in the next step.

benchmark

Smoldkring

We partnered with IBM to build models for Docling. Early results with the 256M model are impressive. Below are some of the early examples they shared with us. Please stay tuned for further information.

benchmark
benchmark

Using a smaller SmolVLM

The new SmolVLM works out-of-the-box with older SmolVLM code, allowing you to use transformers and MLX for inference and fine-tuning, and TRL for tuning. 🚀 Additionally, this release also comes with ONNX checkpoints.

Start SmolVLM with a transformer like the one below.

import torch
from transformer import AutoProcessor, AutoModelForVision2Seq Processor = AutoProcessor.from_pretrained(“HuggingFaceTB/SmolVLM-500M-Instruction”) model = AutoModelForVision2Seq.from_pretrained(
“HuggingFaceTB/SmolVLM-500M-Instruction”torch_dtype=torch.bfloat16, _attn_implementation=“Flash_Caution_2” if device == “Cuda” Other than that “Eager”) message = ( {
“role”: “user”,
“content”🙁 {“type”: “image”},{“type”: “Sentence”, “Sentence”: “Could you please explain this image?”} ) }, ) prompt =processor.apply_chat_template(messages, add_generation_prompt=truth) inputs =processor(text=prompt, images=(image), return_tensors=“pt”) generated_ids = model.generate(**inputs, max_new_tokens=500) generated_texts =processor.batch_decode( generated_ids, Skip_special_tokens=truth,)

Run the following CLI command to use SmolVLM with MLX.

python3 -m mlx_vlm.generate –model HuggingfaceTB/SmolVLM-500M-Instruct –max-tokens 400 –temp 0.0 –image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vlm_example .jpg –prompt “What is in this image?”

MLX

You can try the WebGPU demos of SmolVLM-256M-Instruct and SmolVLM-500M-Instruct.

Find links to fine-tuning and multimodal RAG using ColSmolVLM in the next step.

next step

Thanks to the ViDoRe team for training ColSmolVLM (Tony Wu, Manuel Faysse, and Joshua Lochner for ONNX transformation and WebGPU demos). Also, thanks to Vaibhav Srivastav for his help with this release.

author avatar
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleTexas could change rules for health insurance companies
Next Article What is the promotion of the state AI law in 2025?

Related Posts

Tools

kv cache from scratch in nanovlm

June 4, 2025
Tools

Gemini 2.5 native audio features

June 4, 2025
Tools

IBM and Roche use AI to predict blood glucose levels

June 3, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

New Star: Discover why 보니 is the future of AI art

February 26, 20253 Views

How to use Olympic coders locally for coding

March 21, 20252 Views

SmolVLM miniaturization – now available in 256M and 500M models!

January 23, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New Star: Discover why 보니 is the future of AI art

February 26, 20253 Views

How to use Olympic coders locally for coding

March 21, 20252 Views

SmolVLM miniaturization – now available in 256M and 500M models!

January 23, 20252 Views
Don't Miss

kv cache from scratch in nanovlm

June 4, 2025

Workplace AI Series – Part 3: Artificial Intelligence in Employment: How States Around Pennsylvania Are Near Legal Situation | Tucker Aresberg, PC

June 4, 2025

AI-Media announces innovative AI voice translation at NAB Show 2025

June 4, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?