We are pleased to announce the addition of two new products to the SmolVLM family: SmolVLM-256M and SmolVLM-500M. that’s right. It has 256 million parameters, making it the world’s smallest vision language model.
We built on everything we learned from SmolVLM 2B, focusing on efficiency, data mixing, and new design tradeoffs. We are pleased to introduce two models that maintain strong multimodal performance in a small footprint.
This release comes with four checkpoints. Two basic models and two instruction fine-tuned models with size 256M and 500M parameters. These models can be loaded directly into Transformers, MLX, and ONNX, and there are demos for Transformers and WebGPU (using ONNX). All models and demos for this release can be found here.
table of contents
overview
SmolVLM-256M – The world’s smallest VLM! SmolVLM-500M – Its 500 million parameter sibling delivers significantly improved performance while remaining ultra-lightweight. New Vision Encoder Selection – We compared the SigLIP 400M SO (used in SmolVLM 2B and many other large VLMs) to the smaller SigLIP-based Patch 16/512. Surprisingly, the larger encoder gave only slightly better results, so we opted for the SigLIP base patch 16/512 with 93M parameters for these new releases. Larger image resolution – Our miniature vision encoder processes images at larger resolution (inspired by Apple’s VLM research and Google’s PaliGemma). This allows for a clearer understanding of the image with minimal overhead. Training optimization – A new tokenization trick made training losses look worse on paper, but it significantly improved real-world benchmarks.
We are now reaching model parity with the SmolLM2 family (135M, 360M, 1.7B), so we have a complete set of smaller LLM + VLM combos.
Why downsize?
When we released SmolVLM 2B, the community response was amazing. This model is extremely lightweight, open source and forgiving, and can be easily integrated into existing workflows. But we wanted to take this approach further for constrained devices, consumer laptops, and even people using browser-based inference. That’s where our new 256M and 500M models come in. On the other hand, for those looking to process huge amounts of data, these models can run at a fraction of the cost of 2B models.
Last year we trained two 80B VLMs and reduced them to 8B. Next, SmolVLM took on the challenge of reducing that 2B. And what we learned is that we can push the frontier even further. We are happy to show that 256M and 500M also provide good performance. Our new 256M model is the smallest VLM ever released, but it outperforms the Idefics 80B model, which is only 17 months old.
Introducing the 256 million parameter giant
With just 256 million parameters, this model serves as the smallest VLM to date. Despite its small size, it packs a surprising punch. Proficient in many multimodal tasks, including:
Caption: A description of the image or short video. Document Q&A: Answer questions about PDFs or scanned text. Basic Visual Reasoning: Answer questions about charts and diagrams.
Step up: 500M
If you want more performance headroom while keeping memory usage low, SmolVLM-500M is a 500 million parameter compromise. Although significantly smaller than previous 2B releases, it still manages to score tasks like DocVQA and MMMU closer to larger models. We also found that this model was more robust to prompts, making it suitable for production right away. However, both models have great performance when fine-tuned.
In the graph below, we have visualized the throughput improvement across different batch sizes. The numbers below are throughput benchmarks run on the A100.
What has changed since SmolVLM 2B?
1. Vision Encoder Selection Previously, we used the standard SigLIP 400M SO vision backbone, the same as found in many VLM architectures. For these small models, we experimented with two setups:
SigLIP 400M SO: Higher capacity, better performance. SigLIP Base Patch-16/512 (93M): Much smaller, surprisingly close performance.
We found that the performance gap between the 256M and 500M models was not large enough to justify the heavier encoders. Therefore, we decided to gradually work on the vision encoder. As a bonus, smaller encoders process images at a higher resolution, which (according to Apple and Google research) often allows for better visual understanding without increasing the number of parameters.
2. Update data mix
As with previous releases, it relies on The Cauldron and Docmatix, and adds MathWriting to the mix.
The proportion of the dataset focuses on document understanding (41%) and image captioning (14%), while focusing on other important areas such as visual reasoning, understanding diagrams, and following general instructions. has also been adjusted to maintain a balanced focus. With this update, the model builds on strong document understanding and opens the door to fine-tuning to tailor understanding for specific tasks.
3. Tokenization optimization
We’ve added more pixel shuffle! The new model encodes images at a rate of 4096 pixels per token compared to 1820 pixels per token for the 2B model.
To further optimize Token Zyton, we added a special token to represent sub-image separators in a more efficient way. This means that the string is mapped to a single token instead of being mapped to 7 tokens. as any string up to. This significantly improved the stability of the model during training and the quality of the results. More information can be found in this LinkedIn post.
4. Completion of SmolLM2-SmolVLM family
SmolLM2 is available in three sizes: 135M, 360M, and 1.7B. With the two models we’re releasing today, we have a complete set of small LLM + VLM combos.
Small multimodal search: ColSmolVLM 256M and 500M
We also found it surprisingly easy to tweak and experiment. The team behind search models like ColBERT trained ColSmolVLM to achieve SOTA multimodal search speeds with performance comparable to models 10 times its size. SmolVLM allows you to build searchable databases faster and cheaper. We think the 256M model has the potential to be a great model specialized for many tasks. Find links on how to use the new ColSmolVLM with new SmolVLM models in the next step.
Smoldkring
We partnered with IBM to build models for Docling. Early results with the 256M model are impressive. Below are some of the early examples they shared with us. Please stay tuned for further information.
Using a smaller SmolVLM
The new SmolVLM works out-of-the-box with older SmolVLM code, allowing you to use transformers and MLX for inference and fine-tuning, and TRL for tuning. 🚀 Additionally, this release also comes with ONNX checkpoints.
Start SmolVLM with a transformer like the one below.
import torch
from transformer import AutoProcessor, AutoModelForVision2Seq Processor = AutoProcessor.from_pretrained(“HuggingFaceTB/SmolVLM-500M-Instruction”) model = AutoModelForVision2Seq.from_pretrained(
“HuggingFaceTB/SmolVLM-500M-Instruction”torch_dtype=torch.bfloat16, _attn_implementation=“Flash_Caution_2” if device == “Cuda” Other than that “Eager”) message = ( {
“role”: “user”,
“content”🙁 {“type”: “image”},{“type”: “Sentence”, “Sentence”: “Could you please explain this image?”} ) }, ) prompt =processor.apply_chat_template(messages, add_generation_prompt=truth) inputs =processor(text=prompt, images=(image), return_tensors=“pt”) generated_ids = model.generate(**inputs, max_new_tokens=500) generated_texts =processor.batch_decode( generated_ids, Skip_special_tokens=truth,)
Run the following CLI command to use SmolVLM with MLX.
python3 -m mlx_vlm.generate –model HuggingfaceTB/SmolVLM-500M-Instruct –max-tokens 400 –temp 0.0 –image https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vlm_example .jpg –prompt “What is in this image?”
You can try the WebGPU demos of SmolVLM-256M-Instruct and SmolVLM-500M-Instruct.
Find links to fine-tuning and multimodal RAG using ColSmolVLM in the next step.
next step
Thanks to the ViDoRe team for training ColSmolVLM (Tony Wu, Manuel Faysse, and Joshua Lochner for ONNX transformation and WebGPU demos). Also, thanks to Vaibhav Srivastav for his help with this release.