Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

How Nigeria and Sahel terrorist groups can leverage social media, AI and drones

June 15, 2025

A powerful 8B vision language model for the community

June 15, 2025

An innovative AI platform ensures fair compensation for content creators

June 14, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Sunday, June 15
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»A powerful 8B vision language model for the community
Tools

A powerful 8B vision language model for the community

versatileaiBy versatileaiJune 15, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

I’m excited to release IDEFICS2. IDEFICS2 is a common multimodal model that takes input sequences of text and images and generates text responses. Answer images questions, explain visual content, create stories based on multiple images, extract information from documents, and perform basic arithmetic operations.
IDEFICS2 uses IDEFICS1:8B parameters, open licensing (Apache 2.0), and enhanced OCR (optical character recognition) capabilities, IDEFICS2 is a powerful foundation for a community working on multimodality. Performance in visual question-answer benchmarks is the top class size and competes with much larger models such as the LLAVA-Next-34B and MM1-30B-chat.
IDEFICS2 is also integrated into 🤗 transformers from Get-go, making it easy to Finetune in many multimodal applications. You can try out the hub model now!

Large Cauldron

Open the model
Weight size #Token
mmmu per image
(Val/Test) Mathvista
(testmini)textvqa
(val)mmbench
(Test) VQAV2
(test-dev) docvqa
(Test) DeepSeek-Vl llava-next-34b ✅34b 2880 51.1/44.7 46.5 69.5 79.3 83.7-mm1-chat-7b❌7b 720 37.0/35.6 35.9 GEMINI 1.0 PRO ❌🤷‍♂️ 🤷‍♂️47.9/ – 45.2 74.6-71.2 88.1 Gemini1.5 Pro 80b— 39.3-68.8 -IDEFICS2 (w/oim. Split)*✅8b 64 43.5/37.9 51.6 70.4 76.8 80.8 67.3 IDEFICS2 (w/im. split)

* w/ im. Split: Following strategies from Sphinx and Llava-Next, 4 allows optional sub-image splitting.

Training data

IDEFICS2 was trained with a mixture of openly available datasets for pretraining: interleaved web documents (Wikipedia, Oberic), image caption pairs (public multimodal dataset, LaION-COCO), OCR data (PDFA(EN), IDL and rendered text, inter-image data (WebSight)).
Interactive visualization allows you to explore the Obelics dataset.
Following the general practices of the Foundation Model Community, we will further train a basic model for task-oriented data. However, these data are often in different formats and are scattered across different locations. Gathering them is a barrier to the community. To address that issue, we are releasing a fine-tuning dataset of the multimodal instructions that we have been cooking: The Cauldron is an open compilation of 50 manually curated datasets formatted for multi-turn conversations. Indicates fine-tuned IDEFICS2 for cauldron concatenation and various text-only instruction fine-tuning datasets.

Large Cauldron

IDEFICS1 improvements

Follow your Navit strategy to work with native resolution (up to 980 x 980) and native aspect ratio images. It avoids the need to resize images to fixed-sized squares, as has been done historically in the computer vision community. Additionally, following Sphinx’s strategy, (optionally) allows for very large resolution sub-image splits and passed images. We have significantly enhanced OCR capabilities by integrating data that the model needs to transfer text into an image or document. We also improved our ability to answer questions about charts, numbers and documents using appropriate training data. Starting from the architecture of IDEFICS1 (gate mutual attendance) we simplified the integration of visual features into the language backbone. The image is fed to a Vision encoder, followed by a learned perceptor pooling and an MLP modality projection. The pooled sequence then concatenates with text embedding to obtain the (interleaved) sequence of images and text.

All of these improvements, along with pre-trained backbone improvements, bring a big jump in performance over the 10x smaller model IDEFICS1.

IDEFICS2 Architecture

Get started with IDEFICS2

IDEFICS2 is available in the facehub of the Hug and is supported in the last Transformers version. Here is a code sample to try:

Import request
Import torch
from pill Import image

from transformer Import Autoprocessor, AutomodelForvision2Seq
from Transformers.image_utils Import load_imagedevice= “cuda:0”

image1 = load_image(“https://cdn.britannica.com/61/93061-050-99147dce/statue-of-liberty-island-new-york-bay.jpg”)image2 = load_image(“https://cdn.britannica.com/59/94459-050-dba42467/skyline-chicago.jpg”)image3 = load_image(“https://cdn.britannica.com/68/170868-050-8DDE8263/GOLDEN-GATE-BRIDGE-SAN-FRANCISCO.JPG”)processor = autoprocessor.from_pretrained(“Huggingfacem4/idefics2-8b”)Model = automodelforvision2seq.from_pretrained(
“Huggingfacem4/idefics2-8b”.to(device) messages =({
“role”: “user”,
“content”:({“type”: “image”},{“type”: “Sentence”, “Sentence”: “What do you see in this image?”},)},{
“role”: “assistant”,
“content”:({“type”: “Sentence”, “Sentence”: “In this image we can see New York City, and more specifically the Statue of Liberty.”},)},{
“role”: “user”,
“content”:({“type”: “image”},{“type”: “Sentence”, “Sentence”: “And how about this picture?”},)},)prompt = processor.apply_chat_template(messages,add_generation_prompt =truthinputs = processor(text = prompt, images =(image1, image2), return_tensors =“PT”)inputs = {k:v.to(device) for K, v in inputs.items()} generated_ids = model.generate(** inputs, max_new_tokens =500generated_texts = processor.batch_decode(generated_ids, skip_special_tokens =truth))

printing(generated_texts)

It also offers a handy tweaking colab for anyone looking to improve IDEFICS2 in a specific use case.

Large Cauldron

resource

If you want to dive deeper, here is an edit of all the resources in IDEFICS2:

license

This model is built on two pre-trained models: Mistral-7B-V0.1 and Siglip-So400M-Patch14-384. Both are released under the Apache-2.0 license. It also releases IDEFICS2 weights under the Apache-2.0 license.

Acknowledgments

Thank you to the Google Team and Mistral AI for making the model available in the open source AI community!

Thank you to Chun Te Lee for providing Barplot. I remember Noyan for reviews and suggestions on the blog post.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleAn innovative AI platform ensures fair compensation for content creators
Next Article How Nigeria and Sahel terrorist groups can leverage social media, AI and drones
versatileai

Related Posts

Tools

Combining live-action film production with AI

June 14, 2025
Tools

Featherless AI hugging face reasoning provider🔥

June 13, 2025
Tools

How AI supports better tropical cyclone predictions

June 13, 2025
Add A Comment

Comments are closed.

Top Posts

Track and evaluate agents with Arize Phoenix

March 1, 20252 Views

New Star: Discover why 보니 is the future of AI art

February 26, 20252 Views

The ethical dilemma of AI in business

February 22, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Track and evaluate agents with Arize Phoenix

March 1, 20252 Views

New Star: Discover why 보니 is the future of AI art

February 26, 20252 Views

The ethical dilemma of AI in business

February 22, 20252 Views
Don't Miss

How Nigeria and Sahel terrorist groups can leverage social media, AI and drones

June 15, 2025

A powerful 8B vision language model for the community

June 15, 2025

An innovative AI platform ensures fair compensation for content creators

June 14, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?