With the release of the new 8B and 32B Parameter Vision Language Models (VLMS), the AYA Vision family, we are working to bring multilingual performance to one of AI’s biggest challenges: multimodal models.
AYA Vision is the core of AI’s latest open-weight multi-lining gal and multimodal model family, designed as a powerful foundation for understanding language and vision across 23 languages. Based on the success of Aya Expanse, the cutting-edge multilingual model, expands using a combination of advanced techniques. These include synthetic annotations, scaling multilingual data through translation and repercing, and important ways to deepen understanding of language and vision in multilingual settings.
As a result, our model works well in a variety of tasks, including image captioning, answering visual questions, text generation, and translating both text and images into clear, natural language text. We evaluated the AYA vision model on a set of datasets, including the new open-ended vision language benchmark AyavisionBench, and the multilingual version of the Wild Vision Bench (MWildVision), which is translated into 23 languages.
In pairwise comparisons, the AYA Vision 32B outperforms more than twice the sizes of the Llama-3.2 90B Vision, Molmo 72B, and QWEN2.2.5-VL 72B.
Our compact and more efficient model, AYA Vision 8B, achieves the best performance in multilingual multimodal in parameter classes, outperforming major models such as QWEN2.5-VL 7B, PIXTRAL 12B, GEMINI FLASH 1.5 8B, LLAMA-3.211B BISION, MOLMO-D 7B, and PANGEE 79% WIN-RATE, YEAVISION. mwildbench.
We will release both the 8B and 32B models as open weights for the research community to further accelerate the progression of multilingual multimodals. In this blog post, we share the important technical details behind AYA Vision Models
AYA Vision Architecture and Training
For high-performance vision language models, it is important to process images at any resolution, especially high-resolution images. To enable this feature in AYA Vision, high-resolution images are dynamically resized and split into multiple tiles to generate rich image functionality from the image encoder. The AYA Vision model uses the recently released Siglip2-Patch14-384 model to initialize the Vision encoder.
Dynamic resizing allows for high resolution images to be processed, but also leads to many image tokens passing through the vision language connector and LLM decoder. To improve latency and throughput, we use a downsampling method called Pixel Shuffle to compress the number of image tokens by 4 times. After downsampling, the image tokens are arranged in the input of the embedded language model via the vision language connector and passed to the LLM decoder.
The text decoder uses a multilingual model. For AYA Vision 8B, LLM initialized from the Cohere Command R7B is used to receive improved instructions, world knowledge and further training, and AYA spread recipes consisting of diverse multilingual data, model mergers, and preferred training. For AYA Vision 32B, initialize the AYA Expanse 32B language model based on cutting-edge multilingual performance.
Training process
I trained AYA Vision Models in two stages. Vision language alignment and supervised fine adjustment (SFT). During the vision language alignment stage, only the vision language connector is trained while the weights of the vision encoder and language model are frozen. This allows for a basic understanding of vision languages by mapping image encoder features to the embedded space of the language model. In the SFT stage, both connector and language models are trained on a diverse set of multimodal tasks of 23 languages.
Enhanced multimodal data and expanded language coverage
One of the biggest challenges in developing multilingual vision language models is ensuring strong performance across underrated languages. To address this, we first collect synthetic annotations using a diverse pool of high-quality English datasets. This is the basis for multilingual multimodal annotations. Following synthetic annotations for the English dataset, a large amount of data was translated into 23 languages. To avoid translation artifacts and to maintain fluent text properties with high accuracy, the answers are matched with the original high quality synthetic sample, paraphrase the translated prompt/generated pairs to expand language coverage where the actual dataset is missing. This improves both the flow and alignment of language between visual and text, allowing AYA vision to demonstrate excellent image understanding in multiple languages.
Our 8B model only tweaked monitoring in the original academic dataset reaches a 40.9% victory rate in 23 languages of AyavisionBench against the Pangea 7B. This substantial improvement shows the impact of a significant investment in multilingual data coverage.
Merger of multimodal models
A state-of-the-art vision language model should be superior not only in the understanding of images, but also in the context of conversations where the model is expected to produce high-quality responses to both image and text inputs. To address this, we fuse basic language models with fine-tuned vision language models, inspired by previous research on model merging, a technique that combines multiple trained models.
Merge models enhances the generation of the final model, which leads to a 70% win rate in AyavisionBench’s 23 languages against the Pangea 7B, improving the multimodal win rate by 11.9% compared to multimodal merges.
With the merger of multimodal models, the AYA vision model is superior to the text-only tasks measured in the Marenahard dataset, compared to other major vision language models.
AYA Vision’s training pipeline overview
Scaling up to 32B
Finally, the recipe scales from 8B to 32B to form the AYA Vision 32B, a cutting-edge open-weight multi-ring alvision model. This shows a significant improvement in victory rate due to stronger initialization of the text backbone, with the size of the Lama-3.2 90B, Molmo72. Ayavisionbench ranges from 49% to 63%, and average Mwildvision ranges from 52% to 72% in 23 languages.
AYA Vision Benchmark – Multilingual evaluation data
Along with the AYA Vision model, a high-quality multilingual vision language benchmark called AyavisionBench, built on real-world applications, has also been released, covering 23 languages and nine different task categories, with 135 image Question pairs querying 135 images per language.
This set of assessments is available to the research community, and we are promoting multilingual multimodal assessments. This dataset is designed to assess the ability of a model to perform various visual language tasks, including captions, charts, figure understanding, differences between the two images, general questions in OCR, OCR, document understanding, text transcription, reasoning involving logic and mathematics, and screenshot transformation. By incorporating multiple languages and task types, the dataset provides an extensive and challenging assessment framework for assessing cross-sectional and multimodal understandings.
To create this dataset, we first selected images from the Cauldron Helld-Out test set, a large collection derived from 50 high quality datasets, and confirmed that they were not seen during training. Next, for each image, we generated a corresponding question that explicitly required visual context for the answer. These questions were generated synthetically and subsequently refined through a two-step verification process. First, a human annotator reviewed and verified each question to make sure it was clear, relevant and truly dependent on the image. This rigorous selection and validation process ensures that the dataset serves as a robust benchmark for evaluating vision language models in multilingual and real settings.
Designed for real applications
Communication occurs in many forms and in many languages. Major research and development has released models that promote connectivity in today’s 23 different languages, whether textual or visual.
AYA Vision has a wide range of practical applications. One notable example is availability on WhatsApp, one of the most widely used communication platforms in the world. This allows a huge audience of global citizens who speak many languages to leverage the capabilities of AYA Vision on the platform they use for daily communication.
Let’s start AYA
To get started:
Download weights and datasets from the AYA Vision Collection to the face of a hug.
Try out aya vision with hugging face spaces or text to whatsapp
Building on AYA using the Colab example.
Learn more about our ongoing multilingual efforts.
Acknowledgments
This task would not have been possible without the core AYA Vision Technical Team.
Saurabh Dash, Oliver Nan, John Dang, Arash Ahmadian Dehkordi, Shivalika Singh, Alejandro Salamanca, Bharat Venkitesh, Vlad Shmyhlo, Walter Beller-Morales, Jeremy Pekmez, Jason Ozuzu, Madeline Smith, Marzieh fadaiee, Manojieh fadaiee Matthias Gallé, Beyza Ermis, Ahmetüstün, Sara Hooker.
And it would not have been possible without a wider co-op for the AI and Cohere teams who supported them in many different ways. Sungjin Hong, Michael Kozakov, Pierre Richemond, Brittawnya Prince, Jim Payne, Kyle Lastovica, Jeff Colen, Jenna Cook, Viraat Aryabumi, Trent Fowler, Linus Chui, Meor Amer, Lucas Fayoux, Kele Lastovica, Acyrlage Trend, Acyrlag Campos, Nick Frossto, Phil Brunsson, Aidan Gomez, Ivan Chan.
Thank you for holding Yoni Gozlan, Arthur Zucker, Pedro Cuenka, Alitra Roy Gostipati, Mebe Noyan and Vaibav Srivastav.
reference
(1) Aya Expanse: Combining research breakthroughs for new multilingual frontiers
(2) Pangea: Fully open multilingual multimodal LLM for 39 languages
(3) WildVision: Evaluation of wild visual language models with human preferences
(4) Siglip 2: Multilingual vision language encoder with improved semantic understanding, localization and dense functionality
(5) What is important when building a vision language model?
(6) Molmo and Pixmo: Open weights and open data for cutting-edge vision language models
(7) How far is GPT-4V? Finishing the gap with commercial multimodal models with open source suites