Last December, Google released Paligemma 2: Parigemma 2: Pre-Trained (PT) Paligemma Vision Language Models (VLMS) based on Siglip and Gemma 2. The model comes in three different sizes (3b, 10b, 28b) and three different resolutions. (224×224, 448×448, 896×896).
Today, Google is releasing the Paligemma 2 Mix. It is fine-tuned with a mix of vision language tasks, including OCR, long captions and short captions.
The Paligemma 2 Preprocessing (PT) variant is a great vision language model that transfers on specific tasks at hand. All PT checkpoints are intended to be fine-tuned in downstream tasks and were released for that purpose.
Mixed models can easily understand the performance you get when fine-tuning pre-trained checkpoints in downstream tasks. The main purpose of the Paligemma Model family is to provide a preprocessing model that allows you to learn better in downstream tasks, instead of providing a versatile chat model. The mix model provides a proper signal for the performance of the PT model when fine-tuned with combinations of academic datasets.
You can read more about Paligemma 2 in this blog post.
You can find all the mixed models and demos from this collection.
table of contents
Paligemma 2 Mix Model
The Paligemma 2 Mix model can accomplish a variety of tasks. You can classify them according to their subtasks as follows:
Common Vision Language Related Tasks: Visual Question Answers, See Image Localization Related Tasks: Object Detection, Image Segmentation
Please note that this list of subtasks is not exhaustive. Please note that you can get more information about the complete list of tasks for the Paligemma2 paper.
You can use an open-ended prompt by prompting the Paligemma 2 Mix model. In previous iterations of the previous model of Paligemma, we had to add a task prefix to the prompt depending on the task we wanted to accomplish in a particular language. This continues to work, but the open-ended prompts provide better performance. A prompt with a task prefix looks like this:
“Caption {lang}”: Nice, short caption “{lang}” like Coco: longer, more descriptive caption “ocr”: Optical character recognition “Answer {lang} {question}”: About image content Questions for Questions {lang} {Answer} “: Generate a question for a specific answer
Only two tasks that work with task prefixes alone are object detection and image segmentation. The prompt looks like this:
“detect {object description}”: Find the objects listed in the image and return the bounding boxes for those objects”Segment{Object description}; {Object description}”: Find the area occupied by the image object and image Create a segmentation for that object
If you want to get started right away, feel free to jump to this section of your blog or try the demo.
Comparing mutants in Paligemma 2
In this section we will see how the aforementioned features, Paligemma 2 Mix, works with them, and compare different variants with different sizes and resolutions in several tasks. Here we test the model with some internal examples.
Common Vision Language Tasks
Input Image Input Prompt 3B/448 Response 10B/448 Response
How many candies are there? 5✅5✅
Please explain the images in detail. Groups of people walk down the street and turn their backs to the camera. The streets are lined with tall green leafy trees, palm trees and white fences along the sidewalk. You can see a large pink building with a large round roof and a large pink building with a dome roof in the distance. A man in a black shirt and grey bag is walking in front of a white motorcycle parked near a fence. A man in a red shirt and a black backpack stands next to a hood cart on the sidewalk. The food cart is parked on the side of the street. A man in a blue shirt is walking down the sidewalk, holding a white bag on his back. The streetlights are black. The sky is blue. A group of comers walks down the street and turns their backs to the camera. Tall buildings with pointy roofs and large pink and white buildings dominate the skyline. A woman in a black dress and a man in black pants walk side by side. The motorcycle is parked next to the fence and the scooter is parked on the sidewalk. The person is walking, and the motorcycle is being driven by a man wearing a white shirt. The man is wearing a red shirt and black pants, while the man in a blue shirt is riding a scooter. The pole sign appears and the person is wearing a backpack. ✅
Understanding the document
Input Image Input Prompt 3B/448 Response 10B/448 Response
For resolution-sensitive tasks, which variant is best? It’s 448px, but has been resized to the first 448px of 224px
What is the target emission rate in France in 2023? 20✅20✅
Localization Tasks
Mixed variants of Paligemma 2 were evaluated according to localization-related features. If you are given the prompt “{Object Description}; {Another Object Description}” with objects of different interest, parigenma can detect objects of different interest. The prompt here is not limited to short classes like “birds”, but could be “birds on the stick.”
Below you can find detection and segmentation outputs for various variants with a fixed resolution of 448×448. Zoom in to an object of interest for visualization purposes.
Image text recognition
Input Image Input Prompt 3B/448 Response 10B/448 Response
When was this ticket dated and how much did it cost? 26-05-2023 21:52
17.00✅26-05-202317.00✅
Friday, December 20th, Read Text\nnew Office Party\ncocktail Menu -\ noffice martini\nvodka fraise des bois -jus de fleur de sureau -fleur wide strawberry vodka -raspberry puere-elderflower liquor -ndiffurers sonrise \ ntequila -Mandarine Impériale -jus D’Orange Sanguine -Cointreau -Cherry Bitter Tequila -Tangerine Liquor -Blood Orange Juice -Cointreau -Cherry Bitter \ ngin infused ala manga rotie -citronnelle, kiwi Vert & Jaune -Citron -Poivre Blanc Roasted Mango Infused Gin -Lemongrass -Green & Green & Yellow Kiwi, Lemon – White pepper \ ntransformers Twist -Cramel Jamplémousse -Bananas \ nperuian Peft \ npêches -Cherry Liquor -Grapefruit Cordial -Pineapple✅Friday Bureau – Fleur Wild Strawberry Vodka – Raspberry Puree – Elder Flower Liqueur – Flower Diffuser Sunhiz Tequila – Mandarin Impriare – Jusdrenzin Sungin – Coin Troll – Cherry Bitter Tequila – Blood Orange Juice – Coin Troll – Coin Troll – Cherry Bitter Trans Twist Gin Flavor Filled – Citillon – Citillon – Citillon Vert & Jaune – Citron – Poivre Blanc Roasted Mango Infused Gin – Lemongrass – Green & Yellow Kiwi Lemon – White pepper Peruvian Peft Pieches – Aude de Sadere – Eau de Pumple Mouse – Anana Peace – Cherry Liquor – Grapefruit Vodka – Pineapple ✅
Inference and fine tuning using transformers
You can use the Paligemma 2 Mix model using a transformer.
from transformer Import (paligemmaprocessor,paligemmamaforconditionalgeneration,)
from Transformers.image_utils Import load_image
Import Torch Model_id = “Google/paligemma2-10b-mix-224”
url = “https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg”
image = load_image(url) model = paligemmaforconditionalgeneration.from_pretrained(model_id, torch_dtype = torch.bfloat16, device_map =“Automatic”).I’ll rate it()processor = paligemmaprocessor.from_pretrained(model_id)prompt = “Please explain En.”
model_inputs = processor(text = prompt, image = image, return_tensors =“PT”)“input_ids”). shape(-1))
and torch.inference_mode(): generation = model.generate(** model_inputs, max_new_tokens =100do_sample =error) Generation = Generation (0)(input_len 🙂 decoded = processor.decode(generation, skip_special_tokens =truth))
printing(Decode)
There is a detailed tutorial on finely tuned parigenma 2. You can also use the same notes to fine-tune the mix checkpoints.
demo
We are releasing a demo of the 10B model at a resolution of 448×448. You can play below or go to the app at this link.
read more
Read more about the Parigenma model below.
Acknowledgments
Thank you to Sayak Paul and Vaibhav Srivastav for reviewing this blog post. Thank you to the Google team for releasing this amazing, open model family.
We would like to thank Pablo Montalvo, as well as Lysandre, Raushan, Arthur, Yih-Dar, and other teams for integrating the model into the transformer for immediate review, testing and merger.