We are excited to welcome PaliGemma 2 (a new version of PaliGemma), Google’s all-new vision language model. Like the previous generation, PaliGemma 2 uses the same powerful SigLIP for vision, but the text decoder part is upgraded to the latest Gemma 2.
PaliGemma 2 comes with new pre-trained (pt) models with 3B, 10B, and 28B parameter sizes. All of these support different input resolutions: 224×224, 448×448, and 896×896. These combinations provide great flexibility for different use cases, allowing practitioners to choose the balance they need in the areas of quality and efficiency. In contrast, the previous PaliGemma was only available in the 3B variant.
Pre-trained models are designed to allow easy fine-tuning of downstream tasks. The first PaliGemma was widely adopted by the community for various purposes. With more flexibility through additional variants and improved pre-trained quality, we can’t wait to see what the community can do this time.
As an example, Google has also released several fine-tuned variants on the DOCCI dataset, demonstrating versatile and robust captioning capabilities that are long, nuanced, and detailed. The fine-tuned DOCCI model is available in 3B and 10B variants and supports 448×448 input resolution.
This release includes all open model repositories, transformer integrations, fine-tuning scripts, and demos of fine-tuned models for visual question answering on VQAv2 datasets.
table of contents
Introducing PaliGemma 2
PaliGemma 2 is a new version of the PaliGemma vision language model that Google released in May. PaliGemma 2 connects the powerful SigLIP image encoder to the Gemma 2 language model.
The new model is based on the Gemma 2 2B, 9B, and 27B language models, and the corresponding 3B, 10B, and 28B PaliGemma 2 variants are generated, the names of which take into account the additional parameters of the (compact) image encoder. Masu. As mentioned above, they support three different resolutions, providing great flexibility for fine-tuning downstream tasks.
PaliGemma 2 is distributed under the Gemma License, which permits redistribution, commercial use, tweaking, and creation of model derivatives.
This release comes with the following checkpoints for bfloat16 precision:
Nine pre-trained models: 3B, 10B, and 28B with resolutions 224×224, 448×448, and 896×896.
Two models fine-tuned with DOCCI: Two models fine-tuned with the DOCCI dataset (image and text caption pairs) support 3B and 10B PaliGemma 2 variants and 448×448 input resolution.
Model features
As seen in previous PaliGemma releases, pre-trained (pt) models are ideal for further fine-tuning downstream tasks.
The pt model is pretrained with the following data mixture. The diversity of the pre-training dataset allows us to perform fine-tuning of downstream tasks in similar domains using relatively few examples.
WebLI: A web-scale multilingual image-text dataset built from the public web. A wide range of WebLI decompositions are used to obtain versatile model features such as visual semantic understanding, object localization, visually located text understanding, and multilingualism.
CC3M-35L: English image and alt_text pairs selected from web pages (Sharma et al., 2018). To label this dataset, the authors translated it into an additional 34 languages using the Google Cloud Translation API.
Visual Question Generation with Question Answering Validation (VQ2A): An improved dataset for question answering. The dataset is translated into the same additional 34 languages using the Google Cloud Translation API.
OpenImages: Detection and object recognition questions and answers generated by handcrafted rules on the OpenImages dataset (Piergiovanni et al. 2022).
WIT: Images and text collected from Wikipedia (Srinivasan et al., 2021).
The PaliGemma 2 team fine-tunes PT models internally for a variety of visual language understanding tasks and provides benchmarks for these fine-tuned models in model cards and technical reports.
Fine-tuned based on the DOCCI dataset, PaliGemma 2 can perform a wide range of captioning tasks, including rendering text, capturing spatial relationships, and incorporating world knowledge into captions.
Below we show the performance of DOCCI’s fine-tuned PaliGemma 2 checkpoint compared to other models (extracted from Table 6 of the technical report).
#par #char #sent NES↓ MiniGPT-4 7B 484 5.6 52.3 mPLUG-Owl2 8B 459 4.4 48.4 InstructBLIP 7B 510 4.0 42.6 LLaVA-1.5 7B 395 4.2 40.6 VILA 7B 871 8.6 28.6 Paris Gemma 3B 535 8.9 34.3 PaLI-5B 5B 1065 11.3 32.9 Paris Gemma 2 3B 529 7.7 28.4 Paris Gemma 2 10B 521 7.5 20.3
#char: Average number of characters in the generated caption. #sent: Average number of sentences. NES: Non-entailment sentences that evaluate factual inaccuracy (the lower the number, the better).
Below are some model outputs from DOCCI checkpoints that demonstrate the versatility of the model.
demo
For demonstration purposes, we at the Hugging Face team have fine-tuned PaliGemma 2 3B at 448×448 resolution on a small portion of the VQAv2 dataset. We used LoRA fine-tuning and PEFT, as described later in the fine-tuning section. The demo below shows the final result. Feel free to explore the code in the space to see how it works, or duplicate the code and adapt it with your own tweaks.
How to use with transformer
You can use the PaliGemmaForConditionalGeneration and AutoProcessor APIs to perform inference on PaliGemma 2 models with 🤗 transformers. Until a PyPi version of the transformer is released, you must install the transformer from the main branch as follows:
pip install git+https://github.com/huggingface/transformers
Then you can perform inference like this:
from transformer import Autoprocessor, PaliGemmaForConditionalGeneration
from pill import image
import Request model ID = “google/paligemma2-10b-ft-docci-448”
Model = PaliGemmaForConditionalGeneration.from_pretrained(model_id) Model = model.to(“Cuda”) Processor = AutoProcessor.from_pretrained(model_id) Prompt = “caption en”
Image file = “https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png”
raw_image = image.open(requests.get(image file, stream=truth).raw).convert(“RGB”) inputs =processor(prompt, raw_image, return_tensors=“pt”). To (“Cuda”) output = model.generate(**inputs, max_new_tokens=20)
print(processor.decode(output(0), skip_special_tokens=truth)(Ren(prompt):))
You can also load quantized models using bit-sand-byte integration in transformers. The following example uses 4-bit nf4.
from transformer import BitsAndBytesConfig bnb_config = BitsAndBytesConfig(load_in_4bit=truth,bnb_4bit_quant_type=“nf4”bnb_4bit_compute_dtype=torch.bfloat16 ) model = PaligemmaForConditionalGeneration.from_pretrained(model_id, quantization_config=bnb_config, device_map={“”:0})
We quickly tested the performance degradation in the presence of quantization by evaluating 3B fine-tuned checkpoints on the textvqa dataset using 224×224 input images. These results are from the 5,000 entries in the validation set.
bfloat16, no quantization: 60.04% accuracy. 8 bit: 59.78%. 4 bit, 58.72% using the configuration in the snippet above.
This is a very encouraging number. Of course, quantization is most interesting for large checkpoints, so it’s always a good idea to measure the results for your domain and tasks.
Fine adjustment
If you’ve previously tweaked PaliGemma, the API for tweaking PaliGemma 2 is the same, so you can use the code out of the box. Tweak scripts and notebooks are provided to fine-tune the model, freeze parts of the model, and apply memory-efficient fine-tuning techniques such as LoRA and QLoRA.
For demonstration purposes, we fine-tuned the PaliGemma 2 model with LoRA on half of the VQAv2 validation split. This took 30 minutes on three A100s with 80GB VRAM. The model can be found here. This is a Gradio demo that introduces it.
conclusion
The new PaliGemma 2 release is even more exciting than previous releases, with different sizes and powerful pre-trained models to suit everyone’s needs. We look forward to seeing what the community builds.
Thank you to the Google team for releasing this amazing open model family. Many thanks to Pablo Montalvo for integrating the model into Transformers, and to Lysandre, Raushan, Arthur, Yeh-Dar and the rest of the team for quickly reviewing, testing, and merging.
resource