Today, Google is releasing Siglip 2, a new and superior family of multilingual vision language encoders. The authors extended the training goals of Siglip (Sigmoid Loss) with additional purposes for semantic understanding, localization, and compact features.
The Siglip 2 model outperforms older Siglip models at all model scales of core features, including zero shot classification, image text search, and transfer performance when extracting visual representations of visual language models (VLMS).
The above cherry is a dynamic resolution (NAFLEX) variant. This is useful for aspect ratio and resolution sensitive downstream tasks.
Here is a list of all models that have been released.
introduction
Vision encoder is simple – takes an image and encodes it into a representation, which is used for downstream tasks such as classification, object detection, image segmentation, and more vision tasks. Researchers are always pursuing high density, local awareness, and semantically rich visual expressions.
Clip and Align are the first examples of image and text encoders arranged through co-training. This approach opened up a new way of training vision models. Siglip took it further and replaced the contrasting loss of Clip with sigmoid loss for an even better encoder.
Take it home? Smarter training goals continue to build more structured, more refined and powerful vision encoders. Siglip 2 is just that, and the bundle of very interesting smart training goals applied above Siglip provides a better, more powerful vision language encoder.
I’ll try something new in this blog post. Rather than state what’s new and where to find it, we do a little exercise together. Starting with Siglip, I’ll brainstorm a series of questions (prefixes 🤔) and answers (new headings) to gradually cover all Siglip 2 updates.
Start your journey with a vision encoder with a patch size of 16 and an image resolution of 256. There are four variations to start your training.
SIGLIP2-BASE-PATCH16-256 SIGLIP2-LARGE-PATCH16-256 SIGLIP2-SO400M-PATCH16-256 SIGLIP2-GIANT-OPT-PATCH16-256
🤔Question 1: What are the (low) supplementary training objectives that can be used to learn better visual representations (from the perspective of location perception and local sense)?
Add a decoder (it’s easy)
Add a decoder to the mix. Now you have an image encoder, text encoder, and text decoder. Text decoders have three purposes:
Predict the overall image caption Describe a specific image area Predict the caption Predict the caption
The decoder provides additional signals to the Vision encoder and makes it position-aware. This shows the first improvement in the Siglip 2 training recipe.
🤔Question 2: How do you improve the local semantics of fine grains in image representations?
Global local losses and self-resistance due to masked predictions
To improve the local semantics of fine granularity in image representation, we introduce two important training goals, global local loss, and masked predicted loss. Taking inspiration from the self-teacher learning literature, we use self-resistance. You can use the model as a teacher and as a model as a student. For each iteration, the teacher becomes the moving average of the student parameters.
Global-Local Loss: Student networks are trained to obtain a partial (local) view of the training image and match teacher representations derived from the complete image. Masked predicted loss: 50% of image patches embedded in the student network are masked with mask tokens. Students must match teacher functions in masked locations.
These objectives teach that Vision encoders are spatially aware and improve local semantics. The author adds this loss only after 80% of the training is done with sigmoid and decoder loss. This is done to store the calculation (additional losses are quite expensive) and to avoid negatively affecting the encoder.
🤔Question 3: How do I adapt my model to different resolutions?
Suitable for a variety of resolutions
It is a known fact that image models can be extremely sensitive to a variety of resolutions and aspect ratios. Here, two different methodologies can be utilized to adapt these models to different resolutions and patch sizes.
Fixed Resolution Variants: Getting checkpoints from 95% training allows you to resize the position embedding and patch embedding and continue training at the requested (potentially large) resolution. Dynamic Resolution Variants: Inspired by Flexivit, which uses inputs of different sequence lengths, and using NAVIT that conforms to the native aspect ratio, you can create NAFLEX variants. This is interesting. This is because a single model can be used for OCR (small aspect ratio distortion) and for understanding the document (good resolution).
-Models with the NAFLEX suffix are dynamic resolution variants. Fixed resolution models can be used out of the box with existing SiglipModel classes, but you must use the Naflex variant using Siglip2model. Using the pipeline API will handle this automatically!
This brings us to the end of our evolution from Siglip to Siglip 2. In the next section, we will examine the Siglip 2 applications.
Perform inference in a transformer
It’s very easy to perform inference on a model. You can copy the code below to perform inference in the free Tierco Love Notebook 🚀
Zero shot classification
Here we introduce Siglip 2’s zero shot classification feature using a handy pipeline API.
from transformer Import Pipeline ckpt = “Google/SIGLIP2-SO400M-PATCH14-384”
pipe = pipeline(model = ckpt, task =“Zero-shot image classification”) Input = {
“image”🙁
“https://huggingface.co/datasets/merve/coco/resolve/main/val2017/0000000285.jpg”,
“https://huggingface.co/datasets/merve/coco/resolve/main/val2017/00000000776.jpg”,),
“text”🙁
“Bear looking at the camera”,
“The bear looks away from the camera”,
“A bundle of teddy bears”,
“Two Teddy Bears”,
“Three Teddy Bears”
),} outputs = pipe (input)“image”), condidate_labels = inputs (“text”)))
Visualize the output.
Zero shot classification scores are visualized
Encoding images for downstream tasks
You can also encode images using:
Import torch
from transformer Import Automodel, Autoprocessor
from Transformers.image_utils Import load_image ckpt = “Google/SIGLIP2-SO400M-PATCH14-384”
Model = automodel.from_pretrained(ckpt, device_map =“Auto”).I’ll rate it() processor = autoprocessor.from_pretrained(ckpt)image = load_image(“https://huggingface.co/datasets/merve/coco/resolve/main/val2017/0000000285.jpg”)inputs = processor(image =(image), return_tensors =“PT”).to (model.device)
and torch.no_grad(): image_embeddings = model.get_image_features(** inputs)
printing(image_embeddings.shape)
Compare Siglip 1 with Siglip 2
Looking at the tables for all the released Siglip 2 models, there are two different changes from Siglip.
Siglip 2 has a new variant (NAFLEX) for dynamic resolution. Siglip 2 adds the Giants (1b) series.
The Siglip 2 evaluation chart shows its advantage over Siglip.
Below is a demo that allows you to compare zero shot classification results for Siglip 1 and Siglip 2.
Use an encoder for VLM
Vision encoders tailored to textual information are becoming increasingly important in the development of vision language models (VLMs). A common approach to building VLMs involves combining preprocessed vision encoders with preprocessed LLMs and training them together using multimodal data for diverse vision language tasks .
One outstanding example of VLM leveraging the Siglip family of vision encoders is Parigenma. This Paligemma blog post will help you dig deeper into Paligemma’s abilities. Based on this foundation, the recently introduced Paligemma 2 is taking it a step further by integrating Siglip with Advanced Gemma 2 LLM. It’s really exciting to swap Siglip for a Siglip 2 and then replace it with a parigenma in the settings and see how that model carries the fare.
Acknowledgments
Thank you to Michael Tschannen (first author of Siglip 2), Vaibhav Srivastav and Sayak Paul for feedback on this blog post. We’re yelling at the Google team with the release of this incredible, open family of models.
In no particular order, I would like to thank Pavel, Ross, Pablo, Pedro, Lysandre, and the rest of the face team who are hugging me for their enormous support and contributions to this project.