Boosting Image Search Capabilities Using SigLIP 2

February 26, 2025

0 Views 0

SaveSavedRemoved 0

Boosting image search capabilities has become a critical focus in the realm of digital asset management, e-commerce, and social media platforms. With the ever-increasing volume of visual content generated daily, the need for efficient and accurate image retrieval systems is more pressing than ever. Enter SigLIP 2 (Sigmoid Loss for Language-Image Pre-Training), a state-of-the-art multilingual vision-language encoder developed by Google DeepMind, which promises to revolutionize how we approach image similarity and search tasks. Its innovative architecture not only improves semantic understanding but also excels in zero-shot classification and image-text retrieval. By utilizing a unified training approach that incorporates self-supervised learning and diverse data curation, SigLIP 2 outperforms previous models in extracting meaningful visual representations.

Learning Objectives

Understand the fundamentals of CLIP models and their role in image retrieval systems.
Identify the limitations of softmax-based loss functions in distinguishing nuanced image differences.
Explore how the SigLIP model overcomes these limitations by utilizing sigmoid loss functions.
Analyze the key advancements and differentiating features of SigLIP 2 over SigLIP.
Implement an image retrieval system based on a user’s image query.
Compare and evaluate the performance of SigLIP 2 against SigLIP in image retrieval tasks.

This article was published as a part of the Data Science Blogathon.

Contrastive Language-Image Pre-training (CLIP)

CLIP, which stands for Contrastive Language-Image Pre-training, is a groundbreaking multimodal model developed by OpenAI in 2021. It bridges the gap between computer vision and natural language processing by learning a shared representation space for images and text. This innovative approach allows CLIP to understand and correlate both modalities simultaneously, enabling it to perform tasks like zero-shot image classification, image-text retrieval, and captioning.

Learn More: CLIP VIT-L14: OpenAI’s Multimodal Marvel for Zero-Shot Image Classification

Key Components of CLIP

The Key components of CLIP consists of a Text Encoder, an Image Encoder with a Contrastive Learning Mechanism. This mechanism aligns the representations of text and images by maximizing the similarity between matching pairs and minimizing it for non-matching pairs.

CLIP architecture — Source: https://openai.com/index/clip/

CLIP is trained on a large dataset of image-text pairs, typically involving hundreds of millions of examples. The model learns to predict the most relevant text snippet given an image and vice versa.

Also Read: Google’s SigLIP: A Significant Momentum in CLIP’s Framework

Softmax Function with Cross Entropy Loss

In CLIP, there is an encoder for image and another encoder for text which take the input images and texts to a latent representation. When we have the embeddings (the latent representations) from the encoders, a similarity score (or dot product) is calculated between each image and text pair. The similarity score gives us a measure of how similar the image and the text embeddings are. To train the models to tag the correct text for an image or vice versa, a loss function is utilized whose objective is to maximize the similarity score between the image and text pairs.

In CLIP, the softmax function is applied to the model’s outputs to obtain a probability distribution like below for every image text pair in a batch.

In CLIP, the normalization (as seen in the denominators) is independently performed two times: across images and across texts as shown below in the loss function below –

The first term in the above equation finds the best text match for a given query image while the second term finds the best image match for a given query text. “B” is the batch size.

Limitations of CLIP

Issues in dealing with very Similar Pairs. While CLIP leverages the Softmax function to calculate probabilities for image-text pairings, a potential issue arises when using it directly with cosine similarity, as the Softmax function might not effectively capture the relative distance between image and text embeddings, especially when dealing with very similar pairs, leading to less nuanced comparisons and potentially hindering performance in certain scenarios where fine-grained distinctions are important. Softmax tends to push the probabilities of “incorrect” pairings very close to zero, potentially causing the model to miss subtle differences between similar images and text descriptions.
Quadratic Memory Complexity. Additionally since in CLIP, the similarity of every positive-pair is normalized by all negative pairs, every GPU has to maintain an NxN matrix for all pairwise similarities introducing quadratic memory complexity.

SigLIP with Sigmoid Loss Function

SigLIP, developed by Google follows a similar framework as CLIP but overcomes CLIP’s above issues by using a sigmoid-based loss (in place of softmax based loss) that operates independently on each image-text pair. Following is the Sigmoid Loss Function used in SigLIP

Here, “N” is the batch size which is present in the denominator so that the loss remains normalized for all batch sizes.
“Σ(i=1 to N) Σ(j=1 to N)” is used to sum over the loss for all combinations of image (i) and text (j) pairs.
“z_ij” is for determining whether the image text pair is positive (1) or negative (-1).
“t” controls the steepness of the sigmoid curve.
“xi · yj” measures how similar the image embeddings and text embeddings are.

Differences with Respect to CLIP

CLIP	SigLIP	Inference
Softmax Based Loss	Sigmoid Based Loss	SigLIP is neither asymmetric nor dependent on a global normalization factor. As a result, the loss for each pair—whether positive or negative—is independent of other pairs in the mini-batch
Each GPU stores an NxN matrix to compute all pairwise similarities	No need to store NXN matrix as each positive/negative pair operates independently.	Reduces computational overhead due to memory-efficient loss calculation

SigLIP 2 Over SigLIP

SigLIP 2 models outperform the previous SigLIP versions at all model scales in key areas such as zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). One standout feature is the dynamic resolution (naflex) version, which is especially useful for tasks sensitive to aspect ratio and resolution.

Key Features of SigLIP 2

Training with Sigmoid & Location Aware Captioners (LocCa) Decoder

SigLIP 2 introduces a text decoder alongside the existing image and text vision encoders during training. For LocCa, a transformer decoder with cross-attention is added to the vision encoder to achieve two key goals:

Referring Expression (REF): Predicting bounding box coordinates for specific locations mentioned in textual descriptions.
Grounded Captioning (GCAP): Creating captions based on specific object locations within an image.

Improved Fine-Grained Local Semantics

To improve fine-grained local semantics in image representation, SigLIP 2 adds two additional objectives: Global-Local Loss and Masked Prediction Loss.

Self-Distillation: Unlike traditional knowledge distillation, which uses a large “teacher” model to train a smaller “student” model, self-distillation uses the same model for both roles. It helps transfer knowledge from deeper network layers to shallower ones or from earlier training stages to later ones.
Global-Local Loss: This loss encourages local-to-global consistency. The vision encoder (acting as the student) processes small image patches and learns to match the full-image representation created by a teacher network.
Masked Prediction Loss: This loss works by replacing 50% of the embedded image patches with mask tokens, prompting the student model to match the teacher’s features at the masked locations. This helps focus on individual per-patch features rather than the full image.

Better Adaptability to Different Resolutions

Since image models can be highly sensitive to changes in resolution and aspect ratio, SigLIP 2 introduces two approaches for handling this:

Fixed Resolution Variant: In this version, training resumes from a checkpoint where the model has already learned most patterns (95% of training completed). The positional embeddings are resized to match the target sequence length, and training continues with the new resolution.
Dynamic Resolution (NaFlex) Variant: The NaFlex variant builds on concepts from FlexiViT and NaViT to enable a single model to handle multiple sequence lengths and maintain the native aspect ratio of images. This reduces aspect ratio distortion and is particularly useful for tasks like OCR and document image processing.

Now that we have covered some of the key differentiating features of SigLIP 2, let us build an image retrieval system using it in Python.

Building an Image Retrieval System Using SigLIP 2 and Comparison with SigLIP

In the following hands on tutorial, we will build an image retrieval system when user searches based on a image query. We will compare the responses from SigLIP 2 against SigLIP as well. We will be using the T4 GPU (free tier) on Google Colab for implementing this.

Step 1. Installation of Necessary Libraries

!pip install datasets sentencepiece
!pip install faiss-cpu
#update latest version of transformers
!pip install git+https://github.com/huggingface/transformers

Step 2. Loading SigLIP Models

import torch
import faiss
from torchvision import transforms

from PIL import Image
from transformers import AutoProcessor, SiglipModel, AutoImageProcessor, AutoModel, AutoTokenizer

import numpy as np
import requests

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

model = SiglipModel.from_pretrained("google/siglip-base-patch16-384").to(device)
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-384")
tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-384")

Step 3. Functions For Processing Input Images, Generating Embeddings & Saving it in FAISS

def add_vector(embedding, index):
    vector = embedding.detach().cpu().numpy()
    vector = np.float32(vector)
    faiss.normalize_L2(vector)
    index.add(vector)

def embed_siglip(image):
    with torch.no_grad():
        inputs = processor(images=image, return_tensors="pt").to(device)
        image_features = model.get_image_features(**inputs)
        return image_features

add_vector: This function takes a tensor embedding, normalizes it, and adds it to a FAISS index for efficient similarity searching.

embed_siglip: This function takes an image, processes it, passes it through a model to obtain its embedding (feature representation), and returns these features.

Step 4. Loading Image Dataset

API_TOKEN=""
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/rows?dataset=ceyda/fashion-products-small&config=default&split=train"

def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

We load an image dataset here and fetch it using the requests library for which we pre define the Hugging Face API token first. It is a dataset on Fashion products.

Step 5. Storing the Embeddings in FAISS Vector Database

index = faiss.IndexFlatL2(768)

# read the image and add vector
for elem in data["rows"]:
  url = elem["row"]["image"]["src"]
  image = Image.open(requests.get(url, stream=True).raw)
  #Generate Embedding of Image
  clip_features = embed_siglip(image)
  #Add vector to FAISS
  add_vector(clip_features,index)

#Save the index 
faiss.write_index(index,"./siglip_70k.index")

Step 6. Querying the Model

url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRsZ4PhHTilpQ5zsG51SPZVrgEhdSfQ7_cg1g&s"
image = Image.open(requests.get(url, stream=True).raw)

with torch.no_grad():
  inputs = processor(images=image, return_tensors="pt").to(device)
  input_features = model.get_image_features(**inputs)

input_features = input_features.detach().cpu().numpy()
input_features = np.float32(input_features)
faiss.normalize_L2(input_features)
distances, indices = index.search(input_features, 3)

Now that we’ve built the model, let’s test it out with a few prompts and see how it works.

Hands-on Retrieval Testing

Since this is a fashion dataset, we want to query on some fashion products and check if the model is able to fetch similar looking products from the database.

We will be first querying the model with this tan colored women’s bag.

Let us check the 3 most similar products fetched from the model based on this query now.

Testing on SigLIP 2 Model

#DISPLAYING SIMILAR IMAGE
for elem in indices[0]:
  url = data["rows"][elem]["row"]["image"]["src"]
  image = Image.open(requests.get(url, stream=True).raw)
  width = 300
  ratio = (width / float(image.size[0]))
  height = int((float(image.size[1]) * float(ratio)))
  img = image.resize((width, height), Image.Resampling.LANCZOS)
  display(img)

Output from SigLIP 2 Model

As seen from the output of the SigLIP 2 model, all the retrieved images of bags are close to our queried bag.

Testing on SigLIP Model

Let us now check the same with SigLIP model. We can simply load this model in Step 2 using the following code

import torch
import faiss
from torchvision import transforms

from PIL import Image
from transformers import AutoProcessor, SiglipModel, AutoImageProcessor, AutoModel, AutoTokenizer

import numpy as np
import requests

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

model = SiglipModel.from_pretrained("google/siglip-base-patch16-384").to(device)
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-384")
tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-384")

The other subsequent steps can be re-run as before.

Output from SigLIP Model

As seen from the output of the SigLIP model, two of the retrieved images of bags are similar to the retrieved images of bags from SigLIP 2 model. However, the third image retrieved from SigLIP model is not close to our query image as it is not close to the tan color.

Let us check for another query with this input image.

Output from SigLIP 2 model

As seen from the output of the SigLIP 2 model, all the retrieved images of the womens shoes are Canvas shoes and close to our queried shoe.

Output from SigLIP Model

As seen from the output of the SigLIP model, two of the retrieved images of shoes are similar to the retrieved images of shoes from SigLIP 2 model. However, the third image retrieved from SigLIP model is not exactly like our query image as it is not a Canvas shoe.

Conclusion

SigLIP 2 represents a significant step forward in the evolution of image-text retrieval and vision-language models. Its advanced features, such as dynamic resolution and improved fine-grained semantic understanding, make it a powerful tool for enhancing image search capabilities across various applications. By addressing key limitations of previous models, SigLIP 2 offers more accurate and efficient image retrieval, positioning it as a valuable asset in fields like e-commerce, digital asset management, and social media.

Key Takeaways

SigLIP 2, developed by Google DeepMind, improves upon its predecessor by utilizing a unified training approach and sigmoid-based loss, offering more accurate and efficient image-text retrieval and zero-shot classification.
Unlike CLIP, which uses a Softmax function that can struggle with nuanced image-text comparisons, SigLIP 2 employs a more effective sigmoid loss function that works independently on each image-text pair, enhancing performance.
SigLIP 2 introduces the NaFlex variant, allowing the model to handle varying image resolutions and aspect ratios effectively, making it ideal for tasks such as OCR and document processing.
Through the use of self-distillation and enhanced training techniques like Global-Local Loss and Masked Prediction Loss, SigLIP 2 offers better semantic understanding, making it more adept at capturing detailed visual features.
SigLIP 2 features a Location Aware Captioners (LocCa) Decoder, enabling tasks like grounded captioning and predicting bounding box coordinates, further enhancing its capabilities for accurate image search and retrieval.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is SigLIP 2, and how does it improve image search capabilities?

A. SigLIP 2 is a state-of-the-art multilingual vision-language encoder developed by Google DeepMind. It improves image search by enhancing semantic understanding, enabling better image-text retrieval and zero-shot classification. Its unified training approach and sigmoid-based loss function offer superior performance compared to previous models.

Q2. What are the main features of SigLIP 2 that make it stand out?

A. SigLIP 2 introduces features like Location Aware Captioners (LocCa) Decoder for predicting bounding box coordinates and grounded captioning. It also improves fine-grained local semantics through self-distillation, Global-Local Loss, and Masked Prediction Loss, which make it more adept at handling detailed visual information.

Q3. What variants does SigLIP 2 come in?

A. SigLIP 2 models come in two main variants: FixRes and NaFlex. FixRes works with fixed resolution images, while NaFlex supports variable image aspect ratios and resolutions.

Q4. What are the key improvements in SigLIP 2 over SigLIP?

A. SigLIP 2 models outperform their predecessors in tasks like zero-shot classification, image-text retrieval, and localization tasks. They also offer better multilingual understanding and fairness due to a more diverse training dataset.

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.