The model has been trained on the ImageNet dataset, which contains over 14 million images with corresponding labels.
During training, AlignVisionModel is fed a batch of input images along with their corresponding text descriptions. It then processes these inputs through its layers and generates output hidden states that can be used for classification or object detection tasks. The model’s last layer produces the final output, which consists of a sequence of hidden-states at the output of the last layer and a pooled output on the spatial dimensions.
To use AlignVisionModel in your own code, you can first load it using its pre-trained weights from Hugging Face’s model hub:
# Import the necessary libraries
from transformers import AutoProcessor, AlignVisionModel
# Load the AlignVisionModel from Hugging Face's model hub using its pre-trained weights
model = AlignVisionModel.from_pretrained("kakaobrain/align-base")
# Create an instance of the AutoProcessor class and pass in the AlignVisionModel
processor = AutoProcessor(model)
# The model's last layer produces the final output, which consists of a sequence of hidden-states at the output of the last layer and a pooled output on the spatial dimensions.
# This output can be used for tasks such as image captioning or visual question answering.
# To use AlignVisionModel in your own code, you can first load it using its pre-trained weights from Hugging Face's model hub.
# This allows you to easily incorporate the model into your own projects without having to train it from scratch.
Once you have loaded the model and processor, you can use them to preprocess your input images and generate text embeddings:
# Load an image from a URL or file path
url = "https://example.com/image.jpg" # Assigns a URL to the variable "url"
img_file = "/path/to/local/image.jpg" # Assigns a file path to the variable "img_file"
from PIL import Image # Imports the Image module from the PIL library
import requests # Imports the requests library for making HTTP requests
import io # Imports the io module for handling input/output operations
import os # Imports the os module for interacting with the operating system
if url is not None: # Checks if the url variable is not empty
response = requests.get(url) # Sends a GET request to the specified URL and assigns the response to the variable "response"
img = Image.open(io.BytesIO(response.content)) # Opens the image from the response content using the Image module and assigns it to the variable "img"
elif os.path.isfile(img_file): # Checks if the img_file variable is a valid file path
img = Image.open(img_file) # Opens the image from the file path using the Image module and assigns it to the variable "img"
else:
raise ValueError("Invalid image source") # Raises a ValueError if both the url and img_file variables are empty or invalid
# Preprocess the input image using the processor's preprocessing function
inputs = processor(images=img, return_tensors="pt").to(device) # Uses the processor's preprocessing function to convert the image into a tensor and assigns it to the variable "inputs"
After preprocessing your inputs, you can pass them through AlignVisionModel to generate text embeddings:
# Generate text embeddings using the model's forward method
# Inputs are passed through the model to generate outputs
outputs = model.forward(**inputs)
# The last hidden state is extracted from the outputs
last_hidden_state = outputs.last_hidden_state
# The pooler output is extracted from the outputs
pooler_output = outputs.pooler_output
Finally, you can use these output hidden states and pooled output for classification or object detection tasks:
# Use the last hidden state to perform image classification using a downstream classifier
classifier = ... # Load your own pre-trained classifier here
outputs_cls = classifier(last_hidden_state) # Use the classifier to get the outputs from the last hidden state
predictions_cls = torch.argmax(outputs_cls, dim=1).detach().cpu().numpy() # Use the argmax function to get the predicted class from the outputs, detach from the computational graph and convert to a numpy array for easier manipulation
# Use the pooled output to perform object detection using a downstream detector
detector = ... # Load your own pre-trained detector here
outputs_obj = detector(pooler_output) # Use the detector to get the outputs from the pooled output
predictions_obj = torch.argmax(outputs_obj, dim=1).detach().cpu().numpy() # Use the argmax function to get the predicted class from the outputs, detach from the computational graph and convert to a numpy array for easier manipulation
In recent years, there has been a significant surge in the development of vision-language models (VLMs), which can perform various tasks such as image classification, object detection, and question answering. These models have grown not only in terms of accuracy but also size, with some having billions or even trillions of parameters. In this paper, we propose a strong recipe for transferring VLMs to open-vocabulary object detection using a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis shows that increasing the size of the model and the amount of data used during training yields consistent improvements in performance on downstream tasks such as zero-shot text-conditioned object detection and one-shot image-conditioned object detection. We provide adaptation strategies and regularizations needed to achieve very strong results on these challenging tasks, which are available on GitHub.