So let’s say you want to identify all the cats in an image. You could feed that image through CLIPSeg, along with your prompt (“cat”), and it would output a mask (like a black-and-white version of the original image) where everything that looks like a cat is highlighted in white.
Here’s how you might do this using Hugging Face Transformers:
1. First, download the CLIPSeg model from the Hugging Face Model Hub and add it to your project (you can find instructions on how to do this here).
2. Prepare your image data by converting them into a format that’s compatible with the model (like JPEG or PNG).
3. Load the model using the `transformers` library, which is included in Hugging Face Transformers:
# Import the necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the model from the Hugging Face Model Hub and add it to the project
model = AutoModelForSequenceClassification.from_pretrained("CLIPSeg/clipseg-base")
# Prepare the image data by converting them into a compatible format (JPEG or PNG)
# No code needed for this step, just a reminder for the user
# Load the model using the `transformers` library and assign it to the variable `model`
# `from_pretrained` is a method from the `AutoModelForSequenceClassification` class that loads a pre-trained model from the Hugging Face Model Hub
# The model is assigned to the variable `model` for later use
model = AutoModelForSequenceClassification.from_pretrained("CLIPSeg/clipseg-base")
# Load the tokenizer using the `transformers` library and assign it to the variable `tokenizer`
# `from_pretrained` is a method from the `AutoTokenizer` class that loads a pre-trained tokenizer from the Hugging Face Model Hub
# The tokenizer is assigned to the variable `tokenizer` for later use
tokenizer = AutoTokenizer.from_pretrained("CLIPSeg/clipseg-base")
4. Preprocess your input data by converting it into a format that the model can understand (like tokenized text and image features). Here’s an example:
# Import necessary libraries
import numpy as np # Import numpy library for array manipulation
from PIL import Image # Import PIL library for image processing
from transformers import CLIPSegmentationPipeline, AutoTokenizer, AutoModelForSequenceClassification # Import CLIPSegmentationPipeline, AutoTokenizer, and AutoModelForSequenceClassification from transformers library
# Load your input data (in this case, a single JPEG file)
input_image = Image.open("cat-image.jpg") # Open the image file and assign it to the variable input_image
# Convert the image to a numpy array for processing by the model
input_array = np.asarray(input_image).astype('float32') / 127.5 # Convert the image to a numpy array and normalize the pixel values between -1 and 1
# Preprocess your input text (in this case, "cat") using a tokenizer from the Hugging Face library:
prompt = "cat" # Assign the input text to the variable prompt
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") # Initialize the tokenizer from the Hugging Face library
input_tokens = tokenizer(prompt)["input_ids"] # Tokenize the input text and assign the tokenized output to the variable input_tokens
5. Run the model on your preprocessed data and get back a segmentation mask:
# Load the CLIPSeg pipeline using the Hugging Face library's `CLIPSegmentationPipeline` class:
# Import the necessary libraries
from transformers import CLIPSegmentationPipeline
# Create an instance of the CLIPSegmentationPipeline class, passing in the model and tokenizer as parameters
pipeline = CLIPSegmentationPipeline(model=model, tokenizer=tokenizer)
# Run the model on your input data and get back a segmentation mask (in this case, as an array of shape [1, 512, 512]):
# Convert the input tokens to a list and pass it as a parameter to the pipeline, along with the input array
# The output will be a dictionary with the key "segmentations" containing the segmentation mask
output_dict = pipeline(input_tokens.tolist(), images=[input_array])
# Access the segmentation mask from the output dictionary and convert it to a numpy array
output_array = output_dict["segmentations"][0].numpy()
6. Visualize the output segmentation mask using a library like Matplotlib or OpenCV:
# Import necessary libraries
import matplotlib.pyplot as plt # Import matplotlib library for visualization
from PIL import Image # Import Image module from PIL library for image processing
# Convert the output array back to an image format (in this case, RGB) and display it:
output_image = np.uint8(np.clip((output_array * 127.5 + 1), 0, 255)) # Convert output array to unsigned integer format and clip pixel values between 0 and 255 for display purposes
plt.imshow(Image.fromarray(output_image).convert("RGB")) # Convert output image array to PIL image format and display it using matplotlib's imshow function
And that’s it! You now have a segmentation mask of your input image based on the natural language prompt you provided. Pretty cool, huh?