Using Llama-cpp for Multi-Modal Models in Python

in

Well, a “multi-modal model” is just fancy tech speak for when you want your AI to be able to handle multiple types of input data. For example, maybe you have some text and images that you want the model to analyze together in order to make predictions or generate responses based on both sources of information.

Now, there are a lot of different ways to go about building these kinds of models, but one popular approach is using Llama-cpp (which stands for “Llama C++”) as your base framework. This library provides all sorts of handy tools and functions for working with large language models in Python, including support for multi-modal input data!

So how do you actually use it? Well, first things first: make sure you have Llama-cpp installed on your machine (you can find instructions for doing so here). Then, create a new Python script and import the necessary libraries:

# Import necessary libraries
import os # Importing the os library to access operating system functionalities
from llama_cpp.LLMAPI import SequenceToSequenceModel, Tokenizer, Platform # Importing specific classes from the llama_cpp.LLMAPI library
from llama_cpp.tools.downloader import prepare_model # Importing the prepare_model function from the llama_cpp.tools.downloader library
from PIL import Image # Importing the Image class from the PIL library
import numpy as np # Importing the numpy library and assigning it an alias "np"
import pandas as pd # Importing the pandas library and assigning it an alias "pd"

# Ensure Llama-cpp is installed on the machine
# Instructions for installation can be found here: [insert link]

# Create a new Python script and import necessary libraries

Next, you’ll need to load your text and image data into memory (assuming they’re stored in separate files or directories). For example:

# Load the text data from a CSV file
# Import the necessary libraries
import pandas as pd

# Read the CSV file and store it in a dataframe
df = pd.read_csv('data/text.csv')

# Convert the 'text' column of the dataframe into a list
texts = df['text'].tolist()

# Load the image data using PIL (Python Imaging Library)
# Import the necessary libraries
import os
import numpy as np
from PIL import Image
import io

# Create an empty list to store the images
images = []

# Loop through the files in the 'data/images' directory
for img in os.listdir('data/images'):
    # Check if the file ends with '.jpg'
    if img.endswith('.jpg'):
        # Open the file in binary mode and read its contents
        with open(os.path.join('data', 'images', img), 'rb') as f:
            # Convert the data into a numpy array of unsigned integers
            data = np.frombuffer(f.read(), dtype=np.uint8)
        # Convert the numpy array into an image and convert it to RGB format
        images.append(Image.open(io.BytesIO(data)).convert('RGB'))

# The script loads the text data from a CSV file and the image data from a directory using PIL. It first imports the necessary libraries, then reads the CSV file and converts the 'text' column into a list. Next, it loops through the files in the 'data/images' directory, checks if they are JPEG images, and if so, opens them in binary mode and converts the data into a numpy array. Finally, it converts the numpy array into an image and appends it to the list of images.

Now, you’re ready to create your multi-modal input data! This involves combining the text and image data into a single format that Llama-cpp can understand. One popular approach is using “image captions” (i.e., short descriptions of what’s happening in each image), which can be generated by running the images through an object recognition model or manually writing them yourself:

# Generate image captions for each image using a separate OCR model
captions = [] # create an empty list to store the generated captions
for image, text in zip(images, texts): # iterate through the images and texts using the zip function to combine them
    # Run the image through your chosen OCR model to generate a caption
    caption = OCR_model(image) # use the OCR_model function to generate a caption for the current image
    # Add the resulting caption to our list of multi-modal input data
    captions.append((text, caption)) # append the caption and corresponding text to the captions list


Finally, you can pass this list of multi-modal input data into Llama-cpp for processing! This involves creating a new SequenceToSequenceModel object and feeding it your text and image data:

# Load the pretrained model using prepare_model() function
prepare_model('data/llama.ggml')

# Create a new SequenceToSequenceModel object with our chosen platform (CPU or GPU)
# Check if GPU is available, if not, use CPU as platform
platform = Platform::CPU() if not torch.cuda.is_available() else Platform::GPU()

# Load the pretrained model and set batch size to 1024
model = SequenceToSequenceModel(platform, 'data/llama.ggml', 1024)

# Define a function for processing our multi-modal input data using Llama-cpp's API
def process_input(text, img):
    # Convert text into tokenized sequences that can be fed into the model
    # Create a tokenizer object using the pretrained model
    tokenizer = Tokenizer('data/llama.ggml')
    # Encode the text input into tokenized sequences
    tokens1 = tokenizer.encode(text)
    
    # Preprocess image data for use with Llama-cpp's API (e.g., converting to grayscale or resizing)
    # Convert image data into a numpy array and normalize pixel values between 0 and 1
    img_arr = np.array(img).astype('float32') / 255.0
    
    # Convert image data into a format that can be fed into the model (e.g., converting to grayscale or resizing)
    # Add a batch dimension to the image array for use with Llama-cpp's API
    img_arr = np.expand_dims(img_arr, axis=0)
    
    # Combine text and image data into a single input sequence using Llama-cpp's API
    # Convert the image array into a list of lists (required by Llama-cpp)
    inputs = [tokens1] + img_arr.tolist()
    
    # Run the input through the model and return the resulting output sequences
    # Use the forward function of the model and store the outputs and hidden states in variables
    outputs, _ = model.forward(inputs)
    
    # Extract the first output sequence (i.e., the one corresponding to our text input) for use with downstream processing or analysis
    # Return the first output sequence
    return outputs[0]

And that’s it! You can now call this function on each item in your list of multi-modal input data, and Llama-cpp will handle the rest. Of course, there are many other ways to approach building these kinds of models (and plenty of other libraries you could use instead), but hopefully this gives you a good starting point for exploring the world of multi-modal AI in Python!

SICORPS