Llama 2 Quick Start Guide

in

Do you want something that’s easy and fast? Look no further than Llama 2! In this quick start guide, we’ll show you how to get started with training your very own customized language model in less than a minute.

First: make sure you have the necessary tools installed on your computer. You’ll need Python (version 3 or higher), and the Hugging Face Transformers library. If you don’t already have these, head over to their respective websites and download them now!

Once you have everything set up, let’s get started with training our model. First, we’re going to need some data. For this example, we’ll be using the text from a popular book called “The Catcher in the Rye”. You can find it online or download it as a PDF if you prefer.

Next, let’s create a new directory for our project and navigate into it:

# Create a new directory called "llama2-project" and navigate into it
mkdir llama2-project && cd llama2-project/

# The && operator allows for multiple commands to be executed in one line
# The first command creates the directory "llama2-project"
# The second command navigates into the newly created directory


bash
# Create a new directory called "llama2-project"
mkdir llama2-project

# Navigate into the newly created directory
cd llama2-project/

Now that we have our directory set up, let’s download the text data using wget. This will save us time in the long run since we won’t need to manually copy and paste it into our project folder:

# This script uses the wget command to download a PDF file from a specified URL.

# The wget command is used to retrieve content from web servers.
# The -O flag is used to specify the output file name.
# The -P flag is used to specify the directory where the file will be saved.
# The -q flag is used to suppress the output of the command.
# The -N flag is used to only download the file if it is newer than the existing one.

# The URL of the file to be downloaded.
url="https://example.com/the-catcher-in-the-rye.pdf"

# The output file name.
output_file="the-catcher-in-the-rye.pdf"

# The directory where the file will be saved.
directory="data"

# The wget command with the specified flags and arguments.
wget -O "$output_file" -P "$directory" -q -N "$url"

# The file will now be downloaded and saved in the specified directory with the specified name.
# The -N flag ensures that the file will only be downloaded if it is newer than the existing one, saving time in the long run.

Once the download is complete, let’s convert the PDF file into a text format using pdftotext. This will make it easier for our model to understand:

# This script converts a PDF file into a text format using pdftotext

# The -layout flag preserves the original layout of the PDF file
# The input file is "the-catcher-in-the-rye.pdf" and the output file is "catcher_text.txt"
# The ">" symbol redirects the output of the command to the specified file

pdftotext -layout the-catcher-in-the-rye.pdf catcher_text.txt

Now that we have our data, let’s create a new Python script called `train.py`. This will be where we’ll do all of our training and configuration:

# Create a new file called train.py using the touch command
touch train.py

Open up the file in your favorite text editor or IDE (we recommend VS Code) and add the following code:

# Import the necessary libraries
from transformers import LlamaForSequenceClassification, TrainingArguments, AutoTokenizer
import os

# Set the path to our data folder
data_folder = "catcher_text.txt"

# Load the pre-trained model we'll be using as a starting point
model = LlamaForSequenceClassification.from_pretrained("huggingface/llama-2-13B")

# Set up our training arguments (e.g., number of epochs, batch size)
args = TrainingArguments(output_dir="./outputs", num_train_epochs=1, per_device_train_batch_size=8)

# Load the tokenizer we'll be using to convert text into numerical values (i.e., tokens)
tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-2-13B")

# Define our training function, which will take care of loading in our data and running the model through it:
def train():
    # Load in our text data using the tokenizer we loaded earlier
    with open(data_folder) as f:
        lines = f.readlines()
        
    # Convert each line into a list of tokens (i.e., numerical values that represent words or phrases)
    for i, line in enumerate(lines):
        # Use the tokenizer to convert the text into numerical values
        # return_tensors="pt" specifies that we want the output in PyTorch format
        text = tokenizer(line, return_tensors="pt")["input_ids"]
        
    # Run the model through our data using the training arguments we set up earlier:
    # Trainer is a class from the transformers library that handles the training process
    trainer = Trainer(model=model, args=args)
    trainer.train()
    
# Call our train function to start the actual training process!
if __name__ == "__main__":
    train()

Save and close your file. Now let’s run it:

# This script is used to run the "train.py" file using the python interpreter.

# The following line executes the "train.py" file using the python interpreter.
python train.py

And that’s it! Your model will now be trained on the text data you provided, and should be ready to use in no time at all.

SICORPS