EleutherAI’s Pythia-2.8b and 6.9b Models: Training Details and Checkpoints

Here’s an example of what that might look like:

# Import necessary libraries
import os
from typing import List

import numpy as np
import pandas as pd
import torch
from tqdm import tqdm

# Load pre-trained Pythia checkpoints from EleutherAI repository
checkpoint_path = "https://huggingface.co/EleutherAI/pythia-2.8b"
model, tokenizer = EleutherAI.load(checkpoint_path) # Load pre-trained model and tokenizer from EleutherAI repository

# Set seed for reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

# Define custom Dataset class using PythiaPileDataset
class CustomDataset(PythiaPileDataset):
    def __init__(self, data: List[str], labels: List[int]):
        super().__init__(data=data, labels=labels) # Inherit from PythiaPileDataset class
        
    # Define custom collate function to handle padding and batching
    def collate(self, batch):
        return self.pad_batch(batch) # Pad and batch the data
    
# Load custom dataset from CSV file
df = pd.read_csv("custom_dataset.csv")
data = df["text"].tolist()
labels = [int(label) for label in df["label"]]

# Create custom Dataset object and split into train/val sets
train_size = int(len(data) * 0.8)
val_size = len(data) - train_size # Calculate validation set size
train_dataset = CustomDataset(data[:train_size], labels[:train_size]) # Create train dataset
val_dataset = CustomDataset(data[train_size:], labels[train_size:]) # Create validation dataset

# Define loss function, optimizer, and learning rate scheduler
criterion = torch.nn.CrossEntropyLoss() # Define cross entropy loss function
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) # Define AdamW optimizer with learning rate of 0.0001
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-6) # Define learning rate scheduler

# Run training loop on merged dataset (train and val sets combined)
for epoch in range(3):
    # Set model to train mode
    model.train()
    
    # Loop over batches of data
    for batch_idx, (data, labels) in enumerate(tqdm(train_dataset)):
        # Move data and labels to GPU if available
        if torch.cuda.is_available():
            data = data.to("cuda")
            labels = labels.to("cuda")
        
        # Forward pass through model
        outputs = model(data)
        
        # Calculate loss and backpropagate gradients
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    # Evaluate model on validation set after each epoch
    val_loss = 0.0
    for batch_idx, (data, labels) in enumerate(val_dataset):
        # Move data and labels to GPU if available
        if torch.cuda.is_available():
            data = data.to("cuda")
            labels = labels.to("cuda")
        
        # Forward pass through model
        outputs = model(data)
        
        # Calculate loss and update validation loss
        val_loss += criterion(outputs, labels).item()
    
    # Print training progress after each epoch
    print(f"Epoch {epoch+1}/3: Loss={loss.item():.4f}, Val Loss={val_loss/len(val_dataset):.4f}")
 

# Save trained model to disk for future use
torch.save({'model': model, 'optimizer': optimizer, 'scheduler': scheduler}, "trained_model.pth") # Save model, optimizer, and scheduler to disk

In this example, we first load the pre-trained Pythia checkpoint from EleutherAI’s repository and set a seed for reproducibility using `torch.manual_seed()`. We then define our custom Dataset class called `CustomDataset`, which inherits from `PythiaPileDataset` to handle loading data and labels from CSV files.

Next, we load the custom dataset into memory as two separate datasets (train and val sets) using the `split()` function. We then define our loss function, optimizer, and learning rate scheduler using standard PyTorch functions. Finally, we run a simple training loop on the merged dataset (train and val sets combined), which includes moving data to GPU if available, forward passing through the model, calculating loss, backpropagating gradients, updating weights, evaluating the model on validation set after each epoch, and printing progress.

At the end of the script, we save the trained model to disk for future use using `torch.save()`.

SICORPS