Today we’re going to talk about training BERT for masked language modeling using Hugging Face Transformers. But first, let’s take a step back and explain what the ***** that even means.
BERT is short for “Bidirectional Encoder Representations from Transformers,” which sounds fancy but basically just means it’s a really cool machine learning model that can understand language in both directions (left to right AND right to left). It was developed by Google and has been used to achieve state-of-the-art results on various natural language processing tasks.
Masked Language Modeling is another fancy term for predicting missing words in a sentence. This is useful because it can help us understand the context of a given text, which is important for things like summarization and question answering.
So how do we train BERT to do this? Well, first we need some data. Let’s say we have a dataset called “raw_datasets” that contains our text. We can use Hugging Face Transformers to preprocess the data by converting it into tokenized format (i.e., breaking it down into individual words and punctuation).
Here’s some code to do this:
# Import necessary libraries
from tqdm import tqdm # tqdm is a library that provides progress bars for loops
from transformers import BertTokenizerFast # transformers is a library for natural language processing tasks, BertTokenizerFast is a tokenizer for BERT model
# Define the tokenizer to be used
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased") # from_pretrained() loads a pretrained tokenizer from Hugging Face Transformers
# Define a function to iterate through the dataset in batches
def batch_iterator(batch_size=10000):
for i in tqdm(range(0, len(raw_datasets), batch_size)): # tqdm is used to track the progress of the loop
yield raw_datasets[i : i + batch_size]["text"] # yield is used to create a generator object that returns a batch of text data from the dataset
# Example usage:
# Let's say we have a dataset called "raw_datasets" that contains our text data
raw_datasets = ["This is a sample text.", "Another sample text.", "Yet another sample text."]
# Preprocess the data by converting it into tokenized format using the defined tokenizer
for batch in batch_iterator():
tokenized_batch = tokenizer(batch, padding=True, truncation=True, return_tensors="pt") # tokenizer converts the text data into tokenized format and returns a PyTorch tensor
# Note: The original script did not specify the tokenizer to be used, it has been added in this corrected script. Also, the tokenizer is used to convert the text data into tokenized format, with options for padding and truncation to ensure all texts have the same length, and return_tensors="pt" specifies that the output should be a PyTorch tensor.
# Do further processing on the tokenized batch, such as passing it to a BERT model for training or inference.
This code uses the `BertTokenizerFast` class from Hugging Face Transformers to tokenize our data. We’re also using a progress bar (`tqdm`) to make it look fancy and show us how much of the dataset we’ve processed so far.
Now that we have our preprocessed data, let’s train BERT! Here’s some code for that:
# Import necessary libraries
from transformers import BertForMaskedLM, BertTokenizerFast, AdamW # Importing BertTokenizerFast class from Hugging Face Transformers to tokenize our data and AdamW optimizer for training
import torch # Importing torch for creating and training our model
# Define tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained(tokenizer_id) # Creating an instance of BertTokenizerFast using the tokenizer_id provided
model = BertForMaskedLM.from_pretrained(tokenizer_id) # Creating an instance of BertForMaskedLM using the tokenizer_id provided
# Define optimizer and number of epochs
optimizer = AdamW(model.parameters(), lr=5e-5) # Creating an instance of AdamW optimizer with a learning rate of 5e-5
num_epochs = 3 # Setting the number of epochs for training
# Training loop
for epoch in range(num_epochs): # Looping through the specified number of epochs
for batch, text in enumerate(batch_iterator()): # Looping through the batch iterator to get batches of preprocessed data
input_ids = tokenizer(text, padding="max_length", truncation=True, return_tensors='pt')["input_ids"] # Tokenizing the preprocessed data using the tokenizer and converting it into a format that BERT can use
labels = torch.tensor([tokenizer.masked_fill(input_ids.clone(), -100, input_ids.ne(tokenizer.pad_token_id).long()) for _ in range(len(input_ids))]) # Creating the masked language modeling task by replacing some words with a special token (-100)
loss = model(input_ids=input_ids, labels=labels)[0] # Calculating the loss for the current batch
optimizer.zero_grad() # Resetting the gradients to zero
loss.backward() # Backpropagating the loss
optimizer.step() # Updating the parameters of the model using the optimizer
This code uses the `BertForMaskedLM` class from Hugging Face Transformers to train BERT for masked language modeling. We’re also using AdamW as our optimization algorithm and setting the learning rate to 5e-5. The rest of this code should be pretty self-explanatory, but let me break it down:
1. Load the pretrained model (`BertForMaskedLM`) we want to use for training.
2. Create an optimizer object using AdamW and set its learning rate to 5e-5.
3. Set the number of epochs we want to train for (in this case, 3).
4. Loop through each batch in our preprocessed data.
5. Convert the text into a format that BERT can use using `model.tokenizer`.
6. Create the masked language modeling task by replacing some words with a special token (-100) and converting it to a tensor.
7. Calculate the loss for this batch using the model’s forward pass function (`model(input_ids=input_ids, labels=labels)[0]`) and backpropagate the error through the network using `loss.backward()`.
8. Update the weights of our model using `optimizer.step()`.
That’s how to train BERT for masked language modeling using Hugging Face Transformers in Python. It might seem a bit complicated at first, but once you get the hang of it, it’s actually pretty straightforward. And with all these fancy terms like “BERT” and “masked language modeling,” you can impress your friends by sounding like an AI expert!