So what does “masked” mean in this context? Well, imagine you have a sentence like “The quick brown fox jumps over the lazy dog.” Now let’s say we want to remove one of those words and see if our model can guess which word it was based on the surrounding text. That’s where masked language modeling comes in instead of trying to predict the entire sentence, we just focus on filling in the blank space left by removing a single word.
Here’s an example script using Python and HuggingFace Transformers library:
# Import necessary libraries
from transformers import DistilBertForMaskedLM # Import DistilBertForMaskedLM from HuggingFace Transformers library
import torch # Import torch library for tensor operations
# Load pre-trained model from Hugging Face Hub
model = DistilBertForMaskedLM.from_pretrained('distilbert-base-uncased') # Load pre-trained DistilBertForMaskedLM model from Hugging Face Hub
# Define input text and masked word index (in this case, we're removing "jumps")
input_text = 'The quick brown fox _____ over the lazy dog.' # Define input text with a blank space where a word has been removed
masked_index = 5 # Index of space left by removed word
# Prepare input data for model
inputs = torch.tensor(["[CLS]" + input_text + "[SEP]", ""]).unsqueeze(0) # Create input tensor with special tokens [CLS] and [SEP] added to the beginning and end of the input text
labels = torch.tensor([-100, masked_index]).unsqueeze(0) # Create label tensor with a special value (-100) used to ignore the label for the first token ([CLS]) and the index of the masked word
# Run model and get predictions
outputs = model(inputs)[0] # Run the model on the input and get the output
predictions = outputs[:, 0].argmax(-1).tolist() # Get the index of the predicted word and convert it to a list
print("Predicted word: ", predictions[masked_index]) # Print the predicted word at the masked index
So in this example, we’re using the DistilBertForMaskedLM model to predict which word was removed from our input text. The output is a list of predicted indices for each masked token (in this case, just one), and we can use that index to look up the original word in our input text.
As for how it works under the hood, DistilBertForMaskedLM uses a pre-trained BERT model as its base architecture, but with some modifications to make it smaller (hence “distilled”) while still maintaining good performance on downstream tasks like masked language modeling and classification. The exact details of these modifications are beyond the scope of this explanation, but if you’re interested in learning more about how BERT works or how DistilBertForMaskedLM differs from it, there are plenty of resources available online!