Understanding Attention Masks in Transformer Models

in

It’s kind of like highlighting certain words in a sentence so that they stand out and get more weight when calculating the output.

Here’s an example: let’s say we have this text: “The quick brown fox jumps over the lazy dog.” If we want to emphasize the word “quick” during training, we can create an attention mask where all the indices for that word are set to 1 (meaning it gets full attention) and everything else is set to 0.5 (meaning it still gets some attention but not as much).

Here’s what that might look like in code:

# Import necessary libraries
import numpy as np
from transformers import BertTokenizerFast, TFBertForSequenceClassification

# Load the tokenizer and model
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') # Load the BERT tokenizer
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) # Load the BERT model for sequence classification with 2 labels

# Prepare input data (in this case, just a single sentence)
sentence = "The quick brown fox jumps over the lazy dog."
input_ids = tokenizer(sentence, return_tensors='tf')['input_ids'] # Tokenize the sentence and convert it to input IDs for the model
attention_mask = np.ones((1, len(input_ids[0]))) # Create an attention mask with all indices set to 1 (full attention) for the entire sentence

# Set attention mask for "quick" to 1 (full attention) and everything else to 0.5 (some attention but not as much)
attention_mask[:, tokenizer.get_vocab()[tokenizer.convert_tokens_to_ids('quick')]:tokenizer.get_vocab()[tokenizer.convert_tokens_to_ids('quick')]+1] = 1 

# Set attention mask for "jumps" to 0.5 (some attention but not as much) to avoid overshadowing "quick" during training
attention_mask[:, tokenizer.get_vocab()[tokenizer.convert_tokens_to_ids('jumps')]:tokenizer.get_vocab()[tokenizer.convert_tokens_to_ids('jumps')]+1] = 0.5 

# Set attention mask for "lazy" and everything after it to 0 (no attention) because we don't care about that part of the sentence during training
attention_mask[:, tokenizer.get_vocab()[tokenizer.convert_tokens_to_ids('lazy')]:len(input_ids[0])] = 0 

# This attention mask will ensure that during training, the word "quick" receives full attention while "jumps" receives some attention but not as much, and "lazy" and everything after it receives no attention. This will help the model focus on the important words and improve its performance.

So basically, by creating an attention mask like this, we can tell the model which parts of the input sequence are most important for making predictions. This helps to improve accuracy and reduce overfitting because it allows the model to focus on the most relevant information during training.

SICORPS