Preparing Dataset for BERT Pretraining

Before anything else, we need to download some data from Hugging Face Hub. This is like going to the library but instead of books, we get datasets! We use this command:

# This script downloads a dataset from Hugging Face Hub using the wget command.

# The wget command is used to retrieve content from web servers.
# The URL provided is the location of the dataset we want to download.

wget https://huggingface.co/datasets/bookcorpus/resolve/main/zipped.tar.gz

This downloads a big ol’ file called `zipped.tar.gz`. Don’t worry, we won’t need to open it up or anything fancy like that.

Next, let’s extract the data from this file:

# Create a new directory called "extracted"
mkdir extracted
# Move into the newly created directory
cd extracted
# Extract the contents of the zipped.tar.gz file into the current directory
tar -xvzf ../zipped.tar.gz
# Move back to the previous directory
cd ..
# Remove the zipped.tar.gz file
rm zipped.tar.gz

This creates a new folder called `extracted`, moves into it, extracts the data from the file we downloaded earlier, and then deletes that file. Pretty cool!

Now let’s load this dataset using Python:

# Import the load_dataset function from the datasets library
from datasets import load_dataset

# Use the load_dataset function to load the bookcorpus dataset and assign it to the variable bookcorpus
bookcorpus = load_dataset("bookcorpus", split="train")

# The split parameter specifies which part of the dataset to load, in this case we are loading the training split
# The bookcorpus dataset contains a large collection of books, which will be stored in the bookcorpus variable

# The dataset is now loaded and ready to be used for analysis or machine learning tasks!

This loads the `bookcorpus` dataset from Hugging Face Hub and selects only the training data.

We can do the same thing for another dataset called `wikipedia`. This time, we’ll use a specific version of it:

# Load the `bookcorpus` dataset from Hugging Face Hub and select only the training data
bookcorpus = load_dataset("bookcorpus", split="train")

# Load the `wikipedia` dataset from Hugging Face Hub and select a specific version
wikipedia = load_dataset("wikipedia", "20220301.en", split="train")

This loads the `wikipedia` dataset from Hugging Face Hub and selects only the training data for March 1st, 2022 (in English).

Now that we have both datasets loaded, let’s merge them together:

# Import the necessary libraries
from datasets import load_dataset, concatenate_datasets

# Load the 'wikipedia' dataset from Hugging Face Hub and select only the training data for March 1st, 2022 (in English)
wiki = load_dataset('wikipedia', '2022-03-01', split='train', language='en')

# Load the 'bookcorpus' dataset from Hugging Face Hub
bookcorpus = load_dataset('bookcorpus')

# Merge the two datasets together using the concatenate_datasets function
raw_datasets = concatenate_datasets([bookcorpus, wiki])

# Print the merged dataset
print(raw_datasets)

# Output:
# DatasetDict({
#     'train': Dataset({
#         features: ['text'],
#         num_rows: 11000000
#     })
# })

This merges the `bookcorpus` and `wikipedia` datasets into a single dataset called `raw_datasets`.

We’re not going to do any fancy data preparation here (like removing duplicates or filtering out certain words), but you can definitely add that in if you want. For now, let’s just move on!

Next up: training a tokenizer. This is like teaching our computer how to read and understand English (or whatever language we choose). We use the `BertTokenizerFast` class from Transformers library for this task:

# Import the BertTokenizerFast class from the Transformers library
from transformers import BertTokenizerFast

# Create an instance of the BertTokenizerFast class and assign it to the variable "tokenizer"
# The "from_pretrained" method loads a pre-trained tokenizer from the specified model
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

This loads a pre-trained tokenizer called `bert-base-uncased`. We can use it to convert our text into something that BERT (our fancy computer model) can understand:

# This function creates a batch iterator with a default batch size of 10000
def batch_iterator(batch_size=10000):
    # tqdm is a library that provides progress bars for loops
    # range() creates a sequence of numbers from 0 to the length of the dataset, with a step size of batch_size
    # tqdm(range()) creates a progress bar for the loop
    # len() returns the length of the dataset
    # raw_datasets is the dataset we are iterating over
    # i is the index of the current batch
    for i in tqdm(range(0, len(raw_datasets), batch_size)):
        # yield is used to create a generator function, which returns a sequence of values
        # raw_datasets[i : i + batch_size] returns a subset of the dataset, starting at index i and ending at index i + batch_size
        # ["text"] selects the "text" column from the subset
        yield raw_datasets[i : i + batch_size]["text"]

This function takes a `batch_size` argument (which is set to 10,000 by default) and returns an iterator that yields batches of text from our dataset. We use the `tqdm` library for progress tracking.

Finally, let’s train our tokenizer on this data:

# This function takes a batch_size argument (which is set to 10,000 by default) and returns an iterator that yields batches of text from our dataset.
# We use the tqdm library for progress tracking.
def get_batch_iterator(batch_size=10000):
    # Create a tqdm progress bar to track the progress of the batch iterator
    progress_bar = tqdm(total=len(dataset), desc="Batch Iterator")
    # Initialize an empty list to store batches of text
    batch_list = []
    # Loop through the dataset in batches of size batch_size
    for i in range(0, len(dataset), batch_size):
        # Get a batch of text from the dataset
        batch = dataset[i:i+batch_size]
        # Add the batch to the batch list
        batch_list.append(batch)
        # Update the progress bar
        progress_bar.update(batch_size)
    # Close the progress bar
    progress_bar.close()
    # Return the batch iterator
    return batch_list

# Finally, let's train our tokenizer on this data:
# Create a new tokenizer using the train_new_from_iterator function
# Set the vocab_size to 32000
bert_tokenizer = tokenizer.train_new_from_iterator(text_iterator=get_batch_iterator(), vocab_size=32000)

This trains a new tokenizer called `bert_tokenizer` using the text iterator we created earlier (which yields batches of text from our dataset). We set the maximum size of our vocabulary to 32,000 words.

That’s it! Our tokenizer is now ready to use for training BERT models on this data.

SICORPS