It was created by distilling knowledge from the original BERT model using a technique called knowledge distillation. This means that instead of training a new model from scratch, we use an existing one as a teacher to guide the learning process for our smaller student model.
Here’s how it works in more detail: 1. First, we take two versions of BERT (let’s call them Teacher and Student) and train them on the same data using different optimization objectives. The Teacher is trained with a standard cross-entropy loss function to maximize its accuracy on a given task, while the Student is trained with an additional distillation loss that encourages it to mimic the output of the Teacher as closely as possible.
2. Next, we feed both models some input data and compare their predictions. If they agree (i.e., they make the same prediction), then we reward the Student for getting it right by adding a small amount to its distillation loss. This encourages the Student to learn from the Teacher’s mistakes as well as its successes, which can lead to better performance on downstream tasks.
3. Finally, we repeat this process over multiple iterations until the Student has learned enough knowledge from the Teacher and is able to perform just as well (or almost as well) on a given task without needing all of the resources that BERT requires. This makes it much more practical for use in real-world applications where computational resources are limited or expensive.
For example, let’s say we want to train a model to classify whether a sentence is positive or negative based on its sentiment. We could use DistilBERT instead of BERT because it requires less memory and can be trained faster without sacrificing too much accuracy. This makes it ideal for use in situations where time and resources are limited, such as on mobile devices or embedded systems.
A simplified explanation of how DistilBERT works using knowledge distillation to create a smaller, faster, cheaper, lighter version of BERT that can still perform just as well (or almost as well) on downstream tasks.