Training RLHF Models for Stack Exchange Question and Answering

in

So how does it work? Well, first you need some human annotators who can provide feedback on the generated responses. They’ll read through a bunch of text and rate each response based on its usefulness and accuracy. Then, using this data as input, we train a reward model that will help guide the language model towards generating better responses in the future.

Here’s an example: let’s say you have a question about how to make a delicious pizza from scratch. You ask your favorite search engine for some guidance and it returns a list of results, including links to various recipes and cooking tips. But which one is best? That’s where RLHF comes in!

First, we train our language model (let’s call her “GPT-2”) using regular old supervised learning techniques. She learns how to generate responses based on the input text, but she doesn’t have any way of knowing whether those responses are actually helpful or not. That’s where the human annotators come in!

They read through a bunch of pizza recipes and rate them based on their usefulness (e.g., “This recipe is great because it includes step-by-step instructions for making dough from scratch”). Then, we use this data to train our reward model, which will help guide GPT-2 towards generating more helpful responses in the future.

So how does RLHF actually work? Well, first we define a set of performance criteria that we want our language model to meet (e.g., “Generate responses that are both accurate and useful”). Then, using an RL algorithm like PPO or ILQL, we train the unfrozen layers of the student model to satisfy those criteria. We use the value returned by the reward model as well as a KL divergence penalty between the original base model’s forward pass results and that of the student model to calculate the total reward.

The result (hopefully!) is a language model that satisfies our performance criteria, which means it can generate responses that are both accurate and useful for answering questions about pizza-making! And best of all, because we used RLHF instead of traditional supervised learning techniques, GPT-2 will continue to improve over time as she receives more feedback from human annotators.

It’s not rocket science, but it can be pretty ***** useful if you know how to use it!

SICORPS