RLHF Systems and Reward Modeling

in

Here’s how it works: first, you have a big ol’ language model that can generate all sorts of text based on what it’s been trained on. This is like the dog who already knows some basic tricks (like sit and stay). But sometimes we want our dog to do more advanced things, so we need to teach them new skills.

In RLHF Systems, we use human feedback as a reward signal for training the language model. We give it points or scores based on how well humans perceive its output (like whether they think it’s funny or informative). The higher the score, the more likely the language model is to generate that kind of text in the future.

For example, let’s say we want our dog to learn a new trick: rolling over. We start by showing them how to do it and rewarding them with treats when they get it right (like giving them a treat every time they roll over). Over time, the dog will learn that rolling over is a good thing because it gets rewards for doing so.

In RLHF Systems, we use human feedback in a similar way to train our language models. We show them examples of text and ask humans to rate how well they like it (like giving them points based on whether they think the text is funny or informative). Over time, the language model will learn that generating certain types of text gets higher scores from humans, so it’s more likely to generate those kinds of texts in the future.

The cool thing about RLHF Systems is that we can use them to train our language models for specific tasks or purposes (like writing comedy sketches or news articles). By using human feedback as a reward signal, we can ensure that our language models are generating text that meets our needs and expectations. And by training multiple language models with different rewards signals, we can create a whole system of RLHF Systems that can handle all sorts of tasks and purposes!

If you want to learn more, check out some of the resources I mentioned earlier or do your own research. And if you ever need help training your dog (or language model) to do tricks, just let me know!

SICORPS