So what does it mean to use reinforcement learning with human feedback in the context of language models? Well, let’s say you have a pre-trained language model that can generate text based on some input prompt. But how do we know if this generated text is actually good or not? That’s where RLHF comes in!
Instead of using traditional metrics like BLEU score to evaluate the quality of the generated text, we use human feedback as a measure of performance. This means that we ask humans to rate the generated text on a scale from 1-5 (or whatever rating system you prefer) and then use this feedback as a reward signal for training our language model using reinforcement learning techniques.
Here’s an example: let’s say we have a prompt “Write a short story about a person who discovers they can fly.” Our pre-trained language model generates the following text:
“Sarah had always dreamed of flying, but she never thought it was possible until one day she woke up with wings. She couldn’t believe her eyes as she looked down at her new appendages. Sarah jumped out of bed and ran to the window, eager to test them out. As soon as she spread her wings and lifted off the ground, a feeling of pure joy washed over her. For the first time in her life, Sarah felt truly alive.”
Now we ask humans to rate this generated text on a scale from 1-5 based on how well it captures the essence of the prompt (in this case, “Write a short story about a person who discovers they can fly”). Let’s say that out of 100 human ratings, our generated text received an average rating of 4.2. This means that we use a reward signal of 4.2 for training our language model using reinforcement learning techniques to optimize the output based on this feedback.
RLHF is essentially using human feedback as a measure of performance in order to train language models with reinforcement learning techniques. It’s a pretty cool concept that has been gaining popularity lately, and we can expect to see more applications of it in the future!