Illustrating Reinforcement Learning from Human Feedback (RLHF)

in

That’s where RLHF comes in!

So, how does it work? Basically, we give the language model feedback from humans and use that to improve its writing skills. For example, let’s say our language model wrote this: “The quick brown fox jumps over the lazy dog.” But then a human says, “Hey, that’s not very interesting. Can you make it more exciting?”

So, we give the language model some points for trying to be exciting and deduct points if it writes something boring or nonsensical. And then we use this feedback to train the language model so it learns what humans like and dislike. Pretty cool, right?

Here’s an example of how you might implement RLHF in code:

# Define a function that takes input text from the language model and returns human feedback (points) based on whether or not it's interesting
def get_feedback(text):
    if "exciting" in text.lower(): # Check if the input text contains the word "exciting"
        return 10 # Give points for using exciting words
    else:
        return -5 # Deduct points for being boring

# Train the language model with RLHF by feeding it input text and getting feedback from humans
for i in range(num_episodes):
    state = initialize_state() # Start a new episode (i.e. generate a new piece of text)
    while True:
        action = get_action(state, language_model) # Get an action based on the current state and the language model's output
        next_state, reward = step(state, action) # Take the action and move to the next state (i.e. generate a new piece of text)
        feedback = get_feedback(next_state) # Get human feedback based on whether or not it's interesting
        update_model(language_model, reward, feedback) # Update the language model with the feedback and the reward
        if is_done(next_state): # Check if we've reached a stopping point (i.e. generated enough text for this episode)
            break

# Define a function that takes input text from the language model and returns human feedback (points) based on whether or not it's interesting
def get_feedback(text):
    if "exciting" in text.lower(): # Check if the input text contains the word "exciting"
        return 10 # Give points for using exciting words
    else:
        return -5 # Deduct points for being boring

# Train the language model with RLHF by feeding it input text and getting feedback from humans
for i in range(num_episodes):
    state = initialize_state() # Start a new episode (i.e. generate a new piece of text)
    while True:
        action = get_action(state, language_model) # Get an action based on the current state and the language model's output
        next_state, reward = step(state, action) # Take the action and move to the next state (i.e. generate a new piece of text)
        feedback = get_feedback(next_state) # Get human feedback based on whether or not it's interesting
        update_model(language_model, reward, feedback) # Update the language model with the feedback and the reward
        if is_done(next_state): # Check if we've reached a stopping point (i.e. generated enough text for this episode)
            break

# Define a function that takes input text from the language model and returns human feedback (points) based on whether or not it's interesting
def get_feedback(text):
    if "exciting" in text.lower(): # Check if the input text contains the word "exciting"
        return 10 # Give points for using exciting words
    else:
        return -5 # Deduct points for being boring

# Train the language model with RLHF by feeding it input text and getting feedback from humans
for i in range(num_episodes):
    state = initialize_state() # Start a new episode (i.e. generate a new piece of text)
    while True:
        action = get_action(state, language_model) # Get an action based on the current state and the language model's output
        next_state, reward = step(state, action) # Take the action and move to the next state (i.e. generate a new piece of text)
        feedback = get_feedback(next_state) # Get human feedback based on whether or not it's interesting
        update_model(language_model, reward, feedback) # Update the language model with the feedback and the reward
        if is_done(next_state): # Check if we've reached a stopping point (i.e. generated enough text for this episode)
            break

# Define a function that takes input text from the language model and returns human feedback (points) based on whether or not it's interesting
def get_feedback(text):
    if "exciting" in text.lower(): # Check if the input text contains the word "exciting"
        return 10 # Give points for using exciting words
    else:
        return -5 # Deduct points for being boring

# Train the language model with RLHF by feeding it input text and getting feedback from humans
for i in range(num_episodes):
    state = initialize_state() # Start a new episode (i.e. generate a new piece of text)
    while True:
        action = get_action(state, language_model) # Get an action based on the current state and the language model's output
        next_state, reward = step(state, action) # Take the action and move to the next state (i.e. generate a new piece of text)
        feedback = get_feedback(next_state) # Get human feedback based on whether or not it's interesting
        update_model(language_model, reward, feedback) # Update the language model with the feedback and the reward
        if is_done(next_state): # Check if we've reached a stopping point (i.e. generated enough text for this episode)
            break

# Define a function that takes input text from the language model and returns human feedback (points) based on whether or not it's interesting
def get_feedback(text):
    if "exciting" in text.lower(): # Check if the input text contains the word "exciting"
        return 10 # Give points for using exciting words
    else:
        return -5 # Deduct points for being boring

# Train the language model with RLHF by feeding it input text and getting feedback from humans
for i in range(num_episodes):
    state = initialize_state() # Start a new episode (i.e. generate a new piece of text)
    while True:
        action = get_action(state, language_model) # Get an action based on the current state and the language model's output
        next_state, reward = step(state, action) # Take the action and move to the next state (i.e. generate a new piece of text)
        feedback = get_feedback(next_state) # Get human feedback based on whether or not it's interesting
        update_model(language_model, reward, feedback) # Update the language model with the feedback and the reward
        if is_done(next_state): # Check if we've reached a stopping point (i.e. generated enough text for this episode)
            break

# Define a function that takes input text from the language model and returns human feedback (points) based on whether or not it's interesting
def get_feedback(text):
    if "exciting" in text.lower(): # Check if the input text contains the word "exciting"
        return 10 # Give points for using exciting words
    else:
        return -5 # Deduct points for being boring

# Train the language model with RLHF by feeding it input text and getting feedback from humans
for i in range(num_episodes):
    state = initialize_state() # Start a new episode (i.e. generate a new piece of text)
    while True:
        action = get_action(state, language_model) # Get an action based on the current state and the language model's output
        next_state, reward = step(state, action) # Take the action and move to the next state (i.e. generate a new piece of text)
        feedback = get_feedback(next_state) # Get human feedback based on whether or not it's interesting
        update_model(language_model, reward, feedback) # Update the language model with the feedback and the reward
        if is_done(next_state): # Check if we've reached a stopping point (i.e. generated enough text for this episode)
            break

# Define a function that takes input text from the language model and returns human feedback (points) based on whether or not it's interesting
def get_feedback(text):
    if "exciting" in text.lower(): # Check if the input text contains the word "exciting"
        return 10 # Give points for using exciting words
    else:
        return -5 # Deduct points for being boring

# Train the language model with RLHF by feeding it input text and getting feedback from humans
for i in range(num_episodes):
    state = initialize_state() # Start a new episode (i.e. generate a new piece of text)
    while True:
        action = get_action(state, language_model) # Get an action based on the current state and the language model's output
        next_state, reward = step(state, action) # Take the action and move to the next state (i.e. generate a new piece of text)
        feedback = get_feedback(next_state) # Get human feedback based on whether or not it's interesting
        update_model(language_model, reward, feedback) # Update the language model with the feedback and the reward
        if is_done(next_state): # Check if we've reached a stopping point (i.e. generated enough text for this episode)
            break

And that’s it! With RLHF, our language models can learn to write more interesting and engaging texts based on human feedback. It’s like having your own personal writing coach!

SICORPS