Discount Factor in Reinforcement Learning -

Well, imagine you’re playing a game of chess and you have to choose between two moves: one that leads to an immediate win or another move that sets up a more complex strategy for a future victory. Which would you pick?

In reinforcement learning, the discount factor helps us decide which choice is better by assigning values to rewards based on their distance from the current state. The closer the reward is to our current position, the higher its value will be this makes sense because we want to prioritize immediate gains over future ones (unless those future gains are really juicy).

So how do we set up a discount factor? It’s pretty simple: you just choose a number between 0 and 1. A discount factor of 0 means that all rewards in the future have no value, while a discount factor of 1 means that every reward is equally important (regardless of when it occurs). Most people use values somewhere in between for example, a common choice is to set the discount factor at around 0.9 or 0.95.

Now how we actually implement this concept in code. Here’s an example using Python and OpenAI Gym:

# Import necessary libraries
import gym
from collections import deque
import numpy as np

# Define the discount factor (in this case, 0.95)
gamma = 0.95 # Discount factor used to calculate future rewards

# Initialize a replay memory to store past experiences
memory = deque(maxlen=10000) # Deque data structure used to store past experiences, with a maximum length of 10000

# Set up an environment and initialize variables for storing rewards and actions
env = gym.make('CartPole-v1') # Create an environment using the CartPole-v1 game from OpenAI Gym
state = env.reset() # Reset the environment and get the initial state
reward_sum = 0 # Variable to keep track of the total reward for each episode
done = False # Flag to indicate if the episode is complete
episode_count = 0 # Variable to keep track of the number of episodes completed

while True:
    # Choose an action based on the current state (using a simple epsilon greedy algorithm)
    if np.random.uniform(0, 1) < EPSILON: # Randomly choose an action with probability EPSILON
        action = env.action_space.sample() # Sample an action from the environment's action space
    else:
        q_values = model.predict(state) # Use the model to predict the Q-values for the current state
        action = np.argmax(q_values) # Choose the action with the highest Q-value
    
    # Take the chosen action and observe the resulting state, reward, and done flag
    next_state, reward, done, _ = env.step(action) # Take the chosen action and get the resulting state, reward, and done flag
    
    # Store this experience in our replay memory (if it's not already full)
    if len(memory) < 10000:
        memory.append((state, action, reward, next_state, done)) # Add the experience to the replay memory
        
    # Update the state and reward variables for the current episode
    state = next_state # Update the current state to the next state
    reward_sum += reward # Add the reward from the current step to the total reward for the episode
    
    # Check if we've completed an entire episode (i.e., reached a terminal state)
    if done:
        print(f"Episode {episode_count} complete! Reward sum: {reward_sum}") # Print the episode number and total reward for the episode
        episode_count += 1 # Increment the episode count
        
        # Calculate the total reward for this episode using our discount factor (gamma)
        total_reward = 0 # Variable to keep track of the total reward for the episode
        for i in range(len(memory)-1, -1, -1): # Loop through the experiences in reverse order
            state, action, reward, next_state, done = memory[i] # Get the experience
            if not done:
                # Calculate the future reward using our discount factor (gamma)
                total_reward += gamma ** i * reward # Add the discounted reward to the total reward
        
        # Update the model based on this episode's experience
        for i in range(len(memory)-1, -1, -1): # Loop through the experiences in reverse order
            state, action, reward, next_state, done = memory[i] # Get the experience
            if not done:
                q_values = model.predict(state) # Use the model to predict the Q-values for the current state
                q_next = model.predict(next_state)[np.argmax(model.predict(next_state))] # Use the model to predict the Q-values for the next state and choose the action with the highest Q-value
                
                # Calculate the target value for this experience using our discount factor (gamma) and future reward
                target = reward + gamma * np.amax(q_next) # Calculate the target value for this experience
                
                # Update the Q-values based on this experience
                q_values[np.argmax(model.predict(state))] += LEARNING_RATE * (target - q_values[np.argmax(model.predict(state))]) # Update the Q-value for the chosen action
        
        # Clear out our replay memory for the next episode
        memory = deque(maxlen=10000) # Clear the replay memory
        
    reward_sum = 0 # Reset the total reward for the episode

And that’s it! By setting up a discount factor and using it to calculate future rewards, we can prioritize immediate gains over distant ones which is exactly what we want in reinforcement learning.

Discount Factor in Reinforcement Learning

Social

About

Privacy