Well, imagine you’re playing a game of chess and you have to choose between two moves: one that leads to an immediate win or another move that sets up a more complex strategy for a future victory. Which would you pick?
In reinforcement learning, the discount factor helps us decide which choice is better by assigning values to rewards based on their distance from the current state. The closer the reward is to our current position, the higher its value will be this makes sense because we want to prioritize immediate gains over future ones (unless those future gains are really juicy).
So how do we set up a discount factor? It’s pretty simple: you just choose a number between 0 and 1. A discount factor of 0 means that all rewards in the future have no value, while a discount factor of 1 means that every reward is equally important (regardless of when it occurs). Most people use values somewhere in between for example, a common choice is to set the discount factor at around 0.9 or 0.95.
Now how we actually implement this concept in code. Here’s an example using Python and OpenAI Gym:
# Import necessary libraries
import gym
from collections import deque
import numpy as np
# Define the discount factor (in this case, 0.95)
gamma = 0.95 # Discount factor used to calculate future rewards
# Initialize a replay memory to store past experiences
memory = deque(maxlen=10000) # Deque data structure used to store past experiences, with a maximum length of 10000
# Set up an environment and initialize variables for storing rewards and actions
env = gym.make('CartPole-v1') # Create an environment using the CartPole-v1 game from OpenAI Gym
state = env.reset() # Reset the environment and get the initial state
reward_sum = 0 # Variable to keep track of the total reward for each episode
done = False # Flag to indicate if the episode is complete
episode_count = 0 # Variable to keep track of the number of episodes completed
while True:
# Choose an action based on the current state (using a simple epsilon greedy algorithm)
if np.random.uniform(0, 1) < EPSILON: # Randomly choose an action with probability EPSILON
action = env.action_space.sample() # Sample an action from the environment's action space
else:
q_values = model.predict(state) # Use the model to predict the Q-values for the current state
action = np.argmax(q_values) # Choose the action with the highest Q-value
# Take the chosen action and observe the resulting state, reward, and done flag
next_state, reward, done, _ = env.step(action) # Take the chosen action and get the resulting state, reward, and done flag
# Store this experience in our replay memory (if it's not already full)
if len(memory) < 10000:
memory.append((state, action, reward, next_state, done)) # Add the experience to the replay memory
# Update the state and reward variables for the current episode
state = next_state # Update the current state to the next state
reward_sum += reward # Add the reward from the current step to the total reward for the episode
# Check if we've completed an entire episode (i.e., reached a terminal state)
if done:
print(f"Episode {episode_count} complete! Reward sum: {reward_sum}") # Print the episode number and total reward for the episode
episode_count += 1 # Increment the episode count
# Calculate the total reward for this episode using our discount factor (gamma)
total_reward = 0 # Variable to keep track of the total reward for the episode
for i in range(len(memory)-1, -1, -1): # Loop through the experiences in reverse order
state, action, reward, next_state, done = memory[i] # Get the experience
if not done:
# Calculate the future reward using our discount factor (gamma)
total_reward += gamma ** i * reward # Add the discounted reward to the total reward
# Update the model based on this episode's experience
for i in range(len(memory)-1, -1, -1): # Loop through the experiences in reverse order
state, action, reward, next_state, done = memory[i] # Get the experience
if not done:
q_values = model.predict(state) # Use the model to predict the Q-values for the current state
q_next = model.predict(next_state)[np.argmax(model.predict(next_state))] # Use the model to predict the Q-values for the next state and choose the action with the highest Q-value
# Calculate the target value for this experience using our discount factor (gamma) and future reward
target = reward + gamma * np.amax(q_next) # Calculate the target value for this experience
# Update the Q-values based on this experience
q_values[np.argmax(model.predict(state))] += LEARNING_RATE * (target - q_values[np.argmax(model.predict(state))]) # Update the Q-value for the chosen action
# Clear out our replay memory for the next episode
memory = deque(maxlen=10000) # Clear the replay memory
reward_sum = 0 # Reset the total reward for the episode
And that’s it! By setting up a discount factor and using it to calculate future rewards, we can prioritize immediate gains over distant ones which is exactly what we want in reinforcement learning.