Horizon 1 and Horizon 2 Value Functions in POMDPs

in

Do you feel like they’re just a bunch of buzzwords that don’t really mean anything?

To start, let’s define what POMDPs (Partially Observable Markov Decision Processes) even are. Basically, they’re like MDPs (Markov Decision Processes), but with a twist you don’t have access to the full state information at each time step. Instead, you get some partial observations that help you figure out what might be going on in the environment.

Now, value functions. In RL (reinforcement learning), we use them to evaluate how good a particular action is based on its expected future rewards. There are two main types of value functions: Horizon 1 and Horizon 2.

Horizon 1 value function (also known as the immediate reward or state-value function) tells us what’s going to happen right now if we take this action in a given state. It looks like this:

V(s,a) = E[R_t+1 | S_t=s, A_t=a]

In other words, it calculates the expected immediate reward for taking action ‘a’ from state ‘s’. This is useful when we want to make decisions based on short-term rewards.

But what if we care more about long-term rewards? That’s where Horizon 2 value function comes in handy. It takes into account not just the immediate reward, but also the future rewards that might result from taking this action:

V_H2(s,a) = E[R_{t+1} + gamma * V(S_{t+1}, A_{t+1}) | S_t=s, A_t=a]

Here, ‘gamma’ is a discount factor that determines how much we care about future rewards. If it’s close to 0, we only care about the immediate reward; if it’s close to 1, we care more about long-term rewards. This value function helps us make decisions based on our overall goal rather than just short-term gains.

They might seem like fancy algorithms at first glance, but they’re actually pretty simple once you break them down. And who knows? Maybe someday we’ll be able to use these concepts to solve real-world problems like playing poker or navigating through a maze.

SICORPS