Essentially, they’re like regular old Markov decision processes (MDPs), but with a twist instead of having full knowledge about the current state, you only have partial information. This means that your decisions might not always be optimal because you don’t know exactly where you are in the game.
Don’t Worry! Value Iteration is here to save the day (or at least make things a little less confusing). It’s an algorithm used for solving POMDPs, and it involves iteratively calculating the expected value of each action given a particular state. The idea behind this approach is that by repeatedly updating these values, you can eventually converge on optimal policies even in situations where there are multiple possible states with similar rewards.
So how does Value Iteration work? Well, let’s say we have a POMDP with three actions (A1, A2, and A3) and four possible hidden states (S1, S2, S3, and S4). Each state has an associated reward value, as well as a probability of transitioning to another state based on the chosen action.
To start, we initialize our expected values for each action in each state to some arbitrary number let’s say zero. We then iterate through all possible states and actions, updating the expected values according to:
Expected Value(S_i) = max{Reward(S_i) + sum(P(S_j|A_k)*Value(S_j)|A_k is taken from S_i}
This formula calculates the maximum possible reward that can be obtained by taking any action in state i, and then adding up the expected rewards for each resulting state based on the probability of transitioning to that state.
We repeat this process for every state and action until convergence is reached meaning that there are no more significant changes in the expected values. At this point, we can choose an optimal policy by selecting the action with the highest expected value for each hidden state.
Value Iteration in POMDPs a concept so mind-bending that it’ll make your head spin faster than a spinning top on juice (but hopefully not as fast).