DeepMind’s Gopher and Synchronous Advantage Actor-Critic (A2C)

in

It’s basically a game where you have to navigate through mazes filled with obstacles and collect as many points as possible while avoiding getting squished by walls or eaten by monsters (or whatever else might be lurking in those dark corners). Sounds like fun, right?

Now A2C. This is a fancy algorithm that helps your computer figure out the best way to play Gopher and other games like it. It works by breaking down each move into smaller steps (called actions) and assigning them values based on how good or bad they are for achieving your goal (in this case, getting as many points as possible).

Here’s an example: let’s say you have two options go left or go right. If going left leads to a dead end with no points, it gets assigned a negative value (-10, maybe?). But if going right takes you through a secret passage filled with bonus points (50!), it gets assigned a positive value (+50).

The A2C algorithm then uses this information to help your computer make the best decision possible at each step of the game. It does this by creating a “policy” basically, a set of rules that tells your computer which actions are most likely to lead to success (in this case, getting as many points as possible).

So how does Gopher and A2C work together? Well, let’s say you start playing Gopher and the game presents you with a new maze. The first thing your computer does is use its policy to figure out which action is most likely to lead to success (in this case, moving forward). It then assigns values to each possible outcome based on how good or bad they are for achieving that goal.

Next, the A2C algorithm comes in and helps your computer make a decision by calculating something called an “advantage”. This is basically the difference between what you expected to get (based on your policy) and what you actually got. If you end up getting more points than you expected, that’s a good thing it means you made a smart choice!

Finally, your computer uses this information to update its policy for future games. It does this by adjusting the values assigned to each possible action based on how well they performed in previous games (in other words, if an action led to more points than expected, it gets assigned a higher value). This helps your computer learn from its mistakes and make better decisions over time!

It’s not exactly rocket science (or pinwheel spinning), but it can be pretty helpful for playing games like this one!

SICORPS