AlphaGo was the first computer to beat Lee Sedol, who was considered the best Go player of the decade. Before AlphaGo, the strongest Go computer programs could only play Go at an amateur level.
AlphaGo was further developed into AlphaZero, an AI that can master any game, including complex games like Shogi and Chess.
Why could AlphaGo and AlphaZero master these complex games, unlike previous computer programs?
Because they used Deep Reinforcement Learning Techniques.
In this blog post, I will explain the Deep Reinforcement Learning Techniques used by AlphaZero to play and win games.
I will dive into the most important concepts used by AlphaZero with Connect Four as an example.
DeepMind’s AlphaGo and the Actor-Critic Method
Standard AI methods were not able to play Go well because it was too complex. There were too many game moves or possible board positions to check.
When DeepMind developed AlphaGo in 2015, they used a new approach — the “Actor-Critic Method”.
The Actor-Critic Method is a Deep Reinforcement Learning Technique. In this method, two deep neural networks decide what action to take.
- The first neural network examines the game board and outputs what it predicts to be the best available move.
- The second neural network evaluates the game board and outputs who it predicts will win given the current board state.
DeepMind trained these networks by showing them examples of various human amateur games. Then they had AlphaGo play against itself.
While AlphaGo performed well, it was actually held back by the human examples it had learned from.
DeepMind later iterated on AlphaGo to create AlphaZero in 2017. AlphaZero combined the two deep neural networks into one network.
The bigger change was that AlphaZero never saw any human-played games. Instead, it learned through random play against itself.
AlphaZero surpassed AlphaGo within a matter of days.
The Three Necessary Components for an Implementation of AlphaZero
There are three main components to build algorithms like AlphaZero. These are the tree search, the deep neural network, and the actual game.
For this blog post, I will use Connect Four for the game. I will explain why below. Then I will explain the tree search and deep neural network required for AlphaZero to work.
The Game: Connect Four
Connect Four is a two-player zero-sum game of perfect information.
- Mechanics: It is played on a 7×6 vertical board. Players take turns dropping one of their pieces down one of the seven different columns.
- Goal: The goal of the game is to be the first player to have four pieces form a horizontal, vertical, or diagonal line.
A zero-sum game means that one player’s gain is equal to the other player’s loss. ‘Perfect information’ refers to the fact that both players are aware of the state of the game at all points.
There are two reasons why Connect Four is a good game to use so we can build a Deep Learning algorithm like AlphaZero:
1. It’s a Zero-Sum Game
A zero-sum game of perfect information can be encoded in a 2D matrix equal to the board size.
We can encode the game state in each spot on the Connect Four board with either a “1” for player one, a “-1” for player two, and a “0” for an empty spot.
This representation also allows us to swap whose perspective any given board state. All we have to do is multiply the matrix by -1.
We can do this because there are no unknown pieces on the board from either player’s perspective and because an advantageous piece for one player is disadvantageous to the other player.
2. Lower Total Board State Probabilities
There are 4,531,985,219,092 different board states in a game of Connect Four.
So Connect Four still merits an advanced algorithm like AlphaZero.
But it is still simpler than Chess or Go, which have between 10^120 to 10^360 possible board states.
Now that we have chosen Connect Four for our game, let’s go over the Tree Search and Neural Network used in AlphaZero.
The Tree Search
The first step for our algorithm to work is to create a tree representation of the game.
🌲 A tree is a data structure in computer science that connects multiple nodes through a parent-child relationship.
In our case, each node represents a different board state in the game. A node will have one parent and N children, where N is the number of legal moves available. The parent node refers to the game state that led to our current state.
On the other hand, the children nodes are all the different game states that we can reach from our current state.
See a diagram of a tree graph here:
The AI begins a turn by exploring our game tree.
First, it checks if the current node has any children nodes to explore.
- If there are any children, it chooses the child node that the tree search believes has the best chance for the active player to win the game.
- If there are multiple children that it believes gives the active player the same chance to win, it chooses one at random.
- After this, it checks if the new node it is in has any children nodes to explore. If it does, it repeats the same process until reaching a node without children.
As it is exploring, if the AI ever encounters a board state where a winner was decided, it stops the current search.
It does this because it knows that the series of moves that it explored led to it winning or losing the game. This tells the AI whether this game path is one that we want to follow or not.
In the cases where we reach a node that has no children and we did not yet find a node where a winner was decided, we need to expand the game tree. This is done by first getting the deep neural network to evaluate the current node.
The Deep Neural Network
The network architecture that we use for an AI can be customized to match the complexity of the game we are playing as well as the computing resources available to us.
But there are three parts that the architecture needs in order to work. These are the groups of layers within our neural network.
- The Actor-Head Block
- The Critic-Head Block
- The Body Block
Starting at the end of the network, we need to have two head blocks. One of the blocks will function as the Actor while the other will take the role of the Critic. Each of these head blocks will be responsible for outputting a specific value.
The Actor-Head block will output a probability distribution over all the possible moves.
In the case of Connect Four, this will be a total of seven possible moves. Using a softmax activation on the last layer of the actor head block will give us the probability distribution we need.
The Critic-Head Block will output a single value ranging from ‘
-1’ to ‘
- A positive value represents the predicted probability that the active player will win the game from the current board state.
- A negative value represents the predicted probability that the opponent will win.
- Finally, a value of ‘
0’ represents an ‘
To ensure that the output is in the range we want, we must use a Tanh activation function on the last layer of the Critic Head Block.
Traditionally multiple Fully-Connected layers are used for the head blocks. Furthermore, the number of units in each layer should start high in the first layer of each head and decrease in each subsequent layer.
Dropout layers and other regularization techniques can be used to get better results.
We then take the output from both head blocks and return it to the tree search. The input for the head blocks will come from the body block.
The Body Block handles a NumPy matrix representation of the current board state. The Body Block will then extract the features that it deems important for the head blocks to be able to do their jobs.
The size and the type of layers that are used in the Body Block are highly dependent on the complexity of the game we want our AI to play. A more complex game would require more layers.
After the network is done evaluating the current node, we ask the game to provide us with a list of available moves from the current node.
For each available move, we create a deep copy of the game and we take a different move in each copy. Next, we add all the copies to the tree as children of the current node we are on.
To finish expanding the tree, we update the values for each node we have traversed to get to the current node. The important part is that we do not explore any of the new child nodes we just added to the tree at this point.
To update the values, for each node we combine the output from the Critic Head Block with the probability we ended up in this node given by the Actor Head Block of the current node’s mother.
This new value is then scaled down based on how often we ended up in the current node instead of one of its siblings. The scaling discourages the tree search from always choosing the same paths in future runs.
The other update we have to make is to the value that was outputted by the node’s mother Critic Head Block.
This is updated by adding the difference between the negative value of the Critic Head Block of the current node and the mother’s Critic Head Block value.
We use the negative value in order to swap the active player. This works with any Zero-Sum game, such as Connect Four, since an increased chance of winning for one player means an equal decreased chance of winning for the other player.
After we finish updating the values, we repeat the exploration and update steps. With each repetition, the tree grows and develops a clearer idea of what an ideal move would be.
The repetition also continues for either a predetermined number of iterations or a predetermined amount of time.
Once the limit is reached, the AI uses the results of the tree search to select the move that optimizes its chance to win during the exploration.
I hope that you have a better understanding of how AlphaZero works after reading this blog post. I also encourage you to explore your own Deep Reinforcement Learning projects at home!
Please look out for future blog posts where I will go in more depth into the actual code to create your own AlphaZero.