In a previous post, I explained how AlphaZero can play Connect Four. If you haven’t already, you can read the first part here.
In this part 2, I will share how to write the code for the Monte Carlo Tree Search to play Connect Four below.
First, we will import our libraries and define our class object and its variables.
Second, we will code a loop where the first step is to explore the tree to learn the best moves by simulating playing out the game, and the next step is to actually make a move. When it’s our turn again, we will explore the tree again and make another move.
This loop will continue until the winner for Connect Four is decided.
I will write a final blog post to explain the deep learning part of the code – where we can take advantage of learnings from previous games to make better decisions in future games – at a later date.
Monte Carlo Tree Search
Let’s do a quick recap of tree diagrams.
🌲 Definition: A tree is a data structure that connects multiple nodes through a parent-child relationship where each node represents a different state of the board. A node will have one parent and N children, where N is the number of legal moves available.
The tree search helps the AI decide the utility of the moves available in a given state. The AI then picks the most optimal move to help it win.
🌍 Related Tutorials:
Step One: Importing Libraries
First, we import the following libraries: Tensorflow, NumPy, math
, and random
.
Tensorflow and NumPy will be used to create the variables that will later be passed to a Deep Neural Network. The modules math
and random
are used for some numerical operations. C
is a global variable we’ll go over later.
import tensorflow as tf import numpy as np import math import random C = 1.0
Step Two: Defining the Node Object
Next, we define our Node
class. These are the nodes on our tree graph that are connected via parent-child relationships.
We set a game
object (Connect Four) as a parameter and we can take another Node
object to be the mother.
We will set the following attributes:
self.game
– This holds the game object that represents the current game state that the node is supposed to representself.mother
– This points to the mother of the current node if one existsself.children
– We’ll keep track of all the children nodes in this dictionaryself.U
– The utility of the state. This is calculated by the tree searchself.prob
– The probability distribution among all the available actions from the current state used to decide the next actionself.V
– The value of the state (how likely you are to win). This is calculated by the neural network modelself.N
– The visit count. It keeps track of how many times we’ve been in the same stateself.outcome
– This keeps track of whether or not the game ended
Refer to the definitions above as you read through the code to remember what the attributes do.
class Node: def __init__(self, game, mother=None): super(Node, self).__init__() self.game = game self.mother = mother self.children = {} self.U = 0 self.prob = None self.V = 0 self.N = 0 self.outcome = self.game.score if self.game.score is not None: self.V = self.game.score * self.game.player self.U = 0 if self.game.score is 0 else self.V * float('inf')
When creating each Node
class, we also check if the game has ended. If it has, we update V
to either 1 or -1 and U
to either infinity or negative infinity.
The Node
class contains two main functions. The first function is “explore
” and the other is “next
”.
In “explore
”, we will navigate our tree to learn the best move and in “next
” we actually make the move.
Step Three: Exploring The Tree
Now let’s write the code to begin our tree search.
Explore Function
“explore
” is in charge of navigating the tree and learning how to play the given game. In explore
, we never actually make a move – we are just simulating playing the game out so we can determine the best move to make.
When calling explore
, we need to pass a model that the tree will use to evaluate the different game states.
To start “explore
”, we first check to make sure that the game has a valid score (the game hasn’t ended).
Next, we set “node
” to be the current node and start traversing down the tree. We start traversing by first checking the U
value (utility) of all the children nodes. We then take the action that corresponds to the node with the highest U
value. If there is a tie for the highest U
value, we break it by choosing an action randomly from among the tied actions.
def explore(self, policy): if self.game.score is not None: raise ValueError('Game has ended with score {o:d}'.format(self.game.score)) node = self while len(node.children) > 0 and node.outcome is None: max_U = max(c.U for c in node.children.values()) actions = [a for a, c in node.children.items() if c.U==max_U] if len(actions) == 0: raise ValueError("error zero length ", max_U) action = random.choice(actions)
Check if the Game Ended
Before we move to the next node, we check to see if max_U
(defined in the code above) is either infinity or -infinity.
- Winning a game has a
U
value of infinity because there can’t be a better move than winning the game. - Similarly, losing a game has a U value of -infinity because there can’t be a worse move than losing.
If max_U
was -infinity, node.U
gets set to infinity and node.V
to 1. On the other hand, if max_U
was infinity, node.U
gets set to -infinity and node.V
to -1.
The value of max_U
is calculated from the node’s children and a turn passes as we go from parent to child.
if max_U == -float('inf'): node.U = float('inf') node.V = 1. break elif max_U == float('inf'): node.U = -float('inf') node.V = -1. break
After checking the value of max_U
, we set “node
” to the child node that corresponds to the action that had the highest U
value. Then we repeat the loop until we reach a node with no children or a node where the game ended.
node = node.children[action]
Expanding the Tree
Once we arrive at a node that doesn’t possess any children, it’s time to expand the tree. We do this by creating more children nodes.
This starts by calling process_policy
and passing the following three parameters to the function:
policy
– The model to be used to evaluate the game statenode.game
– The game object for the current node (Connect Four)node.game.player
– The active player for the current node
The process_policy
function returns the probability distribution for the child nodes, which helps us select the next move. It also evaluates the value of the current game state (how likely the active player is to win).
We’ll go over how the process_policy
function works in more detail later.
After storing both values in node.prob
and node.V
respectively, the next step is to create the child nodes for the current node.
if len(node.children < 1): probs, v = process_policy(policy, node.game, node.game.player) node.prob = probs node.V = float(v)
To create the child nodes, we get the list of available moves by querying node.game.available_moves
.
This gives us a list of integers corresponding to the columns that still have room for Connect Four pieces.
For example, if all the columns have room then the following list gets returned: [0,1,2,3,4,5,6]
. However, if the second column from the left is full then the following list would be returned: [0,2,3,...,6]
.
We then enumerate through the list we got back from our query. For each iteration, we call mk_child
, which returns a node
object. With the newly created node, we add it to the children
dictionary of the current node. We use the integer corresponding to the action as the key for the child node.
for i in node.game.available_moves: ch = node.mk_child(i) node.children[i] = ch node.N += 1
After increasing the current node’s visit count (node.N
), we are ready to update the tree via backpropagation.
Assigning Values of Previous Moves via Backpropagation
For this article, backpropagation refers to the process of tuning the values of the nodes to better the prediction accuracy of the Tree Search. We will be tuning the values of U
and V
.
Because we explored the tree, we have more information. We know the result of the moves we took and if a move made us win or lose the game. Therefore, we can update the V
value of each previous node based on our prediction of how good it is.
We do backpropagation by using a while
loop. First, we update all the values. Next, we check whether or not the current node has a mother node. If it does, we change the current node to its mother.
We keep doing this until we reach a node with no mother. Once we reach a node with no mother, we set node
to None
and exit the loop.
while node.mother:
The first value we update in the loop is the visit count (N
).
We increase the visit count of the mother node by one since the only node’s visit count we’ve updated so far is the last node we visited. After that, we update the mother node’s V
value.
It is important to remember that a turn passes each time we move up or down the search tree which means that the active player changes. Therefore, we need to take the negative value of node.V
in order to make it so that both values are from the perspective of the same player.
To make the update, we add the difference between the negative value of node.V
and the value of the mother’s V
value divided by the mother’s visit count. This makes it so that we make smaller updates as time goes on.
node.mother.N += 1 node.mother.V += (-node.V - node.mother.V) / node.mother.N
After updating the mother’s V
value, we need to update the U
value of the current node and all its siblings.
Keep in mind that the U
value is used during the exploration phase to decide which action to take. In order to get a list that includes both the current node and all its siblings, we iterate through the dictionary of the current node’s mother’s children and get the list of items that gives us both the actions and corresponding nodes.
Next, for each node, we update the U
value by taking its V
value and combining it with the probability of getting to said node scaled by the square root of the number of visits to the mother node divided by one plus the number of visits to said node.
This makes it more likely for the algorithm to pick nodes that the neural networks think are good and nodes that have been seldom visited while exploring.
for i, n in node.mother.children.items(): if n.U != float('inf') and n.U != -float('inf'): n.U = n.V + C * n.mother.prob[i] * math.sqrt(n.mother.N) / (1+n.N)
Lastly, we update the current node with the current node’s mother and go through the loop again until node is set to none which happens when we reach the top of the tree.
node = node.mother
That covers how the explore function works.
Helper Functions for Explore
Now let’s go over the two functions that are called by explore()
, mk_child()
and process_policy()
.
First, mk_child()
works by creating a new node from a new copy of the game
object with the selected move applied. We then return the newly created node that will then be added to the children
dictionary.
def mk_child(self, i): child = Node(self.game.sim_move(i), self) return child
Second, process_policy()
is the function used by explore()
for the tree search to be able to use the neural network to evaluate different game states.
It takes three arguments:
policy
which is the model to be used,game
which is the game object for the current game state, and- the active player.
def process_policy(policy, game, player):
The function starts by multiplying the current board state by the active player. This makes it so that the model will always be analyzing the game from the perspective of the same player since the player will either be 1 or -1.
Next, we expand the dimensions twice so that they can be fed to the network.
state = game.state * player state = np.expand_dims(state, 0) state = np.expand_dims(state, -1)
We also get a mask from the game object that will be used to set the probability of any unavailable moves to 0.
mask = game.get_mask()
Lastly, we return the probs
and val
variables that are returned by the model.
probs, val = policy(state, mask) probs = tf.squeeze(probs) val = tf.squeeze(val) return probs, val
Step Four: Choosing the Next Move
Now that we’ve used the explore
function to learn the best move to make, it’s time for us to make the move.
Next Function
The second main function in the Node
class is next()
. This function is responsible for selecting the actual move that will be played. It has one optional argument that controls the exploration vs exploitation dynamic of the tree search. We will go over this more later on in the article.
def next(self, tau=1.):
The function starts by checking that we are in a valid game state making sure that the game hasn’t ended and that the current node has children.
if self.game.score is not None: raise ValueError('game has ended with score {0:d}'.format(self.game.score)) if not self.children: print(self.game.state) raise ValueError('no children found and game hasn\'t ended')
Calculating the Probability Distribution
After ensuring we have a valid game state, the prob
list is created. The prob
list will hold the probability distribution that will later be used to select the move that gives us the highest probability of winning.
To ensure that the prob
is of the correct dimensions, the number of columns is used and all the values are set to 0.
prob = [0. for _ in range(self.game.w)]
Next, we check to see if there is a winning move that can be played. To do this we take the max of the U
value out of all the children.
If the returned value is float(‘inf’)
then we know that there is a winning move and we set the corresponding index of prob
to 1.
max_U = max(c.U for c in self.children.values()) if max_U == float('inf'): for k in self.children.keys(): if self.children[k].U == float('inf'): prob[k] = 1.
If we don’t find a winning move, we have to calculate the prob
values using the visit counts (N
) for each child.
The reason for using N
instead of U
is because U
is calculated in a way that favors both good moves and least used moves. The move with the highest N
can only be the best move.
This is because the way for a move to have the highest visit count is for it to have a value large enough to compensate for the penalty given for having a high visit count during exploration.
To calculate the prob
, the first step is to find the max value of N
among the children and add 1 to it. The reason for adding 1 is that we will divide each child’s N
value by the max value of N
found and we do not want this to be equal to 1.
else: maxN = max(node.N for node in self.children.values()) + 1
Exploration VS Exploitation Dynamic
Next, for each child, we will divide their N
value by the max N
value among the children and raise the result to the power of 1 divided by tau
.
💡 Tau is the variable that is used to control the exploration vs exploitation dynamic of the tree search. Exploring the tree is when we search the possible moves; exploiting is when we exploit our knowledge to make good moves.
How it works is that as tau increases, it makes the exponent smaller. On the other hand, as tau decreases, it makes the exponent larger.
The exponent is then used to either contract or expand the distance between the probability values calculated for each child. The reason why this matters will become clear shortly.
Lastly, we assign the calculated value to the corresponding index in the prob
list.
for k in self.children.keys(): p = (self.children[k].N / maxN) ** (1 / tau) prob[k] = p
Next, we use np.sum()
on prob to find the total of the values in the prob
array.
tot = np.sum(prob)
Lastly, to get a proper probability distribution where all the values add up to 1, we divide all the values of prob
by tot
that we previously calculated. It is due to this step that tau
works to balance the exploration vs exploitation dynamic of the tree search.
Effects of Tau
As tau increases, the distance between the probability values is smaller.
For example, 1% and 1.1% are close in value. This makes it harder for our model to choose the most used action (N
). As tau approaches infinity, it makes it so we pick a random action since all the probabilities are the same.
As tau decreases, the distance between the probability values is larger.
For example, 1% and 10% are far apart in value. This makes it easier for our model to choose the most used action (N
). As tau approaches 0 it makes it more likely for the highest valued action to be picked.
if tot > 0: prob = prob / tot else: # if sum is 0 set all probs to the same value prob = np.full((self.game.w,), 1 / self.game.w)
To finally pick the action to be carried out, we create a new list and populate it with the values from prob
. This new list is then used as the weights to be used by random.choices()
to return the game object associated with the selected move.
weights = [prob[k] for k in self.children.keys()] state = random.choices(list(self.children.values()), weights=weights)[0]
For the last part of the next function, we return:
- the
state
variable that holds the selected move - the
V
value of the current node - the
prob
that we calculated as a TensorFlowfloat32
variable that will be used for training the neural network later - the current game object.
For V
, we return the negative value because V
was originally calculated with respect to the previous player. Most of the values returned are for training the neural network which we will go over in the next post.
return state, -self.V, tf.Variable(prob, dtype=tf.float32), self.game
Concluding Thoughts
Now that we have completed our Monte Carlo Tree Search, our AI can play out a game of Connect Four!
In part 3, I will go over the deep learning aspect of the code.
The deep learning element of the code will allow our AI to play multiple games, learn from those games, and apply that knowledge to future games. Essentially we’ll be making our AI get smarter. We will code both the model and the training code of the model.