AlphaZero Plays Connect Four: Coding a Tree Search in Python

In a previous post, I explained how AlphaZero can play Connect Four. If you haven’t already, you can read the first part here.

In this part 2, I will share how to write the code for the Monte Carlo Tree Search to play Connect Four below.

First, we will import our libraries and define our class object and its variables.

Second, we will code a loop where the first step is to explore the tree to learn the best moves by simulating playing out the game, and the next step is to actually make a move. When it’s our turn again, we will explore the tree again and make another move.

This loop will continue until the winner for Connect Four is decided.

I will write a final blog post to explain the deep learning part of the code – where we can take advantage of learnings from previous games to make better decisions in future games – at a later date.

Monte Carlo Tree Search

Let’s do a quick recap of tree diagrams.

🌲 Definition: A tree is a data structure that connects multiple nodes through a parent-child relationship where each node represents a different state of the board. A node will have one parent and N children, where N is the number of legal moves available.

The tree search helps the AI decide the utility of the moves available in a given state. The AI then picks the most optimal move to help it win.

🌍 Related Tutorials:

Step One: Importing Libraries

First, we import the following libraries: Tensorflow, NumPy, math, and random.

Tensorflow and NumPy will be used to create the variables that will later be passed to a Deep Neural Network. The modules math and random are used for some numerical operations. C is a global variable we’ll go over later.

import tensorflow as tf
import numpy as np
import math
import random

C = 1.0

Step Two: Defining the Node Object

Next, we define our Node class. These are the nodes on our tree graph that are connected via parent-child relationships.

We set a game object (Connect Four) as a parameter and we can take another Node object to be the mother.

We will set the following attributes:

  • self.game – This holds the game object that represents the current game state that the node is supposed to represent
  • self.mother This points to the mother of the current node if one exists
  • self.children – We’ll keep track of all the children nodes in this dictionary 
  • self.U – The utility of the state. This is calculated by the tree search
  • self.prob – The probability distribution among all the available actions from the current state used to decide the next action
  • self.V – The value of the state (how likely you are to win). This is calculated by the neural network model
  • self.N – The visit count. It keeps track of how many times we’ve been in the same state
  • self.outcome – This keeps track of whether or not the game ended

Refer to the definitions above as you read through the code to remember what the attributes do.

class Node:
    def __init__(self, game, mother=None):
        super(Node, self).__init__()
        self.game = game
        self.mother = mother
        self.children = {}
        self.U = 0
        self.prob = None
        self.V = 0
        self.N = 0
        self.outcome = self.game.score

        if self.game.score is not None:
            self.V = self.game.score * self.game.player
            self.U = 0 if self.game.score is 0 else self.V * float('inf')

When creating each Node class, we also check if the game has ended. If it has, we update V to either 1 or -1 and U to either infinity or negative infinity.

The Node class contains two main functions. The first function is “explore” and the other is “next”.

In “explore”, we will navigate our tree to learn the best move and in “next” we actually make the move.

Step Three: Exploring The Tree

Now let’s write the code to begin our tree search.

Explore Function

explore” is in charge of navigating the tree and learning how to play the given game. In explore, we never actually make a move – we are just simulating playing the game out so we can determine the best move to make.

When calling explore, we need to pass a model that the tree will use to evaluate the different game states.

To start “explore”, we first check to make sure that the game has a valid score (the game hasn’t ended). 

Next, we set “node” to be the current node and start traversing down the tree. We start traversing by first checking the U value (utility) of all the children nodes. We then take the action that corresponds to the node with the highest U value. If there is a tie for the highest U value, we break it by choosing an action randomly from among the tied actions.

    def explore(self, policy):
        if self.game.score is not None:
            raise ValueError('Game has ended with score {o:d}'.format(self.game.score))

        node = self

        while len(node.children) > 0 and node.outcome is None:
            max_U = max(c.U for c in node.children.values())
            actions = [a for a, c in node.children.items() if c.U==max_U]

            if len(actions) == 0:
                raise ValueError("error zero length ", max_U)

            action = random.choice(actions)

Check if the Game Ended

Before we move to the next node, we check to see if max_U (defined in the code above) is either infinity or -infinity.

  • Winning a game has a U value of infinity because there can’t be a better move than winning the game. 
  • Similarly, losing a game has a U value of -infinity because there can’t be a worse move than losing.

If max_U was -infinity, node.U gets set to infinity and node.V to 1. On the other hand, if max_U was infinity, node.U gets set to -infinity and node.V to -1. 

The value of max_U is calculated from the node’s children and a turn passes as we go from parent to child. 

            if max_U == -float('inf'):
                node.U = float('inf')
                node.V = 1.
                break

            elif max_U == float('inf'):
                node.U = -float('inf')
                node.V = -1.
                break

After checking the value of max_U, we set “node” to the child node that corresponds to the action that had the highest U value. Then we repeat the loop until we reach a node with no children or a node where the game ended.

            node = node.children[action]

Expanding the Tree

Once we arrive at a node that doesn’t possess any children, it’s time to expand the tree. We do this by creating more children nodes.

This starts by calling process_policy and passing the following three parameters to the function:

  • policy – The model to be used to evaluate the game state
  • node.game – The game object for the current node (Connect Four)
  • node.game.player – The active player for the current node

The process_policy function returns the probability distribution for the child nodes, which helps us select the next move. It also evaluates the value of the current game state (how likely the active player is to win). 

We’ll go over how the process_policy function works in more detail later.

After storing both values in node.prob and node.V respectively, the next step is to create the child nodes for the current node. 

        if len(node.children < 1):
            probs, v = process_policy(policy, node.game, node.game.player)
            node.prob = probs
            node.V = float(v)

To create the child nodes, we get the list of available moves by querying node.game.available_moves.

This gives us a list of integers corresponding to the columns that still have room for Connect Four pieces.

For example, if all the columns have room then the following list gets returned: [0,1,2,3,4,5,6]. However, if the second column from the left is full then the following list would be returned: [0,2,3,...,6]

We then enumerate through the list we got back from our query. For each iteration, we call mk_child, which returns a node object. With the newly created node, we add it to the children dictionary of the current node. We use the integer corresponding to the action as the key for the child node.

            for i in node.game.available_moves:
                ch = node.mk_child(i)
                node.children[i] = ch

        node.N += 1

After increasing the current node’s visit count (node.N), we are ready to update the tree via backpropagation

Assigning Values of Previous Moves via Backpropagation

For this article, backpropagation refers to the process of tuning the values of the nodes to better the prediction accuracy of the Tree Search. We will be tuning the values of U and V.

Because we explored the tree, we have more information. We know the result of the moves we took and if a move made us win or lose the game. Therefore, we can update the V value of each previous node based on our prediction of how good it is.

We do backpropagation by using a while loop. First, we update all the values. Next, we check whether or not the current node has a mother node. If it does, we change the current node to its mother.

We keep doing this until we reach a node with no mother. Once we reach a node with no mother, we set node to None and exit the loop. 

        while node.mother:

The first value we update in the loop is the visit count (N).

We increase the visit count of the mother node by one since the only node’s visit count we’ve updated so far is the last node we visited. After that, we update the mother node’s V value. 

It is important to remember that a turn passes each time we move up or down the search tree which means that the active player changes. Therefore, we need to take the negative value of node.V in order to make it so that both values are from the perspective of the same player.

To make the update, we add the difference between the negative value of node.V and the value of the mother’s V value divided by the mother’s visit count. This makes it so that we make smaller updates as time goes on.

            node.mother.N += 1
            node.mother.V += (-node.V - node.mother.V) / node.mother.N

After updating the mother’s V value, we need to update the U value of the current node and all its siblings. 

Keep in mind that the U value is used during the exploration phase to decide which action to take. In order to get a list that includes both the current node and all its siblings, we iterate through the dictionary of the current node’s mother’s children and get the list of items that gives us both the actions and corresponding nodes.

Next, for each node, we update the U value by taking its V value and combining it with the probability of getting to said node scaled by the square root of the number of visits to the mother node divided by one plus the number of visits to said node.

This makes it more likely for the algorithm to pick nodes that the neural networks think are good and nodes that have been seldom visited while exploring.

            for i, n in node.mother.children.items():
                if n.U != float('inf') and n.U != -float('inf'):
                    n.U = n.V + C * n.mother.prob[i] * math.sqrt(n.mother.N) / (1+n.N)

Lastly, we update the current node with the current node’s mother and go through the loop again until node is set to none which happens when we reach the top of the tree.

            node = node.mother

That covers how the explore function works.

Helper Functions for Explore

Now let’s go over the two functions that are called by explore(), mk_child() and process_policy()

First, mk_child() works by creating a new node from a new copy of the game object with the selected move applied. We then return the newly created node that will then be added to the children dictionary.

    def mk_child(self, i):
        child = Node(self.game.sim_move(i), self)

        return child

Second, process_policy() is the function used by explore() for the tree search to be able to use the neural network to evaluate different game states.

It takes three arguments:

  1. policy which is the model to be used,
  2. game which is the game object for the current game state, and
  3. the active player.
def process_policy(policy, game, player):

The function starts by multiplying the current board state by the active player. This makes it so that the model will always be analyzing the game from the perspective of the same player since the player will either be 1 or -1.

Next, we expand the dimensions twice so that they can be fed to the network.

    state = game.state * player
    state = np.expand_dims(state, 0)
    state = np.expand_dims(state, -1)

We also get a mask from the game object that will be used to set the probability of any unavailable moves to 0.

    mask = game.get_mask()

Lastly, we return the probs and val variables that are returned by the model.

    probs, val = policy(state, mask)
    probs = tf.squeeze(probs)
    val = tf.squeeze(val)

    return probs, val

Step Four: Choosing the Next Move

Now that we’ve used the explore function to learn the best move to make, it’s time for us to make the move.

Next Function

The second main function in the Node class is next(). This function is responsible for selecting the actual move that will be played. It has one optional argument that controls the exploration vs exploitation dynamic of the tree search. We will go over this more later on in the article.

    def next(self, tau=1.):

The function starts by checking that we are in a valid game state making sure that the game hasn’t ended and that the current node has children.

        if self.game.score is not None:
            raise ValueError('game has ended with score {0:d}'.format(self.game.score))

        if not self.children:
            print(self.game.state)
            raise ValueError('no children found and game hasn\'t ended')

Calculating the Probability Distribution

After ensuring we have a valid game state, the prob list is created. The prob list will hold the probability distribution that will later be used to select the move that gives us the highest probability of winning.

To ensure that the prob is of the correct dimensions, the number of columns is used and all the values are set to 0.

        prob = [0. for _ in range(self.game.w)]

Next, we check to see if there is a winning move that can be played. To do this we take the max of the U value out of all the children.

If the returned value is float(‘inf’) then we know that there is a winning move and we set the corresponding index of prob to 1.

        max_U = max(c.U for c in self.children.values())

        if max_U == float('inf'):
            for k in self.children.keys():
                if self.children[k].U == float('inf'):
                    prob[k] = 1.

If we don’t find a winning move, we have to calculate the prob values using the visit counts (N) for each child.

The reason for using N instead of U is because U is calculated in a way that favors both good moves and least used moves. The move with the highest N can only be the best move.

This is because the way for a move to have the highest visit count is for it to have a value large enough to compensate for the penalty given for having a high visit count during exploration. 

To calculate the prob, the first step is to find the max value of N among the children and add 1 to it. The reason for adding 1 is that we will divide each child’s N value by the max value of N found and we do not want this to be equal to 1.

        else:
            maxN = max(node.N for node in self.children.values()) + 1

Exploration VS Exploitation Dynamic

Next, for each child, we will divide their N value by the max N value among the children and raise the result to the power of 1 divided by tau

💡 Tau is the variable that is used to control the exploration vs exploitation dynamic of the tree search. Exploring the tree is when we search the possible moves; exploiting is when we exploit our knowledge to make good moves.

How it works is that as tau increases, it makes the exponent smaller. On the other hand, as tau decreases, it makes the exponent larger.

The exponent is then used to either contract or expand the distance between the probability values calculated for each child. The reason why this matters will become clear shortly.

Lastly, we assign the calculated value to the corresponding index in the prob list.

            for k in self.children.keys():
                p = (self.children[k].N / maxN) ** (1 / tau)
                prob[k] = p

Next, we use np.sum() on prob to find the total of the values in the prob array. 

        tot = np.sum(prob)

Lastly, to get a proper probability distribution where all the values add up to 1, we divide all the values of prob by tot that we previously calculated. It is due to this step that tau works to balance the exploration vs exploitation dynamic of the tree search. 

Effects of Tau

As tau increases, the distance between the probability values is smaller.

For example, 1% and 1.1% are close in value. This makes it harder for our model to choose the most used action (N). As tau approaches infinity, it makes it so we pick a random action since all the probabilities are the same.

As tau decreases, the distance between the probability values is larger.

For example, 1% and 10% are far apart in value. This makes it easier for our model to choose the most used action (N). As tau approaches 0 it makes it more likely for the highest valued action to be picked.

        if tot > 0:
            prob = prob / tot

        else:
            # if sum is 0 set all probs to the same value
            prob = np.full((self.game.w,), 1 / self.game.w)

To finally pick the action to be carried out, we create a new list and populate it with the values from prob. This new list is then used as the weights to be used by random.choices() to return the game object associated with the selected move.

        weights = [prob[k] for k in self.children.keys()]
        state = random.choices(list(self.children.values()), weights=weights)[0]

For the last part of the next function, we return: 

  • the state variable that holds the selected move
  • the V value of the current node
  • the prob that we calculated as a TensorFlow float32 variable that will be used for training the neural network later
  • the current game object. 

For V, we return the negative value because V was originally calculated with respect to the previous player. Most of the values returned are for training the neural network which we will go over in the next post.

        return state, -self.V, tf.Variable(prob, dtype=tf.float32), self.game

Concluding Thoughts

Now that we have completed our Monte Carlo Tree Search, our AI can play out a game of Connect Four!

In part 3, I will go over the deep learning aspect of the code.

The deep learning element of the code will allow our AI to play multiple games, learn from those games, and apply that knowledge to future games. Essentially we’ll be making our AI get smarter. We will code both the model and the training code of the model.