Random Forest Classifier Made Simple

This tutorial introduces you into an exciting machine learning technique: ensemble learning.

Here’s my quick and dirty tip if your prediction accuracy sucks but you need to meet the deadline at all costs: try this “meta-learning” approach that combines the predictions (or classifications) of multiple machine learning algorithms. In many cases, it will give you better last-minute results.

The Basics

You may already have studied multiple machine learning algorithms. However, different algorithms have different strengths. For example, neural network classifiers can generate excellent results for complex problems. However, they are also prone to “overfitting” the data because of their powerful capacity of memorizing fine-grained patterns of the data.

The simple idea of ensemble learning for classification problems tries to leverage the fact that you often don’t know in advance which machine learning technique works best.

How does this work? You create a meta-classifier consisting of multiple types or instances of basic machine learning algorithms. In other words, you train multiple models. To classify a single observation, you ask all models to classify the input independently. Now, you return the class that was returned most often, given your input, as a “meta-prediction”. This is the final output of your ensemble learning algorithm.

Random forests are a special type of ensemble learning algorithms. They focus on decision tree learning. A forest consists of many trees. Similarly, a random forest consists of many decision trees.

Each decision tree is built by injecting randomness in the tree generation procedure during the training phase (e.g. which tree node to select first). This leads to various decision trees – exactly what we want.

Here is how the prediction works for a trained random forest:

In the example, Alice has high maths and language skills.  The “ensemble” consists of three decision trees (building a random forest). To classify Alice, each decision tree is queried about Alice’s classification. Two of the decision trees classify Alice as a computer scientist. As this is the class with most votes, it is returned as final output for the classification.

The Code

Let’s stick to this example of classifying the study field based on a student’s skill level in three different areas (math, language, creativity). You may think that implementing an ensemble learning method is complicated in Python. But it’s not – thanks to the comprehensive scikit-learn library:

## Dependencies
import numpy as np
from sklearn.ensemble import RandomForestClassifier

## Data: student scores in (math, language, creativity) --> study field
X = np.array([[9, 5, 6, "computer science"],
              [5, 1, 5, "computer science"],
              [8, 8, 8, "computer science"],
              [1, 10, 7, "literature"],
              [1, 8, 1, "literature"],
              [5, 7, 9, "art"],
              [1, 1, 6, "art"]])

## One-liner
Forest = RandomForestClassifier(n_estimators=10).fit(X[:,:-1], X[:,-1])

## Result & puzzle
students = Forest.predict([[8, 6, 5],
                         [3, 7, 9],
                         [2, 2, 1]])

Take a guess: what’s the output of this code snippet?

The Results

After initializing the labeled training data, the code creates a random forest using the constructor on the class RandomForestClassifier with one parameter n_estimators that defines the number of trees in the forest.

Next, we populate the model that results from the previous initialization (an empty forest) by calling the function fit(). To this end, the input training data consists of all but the last column of array X, while the labels of the training data are defined in the last column. As in the previous examples, we use slicing to extract the respective columns from the data array X.

The classification part is slightly different in this code snippet. I wanted to show you how to classify multiple observations instead of only one. You can simply achieve this here by creating a multi-dimensional array with one row per observation.

Here is the result of the code puzzle:

## Result & puzzle
students = Forest.predict([[8, 6, 5],
                         [3, 7, 9],
                         [2, 2, 1]])
# ['computer science' 'art' 'art']

Note that the result is still non-deterministic (which means the result may be different for different executions of the code) because the random forest algorithm relies on the random number generator that returns different numbers at different points in time. You can make this call deterministic by using the argument random_state.

Where to go from here?

Random Forests built upon a thorough understanding of Decision Tree Learning. Read my article about decision trees to improve your understanding of this area.

If you feel that you need to refresh your Python skills, download your Python Cheat Sheets (and get regularly new cheat sheets) by subscribing to my email list.

Finally, level up your skills with my new Python learning system based on solving rated Python code puzzles. You do nothing but solving Python puzzles and observe how your Python rating improves.

Test your coding skills by solving Python puzzles now!

Leave a Comment