Python Scikit-Learn Decision Tree [Video + Blog] - Be on the Right Side of Change

Decision Trees are powerful and intuitive tools in your machine learning toolbelt. Decision trees are human-readable – in contrast to most other machine learning techniques. You can easily train a decision tree and show it to your supervisors who do not need to know anything about machine learning in order to understand how your model works. This is especially great for data scientists who often must defend and present their results to management.

In this article, I’ll show you how to use decision trees in a single line of Python code using sklearn.tree.DecisionTreeClassifier().

Video – Decision Tree Learning with Scikit-Learn

As you go through the article, you can also watch the following video:

Concepts and Theory – How Do Decision Trees Work?

You already know decision trees very well from your own experience as a human-being. Decision trees represent a structured way of making decisions – each decision opens up new branches. By answering a bunch of questions, you will ultimately land on the recommended outcome.

Basic Example: Choose Your Study Subject using a Decision Tree

Here is an example:

Decision Tree example — **Figure**: A simple decision tree to answer the question what to study.

For example, decision trees are used for classification problems such as “which subject should I study, given my interests?”. You start at the top. Now, you repeatedly answer questions by selecting the choices that describe your features best. Finally, you reach a leaf node of the tree. This is the recommended class based on your feature selection.

Pruning for Performance Tuning

There are many nuances to decision tree learning. For example, in the above figure, the first question carries more weight than the last question. If you like maths, the decision tree will never recommend you art or linguistics. This is useful because some features may be much more important for the classification decision than others. For example, a classification system that predicts your current health may use your sex (feature) to practically rule out many diseases (classes).

Hence, the order of the decision nodes lends itself for performance optimizations: place the features at the top that have a high impact on the final classification. Decision tree learning will then aggregate the questions that do not have a high impact on the final classification as shown in the next graphic:

**Figure**: Remove the unnecessary branch that doesn’t have an impact to the ultimate outcome.

Suppose the full decision tree looks like the tree on the left. For any combination of features, there is a separate classification outcome (the tree leaves). However, some features may not give you any additional information with respect to the classification problem (e.g. the first “Language” decision node in the example). Decision tree learning would effectively get rid of these nodes for efficiency reasons. This is called “pruning”.

Creating Your Decision Tree with Scikit-Learn

Let’s have a look at a minimal code snippet from my book Python One-Liners that showcases decision tree learning using the machine learning library scikit-learn:

## Dependencies
import numpy as np
from sklearn import tree


## Data: student scores in (math, language, creativity) --> study field
X = np.array([[9, 5, 6, "computer science"],
              [1, 8, 1, "literature"],
              [5, 7, 9, "art"]])


## One-liner
Tree = tree.DecisionTreeClassifier().fit(X[:,:-1], X[:,-1])

## Result & puzzle
student_0 = Tree.predict([[8, 6, 5]])
print(student_0)

student_1 = Tree.predict([[3, 7, 9]])
print(student_1)

? Feel free to guess the output of this code snippet first!

Check out my new Python book Python One-Liners (Amazon Link).

If you like one-liners, you’ll LOVE the book. It’ll teach you everything there is to know about a single line of Python code. But it’s also an introduction to computer science, data science, machine learning, and algorithms. The universe in a single line of Python!

The book was released in 2020 with the world-class programming book publisher NoStarch Press (San Francisco).

Publisher Link: https://nostarch.com/pythononeliners

Explanation

The data in the code snippet describes three students with their estimated skill level (a score between 1-10) in the three areas math, language, and creativity. We also know the study subjects of these students.

The first student is highly skilled in maths and studies computer science.
The second student is skilled in language much more than in the other two skills and studies literature.
The third student is good in creativity and studies art.

The one-liner creates a new decision tree object and trains the model using the fit function on the labeled training data (the last column is the label). Internally, it creates three nodes, one for each feature math, language, and creativity.

When predicting the class of the student_0 (math=8, language=6, creativity=5), the decision tree returns “computer science”. It has learned that this feature pattern (high, medium, medium) is an indicator for the first class. On the other hand, when asked for (3, 7, 9), the decision tree predicts “art” because it has learned that the score (low, medium, high) hints to the third class.

? Note that the algorithm is non-deterministic. In other words, when executing the same code twice, different results may arise. This is common for machine learning algorithms that work with random generators. In this case, the order of the features is randomly permuted, so the final decision tree may have a different order of the features.

Where to Go from Here?

In this article, you have learned about the main ideas in decision tree learning. Decision trees are powerful and intuitive data structures that are easy to use and to train.

You can train your own decision tree in a single line of code. Even if you are a bloody beginner in Python, you can start now and figure out the details later.

If you really want to get proficient in your basic Python code understanding, join my free Python cheat sheet course where I will send you a weekly cheat sheet about various topics in computer science and Python.

Also, I’d appreciate if you checked out my book Python One-Liners for which I originally wrote this article. The book is published with the world-class Python book publisher NoStarch from San Francisco! 🙂