Decision Trees are powerful and intuitive tools in your machine learning toolbelt. A big advantage of decision trees is that they are human-readable – in contrast to most other machine learning techniques. You can easily train a decision tree and show it to your supervisors who do not need to know anything about machine learning in order to understand what your model does. This is especially great for data scientists who often must defend and present their results to management. In this article, I’ll show you how to use decision trees in a single line of Python code. Alternatively, you can also watch the following video:
You already know decision trees very well from your own experience. They represent a structured way of making decisions – each decision opening new branches. By answering a bunch of questions, you will finally land on the recommended outcome. Here is an example:
Decision trees are used for classification problems such as “which subject should I study, given my interests?”. You start at the top. Now, you repeatedly answer questions (select the choices that describe your features best). Finally, you reach a leaf node of the tree. This is the recommended class based on your feature selection.
There are many nuances to decision tree learning. For example, in the above figure, the first question carries more weight than the last question. If you like maths, the decision tree will never recommend you art or linguistics. This is useful because some features may be much more important for the classification decision than others. For example, a classification system that predicts your current health may use your sex (feature) to practically rule out many diseases (classes).
Hence, the order of the decision nodes lends itself for performance optimizations: place the features at the top that have a high impact on the final classification. In decision tree learning will then aggregate the questions that do not have a high impact on the final classification as shown in the next graphic:
Suppose the full decision tree looks like the tree on the left. For any combination of features, there is a separate classification outcome (the tree leaves). However, some features may not give you any additional information with respect to the classification problem (e.g. the first “Language” decision node in the example). Decision tree learning would effectively get rid of these nodes for efficiency reasons. This is called “pruning”.
## Dependencies import numpy as np from sklearn import tree ## Data: student scores in (math, language, creativity) --> study field X = np.array([[9, 5, 6, "computer science"], [1, 8, 1, "literature"], [5, 7, 9, "art"]]) ## One-liner Tree = tree.DecisionTreeClassifier().fit(X[:,:-1], X[:,-1]) ## Result & puzzle student_0 = Tree.predict([[8, 6, 5]]) print(student_0) student_1 = Tree.predict([[3, 7, 9]]) print(student_1)
Guess the output of this code snippet!
The data in the code snippet describes three students with their estimated skill level (a score between 1-10) in the three areas math, language, and creativity. We also know the study subjects of these students. For example, the first student is highly skilled in maths and studies computer science. The second student is skilled in language much more than in the other two skills and studies literature. The third student is good in creativity and studies art.
The one-liner creates a new decision tree object and trains the model using the fit function on the labeled training data (the last column is the label). Internally, it creates three nodes, one for each feature math, language, and creativity.
When predicting the class of the student_0 (math=8, language=6, creativity=5), the decision tree returns “computer science”. It has learned that this feature pattern (high, medium, medium) is an indicator for the first class. On the other hand, when asked for (3, 7, 9), the decision tree predicts “art” because it has learned that the score (low, medium, high) hints to the third class.
Note that the algorithm is non-deterministic. In other words, when executing the same code twice, different results may arise. This is common for machine learning algorithms that work with random generators. In this case, the order of the features is randomly
Where to go from here?
In this article, you have learned about the main ideas in decision tree learning. Decision trees are powerful and intuitive data structures that are easy to use and to train.
You can train your own decision tree in a single line of code. Even if you are a bloody beginner in Python, you can start now and figure out the details later.
If you really want to get proficient in your basic Python code understanding, join my free Python cheat sheet course where I will send you a weekly cheat sheet about various topics in computer science and Python.