Support Vector Machines (SVM) have gained huge popularity in recent years. The reason is their robust classification performance – even in high-dimensional spaces: SVMs even work if there are more dimensions (features) than data items. This is unusual for classification algorithms because of the curse of dimensionality – with increasing dimensionality, data becomes extremely sparse which makes it hard for algorithms to find patterns in the data set.
Understanding the basic ideas of SVMs is a fundamental step to becoming a sophisticated machine learning engineer.
SVM Video
Feel free to watch the following video that summarizes shortly how SVMs work in Python:
SVM Cheat Sheet
Here is a cheat sheet that summarizes the content of this article:
You can get this cheat sheet—along with additional Python cheat sheets—as a high-resolution PDFs here:
Let’s get a conceptual of support vector machines first before learning how to use them with sklearn
.
Machine Learning Classification Overview
How do classification algorithms work? They use the training data to find a decision boundary that divides data in the one class from data in the other class.
Here is an example:
Suppose, you want to build a recommendation system for aspiring university students. The figure visualizes the training data consisting of users that are classified according to their skills in two areas: logic and creativity. Some persons have high logic skills and relatively low creativity, others have high creativity and relatively low logic skills. The first group is labeled as “computer scientists” and the second group is labeled as “artists”. (I know that there are also creative computer scientists, but let’s stick with this example for a moment.)
In order to classify new users, the machine learning model must find a decision boundary that separates the computer scientists from the artists. Roughly speaking, you will check for a new user in which area they fall with respect to the decision boundary: left or right? Users that fall into the left area are classified as computer scientists, while users that fall into the right area are classified as artists.
In the two-dimensional space, the decision boundary is either a line or a (higher-order) curve. The former is called a “linear classifier”, the latter is called a “non-linear classifier”. In this section, we will only explore linear classifiers.
The figure shows three decision boundaries that are all valid separators of the data. For a standard classifier, it is impossible to quantify which of the given decision boundaries is better – they all lead to perfect accuracy when classifying the training data.
Support Vector Machine Classification Overview
But what is the best decision boundary?
Support vector machines provide a unique and beautiful answer to this question. Arguably, the best decision boundary provides a maximal margin of safety. In other words, SVMs maximize the distance between the closest data points and the decision boundary. The idea is to minimize the error of new points that are close to the decision boundary.
Here is an example:
The SVM classifier finds the respective support vectors so that the zone between the different support vectors is as thick as possible. The decision boundary is the line in the middle with maximal distance to the support vectors. Because the zone between the support vectors and the decision boundary is maximized, the margin of safety is expected to be maximal when classifying new data points. This idea shows high classification accuracy for many practical problems.
Scikit-Learn SVM Code
Let’s have a look how the sklearn
library provides a simple means for you to use SVM classification on your own labeled data. I highlighted the sklearn relevant lines in the following code snippet:
## Dependencies from sklearn import svm import numpy as np ## Data: student scores in (math, language, creativity) --> study field X = np.array([[9, 5, 6, "computer science"], [10, 1, 2, "computer science"], [1, 8, 1, "literature"], [4, 9, 3, "literature"], [0, 1, 10, "art"], [5, 7, 9, "art"]]) ## One-liner svm = svm.SVC().fit(X[:,:-1], X[:,-1]) ## Result & puzzle student_0 = svm.predict([[3, 3, 6]]) print(student_0) student_1 = svm.predict([[8, 1, 1]]) print(student_1)
Guess: what is the output of this code?
The code breaks down how you can use support vector machines in Python in its most basic form. The NumPy array holds the labeled training data with one row per user and one column per feature (skill level in maths, language, and creativity). The last column is the label (the class).
Because we have three-dimensional data, the support vector machine separates the data using two-dimensional planes (the linear separator) rather than one-dimensional lines. As you can see, it is also possible to separate three different classes rather than only two as shown in the examples above.
The one-liner itself is straightforward: you first create the model using the constructor of the svm.SVC
class (SVC stands for support vector classification). Then, you call the fit
function to perform the training based on your labeled training data.
In the results part of the code snippet, we simply call the predict
function on new observations:
- Because
student_0
has skillsmaths=3
,language=3
, andcreativity=6
, the support vector machine predicts that the label “art” fits this student’s skills. - Similarly,
student_1
has skillsmaths=8
,language=1
, andcreativity=1
. Thus, the support vector machine predicts that the label “computer science” fits this student’s skills.
Here is the final output of the one-liner:
## Result & puzzle student_0 = svm.predict([[3, 3, 6]]) print(student_0) # ['art'] student_1 = svm.predict([[8, 1, 1]]) print(student_1) ## ['computer science']
Where to Go From Here?
This tutorial provides you the quickest and most concise way of starting out with support vector machines (SVMs). You won’t find any easier way on the whole Internet.
In fact, I wrote this as a chapter draft for my book Python One-Liners that also introduces 10 machine learning algorithms, and how to use them in a single line of Python code.
Here’s more about the book:
Python One-Liners Book: Master the Single Line First!
Python programmers will improve their computer science skills with these useful one-liners.
Python One-Liners will teach you how to read and write “one-liners”: concise statements of useful functionality packed into a single line of code. You’ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.
The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.
Detailed explanations of one-liners introduce key computer science concepts and boost your coding and analytical skills. You’ll learn about advanced Python features such as list comprehension, slicing, lambda functions, regular expressions, map and reduce functions, and slice assignments.
You’ll also learn how to:
- Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
- Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
- Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
- Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
- Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting
By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.