Scikit-learn Library Archives - Be on the Right Side of Change

How to Develop LARS Regression Models in Python?

Gábor Madarász — Wed, 06 Oct 2021 16:27:58 +0000

What is LARS regression?

Regression is the analysis of how a variable (the outcome variable) depends on the evolution of other variables (explanatory variables).

In regression, we are looking for the answer to the question of what is the function that can be used to predict the value of another variable Y by knowing the value of one variable X?

In general, regression calculations are based on the assumption that a causal and statistical relationship can be assumed or deduced between certain variables. To describe the causal relationship, we look for a functional relationship between the variables, i.e., we consider the cause as a dependent variable and the other influencing variables as independent variables.

Linear regression is a parametric regression model that assumes a linear relationship between the explanatory (X) and the explained (Y) variables (in terms of parameters). This means that in estimating linear regression, we try to fit a line to the point cloud of the sample data.

Fig 1: Linear regression of random numbers with noise

Least angle regression (LARS) is a relatively new technique that is a variant of forward regression.

It starts all coefficients with zero and find the predictor (x1) which correlates best with the response.

We move towards this predictor until another predictor, (x2), shows the same degree of correlation with the current residual. The LARS moves in a direction at an equal angle between the two predictors, until a third variable, (x3), again shows the same degree of correlation with the current residual. The LARS then moves at equal angles between x1, x2, and x3, (i.e., in the direction of least angle), until the next variable enters and so on.

Fig 2: The original graph from the Stanford University how LARS works.

What Is LARS For?

This technique is used for forecasting, modeling time series, and establishing cause and effect relationships between variables. There are several advantages to using regression analysis.

It indicates significant relationships between the dependent variable and the independent variable.

It shows the strength of the effect of several independent variables on the dependent variable.

Regression analysis also allows you to compare the effects of variables measured at different scales. These advantages help data scientists to eliminate and evaluate the best set of variables to use in building predictive models.

The advantages of LARS are:

As fast as stepwise regression .
Generates a complete piecewise linear solution path , useful for cross-validation or similar model fitting experiments .
If two variables are related to the dependent variable to nearly the same extent , their coefficients should increase at about the same rate . So the algorithm is more stable.

LARS in Python

To solve this problem, we use the “sklearn.linear_model.Lars” class of the scikit learn library.

from pandas import read_excel
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import matplotlib.pyplot as plt

# import data
dataframe = read_excel('AirQualityUCI.xls')
# clean dataframe
dataframe = dataframe[(dataframe > 0).all(axis=1)]
data = dataframe.values

# select relevant data
x, y = data[:, 11:12], data[:, 10:11]
# print(x.shape, y.shape, type(x), type(y))
# split the arrays into random train and test subsets
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.1)

# model fitting
lars = linear_model.Lars().fit(xtrain, ytrain)

# predict
ypred = lars.predict(xtest)

# measure errors
print(lars.coef_)
mse = mean_squared_error(ytest, ypred)
print("MSE: %.2f" % mse)
mae = mean_absolute_error(ytest, ypred)
print("MAE: %.2f" % mae)


# plot original vs predicted data
x_ax = range(len(ytest))
plt.scatter(x_ax, ytest, s=5, color="green", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

In the above code, we first import the required libraries:

From pandas, read_excel, because the (example) data is in an excel file.

From scikit learn, the train_test_split (for random distribution of data), linear_model for LARS regression, mean_squared_error and mean_absolute_error for model evaluation, and finally matplotlib.pyplot for model visualization.

Then import the data into a pandas DataFrame. This can be a local file or a valid URL.

Our DataFrame (which is real-world air quality data) has some negative values because when the sensors are not working, the value is -200. We remove this data and create a numpy array from the rest. In our example, we are looking at the relationship between relative humidity and temperature.

The corresponding columns (10, 11) are sliced from the data file and separated into training and test data. This is necessary to judge how good our model is. Split the data set into two data sets:

A “training” data set, which we will use to train our model, and a “test” data set, which we will use to judge the accuracy of the model.

Apply the model to the training data with „lars = linear_model.Lars().fit(xtrain, ytrain)” with default settings, and then perform the estimation on the test data with the „predict” method of the class.

We then examine how far the model’s prediction differs from the real data.

MSE represents the mean squared error. It is the square of the distances between the actual points and the regression line. This technique allows the errors to be weighted and negative signs to be eliminated.

MAE is an abbreviation for mean absolute error, which is simply the largest deviation of the measured values from the predicted ones.

In the last part, we will create a graph using matplotlib. For the full course see:

*** Matplotlib – The Complete Guide to Becoming a Data Visualization Wizard ***

Please note that your graph (and your results) may differ because the algorithm used to select the test and training data is randomly split.

Fig 3: Original vs LARS predicted values

Summary

As we have seen, regression is a very important tool in data science, and a relatively new version of it, LARS regression, can be successfully applied in many situations. For this purpose, the Python sckit-learn library provides a convenient solution.

The post How to Develop LARS Regression Models in Python? appeared first on Be on the Right Side of Change.

How to Install Scikit-Learn on PyCharm?

Chris — Tue, 14 Sep 2021 15:03:10 +0000

Scikit-Learn, often abbreviated as sklearn, is a popular machine learning library for Python.

Problem Formulation: Given a PyCharm project. How to install the Scikit-Learn library in your project within a virtual environment or globally?

Here’s a solution that always works:

Open File > Settings > Project from the PyCharm menu.
Select your current project.
Click the Python Interpreter tab within your project tab.
Click the small + symbol to add a new library to the project.
Now type in the library to be installed, in your example "sklearn" without quotes, and click Install Package.
Wait for the installation to terminate and close all popup windows.

Here’s the installation process as a short animated video—it works analogously for Scikit-Learn, just type in “sklearn” or “scikit-learn” in the search field instead:

Make sure to select only “scikit-learn” or “sklearn” because there are many other packages that are not required but also contain the same terms (false positives):

Alternatively, you can run the pip install sklearn or pip install scikit-learn command in your PyCharm “Terminal” view:

$ pip install sklearn       # Alternative 1
$ pip install scikit-learn  # Alternative 2

Both alternatives accomplish the same thing because sklearn is a dummy package pointing to scikit-learn (alias). The following figure shows how to use pip to install the sklearn package:

You can check your installation using the following two lines of Python code that print out the version of the package:

import sklearn
print(sklearn.__version__)

Feel free to check out the following free email academy with Python cheat sheets to boost your coding skills!

To become a PyCharm master, check out our full course on the Finxter Computer Science Academy available for free for all Finxter Premium Members:

The post How to Install Scikit-Learn on PyCharm? appeared first on Be on the Right Side of Change.

Logistic Regression in Python Scikit-Learn

Chris — Sat, 17 Jul 2021 12:22:00 +0000

Logistic regression is a popular algorithm for classification problems (despite its name indicating that it is a “regression” algorithm). It belongs to one of the most important algorithms in the machine learning space.

Linear Regression Background

Let’s review linear regression. Given the training data, we compute a line that fits this training data so that the summed squared distance between the line and the training data is minimal.

This line can be used for many things – e.g. to predict the outcome for unseen input data x. In general, linear regression is great for predicting a continuous output value y, given continuous input value x. A continuous value can take an infinite number of values. For example, we could predict the stock price (output y), given the number of social media posts mentioning the company that is reflected by the stock price (input x). The stock price is continuous as it can take on any value $123.45, $121.897, or $10,198.87.

Logistic Regression and Sigmoid Function

But what if the output is not continuous but categorical? For example, let’s say you want to predict the likelihood of lung cancer, given the number of cigarettes a patient smoke. Each patient can either have lung cancer or not. In contrast to the previous example, there are only these two possible outcomes.

Predicting the likelihood of categorical outcomes is the main motivation for logistic regression.

While linear regression fits a line into the training data, logistic regression fits an S-shaped curve, called “the sigmoid function”. Why? Because the line helps you generate a new output value for each input. On the other hand, the S-shaped curve helps you make binary decisions (e.g. yes/no). For most input values, the sigmoid function will either return a value that is very close to 0 or very close to 1. It is relatively unlikely that your given input value generates a value that is somewhere in-between.

Here is a graphical example of such a scenario:

Sigmoid Function Example

The sigmoid function approximates the probability that a patient has lung cancer, given the number of cigarettes they smoke. This probability helps you to make a robust decision on the subject: Does the patient has lung cancer?

Have a look at the following example:

There are two new patients (in yellow). Let’s pretend we know nothing about them but the number of cigarettes they smoke. We have already trained our logistic regression model (the sigmoid function) that returns a probability value for any new input value x. Now, we can use the respective probabilities of our two inputs to make a prediction about whether the new patients have lung cancer or not.

If the probability given by the sigmoid function is higher than 50%, the model predicts “lung cancer positive”, otherwise, it predicts “lung cancer negative”.

So how to select the correct sigmoid function that best fits the training data?

This is the main question for logistic regression. The answer is maximum likelihood. In other words, which sigmoid function would generate the observed training data with the highest probability?

To calculate the likelihood for a given set of training data, you simply calculate the likelihood for a single training date and repeat this procedure for all training dates. Finally, you multiply those to get the likelihood for the whole set of training data.

Now, you proceed this same likelihood computation for different sigmoid functions (shifting the sigmoid function a little bit). From all computations, you take the sigmoid function that has “maximum likelihood” that means which would produce the training data with maximal probability.

Logistic Regression with sklearn.linear_model

Let’s program your first virtual doc app using logistic regression – in a single line of Python code!

from sklearn.linear_model import LogisticRegression
import numpy as np


## Data (#cigarettes, cancer)
X = np.array([[0, "No"],
              [10, "No"],
              [60, "Yes"],
              [90, "Yes"]])


## One-liner
model = LogisticRegression().fit(X[:,0].reshape(-1,1), X[:,1])


## Result & puzzle
print(model.predict([[2],[12],[13],[40],[90]]))

Exercise: What is the output of this code snippet? Take a guess!

The labeled training data set X consists of four patient records (lines) with two features (columns). The first column holds the number of cigarettes the patients smoke, and the second column holds whether they ultimately suffered from lung cancer. Hence, there is a continuous input variable and a categorical output variable. It’s a classification problem!

We build the model calling the LogisticRegression() constructor with no parameters. On this model, we call the fit function which takes two arguments: the input values and the output classifications (labels). The input values are expected to come as a two-dimensional array where each row holds the feature values.

In our case, we only have a single feature value so we transform our input into a column vector using the reshape() operation that generates a two-dimensional NumPy array. The first argument specifies the number of rows, the second specifies the number of columns. We only care about the number of columns which is one. NumPy determines the number of rows automatically when using the “dummy” parameter -1.

Here is how the input training data (without labels) looks like after converting it using the reshape operation:

[[0],
 [10],
 [60],
 [90]]

Next, we predict whether a patient has lung cancer, given the number of cigarettes they smoke: 2, 12, 13, 40, 90 cigarettes.

Here is the output:

## Result & puzzle
print(model.predict([[2],[12],[13],[40],[90]]))
# ['No' 'No' 'Yes' 'Yes' 'Yes']

The model predicts that the first two patients are lung cancer negative, while the latter three are lung cancer positive.

Let’s explore in detail the probabilities of the sigmoid function that lead to this prediction! Simply run the following code snippet after the above definition:

for i in range(20):
    print("x=" + str(i) + " --> " + str(model.predict_proba([[i]])))

    
'''
x=0 --> [[0.67240789 0.32759211]]
x=1 --> [[0.65961501 0.34038499]]
x=2 --> [[0.64658514 0.35341486]]
x=3 --> [[0.63333374 0.36666626]]
x=4 --> [[0.61987758 0.38012242]]
x=5 --> [[0.60623463 0.39376537]]
x=6 --> [[0.59242397 0.40757603]]
x=7 --> [[0.57846573 0.42153427]]
x=8 --> [[0.56438097 0.43561903]]
x=9 --> [[0.55019154 0.44980846]]
x=10 --> [[0.53591997 0.46408003]]
x=11 --> [[0.52158933 0.47841067]]
x=12 --> [[0.50722306 0.49277694]]
x=13 --> [[0.49284485 0.50715515]]
x=14 --> [[0.47847846 0.52152154]]
x=15 --> [[0.46414759 0.53585241]]
x=16 --> [[0.44987569 0.55012431]]
x=17 --> [[0.43568582 0.56431418]]
x=18 --> [[0.42160051 0.57839949]]
x=19 --> [[0.40764163 0.59235837]]
'''

The code prints for any value of x (the number of cigarettes) the probabilities of lung cancer positive and lung cancer negative. If the probability of the former is higher than the probability of the latter, the predicted outcome is “lung cancer negative”. This happens the last time for x=12. When smoking more than 12 cigarettes, the algorithm will classify a patient to be “lung cancer positive”.

LogisticsRegression Methods

In the previous example, you’ve created a LogisticRegression object using the following constructor:

sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

In most cases, you don’t need to define all arguments—or even understand them by heart. Just start from the most basic example usage and customize as you The LogisticRegression class has many more helper methods. You can check them out here (source):

Name	Description
`decision_function(X)`	Predict confidence scores for samples.
`densify()`	Convert coefficient matrix to dense array format.
`fit(X, y[, sample_weight])`	Fit the model according to the given training data.
`get_params([deep])`	Get parameters for this estimator.
`predict(X)`	Predict class labels for samples in `X`.
`predict_log_proba(X)`	Predict logarithm of probability estimates.
`predict_proba(X)`	Probability estimates.
`score(X, y[, sample_weight])`	Return the mean accuracy on the given test data and labels.
`set_params(**params)`	Set the parameters of this estimator.
`sparsify()`	Convert coefficient matrix to sparse format.

Conclusion

Logistic regression is a classification algorithm (despite its name). This article shows you everything you need to know to start with logistic regression now. It provides you an easy way to implement logistic regression in a single line of Python code using the scikit -learn library.

If you feel stuck in Python and you need to enter the next level in Python coding, feel free to enter my 100% free Python email course with lots of cheat sheets, Python lessons, code contests, and fun!

This tutorial is loosely based on my Python One-Liners book chapter. Check it out!

Python One-Liners Book: Master the Single Line First!

Python programmers will improve their computer science skills with these useful one-liners.

Python One-Liners will teach you how to read and write “one-liners”: concise statements of useful functionality packed into a single line of code. You’ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.

The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.

Detailed explanations of one-liners introduce key computer science concepts and boost your coding and analytical skills. You’ll learn about advanced Python features such as list comprehension, slicing, lambda functions, regular expressions, map and reduce functions, and slice assignments.

You’ll also learn how to:

Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting

By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.

Get your Python One-Liners on Amazon!!

The post Logistic Regression in Python Scikit-Learn appeared first on Be on the Right Side of Change.

Random Forest Classifier with sklearn

Chris — Tue, 13 Jul 2021 14:30:00 +0000

Does your model’s prediction accuracy suck but you need to meet the deadline at all costs?

Try the quick and dirty “meta-learning” approach called ensemble learning. In this article, you’ll learn about a specific ensemble learning technique called random forests that combines the predictions (or classifications) of multiple machine learning algorithms. In many cases, it will give you better last-minute results.

Video Random Forest Classification Python

This video gives you a concise introduction into ensemble learning with random forests using sklearn:

Ensemble Learning

You may already have studied multiple machine learning algorithms—and realized that different algorithms have different strengths.

For example, neural network classifiers can generate excellent results for complex problems. However, they are also prone to “overfitting” the data because of their powerful capacity of memorizing fine-grained patterns of the data.

The simple idea of ensemble learning for classification problems leverages the fact that you often don’t know in advance which machine learning technique works best.

How does ensemble learning work? You create a meta-classifier consisting of multiple types or instances of basic machine learning algorithms. In other words, you train multiple models. To classify a single observation, you ask all models to classify the input independently. Now, you return the class that was returned most often, given your input, as a “meta-prediction”. This is the final output of your ensemble learning algorithm.

Random Forest Learning

Random forests are a special type of ensemble learning algorithms. They focus on decision tree learning. A forest consists of many trees. Similarly, a random forest consists of many decision trees.

Each decision tree is built by injecting randomness in the tree generation procedure during the training phase (e.g. which tree node to select first). This leads to various decision trees – exactly what we want.

Here is how the prediction works for a trained random forest:

In the example, Alice has high maths and language skills. The “ensemble” consists of three decision trees (building a random forest). To classify Alice, each decision tree is queried about Alice’s classification. Two of the decision trees classify Alice as a computer scientist. As this is the class with most votes, it is returned as final output for the classification.

sklearn.ensemble.RandomForestClassifier

Let’s stick to this example of classifying the study field based on a student’s skill level in three different areas (math, language, creativity). You may think that implementing an ensemble learning method is complicated in Python. But it’s not – thanks to the comprehensive scikit-learn library:

## Dependencies
import numpy as np
from sklearn.ensemble import RandomForestClassifier


## Data: student scores in (math, language, creativity) --> study field
X = np.array([[9, 5, 6, "computer science"],
              [5, 1, 5, "computer science"],
              [8, 8, 8, "computer science"],
              [1, 10, 7, "literature"],
              [1, 8, 1, "literature"],
              [5, 7, 9, "art"],
              [1, 1, 6, "art"]])


## One-liner
Forest = RandomForestClassifier(n_estimators=10).fit(X[:,:-1], X[:,-1])

## Result & puzzle
students = Forest.predict([[8, 6, 5],
                         [3, 7, 9],
                         [2, 2, 1]])
print(students)

Take a guess: what’s the output of this code snippet?

After initializing the labeled training data, the code creates a random forest using the constructor on the class RandomForestClassifier with one parameter n_estimators that defines the number of trees in the forest.

Next, we populate the model that results from the previous initialization (an empty forest) by calling the function fit(). To this end, the input training data consists of all but the last column of array X, while the labels of the training data are defined in the last column. As in the previous examples, we use slicing to extract the respective columns from the data array X.

Related Tutorial: Introduction to Python Slicing

The classification part is slightly different in this code snippet. I wanted to show you how to classify multiple observations instead of only one. You can simply achieve this here by creating a multi-dimensional array with one row per observation.

Here is the output of the code:

## Result & puzzle
students = Forest.predict([[8, 6, 5],
                         [3, 7, 9],
                         [2, 2, 1]])
print(students)
# ['computer science' 'art' 'art']

Note that the result is still non-deterministic (which means the result may be different for different executions of the code) because the random forest algorithm relies on the random number generator that returns different numbers at different points in time. You can make this call deterministic by using the argument random_state.

RandomForestClassifier Methods

The RandomForestClassifier object has the following methods (source):

`apply(X)`	Apply trees in the forest to `X` and return leaf indices.
`decision_path(X)`	Return the decision path in the forest.
`fit(X, y[, sample_weight])`	Build a forest of trees from the training set `(X, y)`.
`get_params([deep])`	Get parameters for this estimator.
`predict(X)`	Predict class for `X`.
`predict_log_proba(X)`	Predict class log-probabilities for `X`.
`predict_proba(X)`	Predict class probabilities for `X`.
`score(X, y[, sample_weight])`	Return the mean accuracy on the given test data and labels.
`set_params(**params)`	Set the parameters of this estimator.

To learn about the different arguments of the RandomForestClassifier() constructor, feel free to visit the official documentation. However, the default arguments are often enough to create powerful classification meta-models.

Where to Go From Here?

Random Forests built upon a thorough understanding of Decision Tree Learning. Read my article about decision trees to improve your understanding of this area.

If you feel that you need to refresh your Python skills, download your Python Cheat Sheets (and get regularly new cheat sheets) by subscribing to my email list.

You can level up your skills with our new Python learning system based on solving rated Python code puzzles. You do nothing but solving Python puzzles and observe how your Python rating improves.

Test your coding skills by solving Python puzzles now!

This article is based on my book “Python One-Liners”. Feel free to check out the additional material to help you master the single line like nobody else!

Python One-Liners Book: Master the Single Line First!

Python programmers will improve their computer science skills with these useful one-liners.

The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.

You’ll also learn how to:

Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting

By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.

Get your Python One-Liners on Amazon!!

The post Random Forest Classifier with sklearn appeared first on Be on the Right Side of Change.

SVM sklearn: Python Support Vector Machines Made Simple

Chris — Sun, 11 Jul 2021 10:33:00 +0000

Support Vector Machines (SVM) have gained huge popularity in recent years. The reason is their robust classification performance – even in high-dimensional spaces: SVMs even work if there are more dimensions (features) than data items. This is unusual for classification algorithms because of the curse of dimensionality – with increasing dimensionality, data becomes extremely sparse which makes it hard for algorithms to find patterns in the data set.

Understanding the basic ideas of SVMs is a fundamental step to becoming a sophisticated machine learning engineer.

SVM Video

Feel free to watch the following video that summarizes shortly how SVMs work in Python:

SVM Cheat Sheet

Here is a cheat sheet that summarizes the content of this article:

You can get this cheat sheet—along with additional Python cheat sheets—as a high-resolution PDFs here:

Let’s get a conceptual of support vector machines first before learning how to use them with sklearn.

Machine Learning Classification Overview

How do classification algorithms work? They use the training data to find a decision boundary that divides data in the one class from data in the other class.

Here is an example:

Suppose, you want to build a recommendation system for aspiring university students. The figure visualizes the training data consisting of users that are classified according to their skills in two areas: logic and creativity. Some persons have high logic skills and relatively low creativity, others have high creativity and relatively low logic skills. The first group is labeled as “computer scientists” and the second group is labeled as “artists”. (I know that there are also creative computer scientists, but let’s stick with this example for a moment.)

In order to classify new users, the machine learning model must find a decision boundary that separates the computer scientists from the artists. Roughly speaking, you will check for a new user in which area they fall with respect to the decision boundary: left or right? Users that fall into the left area are classified as computer scientists, while users that fall into the right area are classified as artists.

In the two-dimensional space, the decision boundary is either a line or a (higher-order) curve. The former is called a “linear classifier”, the latter is called a “non-linear classifier”. In this section, we will only explore linear classifiers.

The figure shows three decision boundaries that are all valid separators of the data. For a standard classifier, it is impossible to quantify which of the given decision boundaries is better – they all lead to perfect accuracy when classifying the training data.

Support Vector Machine Classification Overview

But what is the best decision boundary?

Support vector machines provide a unique and beautiful answer to this question. Arguably, the best decision boundary provides a maximal margin of safety. In other words, SVMs maximize the distance between the closest data points and the decision boundary. The idea is to minimize the error of new points that are close to the decision boundary.

Here is an example:

The SVM classifier finds the respective support vectors so that the zone between the different support vectors is as thick as possible. The decision boundary is the line in the middle with maximal distance to the support vectors. Because the zone between the support vectors and the decision boundary is maximized, the margin of safety is expected to be maximal when classifying new data points. This idea shows high classification accuracy for many practical problems.

Scikit-Learn SVM Code

Let’s have a look how the sklearn library provides a simple means for you to use SVM classification on your own labeled data. I highlighted the sklearn relevant lines in the following code snippet:

## Dependencies
from sklearn import svm
import numpy as np


## Data: student scores in (math, language, creativity) --> study field
X = np.array([[9, 5, 6, "computer science"],
              [10, 1, 2, "computer science"],
              [1, 8, 1, "literature"],
              [4, 9, 3, "literature"],
              [0, 1, 10, "art"],
              [5, 7, 9, "art"]])


## One-liner
svm = svm.SVC().fit(X[:,:-1], X[:,-1])


## Result & puzzle
student_0 = svm.predict([[3, 3, 6]])
print(student_0)

student_1 = svm.predict([[8, 1, 1]])
print(student_1)

Guess: what is the output of this code?

The code breaks down how you can use support vector machines in Python in its most basic form. The NumPy array holds the labeled training data with one row per user and one column per feature (skill level in maths, language, and creativity). The last column is the label (the class).

Because we have three-dimensional data, the support vector machine separates the data using two-dimensional planes (the linear separator) rather than one-dimensional lines. As you can see, it is also possible to separate three different classes rather than only two as shown in the examples above.

The one-liner itself is straightforward: you first create the model using the constructor of the svm.SVC class (SVC stands for support vector classification). Then, you call the fit function to perform the training based on your labeled training data.

In the results part of the code snippet, we simply call the predict function on new observations:

Because student_0 has skills maths=3, language=3, and creativity=6, the support vector machine predicts that the label “art” fits this student’s skills.
Similarly, student_1 has skills maths=8, language=1, and creativity=1. Thus, the support vector machine predicts that the label “computer science” fits this student’s skills.

Here is the final output of the one-liner:

## Result & puzzle
student_0 = svm.predict([[3, 3, 6]])
print(student_0)
# ['art']

student_1 = svm.predict([[8, 1, 1]])
print(student_1)
## ['computer science']

Where to Go From Here?

This tutorial provides you the quickest and most concise way of starting out with support vector machines (SVMs). You won’t find any easier way on the whole Internet.

In fact, I wrote this as a chapter draft for my book Python One-Liners that also introduces 10 machine learning algorithms, and how to use them in a single line of Python code.

Here’s more about the book:

Python One-Liners Book: Master the Single Line First!

Python programmers will improve their computer science skills with these useful one-liners.

The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.

You’ll also learn how to:

Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting

By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.

Get your Python One-Liners on Amazon!!

The post SVM sklearn: Python Support Vector Machines Made Simple appeared first on Be on the Right Side of Change.

K-Nearest Neighbors (KNN) with sklearn in Python

Chris — Thu, 10 Jun 2021 16:09:00 +0000

The popular K-Nearest Neighbors (KNN) algorithm is used for regression and classification in many applications such as recommender systems, image classification, and financial data forecasting. It is the basis of many advanced machine learning techniques (e.g., in information retrieval). There is no doubt that understanding KNN is an important building block of your proficient computer science education.

Watch the article as a video:

K-Nearest Neighbors (KNN) is a robust, simple, and popular machine learning algorithm. It’s relatively easy to implement from scratch while being competitive and performant.

Recap Machine Learning

Machine learning is all about learning a so-called model from a given training data set.

This model can then be used for inference, i.e., predicting output values for potentially new and unseen input data.

A model usually is a high-level abstraction such as a mathematical function inferred from the training data. Most machine learning techniques attempt to find patterns in the data that can be captured and used for generalization and prediction on new input data.

KNN Training

However, KNN follows a quite different path. The simple idea is the following: the whole data set is your model.

Yes, you read that right.

The KNN machine learning model is nothing more than a set of observations. Every single instance of your training data is a part of your model. Training becomes as simple as throwing the training data into a container data structure for later retrieval. There’s no complicated inference phase and hours of distributed GPU processing to extract patterns from the data.

KNN Inference

A great advantage is that you can use the KNN Algorithm for prediction or classification – as you like. You execute the following strategy, given your input vector x.

Find the K nearest neighbors of x according to a predefined similarity metric.
Aggregate the K nearest neighbors into a single “prediction” or “classification” value. You can use any aggregator function such as average, mean, max, min, etc.

That’s it. Simple, isn’t it?

Check out the following graphic:

Suppose, your company sells homes for clients. It has acquired a large database of customers and experienced house prices.

One day, your client asks how much he can expect to pay for a house with 52 square meters. You query your KNN “model” and it immediately gives you the response $33,167. And indeed, your client finds a home for $33,489 the same week. How did the KNN system come to this surprisingly accurate prediction?

It simply calculated the K=3 nearest neighbors to the query “D=52 square meters” from the model with regards Euclidean distance. The three nearest neighbors are A, B, and C with prices $34,000, $33,500, and $32,000, respectively. In the final step, the KNN aggregates the three nearest neighbors by calculating the simple average. As K=3 in this example, we denote the model as “3NN”.

Of course, you can vary the similarity functions, the parameter K, and the aggregation method to come up with more sophisticated prediction models.

Another advantage of KNN is that it can be easily adapted as new observations are made. This is not generally true for any machine learning model. A weakness in this regard is obviously that the computational complexity becomes harder and harder, the more points you add. To accommodate for that, you can continuously remove “stale” values from the system.

As I mentioned above, you can also use KNN for classification problems. Instead of averaging over the K nearest neighbors, you can simply use a voting mechanism where each nearest neighbor votes for its class. The class with the most votes wins.

Implementing KNN with SKLearn

## Dependencies
from sklearn.neighbors import KNeighborsRegressor
import numpy as np


## Data (House Size (square meters) / Hous Price ($))
X = np.array([[35, 30000], [45, 45000], [40, 50000],
              [35, 35000], [25, 32500], [40, 40000]])


## One-liner
KNN = KNeighborsRegressor(n_neighbors=3).fit(X[:,0].reshape(-1,1), X[:,1].reshape(-1,1))


## Result & puzzle
res = KNN.predict([[30]])
print(res)

Let’s dive into how to use KNN in Python – in a single line of code.

Take a guess: what’s the output of this code snippet?

Understanding the Code

To help you see the result, let’s plot the housing data from the code:

Can you see the general trend? With growing size of your house, you can expect a linear growth of its market price. Double the square meters and the price will double, too.

In the code, the client requests your price prediction for a house with 30 square meters. What does KNN with K=3 (in short: 3NN) predict?

Beautifully simple, isn’t it? The KNN algorithm finds the three closest houses with respect to house size and averages the predicted house price as the average of the K=3 nearest neighbors.

Thus, the result is $32,500.

Maybe you were confused by the data conversion part within the one-liner. Let me quickly explain what happened here:

## One-liner
KNN = KNeighborsRegressor(n_neighbors=3).fit(X[:,0].reshape(-1,1), X[:,1].reshape(-1,1))

First, we create a new machine learning model called “KNeighborsRegressor”. If you would like to take KNN for classification, you would take the model “KNeighborsClassifier”.

Second, we “train” the model using the fit function with two parameters. The first parameter defines the input (the house size) and the second parameter defines the output (the house price). The shape of both parameters must be so that each observation is an array-like data structure. For example, you wouldn’t use “30” as an input but “[30]”. The reason is that, in general, the input can be multi-dimensional rather than one-dimensional. Therefore, we reshape the input:

print(X[:,0])
"[35 45 40 35 25 40]"

If we would use this 1D NumPy array as an input to the fit() function, the function would not work properly because it expects an array of (array-like) observations – and not an array of integers.

Therefore, we convert the array accordingly using the reshape() function:

print(X[:,0].reshape(-1,1))
"""
[[35]
 [45]
 [40]
 [35]
 [25]
 [40]]
"""

Now, we have six array-like observations. The negative index -1 in the reshape() function call is our “laziness” expression: we want NumPy to determine the number of rows automatically – and only specify how many columns we need (i.e., 1 column).

This article is based on a book chapter of my book Python One-Liners:

Python One-Liners Book: Master the Single Line First!

Python programmers will improve their computer science skills with these useful one-liners.

The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.

You’ll also learn how to:

Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting

By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.

Get your Python One-Liners on Amazon!!

Where to Go From Here?

Understanding algorithms is hard enough.

Why do so many people struggle with algorithms?

Yes, complexity may be an issue from time to time. But in so many cases, the real problem is a lack of your quick and confident understanding of the very basics of code.

Proof: have you ever observed that you can easily understand algorithms visually but not in code?

There is only one solution: master the basics until you don’t have to think about them. Only then can your brain handle the higher-level complexity of algorithms.

To help you achieve this, I invest most of my time and effort in creating the best free Python email course in the web. Join my community of more than 66,000 ambitious Python coders!

The post K-Nearest Neighbors (KNN) with sklearn in Python appeared first on Be on the Right Side of Change.

[Tutorial] K-Means Clustering with SKLearn in One Line

Chris — Fri, 04 Jun 2021 12:38:00 +0000

If there is one clustering algorithm you need to know – whether you are a computer scientist, data scientist, or machine learning expert – it’s the K-Means algorithm. In this tutorial drawn from my book Python One-Liners, you’ll learn the general idea and when and how to use it in a single line of Python code using the sklearn library.

Labeled vs Unlabeled Training

You may know about supervised learning where the training data is “labeled”, i.e., we know the output value of every input value in the training data. But in practice, this is not always the case. What if you have “unlabeled” data? Especially in many data analytics applications, there is no such thing as “the optimal output”. Prediction is not the goal here – but you can still distill useful knowledge from these unlabeled data sets.

For example, suppose you are working in a startup that serves different target markets with various income levels and ages. Your boss tells you to find a certain number of target “personas” that best fit your different target markets.

It’s time to learn about “unsupervised learning” with unlabeled training data. In particular, you can use clustering methods to identify the “average customer personas” which your company serves.

Here is an example:

Visually, you can easily see three types of Personas with different types of incomes and ages. But how to find those algorithmically? This is the domain of clustering algorithms such as the widely popular K-Means algorithm.

Finding the Cluster Centers

Given the data sets and an integer k, the K-Means algorithm finds k clusters of data such that the difference between the k cluster centers (=the centroid of the data in each cluster) and the data in the k cluster is minimal.

In other words, we can find the different personas by running the K-Means algorithm on our data sets:

The cluster centers (black dots) fit very nicely to the overall data. Every cluster center can be viewed as one customer persona. Thus, we have three idealized personas:

A 20-year-old earning $2000,
A 25-year-old earning $3000, and
A 40-year-old earning $4000.

And the great thing is that the K-Means algorithm finds those cluster centers completely automated – even in a high-dimensional space (where it would be hard for humans to find the personas visually).

As a small side note: The K-Means algorithm requires “the number of cluster centers k” as an input. In this case, we use domain knowledge and “magically” defined k=3. There are more advanced algorithms that find the number of cluster centers automatically.

K-Means Algorithm Overview

So how does the K-Means algorithm work? In a nutshell, it performs the following procedure:

Initialize random cluster centers (centroids).
Repeat until convergence
- Assign every data point to its closest cluster center.
- Recompute each cluster center to the centroid of all data points assigned to it.

KMeans Code Using Sklearn

How can we do all of this in a single line of code? Fortunately, the Scikit-learn library in Python has already implemented the K-Means algorithm in a very efficient manner.

So here is the one-liner code snippet that does K-Means clustering for you:

## Dependencies
from sklearn.cluster import KMeans
import numpy as np


## Data (Work (h) / Salary ($))
X = np.array([[35, 7000], [45, 6900], [70, 7100],
              [20, 2000], [25, 2200], [15, 1800]])


## One-liner
kmeans = KMeans(n_clusters=2).fit(X)


## Result & puzzle
cc = kmeans.cluster_centers_
print(cc)

Python Puzzle: What’s the output of this code snippet?

Try to guess a solution without understanding every syntactical element!

(In the next paragraphs, I will give you the result of this code puzzle. In my opinion, puzzle-based learning is one of the best ways to acquire the basics of programming. That’s why I have written the book “Coffee Break Python” to learn Python faster — and to fit learning in any daily schedule.)

Code Explanation

In the first lines, we import the KMeans module from the sklearn.cluster package. This module takes over the clustering itself. Also, we need to import the NumPy library because the KMeans module works on NumPy arrays.

The data is two-dimensional. It correlates the number of working hours with the salary of some workers. There are six data points in this employee data set:

The goal is to find the two cluster centers that fits best to this data.

## One-liner
kmeans = KMeans(n_clusters=2).fit(X)

In the one-liner, we explicitly define the number of cluster centers using the function argument n_clusters. First, we create a new KMeans object that handles the algorithm for us. We simply call the instance method fit(X) to run the K-Means algorithm on our input data X. The KMeans object now holds all the results. All which is left is to retrieve the results from its attributes.

cc = kmeans.cluster_centers_
print(cc)

So, what are the cluster centers and what is the output of this code snippet?

In the graphic, you can see that the two cluster centers are (20, 2000) and (50, 7000). This is also the result of the Python one-liner.

Python One-Liners Book: Master the Single Line First!

Python programmers will improve their computer science skills with these useful one-liners.

The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.

You’ll also learn how to:

Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting

By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.

Get your Python One-Liners on Amazon!!

Where to go from here?

In this article, you have learned how to run the popular K-Means algorithm in Python — using only a single line of code.

I know that it can be hard to understand Python code snippets. Every coder is constantly challenged by the difficulty of code. Don’t let anybody tell you otherwise.

To make learning Python less of a pain, I have created a Python cheat sheet course where I’ll send you a concise, fresh cheat sheet every week. Join my Python course for free!

The post [Tutorial] K-Means Clustering with SKLearn in One Line appeared first on Be on the Right Side of Change.

Python Linear Regression with sklearn – A Helpful Illustrated Guide

Chris — Mon, 26 Apr 2021 11:49:00 +0000

? This tutorial will show you the most simple and straightforward way to implement linear regression in Python—by using scikit-learn’s linear regression functionality. I have written this tutorial as part of my book Python One-Liners where I present how expert coders accomplish a lot in a little bit of code.

Feel free to bookmark and download the Python One-Liner freebies here.

It is really simple to implement linear regression with the sklearn (short for scikit-learn) library. Have a quick look at this code snippet—we’ll explain everything afterward!

from sklearn.linear_model import LinearRegression
import numpy as np

## Data (Apple stock prices)
apple = np.array([155, 156, 157])
n = len(apple)


## One-liner
model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)


## Result & puzzle
print(model.predict([[3],[4]]))
# What is the output of this code?

This one-liner uses two Python libraries: NumPy and scikit-learn. The former is the de-facto standard library for numerical computations (e.g. matrix operations). The latter is the most comprehensive library for machine learning which implements hundreds of machine learning algorithms and techniques.

So let’s explore the code snippet step by step.

We create a simple dataset of three values: three stock prices of the Apple stock in three consecutive days. The variable apple holds this dataset as a one-dimensional NumPy array. We also store the length of the NumPy array in the variable n.

Let’s say the goal is to predict the stock value of the next two days. Such an algorithm could be useful as a benchmark for algorithmic trading applications (using larger datasets of course).

To achieve this goal, the one-liner uses linear regression and creates a model via the function fit(). But what exactly is a model?

Background: What is a Model?

Think of a machine learning model as a black box. You put stuff into the box. We call the input “features” and denote them using the variable x which can be a single value or a multi-dimensional vector of values. Then the box does its magic and processes your input. After a bit of time, you get back the result y.

Now, there are two separate phases: the training phase and the inference phase. During the training phase, you tell your model your “dream” output y’. You change the model as long as it does not generate your dream output y’.

As you keep telling the model your “dream” outputs for many different inputs, you “train” the model using your “training data”. Over time, the model will learn which output you would like to get for certain outputs.

That’s why data is so important in the 21st century: your model will only be as good as it’s training data. Without good training data, it is guaranteed to fail.

So why is machine learning such a big deal nowadays? The main reason is that models “generalize”, i.e., they can use their experience from the training data to predict outcomes for completely new inputs which they have never seen before. If the model generalizes well, these outputs can be surprisingly accurate compared to the “real” but unknown outputs.

Code Explanation

Now, let’s deconstruct the one-liner which creates the model:

model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)

First, we create a new “empty” model by calling LinearRegression(). How does this model look like?

Every linear regression model consists of certain parameters. For linear regression, the parameters are called “coefficients” because each parameter is the coefficient in a linear equation combining the different input features.

With this information, we can shed some light into our black box.

Given the input features x_1, x_2, …, x_k. The linear regression model combines the input features with the coefficients a_1, a_2, …, a_k to calculate the predicted output y using the formula:

In our example, we have only a single input feature x so the formula becomes easier:

In other words, our linear regression model describes a line in the two-dimensional space. The first axis describes the input x. The second axis describes the output x. The line describes the (linear) relationship between input and output.

What is the training data in this space? In our case, the input of the model simply takes the indices of the days: [0, 1, 2] – one day for each stock price [155, 156, 157]. To put it differently:

Input x=0 should cause output y=155
Input x=1 should cause output y=156
Input x=2 should cause output y=157

Now, which line fits best to our training data [155, 156, 157]?

Here is what the linear regression model computes:

## Data (Apple stock prices)
apple = np.array([155, 156, 157])
n = len(apple)


## One-liner
model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)


## Result
print(model.coef_)
# [1.]
print(model.intercept_)
# 155.0

You can see that we have two coefficients: 1.0 and 155.0. Let’s put them in our formula for linear regression:

Let’s plot both the line and the training data in the same space:

A perfect fit! Using this model, we can predict the stock price for any value of x. Of course, whether this prediction accurately reflects the real world is another story.

After having trained the model, we use it to predict the two next days. The Apple dataset consists of three values 155, 156, and 157. We want to know the fourth and fifth value in this series. Thus, we predict the values for indices 3 and 4.

Note that both the function fit() and the function predict() require an array with the following format:

 [,
 ,
 …,
 



Each training  data value is a sequence of feature value:



 = [feature_1, feature_2, …,
feature_k]



Again, here is our one-liner:



model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)



In our case, we only have a single feature x. Therefore, we reshape the NumPy array to the strange looking matrix form:



 [[155],
 [156],
 [157]] 



The fit() function takes two arguments: the input features of the training data (see the last paragraph) and the “dream outputs” of these inputs. Of course, our dream outputs are the real stock prices of the Apple stock. The function then repeats testing and tweaking different model parameters (i.e., lines) so that the difference between the predicted model values and the “dream outputs” is minimal. This is called “error minimization”. (To be more precise, the function minimizes the squared difference from the predicted model values and the “dream outputs” so that outliers have a larger impact on the error.)



In our case, the model perfectly fits the training data, so the error is zero. But often it is not possible to find such a linear model. Here is an example of training data that cannot be fit by a single straight line:



from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt


## Data (Apple stock prices)
apple = np.array([157, 156, 159])
n = len(apple)


## One-liner
model = LinearRegression().fit(np.arange(n).reshape((n,1)), apple)


## Result
print(model.predict([[3],[4]]))
# [158. 159.]

x = np.arange(5)
plt.plot(x[:len(apple)], apple, "o", label="apple stock price")
plt.plot(x, model.intercept_ + model.coef_[0]*x, ":",
         label="prediction")
plt.ylabel("y")
plt.xlabel("x")
plt.ylim((154,164))
plt.legend()
plt.show()








In this case, the fit() function finds the line that minimizes the squared error between the training data and the predictions as described above.



Where to Go from Here?



Do you feel like you need to brush up your coding skills? Then join my free “Coffee Break Python Email Course”. I’ll send you cheat sheets, daily Python lessons, and code contests. It’s fun!
The post Python Linear Regression with sklearn – A Helpful Illustrated Guide appeared first on Be on the Right Side of Change.

Logistic Regression Scikit-learn vs Statsmodels

Lukas Halim — Fri, 05 Feb 2021 15:44:50 +0000

What’s the difference between Statsmodels and Scikit-learn? Both have ordinary least squares and logistic regression, so it seems like Python is giving us two ways to do the same thing. Statsmodels offers modeling from the perspective of statistics. Scikit-learn offers some of the same models from the perspective of machine learning.

So we need to understand the difference between statistics and machine learning! Statistics makes mathematically valid inferences about a population based on sample data. Statistics answers the question, “What is the evidence that X is related to Y?” Machine learning has the goal of optimizing predictive accuracy rather than inference. Machine learning answers the question, “Given X, what prediction should we make for Y?”

In the example below, we’ll create a fake dataset with predictor variables and a binary Y variable. Then we’ll perform logistic regression with scikit-learn and statsmodels. We’ll see that scikit-learn allows us to easily tune the model to optimize predictive power. Statsmodels will provide a summary of statistical measures which will be very familiar to those who’ve used SAS or R.

If you need an intro to Logistic Regression, see this Finxter post.

Create Fake Data for the Logistic Regression Model

I tried using some publicly available data for this exercise but didn’t find one with the characteristics I wanted. So I decided to create some fake data by using NumPy! There’s a post here that explains the math and how to do this in R.

import numpy as np
import pandas as pd

#The next line is setting the seed for the random number generator so that we get consistent results
rg = np.random.default_rng(seed=0)
#Create an array with 500 rows and 3 columns
X_for_creating_probabilities = rg.normal(size=(500,3))

Create an array with the first column removed. The deleted column can be thought of as random noise, or as a variable that we don’t have access to when creating the model.

X1 = np.delete(X_for_creating_probabilities,0,axis=1)
X1[:5]
"""
array([[-0.13210486,  0.64042265],
       [-0.53566937,  0.36159505],
       [ 0.94708096, -0.70373524],
       [-0.62327446,  0.04132598],
       [-0.21879166, -1.24591095]])
"""

Now we’ll create two more columns correlated with X1. Datasets often have highly correlated variables. Correlation increases the likelihood of overfitting. Concatenate to get a single array.

X2 = X1 + .1 * np.random.normal(size=(500,2))
X_predictors = np.concatenate((X1,X2),axis=1)

We want to create our outcome variable and have it be related to X_predictors. To do that, we use our data as inputs to the logistic regression model to get probabilities. Then we set the outcome variable, Y, to True when the probability is above .5.

P = 1 / (1 + np.e**(-np.matmul(X_for_creating_probabilities,[1,1,1])))
Y = P > .5
#About half of cases are True
np.mean(Y)
#0.498

Now divide the data into training and test data. We’ll run a logistic regression on the training data, then see how well the model performs on the training data.

#Set the first 50 rows to train the model
X_train = X_predictors[:50]
Y_train = Y[:50]

#Set the remaining rows to test the model
X_test = X_predictors[50:]
Y_test = Y[50:]

print(f"X_train: {len(X_train)} X_test: {len(X_test)}")
#X_train: 50 X_test: 450

Logistic regression with Scikit-learn

We’re ready to train and test models.

As we train the models, we need to take steps to avoid overfitting. A machine learning model may have very accurate results with the data used to train the model. But this does not mean it will be equally accurate when making predictions with data it hasn’t seen before. When the model fails to generalize to new data, we say it has “overfit” the training data. Overfitting is more likely when there are few observations to train on, and when the model uses many correlated predictors.

How to avoid overfitting? By default, scikit-learn‘s logistic regression applies regularization. Regularization balances the need for predictive accuracy on the training data with a penalty on the magnitude of the model coefficients. Increasing the penalty reduces the coefficients and hence reduces the likelihood of overfitting. If the penalty is too large, though, it will reduce predictive power on both the training and test data.

from sklearn.linear_model import LogisticRegression
scikit_default = LogisticRegression(random_state=0).fit(X_train, Y_train)
print(f"intecept: {scikit_default.intercept_} coeficients: {scikit_default.coef_}")
print(f"train accuracy: {scikit_default.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_default.score(X_test, Y_test)}")
"""
Results will vary slightly, even when you set random_state.
intecept: [-0.44526823] coeficients: [[0.50031563 0.79636504 0.82047214 0.83635656]]
train accuracy: 0.8
test accuracy: 0.8088888888888889
"""

We can set turn off regularization by setting penalty as none. Applying regularization reduces the magnitude of the coefficients. Setting the penalty to none will increase the coefficients. Notice that the accuracy on the test data decreases. This indicates our model has overfit the training data.

from sklearn.linear_model import LogisticRegression
scikit_no_penalty = LogisticRegression(random_state=0,penalty='none').fit(X_train, Y_train)
print(f"intecept: {scikit_no_penalty.intercept_} coeficients: {scikit_no_penalty.coef_}")
print(f"train accuracy: {scikit_no_penalty.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_no_penalty.score(X_test, Y_test)}")
"""
intecept: [-0.63388911] coeficients: [[-3.59878438  0.70813119  5.10660019  1.29684873]]
train accuracy: 0.82
test accuracy: 0.7888888888888889
"""

C is 1.0 by default. Smaller values of C increase the regularization, so if we set the value to .1 we reduce the magnitude of the coefficients.

from sklearn.linear_model import LogisticRegression
scikit_bigger_penalty = LogisticRegression(random_state=0,C=.1).fit(X_train, Y_train)
print(f"intecept: {scikit_bigger_penalty.intercept_} \
    coeficients: {scikit_bigger_penalty.coef_}")
print(f"train accuracy: {scikit_bigger_penalty.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_bigger_penalty.score(X_test, Y_test)}")
"""
intecept: [-0.13102803]     coeficients: [[0.3021235  0.3919277  0.34359251 0.40332636]]
train accuracy: 0.8
test accuracy: 0.8066666666666666
"""

It’s nice to be able to adjust the smoothing coefficient, but how do we decide the optimal value? Scikit-learn’s GridSearchCV provides an effective but easy to use method for choosing an optimal value. The “Grid Search” in GridSearchCV means that we supply a dictionary with the parameter values we wish to test. The model is fit with all combinations of those values. If we have 4 possible values for C and 2 possible values for solver, we will search through all 4X2=8 combinations.

GridSearchCV Searches Through This Grid

C	solver
.01	newton-cg
.1	newton-cg
1	newton-cg
10	newton-cg
.01	lbfgs
.1	lbfgs
1	lbfgs
10	lbfgs

The “CV” in GridSearchCV stands for cross-validation. Cross-validation is the method of segmenting the training data. The model is trained on all but one of the segments and the remaining segment validate the model.

Iteration	Segment 1	Segment 2	Segment 3	Segment 4	Segment 5
1st Iteration	Validation	Train	Train	Train	Train
2nd Iteration	Train	Validation	Train	Train	Train
3rd Iteration	Train	Train	Validation	Train	Train
4th Iteration	Train	Train	Train	Validation	Train
5th Iteration	Train	Train	Train	Train	Validation

GridSearch and cross-validation work in combination. GridsearchCV iterates through values of C and solver for different test and training segments. The algorithm selects the best estimator based performance on the validation segments.

Doing this allows us to determine which values of C and solver work best for our training data. This is how scikit-learn helps us to optimize predictive accuracy.

Let’s see it in action.

from sklearn.model_selection import GridSearchCV
parameters = {'C':[.01, .1, 1, 10],'solver':['newton-cg','lbfgs']}
Logistic = LogisticRegression(random_state=0)
scikit_GridSearchCV = GridSearchCV(Logistic, parameters)
scikit_GridSearchCV.fit(X_train, Y_train)
print(f"best estimator: {scikit_GridSearchCV.best_estimator_}")
#best estimator: LogisticRegression(C=0.1, random_state=0, solver='newton-cg')

Use the score method returns the mean accuracy on the given test data and labels. Accuracy is the percent of observations correctly predicted.

print(f"train accuracy: {scikit_GridSearchCV.score(X_train, Y_train)}")
print(f"test accuracy: {scikit_GridSearchCV.score(X_test, Y_test)}")
"""
train accuracy: 0.82
test accuracy: 0.8133333333333334
"""

Logistic regression with Statsmodels

Now let’s try the same, but with statsmodels. With scikit-learn, to turn off regularization we set penalty='none', but with statsmodels regularization is turned off by default. A quirk to watch out for is that Statsmodels does not include an intercept by default. To include an intercept, we use the sm.add_constant method.

import statsmodels.api as sm

#adding constant to X
X_train_with_constant = sm.add_constant(X_train)
X_test_with_constant = sm.add_constant(X_test)

# building the model and fitting the data
sm_model_all_predictors = sm.Logit(Y_train, X_train_with_constant).fit()

# printing the summary table
print(sm_model_all_predictors.params)
"""
Optimization terminated successfully.
         Current function value: 0.446973
         Iterations 7
[-0.57361523 -2.00207425  1.28872367  3.53734636  0.77494424]
"""

If you’re used to doing logistic regression in R or SAS, what comes next will be familiar. Once we have trained the logistic regression model with statsmodels, the summary method will easily produce a table with statistical measures including p-values and confidence intervals.

sm_model_all_predictors.summary()

Dep. Variable:	y	No. Observations:	50
Model:	Logit	Df Residuals:	45
Method:	MLE	Df Model:	4
Date:	Thu, 04 Feb 2021	Pseudo R-squ.:	0.3846
Time:	14:33:19	Log-Likelihood:	-21.228
converged:	True	LL-Null:	-34.497
Covariance Type:	nonrobust	LLR p-value:	2.464e-05

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	-0.7084	0.478	-1.482	0.138	-1.645	0.228
x1	5.5486	4.483	1.238	0.216	-3.237	14.335
x2	10.2566	5.686	1.804	0.071	-0.887	21.400
x3	-3.9137	4.295	-0.911	0.362	-12.333	4.505
x4	-7.8510	5.364	-1.464	0.143	-18.364	2.662

There’s a lot here, but we’ll focus on the second table with the coefficients.

The first column shows the value for the coefficient. The fourth column, with the heading P>|z|, shows the p-values. A p-value is a probability measure, and p-values above .05 are frequently considered, “not statistically significant.” None of the predictors are considered statistically significant! This is because we have a relatively small number of observations in our training data and because the predictors are highly correlated. Some statistical packages like R and SAS have built-in methods to select the features to include in the model based on which predictors have low (significant) p-values, but unfortunately, this isn’t available in statsmodels.

If we try again with just x1 and x2, we’ll get a completely different result, with very low p-values for x1 and x2, meaning that the evidence for a relationship with the dependent variable is statistically significant. We’re cheating, though – because we created the data, we know that we only need x1 and x2.

sm_model_x1_x2 = sm.Logit(Y_train, X_train_with_constant[:,:3]).fit()
sm_model_x1_x2.summary()

Now we see x1 and x2 are both statistically significant.

Statsmodels doesn’t have the same accuracy method that we have in scikit-learn. We’ll use the predict method to predict the probabilities. Then we’ll use the decision rule that probabilities above .5 are true and all others are false. This is the same rule used when scikit-learn calculates accuracy.

all_predicted_train = sm_model_all_predictors.predict(X_train_with_constant)>.5
all_predicted_test = sm_model_all_predictors.predict(X_test_with_constant)>.5

x1_x2_predicted_train = sm_model_x1_x2.predict(X_train_with_constant[:,:3])>.5
x1_x2_predicted_test = sm_model_x1_x2.predict(X_test_with_constant[:,:3])>.5

#calculate the accuracy
print(f"train: {(Y_train==all_predicted_train).mean()} and test: {(Y_test==all_predicted_test).mean()}")
print(f"train: {(Y_train==x1_x2_predicted_train).mean()} and test: {(Y_test==x1_x2_predicted_test).mean()}")
"""
train: 0.8 and test: 0.8066666666666666
train: 0.8 and test: 0.8111111111111111
"""

Summarizing The Results

Let’s create a DataFrame with the results. The models have identical accuracy on the training data, but different results on the test data. The models with all the predictors and without smoothing have the worst test accuracy, suggesting that they have overfit on the training data and so do not generalize well to new data.

Even if we use the best methods in creating our model, there is still chance involved in how well it generalizes to the test data.

lst = [['scikit-learn','default', scikit_default.score(X_train, Y_train),scikit_default.score(X_test, Y_test)],
       ['scikit-learn','no penalty', scikit_no_penalty.score(X_train, Y_train),scikit_no_penalty.score(X_test, Y_test)],
       ['scikit-learn','bigger penalty', scikit_bigger_penalty.score(X_train, Y_train),scikit_bigger_penalty.score(X_test, Y_test)],
       ['scikit-learn','GridSearchCV', scikit_GridSearchCV.score(X_train, Y_train),scikit_GridSearchCV.score(X_test, Y_test)],
       ['statsmodels','include intercept and all predictors', (Y_train==all_predicted_train).mean(),(Y_test==all_predicted_test).mean()],
       ['statsmodels','include intercept and x1 and x2', (Y_train==x1_x2_predicted_train).mean(),(Y_test==x1_x2_predicted_test).mean()]
      ]
df = pd.DataFrame(lst, columns =['package', 'setting','train accuracy','test accuracy'])
df

	package	setting	train accuracy	test accuracy
0	scikit-learn	default	0.80	0.808889
1	scikit-learn	no penalty	0.78	0.764444
2	scikit-learn	bigger penalty	0.82	0.813333
3	scikit-learn	GridSearchCV	0.80	0.808889
4	statsmodels	include intercept and all predictors	0.78	0.764444
5	statsmodels	include intercept and x1 and x2	0.80	0.811111

Scikit-learn vs Statsmodels

Upshot is that you should use Scikit-learn for logistic regression unless you need the statistics results provided by StatsModels.

Here’s a table of the most relevant similarities and differences:

	Scikit-learn	Statsmodels
Regularization	Uses L2 regularization by default, but regularization can be turned off using penalty=’none’	Does not use regularization by default
Hyperparameter tuning	GridSearchCV allows for easy tuning of regularization parameter	User will need to write lines of code to tune regularization parameter
Intercept	Includes intercept by default	Use the add_constant method to include an intercept
Model Evaluation	The score method reports prediction accuracy	The summary method shows p-values, confidence intervals, and other statistical measures
When should you use it?	For accurate predictions	For statistical inference.
Comparison with R and SAS	Different	Similar

That’s it for now! Please check out my other work at learningtableau.com and my new site datasciencedrills.com.

The post Logistic Regression Scikit-learn vs Statsmodels appeared first on Be on the Right Side of Change.

Decision Tree Learning — A Helpful Illustrated Guide in Python

Chris — Thu, 07 Jan 2021 20:07:00 +0000

This tutorial will show you everything you need to get started training your first models using decision tree learning in Python. To help you grasp this topic thoroughly, I attacked it from different perspectives: textual, visual, and audio-visual. So, let’s get started!

Why Decision Trees?

Deep learning has become the megatrend within artificial intelligence and machine learning. Yet, training large neural networks is not always the best choice. It’s the bazooka in machine learning, effective but not efficient.

A human will not understand in practice why the neural network classifies one way or the other. It is just a black box. Should you blindly invest your money into a stock recommended by a neural network? As you do not know the basis of the decision of a neural network, it can be hard to blindly trust its recommendations.

Many ML divisions in large companies must be able to explain the reasoning of their ML algorithms. Deep learning models fail to do this, but this is where decision trees excel!

This is one reason for the popularity of decision trees. Decision trees are more human-friendly and intuitive. You know exactly how the decisions emerged. And you can even hand tune the ML model of you want to.

The decision tree consists of branching nodes and leaf nodes. A branching node is a variable (also called feature) that is given as input to your decision problem. For each possible value of this feature, there is a child node.

A leaf node represents the predicted class given the feature values along the path to the root. Each leaf node has an associated probability, i.e., how often have we seen this particular instance (choice of feature values) in the training data. Moreover, each leaf node has an associated class or output value which is the predicted class of the input given by the branching nodes.

Video Decision Trees

I explain decision trees in this video:

In case you need to refresh your Python skills, feel free to deepen your Python code understanding with the Finxter web app.

Explanation Simple Example

You already know decision trees very well from your own experience. They represent a structured way of making decisions – each decision opening new branches. By answering a bunch of questions, you will finally land on the recommended outcome.

Here is an example:

Decision trees are used for classification problems such as “which subject should I study, given my interests?”. You start at the top. Now, you repeatedly answer questions (select the choices that describe your features best). Finally, you reach a leaf node of the tree. This is the recommended class based on your feature selection.

There are many nuances to decision tree learning. For example, in the above figure, the first question carries more weight than the last question. If you like maths, the decision tree will never recommend you art or linguistics. This is useful because some features may be much more important for the classification decision than others. For example, a classification system that predicts your current health may use your sex (feature) to practically rule out many diseases (classes).

Hence, the order of the decision nodes lends itself for performance optimizations: place the features at the top that have a high impact on the final classification. In decision tree learning will then aggregate the questions that do not have a high impact on the final classification as shown in the next graphic:

Suppose the full decision tree looks like the tree on the left. For any combination of features, there is a separate classification outcome (the tree leaves). However, some features may not give you any additional information with respect to the classification problem (e.g. the first “Language” decision node in the example). Decision tree learning would effectively get rid of these nodes for efficiency reasons. This is called “pruning”.

Decision Tree Code in Python

Here’s some code on how you can run a decision tree in Python using the sklearn library for machine learning:

## Dependencies
import numpy as np
from sklearn import tree


## Data: student scores in (math, language, creativity) --> study field
X = np.array([[9, 5, 6, "computer science"],
              [1, 8, 1, "literature"],
              [5, 7, 9, "art"]])


## One-liner
Tree = tree.DecisionTreeClassifier().fit(X[:,:-1], X[:,-1])

## Result & puzzle
student_0 = Tree.predict([[8, 6, 5]])
print(student_0)

student_1 = Tree.predict([[3, 7, 9]])
print(student_1)

The data in the code snippet describes three students with their estimated skill level (a score between 1-10) in the three areas math, language, and creativity. We also know the study subjects of these students. For example, the first student is highly skilled in maths and studies computer science. The second student is skilled in language much more than in the other two skills and studies literature. The third student is good in creativity and studies art.

The one-liner creates a new decision tree object and trains the model using the fit function on the labeled training data (the last column is the label). Internally, it creates three nodes, one for each feature math, language, and creativity.

When predicting the class of the student_0 (math=8, language=6, creativity=5), the decision tree returns “computer science”. It has learned that this feature pattern (high, medium, medium) is an indicator for the first class. On the other hand, when asked for (3, 7, 9), the decision tree predicts “art” because it has learned that the score (low, medium, high) hints to the third class.

Note that the algorithm is non-deterministic. In other words, when executing the same code twice, different results may arise. This is common for machine learning algorithms that work with random generators. In this case, the order of the features is randomly permuted, so the final decision tree may have a different order of the features.

Where to Go From Here?

Enough theory. Let’s get some practice!

Coders get paid six figures and more because they can solve problems more effectively using machine intelligence and automation.

To become more successful in coding, solve more real problems for real people. That’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?

You build high-value coding skills by working on practical coding projects!

Do you want to stop learning with toy projects and focus on practical code projects that earn you money and solve real problems for people?

If your answer is YES!, consider becoming a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.

If you just want to learn about the freelancing opportunity, feel free to watch my free webinar “How to Build Your High-Income Skill Python” and learn how I grew my coding business online and how you can, too—from the comfort of your own home.

Join the free webinar now!

The post Decision Tree Learning — A Helpful Illustrated Guide in Python appeared first on Be on the Right Side of Change.