Logistic Regression in Python Scikit-Learn

Logistic regression is a popular algorithm for classification problems (despite its name indicating that it is a “regression” algorithm). It belongs to one of the most important algorithms in the machine learning space.

Linear Regression Background

Let’s review linear regression. Given the training data, we compute a line that fits this training data so that the summed squared distance between the line and the training data is minimal.

This line can be used for many things – e.g. to predict the outcome for unseen input data x. In general, linear regression is great for predicting a continuous output value y, given continuous input value x. A continuous value can take an infinite number of values. For example, we could predict the stock price (output y), given the number of social media posts mentioning the company that is reflected by the stock price (input x). The stock price is continuous as it can take on any value $123.45, $121.897, or $10,198.87.

Logistic Regression and Sigmoid Function

But what if the output is not continuous but categorical? For example, let’s say you want to predict the likelihood of lung cancer, given the number of cigarettes a patient smoke. Each patient can either have lung cancer or not. In contrast to the previous example, there are only these two possible outcomes.

Predicting the likelihood of categorical outcomes is the main motivation for logistic regression.

While linear regression fits a line into the training data, logistic regression fits an S-shaped curve, called “the sigmoid function”. Why? Because the line helps you generate a new output value for each input. On the other hand, the S-shaped curve helps you make binary decisions (e.g. yes/no). For most input values, the sigmoid function will either return a value that is very close to 0 or very close to 1. It is relatively unlikely that your given input value generates a value that is somewhere in-between.

Here is a graphical example of such a scenario:

Sigmoid Function Example

The sigmoid function approximates the probability that a patient has lung cancer, given the number of cigarettes they smoke. This probability helps you to make a robust decision on the subject: Does the patient has lung cancer?

Have a look at the following example:

There are two new patients (in yellow). Let’s pretend we know nothing about them but the number of cigarettes they smoke. We have already trained our logistic regression model (the sigmoid function) that returns a probability value for any new input value x. Now, we can use the respective probabilities of our two inputs to make a prediction about whether the new patients have lung cancer or not.

If the probability given by the sigmoid function is higher than 50%, the model predicts “lung cancer positive”, otherwise, it predicts “lung cancer negative”.

So how to select the correct sigmoid function that best fits the training data?

This is the main question for logistic regression. The answer is maximum likelihood. In other words, which sigmoid function would generate the observed training data with the highest probability?

To calculate the likelihood for a given set of training data, you simply calculate the likelihood for a single training date and repeat this procedure for all training dates. Finally, you multiply those to get the likelihood for the whole set of training data.

Now, you proceed this same likelihood computation for different sigmoid functions (shifting the sigmoid function a little bit). From all computations, you take the sigmoid function that has “maximum likelihood” that means which would produce the training data with maximal probability.

Logistic Regression with sklearn.linear_model

Let’s program your first virtual doc app using logistic regression – in a single line of Python code!

from sklearn.linear_model import LogisticRegression
import numpy as np


## Data (#cigarettes, cancer)
X = np.array([[0, "No"],
              [10, "No"],
              [60, "Yes"],
              [90, "Yes"]])


## One-liner
model = LogisticRegression().fit(X[:,0].reshape(-1,1), X[:,1])


## Result & puzzle
print(model.predict([[2],[12],[13],[40],[90]]))

Exercise: What is the output of this code snippet? Take a guess!

The labeled training data set X consists of four patient records (lines) with two features (columns). The first column holds the number of cigarettes the patients smoke, and the second column holds whether they ultimately suffered from lung cancer. Hence, there is a continuous input variable and a categorical output variable. It’s a classification problem!

We build the model calling the LogisticRegression() constructor with no parameters. On this model, we call the fit function which takes two arguments: the input values and the output classifications (labels). The input values are expected to come as a two-dimensional array where each row holds the feature values.

In our case, we only have a single feature value so we transform our input into a column vector using the reshape() operation that generates a two-dimensional NumPy array. The first argument specifies the number of rows, the second specifies the number of columns. We only care about the number of columns which is one. NumPy determines the number of rows automatically when using the “dummy” parameter -1.

Here is how the input training data (without labels) looks like after converting it using the reshape operation:

[[0],
 [10],
 [60],
 [90]]

Next, we predict whether a patient has lung cancer, given the number of cigarettes they smoke: 2, 12, 13, 40, 90 cigarettes.

Here is the output:

## Result & puzzle
print(model.predict([[2],[12],[13],[40],[90]]))
# ['No' 'No' 'Yes' 'Yes' 'Yes']

The model predicts that the first two patients are lung cancer negative, while the latter three are lung cancer positive.

Let’s explore in detail the probabilities of the sigmoid function that lead to this prediction! Simply run the following code snippet after the above definition:

for i in range(20):
    print("x=" + str(i) + " --> " + str(model.predict_proba([[i]])))

    
'''
x=0 --> [[0.67240789 0.32759211]]
x=1 --> [[0.65961501 0.34038499]]
x=2 --> [[0.64658514 0.35341486]]
x=3 --> [[0.63333374 0.36666626]]
x=4 --> [[0.61987758 0.38012242]]
x=5 --> [[0.60623463 0.39376537]]
x=6 --> [[0.59242397 0.40757603]]
x=7 --> [[0.57846573 0.42153427]]
x=8 --> [[0.56438097 0.43561903]]
x=9 --> [[0.55019154 0.44980846]]
x=10 --> [[0.53591997 0.46408003]]
x=11 --> [[0.52158933 0.47841067]]
x=12 --> [[0.50722306 0.49277694]]
x=13 --> [[0.49284485 0.50715515]]
x=14 --> [[0.47847846 0.52152154]]
x=15 --> [[0.46414759 0.53585241]]
x=16 --> [[0.44987569 0.55012431]]
x=17 --> [[0.43568582 0.56431418]]
x=18 --> [[0.42160051 0.57839949]]
x=19 --> [[0.40764163 0.59235837]]
'''

The code prints for any value of x (the number of cigarettes) the probabilities of lung cancer positive and lung cancer negative. If the probability of the former is higher than the probability of the latter, the predicted outcome is “lung cancer negative”. This happens the last time for x=12. When smoking more than 12 cigarettes, the algorithm will classify a patient to be “lung cancer positive”.

LogisticsRegression Methods

In the previous example, you’ve created a LogisticRegression object using the following constructor:

sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

In most cases, you don’t need to define all arguments—or even understand them by heart. Just start from the most basic example usage and customize as you The LogisticRegression class has many more helper methods. You can check them out here (source):

NameDescription
decision_function(X)Predict confidence scores for samples.
densify()Convert coefficient matrix to dense array format.
fit(X, y[, sample_weight])Fit the model according to the given training data.
get_params([deep])Get parameters for this estimator.
predict(X)Predict class labels for samples in X.
predict_log_proba(X)Predict logarithm of probability estimates.
predict_proba(X)Probability estimates.
score(X, y[, sample_weight])Return the mean accuracy on the given test data and labels.
set_params(**params)Set the parameters of this estimator.
sparsify()Convert coefficient matrix to sparse format.

Conclusion

Logistic regression is a classification algorithm (despite its name). This article shows you everything you need to know to start with logistic regression now. It provides you an easy way to implement logistic regression in a single line of Python code using the scikit-learn library.

If you feel stuck in Python and you need to enter the next level in Python coding, feel free to enter my 100% free Python email course with lots of cheat sheets, Python lessons, code contests, and fun!

This tutorial is loosely based on my Python One-Liners book chapter. Check it out!

Python One-Liners Book: Master the Single Line First!

Python programmers will improve their computer science skills with these useful one-liners.

Python One-Liners

Python One-Liners will teach you how to read and write “one-liners”: concise statements of useful functionality packed into a single line of code. You’ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.

The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.

Detailed explanations of one-liners introduce key computer science concepts and boost your coding and analytical skills. You’ll learn about advanced Python features such as list comprehension, slicing, lambda functions, regular expressions, map and reduce functions, and slice assignments.

You’ll also learn how to:

  • Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
  • Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
  • Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
  • Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
  • Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting

By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.

Get your Python One-Liners on Amazon!!