Logistic regression is a popular algorithm for classification problems (despite its name indicating that it is a “regression” algorithm). It belongs to one of the most important algorithms in the machine learning space.
This line can be used for many things – e.g. to predict the outcome for unseen input data x. In general, linear regression is great for predicting a continuous output value y, given continuous input value x. A continuous value can take an infinite number of values. For example, we could predict the stock price (output y), given the number of social media posts mentioning the company that is reflected by the stock price (input x). The stock price is continuous as it can take on any value $123.45, $121.897, or $10,198.87.
But what if the output is not continuous but categorical? For example, let’s say you want to predict the likelihood of lung cancer, given the number of cigarettes a patient smoke. Each patient can either have lung cancer or not. In contrast to the previous example, there are only these two possible outcomes.
Predicting the likelihood of categorical outcomes is the main motivation for logistic regression.
While linear regression fits a line into the training data, logistic regression fits an S-shaped curve, called “the sigmoid function”. Why? Because the line helps you generate a new output value for each input. On the other hand, the S-shaped curve helps you make binary decisions (e.g. yes/no). For most input values, the sigmoid function will either return a value that is very close to 0 or very close to 1. It is relatively unlikely that your given input value generates a value that is somewhere in-between.
Here is a graphical example of such a scenario:
The sigmoid function approximates the probability that a patient has lung cancer, given the number of cigarettes they smoke. This probability helps you to make a robust decision on the subject: Does the patient has lung cancer?
Have a look at the following example:
There are two new patients (in yellow). Let’s pretend we know nothing about them but the number of cigarettes they smoke. We have already trained our logistic regression model (the sigmoid function) that returns a probability value for any new input value x. Now, we can use the respective probabilities of our two inputs to make a prediction about whether the new patients have lung cancer or not.
If the probability given by the sigmoid function is higher than 50%, the model predicts “lung cancer positive”, otherwise, it predicts “lung cancer negative”.
So how to select the correct sigmoid function that best fits the training data?
This is the main question for logistic regression. The answer is
To calculate the likelihood for a given set of training data, you simply calculate the likelihood for a single training date and repeat this procedure for all training dates. Finally, you multiply those to get the likelihood for the whole set of training data.
Now, you proceed this same likelihood computation for different sigmoid functions (shifting the sigmoid function a little bit). From all computations, you take the sigmoid function that has “maximum likelihood” that means which would produce the training data with maximal probability.
Let’s program your first virtual doc using logistic regression – in a single line of Python code!
from sklearn.linear_model import LogisticRegression import numpy as np ## Data (#cigarettes, cancer) X = np.array([[0, "No"], [10, "No"], [60, "Yes"], [90, "Yes"]]) ## One-liner model = LogisticRegression().fit(X[:,0].reshape(-1,1), X[:,1]) ## Result & puzzle print(model.predict([,,,,]))
What is the output of this code snippet? Take a guess!
The labeled training data set X consists of four patient records (lines) with two features (columns). The first column holds the number of cigarettes the patients smoke, and the second column holds whether they ultimately suffered from lung cancer. Hence, there is a continuous input variable and a categorical output variable. It’s a classification problem!
We build the model calling the LogisticRegression() constructor with no parameters. On this model, we call the fit function which takes two arguments: the input values and the output classifications (labels). The input values are expected to come as a two-dimensional array where each row holds the feature values. In our case, we only have a single feature value so we transform our input into a column vector using the reshape operation. The reshape operation generates a two-dimensional NumPy array. The first reshape argument specifies the number of rows, the second specifies the number of columns. We only care about the number of columns which is one. NumPy determines the number of rows automatically when using the “dummy” parameter -1.
Here is how the input training data (without labels) looks like after converting it using the reshape operation:
[, , , ]
Next, we predict whether a patient has lung cancer, given the number of cigarettes they smoke: 2, 12, 13, 40, 90 cigarettes.
Here is the output:
## Result & puzzle print(model.predict([,,,,])) # ['No' 'No' 'Yes' 'Yes' 'Yes']
The model predicts that the first two patients are lung cancer negative, while the latter three are lung cancer positive.
Let’s explore in detail the probabilities of the sigmoid function that lead to this prediction! Simply run the following code snippet after the above definition:
for i in range(20): print("x=" + str(i) + " --> " + str(model.predict_proba([[i]]))) ''' x=0 --> [[0.67240789 0.32759211]] x=1 --> [[0.65961501 0.34038499]] x=2 --> [[0.64658514 0.35341486]] x=3 --> [[0.63333374 0.36666626]] x=4 --> [[0.61987758 0.38012242]] x=5 --> [[0.60623463 0.39376537]] x=6 --> [[0.59242397 0.40757603]] x=7 --> [[0.57846573 0.42153427]] x=8 --> [[0.56438097 0.43561903]] x=9 --> [[0.55019154 0.44980846]] x=10 --> [[0.53591997 0.46408003]] x=11 --> [[0.52158933 0.47841067]] x=12 --> [[0.50722306 0.49277694]] x=13 --> [[0.49284485 0.50715515]] x=14 --> [[0.47847846 0.52152154]] x=15 --> [[0.46414759 0.53585241]] x=16 --> [[0.44987569 0.55012431]] x=17 --> [[0.43568582 0.56431418]] x=18 --> [[0.42160051 0.57839949]] x=19 --> [[0.40764163 0.59235837]] '''
The code prints for any value of x (the number of cigarettes) the probabilities of lung cancer positive and lung cancer negative. If the probability of the former is higher than the probability of the latter, the predicted outcome is “lung cancer negative”. This happens the last time for x=12. When smoking more than 12 cigarettes, the algorithm will classify a patient to be “lung cancer positive”.
Logistic regression is a classification algorithm (despite its name). This article shows you everything you need to know to start with logistic regression now. It provides you an easy way to implement logistic regression in a single line of Python code using the
If you feel stuck in Python and you need to enter the next level in Python coding, feel free to enter my 100% free Python email course with lots of cheat sheets, Python lessons, code contests, and fun!