[Fixed] Unknown label type: ‘continuous' in sklearn LogisticRegression

Summary: Use SKLearn’s LogisticRegression Model for classification problems only. The Y variable is a category (e.g., binary [0,1]), not continuous (e.g. float numbers 3.4, 7.9). If the Y variable is non-categorical (i.e., continuous), the potential fixes are as follows.

Re-examine the data. Try to encode the continuous Y variable into categories (e.g., use SKLearn’s LabelEncoder preprocessor).
Re-examine the model. Try to use another model such as a regressor makes sense (e.g., Linear Regression).

Note: All the solutions provided below have been verified using Python 3.9.0b5

Problem Formulation

When using scikit-learn’s LogisticRegression classifier, how does one fix the following error?

$ python lr1.py
Traceback (most recent call last):
  File ".../SKLearnLogicReg/lr1.py", line 14, in <module>
    clf.fit(trainingData, trainingScores)
  File ".../lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1347, in fit
    check_classification_targets(y)
  File ".../lib/python3.9/site-packages/sklearn/utils/multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

Background

Machine Learning is one of the hottest topics of our age. Various entities use Machine Learning models to perform complex operations on data. Complex Operations such as…

Data Analysis
Data Classification
Data Prediction
Data Extrapolation.

Python’s scikit-learn library is an open-source Machine Learning library. It supports supervised and unsupervised learning. The scikit-learn library provides excellent tools for model fitting, selection, and evaluation. It also provides many helpful utilities for data preprocessing and analysis.

One has to be careful about choosing the Machine Learning model. One also has to be careful when one examines the data; to ask, what one is attempting to learn from it. This blog discusses Logistic Regression, but the nature of the error is more general. It urges the reader to go back to basics and answer the following…

What do we want to learn from the data? What are we looking for in it?
Is this the right machine learning model we should use?
Are we feeding the data to the model in a proper manner?
Is the data in the correct format to use with the model?
Are you taking enough mental breaks?
Are you pumping the blood in your body? That is—stretch, walk, run, exercise?
Are you nourishing your body? Eating vegetables, fruits, good quality coffee?

Wow!! You Talk Too Much!! Can You Just Tell Me The Darned Answer?

The straightforward way to fix the error is to take a break and go for a walk and eat a fruit.

While this error is frustrating, it is also common among new machine learners. It stems from the single fact that sklearn’s LogisticRegression class is a “classifier”. That is, use scikit-learn’s LogisticRegression for classification problems only. This means that while the X variables can be floats etc., the Y variable has to be a “category”. Category, meaning [0,1], or [yes, no], [true, false], [Apples, Oranges, Pears], and so on. The Y variable cannot be a continuous value such as a float (3.5, 7.9, 89.6, etc.).

Let’s see how this works with some simple naive data. The data we use in the example below has no meaning other than to illustrate the problem.

For this first example we use floats as target vectors (i.e., y_variables). This will cause an error in the fit() method of Logistic Regression.

$ python
Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) 
>>>
>>> ## Import the needed libraries and Modules.
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> 
>>> ## Define some training data. We will call this the X-Variable.
>>> x_variables = np.array([[5.7, 2.5, 7.7],
...                         [8.4, 0.6, 3.6],
...                         [5.3, 4.5, 2.7],
...                         [5.1, 2.4, 6.3]])
>>> 
>>> ## Define the target vector. We will call this the Y-Variable.
>>> ## Note that the values are floats. This will cause the error!!
>>> y_variables = np.array([4.2, 6.8, 3.4, 1.9])
>>> 
>>> ## Define another set of target vectors. Note how these are ints.
>>> ## They are simply rounded versions of the above float numbers.
>>> ## y_variables = np.array([4, 7, 3, 2])
>>> 
>>> ## Define some new, yet unknown data. We will call this the U-Variable.
>>> u_variables  = np.array([[4.8, 6.4, 3.2],
...                          [5.3, 2.3, 7.4]])
>>> 
>>> ## Instantiate the Logistic Regression Machine Learning Model.
>>> lr = LogisticRegression()
>>> 
>>> ## Fit the Model to the Data.  i.e. Make the Model Learn.
>>> lr.fit(x_variables, y_variables)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gsrao/.virtualenvs/Upwork25383745/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1347, in fit
    check_classification_targets(y)
  File "/Users/gsrao/.virtualenvs/Upwork25383745/lib/python3.9/site-packages/sklearn/utils/multiclass.py", line 183, in check_classification_targets
    raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'

For this next example, we use integers as target vectors (i.e., y_variables). Just a simple change!! Everything else is the same. The code goes to completion!!

>>> ## Import the needed libraries and Modules.
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> 
>>> ## Define some training data. We will call this the X-Variable.
>>> x_variables = np.array([[5.7, 2.5, 7.7],
...                         [8.4, 0.6, 3.6],
...                         [5.3, 4.5, 2.7],
...                         [5.1, 2.4, 6.3]])
>>> 
>>> ## Define the target vector. We will call this the Y-Variable.
>>> ## Note that the values are floats. This will cause the error!!
>>> y_variables = np.array([4.2, 6.8, 3.4, 1.9])
>>> 
>>> ## Define another set of target vectors. Note how these are ints.
>>> ## They are simply rounded versions of the above float numbers.
>>> y_variables = np.array([4, 7, 3, 2])
>>> 
>>> ## Define some new, yet unknown data. We will call this the U-Variable.
>>> u_variables  = np.array([[4.8, 6.4, 3.2],
...                          [5.3, 2.3, 7.4]])
>>> 
>>> ## Instantiate the Logistic Regression Machine Learning Model.
>>> lr = LogisticRegression()
>>> 
>>> ## Fit the Model to the Data.  i.e. Make the Model Learn.
>>> lr.fit(x_variables, y_variables)
LogisticRegression()
>>> 
>>> ## Finally Predict the outcome for the Unknown Data!!
>>> print("This is the Prediction for the Unknown Data in u_variables!!")
This is the Prediction for the Unknown Data in u_variables!!
>>> print(lr.predict(u_variables))
[3 4]
>>>

This illustrates the point that was made earlier, “Use LogisticRegression for classification problems *only*”!! The target vector has to be categorical, *not* continuous!!

Ah!! I Get It Now!! Anything Else?

The reader needs to re-examine the data to see if it makes sense to use classification models. It is possible that the data is better served with regression or clustering models. One needs to always ask…

What is the question we are asking about the data?
What are we looking for in the data?
What are we attempting to learn from the data?

Here is a simple example taken from the “Python One-Liners” book by Dr. Chris Mayer. The example correlates cigarette consumption with lung cancer probability. It illustrates how Logistic Regression works well with categorical data.

>>> ## Import the needed libraries and Modules.
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> 
>>> ## Define some training data. We will call this the X-Variable.
>>> ## This array contains the number of cigarettes smoked in a day.
>>> x_variables = np.array([[0], [10], [15], [60], [90]])
>>> 
>>> ## Define the target vector. We will call this the Y-Variable.
>>> ## This array contains the outcome i.e. if patient has lung-Cancer.
>>> y_variables = np.array(["No", "No", "Yes", "Yes", "Yes"])
>>> 
>>> ## Define some new, yet unknown data. We will call this the U-Variable.
>>> ## This correlates to the number of cigarettes smoked in a day. Given
>>> ## this new data, the model will try to predict the outcome.
>>> u_variables  = np.array([[2], [12], [13], [40], [90]])
>>> 
>>> ## Instantiate the Logistic Regression Machine Learning Model.
>>> lr = LogisticRegression()
>>> ## Fit the Model to the Data.  i.e. Make the Model Learn.
>>> lr.fit(x_variables, y_variables)
LogisticRegression()
>>> 
>>> ## Finally Predict the outcome for the Unknown Data!!
>>> print("This is the Prediction for the Unknown Data in u_variables!!")
This is the Prediction for the Unknown Data in u_variables!!
>>> print(lr.predict(u_variables))
['No' 'No' 'Yes' 'Yes' 'Yes']
>>> 
>>> ## Based on the Training Data (i.e. x_variables and y_variables),
>>> ## SKLearn decided the change-over from "No" lung-cancer to "Yes"
>>> ## lung-cancer is somewhere around 12 to 13 cigarettes smoked per
>>> ## day. The predict_proba() method shows the probability values 
>>> ## for "No" v/s "Yes" (i.e. target vector Y) for various values of
>>> ## X (i.e. Number of Cigarettes smoked per day).
>>> for i in range(20):
...   print("x=" + str(i) + " --> " + str(lr.predict_proba([[i]])))
... 
x=0 --> [[9.99870972e-01 1.29027714e-04]]
x=1 --> [[9.99735913e-01 2.64086966e-04]]
x=2 --> [[9.99459557e-01 5.40442542e-04]]
x=3 --> [[0.99889433 0.00110567]]
x=4 --> [[0.99773928 0.00226072]]
x=5 --> [[0.99538318 0.00461682]]
x=6 --> [[0.99059474 0.00940526]]
x=7 --> [[0.98093496 0.01906504]]
x=8 --> [[0.96173722 0.03826278]]
x=9 --> [[0.92469221 0.07530779]]
x=10 --> [[0.85710998 0.14289002]]
x=11 --> [[0.74556647 0.25443353]]
x=12 --> [[0.58873015 0.41126985]]
x=13 --> [[0.4115242 0.5884758]]
x=14 --> [[0.25463283 0.74536717]]
x=15 --> [[0.14301871 0.85698129]]
x=16 --> [[0.07538097 0.92461903]]
x=17 --> [[0.03830145 0.96169855]]
x=18 --> [[0.01908469 0.98091531]]
x=19 --> [[0.00941505 0.99058495]]

Conclusion

So there, you have it!! To Recap…

Use SKLearn’s LogisticRegression Model for Classification problems *only*, i.e., the Y variable is a category (e.g. binary [0,1]), *not continuous* (e.g. float numbers 3.4, 7.9).

If the Y variable is non-categorical (i.e. continuous), the potential fixes are as follows.

Re-examine the Data. Maybe encode the continuous Y variable into categories (e.g. use SKLearn’s LabelEncoder preprocessor).
Re-examine the Model. Maybe another model such as a regressor makes sense (e.g. Linear Regression).

Finxter Academy

This blog was brought to you by Girish Rao, a student of Finxter Academy. You can find his Upwork profile here.

Reference

All research for this blog article was done using Python Documents, the Google Search Engine, and the shared knowledge-base of the Finxter Academy, scikit-learn, and the Stack Overflow Communities.

The Lung-Cancer Example was adapted from “Python One-Liners” by Dr. Chris Mayer.