Regression is a supervised learning technique that defines the relationship between a dependent variable and the independent variable(s). Regression models describe the relationship between the dependent and independent variables using a fitting line. In the case of linear regression models, this is a straight line while in the case of logistic and nonlinear regression models a curved line is used.
Simple Linear Regression is a predictive analysis technique to estimate the relationship between quantitative variables. You can use simple linear regression in the following scenarios:
- To determine the strength of the relationship between two variables.
- To determine the value of a dependent variable corresponding to a certain value of an independent variable/s.
A very popular illustration from econometrics that uses simple linear regression is to find the relationship between consumption and income. When the income increases the consumption grows and vice-versa. The independent variable –
income and the dependent variable –
consumption are both quantitative, so you can perform a regression analysis to find out if there is a linear relationship between them.
Before we dive in, let us understand some of the major concepts necessary to deal with regression analysis.
❂ Quantitative Variables: Data that represents amounts/numerical values are known as quantitative data. A variable that contains quantitative data is known as a quantitative variable. There are two kinds of quantitative variables: (i) discrete and (ii) continuous.
❂ Categorical Variable: These are the variables that represent the classification or grouping of some kind. Categorical data can be of three types: (i) Binary, (ii) Nominal, (iii) Ordinal
❂ Dependent Variable: Variable containing data that is dependent on another variable. You cannot control the data in a dependent variable directly.
❂ Independent Variable: Variable containing data that is not dependent on other variables for its existence. You can control the data in an independent variable directly.
❂ Model: A data model is a transformation engine used to express dependent variables as a function of independent variables.
Mathematical Representation Of Linear Regression
👨🎓 Can you recall the high school lesson on geometry? Do you remember, the equation of a line?
Now, linear regression is just an exemplification of this equation. Here,
- y denotes the variable that needs to be predicted. Hence, it is the dependent variable.
- The value of y is dependent on the value of x. Thus, x is the input and the independent variable.
- m denotes the slope and gives the angle of the line. Hence, it is the parameter.
- c denotes the intercept. Thus, it is the constant that determines what shall be the value of y when x is 0.
Now let us have a look at the mathematical equation that represents simple linear regression:
β0 ➝ Intercept of the Regression line.
β1➝ Slope of the Regression line.
ε ➝ The Error Term.
Note: Linear regression model is not always perfect. It approximates the relationship between dependent and independent variables and approximation often leads to errors. Some errors can be decreased while some errors are inherent to the problem and cannot be eliminated. The errors which cannot be eliminated are known as irreducible error.
Implementing Simple Linear Regression In Python
Let us have a look at an example to visualize how to implement simple linear regression in Python. The data-set that will be used in our example is mentioned below.
❂ The Problem Statement: The data-set used in out example has been mentioned above such that :
- Salary represents the Dependent Variable.
- Years of experience represents the Independent Variable.
- Find a correlation between Salary and Years of Experience. Therefore we observe how the dependent variable changes as the independent variable changes.
- Find the best fit line.
Note: The line of best fit is the line through a scatter plot of data points that best expresses the relationship between those points. (refer: Line Of Best Fit)
Let us dive into the steps involved in implementing the simple linear regression.
📢 Step 1: Preprocessing the Data
The first and foremost step is data pre-processing. We have already discussed and learned about data preprocessing; if you wish to master the concepts of data pre-processing please refer to the article at this link. Let us quickly go through the steps required to pre-process our data:
import numpy as np import pandas as pd import matplotlib.pyplot as plt
dataset = pd.read_csv('Data.csv') x = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
Note: Please refer to the data pre-processing tutorial to understand the concept behind each snippet mentioned above.
📢 Step 2: Training The Simple Linear Regression Model Using Training Set
After completing the data pre-processing, you have to train the model using the training set as shown below.
from sklearn.linear_model import LinearRegression regression_obj = LinearRegression() regression_obj.fit(x_train, y_train)
- Import the LinearRegression class from the linear_model library of the scikit-learn library.
- Create an object
- Use the
fit()method in order to fit the simple linear regression model to the training set so that the model is able to learn and identify the correlations between the variables. To do that, you have to pass x_train and y_train (which represent the independent and dependent variables of the training set) within the
📢 Step 3: Predicting Test Results
After undergoing the training phase, our model is now ready to predict outputs based on new observations. Therefore, now you have to feed in test dataset to the model and test if the model is capable of predicting correct outputs. Let us have a look at the code given below to understand how we can check the efficiency of our model to predict outputs.
y_predicted = regression_obj.predict(x_test)
y_predictedcontains the predicted outputs of
x_test(test dataset). The predict() function returns the labelled data (predicted outputs).
📢 Step 4: Plotting and Visualizing The Training Set Results
It is time for you to visualize the results produced by the model based on inputs from the training set. This can be done with the help of the
pyplot module. But, before we dive into the code, let us discuss the concepts required to execute our code.
✨ What is a Scatter Plot?
In simple and plain terms, you can visualize a scatter plot as a diagram wherein values of the dataset are represented by dots. The method used to draw a scatter plot is known as
scatter(). We can also set the color of the dots with the help of the
color attribute within the
scatter function. In the
scatter function, we will pass the values of training set, i.e.,
x_train (years of experience), and
y_train (the set of salaries).
The following diagram represents a scatter plot:
You dive deep into scatter plots in our blog tutorial here 📈.
plot() function allows us to draw points/markers in a diagram and by default, it draws a line from one point to another. We will use this function to draw our regression line by passing
x_train (years of experience), predicted salary of the training set, and the color of the line.
ylabel() functions are used to set the x-axis (Years of Experience) and y-axis(Salary) of the scatter plot while
title() method allows us to set the title of the scatter plot. The
show() displays the figures/graph and helps you to visualize the output.
Now let us have a look at the code that demonstrates the above explanation:
plt.scatter(x_train, y_train, color = 'red') plt.plot(x_train, regression_obj.predict(x_train), color = 'green') plt.title('Salary vs Experience for Training set') plt.xlabel('Experience (in Years)') plt.ylabel('Salary') plt.show()
📢 Step 5: Plotting and Visualizing The Test Set Results
Previously, we checked and visualized the efficiency and performance of our model based on the training set. Now, it is time to visualize the output for the test set. Everything explained in step 4, also applies to this step, except, instead of using
y_train we will be using
y_test in this case.
(Note: Colors used in this case are different. But this optional.)
# Visualizing the Test Set Results plt.scatter(x_test, y_test, color='red') plt.plot(x_train, regression_obj.predict(x_train), color='blue') plt.title('Salary vs Experience for Test set') plt.xlabel('Experience (in Years)') plt.ylabel('Salary') plt.show()
As seen in the above graph that the observations are mostly close to the regression line. Therefore, we can conclude that our simple linear regression model has a good performance and accuracy and it is an efficient model as it is able to make good predictions.
💡 That brings us to the end of this tutorial on Simple Linear Regression. Please subscribe and stay tuned for the next lesson on the Machine Learning series.
Where to Go From Here?
Enough theory, let’s get some practice!
To become successful in coding, you need to get out there and solve real problems for real people. That’s how you can become a six-figure earner easily. And that’s how you polish the skills you really need in practice. After all, what’s the use of learning theory that nobody ever needs?
Practice projects is how you sharpen your saw in coding!
Do you want to become a code master by focusing on practical code projects that actually earn you money and solve problems for people?
Then become a Python freelance developer! It’s the best way of approaching the task of improving your Python skills—even if you are a complete beginner.
Join my free webinar “How to Build Your High-Income Skill Python” and watch how I grew my coding business online and how you can, too—from the comfort of your own home.