What is LARS regression?
Regression is the analysis of how a variable (the outcome variable) depends on the evolution of other variables (explanatory variables).
In regression, we are looking for the answer to the question of what is the function that can be used to predict the value of another variable Y by knowing the value of one variable X?
In general, regression calculations are based on the assumption that a causal and statistical relationship can be assumed or deduced between certain variables. To describe the causal relationship, we look for a functional relationship between the variables, i.e., we consider the cause as a dependent variable and the other influencing variables as independent variables.
Linear regression is a parametric regression model that assumes a linear relationship between the explanatory (X) and the explained (Y) variables (in terms of parameters). This means that in estimating linear regression, we try to fit a line to the point cloud of the sample data.
Least angle regression (LARS) is a relatively new technique that is a variant of forward regression.
It starts all coefficients with zero and find the predictor (x1) which correlates best with the response.
We move towards this predictor until another predictor, (x2), shows the same degree of correlation with the current residual. The LARS moves in a direction at an equal angle between the two predictors, until a third variable, (x3), again shows the same degree of correlation with the current residual. The LARS then moves at equal angles between x1, x2, and x3, (i.e., in the direction of least angle), until the next variable enters and so on.
What Is LARS For?
This technique is used for forecasting, modeling time series, and establishing cause and effect relationships between variables. There are several advantages to using regression analysis.
It indicates significant relationships between the dependent variable and the independent variable.
It shows the strength of the effect of several independent variables on the dependent variable.
Regression analysis also allows you to compare the effects of variables measured at different scales. These advantages help data scientists to eliminate and evaluate the best set of variables to use in building predictive models.
The advantages of LARS are:
- As fast as stepwise regression .
- Generates a complete piecewise linear solution path , useful for cross-validation or similar model fitting experiments .
- If two variables are related to the dependent variable to nearly the same extent , their coefficients should increase at about the same rate . So the algorithm is more stable.
LARS in Python
from pandas import read_excel from sklearn.model_selection import train_test_split from sklearn import linear_model from sklearn.metrics import mean_squared_error from sklearn.metrics import mean_absolute_error import matplotlib.pyplot as plt # import data dataframe = read_excel('AirQualityUCI.xls') # clean dataframe dataframe = dataframe[(dataframe > 0).all(axis=1)] data = dataframe.values # select relevant data x, y = data[:, 11:12], data[:, 10:11] # print(x.shape, y.shape, type(x), type(y)) # split the arrays into random train and test subsets xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.1) # model fitting lars = linear_model.Lars().fit(xtrain, ytrain) # predict ypred = lars.predict(xtest) # measure errors print(lars.coef_) mse = mean_squared_error(ytest, ypred) print("MSE: %.2f" % mse) mae = mean_absolute_error(ytest, ypred) print("MAE: %.2f" % mae) # plot original vs predicted data x_ax = range(len(ytest)) plt.scatter(x_ax, ytest, s=5, color="green", label="original") plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted") plt.legend() plt.show()
In the above code, we first import the required libraries:
From scikit learn, the
train_test_split (for random distribution of data),
linear_model for LARS regression, mean_sq
mean_absolute_error for model evaluation, and finally
matplotlib.pyplot for model visualization.
Then import the data into a pandas DataFrame. This can be a local file or a valid URL.
Our DataFrame (which is real-world air quality data) has some negative values because when the sensors are not working, the value is -200. We remove this data and create a numpy array from the rest. In our example, we are looking at the relationship between relative humidity and temperature.
The corresponding columns (10, 11) are sliced from the data file and separated into training and test data. This is necessary to judge how good our model is. Split the data set into two data sets:
A “training” data set, which we will use to train our model, and a “test” data set, which we will use to judge the accuracy of the model.
Apply the model to the training data with „
lars = linear_model.Lars().fit(xtrain, ytrain)” with default settings, and then perform the estimation on the test data with the „predict” method of the class.
We then examine how far the model’s prediction differs from the real data.
MSE represents the mean squared error. It is the square of the distances between the actual points and the regression line. This technique allows the errors to be weighted and negative signs to be eliminated.
MAE is an abbreviation for mean absolute error, which is simply the largest deviation of the measured values from the predicted ones.
In the last part, we will create a graph using matplotlib. For the full course see:
Please note that your graph (and your results) may differ because the algorithm used to select the test and training data is randomly split.
As we have seen, regression is a very important tool in data science, and a relatively new version of it, LARS regression, can be successfully applied in many situations. For this purpose, the Python sckit-learn library provides a convenient solution.