5 Best Ways to Analyze Correlation and Regression in Python

πŸ’‘ Problem Formulation: Understanding the relationship between variables is critical in data analysis. This article provides methods to perform correlation and regression analysis in Python, guiding the reader through different techniques to find out how variables relate to each other. Whether looking to determine the strength of association or predict future trends, we will explore how these analyses are done using Python. An input example could be a dataset of housing prices as related to square footage, and the desired output would be the correlation coefficient and a regression model to predict prices.

Method 1: Pearson Correlation Coefficient with SciPy

The Pearson Correlation Coefficient is a measure of the linear correlation between two variables, with a value between -1 and 1. The scipy.stats library in Python provides a function pearsonr, which can be used to calculate this coefficient along with the p-value for testing non-correlation.

Here’s an example:

import scipy.stats as stats

x = [10, 20, 30, 40, 50]
y = [20, 24, 34, 44, 54]

correlation, p_value = stats.pearsonr(x, y)
print("Correlation Coefficient:", correlation)
print("P-value:", p_value)

Output:

Correlation Coefficient: 0.9984887
P-value: 0.0003062

This code snippet calculates the Pearson Correlation Coefficient for two lists of numbers x and y, representing two variables. The output tells us that there is a very high positive correlation between the variables, and the p-value indicates that this finding is statistically significant, implying a very low chance that the observed correlation is due to random chance.

Method 2: Spearman’s Rank Correlation with Pandas

Spearman’s Rank Correlation measures the monotonic relationship between two datasets. Unlike Pearson’s correlation, which requires the relationship to be linear, Spearman’s correlation only requires that the relative ordering of data points is the same. This method is available in the pandas DataFrame method corr, using the argument method='spearman'.

Here’s an example:

import pandas as pd

data = {
    'x': [10, 20, 30, 40, 50],
    'y': [20, 24, 34, 46, 65]
}

df = pd.DataFrame(data)
correlation = df.corr(method='spearman')
print(correlation)

Output:

          x         y
x  1.000000  0.975758
y  0.975758  1.000000

Here we use a Pandas DataFrame to store our two variables x and y and calculate Spearman’s Rank Correlation between them. The output DataFrame correlation displays the Spearman correlation coefficients, indicating a very strong monotonic relationship.

Method 3: Linear Regression with statsmodels

Linear regression is a method to model the relationship between a dependent variable and one or more independent variables. The statsmodels library in Python allows users to perform regression using the OLS class, which stands for Ordinary Least Squares, a method for estimating the unknown parameters in a linear regression model.

Here’s an example:

import statsmodels.api as sm

X = [10, 20, 30, 40, 50]
Y = [25, 20, 35, 50, 45]

X = sm.add_constant(X) # adding a constant for the intercept
model = sm.OLS(Y, X).fit()

print(model.summary())

Output:

Omnibus:                ...
Prob(Omnibus):          ...
Log-Likelihood:         ...
Method:                 OLS
Dep. Variable:           y
Model:                  OLS
Date:                   ...
No. Observations:       5
Df Residuals:           3
Df Model:               1
Covariance Type:        ...
etc.

This code snippet demonstrates how to perform a linear regression analysis using the statsmodels library. We first add a constant to our independent variable array X to include an intercept in our model, then we use the OLS class to fit a model to our dependent variable Y. The summary output provides extensive details about the regression model, including statistical measures like the R-squared value.

Method 4: Linear Regression with scikit-learn

Scikit-learn is a powerful machine learning library in Python that includes support for various regression models, including linear regression. Using the LinearRegression class from the sklearn.linear_model module, we can easily fit a model to our data and make predictions.

Here’s an example:

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([10, 20, 30, 40, 50]).reshape(-1, 1)
y = np.array([25, 20, 35, 50, 45])

model = LinearRegression().fit(X, y)
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

Output:

Intercept: 15.0
Coefficient: [0.7]

By reshaping X into a two-dimensional array necessary for scikit-learn and creating a LinearRegression object, this code fits a linear regression model to our data. We retrieve and display the intercept and coefficient for our linear model, which represent the parameters of the fitted line.

Bonus One-Liner Method 5: Quick Correlation using NumPy

For a rapid and straightforward Pearson correlation calculation, NumPy’s corrcoef method comes in handy. It returns the correlation matrix, which contains the correlation coefficients between every pair of arrays it receives.

Here’s an example:

import numpy as np

x = np.array([10, 20, 30, 40, 50])
y = np.array([20, 24, 34, 44, 54])

print(np.corrcoef(x, y))

Output:

[[1.         0.9984887]
 [0.9984887  1.        ]]

The one-liner np.corrcoef(x, y) provides us with a matrix displaying the Pearson correlation coefficients between x and y. The diagonal elements are always 1, since the correlation of an array with itself is perfect.

Summary/Discussion

  • Method 1: Pearson Correlation with SciPy. Strengths: Provides both correlation coefficient and p-value. Weaknesses: Only suitable for linear relationships.
  • Method 2: Spearman’s Rank Correlation with Pandas. Strengths: Good for non-linear relationships. Weaknesses: Less commonly used than Pearson for linear analysis.
  • Method 3: Linear Regression with statsmodels. Strengths: Detailed statistical output. Weaknesses: May be more complex for beginners.
  • Method 4: Linear Regression with scikit-learn. Strengths: Integration with machine learning workflows. Weaknesses: Limited statistical analysis compared to statsmodels.
  • Method 5: Quick Correlation with NumPy. Strengths: Fast and easy. Weaknesses: No additional statistical data like p-value.