The Ultimate Guide to Bivariate Analysis with Python

5/5 - (2 votes)

This article will review some of the critical techniques used in Exploratory Data Analysis, specifically for Bivariate Analysis.

We will review some of the essential concepts, understand some of the math behind correlation coefficients and provide sufficient examples in Python for a well-rounded, comprehensive understanding.

What is Bivariate Analysis?

Exploratory Data Analysis, or EDA, is the first step for any data science project. The objective of EDA on any dataset is to understand relationships between variables (which may be independent or dependent). This analysis helps determine the appropriate model for predictive and descriptive analytics and machine learning techniques.

πŸ‘‰ Examples: How do sales vary with month or season? How do credit scores determine loan eligibility?

It is thus essential to understand, for a given set of data, how a specific variable interacts and varies with changes in the other.

Variable Types

The variables can be

  • numerical,
  • ordinal,
  • categorical, or even
  • textual.

When definite groups divide variables, for example, Male or Female, these are called Categorical Variables.

If there is an ordering between the variables, these are considered Ordinal variables, for example, High, Medium and Low.

Numeric or continuous variables have values within a given range. Various supervised learning regression or classification techniques depend on categorical or ordinal/numeric variables.

Univariate Analysis

If only one variable is analyzed, it is called Univariate Analysis.

In Univariate analysis, data is summarized and analyzed for patterns by deriving the mean, mode, median, variance, and standard deviation of the individual attribute(s) without analyzing any interactions or relationships between them.

Typical graphical techniques are pie charts, histograms, and frequency distribution tables.

Multivariate Analysis

Multiple relations between multiple variables are examined simultaneously in Multivariate Analysis.

Bivariate analysis is a more straightforward case of multivariate analysis, where two variables are analyzed to determine the empirical relationship between them.

Univariate and Bivariate Analysis can be descriptive (describe dataset characteristics)bor inferential (make predictions or generalizations on a dataset).

Typically, one variable is independent, and the other is dependent.

The bivariate analysis involves the principles of correlation coefficients and regression analysis. As mentioned earlier, the type of analysis depends on attribute types which can be nominal, categorical or ordinal.

The figure below summarizes the possible bivariate analysis approaches, depending on the second variable type and dependency on the first independent variable:

Bivariate Analysis - Comparison Approaches
Bivariate Analysis – Different Approaches

Hence, the three primary bivariate analysis techniques are:

  1. Scatter Plots are a visual representation of how the two variables are interrelated.
  2. Regression Analysis – This gives a line or curve equation to depict the relationship and predict one variable for future values of the other.
  3. Correlation Coefficients – This tests whether the two attributes are related. The value can range from -1 to 0. If the value is 0, then they are not related at all. If the values tend to 1, the correlation is positive, reaching a maximum correlation at 1. Similarly, if the correlation value tends to -1, the attributes are negatively correlated.
    • Pearson Correlation Coefficient is the principal measure of linear correlation between two data sets. Mathematically, it is the covariance ratio of two variables and the product of their respective standard deviations. We will explain this in detail in the next section.

As shown in the above figure, depending on the types of variables, i.e. Categorical or Continuous, we have different forms of analysis.

Variable 1Variable 2Descriptive Statistics Graph
ContinuousContinuousThe measure of increase or decrease of the variable concerning other ScatterplotLine plots
CategoricalContinuousRange of continuous variables concerning each category: Box PlotsViolin PlotsSwarm PlotsCount PlotsBar Plots
CategoricalCategoricalFrequency of 2 categories: HeatmapsStacked Bar ChartsMosaic Plots

Pearson’s Correlation Coefficient

Understanding Pearson’s Correlation Coefficient is fundamental to a better understanding of bivariate analysis techniques.

We will consider the following set of data and ‘GRE Score’ and ‘CGPA’ as the two variable attributes for which we compute the correlation coefficient:

The essential formulae are covariance and correlation coefficient:

Below is the example computation based on the above formula:

#GRE ScoreCGPADeviation GRE vs meanDeviation CGPA vs meanProduct GRE & CGPA Dev.Square of GRE Dev.Square of CGPA Dev.
WY(C) = w{i} – (A)y{i} – (B)(C)*(D) (C)*(C)(D)*(D)
13379.6517.31.1119.13299.291.22
23248.874.30.331.4018.490.11
33168-3.7-0.542.0113.690.30
43228.672.30.130.295.290.02
53148.21-5.7-0.331.9032.490.11
63309.3410.30.808.20106.090.63
73218.21.3-0.34-0.451.690.12
83087.9-11.7-0.647.53136.890.41
93028-17.7-0.549.63313.290.30
103238.63.30.060.1810.890.00
Avg319.7 (A)8.54 (B)Sum49.84
  Population Covariance (Sum of the product of mean deviations)/count4.98
Variance Sum of squares of deviations/count93.810.32
Standard Deviation Squareroot(Variance)9.690.57
Correlation Covariance/Product of Std Dev0.91

A Correlation Coefficient (0.91) value close to 1 indicates a high degree of Positive Correlation.

An important principle to remember is that “Correlation does not imply Causality“. That is — two variables may show a strong correlation, but it does not imply that one is the cause of the other.

A famous example to illustrate this is data since the early 19th century that depicts a steady increase in global average temperature accompanied by a steady reduction in the number of pirates, statistically implying a robust negative correlation between the two!

However, this does not mean that a decrease in the number of pirates is causing more significant global warming or vice versa.

Implementation of Bivariate Analysis on Python

Let us review some of the examples against each of the above cases. We will use the Kaggle datasets on student performance and university admissions for all models and codes.

We first import the required libraries and read the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
df = pd.read_csv("exams.csv")
df.head()
%matplotlib inline

Categorical vs Categorical Variables

Some of the key techniques for bivariate analysis between a set of categorical variables are (illustrated below):

  1. Bar Plots
  2. Line Plots
  3. Heatmaps
  4. Pivot Tables
  5. GroupBy
dfct = pd.crosstab(df["race/ethnicity"],
                   df["parental level of education"],
                   margins=False,
                   values=df["math score"],
                   aggfunc=pd.Series.count)

dfct.plot.bar(stacked=True)
dfct2 = pd.crosstab(df["parental level of education"],
                    df["race/ethnicity"],
                    margins=False,
                    values=df["math score"],
                    aggfunc=pd.Series.count)

# plt.xticks(rotation=45)
dfct2.plot.area(rot=45)
dfct3 = pd.crosstab(df["race/ethnicity"],
                    df["parental level of education"],
                    margins=False,
                    values=df["math score"],
                    aggfunc=pd.Series.count)

dfct3.plot.line()
sns.heatmap(pd.crosstab(df["test preparation course"],
                        df["parental level of education"],
                        margins=False,
                        values=df["math score"],
                        aggfunc=pd.Series.mean),
            cmap="YlGnBu", annot=True, cbar=False)
pd.pivot_table(df, index = ["gender",
                            "test preparation course"],
               aggfunc=[np.mean])
df.groupby(by='test preparation course').agg('mean')

Categorical vs Continuous Variables

Some of the key techniques for bivariate analysis between categorical & continuous variables are (illustrated below):

  1. Barplots
  2. Countplots
  3. Boxplots
  4. Violin Plots
  5. Swarm Plots
sns.barplot(x='test preparation course', y='math score', data=df) 
sns.countplot(x='gender', data=df)
sns.boxplot(x='test preparation course',
            y='writing score',
            data=df,
            palette='rainbow')
sns.violinplot(x='test preparation course',y='reading score',data=df,palette='rainbow')
sns.swarmplot(x='test preparation course',y='math score',data=df)

Continuous vs Continuous

Some of the key techniques for bivariate analysis between continuous variables are (illustrated below):

  1. Scatter plots
  2. Histograms
  3. Heatmaps
  4. Regression models
sns.scatterplot(x="math score",y="writing score",data=df)
df2 = pd.read_csv("adm_data.csv")
sns.histplot(df2["Chance of Admit "],kde=True)
df2[df2['Chance of Admit '] > 0.5].plot.hexbin(x="TOEFL Score",y="Chance of Admit ",gridsize = 15)
sns.heatmap(df2.corr())
sns.jointplot(data=df2,x='GRE Score',y='TOEFL Score',kind='hex')
g = sns. FacetGrid (df, row = 'gender', col = 'race/ethnicity, height = 4)
g.map(sns.scatterplot, 'math score', 'writing score').add_legend();
plt.show()
g = sns.FacetGrid(df, row = 'gender', col = 'race/ethnicity', height = 5, aspect = 0.65)
g.map_dataframe(sns.histplot, x="math score")
plt.show()
import statsmodels.api as sm
y = df2['Chance of Admit ']
x = df2[['GRE Score']]
x = sm.add_constant(x)
model = sm.OLS(y,x).fit()
print(model.summary())
# Below indicates [y = -2.4361 + 0.01 * x] which can also be used to predict the Chance of Admit given a value of GRE Score

We can extend the above bivariate analysis across multiple attributes in a dataset to arrive at multivariate data analysis.

sns.pairplot(df2, diag_kind="kde", markers="+",
                 plot_kws=dict(s=50, edgecolor="b", linewidth=1),
                 diag_kws=dict(shade=True))
g = sns.PairGrid(df2)
g = g.map_upper(sns.scatterplot)
g = g.map_lower(sns.kdeplot, colors="C0")
g = g.map_diag(sns.kdeplot, lw=2)

Lastly, several automated exploratory data analysis packages simplify the analysis with just a few lines of code and output detailed and extensive analysis outcomes.

Pandas-profiling is one such popular package

pandas_profiling.ProfileReport(df2)

References

Datasets used:

Other references used for research: