This article will review some of the critical techniques used in Exploratory Data Analysis, specifically for Bivariate Analysis.

We will review some of the essential concepts, understand some of the math behind correlation coefficients and provide sufficient examples in Python for a well-rounded, comprehensive understanding.

## What is Bivariate Analysis?

**Exploratory Data Analysis**, or **EDA**, is the first step for any data science project. The objective of EDA on any dataset is to understand relationships between variables (which may be independent or dependent). This analysis helps determine the appropriate model for predictive and descriptive analytics and machine learning techniques.

π **Examples**: *How do sales vary with month or season? How do credit scores determine loan eligibility?*

It is thus essential to understand, for a given set of data, how a specific variable interacts and varies with changes in the other.

### Variable Types

The variables can be

**numerical,****ordinal,****categorical**, or even**textual**.

When definite groups divide variables, for example, ** Male** or

**, these are called**

*Female***Categorical**Variables.

If there is an ordering between the variables, these are considered **Ordinal** variables, for example, ** High**,

**and**

*Medium***.**

*Low***Numeric** or *continuous* variables have values within a given range. Various supervised learning regression or classification techniques depend on categorical or ordinal/numeric variables.

### Univariate Analysis

If only one variable is analyzed, it is called **Univariate Analysis. **

In Univariate analysis, data is summarized and analyzed for patterns by deriving the mean, mode, median, variance, and standard deviation of the individual attribute(s) without analyzing any interactions or relationships between them.

Typical graphical techniques are pie charts, histograms, and frequency distribution tables.

### Multivariate Analysis

Multiple relations between multiple variables are examined simultaneously in **Multivariate Analysis**.

**Bivariate analysis** is a more straightforward case of multivariate analysis, where **two** variables are analyzed to determine the empirical relationship between them.

Univariate and Bivariate Analysis can be **descriptive **(describe dataset characteristics)bor **inferential **(make predictions or generalizations on a dataset).

Typically, one variable is **independent**, and the other is **dependent**.

The bivariate analysis involves the principles of **correlation coefficients** and** regression analysis**. As mentioned earlier, the type of analysis depends on attribute types which can be** nominal, categorical or ordinal**.

The figure below summarizes the possible bivariate analysis approaches, depending on the second variable type and dependency on the first independent variable:

Hence, the three primary bivariate analysis techniques are:

**Scatter Plots**are a visual representation of how the two variables are interrelated.**Regression Analysis**– This gives a line or curve equation to depict the relationship and predict one variable for future values of the other.**Correlation Coefficients**– This tests whether the two attributes are related. The value can range from -1 to 0. If the value is 0, then they are not related at all. If the values tend to 1, the correlation is positive, reaching a maximum correlation at 1. Similarly, if the correlation value tends to -1, the attributes are negatively correlated.**Pearson Correlation Coefficient**is the principal measure of linear correlation between two data sets. Mathematically, it is the covariance ratio of two variables and the product of their respective standard deviations. We will explain this in detail in the next section.

As shown in the above figure, depending on the types of variables, i.e. Categorical or Continuous, we have different forms of analysis.

Variable 1 | Variable 2 | Descriptive Statistics Graph |

Continuous | Continuous | The measure of increase or decrease of the variable concerning other ScatterplotLine plots |

Categorical | Continuous | Range of continuous variables concerning each category: Box PlotsViolin PlotsSwarm PlotsCount PlotsBar Plots |

Categorical | Categorical | Frequency of 2 categories: HeatmapsStacked Bar ChartsMosaic Plots |

## Pearson’s Correlation Coefficient

Understanding Pearson’s Correlation Coefficient is fundamental to a better understanding of bivariate analysis techniques.

We will consider the following set of data and ‘GRE Score’ and ‘CGPA’ as the two variable attributes for which we compute the correlation coefficient:

The essential formulae are covariance and correlation coefficient:

Below is the example computation based on the above formula:

# | GRE Score | CGPA | Deviation GRE vs mean | Deviation CGPA vs mean | Product GRE & CGPA Dev. | Square of GRE Dev. | Square of CGPA Dev. |

W | Y | (C) = w{i} – (A) | y{i} – (B) | (C)*(D) | (C)*(C) | (D)*(D) | |

1 | 337 | 9.65 | 17.3 | 1.11 | 19.13 | 299.29 | 1.22 |

2 | 324 | 8.87 | 4.3 | 0.33 | 1.40 | 18.49 | 0.11 |

3 | 316 | 8 | -3.7 | -0.54 | 2.01 | 13.69 | 0.30 |

4 | 322 | 8.67 | 2.3 | 0.13 | 0.29 | 5.29 | 0.02 |

5 | 314 | 8.21 | -5.7 | -0.33 | 1.90 | 32.49 | 0.11 |

6 | 330 | 9.34 | 10.3 | 0.80 | 8.20 | 106.09 | 0.63 |

7 | 321 | 8.2 | 1.3 | -0.34 | -0.45 | 1.69 | 0.12 |

8 | 308 | 7.9 | -11.7 | -0.64 | 7.53 | 136.89 | 0.41 |

9 | 302 | 8 | -17.7 | -0.54 | 9.63 | 313.29 | 0.30 |

10 | 323 | 8.6 | 3.3 | 0.06 | 0.18 | 10.89 | 0.00 |

Avg | 319.7 (A) | 8.54 (B) | Sum | 49.84 | |||

– | – | – | Population Covariance (Sum of the product of mean deviations)/count | 4.98 | – | – | |

– | – | – | Variance Sum of squares of deviations/count | – | 93.81 | 0.32 | |

– | – | – | Standard Deviation Squareroot(Variance) | – | 9.69 | 0.57 | |

– | – | – | Correlation Covariance/Product of Std Dev | – | 0.91 | – |

A Correlation Coefficient (0.91) value close to 1 indicates a high degree of Positive Correlation.

An important principle to remember is that “**Correlation does not imply Causality**“. That is — two variables may show a strong correlation, but it does not imply that one is the cause of the other.

A famous example to illustrate this is data since the early 19th century that depicts a steady increase in global average temperature accompanied by a steady reduction in the number of pirates, statistically implying a robust negative correlation between the two!

However, this does not mean that a decrease in the number of pirates is causing more significant global warming or vice versa.

## Implementation of Bivariate Analysis on Python

Let us review some of the examples against each of the above cases. We will use the Kaggle datasets on student performance and university admissions for all models and codes.

We first import the required libraries and read the data.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import pandas_profiling df = pd.read_csv("exams.csv") df.head() %matplotlib inline

### Categorical vs Categorical Variables

Some of the key techniques for bivariate analysis between a set of categorical variables are (illustrated below):

- Bar Plots
- Line Plots
- Heatmaps
- Pivot Tables
- GroupBy

dfct = pd.crosstab(df["race/ethnicity"], df["parental level of education"], margins=False, values=df["math score"], aggfunc=pd.Series.count) dfct.plot.bar(stacked=True)

dfct2 = pd.crosstab(df["parental level of education"], df["race/ethnicity"], margins=False, values=df["math score"], aggfunc=pd.Series.count) # plt.xticks(rotation=45) dfct2.plot.area(rot=45)

dfct3 = pd.crosstab(df["race/ethnicity"], df["parental level of education"], margins=False, values=df["math score"], aggfunc=pd.Series.count) dfct3.plot.line()

sns.heatmap(pd.crosstab(df["test preparation course"], df["parental level of education"], margins=False, values=df["math score"], aggfunc=pd.Series.mean), cmap="YlGnBu", annot=True, cbar=False)

pd.pivot_table(df, index = ["gender", "test preparation course"], aggfunc=[np.mean])

df.groupby(by='test preparation course').agg('mean')

### Categorical vs Continuous Variables

Some of the key techniques for bivariate analysis between categorical & continuous variables are (illustrated below):

- Barplots
- Countplots
- Boxplots
- Violin Plots
- Swarm Plots

sns.barplot(x='test preparation course', y='math score', data=df)

sns.countplot(x='gender', data=df)

sns.boxplot(x='test preparation course', y='writing score', data=df, palette='rainbow')

sns.violinplot(x='test preparation course',y='reading score',data=df,palette='rainbow')

sns.swarmplot(x='test preparation course',y='math score',data=df)

### Continuous vs Continuous

Some of the key techniques for bivariate analysis between continuous variables are (illustrated below):

- Scatter plots
- Histograms
- Heatmaps
- Regression models

sns.scatterplot(x="math score",y="writing score",data=df)

df2 = pd.read_csv("adm_data.csv") sns.histplot(df2["Chance of Admit "],kde=True)

df2[df2['Chance of Admit '] > 0.5].plot.hexbin(x="TOEFL Score",y="Chance of Admit ",gridsize = 15)

sns.heatmap(df2.corr())

sns.jointplot(data=df2,x='GRE Score',y='TOEFL Score',kind='hex')

g = sns. FacetGrid (df, row = 'gender', col = 'race/ethnicity, height = 4) g.map(sns.scatterplot, 'math score', 'writing score').add_legend(); plt.show()

g = sns.FacetGrid(df, row = 'gender', col = 'race/ethnicity', height = 5, aspect = 0.65) g.map_dataframe(sns.histplot, x="math score") plt.show()

import statsmodels.api as sm y = df2['Chance of Admit '] x = df2[['GRE Score']] x = sm.add_constant(x) model = sm.OLS(y,x).fit() print(model.summary()) # Below indicates [y = -2.4361 + 0.01 * x] which can also be used to predict the Chance of Admit given a value of GRE Score

We can extend the above bivariate analysis across multiple attributes in a dataset to arrive at multivariate data analysis.

sns.pairplot(df2, diag_kind="kde", markers="+", plot_kws=dict(s=50, edgecolor="b", linewidth=1), diag_kws=dict(shade=True))

g = sns.PairGrid(df2) g = g.map_upper(sns.scatterplot) g = g.map_lower(sns.kdeplot, colors="C0") g = g.map_diag(sns.kdeplot, lw=2)

Lastly, several automated exploratory data analysis packages simplify the analysis with just a few lines of code and output detailed and extensive analysis outcomes.

Pandas-profiling is one such popular package

pandas_profiling.ProfileReport(df2)

## References

Datasets used:

Other references used for research:

- Bivariate analysis – Wikipedia
- A Quick Guide to Bivariate Analysis in Python – Analytics Vidhya
- Pearson correlation coefficient – Wikipedia
- https://www.statology.org/bivariate-analysis-in-python/
- https://www.kaggle.com/code/residentmario/bivariate-plotting-with-pandas/notebook
- https://medium.com/mlearning-ai/univariate-bivariate-and-multivariate-data-analysis-in-python-341493c3d173
- Python pandas, seaborn, matplotlib libraries