5 Best Ways to Create a Boxplot Stratified by Column in Python Pandas

Rate this post

πŸ’‘ Problem Formulation: When analyzing data with Python Pandas, it’s common to face the need for visual stratification of data to understand distributions based on categorical variables. For instance, if you have a dataset of employees with their respective departments and salaries, your input is a DataFrame, and the desired output is a series of boxplots, each representing the salary distribution of the employees in each department. This article will guide you through different methods to achieve this.

Method 1: Using Standard Pandas Plotting with Subplots

This method involves using the built-in Pandas plotting interface that, in turn, uses Matplotlib to create boxplots for each category in a separate subplot. A groupby operation is combined with the boxplot method, allowing stratification of the data based on a selected column.

Here’s an example:

import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
df = pd.DataFrame({
    'Department': ['HR', 'IT', 'Sales', 'HR', 'Sales'],
    'Salary': [50000, 65000, 55000, 52000, 60000]
})

# Create boxplots stratified by 'Department'
grouped = df.groupby('Department')
fig, axes = plt.subplots(nrows=len(grouped), ncols=1, figsize=(8, 6))
for (key, ax) in zip(grouped.groups.keys(), axes):
    grouped.get_group(key).boxplot(ax=ax, column=['Salary'])
    ax.set_title('Department: ' + str(key))
plt.tight_layout()
plt.show()

The output will be a series of boxplots, one for each department, each plotted within their own subplot.

In this code, the DataFrame is grouped by ‘Department’, then a subplot is created for each group. The boxplot of ‘Salary’ is then drawn in its respective subplot with the department name set as the title of the subplot. This approach allows us to see the distribution of salaries within each department and enables easy comparison across subplots.

Method 2: Using Seaborn’s FacetGrid for Advanced Visualizations

The Seaborn library, an abstraction layer over Matplotlib, offers the FacetGrid that enables one to create a grid of plots based on the values of one or more categorical variables. This method enhances control over plot aesthetics and layout.

Here’s an example:

import pandas as pd
import seaborn as sns

# Sample DataFrame
df = pd.DataFrame({
    'Department': ['HR', 'IT', 'Sales', 'HR', 'Sales'],
    'Salary': [50000, 65000, 55000, 52000, 60000]
})

# Using Seaborn's FacetGrid
g = sns.FacetGrid(df, col="Department", col_wrap=2)
g = g.map(sns.boxplot, "Salary", order=['HR', 'IT', 'Sales'])
plt.show()

This code will produce multiple boxplots each stratified by ‘Department’ in a grid layout.

In this example, Seaborn’s FacetGrid is used to generate a separate boxplot for each department. The col parameter determines the categorical variable that will stratify the plots, and col_wrap controls the maximum number of columns of subplots in the grid. The Seaborn boxplot function is then mapped onto each facet. This approach gives an aesthetically pleasing and highly customizable set of plots.

Method 3: Using Seaborn’s catplot for a Quick and Easy Boxplot

Seaborn’s catplot is a high-level interface that provides access to various categorical plots, including boxplots. It allows quick and easy plotting without the need for fine-grained control over the plot’s grid structure.

Here’s an example:

import pandas as pd
import seaborn as sns

# Sample DataFrame
df = pd.DataFrame({
    'Department': ['HR', 'IT', 'Sales', 'HR', 'Sales'],
    'Salary': [50000, 65000, 55000, 52000, 60000]
})

# Boxplot using Seaborn's catplot
sns.catplot(x="Department", y="Salary", kind="box", data=df)
plt.show()

The output will display a single figure with separate boxplots for each department along the x-axis.

The catplot function is very straightforward to use and requires minimal code to produce a complex plot. Here, it stratifies the salary data based on department and places them along the x-axis, generating a clear visual distribution comparison between departments.

Method 4: Pandas Groupby with Boxplot and Layout Control

This method is a variation of the first, using the flexibility of Pandas combined with Matplotlib. Using groupby and iterating over subplots within a specified layout provides more control over how the plots are arranged.

Here’s an example:

import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
df = pd.DataFrame({
    'Department': ['HR', 'IT', 'Sales', 'HR', 'Sales'],
    'Salary': [50000, 65000, 55000, 52000, 60000]
})

# Groupby and plot with a specified layout
fig, axes = plt.subplots(1, 3, figsize=(10, 5))  # Adjust the layout as needed
for ax, (key, group) in zip(axes.flatten(), df.groupby('Department')):
    group.boxplot(column='Salary', ax=ax)
    ax.set_title('Department: ' + key)
plt.tight_layout()
plt.show()

The output will be a single row of boxplots, one for each department.

Similar to Method 1, this code uses a groupby to stratify the dataset and then iterates over a flattened array of subplot axes. Each group’s boxplot is drawn onto its respective axis. Adjusting the subplot dimensions is straightforward and allows for a flexible layout that can be customized to fit different amounts of categories.

Bonus One-Liner Method 5: Quick Boxplot with Pandas

For a one-liner solution in Pandas, the boxplot() function can be directly called on the DataFrame, specifying the column to stratify by with the by parameter.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Department': ['HR', 'IT', 'Sales', 'HR', 'Sales'],
    'Salary': [50000, 65000, 55000, 52000, 60000]
})

# One-liner boxplot stratified by 'Department'
df.boxplot(by='Department', column=['Salary'])

A single figure with separate boxplots for each department category will be presented.

This one-liner is the epitome of simplicity. By calling the boxplot method on the DataFrame and specifying the by parameter, Pandas quickly generates a boxplot for each category of the specified column. This method is excellent for a quick look at your data.

Summary/Discussion

  • Method 1: Standard Pandas Plotting with Subplots. Strengths: directly uses Pandas and Matplotlib, good for small datasets. Weaknesses: layout can become cumbersome with many categories.
  • Method 2: Seaborn’s FacetGrid. Strengths: aesthetically pleasing, highly customizable. Weaknesses: requires an additional library, may be overkill for simple tasks.
  • Method 3: Seaborn’s catplot. Strengths: quick setup, produces complex plots with minimal code. Weaknesses: less control over individual plot elements, an additional library needed.
  • Method 4: Pandas Groupby with Boxplot and Layout Control. Strengths: flexible layout customization, good for a medium number of categories. Weaknesses: more code for layout management.
  • Method 5: Quick Boxplot with Pandas. Strengths: absolute minimal code, fast. Weaknesses: limited customization, no control over subplot layout.