5 Best Ways to Plot a Histogram for Pandas DataFrame with Matplotlib

πŸ’‘ Problem Formulation: Data analysts often need to visualize the distribution of numerical data to identify patterns, outliers, and the overall shape of the data set. In this article, we’ll tackle how to plot a histogram for a Pandas DataFrame using the Matplotlib library in Python. For instance, given a DataFrame with a column ‘Age,’ we aim to display its distribution through various histogram plotting techniques. The desired output is a visual representation of the frequency of ‘Age’ within specified ranges or bins.

Method 1: Using DataFrame.plot.hist()

Matplotlib is seamlessly integrated with Pandas, allowing for histograms to be plotted directly from DataFrames using the plot.hist() method. This method is a high-level wrapper for Matplotlib’s plt.hist() function, making it very user-friendly to directly plot histograms from the DataFrame columns. This method offers ease of use due to its straightforward syntax and compatibility with DataFrame objects.

Here’s an example:

import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
data = pd.DataFrame({'Age': [23, 45, 56, 78, 33, 44, 56, 76, 23, 42]})
# Plotting the histogram
ax = data['Age'].plot.hist(bins=5, alpha=0.5)
plt.show()

This code snippet will produce a histogram with 5 bins for the ‘Age’ column in the DataFrame and display it with a 50% transparency.

In the example above, we start by importing the required libraries. Then, we create a simple DataFrame with a single column ‘Age’ containing sample data. The plot.hist() function is used to plot the histogram with specified bins and transparency. Finally, plt.show() is called to display the histogram.

Method 2: Using matplotlib.pyplot.hist()

Another approach to plot histograms is using the matplotlib.pyplot.hist() function directly. This method involves passing the desired DataFrame column to the plt.hist() function. Although it requires slightly more coding than the first method, it offers greater control over the histogram plot, making it ideal for customizing the plot according to specific needs.

Here’s an example:

import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
data = pd.DataFrame({'Age': [23, 45, 56, 78, 33, 44, 56, 76, 23, 42]})
# Plotting the histogram
plt.hist(data['Age'], bins=5, edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')
plt.show()

This code snippet produces a detailed histogram of the ‘Age’ column with 5 bins and labels for the x-axis, y-axis, and a title for the plot.

The plt.hist() function is used this time with additional parameters for enhancing the plot’s readability. The edgecolor parameter outlines each bin, while plt.xlabel(), plt.ylabel(), and plt.title() are used to label the axes and title the plot. The result is a more detailed and customized histogram.

Method 3: Using seaborn.histplot()

Seaborn is a statistical plotting library that builds on Matplotlib and integrates closely with Pandas. The seaborn.histplot() function can create histograms and is particularly useful for its additional features like KDE (Kernel Density Estimate) plots and styling. When aesthetics and additional statistical representation are imperative, seaborn becomes the go-to choice.

Here’s an example:

import pandas as pd
import seaborn as sns

# Sample DataFrame
data = pd.DataFrame({'Age': [23, 45, 56, 78, 33, 44, 56, 76, 23, 42]})
# Plotting the histogram with KDE
sns.histplot(data=data, x='Age', bins=5, kde=True)
sns.despine()
plt.show()

This code snippet produces a histogram with a KDE plot overlaid for smooth density estimation.

In the provided example, we first import pandas and seaborn. After creating a DataFrame, we use sns.histplot() with the kde parameter set to True for an additional density estimate curve. The function sns.despine() is for optional aesthetic improvement by removing the top and right spines.

Method 4: Using pandas.cut() and DataFrame.plot.bar()

For a more manual approach to histogram plotting, one can use the pandas.cut() function to create binned categories and subsequently plot a bar chart using the DataFrame.plot.bar(). This provides maximum control, as you preprocess the data into bins before plotting, which can be particularly useful for non-uniform bin sizes or custom binning logic.

Here’s an example:

import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame
data = pd.DataFrame({'Age': [23, 45, 56, 78, 33, 44, 56, 76, 23, 42]})
# Creating bins
data['AgeBin'] = pd.cut(data['Age'], bins=[0, 30, 60, 90])
# Counting the occurrences in each bin
age_distribution = data['AgeBin'].value_counts().sort_index()
# Plotting the bar chart (histogram)
age_distribution.plot.bar()
plt.show()

This code will create a histogram-like bar chart with custom age ranges as bins.

The pandas.cut() method bins the ‘Age’ data into specified age groups, and value_counts() is used for tallying the frequencies. The index is then sorted to ensure the bars follow a logical order. A bar chart is plotted with plot.bar(), visually functioning as a histogram.

Bonus One-Liner Method 5: Quick Plot with pandas.DataFrame.hist()

For the fastest and least code-intensive method, the Pandas library provides the DataFrame.hist() function that generates histograms for all DataFrame numerical columns in just one line of code. While it lacks the fine-tuning available in other methods, it is unrivaled in convenience for a quick look at data distributions.

Here’s an example:

import pandas as pd

# Sample DataFrame
data = pd.DataFrame({'Age': [23, 45, 56, 78, 33, 44, 56, 76, 23, 42]})
# Plotting the histogram in one line
data.hist(column='Age')
plt.show()

A histogram for the ‘Age’ column is promptly displayed.

By invoking data.hist() with the column parameter, Pandas handles the creation of the histogram directly, without any explicit mention of Matplotlib. This method is particularly useful for quick exploratory data analysis.

Summary/Discussion

  • Method 1: Using DataFrame.plot.hist(). Strengths: Simple and integrated with Pandas. Weaknesses: Less customizable than some other methods.
  • Method 2: Using matplotlib.pyplot.hist(). Strengths: Offers more customization over the plot. Weaknesses: Requires slightly more code than the DataFrame.plot.hist() method.
  • Method 3: Using seaborn.histplot(). Strengths: Excellent for additional statistical information and advanced plot styling. Weaknesses: Requires an additional library and might be overkill for simple histograms.
  • Method 4: Using pandas.cut() and DataFrame.plot.bar(). Strengths: Allows for maximum control and custom binning. Weaknesses: More convoluted than direct histogram methods.
  • Method 5: Quick Plot with pandas.DataFrame.hist(). Strengths: Fast and highly convenient for a quick overview. Weaknesses: Least customizable and can be limiting for detailed analysis.