5 Best Ways to Create a Frequency Plot in Python Pandas DataFrame Using Matplotlib

πŸ’‘ Problem Formulation: When dealing with categorical data in a Pandas DataFrame, visualizing the frequency of categories can be critically important for a quick analysis. For instance, suppose you have a DataFrame containing the favorite fruits of a group of people. The desired output would be a frequency plot visualizing how many times each fruit was chosen as a favorite, providing immediate insights into the data.

Method 1: Using value_counts() and plot()

One of the most straightforward ways to create a frequency plot is by using the value_counts() method in Pandas, which returns a series containing counts of unique values, and then calling the plot() method from Matplotlib. This combination provides a quick and easy-to-understand visualization.

Here’s an example:

import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame
df = pd.DataFrame({'favorite_fruit': ['apple', 'banana', 'orange', 'apple', 'banana', 'apple']})

# Count the occurrences of each fruit and create a bar plot
fruit_counts = df['favorite_fruit'].value_counts()
plt.bar(fruit_counts.index, fruit_counts.values)
plt.xlabel('Fruits')
plt.ylabel('Frequency')
plt.title('Frequency of Favorite Fruits')
plt.show()

In this code snippet, the DataFrame df contains a column of favorite fruits. The value_counts() method computes the number of occurrences of each unique value, and plt.bar() plots a bar chart using this data. The x-axis shows the fruits, and the y-axis shows their respective frequencies.

Method 2: Using groupby() and size()

The combination of groupby() and size() can be used for customized grouping operations before plotting. This is especially useful when handling multiple categories or when preprocessing is required before plotting the frequency.

Here’s an example:

import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame
df = pd.DataFrame({'favorite_fruit': ['apple', 'banana', 'orange', 'apple', 'banana', 'apple'],
                   'person_id': [1, 2, 3, 4, 5, 6]})

# Group by 'favorite_fruit' and calculate size, then plot
grouped_fruit_counts = df.groupby('favorite_fruit').size()
plt.bar(grouped_fruit_counts.index, grouped_fruit_counts.values)
plt.xlabel('Fruits')
plt.ylabel('Frequency')
plt.title('Frequency of Favorite Fruits by Group')
plt.show()

The groupby() method groups the DataFrame by the ‘favorite_fruit’ column, and size() computes the size of each group. The bar chart is then plotted similarly to method 1, but this time potentially after more complex grouping operations.

Method 3: Using crosstab()

The crosstab() function in pandas can be used to create a frequency plot when you want to compare the frequency distribution across multiple categories or factors. It provides a cross-tabulation of two (or more) factors.

Here’s an example:

import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame
df = pd.DataFrame({'favorite_fruit': ['apple', 'banana', 'orange', 'apple', 'banana', 'apple'],
                   'person_gender': ['F', 'M', 'F', 'M', 'F', 'M']})

# Use crosstab to compare fruit preference by gender
fruit_gender_ct = pd.crosstab(df['favorite_fruit'], df['person_gender'])
fruit_gender_ct.plot(kind='bar')
plt.xlabel('Fruits')
plt.ylabel('Frequency')
plt.title('Favorite Fruit Frequency by Gender')
plt.show()

The output is a bar chart that displays the frequency of favorite fruits broken down by gender. The crosstab() computation is pivotal in examining relationships within data across multiple categories.

Method 4: Using hist() Method of DataFrame

For numerical data, the hist() method of DataFrame objects can be used to plot histograms directly, visualizing the frequency distribution of a numerical dataset. It is particularly useful for understanding the distribution of numerical values.

Here’s an example:

import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame with numerical data
df = pd.DataFrame({'data_values': [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]})

# Plot histogram of the 'data_values' directly
df['data_values'].hist(bins=[1,2,3,4,5], edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Data Values')
plt.show()

The histogram shows the frequency distribution of the data in the ‘data_values’ column with custom bin edges. The hist() method is a quick way to get a visual summary of numerical data.

Bonus One-Liner Method 5: Using seaborn.countplot()

For a quick and stylish frequency plot, the countplot() function from the Seaborn library, which operates on categorical data, is very handy. It automatically counts the frequency of categories and displays them in a bar plot.

Here’s an example:

import seaborn as sns
import matplotlib.pyplot as plt

# Create a sample DataFrame
df = pd.DataFrame({'favorite_fruit': ['apple', 'banana', 'orange', 'apple', 'banana', 'apple']})

# Plot using Seaborn countplot
sns.countplot(x='favorite_fruit', data=df)
plt.xlabel('Fruits')
plt.ylabel('Frequency')
plt.title('Fruit Counts with Seaborn')
plt.show()

Although not a Matplotlib method, Seaborn’s countplot() provides an instant frequency plot with a polished look that’s often desired for presentations or reports.

Summary/Discussion

  • Method 1: Using value_counts() and plot(). Strengths: Straightforward, built into pandas. Weaknesses: Basic, doesn’t support complex groupings.
  • Method 2: Using groupby() and size(). Strengths: Flexible, customizable. Weaknesses: May require more steps for simple frequency counts.
  • Method 3: Using crosstab(). Strengths: Comparing multiple categories. Weaknesses: Requires understanding of cross-tabulation concept.
  • Method 4: Using hist() Method of DataFrame. Strengths: Ideal for numerical data; straightforward for histograms. Weaknesses: Not suitable for categorical data without preprocessing.
  • Method 5: Using seaborn.countplot(). Strengths: Easy to use, stylish output. Weaknesses: Not part of Matplotlib; requires Seaborn installation. Also, less customizable compared to Matplotlib’s methods.