5 Best Ways to Summarize Statistics of a Pandas DataFrame in Python

πŸ’‘ Problem Formulation: When working with data in Python, Pandas DataFrames are a common structure to store tabular data. Often, a quick summary of the statistics for each column in a DataFrame helps provide insights. As a Python data analyst, you might have a DataFrame containing multiple rows and columns and wish to find a collective summary, such as count, mean, standard deviation, min, and max for each numerical column. This article explores five different ways to achieve this.

Method 1: Using the describe() Function

The describe() function in Pandas is a convenient tool to get a quick overview of the statistical summaries for each numeric column in a DataFrame. It generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. The function returns a DataFrame with summaries, including count, mean, std, min, max, and percentile values.

Here’s an example:

import pandas as pd

# Sample DataFrame
data = {'scores': [88, 92, 100, 85, 90, 87],
        'time_spent': [43, 45, 50, 40, 42, 39]}
df = pd.DataFrame(data)

# Using describe() to summarize statistics
summary = df.describe()

Output:

          scores  time_spent
count    6.000000   6.000000
mean    90.333333  43.166667
std      5.206640   3.904949
min     85.000000  39.000000
25%     87.250000  40.500000
50%     89.000000  42.500000
75%     91.500000  44.750000
max    100.000000  50.000000

The describe() function quickly provided us with a statistical summary table, where we can easily compare metrics like mean scores and time spent on a sample activity. It’s particularly useful for getting a broad picture of the data range and distribution without manually computing each statistic.

Method 2: Using the info() Function

The info() function in Pandas is typically used to get a concise summary of a DataFrame. While it is not a statistical summary in the strict sense, it provides essential information like the number of non-null entries, data type of each column, and memory usage, which can be invaluable for preliminary data analysis.

Here’s an example:

# Using info() to get a summary of DataFrame
info_summary = df.info()

Output:


RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   scores      6 non-null      int64
 1   time_spent  6 non-null      int64
dtypes: int64(2)
memory usage: 224.0 bytes

The info() function provides an immediate check for missing values and understanding the data type of each column. This can be very important before you start with any sort of data manipulation or statistical analysis.

Method 3: Using Aggregate Functions

Aggregate functions in Pandas, such as mean(), std(), and sum(), allow you to compute specific statistics for each column. You can apply multiple aggregate functions at once using the agg() method to get a summary with selected statistics.

Here's an example:

# Applying multiple aggregate functions
aggregated_stats = df.agg(['mean', 'std', 'min', 'max'])

# Displaying the aggregated statistics
print(aggregated_stats)

Output:

          scores  time_spent
mean   90.333333   43.166667
std     5.206640    3.904949
min    85.000000   39.000000
max   100.000000   50.000000

The agg() method provides flexibility in selecting and calculating only the statistics that are relevant to your analysis, which helps to keep the summary focused and concise.

Method 4: Using the value_counts() Function

The value_counts() function in Pandas is used for categorical data to count the frequency of each unique value in a column. This function can offer a different perspective on data by highlighting the distribution of categorical variables.

Here's an example:

categories = ['High', 'Medium', 'Low', 'Medium', 'High', 'Low']
df['performance'] = categories

# Using value_counts() to summarize the distribution of a categorical column
category_distribution = df['performance'].value_counts()

Output:

Medium    2
High      2
Low       2
Name: performance, dtype: int64

This value_counts() function is straightforward and very useful to understand the frequency and distribution of categorical data within a DataFrame, which is often required before applying more complex statistical techniques.

Bonus One-Liner Method 5: Using the apply() Function

The apply() function in Pandas can be used to apply a function along an axis of the DataFrame. A quick one-liner to summarize statistics for each column could involve using apply with lambda functions.

Here's an example:

# Using apply to summarize statistics with a lambda function
one_liner_summary = df.apply(lambda x: {'mean': x.mean(), 'std': x.std()})

# Displaying the summary
print(one_liner_summary)

Output:

scores       {'mean': 90.33333333333333, 'std': 5.206640...
time_spent    {'mean': 43.166666666666664, 'std': 3.90494...
dtype: object

This one-liner with apply() offers a quick and customizable way to calculate and view a selection of statistics of interest across all columns.

Summary/Discussion

  • Method 1: describe(). Provides a comprehensive statistical summary. May include more information than necessary for some purposes.
  • Method 2: info(). Useful for data type and non-null count overview. Does not provide statistical measures like mean, std, etc.
  • Method 3: Aggregate Functions. Highly customizable. Must specify each statistic of interest.
  • Method 4: value_counts(). Ideal for summarizing categorical data distribution. Not for numerical statistics.
  • Method 5: apply() Function. Versatile and concise, perfect for customized statistics. More complex to read and understand compared to other methods.