­Pandas DataFrame Computations & Descriptive Stats – Part 3

The Pandas DataFrame has several methods concerning Computations and Descriptive Stats. When applied to a DataFrame, these methods evaluate the elements and return the results.

  • Part 1 focuses on the DataFrame methods abs(), all(), any(), clip(), corr(), and corrwith().
  • Part 2 focuses on the DataFrame methods count(), cov(), cummax(), cummin(), cumprod(), cumsum().
  • Part 3 focuses on the DataFrame methods describe(), diff(), eval(), kurtosis().
  • Part 4 focuses on the DataFrame methods mad(), min(), max(), mean(), median(), and mode().
  • Part 5 focuses on the DataFrame methods pct_change(), quantile(), rank(), round(), prod(), and product().
  • Part 6 focuses on the DataFrame methods add_prefix(), add_suffix(), and align().
  • Part 7 focuses on the DataFrame methods at_time(), between_time(), drop(), drop_duplicates() and duplicated().
  • Part 8 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()

Getting Started

Remember to add the Required Starter Code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

Required Starter Code

import pandas as pd
import numpy as np 

Before any data manipulation can occur, two new libraries will require installation.

  • The pandas library enables access to/from a DataFrame.
  • The numpy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

Feel free to check out the correct ways of installing those libraries here:

If the installations were successful, a message displays in the terminal indicating the same.

DataFrame describe()

The describe() method analyzes numeric and object series as well as DataFrame column sets of various data types.

The syntax for this method is as follows (source):

DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
ParametersDescription
percentilesThe percentiles to include in the output. All should be between 0-1. The default is [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles. This parameter accepts a list-like numbers and is optional.
includeThis parameter is a white list of data types to include in the result. Ignored for Series. Below are the available options.
'all': All columns of the input will be included in the output.
– A list-like of dtypes: Limits the results to the provided data types.
– To limit the result to numeric types submit numpy.number.
– To limit it instead to object columns submit the numpy.object data type.
– Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
excludeThis parameter is a list of dtypes. This excludes the data type provided from the result.
– To exclude numeric data types submit a numpy.number.
– To exclude object columns, submit the data type numpy.object.
– Strings can also be used as select_dtypes (ex: df.describe(include=['O']).
– To exclude pandas columns, use 'category'.
datetime_is_numericThis parameter determines if the datetimes are numeric. By default, this parameter is False.

Also, consider this table from the docs:

Numeric DataFor numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default, the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.
Object DataFor object data (strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The frequency (freq) is the most common value’s frequency. Timestamps also include the first and last items.
Multiple Object ValuesIf multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.
Mixed Data TypesFor mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the DataFrame consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.
Include & ExcludeThese parameters can be used to limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

For this example, the same Teams DataFrame referred to in Part 2 of this series is used. The DataFrame below displays four (4) Hockey Teams stats: wins, losses, and ties.

df_teams = pd.DataFrame({'Bruins':  [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames': [1, 8, 12]})

result = df_teams.describe().apply(lambda x:round(x,2))
print(result)
  • Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [2] uses the describe() method to retrieve additional analytical information. Using a lambda, it then formats the output to two (2) decimal places and saves to the result variable.
  • Line [3] outputs the result to the terminal.

Output:

 BruinsOilersLeafsFlames
count3.003.003.003.00
mean6.006.336.677.00
std2.653.514.515.57
min4.003.002.001.00
25%4.504.504.504.50
50%5.006.007.008.00
75%7.008.009.00#0.00
max9.0010.0011.0012.00

Note: Click here to see additional examples.

DataFrame diff()

The diff() method calculates the difference between a DataFrame element compared with another element in the same DataFrame. The default is the element in the previous row.

The syntax for this method is as follows:

DataFrame.diff(periods=1, axis=0)
ParameterDescription
axisIf zero (0) or index is selected, apply the function to each column. Default is None. If one (1) is selected, apply the function to each row.
periodsThe periods to shift for calculating differences. This parameter accepts negative values.

Code – Example 1

This example reflects the difference in regard to the previous row.

df_teams = pd.DataFrame({'Bruins':  [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames': [1, 8, 12]})

result = df_teams.diff()
print(result)
  • Line [1-4] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [6] uses the diff() method to determine the difference from the previous row and saves to the result variable.
  • Line [7] outputs the result to the terminal.

Output:

 BruinsOilersLeafsFlames
0NaNNaNNaNNaN
11.03.05.07.0
24.04.04.04.0

Code – Example 2

This example reflects the difference in regard to the previous column.

df_teams = pd.DataFrame({'Bruins':  [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames': [1, 8, 12]})

result = df_teams.diff(axis=1)
print(result)
  • Line [1-4] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [6] uses the diff() method to determine the difference from the previous column and saves to the result variable.
  • Line [7] outputs the result to the terminal.

Output:

 BruinsOilersLeafsFlames
0NaN-1-1-1
1NaN111
2NaN111

Code – Example 3

This example reflects the difference in regard to the previous rows.

df_teams = pd.DataFrame({'Bruins':  [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames': [1, 8, 12]})

result = df_teams.diff(periods=1)
print(result)
  • Line [1-4] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [6] uses the diff() method to determine the difference from the previous column and with periods set to 1 and saves to the result variable.
  • Line [7] outputs the result to the terminal.

Output:

 BruinsOilersLeafsFlames
0NaNNaNNaNNaN
11.03.05.07.0
24.04.04.04.0

DataFrame eval()

The eval() method evaluates a string describing the operation on DataFrame columns. This is for columns only, not specific rows or elements. This allows the eval to run arbitrary code.

🛑 Note: This can make the code vulnerable to code injection if you pass user input to this method.

The syntax for this method is as follows:

DataFrame.eval(expr, inplace=False, **kwargs)
ParameterDescription
exprThis parameter is the string to evaluate.
inplaceIf the expression contains an assignment, this determines whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned. By default, this parameter is False.
**kwargsSee the documentation here for details.

For this example, the Hockey Teams Bruins and Oilers stats will be added together.

df_teams = pd.DataFrame({'Bruins':  [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames': [1, 8, 12]})

result = df_teams.eval('Bruins + Oilers')
print(result)	
  • Line [1-4] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [6] uses the eval() method to evaluate the calculation and saves to the result variable.
  • Line [7] outputs the result to the terminal.

Output:

07
111
219

DataFrame kurt() and kurtosis()

The DataFrame kurt() and kurtosis() methods are identical and return an unbiased kurtosis over a requested axis. For additional information on Kurtosis, click here.

ParameterDescription
axisIf zero (0) or index is selected, apply the function to each column. Default is None. If one (1) is selected, apply the function to each row.
skipnaExclude NA/null values when computing the result. By default, True.
levelIf the axis is a MultiIndex, count along a particular level, collapsing into a Series. By default, the value is None.
numeric_onlyIncludes floats, integers and boolean columns. If None, this parameter will attempt to use everything.
**kwargsThis parameter is additional keyword arguments to be passed to the method.

For this example, the Hockey Teams data is used.

df_teams = pd.DataFrame({'Bruins':  [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames': [1, 8, 12]})

result = df_teams.kurtosis()
print(result)
  • Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [2] uses the kurtosis() method to determine the output and saves to the result variable.
  • Line [3] outputs the result to the terminal.

Output:

BruinsNaN
OilersNaN
LeafsNaN
FlamesNaN
dtype:float64