Β­Pandas DataFrame describe(), diff(), eval(), kurtosis()

Rate this post

The Pandas DataFrame has several methods concerning Computations and Descriptive Stats. When applied to a DataFrame, these methods evaluate the elements and return the results.


Preparation

Before any data manipulation can occur, two (2) new libraries will require installation.

  • The Pandas library enables access to/from a DataFrame.
  • The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import numpy as np 

DataFrame describe()

The describe() method analyzes numeric and object series and DataFrame column sets of various data types.

The syntax for this method is as follows (source):

DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
ParametersDescription
percentilesThe percentiles to include in the output. All should be between 0-1. The default is [.25, .5, .75]which returns the 25th, 50th, and 75th percentiles. This parameter accepts a list-like numbers and is optional.
includeThis parameter is a white list of data types to include. Ignored for Series. Below are the available options.
– ‘all’: All input columns will be included in the output.
– A list-like of dtypes: Limits the results to the provided data types.
– To limit the result to numeric types, submit numpy.number.
– To limit it instead to object columns submit the numpy.object data type.
– Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O'])). To select pandas categorical columns, use 'category'
excludeThis parameter is a list of dtypes. This excludes the data type provided from the result.
– To exclude numeric data types, submit a numpy.number.
– To exclude object columns, submit the data type numpy.object.
– Strings can also be used as select_dtypes (ex: df.describe(include=['O']).
– To exclude pandas columns, use 'category'.
datetime_is_numericThis parameter determines if the datetimes are numeric. By default, this parameter is False.

Also, consider this table from the docs:

Numeric DataFor numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default, the lower percentile is 25, and the upper percentile is 75. The 50 percentile is the same as the median.
Object DataFor object data (strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The frequency (freq) is the most common value’s frequency. Timestamps also include the first and last items.
Multiple Object ValuesIf multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.
Mixed Data TypesFor mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the DataFrame consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type.
Include & ExcludeThese parameters can limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series.

For this example, the same Teams DataFrame referred to in Part 2 of this series is used. The DataFrame below displays four (4) Hockey Teams’ stats: wins, losses, and ties.

df_teams = pd.DataFrame({'Bruins':   [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames':   [1, 8, 12]})

result = df_teams.describe().apply(lambda x:round(x,2))
print(result)
  • Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [2] uses the describe() method to retrieve additional analytical information. Using a lambda, it then formats the output to two (2) decimal places and saves it to the result variable.
  • Line [3] outputs the result to the terminal.

Output

 BruinsOilersLeafsFlames
count3.003.003.003.00
mean6.006.336.677.00
std2.653.514.515.57
min4.003.002.001.00
25%4.504.504.504.50
50%5.006.007.008.00
75%7.008.009.00#0.00
max9.0010.0011.0012.00

Click here to see additional examples.


DataFrame diff()

The diff() method calculates the difference between a DataFrame element compared with another element in the same DataFrame. The default is the element in the previous row.

The syntax for this method is as follows:

DataFrame.diff(periods=1, axis=0)
ParameterDescription
axisIf zero (0) or index is selected, apply to each column. Default 0.
If one (1) apply to each row.
periodsThe periods to shift for calculating differences. This parameter accepts negative values.

Code – Example 1

This example reflects the difference in regard to the previous row.

df_teams = pd.DataFrame({'Bruins':  [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames': [1, 8, 12]})

result = df_teams.diff()
print(result)
  • Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [2] uses the diff() method to determine the difference from the previous row and saves it to the result variable.
  • Line [3] outputs the result to the terminal.

Output

 BruinsOilersLeafsFlames
0NaNNaNNaNNaN
11.03.05.07.0
24.04.04.04.0

Code – Example 2

This example reflects the difference in regard to the previous column.

df_teams = pd.DataFrame({'Bruins':   [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames':   [1, 8, 12]})

result = df_teams.diff(axis=1)
print(result)
  • Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [2] uses the diff() method to determine the difference from the previous column and saves it to the result variable.
  • Line [3] outputs the result to the terminal.

Output

 BruinsOilersLeafsFlames
0NaN-1-1-1
1NaN111
2NaN111

Code – Example 3

This example reflects the difference in regard to the previous rows.

df_teams = pd.DataFrame({'Bruins':   [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames':   [1, 8, 12]})

result = df_teams.diff(periods=1)
print(result)
  • Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [2] uses the diff() method to determine the difference from the previous column and with periods set to 1 and saves to the result variable.
  • Line [3] outputs the result to the terminal.

Output

 BruinsOilersLeafsFlames
0NaNNaNNaNNaN
11.03.05.07.0
24.04.04.04.0

DataFrame eval()

The eval() method evaluates a string describing the operation on DataFrame columns. This is for columns only, not specific rows or elements. This allows the eval to run arbitrary code.

πŸ›‘ Note: This can make the code vulnerable to code injection if you pass user input to this method.

The syntax for this method is as follows:

DataFrame.eval(expr, inplace=False, **kwargs)
ParameterDescription
exprThis parameter is the string to evaluate.
inplaceIf the expression contains an assignment, this determines whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned. By default, this parameter is False.
**kwargsSee the documentation here for details.

For this example, the Hockey Teams Bruins and Oilers stats will be added together.

df_teams = pd.DataFrame({'Bruins':   [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames':   [1, 8, 12]})

result = df_teams.eval('Bruins + Oilers')
print(result)	
  • Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [2] uses the eval() method to evaluate the calculation and saves to the result variable.
  • Line [3] outputs the result to the terminal.

Output

07
111
219

DataFrame kurt() and kurtosis()

The DataFrame kurt() and kurtosis() methods are identical and return an unbiased kurtosis over a requested axis. For additional information on Kurtosis, click here.

ParameterDescription
axisIf zero (0) or index is selected, apply to each column. Default 0.
If one (1) apply to each row.
skipnaExclude NA/null values when computing the result. By default, True.
levelIf the axis is a MultiIndex, count along with a particular level, collapsing into a Series. By default, the value is None.
numeric_onlyIncludes floats, integers, and boolean columns. If None, this parameter will attempt to use everything.
**kwargsThis parameter is additional keyword arguments to be passed to the method.

For this example, the Hockey Teams data is used.

df_teams = pd.DataFrame({'Bruins':   [4, 5, 9],
                         'Oilers':   [3, 6, 10],
                         'Leafs':    [2, 7, 11],
                         'Flames':   [1, 8, 12]})

result = df_teams.kurtosis()
print(result)
  • Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df_teams.
  • Line [2] uses the kurtosis() method to determine the output and saves to the result variable.
  • Line [3] outputs the result to the terminal.

Output

BruinsNaN
OilersNaN
LeafsNaN
FlamesNaN
dtype:float64

Further Learning Resources

This is Part 3 of the DataFrame method series.

  • Part 1 focuses on the DataFrame methods abs(), all(), any(), clip(), corr(), and corrwith().
  • Part 2 focuses on the DataFrame methods count(), cov(), cummax(), cummin(), cumprod(), cumsum().
  • Part 3 focuses on the DataFrame methods describe(), diff(), eval(), kurtosis().
  • Part 4 focuses on the DataFrame methods mad(), min(), max(), mean(), median(), and mode().
  • Part 5 focuses on the DataFrame methods pct_change(), quantile(), rank(), round(), prod(), and product().
  • Part 6 focuses on the DataFrame methods add_prefix(), add_suffix(), and align().
  • Part 7 focuses on the DataFrame methods at_time(), between_time(), drop(), drop_duplicates() and duplicated().
  • Part 8 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
  • Part 9 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
  • Part 10 focuses on the DataFrame methods reset_index(), sample(), set_axis(), set_index(), take(), and truncate()
  • Part 11 focuses on the DataFrame methods backfill(), bfill(), fillna(), dropna(), and interpolate()
  • Part 12 focuses on the DataFrame methods isna(), isnull(), notna(), notnull(), pad() and replace()
  • Part 13 focuses on the DataFrame methods drop_level(), pivot(), pivot_table(), reorder_levels(), sort_values() and sort_index()
  • Part 14 focuses on the DataFrame methods nlargest(), nsmallest(), swap_level(), stack(), unstack() and swap_axes()
  • Part 15 focuses on the DataFrame methods melt(), explode(), squeeze(), to_xarray(), t() and transpose()
  • Part 16 focuses on the DataFrame methods append(), assign(), compare(), join(), merge() and update()
  • Part 17 focuses on the DataFrame methods asfreq(), asof(), shift(), slice_shift(), tshift(), first_valid_index(), and last_valid_index()
  • Part 18 focuses on the DataFrame methods resample(), to_period(), to_timestamp(), tz_localize(), and tz_convert()
  • Part 19 focuses on the visualization aspect of DataFrames and Series via plotting, such as plot(), and plot.area().
  • Part 20 focuses on continuing the visualization aspect of DataFrames and Series via plotting such as hexbin, hist, pie, and scatter plots.
  • Part 21 focuses on the serialization and conversion methods from_dict(), to_dict(), from_records(), to_records(), to_json(), and to_pickles().
  • Part 22 focuses on the serialization and conversion methods to_clipboard(), to_html(), to_sql(), to_csv(), and to_excel().
  • Part 23 focuses on the serialization and conversion methods to_markdown(), to_stata(), to_hdf(), to_latex(), to_xml().
  • Part 24 focuses on the serialization and conversion methods to_parquet(), to_feather(), to_string(), Styler.
  • Part 25 focuses on the serialization and conversion methods to_bgq() and to_coo().

Also, have a look at the Pandas DataFrame methods cheat sheet!