The Pandas DataFrame has several methods concerning Computations and Descriptive Stats. When applied to a DataFrame, these methods evaluate the elements and return the results.
Preparation
Before any data manipulation can occur, two (2) new libraries will require installation.
- The Pandas library enables access to/from a DataFrame.
- The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install numpy
Hit the <Enter>
key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd import numpy as np
DataFrame describe()
The describe()
method analyzes numeric and object series and DataFrame column sets of various data types.
The syntax for this method is as follows (source):
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
Parameters | Description |
---|---|
percentiles | The percentiles to include in the output. All should be between 0-1. The default is [.25, .5, .75] which returns the 25th, 50th, and 75th percentiles. This parameter accepts a list-like numbers and is optional. |
include | This parameter is a white list of data types to include. Ignored for Series. Below are the available options. – ‘all’: All input columns will be included in the output. – A list-like of dtypes: Limits the results to the provided data types. – To limit the result to numeric types, submit numpy.numbe r.– To limit it instead to object columns submit the numpy.object data type.– Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O']) ). To select pandas categorical columns, use 'category' |
exclude | This parameter is a list of dtypes . This excludes the data type provided from the result.– To exclude numeric data types, submit a numpy.number .– To exclude object columns, submit the data type numpy.object .– Strings can also be used as select_dtypes (ex: df.describe(include=['O'] ).– To exclude pandas columns, use 'category' . |
datetime_is_numeric | This parameter determines if the datetimes are numeric. By default, this parameter is False . |
Also, consider this table from the docs:
Numeric Data | For numeric data, the result’s index will include count , mean , std , min , max as well as lower, 50 and upper percentiles. By default, the lower percentile is 25, and the upper percentile is 75. The 50 percentile is the same as the median . |
Object Data | For object data (strings or timestamps), the result’s index will include count , unique , top , and freq . The top is the most common value. The frequency (freq ) is the most common value’s frequency. Timestamps also include the first and last items. |
Multiple Object Values | If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count. |
Mixed Data Types | For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the DataFrame consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type. |
Include & Exclude | These parameters can limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series. |
For this example, the same Teams DataFrame referred to in Part 2 of this series is used. The DataFrame below displays four (4) Hockey Teams’ stats: wins, losses, and ties.
df_teams = pd.DataFrame({'Bruins': [4, 5, 9], 'Oilers': [3, 6, 10], 'Leafs': [2, 7, 11], 'Flames': [1, 8, 12]}) result = df_teams.describe().apply(lambda x:round(x,2)) print(result)
- Line [1] creates a DataFrame from a Dictionary of Lists and saves it to
df_teams
. - Line [2] uses the
describe()
method to retrieve additional analytical information. Using a lambda, it then formats the output to two (2) decimal places and saves it to theresult
variable. - Line [3] outputs the result to the terminal.
Output
Bruins | Oilers | Leafs | Flames | |
count | 3.00 | 3.00 | 3.00 | 3.00 |
mean | 6.00 | 6.33 | 6.67 | 7.00 |
std | 2.65 | 3.51 | 4.51 | 5.57 |
min | 4.00 | 3.00 | 2.00 | 1.00 |
25% | 4.50 | 4.50 | 4.50 | 4.50 |
50% | 5.00 | 6.00 | 7.00 | 8.00 |
75% | 7.00 | 8.00 | 9.00 | #0.00 |
max | 9.00 | 10.00 | 11.00 | 12.00 |
Click here to see additional examples.
DataFrame diff()
The diff()
method calculates the difference between a DataFrame element compared with another element in the same DataFrame. The default is the element in the previous row.
The syntax for this method is as follows:
DataFrame.diff(periods=1, axis=0)
Parameter | Description |
---|---|
axis | If zero (0) or index is selected, apply to each column. Default 0. If one (1) apply to each row. |
periods | The periods to shift for calculating differences. This parameter accepts negative values. |
Code β Example 1
This example reflects the difference in regard to the previous row.
df_teams = pd.DataFrame({'Bruins': [4, 5, 9], 'Oilers': [3, 6, 10], 'Leafs': [2, 7, 11], 'Flames': [1, 8, 12]}) result = df_teams.diff() print(result)
- Line [1] creates a DataFrame from a Dictionary of Lists and saves it to
df_teams
. - Line [2] uses the
diff()
method to determine the difference from the previous row and saves it to theresult
variable. - Line [3] outputs the result to the terminal.
Output
Bruins | Oilers | Leafs | Flames | |
0 | NaN | NaN | NaN | NaN |
1 | 1.0 | 3.0 | 5.0 | 7.0 |
2 | 4.0 | 4.0 | 4.0 | 4.0 |
Code β Example 2
This example reflects the difference in regard to the previous column.
df_teams = pd.DataFrame({'Bruins': [4, 5, 9], 'Oilers': [3, 6, 10], 'Leafs': [2, 7, 11], 'Flames': [1, 8, 12]}) result = df_teams.diff(axis=1) print(result)
- Line [1] creates a DataFrame from a Dictionary of Lists and saves it to
df_teams
. - Line [2] uses the
diff()
method to determine the difference from the previous column and saves it to theresult
variable. - Line [3] outputs the result to the terminal.
Output
Bruins | Oilers | Leafs | Flames | |
0 | NaN | -1 | -1 | -1 |
1 | NaN | 1 | 1 | 1 |
2 | NaN | 1 | 1 | 1 |
Code β Example 3
This example reflects the difference in regard to the previous rows.
df_teams = pd.DataFrame({'Bruins': [4, 5, 9], 'Oilers': [3, 6, 10], 'Leafs': [2, 7, 11], 'Flames': [1, 8, 12]}) result = df_teams.diff(periods=1) print(result)
- Line [1] creates a DataFrame from a Dictionary of Lists and saves it to
df_teams
. - Line [2] uses the
diff()
method to determine the difference from the previous column and withperiods
set to 1 and saves to theresult
variable. - Line [3] outputs the result to the terminal.
Output
Bruins | Oilers | Leafs | Flames | |
0 | NaN | NaN | NaN | NaN |
1 | 1.0 | 3.0 | 5.0 | 7.0 |
2 | 4.0 | 4.0 | 4.0 | 4.0 |
DataFrame eval()
The eval()
method evaluates a string describing the operation on DataFrame columns. This is for columns only, not specific rows or elements. This allows the eval to run arbitrary code.
π Note: This can make the code vulnerable to code injection if you pass user input to this method.
The syntax for this method is as follows:
DataFrame.eval(expr, inplace=False, **kwargs)
Parameter | Description |
---|---|
expr | This parameter is the string to evaluate. |
inplace | If the expression contains an assignment, this determines whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned. By default, this parameter is False . |
**kwargs | See the documentation here for details. |
For this example, the Hockey Teams Bruins and Oilers stats will be added together.
df_teams = pd.DataFrame({'Bruins': [4, 5, 9], 'Oilers': [3, 6, 10], 'Leafs': [2, 7, 11], 'Flames': [1, 8, 12]}) result = df_teams.eval('Bruins + Oilers') print(result)
- Line [1] creates a DataFrame from a Dictionary of Lists and saves it to
df_teams
. - Line [2] uses the
eval()
method to evaluate the calculation and saves to theresult
variable. - Line [3] outputs the result to the terminal.
Output
0 | 7 |
1 | 11 |
2 | 19 |
DataFrame kurt() and kurtosis()
The DataFrame kurt()
and kurtosis()
methods are identical and return an unbiased kurtosis over a requested axis. For additional information on Kurtosis, click here.
Parameter | Description |
---|---|
axis | If zero (0) or index is selected, apply to each column. Default 0. If one (1) apply to each row. |
skipna | Exclude NA/null values when computing the result. By default, True . |
level | If the axis is a MultiIndex , count along with a particular level, collapsing into a Series. By default, the value is None . |
numeric_only | Includes floats, integers, and boolean columns. If None , this parameter will attempt to use everything. |
**kwargs | This parameter is additional keyword arguments to be passed to the method. |
For this example, the Hockey Teams data is used.
df_teams = pd.DataFrame({'Bruins': [4, 5, 9], 'Oilers': [3, 6, 10], 'Leafs': [2, 7, 11], 'Flames': [1, 8, 12]}) result = df_teams.kurtosis() print(result)
- Line [1] creates a DataFrame from a Dictionary of Lists and saves it to
df_teams
. - Line [2] uses the
kurtosis()
method to determine the output and saves to theresult
variable. - Line [3] outputs the result to the terminal.
Output
Bruins | NaN |
Oilers | NaN |
Leafs | NaN |
Flames | NaN |
dtype: | float64 |
Further Learning Resources
This is Part 3 of the DataFrame method series.
Also, have a look at the Pandas DataFrame methods cheat sheet!