Preparation
Before any data manipulation can occur, two (2) new libraries will require installation.
- The Pandas library enables access to/from a DataFrame.
- The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install numpy
Hit the <Enter>
key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd import numpy as np
DataFrame describe()
The describe()
method analyzes numeric and object series and DataFrame column sets of various data types.
The syntax for this method is as follows (source):
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
Parameters | Description |
---|---|
percentiles | The percentiles to include in the output. All should be between 0-1. The default is [.25, .5, .75] which returns the 25th, 50th, and 75th percentiles. This parameter accepts a list-like numbers and is optional. |
include | This parameter is a white list of data types to include. Ignored for Series. Below are the available options. – ‘all’: All input columns will be included in the output. – A list-like of dtypes: Limits the results to the provided data types. – To limit the result to numeric types, submit numpy.numbe r.– To limit it instead to object columns submit the numpy.object data type.– Strings can also be used in the style of select_dtypes (e.g. df.describe(include=['O']) ). To select pandas categorical columns, use 'category' |
exclude | This parameter is a list of dtypes . This excludes the data type provided from the result.– To exclude numeric data types, submit a numpy.number .– To exclude object columns, submit the data type numpy.object .– Strings can also be used as select_dtypes (ex: df.describe(include=['O'] ).– To exclude pandas columns, use 'category' . |
datetime_is_numeric | This parameter determines if the datetimes are numeric. By default, this parameter is False . |
Also, consider this table from the docs:
Numeric Data | For numeric data, the result’s index will include count , mean , std , min , max as well as lower, 50 and upper percentiles. By default, the lower percentile is 25, and the upper percentile is 75. The 50 percentile is the same as the median . |
Object Data | For object data (strings or timestamps), the result’s index will include count , unique , top , and freq . The top is the most common value. The frequency (freq ) is the most common value’s frequency. Timestamps also include the first and last items. |
Multiple Object Values | If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count. |
Mixed Data Types | For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns. If the DataFrame consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. If include='all' is provided as an option, the result will include a union of attributes of each type. |
Include & Exclude | These parameters can limit which columns in a DataFrame are analyzed for the output. The parameters are ignored when analyzing a Series. |
For this example, the same Teams DataFrame referred to in Part 2 of this series is used. The DataFrame below displays four (4) Hockey Teams’ stats: wins, losses, and ties.
df_teams = pd.DataFrame({'Bruins': [4, 5, 9], 'Oilers': [3, 6, 10], 'Leafs': [2, 7, 11], 'Flames': [1, 8, 12]}) result = df_teams.describe().apply(lambda x:round(x,2)) print(result)
- Line [1] creates a DataFrame from a Dictionary of Lists and saves it to
df_teams
. - Line [2] uses the
describe()
method to retrieve additional analytical information. Using a lambda, it then formats the output to two (2) decimal places and saves it to theresult
variable. - Line [3] outputs the result to the terminal.
Output
Bruins | Oilers | Leafs | Flames | |
count | 3.00 | 3.00 | 3.00 | 3.00 |
mean | 6.00 | 6.33 | 6.67 | 7.00 |
std | 2.65 | 3.51 | 4.51 | 5.57 |
min | 4.00 | 3.00 | 2.00 | 1.00 |
25% | 4.50 | 4.50 | 4.50 | 4.50 |
50% | 5.00 | 6.00 | 7.00 | 8.00 |
75% | 7.00 | 8.00 | 9.00 | #0.00 |
max | 9.00 | 10.00 | 11.00 | 12.00 |
Click here to see additional examples.
More Pandas DataFrame Methods
Feel free to learn more about the previous and next pandas DataFrame methods (alphabetically) here:
Also, check out the full cheat sheet overview of all Pandas DataFrame methods.