The Pandas DataFrame has several Function Applications, GroupBy & Window methods. When applied to a DataFrame, these methods modify the output of a DataFrame.
Part 2 of this series focuses on GroupBy & Window methods and delves into each item listed above.
Preparation
Before any data manipulation can occur, two (2) new libraries will require installation.
- The Pandas library enables access to/from a DataFrame.
- The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install numpy
Hit the <Enter>
key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd import numpy as np
DataFrame transform()
The transform()
method calls a function on itself. This method produces a DataFrame with transformed values.
The syntax for this method is as follows:
DataFrame.transform(func, axis=0, *args, **kwargs)
Parameter | Description |
---|---|
func | This is a function used for transforming the data. This function must work with a DataFrame or DataFrame.apply() . If list-like or dict-like, the latter takes precedence. |
axis | If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row. |
args | The positional argument to pass to the function. |
**kwargs | This parameter is keyword arguments passed to func . |
Rivers Clothing has decided to increase the price of all its items by $1.03. Their Accounting Department has informed them that the market will bear the increase. Use the transform()
method and lambda
to accomplish this task.
df = pd.DataFrame({'Tops': [10.12, 12.23, 13.95], 'Tanks': [11.35, 13.45, 14.98], 'Pants': [21.37, 56.99, 94.87], 'Sweats': [27.15, 21.85, 35.75]}) pd.options.display.float_format = '${:.2f}'.format index_ = ['Small', 'Medium', 'Large'] df.index = index_ result = df.transform(lambda x: x+1.03) print(result)
- Line [1] creates a DataFrame from a Dictionary of Lists and saves it to
df
. - Line [2] formats the output with a dollar sign (
$
) and two (2) decimal places. - Line [3-4] creates and sets the index for the DataFrame (Small/Medium/Large).
- Line [5] uses the
transform()
method with a lambda. This line adds 1.03 to each price. The output saves to theresult
variable. - Line [6] outputs the result to the terminal.
Output
Formula Example: 10.12 + 1.03 = 11.15
Tops | Tanks | Pants | Sweats | |
Small | $11.15 | $12.38 | $22.40 | $28.18 |
Medium | $13.26 | $14.48 | $58.02 | $22.88 |
Large | $14.98 | $16.01 | $95.90 | $36.78 |
DataFrame groupby()
The groupby()
method involves splitting a DataFrame (object) in some capacity, applying a function, and combining the results.
This method groups data and applies various operations to the groups. It returns a groupby()
object with the appropriate data.
The syntax for this method is as follows:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
Parameters | Description |
---|---|
by | This parameter is used to determine the groups for the groupby . If a function, it is called on each index value. If a dict or Series, the Series or dict values will be used to determine the groups. If ndarray , the values are used as-is on the groups. A label or list of labels groups the columns. |
axis | If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row. |
level | If the axis is a MultiIndex , the data groups by a certain level(s). By default, None . |
as_index | Set the return object with group labels as the index. This parameter is only relevant for DataFrame. Set index=False for SQL grouped data. |
sort | This parameter sorts group keys. Turn this off for better performance. This parameter is True by default. |
group_keys | When calling the apply() method, you need to add group keys to the index to identify the pieces. |
squeeze | Depreciated |
observed | This only applies if any of the groups are categories. If True , only show observed values for categorical groupers. If False , show all values for categorical groupers. By default, False . |
dropna | If True , and group keys contain NaN values. The row and columns drop. This parameter is True by default. If False , NaN values will also be the key in groups. |
For this example, Rivers Clothing wants to determine the mean values of the product line Tanks
. Use the groupby()
method to accomplish this task.
df = pd.DataFrame({'Tops': [10.12, 12.23, 13.95], 'Tanks': [11.35, 13.45, 14.98], 'Pants': [21.37, 56.99, 94.87], 'Sweats': [27.15, 21.85, 35.75]}) pd.options.display.float_format = '${:.2f}'.format result = df.groupby(['Tanks']) for tanks, group in result: print('$' + str(tanks)) print(group)
- Line [1] creates a DataFrame from a dictionary of lists and saves it to
df
. - Line [2] formats the output with a dollar sign (
$
) and two (2) decimal places. - Line [3] uses the
groupby()
method for the item Tanks. The output saves to theresult
variable. - Line [4] instantiates a for loop to display the items. This line also pre-pends a dollar sign (
$
) to tanks. - Line [5] outputs the formatted result to the terminal.
Output
$11.35 | ||||
Tops | Tanks | Pants | Sweats | |
0 | $10.12 | $11.35 | $21.37 | $27.15 |
$13.35 | ||||
Tops | Tanks | Pants | Sweats | |
1 | $12.23 | $13.45 | $56.99 | $21.85 |
$14.98 | ||||
Tops | Tanks | Pants | Sweats | |
2 | $13.95 | $14.98 | $94.87 | $35.75 |
DataFrame rolling()
The rolling()
method provides rolling window calculations.
The syntax for this method is as follows:
DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method='single')
Parameter | Description |
---|---|
window | This parameter is the size of the moving window. Each window has a fixed size. If this parameter is an offset, then this will be the time period of each window. Each window will be a variable size based on the time period observations. Valid for datetime indexes. If a Base Indexer sub-class: calculates the boundaries on get_window_bounds() . |
min_periods | This is the minimum number of observations in the window required to have a value (otherwise result is NA). For a window that has an offset, min_periods will default to 1. Otherwise, min_periods will default to the size of the window. |
center | Sets the labels to the center of the windows. By default, True . |
win_type | This parameter provides a window type. If None , all points are even. By default, None . |
on | For a DataFrame, a datetime -like column or Index level calculating the rolling window rather than the DataFrame index. Provided integer column is ignored and excluded from result since an integer index does not calculate the rolling window |
axis | If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row. |
closed | Make the interval close on the right, left, both, or neither end-points. By default, right. |
method | Execute the rolling operation per single column or row ('single' ) or over the entire object ('table' ). This argument implements when specifying engine='numba' in the method call. |
For this example, Rivers Clothing wants prices of all sizes for the Tanks and Pants items.
cols = ['Tanks', 'Pants'] df = pd.DataFrame({'Tops': [10.12, 12.23, 13.95], 'Tanks': [11.35, 13.45, 14.98], 'Pants': [21.37, 56.99, 94.87], 'Sweats': [27.15, 21.85, 35.75]}, columns=cols) pd.options.display.float_format = '${:.2f}'.format index_ = ['Small', 'Medium', 'Large'] df.index = index_ result = df.rolling(1, win_type='gaussian').sum(std=0) print(result)
- Line [1] creates a list of products we want to view. In this case, Tanks and Pants.
- Line [2] creates a DataFrame from a dictionary of lists and saves it to
df
. Thecols
list on line [1] is a DataFrame parameter. - Line [3] formats the output with a dollar sign (
$
) and two (2) decimal places. - Line [4-5] creates and sets the index for the DataFrame (Small/Medium/Large).
- Line [6] uses the
rolling()
method withsum()
. This line retrieves the Tanks and Pants values. In this case, assigning a value to std does not change anything but is required. The output saves to theresult
variable. - Line [7] outputs the result to the terminal.
Output
Tanks | Pants | |
Small | $11.35 | $21.37 |
Medium | $13.45 | $56.99 |
Large | $14.98 | $94.87 |
π‘ Note: Gaussian distribution (also known as the normal distribution) is a bell-shaped curve. Any measurement values will follow a normal distribution. This distribution will have an equal number of measurements above and below the mean value.
DataFrame expanding()
The expanding()
method provides expanding transformations.
The syntax for this method is as follows:
DataFrame.expanding(min_periods=1, center=None, axis=0, method='single')
Parameter | Description |
---|---|
min_periods | This parameter is the minimum number of observations in the window required to have a value (else result is NA). |
center | Sets the values to the center of the window. |
axis | If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row. |
method | Execute the rolling operation per single column or row ('single' ) or over the entire object ('table' ). This argument implements when specifying engine='numba' in the method call. |
For this example, Rivers Clothing wants the Totals for the Tanks and Pants items.
- Line [1] creates a list of products we want to view. In this case, Tanks and Pants.
- Line [2] creates a DataFrame from a dictionary of lists and saves it to
df
. Thecols
list on line [1] is a DataFrame parameter. - Line [3] formats the output with a dollar sign (
$
) and two (2) decimal places. - Line [4] uses the
expanding()
method withsum()
. This line totals the Tanks and Pants values. The output saves to theresult
variable. - Line [5] outputs the result to the terminal.
Output
Formula Example: Tanks: 11.35+13.45+14.98 = 39.78
Tanks | Pants | |
0 | NaN | NaN |
1 | NaN | NaN |
2 | $39.78 | $173.23 |
If we set Line [6] to: result = df.expanding(2).sum()
, below is the output.
Tanks | Pants | |
0 | NaN | NaN |
1 | $24.80 | $78.36 |
2 | $39.78 | $173.23 |
DataFrame Exponential Weighted – ewm()
The ewm()
method provides exponentially weighted functions.
The syntax for this method is as follows:
DataFrame.ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, axis=0, times=None)
Parameter | Description |
---|---|
com | Specify decay in terms of center of mass (a=1/(1+com)) for com β₯ 0 . |
span | Specify decay in terms of center of mass (a=2/(span+1)) for span β₯ 1 . This parameter is a float and is optional. |
halflife | Specify decay in terms of halflife (a=1-exp(--ln(2)/halflife)) for halflife > 0 . If times are specified, the time unit (str or time-delta) over which an observation decays to half its value. mean() and will not apply to the other functions. |
alpha | Specify a smoothing factor a directly, 0 < a < 1. This parameter is a float and is optional. |
min_periods | The minimum number of observations in the window must have a value (else result is 0). The default is the integer 0. |
adjust | Divide by decaying adjustment factor into beginning periods to account for imbalance in relative weightings (viewing EWMA as a moving average). By default, True . |
ignore_na | Ignore missing values when calculating weights; specify True to reproduce pre-0.15.0 behavior. |
axis | If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row. |
times | Times corresponding to the observations. Must be monotonically increasing and datetime64[ns] dtype.If a string, the column name in the DataFrame represents the times. If a 1-D array, a sequence with the same shape as the observations. Applicable to mean() . |
For this example, Rivers Clothing would like an exponential weighted moving average for their sales for the past four (4) quarters on their four (4) clothing lines.
df = pd.DataFrame({'Qtrs': [1, 2, 3, 4], 'Sales': [2500, 2000, 1400, 2700]}) df['Qtrs'] = df['Sales'].ewm(span=4, adjust=False).mean() print(df['Qtrs'])
- Line [1] creates a DataFrame from a dictionary of lists and saves it to
df
. - Line [2] calculates the mean emw and saves it to the
df['Qtrsβ]
variable. - Line [3] outputs the result to the terminal.
Output
Tanks | |
0 | 2500.0 |
1 | 2300.0 |
2 | 1940.0 |
3 | 2244.0 |
Name: | Qtrs, dtype=float64 |