Pandas DataFrame GroupBy and Window - Part 2 - Be on the Right Side of Change

The Pandas DataFrame has several Function Applications, GroupBy & Window methods. When applied to a DataFrame, these methods modify the output of a DataFrame.

Part 2 of this series focuses on GroupBy & Window methods and delves into each item listed above.

Preparation

Before any data manipulation can occur, two (2) new libraries will require installation.

The Pandas library enables access to/from a DataFrame.
The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.

Feel free to view the PyCharm installation guide for the required libraries.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import numpy as np

DataFrame transform()

The transform() method calls a function on itself. This method produces a DataFrame with transformed values.

The syntax for this method is as follows:

DataFrame.transform(func, axis=0, *args, **kwargs)

Parameter	Description
`func`	This is a function used for transforming the data. This function must work with a DataFrame or `DataFrame.apply()`. If list-like or dict-like, the latter takes precedence.
`axis`	If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
`args`	The positional argument to pass to the function.
`**kwargs`	This parameter is keyword arguments passed to `func`.

Rivers Clothing has decided to increase the price of all its items by $1.03. Their Accounting Department has informed them that the market will bear the increase. Use the transform() method and lambda to accomplish this task.

df = pd.DataFrame({'Tops':   [10.12, 12.23, 13.95],
                   'Tanks':   [11.35, 13.45, 14.98],
                   'Pants':   [21.37, 56.99, 94.87],
                   'Sweats':  [27.15, 21.85, 35.75]})

pd.options.display.float_format = '${:.2f}'.format
index_   = ['Small', 'Medium', 'Large']
df.index = index_

result = df.transform(lambda x: x+1.03)
print(result)

Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df.
Line [2] formats the output with a dollar sign ($) and two (2) decimal places.
Line [3-4] creates and sets the index for the DataFrame (Small/Medium/Large).
Line [5] uses the transform() method with a lambda. This line adds 1.03 to each price. The output saves to the result variable.
Line [6] outputs the result to the terminal.

Output

Formula Example: 10.12 + 1.03 = 11.15

	Tops	Tanks	Pants	Sweats
Small	$11.15	$12.38	$22.40	$28.18
Medium	$13.26	$14.48	$58.02	$22.88
Large	$14.98	$16.01	$95.90	$36.78

DataFrame groupby()

The groupby() method involves splitting a DataFrame (object) in some capacity, applying a function, and combining the results.

This method groups data and applies various operations to the groups. It returns a groupby() object with the appropriate data.

The syntax for this method is as follows:

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, 
                  group_keys=True, squeeze=NoDefault.no_default, 
                  observed=False, dropna=True)

Parameters	Description
`by`	This parameter is used to determine the groups for the `groupby`. If a function, it is called on each index value. If a dict or Series, the Series or dict values will be used to determine the groups. If `ndarray`, the values are used as-is on the groups. A label or list of labels groups the columns.
`axis`	If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
`level`	If the `axis` is a `MultiIndex`, the data groups by a certain level(s). By default, `None`.
`as_index`	Set the return object with group labels as the index. This parameter is only relevant for DataFrame. Set `index=False` for SQL grouped data.
`sort`	This parameter sorts group keys. Turn this off for better performance. This parameter is `True` by default.
`group_keys`	When calling the `apply()` method, you need to add group keys to the index to identify the pieces.
`squeeze`	Depreciated
`observed`	This only applies if any of the groups are categories. If `True`, only show observed values for categorical groupers. If `False`, show all values for categorical groupers. By default, `False`.
`dropna`	If `True`, and group keys contain NaN values. The row and columns drop. This parameter is `True` by default. If `False`, NaN values will also be the key in groups.

For this example, Rivers Clothing wants to determine the mean values of the product line Tanks. Use the groupby() method to accomplish this task.

df = pd.DataFrame({'Tops':     [10.12, 12.23, 13.95],
                   'Tanks':    [11.35, 13.45, 14.98],
                   'Pants':    [21.37, 56.99, 94.87],
                   'Sweats':  [27.15, 21.85, 35.75]})

pd.options.display.float_format = '${:.2f}'.format

result = df.groupby(['Tanks'])
for tanks, group in result:
    print('$' + str(tanks))
    print(group)

Line [1] creates a DataFrame from a dictionary of lists and saves it to df.
Line [2] formats the output with a dollar sign ($) and two (2) decimal places.
Line [3] uses the groupby() method for the item Tanks. The output saves to the result variable.
Line [4] instantiates a for loop to display the items. This line also pre-pends a dollar sign ($) to tanks.
Line [5] outputs the formatted result to the terminal.

Output

$11.35
	Tops	Tanks	Pants	Sweats
0	$10.12	$11.35	$21.37	$27.15

$13.35
	Tops	Tanks	Pants	Sweats
1	$12.23	$13.45	$56.99	$21.85

$14.98
	Tops	Tanks	Pants	Sweats
2	$13.95	$14.98	$94.87	$35.75

DataFrame rolling()

The rolling() method provides rolling window calculations.

The syntax for this method is as follows:

DataFrame.rolling(window, min_periods=None, center=False, 
                  win_type=None, on=None, axis=0, closed=None, method='single')

Parameter	Description
`window`	This parameter is the size of the moving window. Each window has a fixed size. If this parameter is an offset, then this will be the time period of each window. Each window will be a variable size based on the time period observations. Valid for `datetime` indexes. If a Base Indexer sub-class: calculates the boundaries on `get_window_bounds()`.
`min_periods`	This is the minimum number of observations in the window required to have a value (otherwise result is NA). For a window that has an offset, `min_periods` will default to 1. Otherwise, `min_periods` will default to the size of the window.
`center`	Sets the labels to the center of the windows. By default, `True`.
`win_type`	This parameter provides a window type. If `None`, all points are even. By default, `None`.
`on`	For a DataFrame, a `datetime`-like column or Index level calculating the rolling window rather than the DataFrame index. Provided integer column is ignored and excluded from result since an integer index does not calculate the rolling window
`axis`	If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
`closed`	Make the interval close on the right, left, both, or neither end-points. By default, right.
`method`	Execute the rolling operation per single column or row (`'single'`) or over the entire object (`'table'`). This argument implements when specifying `engine='numba'` in the method call.

For this example, Rivers Clothing wants prices of all sizes for the Tanks and Pants items.

cols = ['Tanks', 'Pants']
df = pd.DataFrame({'Tops':    [10.12, 12.23, 13.95],
                   'Tanks':   [11.35, 13.45, 14.98],
                   'Pants':   [21.37, 56.99, 94.87],
                   'Sweats': [27.15, 21.85, 35.75]}, columns=cols)

pd.options.display.float_format = '${:.2f}'.format
index_   = ['Small', 'Medium', 'Large']
df.index = index_

result = df.rolling(1, win_type='gaussian').sum(std=0)
print(result)

Line [1] creates a list of products we want to view. In this case, Tanks and Pants.
Line [2] creates a DataFrame from a dictionary of lists and saves it to df. The cols list on line [1] is a DataFrame parameter.
Line [3] formats the output with a dollar sign ($) and two (2) decimal places.
Line [4-5] creates and sets the index for the DataFrame (Small/Medium/Large).
Line [6] uses the rolling() method with sum(). This line retrieves the Tanks and Pants values. In this case, assigning a value to std does not change anything but is required. The output saves to the result variable.
Line [7] outputs the result to the terminal.

Output

	Tanks	Pants
Small	$11.35	$21.37
Medium	$13.45	$56.99
Large	$14.98	$94.87

💡 Note: Gaussian distribution (also known as the normal distribution) is a bell-shaped curve. Any measurement values will follow a normal distribution. This distribution will have an equal number of measurements above and below the mean value.

DataFrame expanding()

The expanding() method provides expanding transformations.

The syntax for this method is as follows:

DataFrame.expanding(min_periods=1, center=None, axis=0, method='single')

Parameter	Description
`min_periods`	This parameter is the minimum number of observations in the window required to have a value (else result is NA).
`center`	Sets the values to the center of the window.
`axis`	If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
`method`	Execute the rolling operation per single column or row (`'single'`) or over the entire object (`'table'`). This argument implements when specifying `engine='numba'` in the method call.

For this example, Rivers Clothing wants the Totals for the Tanks and Pants items.

Line [1] creates a list of products we want to view. In this case, Tanks and Pants.
Line [2] creates a DataFrame from a dictionary of lists and saves it to df. The cols list on line [1] is a DataFrame parameter.
Line [3] formats the output with a dollar sign ($) and two (2) decimal places.
Line [4] uses the expanding() method with sum(). This line totals the Tanks and Pants values. The output saves to the result variable.
Line [5] outputs the result to the terminal.

Output

Formula Example: Tanks: 11.35+13.45+14.98 = 39.78

	Tanks	Pants
0	NaN	NaN
1	NaN	NaN
2	$39.78	$173.23

If we set Line [6] to: result = df.expanding(2).sum(), below is the output.

	Tanks	Pants
0	NaN	NaN
1	$24.80	$78.36
2	$39.78	$173.23

DataFrame Exponential Weighted – ewm()

The ewm() method provides exponentially weighted functions.

The syntax for this method is as follows:

DataFrame.ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, 
              adjust=True, ignore_na=False, axis=0, times=None)

Parameter	Description
`com`	Specify decay in terms of center of mass `(a=1/(1+com))` for `com ≥ 0`.
`span`	Specify decay in terms of center of mass `(a=2/(span+1))` for `span ≥ 1`. This parameter is a float and is optional.
`halflife`	Specify decay in terms of halflife `(a=1-exp(--ln(2)/halflife))` for `halflife > 0`. If `times` are specified, the time unit (`str` or time-delta) over which an observation decays to half its value. `mean()` and will not apply to the other functions.
`alpha`	Specify a smoothing factor a directly, 0 < a < 1. This parameter is a float and is optional.
`min_periods`	The minimum number of observations in the window must have a value (else result is 0). The default is the integer 0.
`adjust`	Divide by decaying adjustment factor into beginning periods to account for imbalance in relative weightings (viewing EWMA as a moving average). By default, `True`.
`ignore_na`	Ignore missing values when calculating weights; specify `True` to reproduce pre-0.15.0 behavior.
`axis`	If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
`times`	Times corresponding to the observations. Must be monotonically increasing and `datetime64[ns]` dtype. If a string, the column name in the DataFrame represents the times. If a 1-D array, a sequence with the same shape as the observations. Applicable to `mean()`.

For this example, Rivers Clothing would like an exponential weighted moving average for their sales for the past four (4) quarters on their four (4) clothing lines.

df = pd.DataFrame({'Qtrs':   [1, 2, 3, 4],
                   'Sales':  [2500, 2000, 1400, 2700]})

df['Qtrs'] = df['Sales'].ewm(span=4, adjust=False).mean()
print(df['Qtrs'])

Line [1] creates a DataFrame from a dictionary of lists and saves it to df.
Line [2] calculates the mean emw and saves it to the df['Qtrs’] variable.
Line [3] outputs the result to the terminal.

Output

	Tanks
0	2500.0
1	2300.0
2	1940.0
3	2244.0
Name:	Qtrs, dtype=float64