Pandas DataFrame GroupBy and Window – Part 2

The Pandas DataFrame has several Function Applications, GroupBy & Window methods. When applied to a DataFrame, these methods modify the output of a DataFrame.

Part 2 of this series focuses on GroupBy & Window methods and delves into each item listed above.


Preparation

Before any data manipulation can occur, two (2) new libraries will require installation.

  • The Pandas library enables access to/from a DataFrame.
  • The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import numpy as np 

DataFrame transform()

The transform() method calls a function on itself. This method produces a DataFrame with transformed values.

The syntax for this method is as follows:

DataFrame.transform(func, axis=0, *args, **kwargs)
ParameterDescription
funcThis is a function used for transforming the data. This function must work with a DataFrame or DataFrame.apply(). If list-like or dict-like, the latter takes precedence.
axisIf zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
argsThe positional argument to pass to the function.
**kwargsThis parameter is keyword arguments passed to func.

Rivers Clothing has decided to increase the price of all its items by $1.03. Their Accounting Department has informed them that the market will bear the increase. Use the transform() method and lambda to accomplish this task.

df = pd.DataFrame({'Tops':   [10.12, 12.23, 13.95],
                   'Tanks':   [11.35, 13.45, 14.98],
                   'Pants':   [21.37, 56.99, 94.87],
                   'Sweats':  [27.15, 21.85, 35.75]})

pd.options.display.float_format = '${:.2f}'.format
index_   = ['Small', 'Medium', 'Large']
df.index = index_

result = df.transform(lambda x: x+1.03)
print(result)
  • Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df.
  • Line [2] formats the output with a dollar sign ($) and two (2) decimal places.
  • Line [3-4] creates and sets the index for the DataFrame (Small/Medium/Large).
  • Line [5] uses the transform() method with a lambda. This line adds 1.03 to each price. The output saves to the result variable.
  • Line [6] outputs the result to the terminal.

Output

Formula Example:  10.12 + 1.03 = 11.15

 TopsTanksPantsSweats
Small$11.15$12.38$22.40 $28.18
Medium$13.26$14.48$58.02 $22.88
Large$14.98$16.01$95.90$36.78

DataFrame groupby()

The groupby() method involves splitting a DataFrame (object) in some capacity, applying a function, and combining the results.

This method groups data and applies various operations to the groups. It returns a groupby() object with the appropriate data.

The syntax for this method is as follows:

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, 
                  group_keys=True, squeeze=NoDefault.no_default, 
                  observed=False, dropna=True)
ParametersDescription
byThis parameter is used to determine the groups for the groupby. If a function, it is called on each index value. If a dict or Series, the Series or dict values will be used to determine the groups. If ndarray, the values are used as-is on the groups. A label or list of labels groups the columns.
axisIf zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
levelIf the axis is a MultiIndex, the data groups by a certain level(s). By default, None.
as_indexSet the return object with group labels as the index. This parameter is only relevant for DataFrame. Set index=False for SQL grouped data.
sortThis parameter sorts group keys. Turn this off for better performance. This parameter is True by default.
group_keysWhen calling the apply() method, you need to add group keys to the index to identify the pieces.
squeezeDepreciated
observedThis only applies if any of the groups are categories. If True, only show observed values for categorical groupers. If False, show all values for categorical groupers. By default, False.
dropnaIf True, and group keys contain NaN values. The row and columns drop. This parameter is True by default. If False, NaN values will also be the key in groups.

For this example, Rivers Clothing wants to determine the mean values of the product line Tanks. Use the groupby() method to accomplish this task.

df = pd.DataFrame({'Tops':     [10.12, 12.23, 13.95],
                   'Tanks':    [11.35, 13.45, 14.98],
                   'Pants':    [21.37, 56.99, 94.87],
                   'Sweats':  [27.15, 21.85, 35.75]})

pd.options.display.float_format = '${:.2f}'.format

result = df.groupby(['Tanks'])
for tanks, group in result:
    print('$' + str(tanks))
    print(group)
  • Line [1] creates a DataFrame from a dictionary of lists and saves it to df.
  • Line [2] formats the output with a dollar sign ($) and two (2) decimal places.
  • Line [3] uses the groupby() method for the item Tanks. The output saves to the result variable.
  • Line [4] instantiates a for loop to display the items. This line also pre-pends a dollar sign ($) to tanks.
  • Line [5] outputs the formatted result to the terminal.

Output

$11.35
 TopsTanksPantsSweats
0$10.12$11.35$21.37 $27.15
$13.35
 TopsTanksPantsSweats
1$12.23$13.45$56.99$21.85
$14.98
 TopsTanksPantsSweats
2$13.95$14.98$94.87$35.75

DataFrame rolling()

The rolling() method provides rolling window calculations.

The syntax for this method is as follows:

DataFrame.rolling(window, min_periods=None, center=False, 
                  win_type=None, on=None, axis=0, closed=None, method='single')
ParameterDescription
windowThis parameter is the size of the moving window. Each window has a fixed size. If this parameter is an offset, then this will be the time period of each window. Each window will be a variable size based on the time period observations. Valid for datetime indexes. If a Base Indexer sub-class: calculates the boundaries on get_window_bounds().
min_periodsThis is the minimum number of observations in the window required to have a value (otherwise result is NA). For a window that has an offset, min_periods will default to 1. Otherwise, min_periods will default to the size of the window.
centerSets the labels to the center of the windows. By default, True.
win_typeThis parameter provides a window type. If None, all points are even. By default, None.
onFor a DataFrame, a datetime-like column or Index level calculating the rolling window rather than the DataFrame index. Provided integer column is ignored and excluded from result since an integer index does not calculate the rolling window
axisIf zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
closedMake the interval close on the right, left, both, or neither end-points. By default, right.
methodExecute the rolling operation per single column or row ('single') or over the entire object ('table'). This argument implements when specifying engine='numba' in the method call.

For this example, Rivers Clothing wants prices of all sizes for the Tanks and Pants items.

cols = ['Tanks', 'Pants']
df = pd.DataFrame({'Tops':    [10.12, 12.23, 13.95],
                   'Tanks':   [11.35, 13.45, 14.98],
                   'Pants':   [21.37, 56.99, 94.87],
                   'Sweats': [27.15, 21.85, 35.75]}, columns=cols)

pd.options.display.float_format = '${:.2f}'.format
index_   = ['Small', 'Medium', 'Large']
df.index = index_

result = df.rolling(1, win_type='gaussian').sum(std=0)
print(result)
  • Line [1] creates a list of products we want to view. In this case, Tanks and Pants.
  • Line [2] creates a DataFrame from a dictionary of lists and saves it to df. The cols list on line [1] is a DataFrame parameter.
  • Line [3] formats the output with a dollar sign ($) and two (2) decimal places.
  • Line [4-5] creates and sets the index for the DataFrame (Small/Medium/Large).
  • Line [6] uses the rolling() method with sum(). This line retrieves the Tanks and Pants values. In this case, assigning a value to std does not change anything but is required. The output saves to the result variable.
  • Line [7] outputs the result to the terminal.

Output

 TanksPants
Small$11.35$21.37
Medium$13.45$56.99
Large$14.98$94.87

πŸ’‘ Note: Gaussian distribution (also known as the normal distribution) is a bell-shaped curve. Any measurement values will follow a normal distribution. This distribution will have an equal number of measurements above and below the mean value.


DataFrame expanding()

The expanding() method provides expanding transformations.

The syntax for this method is as follows:

DataFrame.expanding(min_periods=1, center=None, axis=0, method='single')
ParameterDescription
min_periodsThis parameter is the minimum number of observations in the window required to have a value (else result is NA).
centerSets the values to the center of the window.
axisIf zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
methodExecute the rolling operation per single column or row ('single') or over the entire object ('table'). This argument implements when specifying engine='numba' in the method call.

For this example, Rivers Clothing wants the Totals for the Tanks and Pants items.

  • Line [1] creates a list of products we want to view. In this case, Tanks and Pants.
  • Line [2] creates a DataFrame from a dictionary of lists and saves it to df. The cols list on line [1] is a DataFrame parameter.
  • Line [3] formats the output with a dollar sign ($) and two (2) decimal places.
  • Line [4] uses the expanding() method with sum(). This line totals the Tanks and Pants values. The output saves to the result variable.
  • Line [5] outputs the result to the terminal.

Output

Formula Example: Tanks: 11.35+13.45+14.98 = 39.78

 TanksPants
0NaNNaN
1NaNNaN
2$39.78$173.23

If we set Line [6] to: result = df.expanding(2).sum(), below is the output.

 TanksPants
0NaNNaN
1$24.80$78.36
2$39.78$173.23

DataFrame Exponential Weighted – ewm()

The ewm() method provides exponentially weighted functions.

The syntax for this method is as follows:

DataFrame.ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, 
              adjust=True, ignore_na=False, axis=0, times=None)
ParameterDescription
comSpecify decay in terms of center of mass (a=1/(1+com)) for com β‰₯ 0.
spanSpecify decay in terms of center of mass (a=2/(span+1)) for span β‰₯ 1. This parameter is a float and is optional.
halflifeSpecify decay in terms of halflife (a=1-exp(--ln(2)/halflife)) for halflife > 0. If times are specified, the time unit (str or time-delta) over which an observation decays to half its value. mean() and will not apply to the other functions.
alphaSpecify a smoothing factor a directly, 0 < a < 1. This parameter is a float and is optional.
min_periodsThe minimum number of observations in the window must have a value (else result is 0). The default is the integer 0.
adjustDivide by decaying adjustment factor into beginning periods to account for imbalance in relative weightings (viewing EWMA as a moving average). By default, True.
ignore_naIgnore missing values when calculating weights; specify True to reproduce pre-0.15.0 behavior.
axisIf zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
timesTimes corresponding to the observations. Must be monotonically increasing and datetime64[ns] dtype.
If a string, the column name in the DataFrame represents the times.
If a 1-D array, a sequence with the same shape as the observations. Applicable to mean().

For this example, Rivers Clothing would like an exponential weighted moving average for their sales for the past four (4) quarters on their four (4) clothing lines.

df = pd.DataFrame({'Qtrs':   [1, 2, 3, 4],
                   'Sales':  [2500, 2000, 1400, 2700]})

df['Qtrs'] = df['Sales'].ewm(span=4, adjust=False).mean()
print(df['Qtrs'])
  • Line [1] creates a DataFrame from a dictionary of lists and saves it to df.
  • Line [2] calculates the mean emw and saves it to the df['Qtrs’] variable.
  • Line [3] outputs the result to the terminal.

Output

 Tanks
02500.0
12300.0
21940.0
32244.0
Name:Qtrs, dtype=float64