Pandas DataFrame Methods - Be on the Right Side of Change

The Pandas DataFrame is a data structure that organizes data into a two-dimensional format. If you are familiar with Excel or Databases, the setup is similar. Each DataFrame contains a schema that defines a Column (Field) Name and a Data Type.

This article delves into the methods available for DataFrame Iteration. This article also assumes Excel exists on the computer.

Preparation

Before any data manipulation can occur, one (1) new library will require installation.

The Pandas library enables access to/from a DataFrame.

To install this library, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

If the installation was successful, a message displays in the terminal indicating the same.

Feel free to view the PyCharm installation guide for the required library.

How to install Pandas on PyCharm.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd

Create a CSV File

For this article, we will create a CSV file. This file contains a small representation of 2020 Fire Incidents in the San Diego area. To complete this task, perform the following steps:

Using the mouse, highlight the data below.
Press CTRL+C to copy the contents to the system Clipboard.
Open Notepad.
Press CTRL+V to paste the contents of the system Clipboard to Notepad.
From the File menu, click Save As and browse to select the current working directory.
From the Save As pop-up, enter the filename and type as indicated below.
Click the Save button to complete this task.

Data

Call;Incident;Date;Time;Zip
Life-Threatening Emergency Response;FS21198050;11/30/2021;23:56;92131
Urgent Response;FS21198048;11/30/2021;23:53;92106
Non-Life-Threatening Response;FS21198047;11/30/2021;23:43;92101
Urgent Response;FS21198045;11/30/2021;23:35;92102
Life-Threatening Emergency Response;FS21198046;11/30/2021;23:34;92108
Life-Threatening Emergency Response;FS21198040;11/30/2021;23:26;92110
Non-Life-Threatening Response;FS21198039;11/30/2021;23:24;92037
Life-Threatening Emergency Response;FS21198038;11/30/2021;23:14;92113
Urgent Response;FS21198037;11/30/2021;23:14;92105
Fire;FS21198036;11/30/2021;23:12;92110

DataFrame pop()

The pop() method drops the referenced item from the DataFrame. If not located, a KeyError occurs.

The syntax of this method is as follows:

DataFrame.pop(item)

Parameter	Description
`item`	This parameter is the Column Label (Name) to be popped (removed).

This example pops (removes) the Date column from the DataFrame.

df = pd.read_csv('fire.csv', sep=';')
df.pop('Date')
print(df)

Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (;).
Line [2] pops (removes) the Date column from the DataFrame.
Line [3] outputs the DataFrame to the terminal.

Output

	Call	Incident	Time	Zip
0	Life-Threatening Emergency Response	FS21198050	23:56	92131
1	Urgent Response	FS21198048	23:53	92106
2	Life-Threatening Emergency Response	FS21198047	23:43	92101
3	Urgent Response	FS21198045	23:35	92102
4	Life-Threatening Emergency Response	FS21198046	23:34	92108
5	Life-Threatening Emergency Response	FS21198040	23:26	92110
6	Non-Life-Threatening Emergency Response	FS21198039	23:24	92037
7	Life-Threatening Emergency Response	FS21198038	23:14	92113
8	Urgent Response	FS21198037	23:14	92105
9	Fire	FS21198036	23:12	92110

DataFrame tail()

If a DataFrame contains many rows, this method comes in handy. This method offers the ability to view a selection of the bottom n rows of a DataFrame.

The syntax of this method is as follows:

DataFrame.tail(n=5)

Parameter	Description
`n`	If empty, the bottom five (`n=5`) rows will display. If this parameter contains a value, the bottom `n` number of rows will display. The number of rows does not include the header row.

For this example, we will display the bottom three (3) rows of the DataFrame.

df = pd.read_csv('fire.csv', sep=';')
print(df.tail(3))

Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (;).
Line [2] outputs the bottom three (3) DataFrame rows, plus the header row to the terminal.

Output

	Call	Incident	Date	Time	Zip
7	Life-Threatening Emergency Response	FS21198038	11/30/2021	23:14	92113
8	Urgent Response	FS21198037	11/30/2021	23:14	92105
9	Fire	FS21198036	11/30/2021	23:12	92110

DataFrame get()

The get() method gets (retrieves) an item from an object for a specified key. This key (Column) might be a single key (Column) or multiple keys (Columns). A default value of None returns for no match.

The syntax of this method is as follows:

DataFrame.get(key, default=None)

Parameter	Description
`key`	This is an object.
`value`	This is the same type as the items in the object.

For this example, the values in the Zip column are returned.

df = pd.read_csv('fire.csv', sep=';')
results = df.get("Zip")
print(results)

Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (;).
Line [2] saves the Zip values to results.
Line [3] outputs the results to the terminal.

Output

0	92131
1	92106
2	92101
3	92102
4	92108
5	92110
6	92037
7	92113
8	92105
9	92110
Name:	Zip, dtype: int64

DataFrame isin()

The isin() method determines if each element in the DataFrame exists in the values parameter. This method returns a DataFrame of Boolean values (True/False) indicating the same.

The syntax of this method is as follows:

DataFrame.isin(values)

Parameter	Description
`values`	This is an iterable and can be a Series, DataFrame, or Dictionary. If the iterable is a Dictionary, the keys must contain and match the Column Labels (Names). If a DataFrame, the index and Column Labels (Names) must match.

This example uses a DataFrame.

df = pd.read_csv('fire.csv', sep=';')
results = df.isin([92108, 92102])
print(results)

Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (;).
Line [2] checks for two Zip values and saves the Boolean values to results.
Line [3] outputs the results to the terminal.

Output

	Call	Incident	Time	Zip
0	False	False	False	False
1	False	False	False	False
2	False	False	False	False
3	False	False	False	True
4	False	False	False	True
5	False	False	False	False
6	False	False	False	False
7	False	False	False	False
8	False	False	False	False
9	False	False	False	False

Click here for examples using a Dictionary and a Series.

DataFrame where()

The where() method replaces values when cond=False. This method is similar to the if-then statement. For each element in the DataFrame, if cond=True, the element value is used. Otherwise, the match from the other value is assigned.

The syntax of this method is as follows:

DataFrame.where(cond, other=nan, inplace=False, axis=None, 
                level=None, errors='raise', try_cast=NoDefault.no_default)

Parameter	Description
`cond`	The allowable items are Boolean, Series, DataFrame, Array, or Callable. Based on the following conditions: – Where `cond=True`, the original value is kept. – Where `cond=False`, replace with the corresponding value from other (see below). – If callable, a Boolean Series/DataFrame or Array returns.
`other`	Entries where `cond=False` this is changed to the corresponding value from other. If other is callable, this returns a scalar or Series/DataFrame.
`inplace`	– If `inplace=True`, the DataFrame is updated. – If `inplace=False`, the DataFrame is updated, and a copy of the DataFrame returns. Default is `False`.
`axis`	Determines the alignment `axis`. The default value is `None`.
`level`	Determines the alignment level. The default value is `None`.
`errors`	This parameter does not affect the results.

For this example, this code converts the Zip values that are not 92108 to 90210.

df = pd.read_csv('fire.csv', sep=';')
results = df.where(df == 92108, '90210')
print(results)

Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (;).
Line [2] changes all Zip values that do not match the filter criteria and assigns this to results.
Line [3] outputs the results to the terminal.

Output

	Call	Incident	Time	Zip
0	90210	90210	90210	90210
1	90210	90210	90210	90210
2	90210	90210	90210	90210
3	90210	90210	90210	90210
4	90210	90210	90210	92108
5	90210	90210	90210	90210
6	90210	90210	90210	90210
7	90210	90210	90210	90210
8	90210	90210	90210	90210
9	90210	90210	90210	90210

DataFrame mask()

The mask() method replaces values where cond=True. This method is similar to the if-then statement. For each element in the DataFrame, if cond=False, the element value is assigned. Otherwise, the match from the other value is accessed.

The syntax of this method is as follows:

DataFrame.mask(cond, other=nan, inplace=False, axis=None, level=None, 
               errors='raise', try_cast=NoDefault.no_default)

Parameter	Description
`cond`	The allowable items are Boolean, Series, DataFrame, Array, or Callable. Based on the following conditions: – Where `cond=True`, the original value is kept. – Where `cond=False`, replace with the corresponding value from other (see below). – If callable, a Boolean Series/DataFrame or Array returns.
`other`	Entries where `cond=False` this is changed to the corresponding value from other. If other is callable, this returns a scalar or Series/DataFrame.
`inplace`	– If `inplace=True`, the DataFrame is updated. – If `inplace=False`, the DataFrame is updated, and a copy of the DataFrame returns. Default is `False`.
`axis`	Determines the alignment `axis`. The default value is `None`.
`level`	Determines the alignment level. The default value is `None`.
`errors`	This parameter does not affect the results.

For this example, the Zip value 92131 changes to 9000.

df = pd.read_csv('fire.csv', sep=';')
results = df.mask(df == 92131, '90000')
print(results)

Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (;).
Line [2] searches for and changes the matching field. No other data is modified.
Line [3] outputs the results to the terminal.

Output

	Call	Incident	Date	Time	Zip
0	Life-Threatening Emergency Response	FS21198050	11/30/2021	23:56	90000
1	Urgent Response	FS21198048	11/30/2021	23:53	92106
2	Life-Threatening Emergency Response	FS21198047	11/30/2021	23:43	92101
3	Urgent Response	FS21198045	11/30/2021	23:35	92102
4	Life-Threatening Emergency Response	FS21198046	11/30/2021	23:34	92108
5	Life-Threatening Emergency Response	FS21198040	11/30/2021	23:26	92110
6	Non-Life-Threatening Emergency Response	FS21198039	11/30/2021	23:24	92037
7	Life-Threatening Emergency Response	FS21198038	11/30/2021	23:14	92113
8	Urgent Response	FS21198037	11/30/2021	23:14	92105
9	Fire	FS21198036	11/30/2021	23:12	92110

DataFrame query()

The query() method queries the columns of a DataFrame using a boolean expression. The return value is a DataFrame resulting from the provided query expression or None if inplace=True.

The syntax of this method is as follows:

DataFrame.query(expr, inplace=False, **kwargs)

Parameter	Description
`expr`	This is the query string to be evaluated. You can refer to variables inside the environment by pre-fixing with an `@` character. Refer to invalid Python Column Labels (Names) by surrounding them in back-ticks.
`inplace`	This value determines if this executes in place and returns a copy
`**kwargs`	Click to see the documentation on `eval()`.

For this example, all Zip values greater than 92108 return.

df = pd.read_csv('fire.csv', sep=';')
results = df.query('Zip > 92108')
print(results)

Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (;).
Line [2] searches for and returns the results that match the query entered.
Line [3] outputs the results to the terminal.

Output

	Call	Incident	Date	Time	Zip
0	Life-Threatening Emergency Response	FS21198050	11/30/2021	23:56	92131
5	Life-Threatening Emergency Response	FS21198040	11/30/2021	23:26	92110
7	Life-Threatening Emergency Response	FS21198038	11/30/2021	23:14	92113
9	Fire	FS21198036	11/30/2021	23:12	92110

DataFrame xs()

The xs() method takes a Key argument to select data at a particular level of a MultiIndex and returns a cross-section from the Series/DataFrame.

The syntax of this method is as follows:

DataFrame.xs(key, axis=0, level=None, drop_level=True)

Parameter	Description
`key`	The label contained in the index or partially in a `MultiIndex`.
`axis`	Axis to retrieve the cross-section of data from (0/1). Default is 0.
`level`	Indicates which levels of a `MultiIndex` to use.
`drop_level`	If `False`, an object returns with the same levels as the original. Default is `True`.

For this example, we have created a DataFrame containing information about a High School.

school = {'stds_class': 	[14, 13, 18, 21],
	   'avg_class':  	[83, 72, 65, 91],
	   'classes':         	['math', 'science', 'history', 'english'],
	   'bldg':          	['C3', 'B1', 'F6', 'M2'],
	   'teacher':   	['hill', 'watts', 'sufra', 'brown']}

df = pd.DataFrame(school)
df = df.set_index(['classes', 'bldg', 'teacher'])

Line [1] creates a new DataFrame, school.
Line [2] assigns this DataFrame to df.
Line [3] sets the indexes to classes, bldg, and teacher.
Line [4] outputs the DataFrame to the terminal.

Output

Below is the DataFrame with no modifications.

                      stds_class  avg_class
classes bldg teacher                       
math    C3   hill             14         83
science B1   watts            13         72
history F6   sufra            18         65
english M2   brown            21         91

The following code applies the xs() method.

df = pd.DataFrame(school)
df = df.set_index(['classes', 'bldg', 'teacher'])
print(df.xs('science'))

Line [3] outputs the details for science to the terminal.

Output

              stds_class  avg_class
bldg teacher                       
B1   watts            13         72