The Pandas DataFrame is a data structure that organizes data into a two-dimensional format. If you are familiar with Excel or Databases, the setup is similar. Each DataFrame contains a schema that defines a Column (Field) Name and a Data Type.
This article delves into the methods available for DataFrame Iteration. This article also assumes Excel exists on the computer.
Preparation
Before any data manipulation can occur, one (1) new library will require installation.
- The Pandas library enables access to/from a DataFrame.
To install this library, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
If the installation was successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required library.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd
Create a CSV File
For this article, we will create a CSV file. This file contains a small representation of 2020 Fire Incidents in the San Diego area. To complete this task, perform the following steps:
- Using the mouse, highlight the data below.
- Press CTRL+C to copy the contents to the system Clipboard.
- Open Notepad.
- Press CTRL+V to paste the contents of the system Clipboard to Notepad.
- From the File menu, click Save As and browse to select the current working directory.
- From the Save As pop-up, enter the filename and type as indicated below.
- Click the Save button to complete this task.
Data
Call;Incident;Date;Time;Zip Life-Threatening Emergency Response;FS21198050;11/30/2021;23:56;92131 Urgent Response;FS21198048;11/30/2021;23:53;92106 Non-Life-Threatening Response;FS21198047;11/30/2021;23:43;92101 Urgent Response;FS21198045;11/30/2021;23:35;92102 Life-Threatening Emergency Response;FS21198046;11/30/2021;23:34;92108 Life-Threatening Emergency Response;FS21198040;11/30/2021;23:26;92110 Non-Life-Threatening Response;FS21198039;11/30/2021;23:24;92037 Life-Threatening Emergency Response;FS21198038;11/30/2021;23:14;92113 Urgent Response;FS21198037;11/30/2021;23:14;92105 Fire;FS21198036;11/30/2021;23:12;92110 β
DataFrame pop()
The pop()
method drops the referenced item from the DataFrame. If not located, a KeyError
occurs.
The syntax of this method is as follows:
DataFrame.pop(item)
Parameter | Description |
---|---|
item | This parameter is the Column Label (Name) to be popped (removed). |
This example pops (removes) the Date column from the DataFrame.
df = pd.read_csv('fire.csv', sep=';') df.pop('Date') print(df)
- Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (
;
). - Line [2] pops (removes) the Date column from the DataFrame.
- Line [3] outputs the DataFrame to the terminal.
Output
Call | Incident | Time | Zip | |
0 | Life-Threatening Emergency Response | FS21198050 | 23:56 | 92131 |
1 | Urgent Response | FS21198048 | 23:53 | 92106 |
2 | Life-Threatening Emergency Response | FS21198047 | 23:43 | 92101 |
3 | Urgent Response | FS21198045 | 23:35 | 92102 |
4 | Life-Threatening Emergency Response | FS21198046 | 23:34 | 92108 |
5 | Life-Threatening Emergency Response | FS21198040 | 23:26 | 92110 |
6 | Non-Life-Threatening Emergency Response | FS21198039 | 23:24 | 92037 |
7 | Life-Threatening Emergency Response | FS21198038 | 23:14 | 92113 |
8 | Urgent Response | FS21198037 | 23:14 | 92105 |
9 | Fire | FS21198036 | 23:12 | 92110 |
DataFrame tail()
If a DataFrame contains many rows, this method comes in handy. This method offers the ability to view a selection of the bottom n rows of a DataFrame.
The syntax of this method is as follows:
DataFrame.tail(n=5)
Parameter | Description |
---|---|
n | If empty, the bottom five (n=5 ) rows will display. If this parameter contains a value, the bottom n number of rows will display. The number of rows does not include the header row. |
For this example, we will display the bottom three (3) rows of the DataFrame.
df = pd.read_csv('fire.csv', sep=';') print(df.tail(3))
- Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (
;
). - Line [2] outputs the bottom three (3) DataFrame rows, plus the header row to the terminal.
Output
Call | Incident | Date | Time | Zip | |
7 | Life-Threatening Emergency Response | FS21198038 | 11/30/2021 | 23:14 | 92113 |
8 | Urgent Response | FS21198037 | 11/30/2021 | 23:14 | 92105 |
9 | Fire | FS21198036 | 11/30/2021 | 23:12 | 92110 |
DataFrame get()
The get()
method gets (retrieves) an item from an object for a specified key. This key (Column) might be a single key (Column) or multiple keys (Columns). A default value of None
returns for no match.
The syntax of this method is as follows:
DataFrame.get(key, default=None)
Parameter | Description |
---|---|
key | This is an object. |
value | This is the same type as the items in the object. |
For this example, the values in the Zip column are returned.
df = pd.read_csv('fire.csv', sep=';') results = df.get("Zip") print(results)
- Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (
;
). - Line [2] saves the Zip values to results.
- Line [3] outputs the results to the terminal.
Output
0 | 92131 |
1 | 92106 |
2 | 92101 |
3 | 92102 |
4 | 92108 |
5 | 92110 |
6 | 92037 |
7 | 92113 |
8 | 92105 |
9 | 92110 |
Name: | Zip, dtype: int64 |
DataFrame isin()
The isin()
method determines if each element in the DataFrame exists in the values parameter. This method returns a DataFrame of Boolean values (True
/False
) indicating the same.
The syntax of this method is as follows:
DataFrame.isin(values)
Parameter | Description |
---|---|
values | This is an iterable and can be a Series, DataFrame, or Dictionary. If the iterable is a Dictionary, the keys must contain and match the Column Labels (Names). If a DataFrame, the index and Column Labels (Names) must match. |
This example uses a DataFrame.
df = pd.read_csv('fire.csv', sep=';') results = df.isin([92108, 92102]) print(results)
- Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (
;
). - Line [2] checks for two Zip values and saves the Boolean values to
results
. - Line [3] outputs the results to the terminal.
Output
Call | Incident | Time | Zip | |
0 | False | False | False | False |
1 | False | False | False | False |
2 | False | False | False | False |
3 | False | False | False | True |
4 | False | False | False | True |
5 | False | False | False | False |
6 | False | False | False | False |
7 | False | False | False | False |
8 | False | False | False | False |
9 | False | False | False | False |
Click here for examples using a Dictionary and a Series.
DataFrame where()
The where()
method replaces values when cond=False
. This method is similar to the if-then statement. For each element in the DataFrame, if cond=True
, the element value is used. Otherwise, the match from the other
value is assigned.
The syntax of this method is as follows:
DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=NoDefault.no_default)
Parameter | Description |
---|---|
cond | The allowable items are Boolean, Series, DataFrame, Array, or Callable. Based on the following conditions: – Where cond=True , the original value is kept.– Where cond=False , replace with the corresponding value from other (see below).– If callable, a Boolean Series/DataFrame or Array returns. |
other | Entries where cond=False this is changed to the corresponding value from other. If other is callable, this returns a scalar or Series/DataFrame. |
inplace | – If inplace=True , the DataFrame is updated. – If inplace=False , the DataFrame is updated, and a copy of the DataFrame returns. Default is False . |
axis | Determines the alignment axis . The default value is None . |
level | Determines the alignment level. The default value is None . |
errors | This parameter does not affect the results. |
For this example, this code converts the Zip values that are not 92108 to 90210.
df = pd.read_csv('fire.csv', sep=';') results = df.where(df == 92108, '90210') print(results)
- Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (
;
). - Line [2] changes all Zip values that do not match the filter criteria and assigns this to results.
- Line [3] outputs the results to the terminal.
Output
Call | Incident | Time | Zip | |
0 | 90210 | 90210 | 90210 | 90210 |
1 | 90210 | 90210 | 90210 | 90210 |
2 | 90210 | 90210 | 90210 | 90210 |
3 | 90210 | 90210 | 90210 | 90210 |
4 | 90210 | 90210 | 90210 | 92108 |
5 | 90210 | 90210 | 90210 | 90210 |
6 | 90210 | 90210 | 90210 | 90210 |
7 | 90210 | 90210 | 90210 | 90210 |
8 | 90210 | 90210 | 90210 | 90210 |
9 | 90210 | 90210 | 90210 | 90210 |
DataFrame mask()
The mask()
method replaces values where cond=True
. This method is similar to the if-then statement. For each element in the DataFrame, if cond=False
, the element value is assigned. Otherwise, the match from the other
value is accessed.
The syntax of this method is as follows:
DataFrame.mask(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=NoDefault.no_default)
Parameter | Description |
---|---|
cond | The allowable items are Boolean, Series, DataFrame, Array, or Callable. Based on the following conditions: – Where cond=True , the original value is kept.– Where cond=False , replace with the corresponding value from other (see below).– If callable, a Boolean Series/DataFrame or Array returns. |
other | Entries where cond=False this is changed to the corresponding value from other. If other is callable, this returns a scalar or Series/DataFrame. |
inplace | – If inplace=True , the DataFrame is updated. – If inplace=False , the DataFrame is updated, and a copy of the DataFrame returns. Default is False . |
axis | Determines the alignment axis . The default value is None . |
level | Determines the alignment level. The default value is None . |
errors | This parameter does not affect the results. |
For this example, the Zip value 92131 changes to 9000.
df = pd.read_csv('fire.csv', sep=';') results = df.mask(df == 92131, '90000') print(results)
- Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (
;
). - Line [2] searches for and changes the matching field. No other data is modified.
- Line [3] outputs the results to the terminal.
Output
Call | Incident | Date | Time | Zip | |
0 | Life-Threatening Emergency Response | FS21198050 | 11/30/2021 | 23:56 | 90000 |
1 | Urgent Response | FS21198048 | 11/30/2021 | 23:53 | 92106 |
2 | Life-Threatening Emergency Response | FS21198047 | 11/30/2021 | 23:43 | 92101 |
3 | Urgent Response | FS21198045 | 11/30/2021 | 23:35 | 92102 |
4 | Life-Threatening Emergency Response | FS21198046 | 11/30/2021 | 23:34 | 92108 |
5 | Life-Threatening Emergency Response | FS21198040 | 11/30/2021 | 23:26 | 92110 |
6 | Non-Life-Threatening Emergency Response | FS21198039 | 11/30/2021 | 23:24 | 92037 |
7 | Life-Threatening Emergency Response | FS21198038 | 11/30/2021 | 23:14 | 92113 |
8 | Urgent Response | FS21198037 | 11/30/2021 | 23:14 | 92105 |
9 | Fire | FS21198036 | 11/30/2021 | 23:12 | 92110 |
DataFrame query()
The query()
method queries the columns of a DataFrame using a boolean expression. The return value is a DataFrame resulting from the provided query expression or None
if inplace=True
.
The syntax of this method is as follows:
DataFrame.query(expr, inplace=False, **kwargs)
Parameter | Description |
---|---|
expr | This is the query string to be evaluated. You can refer to variables inside the environment by pre-fixing with an @ character. Refer to invalid Python Column Labels (Names) by surrounding them in back-ticks. |
inplace | This value determines if this executes in place and returns a copy |
**kwargs | Click to see the documentation on eval() . |
For this example, all Zip values greater than 92108 return.
df = pd.read_csv('fire.csv', sep=';') results = df.query('Zip > 92108') print(results)
- Line [1] reads in the CSV file created earlier. Notice the field separator character is the semi-colon (
;
). - Line [2] searches for and returns the results that match the query entered.
- Line [3] outputs the results to the terminal.
Output
Call | Incident | Date | Time | Zip | |
0 | Life-Threatening Emergency Response | FS21198050 | 11/30/2021 | 23:56 | 92131 |
5 | Life-Threatening Emergency Response | FS21198040 | 11/30/2021 | 23:26 | 92110 |
7 | Life-Threatening Emergency Response | FS21198038 | 11/30/2021 | 23:14 | 92113 |
9 | Fire | FS21198036 | 11/30/2021 | 23:12 | 92110 |
DataFrame xs()
The xs()
method takes a Key
argument to select data at a particular level of a MultiIndex
and returns a cross-section from the Series/DataFrame.
The syntax of this method is as follows:
DataFrame.xs(key, axis=0, level=None, drop_level=True)
Parameter | Description |
---|---|
key | The label contained in the index or partially in a MultiIndex . |
axis | Axis to retrieve the cross-section of data from (0/1). Default is 0. |
level | Indicates which levels of a MultiIndex to use. |
drop_level | If False , an object returns with the same levels as the original. Default is True . |
For this example, we have created a DataFrame containing information about a High School.
school = {'stds_class': [14, 13, 18, 21], 'avg_class': [83, 72, 65, 91], 'classes': ['math', 'science', 'history', 'english'], 'bldg': ['C3', 'B1', 'F6', 'M2'], 'teacher': ['hill', 'watts', 'sufra', 'brown']} df = pd.DataFrame(school) df = df.set_index(['classes', 'bldg', 'teacher'])
- Line [1] creates a new DataFrame,
school
. - Line [2] assigns this DataFrame to
df
. - Line [3] sets the indexes to classes,
bldg
, andteacher
. - Line [4] outputs the DataFrame to the terminal.
Output
Below is the DataFrame with no modifications.
stds_class avg_class
classes bldg teacher
math C3 hill 14 83
science B1 watts 13 72
history F6 sufra 18 65
english M2 brown 21 91
The following code applies the xs()
method.
df = pd.DataFrame(school) df = df.set_index(['classes', 'bldg', 'teacher']) print(df.xs('science'))
Line [3] outputs the details for science to the terminal.
Output
stds_class avg_class
bldg teacher
B1 watts 13 72