Problem Formulation and Solution Overview
To make it more interesting, we have the following scenario:
To follow along, download the kp_data.xlsx
file and place it into the current working directory.
Method 1: Use read_excel() and the & operator
This method uses the read_excel()
function to read an XLSX file into a DataFrame and an expression to filter the results.
This example imports the above-noted Excel file into a DataFrame. The Employees
worksheet is accessed, and the following filter is applied:
π Give me the DataFrame rows for all employees who work in the Sales Department, and earn more than $55,000/annum.
Let’s convert this to Python code.
import pandas as pd cols = ['First', 'Last', 'Dept', 'Salary'] df_emps = pd.read_excel('kp_data.xlsx', sheet_name='Employees', usecols=cols) df_salary = df_emps[(df_emps['Dept'] == 'Sales') & (df_emps['Salary'] > 55000)] df_salary.to_excel('sales_55.xlsx', sheet_name='Sales Salaries Greater Than 55K')
The first line in the above code snippet imports the Pandas library. This allows access to and manipulation of the XLSX file. Just so you know, the openpyxl
library must be installed before continuing.
The following line defines the four (4) columns to retrieve from the XLSX file and saves them to the variable cols
as a List.
π‘Note: Open the Excel file and review the data to follow along.
Import the Excel File to Python
On the next line in the code snippet, read_excel()
is called and passed three (3) arguments:
- The name of the Excel file to import (
kp_data.xlsx
). - The worksheet name. The first worksheet in the Excel file is always read unless stated otherwise. For this example, our Excel file contains two (2) worksheets:
Employees
andSales
. The Employees worksheet can be referenced usingsheet_name=0
orsheet_name='Employees'
. Both produce the same result. - The columns to retrieve from the Excel workheet (
usecols=cols
).
The results save to df_emps
.
Filter the DataFrame
The highlighted line applies a filter that references the DataFrame columns to base the filter on and the &
operator to allow for more than one (1) filter criteria.
In Python, this filter:
df_salary = df_emps[(df_emps['Dept'] == 'Sales') & (df_emps['Salary'] > 55000)]
Equates to this:
π Give me the DataFrame rows for all employees who work in the Sales Department, and earn more than $55,000/annum.
These results save to sales_55.xlsx
with a worksheet ‘Sales Salaries Greater Than 55K
‘ and placed into the current working directory.
Contents of Filtered Excel File
Method 2: Use read_excel() and loc[]
This method uses the read_excel()
function to read an XLSX file into a DataFrame and loc[]
to filter the results. The loc[]
function can access either a group of rows or columns based on their label names.
This example imports the above-noted Excel file into a DataFrame. The Employees
worksheet is accessed, and the following filter is applied:
π Give me the DataFrame rows for all employees who work in the IT Department, and live in the United States.
Let’s convert this to Python code.
import pandas as pd from openpyxl import load_workbook cols = ['First', 'Last', 'Dept', 'Country'] df_emps = pd.read_excel('kp_data.xlsx', sheet_name='Employees', usecols=cols) df_it = df_emps.loc[(df_emps.Dept == 'IT') & (df_emps.Country == 'United States')] book = load_workbook('kp_data.xlsx') writer = pd.ExcelWriter('kp_data.xlsx', engine='openpyxl') writer.book = book df_it.to_excel(writer, sheet_name = 'IT - US') writer.save() writer.close()
The first line in the above code snippet imports the Pandas library. This allows access to and manipulation of the XLSX file. Just so you know, the openpyxl
library must be installed before continuing.
The following line imports openpyxl
. This is required, in this case, to save the filtered results to a new worksheet in the same Excel file.
The following line defines the four (4) columns to retrieve from the XLSX file and saves them to the variable cols
as a List
.
π‘Note: Open the Excel file and review the data to follow along.
Import the Excel File to Python
On the next line in the code snippet, read_excel()
is called and passed three (3) arguments:
- The name of the Excel file to import (
kp_data.xlsx
). - The worksheet name. The first worksheet in the Excel file is always read unless stated otherwise. For this example, our Excel file contains two (2) worksheets:
Employees
andSales
. The Employees worksheet can be referenced usingsheet_name=0
orsheet_name='Employees'
. Both produce the same result. - The columns to retrieve from the Excel worksheet (
usecols=cols
).
The results save to df_it
.
Filter the DataFrame
The highlighted line applies a filter using loc[]
and passes the filter to return specific rows from the DataFrame.
In Python, this filter:
df_it = df_emps.loc[(df_emps.Dept == 'IT') & (df_emps.Country == 'United States')]
Equates to this:
π Give me the DataFrame rows for all employees who work in the IT Department, and live in the United States.
Saves Results to Worksheet in Same Excel File
In the bottom highlighted section of the above code, the Excel file is re-opened using load_workbook()
. Then, a writer object is declared, the results filtered, and written to a new worksheet, called IT - US
and the file is saved and closed.
Method 3: Use read_excel() and iloc[]
This method uses the read_excel()
function to read an XLSX file into a DataFrame and
to filter the results. The iloc[]
function accesses either a group of rows or columns based on their location (integer value).iloc[]
This example imports required Pandas
library and the above-noted Excel file into a DataFrame. The Sales
worksheet is then accessed.
This worksheet contains the yearly sale totals for K-Paddles paddles. These results are filtered to the first six (6) rows in the DataFrame and columns shown below.
import pandas as pd cols = ['Month', 'Aspire', 'Adventurer', 'Maximizer'] df_sales = pd.read_excel('kp_data.xlsx', sheet_name='Sales', usecols=cols) df_aspire = df_sales.iloc[0:6] print(df_aspire)
The results are output to the terminal.
Month | Aspire | Adventurer | Maximizer | |
0 | 1 | 2500 | 5200 | 21100 |
1 | 2 | 2630 | 5100 | 18330 |
2 | 3 | 2140 | 4550 | 22470 |
3 | 4 | 3400 | 5870 | 22270 |
4 | 5 | 3600 | 4560 | 20960 |
5 | 6 | 2760 | 4890 | 20140 |
Method 4: Use read_excel(), index[] and loc[]
This method uses the read_excel()
function to read an XLSX file into a DataFrame in conjunction with index[]
and loc[]
to filter the results. The loc[]
function can access either a group of rows or columns based on their label names.
This example imports the required Pandas
library and the above-noted Excel file into a DataFrame. The Sales
worksheet is then accessed. This worksheet contains the yearly sale totals for K-Paddles paddles.
These results are filtered to view the results for the Pinnacle paddle using index[]
and passing it a start and stop position (stop-1).
import pandas as pd cols = ['Month', 'Pinnacle'] df_pinnacle = pd.read_excel('kp_data.xlsx', sheet_name='Sales', usecols=cols) print(df_pinnacle.loc[df_pinnacle.index[0:5], ['Month', 'Pinnacle']])
The results are output to the terminal.
Month | Pinnacle | |
0 | 1 | 1500 |
1 | 2 | 1200 |
2 | 3 | 1340 |
3 | 4 | 1130 |
4 | 5 | 1740 |
Method 5: Use read_excel() and isin()
This method uses the read_excel()
function to read an XLSX file into a DataFrame using isin()
to filter the results. The isin()
function filters the results down to the records that match the criteria passed as an argument.
This example imports required Pandas
library and the above-noted Excel file into a DataFrame. The Employees
worksheet is then accessed.
These results are filtered to view the results for all employees who reside in Chicago.
import pandas as pd cols = ['First', 'Last', 'City'] df_emps = pd.read_excel('kp_data.xlsx', sheet_name='Employees', usecols=cols) print(df_emps[df_emps.City.isin(['Chicago'])])
The results are output to the terminal.
First | Last | City | |
2 | Luna | Sanders | Chicago |
3 | Penelope | Jordan | Chicago |
9 | Madeline | Walker | Chicago |
34 | Caroline | Jenkins | Chicago |
Summary
This article has provided five (5) ways to filter data from an Excel file using Python to select the best fit for your coding requirements.
Good Luck & Happy Coding!