How to Filter Data from an Excel File in Python with Pandas

4.9/5 - (10 votes)

Problem Formulation and Solution Overview

This article will show different ways to read and filter an Excel file in Python.

To make it more interesting, we have the following scenario:

Sven is a Senior Coder at K-Paddles. K-Paddles manufactures Kayak Paddles made of Kevlar for the White Water Rafting Community. Sven has been asked to read an Excel file and run reports. This Excel file contains two (2) worksheets, Employees and Sales.

To follow along, download the kp_data.xlsx file and place it into the current working directory.


πŸ’¬ Question: How would we write code to filter an Excel file in Python?

We can accomplish this task by one of the following options:


Method 1: Use read_excel() and the & operator

This method uses the read_excel() function to read an XLSX file into a DataFrame and an expression to filter the results.

This example imports the above-noted Excel file into a DataFrame. The Employees worksheet is accessed, and the following filter is applied:

πŸ‘€ Give me the DataFrame rows for all employees who work in the Sales Department, and earn more than $55,000/annum.

Let’s convert this to Python code.

import pandas as pd

cols = ['First', 'Last', 'Dept', 'Salary']
df_emps = pd.read_excel('kp_data.xlsx', sheet_name='Employees', usecols=cols)
df_salary = df_emps[(df_emps['Dept'] == 'Sales') & (df_emps['Salary'] > 55000)] 
df_salary.to_excel('sales_55.xlsx', sheet_name='Sales Salaries Greater Than 55K') 

The first line in the above code snippet imports the Pandas library. This allows access to and manipulation of the XLSX file. Just so you know, the openpyxl library must be installed before continuing.

The following line defines the four (4) columns to retrieve from the XLSX file and saves them to the variable cols as a List.

πŸ’‘Note: Open the Excel file and review the data to follow along.

Import the Excel File to Python

On the next line in the code snippet, read_excel() is called and passed three (3) arguments:

  • The name of the Excel file to import (kp_data.xlsx).
  • The worksheet name. The first worksheet in the Excel file is always read unless stated otherwise. For this example, our Excel file contains two (2) worksheets: Employees and Sales. The Employees worksheet can be referenced using sheet_name=0 or sheet_name='Employees'. Both produce the same result.
  • The columns to retrieve from the Excel workheet (usecols=cols).

The results save to df_emps.

Filter the DataFrame

The highlighted line applies a filter that references the DataFrame columns to base the filter on and the & operator to allow for more than one (1) filter criteria.

In Python, this filter:

df_salary = df_emps[(df_emps['Dept'] == 'Sales') & (df_emps['Salary'] > 55000)]

Equates to this:

πŸ‘€ Give me the DataFrame rows for all employees who work in the Sales Department, and earn more than $55,000/annum.

These results save to sales_55.xlsx with a worksheet ‘Sales Salaries Greater Than 55K‘ and placed into the current working directory.

Contents of Filtered Excel File

pd.read_excel - An Inofficial Guide to Reading Data from Excel

Method 2: Use read_excel() and loc[]

This method uses the read_excel() function to read an XLSX file into a DataFrame and loc[] to filter the results. The loc[] function can access either a group of rows or columns based on their label names.

This example imports the above-noted Excel file into a DataFrame. The Employees worksheet is accessed, and the following filter is applied:

πŸ‘€ Give me the DataFrame rows for all employees who work in the IT Department, and live in the United States.

Let’s convert this to Python code.

import pandas as pd
from openpyxl import load_workbook

cols = ['First', 'Last', 'Dept', 'Country']
df_emps = pd.read_excel('kp_data.xlsx', sheet_name='Employees', usecols=cols)
df_it = df_emps.loc[(df_emps.Dept  == 'IT') & (df_emps.Country == 'United States')]

book = load_workbook('kp_data.xlsx')
writer = pd.ExcelWriter('kp_data.xlsx', engine='openpyxl')
writer.book = book
df_it.to_excel(writer, sheet_name = 'IT - US')
writer.save()
writer.close()

The first line in the above code snippet imports the Pandas library. This allows access to and manipulation of the XLSX file. Just so you know, the openpyxl library must be installed before continuing.

The following line imports openpyxl. This is required, in this case, to save the filtered results to a new worksheet in the same Excel file.

The following line defines the four (4) columns to retrieve from the XLSX file and saves them to the variable cols as a List.

πŸ’‘Note: Open the Excel file and review the data to follow along.

Import the Excel File to Python

On the next line in the code snippet, read_excel() is called and passed three (3) arguments:

  • The name of the Excel file to import (kp_data.xlsx).
  • The worksheet name. The first worksheet in the Excel file is always read unless stated otherwise. For this example, our Excel file contains two (2) worksheets: Employees and Sales. The Employees worksheet can be referenced using sheet_name=0 or sheet_name='Employees'. Both produce the same result.
  • The columns to retrieve from the Excel worksheet (usecols=cols).

The results save to df_it.

Filter the DataFrame

The highlighted line applies a filter using loc[] and passes the filter to return specific rows from the DataFrame.

In Python, this filter:

df_it = df_emps.loc[(df_emps.Dept  == 'IT') & (df_emps.Country == 'United States')]

Equates to this:

πŸ‘€ Give me the DataFrame rows for all employees who work in the IT Department, and live in the United States.

Saves Results to Worksheet in Same Excel File

In the bottom highlighted section of the above code, the Excel file is re-opened using load_workbook(). Then, a writer object is declared, the results filtered, and written to a new worksheet, called IT - US and the file is saved and closed.

Pandas loc() and iloc() - A Simple Guide

Method 3: Use read_excel() and iloc[]

This method uses the read_excel() function to read an XLSX file into a DataFrame and iloc[] to filter the results. The iloc[] function accesses either a group of rows or columns based on their location (integer value).

This example imports required Pandas library and the above-noted Excel file into a DataFrame. The Sales worksheet is then accessed.

This worksheet contains the yearly sale totals for K-Paddles paddles. These results are filtered to the first six (6) rows in the DataFrame and columns shown below.

import pandas as pd 

cols = ['Month', 'Aspire', 'Adventurer', 'Maximizer']
df_sales = pd.read_excel('kp_data.xlsx', sheet_name='Sales', usecols=cols)
df_aspire = df_sales.iloc[0:6]
print(df_aspire)

The results are output to the terminal.

MonthAspireAdventurerMaximizer
012500520021100
122630510018330
232140455022470
343400587022270
453600456020960
562760489020140

Method 4: Use read_excel(), index[] and loc[]

This method uses the read_excel() function to read an XLSX file into a DataFrame in conjunction with index[] and loc[] to filter the results. The loc[] function can access either a group of rows or columns based on their label names.

This example imports the required Pandas library and the above-noted Excel file into a DataFrame. The Sales worksheet is then accessed. This worksheet contains the yearly sale totals for K-Paddles paddles.

These results are filtered to view the results for the Pinnacle paddle using index[] and passing it a start and stop position (stop-1).

import pandas as pd 

cols = ['Month', 'Pinnacle']
df_pinnacle = pd.read_excel('kp_data.xlsx', sheet_name='Sales', usecols=cols)
print(df_pinnacle.loc[df_pinnacle.index[0:5], ['Month', 'Pinnacle']])

The results are output to the terminal.

MonthPinnacle
011500
121200
231340
341130
451740
10 Minutes to Pandas in 5 Minutes (Okay 8)

Method 5: Use read_excel() and isin()

This method uses the read_excel() function to read an XLSX file into a DataFrame using isin() to filter the results. The isin() function filters the results down to the records that match the criteria passed as an argument.

This example imports required Pandas library and the above-noted Excel file into a DataFrame. The Employees worksheet is then accessed.

These results are filtered to view the results for all employees who reside in Chicago.

import pandas as pd

cols = ['First', 'Last', 'City']
df_emps = pd.read_excel('kp_data.xlsx', sheet_name='Employees', usecols=cols)
print(df_emps[df_emps.City.isin(['Chicago'])])

The results are output to the terminal.

FirstLastCity
2LunaSandersChicago
3PenelopeJordanChicago
9MadelineWalkerChicago
34CarolineJenkinsChicago

Summary

This article has provided five (5) ways to filter data from an Excel file using Python to select the best fit for your coding requirements.

Good Luck & Happy Coding!


Programmer Humor – Blockchain

“Blockchains are like grappling hooks, in that it’s extremely cool when you encounter a problem for which they’re the right solution, but it happens way too rarely in real life.” source xkcd