pd.read_excel – An Inofficial Guide to Reading Data from Excel

Rate this post

Microsoft Excel is a cross-platform and battle-tested spreadsheet software widely used for data analysis and visualization. It is a powerful and user-friendly tool indeed! But how can we bring it to the next level? 

We can combine Excel with Python to read data from Excel, save data to Excel with formats, formulas, and even charts, and automate Excel tasks in Python! Please continue reading and stay tuned for my Excel in Python series if it sounds great to you!

  • This tutorial will first introduce Excel’s data structure and lingos.
  • Then, you will learn the difference between four popular Python ways to read data from excel, including pandas and openpyxl.
  • Finally, I will guide you through three actual use cases in pandas.read_excel, ranging from single Excel sheet, multiple Excel sheets, and multiple excel workbooks cases. 

You can find all datasets and codes in this tutorial here. The data in our example datasets are not actual data and are only used for educational purposes.

Quick Introduction to Excel

Before jumping into reading data from Excel, let’s look at how data are stored in Excel and get ourselves familiar with some Excel lingos.

Simply put, data are stored in cells in Excel, and each cell can be identified with its unique row and column number pair. 

Columns in Excel are labeled in alphabets, starting from ‘A’, and rows in Excel are labeled in roman numbers, starting from ‘1’. For example, in the following Excel picture, A1 is the cell in the intersection of the first column and first row, which is ‘ID’.

Yes, you got the idea and it is just like the DataFrame data structure in our friends, pandas.

On top of that, other commonly seen Excel lingos include worksheets and workbooks. 

  • A worksheet means a single spreadsheet in an Excel file. 
  • A workbook means a single Excel file with extensions like .xlsx and .xls

Tip: More file extensions supported by Excel can be found in the Microsoft official doc here.

Now, you are ready to know how to read data from Excel!

Brief Comparison on popular Python ways to Read Data from Excel

Here is the summary of popular Python ways to read data from excel:

One of pandas.read_excel()’s advantages is engine compatibility. It uses multiple external engines for different Excel extensions, including openpyxl, xlrd, odf, and pyxlsb.

  • xlrd” supports old-style Excel files (.xls).
  • openpyxl” supports newer Excel 2010 file formats (.xlsx, .xlsm, .xltx, .xltm).
  • odf” supports OpenDocument file formats (.odf, .ods, .odt).
  • pyxlsb” supports Binary Excel 2007-2010 files (.xlsb).

In summary, you can use openpyxl and other libraries for specific Excel file formats and basic data processing. And please remember openpyxl supports newer Excel file formats out of all. 

However, I recommend using pandas.read_excel() for data science and analytics application because it supports most Excel file formats and returns a DataFrame object with powerful methods, which is efficient for subsequent data processing.

Therefore, let’s see how to use pandas.read_excel() to read data from Excel in three use cases.

Reading Data from Excel – pandas.read_excel()

Since pandas.read_excel() is the most powerful and inclusive way to read data from Excel files with different extensions, I will first walk you through three use cases with real Python codes below using pandas.read_excel().

Getting started

To use pandas, you need to first install this package in your command line like:

$ pip install pandas 

Tip: you might need to use pip3 instead of pip, depending on your environment.

Given the engine compatibility mentioned above, you also need to install respective engine libraries. For example, to use openpyxl, you need to install this package in your command line like:

$ pip install openpyxl

Tip: you might need to use pip3 instead of pip, depending on your environment.

Dataset

In our examples, we will use two Excel workbooks (.xlsx), learn_excel_12 and learn_excel_34.

These two workbooks have same data structures and column names in each worksheet.

For example, following is the data in the Class_1 worksheet in the learn_excel_12 workbook.

One row in each sheet represents a student in each class and the columns represents the student’s ID, Major, and Score respectively. 

Concretely, learn_excel_12 has two worksheets, Class 1 and Class 2. And learn_excel_34 has two worksheets, Class 3 and Class 4.

You can find all datasets and codes in this tutorial here. The data in our example datasets are not actual data and are only used for educational purposes.

Single Excel Worksheet Case

So, how can we read data from single excel sheet? We can use the sheet_name parameter in pandas.read_excel() to specify a single sheet in two ways, shown in the following code:

import pandas as pd

excel_wb12_filepath = 'learn_excel_12.xlsx'

# To get a single worksheet in a workbook, we can
# pass the excel workbook filepath and the sheet_name respectively.
class1 = pd.read_excel(excel_wb12_filepath, sheet_name='Class 1')

# The sheet_name parameter can be a number, starting from 0.
# so, to get class 2, we need to pass 1 to the sheet_name parameter.
class2 = pd.read_excel(excel_wb12_filepath, sheet_name=1)

The code shows how to access data from the Class 1 worksheet by passing its sheet name to sheet_name parameter and access data from the Class 2 worksheet by passing a roman number to the sheet_name parameter in the learn_excel_12 workbook.

After running the code,  we can get the output:

Multiple Excel Worksheets Case

Likewise, we can pass a list containing various sheet names or roman numbers to the sheet_name parameter of pandas.read_excel() to read data from multiple worksheets. Besides, if we set the sheet_name parameter to be None, it will load all sheets. 

import pandas as pd

excel_wb12_filepath = 'learn_excel_12.xlsx'

# To get multiple worksheets in a workbook, we can pass a list
# to the sheet_name parameter.
# And we can pass either roman number or sheet name.
two_sheets = pd.read_excel(excel_wb12_filepath, sheet_name=['Class 1', 1])

# If we want all sheets in a workbook, we can set the sheet_name to be None.
all_sheets = pd.read_excel(excel_wb12_filepath, sheet_name=None)

print(type(all_sheets))
print(type(all_sheets['Class 1']))
print('-'*85)
print(all_sheets['Class 1'].head())
print('-'*85)
print(all_sheets['Class 2'].head())

The code shows two ways to access data from Class 1 and Class 2 worksheets. 

First, we can pass a list containing their sheet names in either roman number way or sheet name way.

🛑 But be careful! Since the sheet_name is like accepting the keys in a dictionary, you will need to use the corresponding key to index from two_sheet to get each worksheet.

Second, we can use the “sheet_name = None” trick to get all sheets in the learn_excel_12 workbook. And when you want to index from all_sheet, you have to provide exact sheet name, not roman number, to index the corresponding DataFrame Object.

After running the code,  we can get the output:

Multiple Excel Workbooks Case

Finally, to get multiple workbooks, we can just call the pandas.read_excel() multiple times. 🙂

import pandas as pd

excel_wb12_filepath = 'learn_excel_12.xlsx'
excel_wb34_filepath = 'learn_excel_34.xlsx'

# To get multiple workbooks, we can just call the pandas.read_excel() multiple times:)
wb12 = pd.read_excel(excel_wb12_filepath, sheet_name=None)
wb34 = pd.read_excel(excel_wb34_filepath, sheet_name=None)

# Calculate the average score of each class
average_score_1 = wb12['Class 1']['Score'].mean()
average_score_2 = wb12['Class 2']['Score'].mean()
average_score_3 = wb34['Class 3']['Score'].mean()
average_score_4 = wb34['Class 4']['Score'].mean()

print(average_score_1, average_score_2,
      average_score_3, average_score_4, sep="\n")

The code shows how to access data from learn_excel_12 workbook and learn_excel_34_workbook. On top of that, we use efficient pandas.DataFrame characteristics to further calculate the average score of each class.

After running the code,  we can get the output:

Summary

In this part, I show you how to read data from Excel worksheet(s) and workbooks using pandas.read_excel.

I would appreciate it if you follow along. And you probably notice that I almost deploy one crucial parameter, sheet_name, here.

Yes! And the official docs of pandas.read_excel() offers more advanced manipulation freedom like header, parse_dates, and dtype.

You will find that resource to be useful when needed!

Conclusion

That’s it for how to read data from Excel. This is the first article in our Excel in Python series. We learned about Excel’s data structure and commonly used lingos and four popular ways to read data from Excel in Python, including pandas and openpyxl. Finally, we looked at three actual use cases in pandas.read_excel, ranging from single Excel sheet, multiple excel sheet, and multiple excel workbooks cases.

I hope you enjoy all this, and stay tuned for our following Excel in Python article on saving data to Excel!

Happy coding!