Microsoft Excel is a cross-platform and battle-tested spreadsheet software widely used for data analysis and visualization. It is a powerful and user-friendly tool indeed! But how can we bring it to the next level?Β
We can combine Excel with Python to read data from Excel, save data to Excel with formats, formulas, and even charts, and automate Excel tasks in Python! Please continue reading and stay tuned for my Excel in Python series if it sounds great to you!
- This tutorial will first introduce Excelβs data structure and lingos.
- Then, you will learn the difference between four popular Python ways to read data from excel, including pandas and
openpyxl
. - Finally, I will guide you through three actual use cases in
pandas.read_excel
, ranging from single Excel sheet, multiple Excel sheets, and multiple excel workbooks cases.
You can find all datasets and codes in this tutorial here. The data in our example datasets are not actual data and are only used for educational purposes.
Quick Introduction to Excel
Before jumping into reading data from Excel, letβs look at how data are stored in Excel and get ourselves familiar with some Excel lingos.
Simply put, data are stored in cells in Excel, and each cell can be identified with its unique row and column number pair.Β
Columns in Excel are labeled in alphabets, starting from βAβ
, and rows in Excel are labeled in roman numbers, starting from β1β
. For example, in the following Excel picture, A1
is the cell in the intersection of the first column and first row, which is βIDβ
.
Yes, you got the idea and it is just like the DataFrame data structure in our friends, pandas.
On top of that, other commonly seen Excel lingos include worksheets and workbooks.
- A worksheet means a single spreadsheet in an Excel file.
- A workbook means a single Excel file with extensions like
.xlsx
and.xls
.
Tip: More file extensions supported by Excel can be found in the Microsoft official doc here.
Now, you are ready to know how to read data from Excel!
Brief Comparison on popular Python ways to Read Data from Excel
Here is the summary of popular Python ways to read data from excel:
One of pandas.read_excel()
βs advantages is engine compatibility. It uses multiple external engines for different Excel extensions, including openpyxl
, xlrd
, odf
, and pyxlsb
.
- β
xlrd
β supports old-style Excel files (.xls
). - β
openpyxl
β supports newer Excel 2010 file formats (.xlsx, .xlsm, .xltx, .xltm
). - β
odf
β supports OpenDocument file formats (.odf, .ods, .odt
). - β
pyxlsb
β supports Binary Excel 2007-2010 files (.xlsb
).
In summary, you can use openpyxl
and other libraries for specific Excel file formats and basic data processing. And please remember openpyxl
supports newer Excel file formats out of all.
However, I recommend using pandas.read_excel()
for data science and analytics application because it supports most Excel file formats and returns a DataFrame object with powerful methods, which is efficient for subsequent data processing.
Therefore, letβs see how to use pandas.read_excel()
to read data from Excel in three use cases.
Reading Data from Excel – pandas.read_excel()
Since pandas.read_excel()
is the most powerful and inclusive way to read data from Excel files with different extensions, I will first walk you through three use cases with real Python codes below using pandas.read_excel()
.
Getting started
To use pandas, you need to first install this package in your command line like:
$ pip install pandas
Tip: you might need to use pip3
instead of pip
, depending on your environment.
Given the engine compatibility mentioned above, you also need to install respective engine libraries. For example, to use openpyxl
, you need to install this package in your command line like:
$ pip install openpyxl
Tip: you might need to use pip3
instead of pip
, depending on your environment.
Dataset
In our examples, we will use two Excel workbooks (.xlsx
), learn_excel_12
and learn_excel_34
.
These two workbooks have same data structures and column names in each worksheet.
For example, following is the data in the Class_1
worksheet in the learn_excel_12
workbook.
One row in each sheet represents a student in each class and the columns represents the studentβs ID, Major, and Score respectively.
Concretely, learn_excel_12
has two worksheets, Class 1 and Class 2. And learn_excel_34
has two worksheets, Class 3 and Class 4.
You can find all datasets and codes in this tutorial here. The data in our example datasets are not actual data and are only used for educational purposes.
Single Excel Worksheet Case
So, how can we read data from single excel sheet? We can use the sheet_name
parameter in pandas.read_excel()
to specify a single sheet in two ways, shown in the following code:
import pandas as pd excel_wb12_filepath = 'learn_excel_12.xlsx' # To get a single worksheet in a workbook, we can # pass the excel workbook filepath and the sheet_name respectively. class1 = pd.read_excel(excel_wb12_filepath, sheet_name='Class 1') # The sheet_name parameter can be a number, starting from 0. # so, to get class 2, we need to pass 1 to the sheet_name parameter. class2 = pd.read_excel(excel_wb12_filepath, sheet_name=1)
The code shows how to access data from the Class 1 worksheet by passing its sheet name to sheet_name
parameter and access data from the Class 2 worksheet by passing a roman number to the sheet_name
parameter in the learn_excel_12
workbook.
After running the code, we can get the output:
Multiple Excel Worksheets Case
Likewise, we can pass a list containing various sheet names or roman numbers to the sheet_name
parameter of pandas.read_excel()
to read data from multiple worksheets. Besides, if we set the sheet_name
parameter to be None
, it will load all sheets.
import pandas as pd excel_wb12_filepath = 'learn_excel_12.xlsx' # To get multiple worksheets in a workbook, we can pass a list # to the sheet_name parameter. # And we can pass either roman number or sheet name. two_sheets = pd.read_excel(excel_wb12_filepath, sheet_name=['Class 1', 1]) # If we want all sheets in a workbook, we can set the sheet_name to be None. all_sheets = pd.read_excel(excel_wb12_filepath, sheet_name=None) print(type(all_sheets)) print(type(all_sheets['Class 1'])) print('-'*85) print(all_sheets['Class 1'].head()) print('-'*85) print(all_sheets['Class 2'].head())
The code shows two ways to access data from Class 1 and Class 2 worksheets.
First, we can pass a list containing their sheet names in either roman number way or sheet name way.
π But be careful! Since the sheet_name
is like accepting the keys in a dictionary, you will need to use the corresponding key to index from two_sheet
to get each worksheet.
Second, we can use the βsheet_name = None
β trick to get all sheets in the learn_excel_12
workbook. And when you want to index from all_sheet
, you have to provide exact sheet name, not roman number, to index the corresponding DataFrame Object.
After running the code, we can get the output:
Multiple Excel Workbooks Case
Finally, to get multiple workbooks, we can just call the pandas.read_excel()
multiple times. π
import pandas as pd excel_wb12_filepath = 'learn_excel_12.xlsx' excel_wb34_filepath = 'learn_excel_34.xlsx' # To get multiple workbooks, we can just call the pandas.read_excel() multiple times:) wb12 = pd.read_excel(excel_wb12_filepath, sheet_name=None) wb34 = pd.read_excel(excel_wb34_filepath, sheet_name=None) # Calculate the average score of each class average_score_1 = wb12['Class 1']['Score'].mean() average_score_2 = wb12['Class 2']['Score'].mean() average_score_3 = wb34['Class 3']['Score'].mean() average_score_4 = wb34['Class 4']['Score'].mean() print(average_score_1, average_score_2, average_score_3, average_score_4, sep="\n")
The code shows how to access data from learn_excel_12
workbook and learn_excel_34_workbook
. On top of that, we use efficient pandas.DataFrame
characteristics to further calculate the average score of each class.
After running the code, we can get the output:
Summary
In this part, I show you how to read data from Excel worksheet(s) and workbooks using pandas.read_excel
.
I would appreciate it if you follow along. And you probably notice that I almost deploy one crucial parameter, sheet_name
, here.
Yes! And the official docs of pandas.read_excel()
offers more advanced manipulation freedom like header, parse_dates
, and dtype
.
You will find that resource to be useful when needed!
Conclusion
Thatβs it for how to read data from Excel. This is the first article in our Excel in Python series. We learned about Excelβs data structure and commonly used lingos and four popular ways to read data from Excel in Python, including pandas and openpyxl. Finally, we looked at three actual use cases in pandas.read_excel, ranging from single Excel sheet, multiple excel sheet, and multiple excel workbooks cases.
I hope you enjoy all this, and stay tuned for our following Excel in Python article on saving data to Excel!
Happy coding!