In this tutorial, we will learn how to read in HTML tables using the read_html()
function and how to turn these tables into Pandas data frames to analyze them. Furthermore, we will see how to render Pandas data frames as HTML tables applying the to_html()
function.
As you go through the article, you can also watch the following explainer video tutorial:
Reading in HTML tables using the read_html() function
For this tutorial, we will use this Wikipedia page about Europe. It contains a lot of information about the history and current situation of the continent Europe. To get an overview of all the parameters, check out the official documentation. So, letβs get started with the actual coding:
import pandas as pd url = "https://en.wikipedia.org/wiki/Europe" tables = pd.read_html(url) print(type(tables)) # <class 'list'>
In the beginning, we import the Pandas library. Then, we create the variable βurl
β and assign it the URL of the Wikipedia page as a string. After that, we use the read_html()
function for the first time. We read in the URL by putting the βurl
β variable inside the read_html()
function and assigning that to a new variable called βtables
β. Finally, we output the type of βtables
β. As we can see, the type is a list. So basically, the read_html()
function, as we use it here, reads in all the tables it can find on the website and assigns these tables as a list to the variable.
Letβs see how many tables there are:
print(len(tables)) # 44
We determine the length of the table list by using the function len()
. There are 44 tables in total.
Now, if we wanted to get a specific table, we could run:
print(tables[4])
This is the resulting output:
Flag | Symbol | Name | Sovereignstate | Area(km2) | Population | Populationdensity(per km2) | Capital | |
0 | NaN | NaN | Sovereign Base Areas of Akrotiri and Dhekelia | UK | 254.0 | 15700 | 59.100 | Episkopi Cantonment |
1 | NaN | NaN | Γ land | Finland | 1580.0 | 29489 | 18.360 | Mariehamn |
2 | NaN | NaN | Bailiwick of Guernseyβ[c] | UK | 78.0 | 65849 | 844.000 | St. Peter Port |
3 | NaN | NaN | Bailiwick of Jerseyβ[c] | UK | 118.2 | 100080 | 819.000 | Saint Helier |
4 | NaN | NaN | Faroe Islands | Denmark | 1399.0 | 50778 | 35.200 | TΓ³rshavn |
5 | NaN | NaN | Gibraltar | UK | 06. Jul | 32194 | 4.328.000 | Gibraltar |
6 | NaN | NaN | Greenland | Denmarkβ[r] | 2166086.0 | 55877 | 0.028 | Nuuk |
7 | NaN | NaN | Isle of Manβ[c] | UK | 572.0 | 83314 | 148.000 | Douglas |
8 | NaN | NaN | Svalbard | Norway | 61022.0 | 2667 | 0.044 | Longyearbyen |
This way, we get the fifth table from the list.
Great, so we have learned a way to access a specific table from the list. However, this method is not really efficient since we do not know what the table contains if we access it by list number. Luckily, the read_html()
function provides us with useful parameters to specify which table we want to access.
Letβs say we want to get this table from the website:
Since it is a table, it is contained somewhere in our βtables
β list. To get this specific table, we use the βmatch
β parameter. This parameter expects a string or regular expression as input. Letβs put in the string "Peak Year"
to state that we want to access this table:
economy_table = pd.read_html(url, match="Peak Year") # economy_table:
This output shows all the tables that contain the string "Peak Year"
. But as we can see, there are two tables inside this list. We can confirm this, by running:
print(len(economy_table)) # 2
So, we need to be more specific inside our βmatch
β parameter:
economy_table = pd.read_html(url, match="nominal, Peak Year") # economy_table:
Here, we only get one table as output, which we can confirm again:
print(len(economy_table)) # 1
There are several more parameters to apply. We will have a look at the most important ones. Letβs say, we want to convert the integer values in the column "GDP (nominal, Peak Year)millions of USD"
to float values. Additionally, we may also want to set the βRank
β column as the index column:
economy_table = pd.read_html(url, match="nominal, Peak Year", converters={"GDP (nominal, Peak Year)millions of USD": float}, index_col=0)
Again, we used the βmatch
β parameter like before. In addition to that, we applied the βconverters
β parameter and put in a dictionary with the column name as the key and the data type that we want to convert in as the value. And we applied the βindex_col
β parameter and set that to β0β to state that we want to use the first column (the βRank
β column) as the index. The output shows the transformed table.
Converting the Tables into Pandas DataFrames
After we have read the HTML tables, the next step is to turn these tables into Pandas data frames to be able to analyze the data. The βeconomy_table
β that we created above is from the type βlist
β and contains only one entry:
type(economy_table) # <class 'list'> len(economy_table) # 1
Now, we want to convert that list entry into a Pandas data frame. And this is how we do it:
economy_df = economy_table[0]
Country | GDP (nominal, Peak Year)millions of USD | Peak Year | |
Rank | |||
β | European Union | 19226235.0 | 2008 |
1 | Germany | 4230172.0 | 2021 |
2 | United Kingdom | 3108416.0 | 2021 |
3 | France | 2940428.0 | 2021 |
4 | Italy | 2408392.0 | 2008 |
5 | Russia | 2288428.0 | 2013 |
6 | Spain | 1631685.0 | 2008 |
7 | Netherlands | 1007562.0 | 2021 |
8 | Turkey | 957504.0 | 2013 |
9 | Switzerland | 810830.0 | 2021 |
10 | Poland | 655332.0 | 2021 |
We create a new variable βeconomy_df
β and assign it the first entry of the βeconomy_table
β list. The outputted data frame is indeed a Pandas data frame which we can prove by doing this:
isinstance(economy_df, pd.DataFrame) # True
So, this is how we transform the tables into data frames. We can also check the data type of each column to see if the converting to float of the βGDPβ column worked:
economy_df.dtypes
Country | object |
GDP (nominal, Peak Year) millions of USD | float64 |
Peak Year | int64 |
dtype:object |
As we can see, the data type of the βGDP
β column is indeed βfloat64
β.
So, now that we transformed the table into a Pandas data frame, we are now able to do all sorts of data analysis stuff that Pandas provides us.
Writing DataFrames to HTML tables
Now that we have seen how to read in HTML tables and how to transform them into data frames, in the next step, we will see how to write data frames to HTML tables using the to_html()
function. We will use a new data frame for this approach:
data = { "speed": [7,5,8], "height": [1.0, 0.3, 0.1], "length": [1.2, 0.4, 0.2] } df = pd.DataFrame(data, index=["dog", "cat", "fish"])
This is the newly-created DataFrame:
speed | height | length | |
dog | 7 | 1.0 | 1.2 |
cat | 5 | 0.3 | 0.4 |
fish | 8 | 0.1 | 0.2 |
Here, we have the example dataset with a βspeed
β, a βheight
β, and a βlength
β column. We create a Pandas data frame called βdf
β with this data and assign the indexes βdogβ, βcatβ, and βfishβ to it. The output shows a usual Pandas data frame.
Next, we apply the to_html()
function:
html_table = df.to_html() print(html_table)
Here’s the output HTML table:
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>speed</th> <th>height</th> <th>length</th> </tr> </thead> <tbody> <tr> <th>dog</th> <td>7</td> <td>1.0</td> <td>1.2</td> </tr> <tr> <th>cat</th> <td>5</td> <td>0.3</td> <td>0.4</td> </tr> <tr> <th>fish</th> <td>8</td> <td>0.1</td> <td>0.2</td> </tr> </tbody> </table>
We render βdf
β as a HTML table using to_html()
and assign this to the new variable βhtml_table
β. We use the print()
statement for the output because otherwise, the output would be messy. The output shows a classic HTML table.
In addition to that, we can write this HTML table to a file:
html_file = open("index.html", "w") html_file.write(html_table) html_file.close()
This way, we create an HTML file called βindex.htmlβ and it is stored in the same folder as the python file we are working with. When we go into the folder and open the HTML file with a browser, it looks like this:
However, the approach we used with the βopen
β, βwrite
β and βclose
β statements is a bit wordy and not clean. Luckily, Python provides us with a nice alternative that makes our code much cleaner:
with open("index.html", "w") as file: file.write(html_table)
Here, we use the βwithβ statement which is used in exception handling. It does the same as in the example above but itβs much smoother to do it this way because we need less code and it is easier to read.
Styling the HTML Tables
The to_html()
function provides us with some optional parameters that we can apply to add some styling to our HTML tables. For example, we can use the βjustify
β parameter to justify the column labels:
html_table = df.to_html(justify="center") print(html_table)
The output HTML:
<table border="1" class="dataframe"> <thead> <tr style="text-align: center;"> <th></th> <th>speed</th> <th>height</th> <th>length</th> </tr> </thead> <tbody> <tr> <th>dog</th> <td>7</td> <td>1.0</td> <td>1.2</td> </tr> <tr> <th>cat</th> <td>5</td> <td>0.3</td> <td>0.4</td> </tr> <tr> <th>fish</th> <td>8</td> <td>0.1</td> <td>0.2</td> </tr> </tbody> </table>
If we compare this HTML table to the one above, we see that βtext-align
β in the βtr style
β tag now says βcenter
β instead of βright
β since the default value is βright
β.
We can also change the default border size of β1β to another value by applying the βborder
β parameter:
html_table = df.to_html(justify="center", border=4) print(html_table)
This is the output:
<table border="4" class="dataframe"> <thead> <tr style="text-align: center;"> <th></th> <th>speed</th> <th>height</th> <th>length</th> </tr> </thead> <tbody> <tr> <th>dog</th> <td>7</td> <td>1.0</td> <td>1.2</td> </tr> <tr> <th>cat</th> <td>5</td> <td>0.3</td> <td>0.4</td> </tr> <tr> <th>fish</th> <td>8</td> <td>0.1</td> <td>0.2</td> </tr> </tbody> </table>
Now, the βtable border
β tag says β4β instead of β1β.
If we use CSS id selectors, we are able to apply these directly inside the to_html()
function using the parameter βtable_id
β:
html_table = df.to_html(justify="center", border= 4, table_id="animal table") print(html_table)
This is the resulting table:
<table border="4" class="dataframe" id="animal table"> <thead> <tr style="text-align: center;"> <th></th> <th>speed</th> <th>height</th> <th>length</th> </tr> </thead> <tbody> <tr> <th>dog</th> <td>7</td> <td>1.0</td> <td>1.2</td> </tr> <tr> <th>cat</th> <td>5</td> <td>0.3</td> <td>0.4</td> </tr> <tr> <th>fish</th> <td>8</td> <td>0.1</td> <td>0.2</td> </tr> </tbody> </table>
In the first tag, we now have an id selector element which we did not have before.
Summary
All in all, Pandas provides us with some useful tools to use when working with HTML tables. We can easily read in HTML tables directly from websites with the read_html()
function and create data frames from these tables. Also, we can render our data frames as HTML tables, apply several stylings to these tables and save them as HTML files. These skills are very essential, especially when working with web data.
For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page and subscribe to our email academy:
Happy Coding!