Reading and Writing HTML with Pandas - Be on the Right Side of Change

In this tutorial, we will learn how to read in HTML tables using the read_html() function and how to turn these tables into Pandas data frames to analyze them. Furthermore, we will see how to render Pandas data frames as HTML tables applying the to_html() function.

As you go through the article, you can also watch the following explainer video tutorial:

Reading in HTML tables using the read_html() function

For this tutorial, we will use this Wikipedia page about Europe. It contains a lot of information about the history and current situation of the continent Europe. To get an overview of all the parameters, check out the official documentation. So, let’s get started with the actual coding:

import pandas as pd

url = "https://en.wikipedia.org/wiki/Europe"
tables = pd.read_html(url)

print(type(tables))
# <class 'list'>

In the beginning, we import the Pandas library. Then, we create the variable “url” and assign it the URL of the Wikipedia page as a string. After that, we use the read_html() function for the first time. We read in the URL by putting the “url” variable inside the read_html() function and assigning that to a new variable called “tables”. Finally, we output the type of “tables”. As we can see, the type is a list. So basically, the read_html() function, as we use it here, reads in all the tables it can find on the website and assigns these tables as a list to the variable.

Let’s see how many tables there are:

print(len(tables))
# 44

We determine the length of the table list by using the function len(). There are 44 tables in total.

Now, if we wanted to get a specific table, we could run:

print(tables[4])

This is the resulting output:

	Flag	Symbol	Name	Sovereignstate	Area(km2)	Population	Populationdensity(per km2)	Capital
0	NaN	NaN	Sovereign Base Areas of Akrotiri and Dhekelia	UK	254.0	15700	59.100	Episkopi Cantonment
1	NaN	NaN	Åland	Finland	1580.0	29489	18.360	Mariehamn
2	NaN	NaN	Bailiwick of Guernsey [c]	UK	78.0	65849	844.000	St. Peter Port
3	NaN	NaN	Bailiwick of Jersey [c]	UK	118.2	100080	819.000	Saint Helier
4	NaN	NaN	Faroe Islands	Denmark	1399.0	50778	35.200	Tórshavn
5	NaN	NaN	Gibraltar	UK	06. Jul	32194	4.328.000	Gibraltar
6	NaN	NaN	Greenland	Denmark [r]	2166086.0	55877	0.028	Nuuk
7	NaN	NaN	Isle of Man [c]	UK	572.0	83314	148.000	Douglas
8	NaN	NaN	Svalbard	Norway	61022.0	2667	0.044	Longyearbyen

This way, we get the fifth table from the list.

Great, so we have learned a way to access a specific table from the list. However, this method is not really efficient since we do not know what the table contains if we access it by list number. Luckily, the read_html() function provides us with useful parameters to specify which table we want to access.

Let’s say we want to get this table from the website:

Since it is a table, it is contained somewhere in our “tables” list. To get this specific table, we use the “match” parameter. This parameter expects a string or regular expression as input. Let’s put in the string "Peak Year" to state that we want to access this table:

economy_table = pd.read_html(url, match="Peak Year")
# economy_table:

This output shows all the tables that contain the string "Peak Year". But as we can see, there are two tables inside this list. We can confirm this, by running:

print(len(economy_table))
# 2

So, we need to be more specific inside our “match” parameter:

economy_table = pd.read_html(url, match="nominal, Peak Year")
# economy_table:

Here, we only get one table as output, which we can confirm again:

print(len(economy_table))
# 1

There are several more parameters to apply. We will have a look at the most important ones. Let’s say, we want to convert the integer values in the column "GDP (nominal, Peak Year)millions of USD" to float values. Additionally, we may also want to set the “Rank” column as the index column:

economy_table = pd.read_html(url, match="nominal, Peak Year", 
                             converters={"GDP (nominal, Peak Year)millions of USD": float}, 
                             index_col=0)

Again, we used the “match” parameter like before. In addition to that, we applied the “converters” parameter and put in a dictionary with the column name as the key and the data type that we want to convert in as the value. And we applied the “index_col” parameter and set that to “0” to state that we want to use the first column (the “Rank” column) as the index. The output shows the transformed table.

Converting the Tables into Pandas DataFrames

After we have read the HTML tables, the next step is to turn these tables into Pandas data frames to be able to analyze the data. The “economy_table” that we created above is from the type “list” and contains only one entry:

type(economy_table)
# <class 'list'>

len(economy_table)
# 1

Now, we want to convert that list entry into a Pandas data frame. And this is how we do it:

economy_df = economy_table[0]

	Country	GDP (nominal, Peak Year)millions of USD	Peak Year
Rank
–	European Union	19226235.0	2008
1	Germany	4230172.0	2021
2	United Kingdom	3108416.0	2021
3	France	2940428.0	2021
4	Italy	2408392.0	2008
5	Russia	2288428.0	2013
6	Spain	1631685.0	2008
7	Netherlands	1007562.0	2021
8	Turkey	957504.0	2013
9	Switzerland	810830.0	2021
10	Poland	655332.0	2021

We create a new variable “economy_df” and assign it the first entry of the “economy_table” list. The outputted data frame is indeed a Pandas data frame which we can prove by doing this:

isinstance(economy_df, pd.DataFrame)
# True

So, this is how we transform the tables into data frames. We can also check the data type of each column to see if the converting to float of the “GDP” column worked:

economy_df.dtypes

Country	object
GDP (nominal, Peak Year) millions of USD	float64
Peak Year	int64
dtype:object

As we can see, the data type of the “GDP” column is indeed “float64”.

So, now that we transformed the table into a Pandas data frame, we are now able to do all sorts of data analysis stuff that Pandas provides us.

Writing DataFrames to HTML tables

Now that we have seen how to read in HTML tables and how to transform them into data frames, in the next step, we will see how to write data frames to HTML tables using the to_html() function. We will use a new data frame for this approach:

data = {
    "speed": [7,5,8],
    "height": [1.0, 0.3, 0.1],
    "length": [1.2, 0.4, 0.2]
}

df = pd.DataFrame(data, index=["dog", "cat", "fish"])

This is the newly-created DataFrame:

	speed	height	length
dog	7	1.0	1.2
cat	5	0.3	0.4
fish	8	0.1	0.2

Here, we have the example dataset with a “speed”, a “height”, and a “length” column. We create a Pandas data frame called “df” with this data and assign the indexes “dog”, “cat”, and “fish” to it. The output shows a usual Pandas data frame.

Next, we apply the to_html() function:

html_table = df.to_html()
print(html_table)

Here’s the output HTML table:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>speed</th>
      <th>height</th>
      <th>length</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>dog</th>
      <td>7</td>
      <td>1.0</td>
      <td>1.2</td>
    </tr>
    <tr>
      <th>cat</th>
      <td>5</td>
      <td>0.3</td>
      <td>0.4</td>
    </tr>
    <tr>
      <th>fish</th>
      <td>8</td>
      <td>0.1</td>
      <td>0.2</td>
    </tr>
  </tbody>
</table>

We render “df” as a HTML table using to_html() and assign this to the new variable “html_table”. We use the print() statement for the output because otherwise, the output would be messy. The output shows a classic HTML table.

In addition to that, we can write this HTML table to a file:

html_file = open("index.html", "w")
html_file.write(html_table)
html_file.close()

This way, we create an HTML file called “index.html” and it is stored in the same folder as the python file we are working with. When we go into the folder and open the HTML file with a browser, it looks like this:

However, the approach we used with the “open”, “write” and “close” statements is a bit wordy and not clean. Luckily, Python provides us with a nice alternative that makes our code much cleaner:

with open("index.html", "w") as file:
    file.write(html_table)

Here, we use the “with” statement which is used in exception handling. It does the same as in the example above but it’s much smoother to do it this way because we need less code and it is easier to read.

Styling the HTML Tables

The to_html() function provides us with some optional parameters that we can apply to add some styling to our HTML tables. For example, we can use the “justify” parameter to justify the column labels:

html_table = df.to_html(justify="center")
print(html_table)

The output HTML:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: center;">
      <th></th>
      <th>speed</th>
      <th>height</th>
      <th>length</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>dog</th>
      <td>7</td>
      <td>1.0</td>
      <td>1.2</td>
    </tr>
    <tr>
      <th>cat</th>
      <td>5</td>
      <td>0.3</td>
      <td>0.4</td>
    </tr>
    <tr>
      <th>fish</th>
      <td>8</td>
      <td>0.1</td>
      <td>0.2</td>
    </tr>
  </tbody>
</table>

If we compare this HTML table to the one above, we see that “text-align” in the “tr style” tag now says “center” instead of “right” since the default value is “right”.

We can also change the default border size of “1” to another value by applying the “border” parameter:

html_table = df.to_html(justify="center", border=4)
print(html_table)

This is the output:

<table border="4" class="dataframe">
  <thead>
    <tr style="text-align: center;">
      <th></th>
      <th>speed</th>
      <th>height</th>
      <th>length</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>dog</th>
      <td>7</td>
      <td>1.0</td>
      <td>1.2</td>
    </tr>
    <tr>
      <th>cat</th>
      <td>5</td>
      <td>0.3</td>
      <td>0.4</td>
    </tr>
    <tr>
      <th>fish</th>
      <td>8</td>
      <td>0.1</td>
      <td>0.2</td>
    </tr>
  </tbody>
</table>

Now, the “table border” tag says “4” instead of “1”.

If we use CSS id selectors, we are able to apply these directly inside the to_html() function using the parameter “table_id”:

html_table = df.to_html(justify="center", border= 4, table_id="animal table")
print(html_table)

This is the resulting table:

<table border="4" class="dataframe" id="animal table">
  <thead>
    <tr style="text-align: center;">
      <th></th>
      <th>speed</th>
      <th>height</th>
      <th>length</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>dog</th>
      <td>7</td>
      <td>1.0</td>
      <td>1.2</td>
    </tr>
    <tr>
      <th>cat</th>
      <td>5</td>
      <td>0.3</td>
      <td>0.4</td>
    </tr>
    <tr>
      <th>fish</th>
      <td>8</td>
      <td>0.1</td>
      <td>0.2</td>
    </tr>
  </tbody>
</table>

In the first tag, we now have an id selector element which we did not have before.

Summary

All in all, Pandas provides us with some useful tools to use when working with HTML tables. We can easily read in HTML tables directly from websites with the read_html() function and create data frames from these tables. Also, we can render our data frames as HTML tables, apply several stylings to these tables and save them as HTML files. These skills are very essential, especially when working with web data.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page and subscribe to our email academy:

Happy Coding!