Reading and Writing HTML with Pandas

In this tutorial, we will learn how to read in HTML tables using the read_html() function and how to turn these tables into Pandas data frames to analyze them. Furthermore, we will see how to render Pandas data frames as HTML tables applying the to_html() function.

As you go through the article, you can also watch the following explainer video tutorial:

Reading in HTML tables using the read_html() function

For this tutorial, we will use this Wikipedia page about Europe. It contains a lot of information about the history and current situation of the continent Europe. To get an overview of all the parameters, check out the official documentation. So, let’s get started with the actual coding:

import pandas as pd

url = "https://en.wikipedia.org/wiki/Europe"
tables = pd.read_html(url)

print(type(tables))
# <class 'list'> 

In the beginning, we import the Pandas library. Then, we create the variable β€œurl” and assign it the URL of the Wikipedia page as a string. After that, we use the read_html() function for the first time. We read in the URL by putting the β€œurl” variable inside the read_html() function and assigning that to a new variable called β€œtables”. Finally, we output the type of β€œtables”. As we can see, the type is a list. So basically, the read_html() function, as we use it here, reads in all the tables it can find on the website and assigns these tables as a list to the variable.

Let’s see how many tables there are:

print(len(tables))
# 44

We determine the length of the table list by using the function len(). There are 44 tables in total.

Now, if we wanted to get a specific table, we could run:

print(tables[4])

This is the resulting output:

FlagSymbolNameSovereignstateArea(km2)PopulationPopulationdensity(per km2)Capital
0NaNNaNSovereign Base Areas of Akrotiri and DhekeliaUK254.01570059.100Episkopi Cantonment
1NaNNaNΓ…landFinland1580.02948918.360Mariehamn
2NaNNaNBailiwick of Guernsey [c]UK78.065849844.000St. Peter Port
3NaNNaNBailiwick of Jersey [c]UK118.2100080819.000Saint Helier
4NaNNaNFaroe IslandsDenmark1399.05077835.200TΓ³rshavn
5NaNNaNGibraltarUK06. Jul321944.328.000Gibraltar
6NaNNaNGreenlandDenmark [r]2166086.0558770.028Nuuk
7NaNNaNIsle of Man [c]UK572.083314148.000Douglas
8NaNNaNSvalbardNorway61022.026670.044Longyearbyen

This way, we get the fifth table from the list.

Great, so we have learned a way to access a specific table from the list. However, this method is not really efficient since we do not know what the table contains if we access it by list number. Luckily, the read_html() function provides us with useful parameters to specify which table we want to access.

Let’s say we want to get this table from the website:

Since it is a table, it is contained somewhere in our β€œtables” list. To get this specific table, we use the β€œmatch” parameter. This parameter expects a string or regular expression as input. Let’s put in the string "Peak Year" to state that we want to access this table:

economy_table = pd.read_html(url, match="Peak Year")
# economy_table:

This output shows all the tables that contain the string "Peak Year". But as we can see, there are two tables inside this list. We can confirm this, by running:

print(len(economy_table))
# 2

So, we need to be more specific inside our β€œmatch” parameter:

economy_table = pd.read_html(url, match="nominal, Peak Year")
# economy_table:

Here, we only get one table as output, which we can confirm again:

print(len(economy_table))
# 1

There are several more parameters to apply. We will have a look at the most important ones. Let’s say, we want to convert the integer values in the column "GDP (nominal, Peak Year)millions of USD" to float values. Additionally, we may also want to set the β€œRank” column as the index column:

economy_table = pd.read_html(url, match="nominal, Peak Year", 
                             converters={"GDP (nominal, Peak Year)millions of USD": float}, 
                             index_col=0)

Again, we used the β€œmatch” parameter like before. In addition to that, we applied the β€œconverters” parameter and put in a dictionary with the column name as the key and the data type that we want to convert in as the value. And we applied the β€œindex_col” parameter and set that to β€œ0” to state that we want to use the first column (the β€œRank” column) as the index. The output shows the transformed table.

Converting the Tables into Pandas DataFrames

After we have read the HTML tables, the next step is to turn these tables into Pandas data frames to be able to analyze the data. The β€œeconomy_table” that we created above is from the type β€œlist” and contains only one entry:

type(economy_table)
# <class 'list'>

len(economy_table)
# 1

Now, we want to convert that list entry into a Pandas data frame. And this is how we do it:

economy_df = economy_table[0]
CountryGDP (nominal, Peak Year)millions of USDPeak Year
Rank
–European Union19226235.02008
1Germany4230172.02021
2United Kingdom3108416.02021
3France2940428.02021
4Italy2408392.02008
5Russia2288428.02013
6Spain1631685.02008
7Netherlands1007562.02021
8Turkey957504.02013
9Switzerland810830.02021
10Poland655332.02021

We create a new variable β€œeconomy_df” and assign it the first entry of the β€œeconomy_table” list. The outputted data frame is indeed a Pandas data frame which we can prove by doing this:

isinstance(economy_df, pd.DataFrame)
# True

So, this is how we transform the tables into data frames. We can also check the data type of each column to see if the converting to float of the β€œGDP” column worked:

economy_df.dtypes
Countryobject
GDP (nominal, Peak Year) millions of USDfloat64
Peak Yearint64
dtype:object

As we can see, the data type of the β€œGDP” column is indeed β€œfloat64”.

So, now that we transformed the table into a Pandas data frame, we are now able to do all sorts of data analysis stuff that Pandas provides us.

Writing DataFrames to HTML tables

Now that we have seen how to read in HTML tables and how to transform them into data frames, in the next step, we will see how to write data frames to HTML tables using the to_html() function. We will use a new data frame for this approach:

data = {
    "speed": [7,5,8],
    "height": [1.0, 0.3, 0.1],
    "length": [1.2, 0.4, 0.2]
}

df = pd.DataFrame(data, index=["dog", "cat", "fish"])

This is the newly-created DataFrame:

speedheightlength
dog71.01.2
cat50.30.4
fish80.10.2

Here, we have the example dataset with a β€œspeed”, a β€œheight”, and a β€œlength” column. We create a Pandas data frame called β€œdf” with this data and assign the indexes β€œdog”, β€œcat”, and β€œfish” to it. The output shows a usual Pandas data frame.

Next, we apply the to_html() function:

html_table = df.to_html()
print(html_table)

Here’s the output HTML table:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>speed</th>
      <th>height</th>
      <th>length</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>dog</th>
      <td>7</td>
      <td>1.0</td>
      <td>1.2</td>
    </tr>
    <tr>
      <th>cat</th>
      <td>5</td>
      <td>0.3</td>
      <td>0.4</td>
    </tr>
    <tr>
      <th>fish</th>
      <td>8</td>
      <td>0.1</td>
      <td>0.2</td>
    </tr>
  </tbody>
</table>

We render β€œdf” as a HTML table using to_html() and assign this to the new variable β€œhtml_table”. We use the print() statement for the output because otherwise, the output would be messy. The output shows a classic HTML table.

In addition to that, we can write this HTML table to a file:

html_file = open("index.html", "w")
html_file.write(html_table)
html_file.close()

This way, we create an HTML file called β€œindex.html” and it is stored in the same folder as the python file we are working with. When we go into the folder and open the HTML file with a browser, it looks like this:

However, the approach we used with the β€œopen”, β€œwrite” and β€œclose” statements is a bit wordy and not clean. Luckily, Python provides us with a nice alternative that makes our code much cleaner:

with open("index.html", "w") as file:
    file.write(html_table)

Here, we use the β€œwith” statement which is used in exception handling. It does the same as in the example above but it’s much smoother to do it this way because we need less code and it is easier to read.

Styling the HTML Tables

The to_html() function provides us with some optional parameters that we can apply to add some styling to our HTML tables. For example, we can use the β€œjustify” parameter to justify the column labels:

html_table = df.to_html(justify="center")
print(html_table)

The output HTML:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: center;">
      <th></th>
      <th>speed</th>
      <th>height</th>
      <th>length</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>dog</th>
      <td>7</td>
      <td>1.0</td>
      <td>1.2</td>
    </tr>
    <tr>
      <th>cat</th>
      <td>5</td>
      <td>0.3</td>
      <td>0.4</td>
    </tr>
    <tr>
      <th>fish</th>
      <td>8</td>
      <td>0.1</td>
      <td>0.2</td>
    </tr>
  </tbody>
</table>

If we compare this HTML table to the one above, we see that β€œtext-align” in the β€œtr style” tag now says β€œcenter” instead of β€œright” since the default value is β€œright”.

We can also change the default border size of β€œ1” to another value by applying the β€œborder” parameter:

html_table = df.to_html(justify="center", border=4)
print(html_table)

This is the output:

<table border="4" class="dataframe">
  <thead>
    <tr style="text-align: center;">
      <th></th>
      <th>speed</th>
      <th>height</th>
      <th>length</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>dog</th>
      <td>7</td>
      <td>1.0</td>
      <td>1.2</td>
    </tr>
    <tr>
      <th>cat</th>
      <td>5</td>
      <td>0.3</td>
      <td>0.4</td>
    </tr>
    <tr>
      <th>fish</th>
      <td>8</td>
      <td>0.1</td>
      <td>0.2</td>
    </tr>
  </tbody>
</table>

Now, the β€œtable border” tag says β€œ4” instead of β€œ1”.

If we use CSS id selectors, we are able to apply these directly inside the to_html() function using the parameter β€œtable_id”:

html_table = df.to_html(justify="center", border= 4, table_id="animal table")
print(html_table)

This is the resulting table:

<table border="4" class="dataframe" id="animal table">
  <thead>
    <tr style="text-align: center;">
      <th></th>
      <th>speed</th>
      <th>height</th>
      <th>length</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>dog</th>
      <td>7</td>
      <td>1.0</td>
      <td>1.2</td>
    </tr>
    <tr>
      <th>cat</th>
      <td>5</td>
      <td>0.3</td>
      <td>0.4</td>
    </tr>
    <tr>
      <th>fish</th>
      <td>8</td>
      <td>0.1</td>
      <td>0.2</td>
    </tr>
  </tbody>
</table>

In the first tag, we now have an id selector element which we did not have before.

Summary

All in all, Pandas provides us with some useful tools to use when working with HTML tables. We can easily read in HTML tables directly from websites with the read_html() function and create data frames from these tables. Also, we can render our data frames as HTML tables, apply several stylings to these tables and save them as HTML files. These skills are very essential, especially when working with web data.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page and subscribe to our email academy:

Happy Coding!