Pandas Read and Write HTML Files

Over your career as a Data Scientist or a Web Scraper, there may be instances where you will work with data to/from a DataFrame to HTML format.  This article shows you how to manipulate this data using the above functions.

This article covers the commonly used parameters for each function listed above. For a complete list of all parameters and their use, click here.


Preparation

Before any data manipulation can occur, three (3) new libraries will require installation.

  • The pandas library enables access to/from a DataFrame.
  • The ipython library enables HTML rendering and styling.
  • The jupyter library is a server-client application that allows editing and running the Notebook in your favorite browser. This library can reside on your computer to run in a local environment or remote server.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install ipython

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install jupyter

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
from IPython.display import HTML

Start Jupyter

To start Jupyter Notebook, perform the following steps:

  • Locate the executable file where Jupyter Notebook resides (for this example, on your computer). The easiest way to do this is to search for the file jupyter-lab.exe.  Please note the path.
  • Navigate to the Windows search box (Desktop bottom left).
  • In the search textbox enter cmd. Select Command Prompt -> Open.
  •  A pop-up window appears. Paste the entire path to the file (which may differ), including 'jupyter-lab.exe' as follows:   C://python/scripts/jupyter-lab.exe.
  • Hit the <Enter> key to load Jupyter Notebook.

πŸ’‘ Note: Keep this pop-up window open. Failure to do so will close the Jupyter Notebook.

  • If successful, the Jupyter Notebook Launcher opens up in your default browser window.
  • Click the button located directly below Notebook.
  • Shown below is the environment used in this article.
  • The final step is to rename the file to something more descriptive. With your mouse, right-click over the filename tab.
  • Select Rename Notebook.
  • From the Rename pop-window type styles.ipynb.
  • Click the Rename button to confirm the selection.

Read HTML File

Function Outline

pandas.read_html(io, match='.+', flavor=None, header=None, 
                 index_col=None, skiprows=None, attrs=None, 
                 parse_dates=False, thousands=',', encoding=None, 
                 decimal='.', converters=None, na_values=None, 
                 keep_default_na=True, displayed_only=True)

This function reads HTML tables into a list of DataFrame objects.

For this example, we will create an HTML file. You could read in any webpage by replacing the filename parameter used here with any URL.

To create the HTML file, perform the following steps:

  • Highlight the text below. Press CTL+C to copy the contents to the system Clipboard.
  • Open a text editor (Notepad). Paste the contents (CTRL+V) of the system Clipboard to the file.
  • Save the file as sample.html to the desktop.
<!doctype html>
<html lang="en">
    <head>
        <title>Sample</title>
    </head>
<body>
    <table>
        <thead>
            <tr>
            <th>FID</th>
            <th>Score</th>
            <th>Level</th>
            <th>Joined</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>1042</td>
                <td>1710</td>
                <td>Expert</td>
                <td>10/15/2021</td>
            </tr>
            <tr>
                <td>1043</td>
                <td>1960</td>
                <td>Authority</td>
                <td>10/8/2021</td>
            </tr>
            <tr>
                <td>1044</td>
                <td>1350</td>
                <td>Learner</td>
                <td>10/18/2021</td>
            </tr>
        </tbody>
    </table>
</body>
</html>

The next step is to upload the sample.html file located on the desktop. To upload this file to the Jupyter Notebook, perform the following steps:

  • On the left-hand side of the Jupyter Notebook, click the Upload button.
  • From the File Upload pop-up box, browse, and select the sample.html file.
  • Click the Open button to complete the process.

If successful, this file now resides inside the Jupyter Notebook area.

df = pd.read_html('sample.html')
print(df)
  • Line [1] reads in the HTML file and saves the contents.
  • Line [2] outputs the contents to the terminal.

To run this code, press the run icon (right-pointing arrow) located directly below the styles.ipynb filename tab.

Output

This output, in this case, turns out to be a list, as shown below.

[    FID  Score      Level      Joined
0  1042   1710     Expert  10/15/2021
1  1043   1960  Authority   10/8/2021
2  1044   1350    Learner  10/18/2021]

To remove the square brackets, run the code below.

print(df[0])

Output

 ScoreLevelJoined
01042Expert 10/15/2021
11043Authority  10/8/2021
21044Learner 10/18/2021

DataFrame to HTML

Using the DataFrame (df) above, we could save this output to an HTML file by appending the following lines to the code above.

df = pd.read_html('sample.html')
df[0].to_html('newfile.html')
  • Line [1] reads in the HTML file and saves the contents.
  • Line [2] exports this content to newfile.html.

πŸ’‘ Note: If you look at the source code of newfile.html, you will see all HTML tags except for the ones that are table-related are removed.

Below is the front-end view of the HTML file. To view this file, locate and double-click the newfile.html on the left-hand side.

Output


HTML Styler

This section focuses on styling the HTML file to give it some pizzazz.

For this example, remove all lines of code from the styles.ipynb file except for the two libraries noted above.

df = pd.read_html('newfile.html')
HTML(df[0].to_html(classes='table table-bordered'))
HTML(df[0].to_html(classes='table table-hover'))

Output

That wraps up this article. The take-away is that any styles applied to the HTML file are temporary. So once the HTML file saves, all changes are lost.

A workaround is to either create a style sheet and call that in or add the styles directly inside the HTML file.