Over your career as a Data Scientist or a Web Scraper, there may be instances where you will work with data to/from a DataFrame to HTML format. This article shows you how to manipulate this data using the above functions.
This article covers the commonly used parameters for each function listed above. For a complete list of all parameters and their use, click here.
Preparation
Before any data manipulation can occur, three (3) new libraries will require installation.
- The
pandas
library enables access to/from a DataFrame. - The
ipython
library enables HTML rendering and styling. - The
jupyter library
is a server-client application that allows editing and running the Notebook in your favorite browser. This library can reside on your computer to run in a local environment or remote server.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install ipython
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install jupyter
Hit the <Enter>
key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
- How to install Pandas on PyCharm
- How to install iPython on PyCharm
- How to install Jupyter on PyCharm
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd from IPython.display import HTML
Start Jupyter
To start Jupyter Notebook, perform the following steps:
- Locate the executable file where Jupyter Notebook resides (for this example, on your computer). The easiest way to do this is to search for the file
jupyter-lab.exe
. Please note the path. - Navigate to the Windows search box (Desktop bottom left).
- In the search textbox enter
cmd
. SelectCommand Prompt -> Open
.
- A pop-up window appears. Paste the entire path to the file (which may differ), including
'jupyter-lab.exe'
as follows:C://python/scripts/jupyter-lab.exe
. - Hit the
<Enter>
key to load Jupyter Notebook.
π‘ Note: Keep this pop-up window open. Failure to do so will close the Jupyter Notebook.
- If successful, the Jupyter Notebook Launcher opens up in your default browser window.
- Click the button located directly below Notebook.
- Shown below is the environment used in this article.
- The final step is to rename the file to something more descriptive. With your mouse, right-click over the filename tab.
- Select
Rename Notebook
.
- From the Rename pop-window type
styles.ipynb
. - Click the
Rename
button to confirm the selection.
Read HTML File
Function Outline
pandas.read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True)
This function reads HTML tables into a list of DataFrame objects.
For this example, we will create an HTML file. You could read in any webpage by replacing the filename parameter used here with any URL.
To create the HTML file, perform the following steps:
- Highlight the text below. Press
CTL+C
to copy the contents to the system Clipboard. - Open a text editor (Notepad). Paste the contents (
CTRL+V
) of the system Clipboard to the file. - Save the file as
sample.html
to the desktop.
<!doctype html> <html lang="en"> <head> <title>Sample</title> </head> <body> <table> <thead> <tr> <th>FID</th> <th>Score</th> <th>Level</th> <th>Joined</th> </tr> </thead> <tbody> <tr> <td>1042</td> <td>1710</td> <td>Expert</td> <td>10/15/2021</td> </tr> <tr> <td>1043</td> <td>1960</td> <td>Authority</td> <td>10/8/2021</td> </tr> <tr> <td>1044</td> <td>1350</td> <td>Learner</td> <td>10/18/2021</td> </tr> </tbody> </table> </body> </html>
The next step is to upload the sample.html
file located on the desktop. To upload this file to the Jupyter Notebook
, perform the following steps:
- On the left-hand side of the
Jupyter Notebook
, click theUpload
button.
- From the
File Upload
pop-up box, browse, and select thesample.html
file. - Click the
Open
button to complete the process.
If successful, this file now resides inside the Jupyter Notebook
area.
df = pd.read_html('sample.html') print(df)
- Line [1] reads in the HTML file and saves the contents.
- Line [2] outputs the contents to the terminal.
To run this code, press the run icon (right-pointing arrow) located directly below the styles.ipynb
filename tab.
Output
This output, in this case, turns out to be a list, as shown below.
[ FID Score Level Joined 0 1042 1710 Expert 10/15/2021 1 1043 1960 Authority 10/8/2021 2 1044 1350 Learner 10/18/2021]
To remove the square brackets, run the code below.
print(df[0])
Output
Score | Level | Joined | |
0 | 1042 | Expert | 10/15/2021 |
1 | 1043 | Authority | 10/8/2021 |
2 | 1044 | Learner | 10/18/2021 |
DataFrame to HTML
Using the DataFrame (df
) above, we could save this output to an HTML file by appending the following lines to the code above.
df = pd.read_html('sample.html') df[0].to_html('newfile.html')
- Line [1] reads in the HTML file and saves the contents.
- Line [2] exports this content to
newfile.html
.
π‘ Note: If you look at the source code of newfile.html
, you will see all HTML tags except for the ones that are table-related are removed.
Below is the front-end view of the HTML file. To view this file, locate and double-click the newfile.html
on the left-hand side.
Output
HTML Styler
This section focuses on styling the HTML file to give it some pizzazz.
For this example, remove all lines of code from the styles.ipynb
file except for the two libraries noted above.
df = pd.read_html('newfile.html') HTML(df[0].to_html(classes='table table-bordered')) HTML(df[0].to_html(classes='table table-hover'))
Output
That wraps up this article. The take-away is that any styles applied to the HTML file are temporary. So once the HTML file saves, all changes are lost.
A workaround is to either create a style sheet and call that in or add the styles directly inside the HTML file.