All codes in this article can be found in our Github repository:
Is it tedious to copy and paste the Table from the webpage to your spreadsheet or word? So, you want to use python for scraping the HTML Table?
Are you figuring out on How to parse HTML Table using Python programming language?
Are you confused on what python module to be used for parsing HTML Table?
You have come into the right place. In this article, we will show you three different methods to parse the HTML table using python in efficient way. We will explain you the methods using Tables from Wikipedia. In the last part of the article, we will show how to extract long table from BBC news website. Finally, you will get recommended method to parse HTML Table among the three.
Before we dive in let us understand about HTML Table and its element.
What Is HTML Table?
The table is used by the web developer to arrange data into rows and columns. It consists of cells and inside of it are rows and columns.
The purpose of the HTML table is to organize the data in a tabular form. So that users can read the data with less effort. Users can correlate specific data with rows and columns description.
Tables are used for:
- Financial Data
- Calendar
- Price Comparison
- Features Comparison
- Vaccination facts Information panel and
- Much more….
Elements of HTML Table
We will be using the List of Country Capitals in The Middle East table to learn about the elements.
<thead>
This element is in the first row of the table. Under this<thead>
, the headings of the table are written. But doesn’t contain any data. Refer to 2 in Image 1.<tr>
It stands for table row. This is under the<thead>
. It is direct child of<table>
element. The headings and data are written under this tag. Refer to 3,6,8 & 10 in Image 1.<th>
This is the table’s title element of each column. This element can be in table body (<tbody>
) too and not necessary to be in<thead>
. Refer to 4 in Image 1 where “Country” and “Capital” is mentioned is in<th>
element.<tbody>
It stands for table body. This is the area where data is displayed. It is direct dependent of<table>
tag. This should always come after<thead>
. Refer to 5 in Image 1.<td>
This tag stands for table Data. This is where Data is displayed . It should always come under<tr>
. These cells are displayed row- wise. Refer to 7, 9 & 11 in Image 1. Where the name of the country and capital is in<td>
element.<tfoot>
It stands for table footer. It is used in the last row for summarizing the table for example Total of numeric values. In HTML5,<tfoot>
can be placed either before or after<tbody>
and<tr>
elements. You can find it in quantitative data and long tables.
The output for the HTML script is shown below:
Country | Capital |
United Arab Emirates | Abu Dhabi |
Jordan | Amman |
Turkey | Ankara |
Now you have understood the elements of the HTML Table. Let us proceed to parse some HTML tables. There are four methods to extract the table.
Method 1: Using Ixml to parse HTML Table
Before diving into Ixml, you have to be clear on what is XML and HTML mean.
XML?
XML is an Extended Markup Language. It is a programming language created by World Wide Web Consortium (W3C). It helps to encode documents in a format which is readable by humans and machines. The format for the XML is textual which is simple to understand by all. With this language, we can create web applications, web pages. And also used for transporting data from databases. The main focus of this language is to store and transport data. In this XML script, we can define our own tags as per user requirements, but closing tags is mandatory.
HTML?
It stands for HyperText Markup Language. This is programming language enables to the creation of web pages using structure. HyperText facilities to access hyperlinks of the web page. As learned above it has its own pre-defined elements or tags to construct a solid webpage. The script is easier to understand and simple to edit or update in plain text. Presentation of Web Page design is the main focus for HTML, and it is easier to parse the data compared to XML.
Okay, we will start to scrap the HTML table using ixml.
lxml: This library is created from other programming languages C library i.e., libxml2 and libxslt. This module has fast processing time taken from C and has simplicity feature from Python. lxml can create, parse, and query the XML.
We will be extracting a table from a list of best-selling books on the Wikipedia website.
First, you have to install lxml
and tabulate libraries if you have not done before.
pip install lxml pip install tabulate
The next step is to import the libraries:
from lxml import html,etree from tabulate import tabulate
You have to import html
and etree
modules for HTML and XML file types.
Since lxml etree
supports only XML or HTML which is XML compliant, you have to convert the HTML to XML by following codes:
html_file = "/Users/mohamedthoufeeq/Desktop/UPWORK PROJECTS/Fixnter/HTML TABLE /List of best-selling books - Wikipedia.html"
You have to save the HTML page on your system and create a variable html_file
to store your file path of the HTML page.
In the next step, you have to open an HTML file and parse the contents of HTML, store it in the variable html_doc
.
with open(html_file,'r',encoding='utf-8') as inpt: html_doc = html.fromstring(inpt.read())
In the above command, use the html.fromstring()
method to store the parsed contents of the HTML.
with open("BestSellingBooksLists.xml",'wb',) as outpt: outpt.write(etree.tostring(html_doc))
Here you are creating a new file “BestSellingBooksLists.xml
” and transferring the contents of HTML to this XML file. Then use the etree.tostring()
method for writing the XML contents.
You can see the new file :BestSellingBooksLists.xml
: saved in your system. Locate it and copy the path.
Now we have to parse the XML file using etree.parse()
method:
table_tree = etree.parse("/Users/mohamedthoufeeq/Desktop/UPWORK PROJECTS/Fixnter/HTML TABLE /BestSellingBooksLists.xml")
In the following commands, we will be extracting table using XPath method. Open the webpage and inspect the table element. You can learn how to identify the elements using XPath in this article: https://blog.finxter.com/how-to-automate-google-search-using-python/
#Extracting data from the table Main_Heading = table_tree.xpath('//*[@class = "wikitable sortable"][1]//th') Main_Heading_list =[] for mh in Main_Heading: Main_Heading_list.append((mh.text).replace("\n"," ")) item = [] Book = table_tree.xpath('//*[@class = "wikitable sortable"][1]//i[1]/a[1]') for bl in Book: item.append((bl.text).replace("\n"," ")) Author = table_tree.xpath('//*[@class = "wikitable sortable"][1]//td[2]/a[1]') for auth in Author: item.append((auth.text).replace("\n"," ")) Language = table_tree.xpath('//*[@class = "wikitable sortable"][1]//td[3]') for lan in Language: item.append((lan.text).replace("\n"," ")) Published = table_tree.xpath('//*[@class = "wikitable sortable"][1]//td[4]') for pub in Published: item.append((pub.text).replace("\n"," ")) Sales = table_tree.xpath('//*[@class = "wikitable sortable"][1]//td[5]') for sal in Sales: item.append((sal.text).replace("\n"," ")) genre = table_tree.xpath('//*[@class = "wikitable sortable"][1]//td[6]/a[1]') for gen in genre: item.append((gen.text).replace("\n"," "))
The class “wikitable sortable” is used for the list of best-selling book tables.
n = 6 rows = [item [v:v+n] for v in range(0,len(item),n)] rows.insert(0,Main_Heading_list)
We are breaking the item list into multiple list which is stored in a new list called rows
using list comprehension and inserting the title of the table into the list.
Finally, we will be drawing a table using the tabulate library:
print(tabulate(rows,headers = 'firstrow',tablefmt='fancy_grid'))
Output of your Program:
Method 2: Using Pandas and Beautiful Soup to Parse HTML Table
In Method 2 you will be using a well-known web scraping module to parse the table. The one only Beautiful Soup to extract the HTML table. Then we can organize the extracted data into the tabular form using Pandas Dataframe.
As always install the libraries using the below command:
pip install beautifulsoup4 pip install pandas pip install requests
Now you have to import beautifulsoup, pandas, and requests.
from bs4 import BeautifulSoup import pandas as pd import requests
Let us now get the URL for extracting the List of Best-Selling Books more than 100 million copies table.
url = "https://en.wikipedia.org/wiki/List_of_best-selling_books#More_than_100_million_copies" website = requests.get(url)
Store the URL for the webpage in the variable URL. You can get the webpage content using a request.get()
method and store into the website variable
soup = BeautifulSoup(website.content,'html5lib')
The content of the webpage is parsed by from the above code and stored in variable soup.
You can use html5lib to parse the webpage which is extremely lenient.
Use soup.find
method to identify ‘table
’ tag and class “wikitable sortable”. Then store the contents of the table in a variable table. The class “wikitable sortable” belongs to the table element. Refer to Image 4.
table = soup.find('table', class_="wikitable sortable")
Then next create the following lists:
book = [] # stores the book names. author = [] # stores the author name of the book. language = [] # stores the language of the book. published = [] # stores the published year of the book. sales = [] # stores the approx. sales made for the book. genre = [] # stores the genre of the book.
Identify the HTML element for the table data (td
) which is under table row (tr
).
Refer to Image 5.
You can find the explanation of above code below:
- Use the
table.tbody.find_all('tr')
to get the elements for table row. - Use
columns = row.find_all('td')
method to get the element for table data. - Use for loop for iterating append lists of all the book details.
- Note that as you need to extract only the first table. So, we will use Index for columns 0,1,2,3,4 and 5 for table data as shown in the above code.
- Use the
.text()
method to get only text such as ” The Hobbit”. - Use the
strip()
method to remove the new spaces.
Finally, you can present the data in the tabular form following command:
You have to create dictionary table_dict
where key is the table heading and value is table data.
table_dict = {'Book':book,'Author(s)':author,'Original Language':language,'Published':published, 'Sales':sales,'Genre':genre}
In the code below, Create DataFrame for the table_dict
dictionary and store it in the variable in Data_Frame
.
Data_Frame=pd.DataFrame(table_dict)
You can set an option for the table to show only 6 columns and expand the frame without hiding in the below code.
pd.set_option('display.max_columns',6) pd.set_option("expand_frame_repr", False)
Finally, print the table using the command:
print(Data_Frame)
Output:
Method 3: Using HTMLTableParser to Parse HTML Table
In this method, we will use the HTMLTableParser
module to scrap HTML Table exclusively. This one doesn’t need any other external module. This module works only in Python 3 version.
Install the HTMLTableParser
and urllib.request
using the command:
pip install html-table-parser-python3 pip install urllib3
Store the website page into the variable url
.
url = "https://en.wikipedia.org/wiki/List_of_best-selling_books#More_than_100_million_copies"
From the below commands, the program makes a request, opens the website, and reads its HTML contents. The variable xhtml
stores the HTML contents.
req = urllib.request.Request(url=url) f = urllib.request.urlopen(req) xhtml = f.read().decode('utf-8')
Next, define the object HTMLTableParser()
and store the result in the variable p
.
p = HTMLTableParser()
Feed the HTML contents to HTMLTableParser object using feed()
method.
p.feed(xhtml)
In the below command, use tables [1] to get contents from the first table only.
Finally, using the tabulate module to get the list of best-selling books details in tabular form.
s = (p.tables[1]) print(tabulate(s,headers='firstrow', tablefmt='fancy_grid'))
The output is same as the Image 3.
Extracting Global Vaccination Table
In this section, we will apply Method 3 to scrap the Global Vaccination Table from the website https://www.bbc.com/news/world-56237778 .
Code:
# Import the libraries import urllib.request from html_table_parser.parser import HTMLTableParser from tabulate import tabulate # Getting HTML Contents from the webpage url = "https://www.bbc.com/news/world-56237778" req = urllib.request.Request(url=url) f = urllib.request.urlopen(req) xhtml = f.read().decode('utf-8') #Creating the HTMLTableParser() object p = HTMLTableParser() #Feed the HTML Contents to the object p.feed(xhtml) #Printing the table. s = (p.tables[0]) print(tabulate(s,headers='firstrow',tablefmt='fancy_grid'))
Output:
The table above shows the total number of doses administered for countries over the world.
Summary
Congratulation! Now you’re able to parse HTML Table using Python modules only. You have an excellent idea of what modules to use for this purpose. The main modules you learned which can achieve to web scrap the HTML table are LXML.etree
, beautifulsoup and HTMLTableParser
. But note that LXML can be used only for the XML file type.
We have shown you examples from well-known websites such as Wikipedia and BBC News.