π‘ Problem Formulation: Many developers deal with the need to extract data from HTML tables and save it into a more portable and user-friendly format. The input is an HTML file or string with table data, and the desired output is a CSV file that can be easily read and processed by spreadsheet applications or imported into a database. This article provides various Python methods to accomplish this task effectively.
Method 1: Using BeautifulSoup and csv module
BeautifulSoup is a Python library for parsing HTML and XML documents. It creates a parse tree that can be used to extract data from HTML, which is useful for scraping web pages. Combined with Python’s built-in csv
module, we can parse an HTML table and write the content to a CSV file.
Here’s an example:
from bs4 import BeautifulSoup import csv html_code = """<table> <tr><td>Name</td><td>Age</td></tr> <tr><td>Alice</td><td>30</td></tr> <tr><td>Bob</td><td>25</td></tr> </table>""" soup = BeautifulSoup(html_code, features="lxml") table = soup.find("table") with open("output.csv", "w", newline="") as f: csv_writer = csv.writer(f) for row in table.find_all("tr"): csv_row = [cell.get_text() for cell in row.find_all(["td", "th"])] csv_writer.writerow(csv_row)
The output CSV file will contain:
Name,Age Alice,30 Bob,25
This code snippet utilizes BeautifulSoup to parse the HTML table data. Rows are identified by the <tr>
tag, and individual cell data are found using the <td>
or <th>
tags. The data is then written row by row into a CSV file using Pythonβs csv.writer
.
Method 2: Using pandas
pandas is an open-source data manipulation and analysis library for Python. It provides various functions to easily deal with structured data like CSV files. We can use pandas to read the HTML content directly and save it as a CSV file without much fuss.
Here’s an example:
import pandas as pd html_code = """<table> <tr><td>Name</td><td>Country</td></tr> <tr><td>Carlos</td><td>Spain</td></tr> <tr><td>Liu</td><td>China</td></tr> </table>""" dfs = pd.read_html(html_code) dfs[0].to_csv('output.csv', index=False)
The output CSV file will contain:
Name,Country Carlos,Spain Liu,China
The pandas read_html
function returns a list of DataFrame objects, each representing an HTML table found in the given string or file. We then use the DataFrame to_csv
method to save the first table to a CSV file, setting index=False
to avoid writing row indices.
Method 3: Using lxml and csv module
The lxml
library is a high-performance, easy-to-use library for processing XML and HTML in Python. It can be combined with Python’s csv
module to extract tables and save them as CSV. It is a particularly potent tool if you’re dealing with very large or complicated HTML documents.
Here’s an example:
from lxml import etree import csv html_code = """<table> <tr><td>Product</td><td>Price</td></tr> <tr><td>Apple</td><td>0.99</td></tr> <tr><td>Banana</td><td>0.59</td></tr> </table>""" root = etree.HTML(html_code) table = root.find(".//table") with open("output.csv", "w", newline="") as f: writer = csv.writer(f) for row in table.findall(".//tr"): writer.writerow([cell.text for cell in row.findall(".//td|.//th")])
The output CSV file will contain:
Product,Price Apple,0.99 Banana,0.59
The etree.HTML
function from the lxml library parses the HTML code and constructs an Element tree. We select the table using XPath expression and iterate through the rows. Each cell’s text content is then extracted and written to the CSV file using the csv.writer
class.
Method 4: Using HTMLParser and csv module
HTMLParser
is a Python standard library module used to parse HTML. It offers a convenient way to handle HTML and allows us to extract table data by overriding its handler methods. This approach is good for simple, well-formed HTML data.
Here’s an example:
from html.parser import HTMLParser import csv class MyHTMLParser(HTMLParser): def __init__(self): super().__init__() self.is_td = False self.rows = [] self.current_row = [] def handle_starttag(self, tag, attrs): if tag == 'td': self.is_td = True def handle_endtag(self, tag): if tag == 'td': self.is_td = False elif tag == 'tr': self.rows.append(self.current_row) self.current_row = [] def handle_data(self, data): if self.is_td: self.current_row.append(data.strip()) html_code = """<table> <tr><td>Product</td><td>Price</td></tr> <tr><td>Apple</td><td>0.99</td></tr> <tr><td>Banana</td><td>0.59</td></tr> </table>""" parser = MyHTMLParser() parser.feed(html_code) with open('output.csv', 'w', newline='') as file: writer = csv.writer(file) for row in parser.rows: writer.writerow(row)
The output CSV file will contain:
Product,Price Apple,0.99 Banana,0.59
In this example, a custom HTML parser class is defined by extending the HTMLParser
class. It overrides methods to set flags when <td>
tags are encountered and to collect cell data. When a <tr>
end tag is found, the current row data is added to a list of rows, which are then written to a CSV file.
Bonus One-Liner Method 5: Using pandas (concise)
For pandas aficionados seeking a concise one-liner, the library offers a way to directly convert an HTML table to a CSV file with very minimal code. This method uses the same approach as Method 2 but optimizes it into a single line of code.
Here’s an example:
pd.read_html("<table><tr><th>ID</th><td>101</td></tr><tr><th>Name</th><td>John</td></tr></table>")[0].to_csv('output.csv', index=False)
The output CSV file will contain:
ID,101 Name,John
This one-liner script reads the HTML, converts the first (and only) table to a pandas DataFrame, and then writes the DataFrame to a CSV file, all in a single line of code. It’s efficient for simple scripts where importing pandas is not an overhead.
Summary/Discussion
- Method 1: Using BeautifulSoup and csv module. Strengths: Versatile, great for complex HTML. Weaknesses: Requires an external library.
- Method 2: Using pandas. Strengths: Very simple syntax, handles complex HTML. Weaknesses: Pandas is a large library to import for a single task.
- Method 3: Using lxml and csv module. Strengths: Fast and highly efficient for large documents. Weaknesses: Slightly more complex syntax than BeautifulSoup.
- Method 4: Using HTMLParser and csv module. Strengths: No external dependencies, part of standard library. Weaknesses: Limited to simple, well-formed HTML.
- Bonus Method 5: Using pandas (concise). Strengths: Extremely concise one-liner. Weaknesses: Requires importing pandas, which may be inefficient for small tasks.