5 Best Ways to Save HTML Tables Data to CSV in Python

Rate this post

πŸ’‘ Problem Formulation: Many developers deal with the need to extract data from HTML tables and save it into a more portable and user-friendly format. The input is an HTML file or string with table data, and the desired output is a CSV file that can be easily read and processed by spreadsheet applications or imported into a database. This article provides various Python methods to accomplish this task effectively.

Method 1: Using BeautifulSoup and csv module

BeautifulSoup is a Python library for parsing HTML and XML documents. It creates a parse tree that can be used to extract data from HTML, which is useful for scraping web pages. Combined with Python’s built-in csv module, we can parse an HTML table and write the content to a CSV file.

Here’s an example:

from bs4 import BeautifulSoup
import csv

html_code = """<table>
                   <tr><td>Name</td><td>Age</td></tr>
                   <tr><td>Alice</td><td>30</td></tr>
                   <tr><td>Bob</td><td>25</td></tr>
               </table>"""
soup = BeautifulSoup(html_code, features="lxml")
table = soup.find("table")

with open("output.csv", "w", newline="") as f:
    csv_writer = csv.writer(f)
    for row in table.find_all("tr"):
        csv_row = [cell.get_text() for cell in row.find_all(["td", "th"])]
        csv_writer.writerow(csv_row)

The output CSV file will contain:

Name,Age
Alice,30
Bob,25

This code snippet utilizes BeautifulSoup to parse the HTML table data. Rows are identified by the <tr> tag, and individual cell data are found using the <td> or <th> tags. The data is then written row by row into a CSV file using Python’s csv.writer.

Method 2: Using pandas

pandas is an open-source data manipulation and analysis library for Python. It provides various functions to easily deal with structured data like CSV files. We can use pandas to read the HTML content directly and save it as a CSV file without much fuss.

Here’s an example:

import pandas as pd

html_code = """<table>
                   <tr><td>Name</td><td>Country</td></tr>
                   <tr><td>Carlos</td><td>Spain</td></tr>
                   <tr><td>Liu</td><td>China</td></tr>
               </table>"""
dfs = pd.read_html(html_code)
dfs[0].to_csv('output.csv', index=False)

The output CSV file will contain:

Name,Country
Carlos,Spain
Liu,China

The pandas read_html function returns a list of DataFrame objects, each representing an HTML table found in the given string or file. We then use the DataFrame to_csv method to save the first table to a CSV file, setting index=False to avoid writing row indices.

Method 3: Using lxml and csv module

The lxml library is a high-performance, easy-to-use library for processing XML and HTML in Python. It can be combined with Python’s csv module to extract tables and save them as CSV. It is a particularly potent tool if you’re dealing with very large or complicated HTML documents.

Here’s an example:

from lxml import etree
import csv

html_code = """<table>
                   <tr><td>Product</td><td>Price</td></tr>
                   <tr><td>Apple</td><td>0.99</td></tr>
                   <tr><td>Banana</td><td>0.59</td></tr>
               </table>"""
root = etree.HTML(html_code)
table = root.find(".//table")

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    for row in table.findall(".//tr"):
        writer.writerow([cell.text for cell in row.findall(".//td|.//th")])

The output CSV file will contain:

Product,Price
Apple,0.99
Banana,0.59

The etree.HTML function from the lxml library parses the HTML code and constructs an Element tree. We select the table using XPath expression and iterate through the rows. Each cell’s text content is then extracted and written to the CSV file using the csv.writer class.

Method 4: Using HTMLParser and csv module

HTMLParser is a Python standard library module used to parse HTML. It offers a convenient way to handle HTML and allows us to extract table data by overriding its handler methods. This approach is good for simple, well-formed HTML data.

Here’s an example:

from html.parser import HTMLParser
import csv

class MyHTMLParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.is_td = False
        self.rows = []
        self.current_row = []

    def handle_starttag(self, tag, attrs):
        if tag == 'td':
            self.is_td = True

    def handle_endtag(self, tag):
        if tag == 'td':
            self.is_td = False
        elif tag == 'tr':
            self.rows.append(self.current_row)
            self.current_row = []

    def handle_data(self, data):
        if self.is_td:
            self.current_row.append(data.strip())

html_code = """<table>
                   <tr><td>Product</td><td>Price</td></tr>
                   <tr><td>Apple</td><td>0.99</td></tr>
                   <tr><td>Banana</td><td>0.59</td></tr>
               </table>"""
parser = MyHTMLParser()
parser.feed(html_code)

with open('output.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    for row in parser.rows:
        writer.writerow(row)

The output CSV file will contain:

Product,Price
Apple,0.99
Banana,0.59

In this example, a custom HTML parser class is defined by extending the HTMLParser class. It overrides methods to set flags when <td> tags are encountered and to collect cell data. When a <tr> end tag is found, the current row data is added to a list of rows, which are then written to a CSV file.

Bonus One-Liner Method 5: Using pandas (concise)

For pandas aficionados seeking a concise one-liner, the library offers a way to directly convert an HTML table to a CSV file with very minimal code. This method uses the same approach as Method 2 but optimizes it into a single line of code.

Here’s an example:

pd.read_html("<table><tr><th>ID</th><td>101</td></tr><tr><th>Name</th><td>John</td></tr></table>")[0].to_csv('output.csv', index=False)

The output CSV file will contain:

ID,101
Name,John

This one-liner script reads the HTML, converts the first (and only) table to a pandas DataFrame, and then writes the DataFrame to a CSV file, all in a single line of code. It’s efficient for simple scripts where importing pandas is not an overhead.

Summary/Discussion

  • Method 1: Using BeautifulSoup and csv module. Strengths: Versatile, great for complex HTML. Weaknesses: Requires an external library.
  • Method 2: Using pandas. Strengths: Very simple syntax, handles complex HTML. Weaknesses: Pandas is a large library to import for a single task.
  • Method 3: Using lxml and csv module. Strengths: Fast and highly efficient for large documents. Weaknesses: Slightly more complex syntax than BeautifulSoup.
  • Method 4: Using HTMLParser and csv module. Strengths: No external dependencies, part of standard library. Weaknesses: Limited to simple, well-formed HTML.
  • Bonus Method 5: Using pandas (concise). Strengths: Extremely concise one-liner. Weaknesses: Requires importing pandas, which may be inefficient for small tasks.