How to Extract HTML H1, H2, H3 Headlines from a Python String Using Regex

5/5 - (1 vote)

You can use the re module in Python to use regular expressions. Here’s an example of how you can use it to find <h1>...</h1> in a string:

import re

html = """<html><body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body></html>"""  # your HTML string

matches = re.findall('<h1>(.*?)</h1>', html)

for match in matches:
    print(match)

The output is 'My First Heading':

In this example, the re.findall() function returns all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

The regular expression <h1>(.*?)</h1> means:

  • <h1> matches the characters <h1> literally.
  • (.*?) is a capturing group that matches any character (except for a newline) between zero and unlimited times. The ? after * makes it non-greedy, meaning it will stop at the first </h1> it encounters.
  • </h1> matches the characters </h1> literally.

If there are any <h1> tags in the HTML string, their contents will be printed out by the print(match) statement.

This should work for simple cases, but if your HTML content gets more complex (for example, if there are attributes in the <h1> tag like <h1 class="headline"> or nested tags), you may need to adjust your regular expression or use an HTML parser like BeautifulSoup for more reliability. Regular expressions can get quite complicated when trying to parse HTML, and HTML parsers are specifically designed for this task.

βœ… Academy: Web Scraping with BeautifulSoup

How to remove the <h1> tag environment afterward?

If you’ve already extracted the content within <h1> tags and you just want to remove those tags (and possibly other HTML tags) from the text, you can use the re.sub() function to replace them with an empty string. Here’s an example:

import re

html = """<html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>"""  # your HTML string

# Remove <h1>...</h1>
cleaned = re.sub('<h1>.*?</h1>', '', html)

print(cleaned)

In this example, re.sub('<h1>.*?</h1>', '', html) replaces any occurrences of <h1>...</h1> (where ... is any sequence of characters) with an empty string.

If you want to remove all HTML tags, not just <h1>, you can modify the regular expression to match any tag:

import re

html = """<html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>"""  # your HTML string

# Remove all HTML tags
cleaned = re.sub('<.*?>', '', html)

print(cleaned)

In this example, re.sub('<.*?>', '', html) replaces any HTML tag with an empty string. The regular expression <.*?> matches any sequence of characters enclosed in < and >. The ? makes it non-greedy, so it will stop at the first > it encounters, which allows it to correctly handle multiple tags on one line.

βœ… Recommended: Python Regex Superpower [Full Tutorial]