Python Webscraper Regex [Free Book Chapter Tutorial]

This tutorial is a chapter excerpt drafted for my new book “Python One-Liners” (to appear in 2020, No Starch Press, San Francisco).

Are you an office worker, student, software developer, manager, blogger, researcher, author, copywriter, teacher, or self-employed freelancer? Most likely, you are spending many hours in front of your computer, day after day. In any case, improving your daily productivity—only by a small fraction of a percentage—will amount for thousands, if not tens of thousands of dollars of productivity gain. And more importantly, if you are not merely clocking your time at work, improving your computer productivity will give you more free time to be used in better ways.

This chapter shows you an extremely undervalued technology that helps master coders make more efficient use of their time when working with textual data. The technology is called “regular expressions”. This chapter shows you ten ways of using regular expressions to solve everyday problems with less effort, time, and energy. Study this chapter about regular expressions carefully—it’ll be worth your time investment!

Related article: Python Regex Superpower – The Ultimate Guide

Writing Your First Web Scraper With Regular Expressions

Why should you care about regular expressions? Because you will encounter them regularly if you are pursuing a programming career.

Suppose, you are working as a freelance software developer. Your client is a Fintech startup that needs to stay updated about the latest developments in the cryptocurrency space. They hire you to write a web scraper that regularly pulls the HTML source code of news websites and searches it for occurrences of words starting with 'crypto' (e.g. 'cryptocurrency', 'crypto-bot', 'crypto-crash', …).

Your first attempt is the following code snippet:

import urllib.request

search_phrase = 'crypto'

with urllib.request.urlopen('https://www.wired.com/') as response:
   html = response.read().decode("utf8") # convert to string
   first_pos = html.find(search_phrase)
   print(html[first_pos-10:first_pos+10])

Try It Yourself: Use our interactive browser Python shell to run this code interactively:

Exercise: Search the wired website for other words using this web scraper!

The method urlopen (from the module urllib.request) pulls the HTML source code from the specified URL. As the result is a byte array, you first convert it to a string using the method decode(). Then, you use the string method find() that returns the position of the first occurrence of the searched string. With slicing, you carve out a substring that returns the immediate environment of the position. The result is the following string:

# ,r=window.crypto||wi

Aww. That looks bad. As it turns out, the search phrase is ambiguous – most words containing 'crypto' are semantically unrelated to cryptocurrencies. Your web scraper generates false positives (it finds string results which you originally didn’t mean to find).[1] So how can you fix it?

Luckily, you’ve just read this Python book, so the answer is obvious: regular expressions! Your idea to remove false positives is to search for occurrences where the word "crypto" is followed by up to 30 arbitrary characters, followed by the word "coin". Roughly speaking, the search query is: "crypto" + <up to 30 arbitrary characters> + "coin". Consider the following two examples:

"crypto-bot that is trading Bitcoin" — YES
"cryptographic encryption methods that can be cracked easily with quantum computers" — NO

A regular expression is like a mini-programming language inside Python that allows you to search a string for occurrences of a query pattern. Regular expressions are much more powerful than default textual search functionality as shown above. For example, the set of query strings can even have an infinite size!

Our goal is to solve the following problem: Given a string, find occurrences where the string “crypto” is followed by up to 30 arbitrary characters, followed by the string "coin".

Let’s have a first look at the result before we discuss—in a step-by-step manner—how the code solves the problem.

## Dependencies
import re


## Data
text_1 = "crypto-bot that is trading Bitcoin and other currencies"
text_2 = "cryptographic encryption methods that can be cracked easily with quantum computers"


## One-Liner
pattern = re.compile("crypto(.{1,30})coin")


## Result
print(pattern.match(text_1))
print(pattern.match(text_2))

One-liner solution to find text snippets in the form crypto … coing.

The code searches two different strings text_1 and text_2. Does the search query (pattern) match them?

First, we import the standard package for regular expressions in Python, called re. The important stuff happens in the one-liner u where you compile the search query "crypto(.{1,30})coin" (called pattern in regex terminology). This is the query which we can then search in various strings. We use the following special regex characters. Read them from top to bottom and you will understand the meaning of the pattern in the above code snippet.

() matches whatever regex is inside,
. matches an arbitrary character,
{1,30} matches between 1 and 30 occurrences of the previous regex,
(.{1,30}) matches between 1 and 30 arbitrary characters, and
crypto(.{1,30})coin matches the regex consisting of three parts: the word "crypto", an arbitrary sequence with 1 to 30 chars, followed by the word “coin”.

We say that the pattern is compiled because Python creates a pattern object that can be reused in multiple locations—much like a compiled program can be executed multiple times. Now, we call the function match() on our compiled pattern and the text to be searched. This leads to the following result:

## Result
print(pattern.match(text_1))
# <re.Match object; span=(0, 34), match='crypto-bot that is trading Bitcoin'>

print(pattern.match(text_2))
# None

String text_1 matches the pattern (indicated by the resulting match object), string text_2 doesn’t (indicated by the result None). Although the textual representation of the first matching object does not look pretty, it gives a clear hint that the given string 'crypto-bot that is trading Bitcoin' matches the regular expression.

Finding Basic Textual Patterns in Strings

At this point, you have learned the most powerful way to find arbitrary textual patterns in strings: regular expressions. Let’s build upon that by introducing the important re.findall() function. Additionally, it explains several basic regular expressions in more detail.

A regular expression (in short: regex) formally describes the search pattern using a combination of some basic commands. Learn these basic commands and you will understand complex regular expressions easily. In this one-liner section, we will focus on the three most important regex commands.

The Dot Regex (.)

First, you need to know how to match an arbitrary character using the dot (.) regex. The dot regex matches any character. You can use it to indicate that you really don’t care which character matches—as long as exactly one matches.

import re

text = '''A blockchain, originally block chain,
is a growing list of records, called blocks,
which are linked using cryptography.
'''

print(re.findall('b...k', text))
# ['block', 'block', 'block']

The example uses the findall() method of the re package. The first parameter is the regex itself: we search for any string pattern starting with the character 'b', followed by three arbitrary characters (the dots …), followed by the character 'k'. Note that not only is the string 'block' a match but also 'boook', 'b erk', and 'bloek'. The second parameter is the text to be searched. The string text contains three such patterns. These are the result of the print statement.

The Asterisk Regex (*)

Second, you need to know how to match an arbitrary number of specific characters using the asterisk (*) regex.

print(re.findall('y.*y', text))
# ['yptography']

The asterisk operator applies to the regex immediately in front of it. In the example, the regex pattern starts with the character 'y', followed by an arbitrary number of characters (.*), followed by the character 'y'. The word 'cryptography' contains one such instance.

If you are reading this thoroughly, you may wonder why it doesn’t find the long substring between 'originally' and 'cryptography' which should match the regex pattern 'y.*y', as well. The reason is simply that the asterisk operator matches an arbitrary number of characters, but not including newlines. Semantically, the end of the line resets the state of the search for the regex. In the next line, a new search is initiated. The string stored in the variable text is a multi-line string with three new lines.

The Question Mark Regex (?)

Third, you need to know how to match zero or one characters using the question mark regex (?).

print(re.findall('blocks?', text))
# ['block', 'block', 'blocks']

The zero-or-one regex (?) applies to the regex immediately in front of it. In our case, this is the character 's'. The meaning of the zero-or-one regex is that this character is optional.

An important detail is that the question mark can be combined with the asterisk operator '*?' to allow for non-greedy pattern matching. In contrast, if you use the asterisk operator '*' without the question mark, it greedily matches as many characters as possible. For example, when searching the HTML string '<div>hello world</div>' using the regex '<.*>', it matches the whole string '<div>hello world</div>' rather than only the prefix '<div>'. If you want to achieve the latter, you can, therefore, use the non-greedy regex '<.*?>':

txt = '<div>hello world</div>'

print(re.findall('<.*>', txt))
# ['<div>hello world</div>']

print(re.findall('<.*?>', txt))
# ['<div>', '</div>']

Equipped with these three tools, you are now able to comprehend the next one-liner solution.

Our goal is to solve the following problem: “Given a string. Use a non-greedy approach to find all patterns that start with the character 'p', end with the character 'r', and have one occurrence of the character 'e' (and an arbitrary number of other characters) in between!” These types of text queries occur quite frequently—especially in companies that focus on text processing, speech recognition, or machine translation (such as search engines, social networks, or video platforms).

## Dependencies
import re


## Data
text = 'peter piper picked a peck of pickled peppers'


## One-Liner
result = re.findall('p.*?e.*?r', text)


## Result
print(result)

One-liner solution to search for specific phrases (non-greedy).

The regex search query is 'p.*?e?.*?r'. So we look for a phrase that starts with the character 'p' and ends with the character 'r'. In between those two characters, we require one occurrence of the character 'e'. Apart from that, we allow an arbitrary number of characters (whitespace or not). However, we match in a non-greedy manner using the regex '.*?' so that Python searches for a minimal number of arbitrary characters (rather than a maximal number of arbitrary characters for the greedy regex '.*').

## Result
print(result)
# ['peter', 'piper', 'picked a peck of pickled pepper']

To fully grasp the meaning of non-greedy matching, compare this solution to the one that would be obtained when you’d use the greedy regex ‘p.*e.*r’.

result = re.findall('p.*e.*r', text)
print(result)
# ['peter piper picked a peck of pickled pepper']

The first greedy asterisk operator .* matches almost the whole string before it terminates.

Analyzing Hyperlinks of HTML Documents

In the last section, you have learned the three most important regular expressions: the dot regex, the asterisk regex, and the zero-or-one regex. This section goes much further introducing many more regular expressions.

By adding more regular expressions to your stock of knowledge, you increase your capability of solving real-world problems in a fast, concise, and easy manner. So what are some of the most important regular expressions? Study the following list carefully because we will use all of them in this chapter.

The dot regex . matches an arbitrary character.
The asterisk regex A* matches an arbitrary number of instances of the regex A.
The zero-or-one regex A? matches either zero or one instances of the regex A.
The non-greedy dot regex .? matches as few arbitrary characters as possible such that the overall regex matches if possible.
The regex A{m} matches exactly m copies of the regex A.
The regex A{m,n} matches between m and n copies of the regex A.
The regex A|B matches either regex A or regex B (but not both).
The regex AB matches first regex A and then regex B.
The regex (A) matches regex A. The parenthesis groups regular expressions so that you can control the order of execution (for example, the regex (AB)|C is different than A(B|C).

Let’s consider a short example. Say, you create the regex ‘b?(.a)*’. Which patterns will the regex match? The regex matches all patterns starting with zero or one character ‘b’ and an arbitrary number of two-character-sequences ending in the character ‘a’. Hence, the strings ‘bcacaca’, ”, and ‘aaaaaa’ would all match the regex.

Before we dive into the next one-liner, let’s quickly discuss another topic of interest for any practitioner: when to use which regex function? The three most important regex functions are re.match(), re.search(), and re.findall(). You’ve already seen two of them but let’s study them more thoroughly (by example).

import re

text = '''
"One can never have enough socks", said Dumbledore.
"Another Christmas has come and gone and I didn’t
get a single pair. People will insist on giving me books."
Christmas Quote
'''

regex = 'Christ.*'

print(re.match(regex, text))
# None

print(re.search(regex, text))
# <re.Match object; span=(62, 102), match='Christmas has come and gone and I didn’t'>

print(re.findall(regex, text))
# ['Christmas has come and gone and I didn’t', 'Christmas Quote']

All three functions take the regex and the string to be searched as an input. Functions match() and search() return a match object (or None if the regex did not match anything). The match object stores the position of the match and more advanced meta information. The function match() does not find the regex in the string (it returns None). Why? Because the function looks for the pattern only at the beginning of the string. The function search()searches for the first occurrence of the regex anywhere in the string. Therefore, it finds the match ‘Christmas has come and gone and I didn’t’.

I guess you like the function findall() most? The output is intuitive (but also less useful for further processing: for instance, the match object contains interesting information about the precise matching location). The result is not a matching object but a sequence of strings. In contrast to the functions match() and search(), the function findall() retrieves all matched patterns.

Say, your company asks you to create a small web bot that crawls web pages and checks whether they contain links to the domain ‘finxter.com‘. An additional requirement is that the hyperlink descriptions should also contain the strings ‘test’ or ‘puzzle’. More precisely, the goal is to solve the following problem: “Given a string, find all hyperlinks that point to the domain finxter.com and contain the strings ‘test’ or ‘puzzle’ in the link description”.

## Dependencies
import re


## Data
page = '''
<!DOCTYPE html>
<html>
<body>

<h1>My Programming Links</h1>
<a href="https://app.finxter.com/learn/computer/science/">test your Python skill level</a>
<a href="https://blog.finxter.com/recursion/">Learn recursion</a>
<a href="https://nostarch.com/">Great books from NoStarchPress</a>
<a href="http://finxter.com/">Solve more Python puzzles</a>

</body>
</html>
'''

## One-Liner
practice_tests = re.findall("(<a.*?finxter.*(test|puzzle).*>)", page)


## Result
print(practice_tests)

One-liner solution to analyze web page links.

The code finds two occurrences of the regular expression. Which ones?

The data consists of a simple HTML web page (stored as a multi-line string) containing a list of hyperlinks (the tag environment <a href=””>link text </a>). The one-liner solution uses the function re.findall() to check the regular expression “(<a.*?finxter.*(test|puzzle).*>)”. This way, the regular expression returns all occurrences in the tag environment <a…> with the following restrictions:

After the opening tag, an arbitrary number of characters is matched (non-greedy), followed by the string ‘finxter’. Next, we match an arbitrary number of characters (greedy), followed by one occurrence of either the string ‘test’ or the string ‘puzzle’. Again, we match an arbitrary number of characters (greedily), followed by the closing tag. This way, we find all hyperlink tags which contain the respective strings. Note that this regex also matches tags where the strings ‘test’ or ‘puzzle’ occur within the link itself.

The result of the one-liner is the following:

## Result
print(practice_tests)
# [('<a href="https://app.finxter.com/learn/computer/science/">test your Python skill level</a>', 'test'),
#  ('<a href="http://finxter.com/">Solve more Python puzzles</a>', 'puzzle')]

Two hyperlinks match our regular expression: the result of the one-liner is a list with two elements. However, each element is a tuple of strings rather than a simple string. This is different from the results of the function findall() which we’ve discussed in previous code snippets. What’s the reason for this behavior? The return type is a list of tuples—with one tuple value for each matching group enclosed in brackets (). For instance, the regex ‘(test|puzzle)’ uses the bracket notation to create a matching group. The rule is now the following: if you use matching groups in your regex, the function re.findall() will add one tuple value for every matched group. The tuple value is the substring that matches this particular group (and not a string that matches the whole regex comprising multiple matching groups). That’s why the second tuple value of the first list value is the string ‘test’ and the second tuple value of the second list value is the string ‘puzzle’—those are matched in this respective order.

Where to Go From Here?

Regular expressions will boost your productivity—this is a fact.

In this article, you learned basic regular expression syntax. You learned about the zero-or-one regex. You learned about the asterisk operator. You learned about the power of combining multiple regular expressions using the and or or operators. You also learned practical applications such as analyzing HTML documents.

If you liked what you learned, check out my new book “Python One-Liners”, published with the famous NoStarchPress in San Francisco. It’s a carefully crafted Python tutorial, twelve months in the making. The NoStarch team and I poured hundreds of hours of time into the book. It’ll help you become a better coder, computer scientist, and regular expression fanatic.

Buy “Python One-Liners” now!