This morning, I read over an actual Quora thread with this precise question. While there’s no dumb question, the question reveals that there may be some gap in understanding the basics in Python and Python’s regular expression library.
So if you’re an impatient person, here’s the short answer:
How to match an exact word/string using a regular expression in Python?
You don’t! Well, you can do it by using the straightforward regex
'hello' to match it in
'hello world'. But there’s no need to use an expensive and less readable regex to match an exact substring in a given string. Instead, simply use the pure Python expression
So far so good. But let’s dive into some more specific questions—because you may not exactly have looked for this simplistic answer. In fact, there are multiple ways of understanding your question and I have tried to find all interpretations and answered them one by one:
(You can also watch my tutorial video as you go over the article.)
How to Check Membership of a Word in a String (Python Built-In)?
This is the simple answer, you’ve already learned. Instead of matching an exact string, it’s often enough to use Python’s
in keyword to check membership. As this is a very efficient built-in functionality in Python, it’s much faster, more readable, and doesn’t require external dependencies.
Thus, you should rely on this method if possible:
>>> 'hello' in 'hello world' True
The first example shows the most straightforward way of doing it: simply ask Python whether a string is “in” another string. This is called the membership operator and it’s very efficient.
You can also check whether a string does not occur in another string. Here’s how:
>>> 'hi' not in 'hello world' True
The negative membership operator
s1 not in s2 returns
True if string
s1 does not occur in string
But there’s a problem with the membership operator. The return value is only a Boolean value. However, the advantage of Python’s regular expression library
re is that it returns a match object which contains more interesting information such as the exact location of the matching substring.
So let’s explore the problem of exact string matching using the regex library next:
How to Match an Exact String (Regex)?
Here’s how you can match an exact substring in a given string:
>>> import re >>> re.search('hello', 'hello world') <re.Match object; span=(0, 5), match='hello'>
After importing Python’s library for regular expression processing
re, you use the
re.search(pattern, string) method to find the first occurrence of the
pattern in the
string. If you’re unsure about this method, check out my detailed tutorial on this blog.
This returns a match object that wraps a lot of useful information such as the start and stop matching positions and the matching substring. As you’re looking for exact string matches, the matching substring will always be the same as your searched word.
But wait, there’s another problem: you wanted an exact match, right? But this also means that you’re getting prefix matches of your searched word:
>>> re.search('good', 'goodbye') <re.Match object; span=(0, 4), match='good'>
When searching for the exact word
'good' in the string
'goodbye' it actually matches the prefix of the word. Is this what you wanted?
If not, read on:
How to Match a Word in a String (Word Boundary \b)?
So how can we fix the problem that an exact match of a word will also retrieve matching substrings that occur anywhere in the string?
Here’s an example:
>>> 'no' in 'nobody knows' True
And another example:
>>> re.search('see', 'dfjkyldsssseels') <re.Match object; span=(10, 13), match='see'>
What if you want to match only whole words—not exact substrings? The answer is simple: use the word boundary metacharacter
'\b'. This metacharacter matches at the beginning and end of each word—but it doesn’t consume anything. In other words, it simply checks whether the word starts or ends at this position (by checking for whitespace or non-word characters).
Here’s how you use the word boundary character to ensure that only whole words match:
>>> import re >>> re.search(r'\bno\b', 'nobody knows') >>> >>> re.search(r'\bno\b', 'nobody knows nothing - no?') <re.Match object; span=(23, 25), match='no'>
In both examples, you use the same regex
'\bno\b' that searches for the exact word
'no' but only if the word boundary character
'\b' matches before and after. In other words, the word
'no' must appear on its own as a separate word. It is not allowed to appear within another sequence of word characters.
As a result, the regex doesn’t match in the string
'nobody knows' but it matches in the string
'nobody knows nothing - no?'.
Note that we use raw string
r'...' to write the regex so that the escape sequence
'\b' works in the string. Without the raw string, Python would assume that it’s an unescaped backslash character
'\', followed by the character
'b'. With the raw string, all backslashes will just be that: backslashes. The regex engine then interprets the two characters as one special metacharacter: the word boundary
But what if you don’t care whether the word is upper or lowercase or capitalized? In other words:
How to Match a Word in a String (Case Insensitive)?
You can search for an exact word in a string—but ignore capitalization. This way, it’ll be irrelevant whether the word’s characters are lowercase or uppercase. Here’s how:
>>> import re >>> re.search('no', 'NONONON', flags=re.IGNORECASE) <re.Match object; span=(0, 2), match='NO'> >>> re.search('no', 'NONONON', flags=re.I) <re.Match object; span=(0, 2), match='NO'> >>> re.search('(?i)no', 'NONONON') <re.Match object; span=(0, 2), match='NO'>
All three ways are equivalent: they all ignore the capitalization of the word’s letters. If you need to learn more about the
flags argument in Python, check out my detailed tutorial on this blog. The third example uses the in-regex flag
(?i) that also means: “ignore the capitalization”.
How to Find All Occurrences of a Word in a String?
Okay, you’re never satisfied, are you? So let’s explore how you can find all occurrences of a word in a string.
In the previous examples, you used the
re.search(pattern, string) method to find the first match of the
pattern in the
Next, you’ll learn how to find all occurrences (not only the first match) by using the
re.findall(pattern, string) method. You can also read my blog tutorial about the findall() method that explains all the details.
>>> import re >>> re.findall('no', 'nononono') ['no', 'no', 'no', 'no']
Your code retrieves all matching substrings. If you need to find all match objects rather than matching substrings, you can use the re.finditer(pattern, string) method:
>>> for match in re.finditer('no', 'nonononono'): print(match) <re.Match object; span=(0, 2), match='no'> <re.Match object; span=(2, 4), match='no'> <re.Match object; span=(4, 6), match='no'> <re.Match object; span=(6, 8), match='no'> <re.Match object; span=(8, 10), match='no'> >>>
re.finditer(pattern, string) method creates an iterator that iterates over all matches and returns the match objects. This way, you can find all matches and get the match objects as well.
How to Find All Lines Containing an Exact Word?
Say you want to find all lines that contain the word ’42’ from a multi-line string in Python. How’d you do it?
The answer makes use of a fine Python regex specialty: the dot regex matches all characters, except the newline character. Thus, the regex
.* will match all characters in a given line (but then stop).
Here’s how you can use this fact to get all lines that contain a certain word:
>>> import re >>> s = '''the answer is 42 the answer: 42 42 is the answer 43 is not''' >>> re.findall('.*42.*', s) ['the answer is 42', 'the answer: 42', '42 is the answer']
Three out of four lines contain the word
findall() method returns these as strings.
How to Find All Lines Not Containing an Exact Word?
In the previous section, you’ve learned how to find all lines that contain an exact word. In this section, you’ll learn how to do the opposite: find all lines that NOT contain an exact word.
This is a bit more complicated. I’ll show you the code first and explain it afterwards:
import re s = '''the answer is 42 the answer: 42 42 is the answer 43 is not the answer 42''' for match in re.finditer('^((?!42).)*$', s, flags=re.M): print(match) ''' <re.Match object; span=(49, 58), match='43 is not'> <re.Match object; span=(59, 69), match='the answer'> '''
You can see that the code successfully matches only the lines that do not contain the string
How can you do it?
The general idea is to match a line that doesn’t contain the string ‘
42', print it to the shell, and move on to the next line. The
re.finditer(pattern, string) accomplishes this easily by returning an iterator over all match objects.
The regex pattern
'^((?!42).)*$' matches the whole line from the first position
'^' to the last position
'$'. If you need a refresher on the start-of-the-line and end-of-the-line metacharacters, read this 5-min tutorial.
In between, you match an arbitrary number of characters: the asterisk quantifier does that for you. If you need help understanding the asterisk quantifier, check out this blog tutorial.
Which characters do you match? Only those where you don’t have the negative word
'42' in your lookahead. If you need a refresher on lookaheads, check out this tutorial.
As the lookahead itself doesn’t consume a character, we need to consume it manually by adding the dot metacharacter
. which matches all characters except the newline character
'\n'. As it turns out, there’s also a blog tutorial on the dot metacharacter.
Finally, you need to define the
re.MULTILINE flag, in short:
re.M, because it allows the start
^ and end
$ metacharacters to match also at the start and end of each line (not only at the start and end of each string).
Together, this regular expression matches all lines that do not contain the specific word
Where to Go From Here?
Summary: You’ve learned multiple ways of matching an exact word in a string. You can use the simple Python membership operator. You can use a default regex with no special metacharacters. You can use the word boundary metacharacter
'\b' to match only whole words. You can match case-insensitive by using the flags argument
re.IGNORECASE. You can match not only one but all occurrences of a word in a string by using the
re.finditer() methods. And you can match all lines containing and not containing a certain word.
Pheww. This was some theory-heavy stuff. Do you feel like you need some more practical stuff next?
Then check out my practice-heavy Python freelancer course that helps you prepare for the worst and create a second income stream by creating your thriving coding side-business online.