How to Find All Lines Not Containing a Regex in Python?

5/5 - (2 votes)

Today, I stumbled upon this beautiful regex problem:

Given are a multi-line string and a regex pattern. How to find all lines that do NOT contain the regex pattern?

I’ll give you a short answer and a long answer.

The short answer:

Use the pattern '((?!regex).)*' to match all lines that do not contain regex pattern regex. The expression '(?! ...)' is a negative lookahead that ensures that the enclosed pattern ... does not follow from the current position.

So let’s discuss this solution in greater detail. (You can also watch my explainer video if you prefer video format.)

How to Find All Lines Not Containing a Regex in Python?

Related article:

Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.

Detailed Example

Let’s consider a practical code snippet. I’ll show you the code first and explain it afterward:

import re
s = '''the answer is 42
the answer: 42
42 is the answer
43 is not
the answer

for match in re.finditer('^((?!42).)*$', s, flags=re.M):

<re.Match object; span=(49, 58), match='43 is not'>
<re.Match object; span=(59, 69), match='the answer'>

You can see that the code successfully matches only the lines that do not contain the string '42'.

How to Match a Line That Doesn’t Contain a String?

The general idea is to match a line that doesn’t contain the string ‘42', print it to the shell, and move on to the next line.

The re.finditer(pattern, string) accomplishes this easily by returning an iterator over all match objects.

The regex pattern '^((?!42).)*$' matches the whole line from the first position '^' to the last position '$'.

πŸ“„ Related Tutorial: If you need a refresher on the start-of-the-line and end-of-the-line metacharacters, read this 5-min tutorial.

Python Regex - How to Match the Start of Line (^) and End of Line ($)

You match an arbitrary number of characters in between: the asterisk quantifier does that for you.

πŸ“„ Related Tutorial: If you need help understanding the asterisk quantifier, check out this blog tutorial.

Which characters do you match? Only those where you don’t have the negative word '42' in your lookahead.

πŸ“„ Related Tutorial: If you need a refresher on lookaheads, check out this tutorial.

The lookahead itself doesn’t consume a character. Thus, you need to consume it manually by adding the dot metacharacter . which matches all characters except the newline character '\n'.

πŸ“„ Related Tutorial: As it turns out, there’s also a blog tutorial on the dot metacharacter.

Finally, you need to define the re.MULTILINE flag, in short: re.M, because it allows the start ^ and end $ metacharacters to match also at the start and end of each line (not only at the start and end of each string).

πŸ“„ Related Tutorial: You can read more about the flags argument at this blog tutorial.

Together, this regular expression matches all lines that do not contain the specific word '42'.

In case you had some problems understanding the concept of lookahead (and why it doesn’t consume anything), have a look at this explanation from the matching group tutorial on this blog:

Positive Lookahead (?=…)

Python Re Groups & Positive Lookahead [For Absolute Beginners]

The concept of lookahead is very powerful. Any advanced coder should know it.

A friend recently told me that he had written a complicated regex that ignores the order of occurrences of two words in a given text.

It’s a challenging problem, and without the concept of lookahead, the resulting code will be complicated and hard to understand. However, the concept of lookahead makes this problem simple to write and read.

But first things first: how does the lookahead assertion work?

In normal regular expression processing, the regex is matched from left to right. The regex engine “consumes” partially matching substrings. The consumed substring cannot be matched by any other part of the regex.

Figure: A simple example of lookahead. The regular expression engine matches (“consumes”) the string partially. Then it checks whether the remaining pattern could be matched without actually matching it.

Think of the lookahead assertion as a non-consuming pattern match.

The regex engine searches for the pattern from left to right. Each step, it maintains one “current” position to check if this position is the first position of the remaining match.

In other words, the regex engine tries to “consume” the next character as a (partial) match of the pattern.

The advantage of the lookahead expression is that it doesn’t consume anything. It just “looks ahead” starting from the current position whether what follows would theoretically match the lookahead pattern.

If it doesn’t, the regex engine cannot move on.

Next, it “backtracks“—which is just a fancy way of saying: it goes back to a previous decision and tries to match something else.

Positive Lookahead Example: How to Match Two Words in Arbitrary Order?

Problem Formulation: What if you want to search a given text for pattern A AND pattern B—but in no particular order? If both patterns appear anywhere in the string, the whole string should be returned as a match.

Now, this is a bit more complicated because any regular expression pattern is ordered from left to right.

A simple solution is to use the lookahead assertion (?.*A) to check whether regex A appears anywhere in the string.

Note we assume a single line string as the .* pattern doesn’t match the newline character by default.

First, look at the minimal solution to check for two patterns anywhere in the string (say, patterns 'hi' AND 'you').

>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are yo?')
>>> re.findall(pattern, 'hi how are you?')

In the first example, both words do not appear. In the second example, they do.

Let’s go back to the expression (?=.*hi)(?=.*you) to match strings that contain both 'hi' and 'you'. Why does it work?

The reason is that the lookahead expressions don’t consume anything. You first search for an arbitrary number of characters .*, followed by the word hi.

But because the regex engine hasn’t consumed anything, it’s still in the same position at the beginning of the string. So, you can repeat the same for the word you.

Note that this method doesn’t care about the order of the two words:

>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are you?')
>>> re.findall(pattern, 'you are how? hi!')

No matter which word "hi" or "you" appears first in the text, the regex engine finds both.

You may ask: why’s the output the empty string?

The reason is that the regex engine hasn’t consumed any character. It just checked the lookaheads.

So the easy fix is to consume all characters as follows:

>>> import re
>>> pattern = '(?=.*hi)(?=.*you).*'
>>> re.findall(pattern, 'you fly high')
['you fly high']

Now, the whole string is a match because after checking the lookahead with '(?=.*hi)(?=.*you)', you also consume the whole string '.*'.

Negative Lookahead (?!…)

The negative lookahead works just like the positive lookahead—only it checks that the given regex pattern does not occur going forward from a certain position.

Here’s an example:

>>> import re
>>>'(?!.*hi.*)', 'hi say hi?')
<re.Match object; span=(8, 8), match=''>

The negative lookahead pattern (?!.*hi.*) ensures that, going forward in the string, there’s no occurrence of the substring 'hi'.

The first position where this holds is position 8 (right after the second 'h').

Like the positive lookahead, the negative lookahead does not consume any character so the result is the empty string (which is a valid match of the pattern).

You can even combine multiple negative lookaheads like this:

>>>'(?!.*hi.*)(?!\?).', 'hi say hi?')
<re.Match object; span=(8, 9), match='i'>

You search for a position where neither 'hi' is in the lookahead, nor does the question mark character follow immediately. This time, we consume an arbitrary character, so the resulting match is the character 'i'.

Where to Go From Here?

Summary: You’ve learned that you can match all lines that do not match a certain regex by using the lookahead pattern ((?!regex).)*.

Regex Humor

Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee. (source)

Python Regex Course

Google engineers are regular expression masters. The Google search engine is a massive text-processing engine that extracts value from trillions of webpages.Β Β 

Facebook engineers are regular expression masters. Social networks like Facebook, WhatsApp, and Instagram connect humans via text messages.Β 

Amazon engineers are regular expression masters. Ecommerce giants ship products based on textual product descriptions.Β Β Regular expressions ​rule the game ​when text processing ​meets computer science.Β 

If you want to become a regular expression master too, check out the most comprehensive Python regex course on the planet:

Now, this was a lot of theory! Let’s get some practice.

In my Python freelancer bootcamp, I’ll train you on how to create yourself a new success skill as a Python freelancer with the potential of earning six figures online.

The next recession is coming for sure, and you want to be able to create your own economy so that you can take care of your loved ones.

Check out my free “Python Freelancer” webinar now!

Join 20,000+ ambitious coders for free!