Python Re Groups

This tutorial explains everything you need to know about matching groups in Python’s re package for regular expressions. You may have also read the term “capture groups” which points to the same concept.

As you read through the tutorial, you can also watch the tutorial video where I explain everything in a simple way:

Related article: Python Regex Superpower – The Ultimate Guide

Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.

So let’s start with the basics:

Matching Group ()

What’s a matching group?

Like you use parentheses to structure mathematical expressions, (2 + 2) * 2 versus 2 + (2 * 2), you use parentheses to structure regular expressions. An example regex that does this is 'a(b|c)'. The whole content enclosed in the opening and closing parentheses is called matching group (or capture group). You can have multiple matching groups in a single regex. And you can even have hierarchical matching groups, for example 'a(b|(cd))'.

One big advantage of a matching group is that it captures the matched substring. You can retrieve it in other parts of the regular expression—or after analyzing the result of the whole regex matching.

Let’s have a short example for the most basic use of a matching group—to structure the regex.

Say you create regex b?(a.)* with the matching group (a.) that matches all patterns starting with zero or one occurrence of character 'b' and an arbitrary number of two-character-sequences starting with the character 'a'. Hence, the strings 'bacacaca', 'aaaa', '' (the empty string), and 'Xababababab' all match your regex.

The use of the parentheses for structuring the regular expression is intuitive and should come naturally to you because the same rules apply as for arithmetic operations. However, there’s a more advanced use of regex groups: retrieval.

You can retrieve the matched content of each matching group. So the next question naturally arises:

How to Get the First Matching Group?

There are two scenarios when you want to access the content of your matching groups:

  1. Access the matching group in the regex pattern to reuse partially matched text from one group somewhere else.
  2. Access the matching group after the whole match operation to analyze the matched text in your Python code.

In the first case, you simply get the first matching group with the \number special sequence. For example, to get the first matching group, you’d use the \1 special sequence. Here’s an example:

>>> import re
>>> re.search(r'(j.n) is \1','jon is jon')
<re.Match object; span=(0, 10), match='jon is jon'>

You’ll use this feature a lot because it gives you much more expression power: for example, you can search for a name in a text-based on a given pattern and then process specifically this name in the rest of the text (and not all other names that would also fit the pattern).

Note that the numbering of the groups start with \1 and not with \0—a rare exception to the rule that in programming, all numbering starts with 0.

In the second case, you want to know the contents of the first group after the whole match. How do you do that?

The answer is also simple: use the m.group(0) method on the matching object m. Here’s an example:

>>> import re
>>> m = re.search(r'(j.n)','jon is jon')
>>> m.group(1)
'jon'

The numbering works consistently with the previously introduced regex group numbering: start with identifier 1 to access the contents of the first group.

How to Get All Other Matching Groups?

Again, there are two different intentions when asking this question:

  1. Access the matching group in the regex pattern to reuse partially matched text from one group somewhere else.
  2. Access the matching group after the whole match operation to analyze the matched text in your Python code.

In the first case, you use the special sequence \2 to access the second matching group, \3 to access the third matching group, and \99 to access the ninety-ninth matching group.

Here’s an example:

>>> import re
>>> re.search(r'(j..) (j..)\s+\2', 'jon jim jim')
<re.Match object; span=(0, 11), match='jon jim jim'>
>>> re.search(r'(j..) (j..)\s+\2', 'jon jim jon')
>>> 

As you can see, the special sequence \2 refers to the matching contents of the second group 'jim'.

In the second case, you can simply increase the identifier too to access the other matching groups in your Python code:

>>> import re
>>> m =  re.search(r'(j..) (j..)\s+\2', 'jon jim jim')
>>> m.group(0)
'jon jim jim'
>>> m.group(1)
'jon'
>>> m.group(2)
'jim'

This code also shows an interesting feature: if you use the identifier 0 as an argument to the m.group(0) method, the regex module will give you the contents of the whole match. You can think of it as the first group being the whole match.

Named Groups: (?P<name>…) and (?P=name)

Accessing the captured group using the notation \number is not always convenient and sometimes not even possible (for example if you have more than 99 groups in your regex). A major disadvantage of regular expressions is that they tend to be hard to read. It’s therefore important to know about the different tweaks to improve readability.

One such optimization is a named group. It’s really just that: a matching group that captures part of the match but with one twist: it has a name. Now, you can use this name to access the captured group at a later point in your regular expression pattern. This can improve readability of the regular expression.

import re
pattern = '(?P<quote>["\']).*(?P=quote)'
text = 'She said "hi"'
print(re.search(pattern, text))
# <re.Match object; span=(9, 13), match='"hi"'>

The code searches for substrings that are enclosed in either single or double quotes. You first match the opening quote by using the regex ["\']. You escape the single quote, \' so that the Python regex engine does not assume (wrongly) that the single quote indicates the end of the string. You then use the same group to match the closing quote of the same character (either a single or double quote).

Non-Capturing Groups (?:…)

In the previous examples, you’ve seen how to match and capture groups with the parentheses (...). You’ve learned that each match of this basic group operator is captured so that you can retrieve it later in the regex with the special commands \1, \2, …, \99 or after the match on the matched object m with the method m.group(1), m.group(2), and so on.

But what if you don’t need that? What if you just need to keep your regex pattern in order—but you don’t want to capture the contents of a matching group?

The simple solution is the non-capturing group operation (?: ... ). You can use it just like the capturing group operation ( ... ). Here’s an example:

>>>import re
>>> re.search('(?:python|java) is great', 'python is great. java is great.')
<re.Match object; span=(0, 15), match='python is great'>

The non-capturing group exists with the sole purpose to structure the regex. You cannot use its content later:

>>> m = re.search('(?:python|java) is great', 'python is great. java is great.')
>>> m.group(1)
Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    m.group(1)
IndexError: no such group
>>> 

If you try to access the contents of the non-capturing group, the regex engine will throw an IndexError: no such group.

Of course, there’s a straightforward alternative to non-capturing groups. You can simply use the normal (capturing) group but don’t access its contents. Only rarely will the performance penalty of capturing a group that’s not needed have any meaningful impact on your overall application.

Positive Lookahead (?=…)

The concept of lookahead is a very powerful one and any advanced coder should know it. A friend recently told me that he had written a complicated regex that ignores the order of occurrences of two words in a given text. It’s a challenging problem and without the concept of lookahead, the resulting code will be complicated and hard to understand. However, the concept of lookahead makes this problem simple to write and read.

But first things first: how does the lookahead assertion work?

In normal regular expression processing, the regex is matched from left to right. The regex engine “consumes” partially matching substrings. The consumed substring cannot be matched by any other part of the regex.

Figure: A simple example of lookahead. The regular expression engine matches (“consumes”) the string partially. Then it checks whether the remaining pattern could be matched without actually matching it.

Think of the lookahead assertion as a non-consuming pattern match. The regex engine goes from the left to the right—searching for the pattern. At each point, it has one “current” position to check if this position is the first position of the remaining match. In other words, the regex engine tries to “consume” the next character as a (partial) match of the pattern.

The advantage of the lookahead expression is that it doesn’t consume anything. It just “looks ahead” starting from the current position whether what follows would theoretically match the lookahead pattern. If it doesn’t, the regex engine cannot move on. Next, it “backtracks”—which is just a fancy way of saying: it goes back to a previous decision and tries to match something else.

Positive Lookahead Example: How to Match Two Words in Arbitrary Order?

What if you want to search a given text for pattern A AND pattern B—but in no particular order? If both patterns appear anywhere in the string, the whole string should be returned as a match.

Now, this is a bit more complicated because any regular expression pattern is ordered from left to right. A simple solution is to use the lookahead assertion (?.*A) to check whether regex A appears anywhere in the string. (Note we assume a single line string as the .* pattern doesn’t match the newline character by default.)

Let’s first have a look at the minimal solution to check for two patterns anywhere in the string (say, patterns ‘hi’ AND ‘you’).

>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are yo?')
[]
>>> re.findall(pattern, 'hi how are you?')
['']

In the first example, both words do not appear. In the second example, they do.

Let’s go back to the expression (?=.*hi)(?=.*you) to match strings that contain both ‘hi’ and ‘you’. Why does it work?

The reason is that the lookahead expressions don’t consume anything. You first search for an arbitrary number of characters .*, followed by the word hi. But because the regex engine hasn’t consumed anything, it’s still in the same position at the beginning of the string. So, you can repeat the same for the word you.

Note that this method doesn’t care about the order of the two words:

>>> import re
>>> pattern = '(?=.*hi)(?=.*you)'
>>> re.findall(pattern, 'hi how are you?')
['']
>>> re.findall(pattern, 'you are how? hi!')
['']

No matter which word “hi” or “you” appears first in the text, the regex engine finds both.

You may ask: why’s the output the empty string? The reason is that the regex engine hasn’t consumed any character. It just checked the lookaheads. So the easy fix is to consume all characters as follows:

>>> import re
>>> pattern = '(?=.*hi)(?=.*you).*'
>>> re.findall(pattern, 'you fly high')
['you fly high']

Now, the whole string is a match because after checking the lookahead with ‘(?=.*hi)(?=.*you)’, you also consume the whole string ‘.*’.

Negative Lookahead (?!…)

The negative lookahead works just like the positive lookahead—only it checks that the given regex pattern does not occur going forward from a certain position.

Here’s an example:

>>> import re
>>> re.search('(?!.*hi.*)', 'hi say hi?')
<re.Match object; span=(8, 8), match=''>

The negative lookahead pattern (?!.*hi.*) ensures that, going forward in the string, there’s no occurrence of the substring 'hi'. The first position where this holds is position 8 (right after the second 'h'). Like the positive lookahead, the negative lookahead does not consume any character so the result is the empty string (which is a valid match of the pattern).

You can even combine multiple negative lookaheads like this:

>>> re.search('(?!.*hi.*)(?!\?).', 'hi say hi?')
<re.Match object; span=(8, 9), match='i'>

You search for a position where neither ‘hi’ is in the lookahead, nor does the question mark character follow immediately. This time, we consume an arbitrary character so the resulting match is the character 'i'.

Group Flags (?aiLmsux:…) and (?aiLmsux)

You can control the regex engine with the flags argument of the re.findall(), re.search(), or re.match() methods. For example, if you don’t care about capitalization of your matched substring, you can pass the re.IGNORECASE flag to the regex methods:

>>> re.findall('PYTHON', 'python is great', flags=re.IGNORECASE)
['python']

But using a global flag for the whole regex is not always optimal. What if you want to ignore the capitalization only for a certain subregex?

You can do this with the group flags: a, i, L, m, s, u, and x. Each group flag has its own meaning:

SyntaxMeaning
aIf you don’t use this flag, the special Python regex symbols \w, \W, \b, \B, \d, \D, \s and \S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests.
iIf you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z].
L Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable.
m This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string.
sWithout using this flag, the dot regex ‘.’ matches all characters except the newline character ‘\n’. Switch on this flag to really match all characters including the newline character.
x To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex.

For example, if you want to switch off the differentiation of capitalization, you’ll use the i flag as follows:

>>> re.findall('(?i:PYTHON)', 'python is great')
['python']

You can also switch off the capitalization for the whole regex with the “global group flag” (?i) as follows:

>>> re.findall('(?i)PYTHON', 'python is great')
['python']

Where to Go From Here?

Summary: You’ve learned about matching groups to structure the regex and capture parts of the matching result. You can then retrieve the captured groups with the \number syntax within the regex pattern itself and with the m.group(i) syntax in the Python code at a later stage.

To learn the Python basics, check out my free Python email academy with many advanced courses—including a regex video tutorial in your INBOX.

Join 20,000+ ambitious coders for free!