Python Re Question Mark (?): Optional Match

Congratulations, you’re about to learn one of the most frequently used regex operators: the question mark quantifier A?.

In particular, this article is all about the ? quantifier in Python’s re library. You can also watch the explainer video while you skim through the tutorial:

Related article: Python Regex Superpower – The Ultimate Guide

What’s the Python Re ? Quantifier

When applied to regular expression A, Python’s A? quantifier matches either zero or one occurrences of A. The ? quantifier always applies only to the preceding regular expression. For example, the regular expression ‘hey?’ matches both strings ‘he’ and ‘hey’. But it does not match the empty string because the ? quantifier does not apply to the whole regex ‘hey’ but only to the preceding regex ‘y’.

Let’s study two basic examples to help you gain a deeper understanding. Do you get all of them?

>>> import re
>>>
>>> re.findall('aa[cde]?', 'aacde aa aadcde')
['aac', 'aa', 'aad']
>>>
>>> re.findall('aa?', 'accccacccac')
['a', 'a', 'a']
>>>
>>> re.findall('[cd]?[cde]?', 'ccc dd ee')
['cc', 'c', '', 'dd', '', 'e', 'e', '']

Don’t worry if you had problems understanding those examples. You’ll learn about them next. Here’s the first example:

>>> re.findall('aa[cde]?', 'aacde aa aadcde')
['aac', 'aa', 'aad']

You use the re.findall() method. In case you don’t know it, here’s the definition from the Finxter blog article:

The re.findall(pattern, string) method finds all occurrences of the pattern in the string and returns a list of all matching substrings.

Please consult the blog article to learn everything you need to know about this fundamental Python method.

The first argument is the regular expression pattern ‘aa[cde]?’. The second argument is the string to be searched for the pattern. In plain English, you want to find all patterns that start with two ‘a’ characters, followed by one optional character—which can be either ‘c’, ‘d’, or ‘e’.

The findall() method returns three matching substrings:

  • First, string ‘aac’ matches the pattern. After Python consumes the matched substring, the remaining substring is ‘de aa aadcde’.
  • Second, string ‘aa’ matches the pattern. Python consumes it which leads to the remaining substring ‘ aadcde’.
  • Third, string ‘aad’ matches the pattern in the remaining substring. What remains is ‘cde’ which doesn’t contain a matching substring anymore.

The second example is the following:

>>> re.findall('aa?', 'accccacccac')
['a', 'a', 'a']

In this example, you’re looking at the simple pattern ‘aa?’. You want to find all occurrences of character ‘a’ followed by an optional second ‘a’. But be aware that the optional second ‘a’ is not needed for the pattern to match.

Therefore, the regex engine finds three matches: the characters ‘a’.

The third example is the following:

>>> re.findall('[cd]?[cde]?', 'ccc dd ee')
['cc', 'c', '', 'dd', '', 'e', 'e', '']

This regex pattern looks complicated: ‘[cd]?[cde]?’. But is it really?

Let’s break it down step-by-step:

The first part of the regex [cd]? defines a character class [cd] which reads as “match either c or d”. The question mark quantifier indicates that you want to match either one or zero occurrences of this pattern.

The second part of the regex [cde]? defines a character class [cde] which reads as “match either c, d, or e”. Again, the question mark indicates the zero-or-one matching requirement.

As both parts are optional, the empty string matches the regex pattern. However, the Python regex engine attempts as much as possible.

Thus, the regex engine performs the following steps:

  • The first match in the string ‘ccc dd ee’ is ‘cc’. The regex engine consumes the matched substring, so the string ‘c dd ee’ remains.
  • The second match in the remaining string is the character ‘c’. The empty space ‘ ‘ does not match the regex so the second part of the regex [cde] does not match. Because of the question mark quantifier, this is okay for the regex engine. The remaining string is ‘ dd ee’.
  • The third match is the empty string ”. Of course, Python does not attempt to match the same position twice. Thus, it moves on to process the remaining string ‘dd ee’.
  • The fourth match is the string ‘dd’. The remaining string is ‘ ee’.
  • The fifth match is the string ”. The remaining string is ‘ee’.
  • The sixth match is the string ‘e’. The remaining string is ‘e’.
  • The seventh match is the string ‘e’. The remaining string is ”.
  • The eighth match is the string ”. Nothing remains.

This was the most complicated of our examples. Congratulations if you understood it completely!

[Collection] What Are The Different Python Re Quantifiers?

The question mark quantifier—Python re ?—is only one of many regex operators. If you want to use (and understand) regular expressions in practice, you’ll need to know all of them by heart!

So let’s dive into the other operators:

A regular expression is a decades-old concept in computer science. Invented in the 1950s by famous mathematician Stephen Cole Kleene, the decades of evolution brought a huge variety of operations. Collecting all operations and writing up a comprehensive list would result in a very thick and unreadable book by itself.

Fortunately, you don’t have to learn all regular expressions before you can start using them in your practical code projects. Next, you’ll get a quick and dirty overview of the most important regex operations and how to use them in Python. In follow-up chapters, you’ll then study them in detail — with many practical applications and code puzzles.

Here are the most important regex quantifiers:

QuantifierDescriptionExample
.The wild-card (‘dot’) matches any character in a string except the newline character ‘\n’.Regex ‘…’ matches all words with three characters such as ‘abc’, ‘cat’, and ‘dog’.
*The zero-or-more asterisk matches an arbitrary number of occurrences (including zero occurrences) of the immediately preceding regex.Regex ‘cat*’ matches the strings ‘ca’, ‘cat’, ‘catt’, ‘cattt’, and ‘catttttttt’.
?The zero-or-one matches (as the name suggests) either zero or one occurrences of the immediately preceding regex. Regex ‘cat?’ matches both strings ‘ca’ and ‘cat’ — but not ‘catt’, ‘cattt’, and ‘catttttttt’.
+The at-least-one matches one or more occurrences of the immediately preceding regex. Regex ‘cat+’ does not match the string ‘ca’ but matches all strings with at least one trailing character ‘t’ such as ‘cat’, ‘catt’, and ‘cattt’.
^The start-of-string matches the beginning of a string. Regex ‘^p’ matches the strings ‘python’ and ‘programming’ but not ‘lisp’ and ‘spying’ where the character ‘p’ does not occur at the start of the string.
$The end-of-string matches the end of a string. Regex ‘py$’ would match the strings ‘main.py’ and ‘pypy’ but not the strings ‘python’ and ‘pypi’.
A|BThe OR matches either the regex A or the regex B. Note that the intuition is quite different from the standard interpretation of the or operator that can also satisfy both conditions. Regex ‘(hello)|(hi)’ matches strings ‘hello world’ and ‘hi python’. It wouldn’t make sense to try to match both of them at the same time.
AB The AND matches first the regex A and second the regex B, in this sequence. We’ve already seen it trivially in the regex ‘ca’ that matches first regex ‘c’ and second regex ‘a’.

Note that I gave the above operators some more meaningful names (in bold) so that you can immediately grasp the purpose of each regex. For example, the ‘^’ operator is usually denoted as the ‘caret’ operator. Those names are not descriptive so I came up with more kindergarten-like words such as the “start-of-string” operator.

We’ve already seen many examples but let’s dive into even more!

import re

text = '''
    Ha! let me see her: out, alas! he's cold:
    Her blood is settled, and her joints are stiff;
    Life and these lips have long been separated:
    Death lies on her like an untimely frost
    Upon the sweetest flower of all the field.
'''

print(re.findall('.a!', text))
'''
Finds all occurrences of an arbitrary character that is
followed by the character sequence 'a!'.
['Ha!']
'''

print(re.findall('is.*and', text))
'''
Finds all occurrences of the word 'is',
followed by an arbitrary number of characters
and the word 'and'.
['is settled, and']
'''

print(re.findall('her:?', text))
'''
Finds all occurrences of the word 'her',
followed by zero or one occurrences of the colon ':'.
['her:', 'her', 'her']
'''

print(re.findall('her:+', text))
'''
Finds all occurrences of the word 'her',
followed by one or more occurrences of the colon ':'.
['her:']
'''


print(re.findall('^Ha.*', text))
'''
Finds all occurrences where the string starts with
the character sequence 'Ha', followed by an arbitrary
number of characters except for the new-line character. 
Can you figure out why Python doesn't find any?
[]
'''

print(re.findall('\n$', text))
'''
Finds all occurrences where the new-line character '\n'
occurs at the end of the string.
['\n']
'''

print(re.findall('(Life|Death)', text))
'''
Finds all occurrences of either the word 'Life' or the
word 'Death'.
['Life', 'Death']
'''

In these examples, you’ve already seen the special symbol ‘\n’ which denotes the new-line character in Python (and most other languages). There are many special characters, specifically designed for regular expressions. Next, we’ll discover the most important special symbols.

What’s the Difference Between Python Re ? and * Quantifiers?

You can read the Python Re A? quantifier as zero-or-one regex: the preceding regex A is matched either zero times or exactly once. But it’s not matched more often.

Analogously, you can read the Python Re A* operator as the zero-or-multiple-times regex (I know it sounds a bit clunky): the preceding regex A is matched an arbitrary number of times.

Here’s an example that shows the difference:

>>> import re
>>> re.findall('ab?', 'abbbbbbb')
['ab']
>>> re.findall('ab*', 'abbbbbbb')
['abbbbbbb']

The regex ‘ab?’ matches the character ‘a’ in the string, followed by character ‘b’ if it exists (which it does in the code).

The regex ‘ab*’ matches the character ‘a’ in the string, followed by as many characters ‘b’ as possible.

What’s the Difference Between Python Re ? and + Quantifiers?

You can read the Python Re A? quantifier as zero-or-one regex: the preceding regex A is matched either zero times or exactly once. But it’s not matched more often.

Analogously, you can read the Python Re A+ operator as the at-least-once regex: the preceding regex A is matched an arbitrary number of times but at least once.

Here’s an example that shows the difference:

>>> import re
>>> re.findall('ab?', 'aaaaaaaa')
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a']
>>> re.findall('ab+', 'aaaaaaaa')
[]

The regex ‘ab?’ matches the character ‘a’ in the string, followed by character ‘b’ if it exists—but it doesn’t in the code.

The regex ‘ab+’ matches the character ‘a’ in the string, followed by as many characters ‘b’ as possible—but at least one. However, the character ‘b’ does not exist so there’s no match.

What are Python Re *?, +?, ?? Quantifiers?

You’ve learned about the three quantifiers:

  • The quantifier A* matches an arbitrary number of patterns A.
  • The quantifier A+ matches at least one pattern A.
  • The quantifier A? matches zero-or-one pattern A.

Those three are all greedy: they match as many occurrences of the pattern as possible. Here’s an example that shows their greediness:

>>> import re
>>> re.findall('a*', 'aaaaaaa')
['aaaaaaa', '']
>>> re.findall('a+', 'aaaaaaa')
['aaaaaaa']
>>> re.findall('a?', 'aaaaaaa')
['a', 'a', 'a', 'a', 'a', 'a', 'a', '']

The code shows that all three quantifiers *, +, and ? match as many ‘a’ characters as possible.

So, the logical question is: how to match as few as possible? We call this non-greedy matching. You can append the question mark after the respective quantifiers to tell the regex engine that you intend to match as few patterns as possible: *?, +?, and ??.

Here’s the same example but with the non-greedy quantifiers:

>>> import re
>>> re.findall('a*?', 'aaaaaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '']
>>> re.findall('a+?', 'aaaaaaa')
['a', 'a', 'a', 'a', 'a', 'a', 'a']
>>> re.findall('a??', 'aaaaaaa')
['', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '', 'a', '']

In this case, the code shows that all three quantifiers *?, +?, and ?? match as few ‘a’ characters as possible.

Related Re Methods

There are five important regular expression methods which you should master:

  • The re.findall(pattern, string) method returns a list of string matches. Read more in our blog tutorial.
  • The re.search(pattern, string) method returns a match object of the first match. Read more in our blog tutorial.
  • The re.match(pattern, string) method returns a match object if the regex matches at the beginning of the string. Read more in our blog tutorial.
  • The re.fullmatch(pattern, string) method returns a match object if the regex matches the whole string. Read more in our blog tutorial.
  • The re.compile(pattern) method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in our blog tutorial.
  • The re.split(pattern, string) method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those. Read more in our blog tutorial.
  • The re.sub(The re.sub(pattern, repl, string, count=0, flags=0) method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in our blog tutorial.

These seven methods are 80% of what you need to know to get started with Python’s regular expression functionality.

Where to Go From Here?

You’ve learned everything you need to know about the question mark quantifier ? in this regex tutorial.

Summary: When applied to regular expression A, Python’s A? quantifier matches either zero or one occurrences of A. The ? quantifier always applies only to the preceding regular expression. For example, the regular expression ‘hey?’ matches both strings ‘he’ and ‘hey’. But it does not match the empty string because the ? quantifier does not apply to the whole regex ‘hey’ but only to the preceding regex ‘y’.

Want to earn money while you learn Python? Average Python programmers earn more than $50 per hour. You can certainly become average, can’t you?

Join the free webinar that shows you how to become a thriving coding business owner online!

[Webinar] Become a Six-Figure Freelance Developer with Python

Join us. It’s fun! 🙂