Python Regex Compile

Why have regular expressions survived seven decades of technological disruption? Because coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens!

This article is all about the re.compile(pattern) method of Python’s re library. Before we dive into re.compile(), let’s get an overview of the four related methods you must understand:

  • The findall(pattern, string) method returns a list of string matches. Read more in our blog tutorial.
  • The search(pattern, string) method returns a match object of the first match. Read more in our blog tutorial.
  • The match(pattern, string) method returns a match object if the regex matches at the beginning of the string. Read more in our blog tutorial.
  • The fullmatch(pattern, string) method returns a match object if the regex matches the whole string. Read more in our blog tutorial.

Related article: Python Regex Superpower – The Ultimate Guide

Equipped with this quick overview of the most critical regex methods, let’s answer the following question:

How Does re.compile() Work in Python?

The re.compile(pattern) method returns a regular expression object (see next section)

You then use the object to call important regex methods such as search(string), match(string), fullmatch(string), and findall(string).

In short: You compile the pattern first. You search the pattern in a string second.

This two-step approach is more efficient than calling, say, search(pattern, string) at once. That is, IF you call the search() method multiple times on the same pattern. Why? Because you can reuse the compiled pattern multiple times.

Here’s an example:

import re

# These two lines ...
regex = re.compile('Py...n')
match = regex.search('Python is great')

# ... are equivalent to ...
match = re.search('Py...n', 'Python is great')

In both instances, the match variable contains the following match object:

<re.Match object; span=(0, 6), match='Python'>

But in the first case, we can find the pattern not only in the string ‘Python is great’ but also in other strings—without any redundant work of compiling the pattern again and again.

Specification:

re.compile(pattern, flags=0)

The method has up to two arguments.

We’ll explore those arguments in more detail later.

Return Value:

The re.compile(patterns, flags) method returns a regular expression object. You may ask (and rightly so):

What’s a Regular Expression Object?

Python internally creates a regular expression object (from the Pattern class) to prepare the pattern matching process. You can call the following methods on the regex object:

Method Description
Pattern.search(string[, pos[, endpos]])Searches the regex anywhere in the string and returns a match object or None. You can define start and end positions of the search.
Pattern.match(string[, pos[, endpos]])Searches the regex at the beginning of the string and returns a match object or None. You can define start and end positions of the search.
Pattern.fullmatch(string[, pos[, endpos]])Matches the regex with the whole string and returns a match object or None. You can define start and end positions of the search.
Pattern.split(string, maxsplit=0) Divides the string into a list of substrings. The regex is the delimiter. You can define a maximum number of splits.
Pattern.findall(string[, pos[, endpos]]) Searches the regex anywhere in the string and returns a list of matching substrings. You can define start and end positions of the search.
Pattern.finditer(string[, pos[, endpos]]) Returns an iterator that goes over all matches of the regex in the string (returns one match object after another). You can define the start and end positions of the search.
Pattern.sub(repl, string, count=0) Returns a new string by replacing the first count occurrences of the regex in the string (from left to right) with the replacement string repl.
Pattern.subn(repl, string, count=0) Returns a new string by replacing the first count occurrences of the regex in the string (from left to right) with the replacement string repl. However, it returns a tuple with the replaced string as the first and the number of successful replacements as the second tuple value.

If you’re familiar with the most basic regex methods, you’ll realize that all of them appear in this table. But there’s one distinction: you don’t have to define the pattern as an argument. For example, the regex method re.search(pattern, string) will internally compile a regex object p and then call p.search(string).

You can see this fact in the official implementation of the re.search(pattern, string) method:

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

(Source: GitHub repository of the re package)

The re.search(pattern, string) method is a mere wrapper for compiling the pattern first and calling the p.search(string) function on the compiled regex object p.

Is It Worth Using Python’s re.compile()?

No, in the vast majority of cases, it’s not worth the extra line.

Consider the following example:

import re

# These two lines ...
regex = re.compile('Py...n')
match = regex.search('Python is great')

# ... are equivalent to ...
match = re.search('Py...n', 'Python is great')

Don’t get me wrong. Compiling a pattern once and using it many times throughout your code (e.g., in a loop) comes with a big performance benefit. In some anecdotal cases, compiling the pattern first lead to 10x to 50x speedup compared to compiling it again and again.

But the reason it is not worth the extra line is that Python’s re library ships with an internal cache. At the time of this writing, the cache has a limit of up to 512 compiled regex objects. So for the first 512 times, you can be sure when calling re.search(pattern, string) that the cache contains the compiled pattern already.

Here’s the relevant code snippet from re’s GitHub repository:

# --------------------------------------------------------------------
# internals

_cache = {}  # ordered!

_MAXCACHE = 512
def _compile(pattern, flags):
    # internal: compile pattern
    if isinstance(flags, RegexFlag):
        flags = flags.value
    try:
        return _cache[type(pattern), pattern, flags]
    except KeyError:
        pass
    if isinstance(pattern, Pattern):
        if flags:
            raise ValueError(
                "cannot process flags argument with a compiled pattern")
        return pattern
    if not sre_compile.isstring(pattern):
        raise TypeError("first argument must be string or compiled pattern")
    p = sre_compile.compile(pattern, flags)
    if not (flags & DEBUG):
        if len(_cache) >= _MAXCACHE:
            # Drop the oldest item
            try:
                del _cache[next(iter(_cache))]
            except (StopIteration, RuntimeError, KeyError):
                pass
        _cache[type(pattern), pattern, flags] = p
    return p

Can you find the spots where the cache is initialized and used?

While in most cases, you don’t need to compile a pattern, in some cases, you should. These follow directly from the previous implementation:

  • You’ve got more than MAXCACHE patterns in your code.
  • You’ve got more than MAXCACHE different patterns between two same pattern instances. Only in this case, you will see “cache misses” where the cache has already flushed the seemingly stale pattern instances to make room for newer ones.
  • You reuse the pattern multiple times. Because if you don’t, it won’t make sense to use sparse memory to save them in your memory.
  • (Even then, it may only be useful if the patterns are relatively complicated. Otherwise, you won’t see a lot of performance benefits in practice.)

To summarize, compiling the pattern first and storing the compiled pattern in a variable for later use is often nothing but “premature optimization”—one of the deadly sins of beginner and intermediate programmers.

What Does re.compile() Really Do?

It doesn’t seem like a lot, does it? My intuition was that the real work is in finding the pattern in the text—which happens after compilation. And, of course, matching the pattern is the hard part. But a sensible compilation helps a lot in preparing the pattern to be matched efficiently by the regex engine—work that would otherwise have be done by the regex engine.

Regex’s compile() method does a lot of things such as:

  • Combine two subsequent characters in the regex if they together indicate a special symbol such as certain Greek symbols.
  • Prepare the regex to ignore uppercase and lowercase.
  • Check for certain (smaller) patterns in the regex.
  • Analyze matching groups in the regex enclosed in parentheses.

Here’s the implemenation of the compile() method—it looks more complicated than expected, no?

def _compile(code, pattern, flags):
    # internal: compile a (sub)pattern
    emit = code.append
    _len = len
    LITERAL_CODES = _LITERAL_CODES
    REPEATING_CODES = _REPEATING_CODES
    SUCCESS_CODES = _SUCCESS_CODES
    ASSERT_CODES = _ASSERT_CODES
    iscased = None
    tolower = None
    fixes = None
    if flags & SRE_FLAG_IGNORECASE and not flags & SRE_FLAG_LOCALE:
        if flags & SRE_FLAG_UNICODE:
            iscased = _sre.unicode_iscased
            tolower = _sre.unicode_tolower
            fixes = _ignorecase_fixes
        else:
            iscased = _sre.ascii_iscased
            tolower = _sre.ascii_tolower
    for op, av in pattern:
        if op in LITERAL_CODES:
            if not flags & SRE_FLAG_IGNORECASE:
                emit(op)
                emit(av)
            elif flags & SRE_FLAG_LOCALE:
                emit(OP_LOCALE_IGNORE[op])
                emit(av)
            elif not iscased(av):
                emit(op)
                emit(av)
            else:
                lo = tolower(av)
                if not fixes:  # ascii
                    emit(OP_IGNORE[op])
                    emit(lo)
                elif lo not in fixes:
                    emit(OP_UNICODE_IGNORE[op])
                    emit(lo)
                else:
                    emit(IN_UNI_IGNORE)
                    skip = _len(code); emit(0)
                    if op is NOT_LITERAL:
                        emit(NEGATE)
                    for k in (lo,) + fixes[lo]:
                        emit(LITERAL)
                        emit(k)
                    emit(FAILURE)
                    code[skip] = _len(code) - skip
        elif op is IN:
            charset, hascased = _optimize_charset(av, iscased, tolower, fixes)
            if flags & SRE_FLAG_IGNORECASE and flags & SRE_FLAG_LOCALE:
                emit(IN_LOC_IGNORE)
            elif not hascased:
                emit(IN)
            elif not fixes:  # ascii
                emit(IN_IGNORE)
            else:
                emit(IN_UNI_IGNORE)
            skip = _len(code); emit(0)
            _compile_charset(charset, flags, code)
            code[skip] = _len(code) - skip
        elif op is ANY:
            if flags & SRE_FLAG_DOTALL:
                emit(ANY_ALL)
            else:
                emit(ANY)
        elif op in REPEATING_CODES:
            if flags & SRE_FLAG_TEMPLATE:
                raise error("internal: unsupported template operator %r" % (op,))
            if _simple(av[2]):
                if op is MAX_REPEAT:
                    emit(REPEAT_ONE)
                else:
                    emit(MIN_REPEAT_ONE)
                skip = _len(code); emit(0)
                emit(av[0])
                emit(av[1])
                _compile(code, av[2], flags)
                emit(SUCCESS)
                code[skip] = _len(code) - skip
            else:
                emit(REPEAT)
                skip = _len(code); emit(0)
                emit(av[0])
                emit(av[1])
                _compile(code, av[2], flags)
                code[skip] = _len(code) - skip
                if op is MAX_REPEAT:
                    emit(MAX_UNTIL)
                else:
                    emit(MIN_UNTIL)
        elif op is SUBPATTERN:
            group, add_flags, del_flags, p = av
            if group:
                emit(MARK)
                emit((group-1)*2)
            # _compile_info(code, p, _combine_flags(flags, add_flags, del_flags))
            _compile(code, p, _combine_flags(flags, add_flags, del_flags))
            if group:
                emit(MARK)
                emit((group-1)*2+1)
        elif op in SUCCESS_CODES:
            emit(op)
        elif op in ASSERT_CODES:
            emit(op)
            skip = _len(code); emit(0)
            if av[0] >= 0:
                emit(0) # look ahead
            else:
                lo, hi = av[1].getwidth()
                if lo != hi:
                    raise error("look-behind requires fixed-width pattern")
                emit(lo) # look behind
            _compile(code, av[1], flags)
            emit(SUCCESS)
            code[skip] = _len(code) - skip
        elif op is CALL:
            emit(op)
            skip = _len(code); emit(0)
            _compile(code, av, flags)
            emit(SUCCESS)
            code[skip] = _len(code) - skip
        elif op is AT:
            emit(op)
            if flags & SRE_FLAG_MULTILINE:
                av = AT_MULTILINE.get(av, av)
            if flags & SRE_FLAG_LOCALE:
                av = AT_LOCALE.get(av, av)
            elif flags & SRE_FLAG_UNICODE:
                av = AT_UNICODE.get(av, av)
            emit(av)
        elif op is BRANCH:
            emit(op)
            tail = []
            tailappend = tail.append
            for av in av[1]:
                skip = _len(code); emit(0)
                # _compile_info(code, av, flags)
                _compile(code, av, flags)
                emit(JUMP)
                tailappend(_len(code)); emit(0)
                code[skip] = _len(code) - skip
            emit(FAILURE) # end of branch
            for tail in tail:
                code[tail] = _len(code) - tail
        elif op is CATEGORY:
            emit(op)
            if flags & SRE_FLAG_LOCALE:
                av = CH_LOCALE[av]
            elif flags & SRE_FLAG_UNICODE:
                av = CH_UNICODE[av]
            emit(av)
        elif op is GROUPREF:
            if not flags & SRE_FLAG_IGNORECASE:
                emit(op)
            elif flags & SRE_FLAG_LOCALE:
                emit(GROUPREF_LOC_IGNORE)
            elif not fixes:  # ascii
                emit(GROUPREF_IGNORE)
            else:
                emit(GROUPREF_UNI_IGNORE)
            emit(av-1)
        elif op is GROUPREF_EXISTS:
            emit(op)
            emit(av[0]-1)
            skipyes = _len(code); emit(0)
            _compile(code, av[1], flags)
            if av[2]:
                emit(JUMP)
                skipno = _len(code); emit(0)
                code[skipyes] = _len(code) - skipyes + 1
                _compile(code, av[2], flags)
                code[skipno] = _len(code) - skipno
            else:
                code[skipyes] = _len(code) - skipyes + 1
        else:
            raise error("internal: unsupported operand type %r" % (op,))

Don’t worry, you don’t need to understand the code. Just note that all this work would have to be done by the regex engine at “matching runtime” if you wouldn’t compile the pattern first. If we can do it only once, it’s certainly a low-hanging fruit for performance optimizations—especially for long regular expression patterns.

How to Use the Optional Flag Argument?

As you’ve seen in the specification, the compile() method comes with an optional third ‘flag’ argument:

re.compile(pattern, flags=0)

What’s the purpose of the flags argument?

Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (for example, whether to ignore capitalization when matching your regex).

SyntaxMeaning
re.ASCIIIf you don’t use this flag, the special Python regex symbols w, W, b, B, d, D, s and S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests.
re.A Same as re.ASCII
re.DEBUG If you use this flag, Python will print some useful information to the shell that helps you debugging your regex.
re.IGNORECASE If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z].
re.I Same as re.IGNORECASE
re.LOCALE Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable.
re.L Same as re.LOCALE
re.MULTILINE This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string.
re.M Same as re.MULTILINE
re.DOTALL Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘n’. Switch on this flag to really match all characters including the newline character.
re.S Same as re.DOTALL
re.VERBOSE To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex.
re.X Same as re.VERBOSE

Here’s how you’d use it in a practical example:

import re

text = 'Python is great (python really is)'

regex = re.compile('Py...n', flags=re.IGNORECASE)

matches = regex.findall(text)
print(matches)
# ['Python', 'python']

Although your regex ‘Python’ is uppercase, we ignore the capitalization by using the flag re.IGNORECASE.

Where to Go From Here?

You’ve learned about the re.compile(pattern) method that prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code.

Learning Python is hard. But if you cheat, it isn’t as hard as it has to be:

Download 8 Free Python Cheat Sheets now!