Python Regex Split

Why have regular expressions survived seven decades of technological disruption? Because coders who understand regular expressions have a massive advantage when working with textual data. They can write in a single line of code what takes others dozens!

This article is all about the re.split(pattern, string) method of Python’s re library.

Related article: Python Regex Superpower – The Ultimate Guide

Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.

Let’s answer the following question:

How Does re.split() Work in Python?

The re.split(pattern, string, maxsplit=0, flags=0) method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those.

Here’s a minimal example:

>>> import re
>>> string = 'Learn Python with\t     Finxter!'
>>> re.split('\s+', string)
['Learn', 'Python', 'with', 'Finxter!']

The string contains four words that are separated by whitespace characters (in particular: the empty space ‘ ‘ and the tabular character ‘\t’). You use the regular expression ‘\s+’ to match all occurrences of a positive number of subsequent whitespaces. The matched substrings serve as delimiters. The result is the string divided along those delimiters.

But that’s not all! Let’s have a look at the formal definition of the split method.

Specification

re.split(pattern, string, maxsplit=0, flags=0)

The method has four arguments—two of which are optional.

  • pattern: the regular expression pattern you want to use as a delimiter.
  • string: the text you want to break up into a list of strings.
  • maxsplit (optional argument): the maximum number of split operations (= the size of the returned list). Per default, the maxsplit argument is 0, which means that it’s ignored.
  • flags (optional argument): a more advanced modifier that allows you to customize the behavior of the function. Per default the regex module does not consider any flags. Want to know how to use those flags? Check out this detailed article on the Finxter blog.

The first and second arguments are required. The third and fourth arguments are optional.

You’ll learn about those arguments in more detail later.

Return Value:

The regex split method returns a list of substrings obtained by using the regex as a delimiter.

Regex Split Minimal Example

Let’s study some more examples—from simple to more complex.

The easiest use is with only two arguments: the delimiter regex and the string to be split.

>>> import re
>>> string = 'fgffffgfgPythonfgisfffawesomefgffg'
>>> re.split('[fg]+', string)
['', 'Python', 'is', 'awesome', '']

You use an arbitrary number of ‘f’ or ‘g’ characters as regular expression delimiters. How do you accomplish this? By combining the character class regex [A] and the one-or-more regex A+ into the following regex: [fg]+. The strings in between are added to the return list.

How to Use the maxsplit Argument?

What if you don’t want to split the whole string but only a limited number of times. Here’s an example:

>>> string = 'a-bird-in-the-hand-is-worth-two-in-the-bush'
>>> re.split('-', string, maxsplit=5)
['a', 'bird', 'in', 'the', 'hand', 'is-worth-two-in-the-bush']
>>> re.split('-', string, maxsplit=2)
['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']

We use the simple delimiter regex ‘-‘ to divide the string into substrings. In the first method call, we set maxsplit=5 to obtain six list elements. In the second method call, we set maxsplit=3 to obtain three list elements. Can you see the pattern?

You can also use positional arguments to save some characters:

 >>> re.split('-', string, 2)
['a', 'bird', 'in-the-hand-is-worth-two-in-the-bush']

But as many coders don’t know about the maxsplit argument, you probably should use the keyword argument for readability.

How to Use the Optional Flag Argument?

As you’ve seen in the specification, the re.split() method comes with an optional fourth ‘flag’ argument:

re.split(pattern, string, maxsplit=0, flags=0)

What’s the purpose of the flags argument?

Flags allow you to control the regular expression engine. Because regular expressions are so powerful, they are a useful way of switching on and off certain features (for example, whether to ignore capitalization when matching your regex).

SyntaxMeaning
re.ASCIIIf you don’t use this flag, the special Python regex symbols w, W, b, B, d, D, s and S will match Unicode characters. If you use this flag, those special symbols will match only ASCII characters — as the name suggests.
re.A Same as re.ASCII
re.DEBUG If you use this flag, Python will print some useful information to the shell that helps you debugging your regex.
re.IGNORECASE If you use this flag, the regex engine will perform case-insensitive matching. So if you’re searching for [A-Z], it will also match [a-z].
re.I Same as re.IGNORECASE
re.LOCALE Don’t use this flag — ever. It’s depreciated—the idea was to perform case-insensitive matching depending on your current locale. But it isn’t reliable.
re.L Same as re.LOCALE
re.MULTILINE This flag switches on the following feature: the start-of-the-string regex ‘^’ matches at the beginning of each line (rather than only at the beginning of the string). The same holds for the end-of-the-string regex ‘$’ that now matches also at the end of each line in a multi-line string.
re.M Same as re.MULTILINE
re.DOTALL Without using this flag, the dot regex ‘.’ matches all characters except the newline character ‘n’. Switch on this flag to really match all characters including the newline character.
re.S Same as re.DOTALL
re.VERBOSE To improve the readability of complicated regular expressions, you may want to allow comments and (multi-line) formatting of the regex itself. This is possible with this flag: all whitespace characters and lines that start with the character ‘#’ are ignored in the regex.
re.X Same as re.VERBOSE

Here’s how you’d use it in a practical example:

>>> import re
>>> re.split('[xy]+', text, flags=re.I)
['the', 'russians', 'are', 'coming']

Although your regex is lowercase, we ignore the capitalization by using the flag re.I which is short for re.IGNORECASE. If we wouldn’t do it, the result would be quite different:

>>> re.split('[xy]+', text)
['theXXXYYYrussiansXX', 'are', 'Y', 'coming']

As the character class [xy] only contains lowerspace characters ‘x’ and ‘y’, their uppercase variants appear in the returned list rather than being used as delimiters.

What’s the Difference Between re.split() and string.split() Methods in Python?

The method re.split() is much more powerful. The re.split(pattern, string) method can split a string along all occurrences of a matched pattern. The pattern can be arbitrarily complicated. This is in contrast to the string.split(delimiter) method which also splits a string into substrings along the delimiter. However, the delimiter must be a normal string.

An example where the more powerful re.split() method is superior is in splitting a text along any whitespace characters:

import re


text = '''
    Ha! let me see her: out, alas! he's cold:
    Her blood is settled, and her joints are stiff;
    Life and these lips have long been separated:
    Death lies on her like an untimely Frost
    Upon the sweetest flower of all the field.
'''

print(re.split('\s+', text))
'''
['', 'Ha!', 'let', 'me', 'see', 'her:', 'out,', 'alas!',
"he's", 'cold:', 'Her', 'blood', 'is', 'settled,', 'and',
'her', 'joints', 'are', 'stiff;', 'Life', 'and', 'these',
'lips', 'have', 'long', 'been', 'separated:', 'Death',
'lies', 'on', 'her', 'like', 'an', 'untimely', 'Frost',
'Upon', 'the', 'sweetest', 'flower', 'of', 'all', 'the',
'field.', '']
'''

The re.split() method divides the string along any positive number of whitespace characters. You couldn’t achieve such a result with string.split(delimiter) because the delimiter must be a constant-sized string.

Related Re Methods

There are five important regular expression methods which you should master:

  • The re.findall(pattern, string) method returns a list of string matches. Read more in our blog tutorial.
  • The re.search(pattern, string) method returns a match object of the first match. Read more in our blog tutorial.
  • The re.match(pattern, string) method returns a match object if the regex matches at the beginning of the string. Read more in our blog tutorial.
  • The re.fullmatch(pattern, string) method returns a match object if the regex matches the whole string. Read more in our blog tutorial.
  • The re.compile(pattern) method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in our blog tutorial.

These five methods are 80% of what you need to know to get started with Python’s regular expression functionality.

Where to Go From Here?

You’ve learned about the re.split(pattern, string) method that divides the string along the matched pattern occurrences and returns a list of substrings.

Learning Python is hard. But if you cheat, it isn’t as hard as it has to be:

Download 8 Free Python Cheat Sheets now!