Python | Split String with Regex

Rate this post

Summary: The different methods to split a string using regex are:

  • re.split()
  • re.sub()
  • re.findall()
  • re.compile()

Minimal Example

import re

text = "Earth:Moon::Mars:Phobos"

# Method 1
res = re.split("[:]+", text)
print(res)

# Method 2
res = re.sub(r':', " ", text).split()
print(res)

# Method 3
res = re.findall("[^:\s]+", text)
print(res)

# Method 4
pattern = re.compile("[^:\s]+").findall
print(pattern(text))

# Output
['Earth', 'Moon', 'Mars', 'Phobos']

Problem Formulation

πŸ“œProblem: Given a string and a delimiter. How will you split the string using the given delimiter using different functions from the regular expressions library?

Example: In the following example, the given string has to be split using a hyphen as the delimiter.

# Input
text = "abc-lmn-xyz"

# Expected Output
['abc', 'lmn', 'xyz']

Method 1: re.split

The re.split(pattern, string) method matches all occurrences of the pattern in the string and divides the string along the matches resulting in a list of strings between the matches. For example, re.split('a', 'bbabbbab') results in the list of strings ['bb', 'bbb', 'b'].

Approach: Use the re.split function and pass [_]+ as the pattern which splits the given string on occurrence of an underscore.

Code:

import re

text = "abc_lmn_xyz"
res = re.split("[_]+", text)
print(res)

# ['abc', 'lmn', 'xyz']

πŸš€Related Read: Python Regex Split

Method 2: re.sub

The regex function re.sub(P, R, S) replaces all occurrences of the pattern P with the replacement R in string S. It returns a new string. For example, if you call re.sub('a', 'b', 'aabb'), the result will be the new string 'bbbb' with all characters 'a' replaced by 'b'.

Approach: The idea here is to use the re.sub function to replace all occurrences of underscores with a space and then use the split function to split the string at spaces.

Code:

import re

text = "abc_lmn_xyz"
res = re.sub(r'_', " ", text).split()
print(res)

# ['abc', 'lmn', 'xyz']

πŸš€Related Read: Python Regex Sub

Method 3: re.findall

The re.findall(pattern, string) method scans string from left to right, searching for all non-overlapping matches of the pattern. It returns a list of strings in the matching order when scanning the string from left to right.

Approach: Find all occurrences of characters that are separated by underscores using the re.findall().

Code:

import re

text = "abc_lmn_xyz"
res = re.findall("[^_\s]+", text)
print(res)

# ['abc', 'lmn', 'xyz']

πŸš€Related Read: Python re.findall()

Method 4: re.compile

The method re.compile(pattern) returns a regular expression object from the pattern that provides basic regex methods such as pattern.search(string)pattern.match(string), and pattern.findall(string). The explicit two-step approach of (1) compiling and (2) searching the pattern is more efficient than calling, say, search(pattern, string) at once, if you match the same pattern multiple times because it avoids redundant compilations of the same pattern.

Code:

import re

text = "abc_lmn_xyz"
pattern = re.compile("[^-\s]+").findall
print(pattern(text))

# ['abc', 'lmn', 'xyz']

Why use re.compile?

  • Efficiency: Using re.compile() to assemble regular expressions is effective when the expression has to be used more than once. Thus, by using the classes/objects created by compile function, we can search for instances that we need within different strings without having to rewirte the expressions again and again. This increases productivity as well as saves time.
  • Readability: Another advantage of using re.compile is the readability factor as it leverages you the power to decouple the specification of the regex.

πŸš€Read: Is It Worth Using Python’s re.compile()?

Exercise

Problem: Python regex split by spaces, commas, and periods, but not in cases like 1,000 or 1.50.

Given:
my_string = "one two 3.4 5,6 seven.eight nine,ten"
Expected Output:
["one", "two", "3.4", "25.6" , "seven", "eight", "nine", "ten"]

Solution

my_string = "one two 3.4 25.6 seven.eight nine,ten"
res = re.split('\s|(?<!\d)[,.](?!\d)', my_string)
print(res)

# ['one', 'two', '3.4', '25.6', 'seven', 'eight', 'nine', 'ten']

Conclusion

Therefore, we have learned four different ways of splitting a string using the regular expressions package in Python. Feel free to use the suitable technique that fits your needs. The idea of this tutorial was to get you acquainted with the numerous ways of using regex to split a string and I hope it helped you.

Please stay tuned and subscribe for more interesting discussions and tutorials in the future. Happy coding! πŸ™‚


Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.