Problem Formulation and Solution Overview
The Regular Expression, also referred to as regex, is a complex pattern to search for and locate matching character(s) within a string. At first, this concept may seem daunting, but with practice, regex will improve your coding skills dramatically.
| Born Reginald Kenneth Dwight on 25 March 1947, John is a British singer, pianist and composer. John is commonly nicknamed Rocket Man after his hit of the same name. John has led a successful career as a solo artist since the 1970s. |
Preparation
To run these code examples error-free, the regex library must be installed and imported. Click here for installation instructions.
import re # or import regex
Method 1: Use regex findall()
The re.findall() function can be found in the regex library. This function searches for matching patterns in a string and has the following syntax: re.findall(pattern, string, flags=0)
import re
elton_bio = """
Born Reginald Kenneth Dwight on 25 March 1947,
John is a British singer, pianist and composer.
John is commonly nicknamed Rocket Man after his
hit of the same name. JoHn has led a successful
career as a solo artist since the 1970s.
"""
matches = re.findall(r'J\w+', elton_bio, re.IGNORECASE | re.MULTILINE)
print(matches)Above imports the regex library.
Then a multi-line string is declared containing a snippet of Elton John’s Biography. This saves to elton_bio.
Next, re.findall() is called and passed the following arguments:
- The search pattern (
r'J\w+'). Therindicates to treat the string as a raw string (ignore all escape codes). - The string to search on
elton_bio. - Two (2) regex flags. The first flag ignores the case (such as upper, lower, title). The second flag accommodates the multi-line string,
The results return as a list and save to matches.
π‘Note: When calling more than one (1) flag, separate with the pipe (|) character.
When the output is sent to the terminal, three (3) matches are found. If re.IGNORECASE, or re.I was not passed as an argument; the last element would not be considered a match.
['John', 'John', 'JoHn'] |
π‘Note: Regex flags have short-forms, such as: re.I is the same as re.IGNORECASE, re.M is the same as re.MULTIlINE.
Method 2: Use regex finditer()
This method uses re.finditer() from the regex library. This option may be best if a large number of matches is expected as it returns an iterator object instead of a list.
import re
elton_bio = """
Born Reginald Kenneth Dwight on 25 March 1947,
John is a British singer, pianist and composer.
John is commonly nicknamed Rocket Man after his
hit of the same name. JoHn has led a successful
career as a solo artist since the 1970s.
"""
result = re.finditer(r'J\w+', elton_bio)
for match in result:
print(match.group())Above imports the regex library.
Then a multi-line string is declared containing a snippet of Elton John’s Biography. This saves to elton_bio.
Then re.finditer() is called and passed two (2) arguments:
- The search pattern (
r'J\w+'). Therindicates to treat the string as a raw string (ignore all escape codes). - The multi-line string to search on
elton_bio.
An object returns and saves to result. If result was output to the terminal, an object similar to below would display.
<callable_iterator object at 0x0000021F3CB2B430> |
To view the matches, a for loop is called to output each match.group() found to the terminal.
John |
π‘Note: The output displays all three (3) matches, even though the last match is in mixed cased.
Method 3: Use regex.search()
This method uses re.search() to search for matches and return a list.
import re
elton_bio = """
Born Reginald Kenneth Dwight on 25 March 1947,
John is a British singer, pianist and composer.
John is commonly nicknamed Rocket Man after his
hit of the same name. JoHn has led a successful
career as a solo artist since the 1970s.
"""
def find_all(regex, text):
match_list = []
while True:
match = re.search(regex, text)
if match:
match_list.append(match.group(0))
text = text[match.end():]
else:
return match_list
print(find_all(r'J\w+', elton_bio))Above imports the regex library.
Then a multi-line string is declared containing a snippet of Elton John’s Biography. This saves to elton_bio.
Next, the function find_all is defined with two (2) arguments: the regex pattern (regex) and the string to search (text).
The following lines loop through the string, searching for pattern matches. These matches are extracted and appended to match_list.
Finally, the above function is called and passed the appropriate arguments. The results return and are output to the terminal.
['John', 'John', 'JoHn'] |
π‘Note: The output displays all three (3) matches, even though the last match is in mixed cased.
Method 4: Use regex sub()
What happens if you want to extract each occurrence of ‘John’ and replace it with ‘Elton John’? You could use regex.sub() with the following syntax: re.sub(pattern, replacement, string[, count, flags])
import re
elton_bio = """
Born Reginald Kenneth Dwight on 25 March 1947,
John is a British singer, pianist and composer.
John is commonly nicknamed Rocket Man after his
hit of the same name. JoHn has led a successful
career as a solo artist since the 1970s.
"""
new_ebio = re.sub(r'J\w+', 'Elton John', elton_bio)
print(new_ebio)Above imports the regex library.
Then a multi-line string is declared containing a snippet of Elton John’s Biography. This saves to elton_bio.
The following line calls re.sub() with three (3) arguments:
- The search pattern (
r'J\w+'). Therindicates to treat the string as a raw string (ignore all escape codes). - The replacement string ‘
Elton John‘. - The multi-line string to apply this on
elton_bio.
The results save to new_ebio and are output to the terminal.
Born Reginald Kenneth Dwight on 25 March 1947, Elton John is a British singer, pianist and composer. Elton John is commonly nicknamed Rocket Man after his hit of the same name. Elton John has led a successful career as a solo artist since the 1970s. |
Summary
Regex Humor
