Problem Formulation and Solution Overview
The Regular Expression, also referred to as regex
, is a complex pattern to search for and locate matching character(s) within a string. At first, this concept may seem daunting, but with practice, regex will improve your coding skills dramatically.
Born Reginald Kenneth Dwight on 25 March 1947, John is a British singer, pianist and composer. John is commonly nicknamed Rocket Man after his hit of the same name. John has led a successful career as a solo artist since the 1970s. |
Preparation
To run these code examples error-free, the regex library must be installed and imported. Click here for installation instructions.
import re # or import regex
Method 1: Use regex findall()
The re.findall()
function can be found in the regex
library. This function searches for matching patterns in a string and has the following syntax: re.findall(pattern, string, flags=0)
import re elton_bio = """ Born Reginald Kenneth Dwight on 25 March 1947, John is a British singer, pianist and composer. John is commonly nicknamed Rocket Man after his hit of the same name. JoHn has led a successful career as a solo artist since the 1970s. """ matches = re.findall(r'J\w+', elton_bio, re.IGNORECASE | re.MULTILINE) print(matches)
Above imports the regex library.
Then a multi-line string is declared containing a snippet of Elton John’s Biography. This saves to elton_bio
.
Next, re.findall()
is called and passed the following arguments:
- The search pattern (
r'J\w+'
). Ther
indicates to treat the string as a raw string (ignore all escape codes). - The string to search on
elton_bio
. - Two (2) regex flags. The first flag ignores the case (such as upper, lower, title). The second flag accommodates the multi-line string,
The results return as a list and save to matches
.
π‘Note: When calling more than one (1) flag, separate with the pipe (|) character.
When the output is sent to the terminal, three (3) matches are found. If re.IGNORECASE
, or re.I
was not passed as an argument; the last element would not be considered a match.
['John', 'John', 'JoHn'] |
π‘Note: Regex flags have short-forms, such as: re.I
is the same as re.IGNORECASE
, re.M
is the same as re.MULTIlINE
.
Method 2: Use regex finditer()
This method uses re.finditer()
from the regex
library. This option may be best if a large number of matches is expected as it returns an iterator object instead of a list.
import re elton_bio = """ Born Reginald Kenneth Dwight on 25 March 1947, John is a British singer, pianist and composer. John is commonly nicknamed Rocket Man after his hit of the same name. JoHn has led a successful career as a solo artist since the 1970s. """ result = re.finditer(r'J\w+', elton_bio) for match in result: print(match.group())
Above imports the regex library.
Then a multi-line string is declared containing a snippet of Elton John’s Biography. This saves to elton_bio
.
Then re.finditer()
is called and passed two (2) arguments:
- The search pattern (
r'J\w+'
). Ther
indicates to treat the string as a raw string (ignore all escape codes). - The multi-line string to search on
elton_bio
.
An object returns and saves to result
. If result
was output to the terminal, an object similar to below would display.
<callable_iterator object at 0x0000021F3CB2B430> |
To view the matches, a for
loop is called to output each match.group()
found to the terminal.
John |
π‘Note: The output displays all three (3) matches, even though the last match is in mixed cased.
Method 3: Use regex.search()
This method uses re.search()
to search for matches and return a list.
import re elton_bio = """ Born Reginald Kenneth Dwight on 25 March 1947, John is a British singer, pianist and composer. John is commonly nicknamed Rocket Man after his hit of the same name. JoHn has led a successful career as a solo artist since the 1970s. """ def find_all(regex, text): match_list = [] while True: match = re.search(regex, text) if match: match_list.append(match.group(0)) text = text[match.end():] else: return match_list print(find_all(r'J\w+', elton_bio))
Above imports the regex library.
Then a multi-line string is declared containing a snippet of Elton John’s Biography. This saves to elton_bio
.
Next, the function find_all
is defined with two (2) arguments: the regex pattern (regex
) and the string to search (text
).
The following lines loop through the string, searching for pattern matches. These matches are extracted and appended to match_list
.
Finally, the above function is called and passed the appropriate arguments. The results return and are output to the terminal.
['John', 'John', 'JoHn'] |
π‘Note: The output displays all three (3) matches, even though the last match is in mixed cased.
Method 4: Use regex sub()
What happens if you want to extract each occurrence of ‘John’ and replace it with ‘Elton John’? You could use regex.sub()
with the following syntax: re.sub(pattern, replacement, string[, count, flags])
import re elton_bio = """ Born Reginald Kenneth Dwight on 25 March 1947, John is a British singer, pianist and composer. John is commonly nicknamed Rocket Man after his hit of the same name. JoHn has led a successful career as a solo artist since the 1970s. """ new_ebio = re.sub(r'J\w+', 'Elton John', elton_bio) print(new_ebio)
Above imports the regex library.
Then a multi-line string is declared containing a snippet of Elton John’s Biography. This saves to elton_bio
.
The following line calls re.sub()
with three (3) arguments:
- The search pattern (
r'J\w+'
). Ther
indicates to treat the string as a raw string (ignore all escape codes). - The replacement string ‘
Elton John
‘. - The multi-line string to apply this on
elton_bio
.
The results save to new_ebio
and are output to the terminal.
Born Reginald Kenneth Dwight on 25 March 1947, Elton John is a British singer, pianist and composer. Elton John is commonly nicknamed Rocket Man after his hit of the same name. Elton John has led a successful career as a solo artist since the 1970s. |