A Quick Introduction To Python’s ‘re’ Module
“How to extract digits or numbers from a string” is a common search by Python users in Google, and a frequent query in forums such as Stack Overflow. The answers invariably talk to two main solutions and in this article, I intend to focus down on one of those called regular expressions.
Regular expressions don’t get the kudos they should given both their power and widespread use across many of today’s popular programming languages. Serious programmers working for some of the biggest names in computer science today frequently rely on regular expressions to clean and extract data for use. If you’re looking for an edge to turbocharge your coding ability, I’d be giving regular expressions a second look.
You’ll see regular expressions referred to by some nicknames, either REs, Regexes or Regex Patterns. This can be mildly confusing to newcomers as Regex is also the name of a third party module which we’ll touch briefly on later in this article. For the moment, when I speak of regular expressions I’m referring to the small, powerful and very specialised language subset that ships as standard with Python and is simply called ‘re‘.
So Where Would You Use Regular Expressions?
When you have a dump of raw data, you’ll usually find yourself needing to clean that data before it becomes usable, or you may need to extract or ‘mine’ a usable component from the mass of data before discarding the rest. Perhaps you need to validate or extract an email address or phone number from a text string? Maybe you’ve just scraped a web page and need to separate very specific references or patterns of text and numbers?
Regular expressions are routinely used in biology when searching for patterns in DNA or protein sequences. Similarly in searching for geographic coordinates or taxonomic names in science documents. There is no doubt that very early on in any programmer’s development a problem arises that regular expressions are best placed to solve, so I suggest you add it to your list of tools.
Before we begin using the re module, I want to touch on compiling. Standard tutorials will teach the need to ‘compile’ a pattern before using it to search a string. However many of the functions in the re module will allow you to compile the pattern ‘on the fly’ when the code is executed. It’s your choice, but (in much the same way as we define functions to streamline our code) if you intend to use a pattern repeatedly through your programme it would seem to be more memory efficient to compile the pattern once and have python cache the compiled pattern for future use which negates the need to compile it each time it is invoked. Therefore I will utilise the compile step through my code examples.
Regular Expression Characters
There are two main characters used in regular expressions; ordinary and special. Ordinary characters are those that represent themselves, so if you wish to search for a word such as ‘Finxter’ then that becomes the search pattern you’d use. However, often you don’t know the precise letters or numbers you are looking for, only the pattern that those numbers make and that is when we use special characters.
The re module uses a type of shorthand to allow you to search for specific characters and patterns in your data. There are a few to explore but the following will get us started with our goal of finding and extracting numbers from a string.
- \d matches with a decimal digit so selects any number from 0 through to 9 inclusive.
- \D will match any character that is not a decimal digit. Thereby excluding 0 through 9.
- \w matches any alphanumeric character, so numbers or letters including the underscore character.
- \W will match any non-alphanumeric character, so it excludes numbers, letters or underscores.
- \s matches ‘white-space’ characters, so a space, a tab or a newline character.
- \S will match any character that is not a space, tab or newline character
Use The Raw String Prefix When Creating A Pattern
Before we try some of these special characters, I want to touch briefly on the use of backslashes in regular expressions. As you’ll know, Python uses backslashes in special character sequences such as ‘\n’ to indicate a new line, or ‘\t’ to indicate a tab. Backslashes are also used to ‘escape’ other special characters. For instance, if I want to escape a backslash because I mean it to show as an actual backslash in a string and not a command in the code, I’d use another backslash as in '\\'
. Therefore the use of backslashes in the re module has the potential to confuse. Rather than tie yourself up in knots trying to decide what to escape, I suggest using the ‘r’ prefix to the regular expression you create which indicates a ‘raw string’, or one which ignores escape characters. You’ll see this shortly when we code up a search.
Importing And Using The Regular Expression Module
So let’s use the regular expression special characters to search a string and see how they work. But first, we need to import the regular expression module into our code. Simply add the following to your script.
For this demonstration I’m going to use an email I received from Chris Mayer when I joined his Finxter Academy back in the day. We’ll create some patterns and see if we can extract some numerical and other data from the string. At the time of my joining, the Finxter Academy had almost 32,000 members. Let’s see if we can extract the actual number of people in the Finxter community by using the \d and \D shorthand characters discussed previously.
There are two things to note from this example, the first is the use of the ‘r’ in front of the pattern we compiled (r’\d\d\D\d\d\d’) which denotes a raw string as we discussed earlier. The second is that search() returned a Match Object containing information about the search. Note the ‘span’ property of Match Object which gives us useful data such as the start and stop index of the pattern location (190, 196) in the string and the ‘match’ property which shows the returned pattern match (match=’31,197′). To extract just the data we wish from the search we need to use the group() command as follows:
Which returns the data we were seeking. Bear in mind that this data is still a string and will require cleaning and converting if you wish to use it in an equation.
Special Characters
We managed a result with that pattern, but if you had a larger number such as used with credit cards that level of repetition would rapidly get tedious so there is a shorter method of expressing a pattern by using special characters that signify a repetition of characters around them, so let’s take a look at those.
- + signals that the search should include 1 or more repetitions of the preceding character; so a pattern of 34+ would return 34, 344, 3444 etc. It will not return just 3, there must be at least one 4.
- * indicates that the search should return 0 or more repetitions of the preceding character; so the same pattern of 34* would return 3, 34, 344, 3444 etc.
- ? asks the search to return 0 or 1 repetition of the preceding character; so 34? will return only 3 or 34.
- The . (The dot or period) stands in for any character other than a newline.
- | is used as an ‘or’ indicator. If you use a pattern X|Y it will search for X or Y.
So using some of those extra characters our previous pattern might be shortened as follows.
Just beware the dot as we used it in this example; because it can stand in for any character, it might return a number rather than the comma that we are seeking and so the pattern may be too broad. To be specific you might wish to use \W or \D in the place of the dot.
Define Your Own Character Class
Sometimes you may find the ordinary and special characters too broad for the pattern you wish to locate and in those cases, the re module allows us to define a special character class of our own. This is done by using the square bracket notation.
[ ] are used to stipulate the specific character grouping you seek.
Perhaps we wish to extract an email address from the email string above?
The first square bracket pattern calls for any alphanumeric characters, including the underscore character, followed by the @ symbol and then the second square bracket pattern again calls for any alphanumeric characters, including the underscore character.
So how do we use regular expressions to extract an address from the above string? Well, we know the German address convention is [Street] [Number], [Postcode] [City] so let’s see how we might code this up.
We want to stipulate that the first word of the street must be capitalised otherwise we may pull other matching patterns from within the string, so let’s use [A-Z][a-z]+ to start our pattern which indicates there must be only one capital letter selected from A to Z to start the pattern, followed by one or more lower case letters from a to z.
We follow that pattern with the white-space character ‘\s’.
For the street number, we call for decimal numbers between 0 and 9 and given street numbers may be large or small we bracket the total by stipulating a search for any number of digits from 2 to 4 [\d{2,4}].
Then we search for the postcode, remembering the comma and white-space that precedes the number of digits [,\s\d]+
Finally, we call the white-space and one or more alphanumeric characters which would represent the city [\s\w]+.
So the final pattern will look like this [A-Z][a-z]+\s[\d{2,4}][,\s\d]+[\s\w]+
Let’s try it.
Success! At the beginning of this article we set out to extract digits from a string and not only did we manage that, but we also took an email address and a street address. However, don’t stop there as we’ve only lightly scratched the surface of what regular expressions can do. We’ve used compile(), search(), match(), and group() but there are many more modules within re that you can use. Here are some of the most frequently used.
- re.compile(pattern) creates a regular expressions object which Python caches for multiple uses.
- re.search(pattern, string) checks if the pattern is in the string and returns the first match as a match object which as we saw, contains meta-data about the matched position and sub-string.
- re.findall(pattern, string) checks if the pattern is in the string and returns a list of all matches.
- re.match(pattern, string) checks for the pattern at the beginning of a string and returns a match object.
- re.split(pattern, string) splits a string where the pattern matches and returns a list of strings. For instance, you might split a text string at every full-stop(period) followed by a white space and have a list of individual strings returned.
- re.sub(pattern, replacement, string) locates the first pattern match and replaces it with the replacement string before returning a new string.
A comprehensive tutorial on the intricacies of regular expressions may be found here https://blog.finxter.com/python-regex/
Finally, I previously mentioned Regex, which while used as a shorthand for regular expressions is also a third-party module that uses an API compatible with the standard Python re module but adds increased functionality. If you wish to explore Regex, you can find it here
In Summary
To summarise, today’s task was to extract digits from a string. We learned about the Python re module which allows us to use powerful regular expressions to create a pattern of characters we wish to extract from a string. We learned some of the standard and special characters which enable us to create some customised patterns and we learned a few common commands that will accept our pattern and return the location, match and string we are seeking.
There is a considerable amount to learn about regular expressions and I trust this article has fired your desire for a deeper understanding. Thank you for reading.