? Summary: This blog explores the steps to remove all non-alphabet characters from a given string. The ‘re’ module in Python provides regular expression operations, to process text. One uses these operations to manipulate text in strings. The compile() method in conjunction with the sub() method can remove all non-alphabet characters from a given string.
Note: All the solutions provided below have been verified using Python 3.9.0b5
Problem Formulation
Imagine the following garbled string in python…
my_string = 'A !He#a"lt#hy D$os@e Of% O*m$e+ga_3 F#a$t#t@y-A%ci^d*s P&er D{a]y K\'ee(p)s T*he D^.oc+to&r A#w*ay.\nFl-a)x Se/:ed A;nd W]al<n=uts A>re A? G@oo[d S\\our]ce Of O!m&eg^a_3 F#a$t#t@y-A%ci^d*s.'
How does one get rid of the non-alphabet characters to clean up the string?
'AHealthyDoseOfOmegaFattyAcidsPerDayKeepsTheDoctorAwayFlaxSeedAndWalnutsAreAGoodSourceOfOmegaFattyAcids'
Background
The above problem formulation is an exaggerated example of a garbled sentence. But, in one’s Python coding career, one does find the need to clean up sentences every now and then. This could be something as simple as cleaning up punctuation to get word counts. Or it could be something more complex, like recovering corrupted code. In any case, it is good to have an arsenal of tools that a Pythonista could use in such situations. This blog will show you a simple way to remove non-alphabet characters from strings.
Ok! Enough Talk, I Get It!! Now Show Me!!
In most versions of Python, the ‘re’ module is part of the standard library. One should remember to ‘import’ the ‘re’ module before using it. The solution shown below, first compiles the search pattern. Next, the compiled object operates on the string, to get the desired results.
$ python Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) >>> >>> ## Remember to import the ‘re’ module. It is part of Python’s Standard Library. >>> import re >>> >>> ## Compile the search pattern. >>> pattern = re.compile('[\W_0-9]+') >>> >>> ## ‘my_string’ is the original garbled string. >>> my_string = 'A !He#a"lt#hy D$os@e Of% O*m$e+ga_3 F#a$t#t@y-A%ci^d*s P&er D{a]y K\'ee(p)s T*he D^.oc+to&r A#w*ay.\nFl-a)x Se/:ed A;nd W]al<n=uts A>re A? G@oo[d S\\our]ce Of O!m&eg^a_3 F#a$t#t@y-A%ci^d*s.' >>> >>> ## The ‘pattern’ object is used to apply the substitute function, to remove the >>> ## non-alphabet characters from ‘my_string’ >>> clean_string = pattern.sub('', my_string) >>> >>> ## ‘clean_string’ is the ‘cleaned’ string, containing only alphanumeric characters. >>> clean_string 'AHealthyDoseOfOmegaFattyAcidsPerDayKeepsTheDoctorAwayFlaxSeedAndWalnutsAreAGoodSourceOfOmegaFattyAcids' >>>
The original garbled string has words that are hand picked to make a meaningful sentence. The words have Camel Characters For Illustrative Purposes. After the substitution operation, the cleaned string stands out. Yes, the ‘spaces’ are also removed because ‘spaces’ are non-alphabet characters. But is there something else that got removed?
Wait A Minute!! You Meant Omega3 Fatty Acids, Right?
Correct! The Astute reader may have noticed the removal of numeric characters too. This Blog is about removing non-alphabet characters. Alphabet characters being ‘a’
to ‘z’
and ‘A’
to ‘Z’
. Hence, the code removes everything that is non-alphabetic, including numeric characters. But fear not! This blog is all about giving the reader relevant tools and showing them how to use it.
The key is to change the search pattern a tiny bit. Here is what the '[\W_0-9]+'
search pattern means.
- The square brackets
'[]'
enclose one or more character classes. It indicates a set of character classes or individual characters. The square brackets tell the ‘re’ module to match ‘one’ character from the enclosed set. - The pattern
'\W'
means any character that is not alphanumeric or an underscore'_'
. Which is why one needs to include'_'
and'0-9'
within the'[]'
, to tell're'
to search for all non-alphabet characters. - Finally, the regex plus operator
‘+’
tells ‘re’ to match 1 or more of the preceding character.
So to remove the ‘non-alphanumeric’ characters, one would use '[\W_]+'
instead of '[\W_0-9]+'
, while compiling the pattern. Lets see how this works.
$ python Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) >>> >>> ## Again, remember to import the ‘re’ module. It is part of Python’s Standard Library. >>> import re >>> >>> ## Compile the search pattern. Note the difference from before, >>> ## i.e. '[\W_]+' instead of '[\W_0-9]+' >>> pattern = re.compile('[\W_]+') >>> >>> ## ‘my_string’ is the original garbled string. >>> my_string = 'A !He#a"lt#hy D$os@e Of% O*m$e+ga_3 F#a$t#t@y-A%ci^d*s P&er D{a]y K\'ee(p)s T*he D^.oc+to&r A#w*ay.\nFl-a)x Se/:ed A;nd W]al<n=uts A>re A? G@oo[d S\\our]ce Of O!m&eg^a_3 F#a$t#t@y-A%ci^d*s.' >>> >>> ## The ‘pattern’ object is used to apply the substitute function, to remove the >>> ## non-alphabet characters from ‘my_string’ >>> clean_string = pattern.sub('', my_string) >>> >>> ## ‘clean_string’ is the ‘cleaned’ string, containing only alphanumeric characters now. >>> ## Note the ‘3’ in ‘Omega3’ >>> clean_string 'AHealthyDoseOfOmega3FattyAcidsPerDayKeepsTheDoctorAwayFlaxSeedAndWalnutsAreAGoodSourceOfOmega3FattyAcids' >>>
There!! That looks much better now. The numeric characters are now included.
The Sentence Is Still A Loooooong Word!!
Right!! Removing spaces from sentences makes them unreadable. Hence the careful choice of the original garbled sentence, ToUseCamelCharacters
. This section explores a way to preserve the spaces in the original garbled sentence. It is not the be-all end-all method, but it is simple and easy to understand.
- The
split()
built-in function, splits the original sentence at spaces. This creates a list of words. The original long sentence; a string of words becomes a list of individual words. - The words are still garbled.
re.sub()
operates on each word, to clean it up. This results in a list containing the cleaned words. - Next, the
join()
built-in function uses the'space'
character, to join the words in this list.
Lets see how this works.
$ python Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) >>> >>> ## This is the original >>> my_>>> >>> ## Remember to import the ‘re’ module. It is part of Python’s Standard Library. >>> import re >>> >>> ## Compile the search pattern. >>> pattern = re.compile('[\W_0-9]+') >>> >>> ## ‘my_string’ is the original garbled string. >>> my_string = 'A !He#a"lt#hy D$os@e Of% O*m$e+ga_3 F#a$t#t@y-A%ci^d*s P&er D{a]y K\'ee(p)s T*he D^.oc+to&r A#w*ay.\nFl-a)x Se/:ed A;nd W]al<n=uts A>re A? G@oo[d S\\our]ce Of O!m&eg^a_3 F#a$t#t@y-A%ci^d*s.' >>> >>> ## Split my_string at the spaces to create a list of garbled words. >>> dirty_list = my_string.split() >>> >>> ## Use list comprehension to clean the words while creating a new list. >>> clean_list = [pattern.sub('', word) for word in dirty_list] >>> >>> ## Join the Cleaned words in the new list, using spaces. >>> clean_string = ' '.join(clean_list) >>> >>> clean_string 'A Healthy Dose Of Omega FattyAcids Per Day Keeps The Doctor Away Flax Seed And Walnuts Are A Good Source Of Omega FattyAcids' >>>
Well, it is a tiny bit easier to read compared to the camel character string. But, the sentence still lost the punctuations etc. You win some, you lose some!!
Well, That Was Interesting! Anything Else?
Of course there is always something else when one is learning Python. Remember the search pattern '[\W_0-9]+'
used in the examples above? Did you ever wonder why the code uses the '+'
character after the '[]'
set? By itself, the '[]'
will match one character at a time, before moving on. One could use only the '[]'
and the code will still work. Adding the '+'
makes it much faster. Speed is also the reason why one should compile a pattern for searching, rather than using it as is.
Note in the code below, that string.printable
is a built-in string of printable characters. string.printable
is available from the ‘string’ standard Python library.
$ python Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) >>> >>> import string >>> print(string.printable) 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ >>>
Now Consider the following comparisons.
## No '+' in Pattern. Pattern used as is, i.e. not compiled. $ python -m timeit -s \ > "import re, string" \ > "re.sub('[\W_0-9]', '', string.printable)" 20000 loops, best of 5: 10.2 usec per loop ## No '+' in Pattern. Pattern is compiled. $ python -m timeit -s \ > "import re, string; \ > pattern = re.compile('[\W_0-9]')" \ > "pattern.sub('', string.printable)" 50000 loops, best of 5: 9.52 usec per loop ## Pattern used as is, i.e. not compiled. $ python -m timeit -s \ > "import re, string" \ > "re.sub('[\W_0-9]+', '', string.printable)" 100000 loops, best of 5: 3.56 usec per loop ## Pattern is compiled. $ python -m timeit -s \ > "import re, string; \ > pattern = re.compile('[\W_0-9]+')" \ > "pattern.sub('', string.printable)" 100000 loops, best of 5: 2.92 usec per loop
Wow!! Using the ‘+’
and compiling the search pattern hugely increases speed!!
Conclusion
This blog explored the subtleties of using regular expressions to manipulate strings. Learning Python is all about experimenting and trying out different tactics to achieve the end result.
Here is a one-liner for the reader to figure out. Examine it and dissect it piece by piece. Each element in the one-liner is part of the code shown earlier. Stare at it for a while! Try it out! Breathe!! Stay Calm!! Like your fellow Pythonista’s, you will eventually get the hang of it too…
$ python Python 3.9.0b5 (default, Oct 19 2020, 11:11:59) >>> >>> import re >>> >>> ## This is the original >>> my_string = 'A !He#a"lt#hy D$os@e Of% O*m$e+ga_3 F#a$t#t@y-A%ci^d*s P&er D{a]y K\'ee(p)s T*he D^.oc+to&r A#w*ay.\nFl-a)x Se/:ed A;nd W]al<n=uts A>re A? G@oo[d S\\our]ce Of O!m&eg^a_3 F#a$t#t@y-A%ci^d*s.' >>> >>> ## The one-liner!!! >>> clean_string = ' '.join([re.compile('[\W_]+').sub('', word) for word in my_string.split()]) >>> >>> ## The output is the same as above. >>> clean_string 'A Healthy Dose Of Omega3 FattyAcids Per Day Keeps The Doctor Away Flax Seed And Walnuts Are A Good Source Of Omega3 FattyAcids' >>>
Finxter Academy
This blog was brought to you by Girish Rao, a student of Finxter Academy. You can find his Upwork profile here.
Reference
All research for this blog article was done using Python Documents, the Google Search Engine, and the shared knowledge-base of the Finxter Academy and the Stack Overflow Communities.