The web is full of tutorials about regular expressions. But I realized that most of those tutorials lack a thorough motivation.
- Why do regular expressions exist?
- What are they used for?
- What are some practical applications?
Somehow, the writers of those tutorials believe that readers are motivated by default to learn a technology that’s complicated and hard to learn.
Well, readers are not. If you’re like me, you tend to avoid complexity and you first want to know WHY before you invest dozens of hours learning a new skill. Is this you? Then keep reading. (Otherwise, leave now—and don’t tell me I didn’t warn you.)
So what are some applications of regular expressions?
As you read through the article, you can watch my explainer video:
Related article: Python Regex Superpower – The Ultimate Guide
Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.
Here’s the ToC that also gives you a quick overview of the regex applications:
Search and Replace in a Text Editor
The most straightforward application is to search a given text in your text editor. Say, your boss asks you to replace all occurrences of a customer
'Max Power' with the name
'Max Power, Ph.D.,'.
Here’s how it would look like:
I used the popular text editor Notepad++ (recommended for coders). At the bottom of the “Replace” window, you can see the box selection “Regular expression”. But in the example, we used the most straightforward regular expression: a simple string.
So you search and replace all occurrences of the string ‘Max Power’ and give it back to your boss. But your boss glances over your document and tells you that you’ve missed all occurrences with only
'Max' (without the surname
'Power'). What do you do?
Simple, you’re using a more powerful regex:
'Max( Power)?' rather than only
Don’t worry, it’s not about the specific regex
'Max( Power)?' and why it works. I just wanted to show you that it’s possible to match all strings that either look like this:
'Max Power' or like this:
Searching Your Operating System for Files
This is another common application: use regular expressions to search (and find) certain files on your operating system.
For example, this guy tried to find all files with the following filename patterns:
He managed to do it on Windows using the command:
dir * /s/b | findstr \.r[0-9]+$
Large parts of the commands are a regular expression. In the final part, you can already see that he requires the file to end with
.r and an arbitrary number of numeric symbols.
As soon as you’ve mastered regular expressions, this will cost you no time at all and your productivity with your computer will skyrocket.
Searching Your Files for Text
But what if you don’t want to find files with a certain filename but with a certain file content? Isn’t this much harder?
As it turns out, it isn’t! Well, if you use regular expression and grep.
Here’s what a grep guru would do to find all lines in a file
'haiku.txt' that contain the word
A Windows version of grep is the find utility.
Well, using regular expressions to find content on the web is considered the holy grail of search. But the web is a huge beast and supporting a full-fledged regex engine would be too demanding for Google’s servers. It costs a lot of computational resources. Therefore, nobody actually provides a search engine that allows all regex commands.
However, web search engines such as Google support a limited number of regex commands. For example, you can search queries that do NOT contain a specific word:
The search “Jeff -Bezos” will give you all the Jeff’s that do not end with Bezos. If a firstname is dominated like this, using advanced search operators is quite a useful extension.
Here’s an in-depth Google search guide that shows you how to use advanced commands to search the huge web even faster.
With the explosion of data and knowledge, mastering search is a critical skill in the 21st century.
Validate User Input in Web Applications
If you’re running a web application, you need to deal with user input. Often, users can put anything in the input fields (even cross-site scripts to hack your webserver). Your application must validate that the user input is okay—otherwise you’re guaranteed to crash your backend application or database.
How can you validate user input? Regex to the rescue!
Here’s how you’d check whether
- The user input consists only of lowercase letters:
- The username consists of only lowercase letters, underscores, or numbers:
- The input does not contain any parentheses:
With regular expressions, you can validate any user input—no matter how complicated it may seem.
Think about this: any web application that processes user input needs regular expressions. Google, Facebook, Baidu, WeChat—all of those companies work with regular expressions to validate their user input. This skill is wildly important for your success as a developer working for those companies (or any other web-based company for that matter).
Guess what Google’s ex tech lead argues is the top skill of a programmer? You got it: regular expressions!
Extract Useful Information With Web Crawlers
Okay, you can validate user input with regular expressions. But is there more? You bet there is.
Regular expressions are not only great to validate textual data but to extract information from textual data.
For example, say you want to gain some advantage over your competition. You decide to write a web crawler that works 24/7 exploring a subset of webpages. A webpage links to other webpages. By going from webpage to webpage, your crawler can explore huge parts of the web—fully automatized.
Imagine the potential! Data is the asset class of the 21st century and you can collect this valuable asset with your own web crawler.
A web crawler can be a Python program that downloads the HTML content of a website:
Your crawler can now use regular expressions to extract all outgoing links to other websites (starting with
A simple regular expression can now automatically get the stuff that follows—which is the outgoing URL. You can store this URL in a list and visit it at a later point in time.
As you’re extracting links, you can build a web graph, extract other information (e.g., embedded opinions of people) and run complicated subroutines on parts of the textual data (e.g., sentiment analysis).
Don’t underestimate the power of web crawlers when used in combination with regular expressions!
Data Scraping and Web Scraping
In the previous example, you’ve already seen how to extract useful information from websites with a web crawler.
But often the first step is to simply download a certain type of data from a large number of websites with the goal of storing it in a database (or a spreadsheet). But the data needs to have a certain structure.
The process of extracting a certain type of data from a set of websites and converting it to the desired data format is called web scraping.
Web scrapers are needed in finance startups, analytics companies, law enforcement, eCommerce companies, and social networks.
Regular expressions help greatly in processing the messy textual data. There are many different applications such as finding titles of a bunch of blog articles (e.g., for SEO).
A minimal example of using Python’s regex library
re for web scraping is the following:
from urllib.request import urlopen import re html = urlopen("https://blog.finxter.com/start-learning-python/").read() print(str(html)) titles = re.findall("\<title\>(.*)\</title\>", str(html)) print(titles) # ['What's The Best Way to Start Learning Python? A Tutorial in 10 Easy Steps! | Finxter']
You extract all data that’s enclosed in opening and closing title tags:
Data wrangling is the process of transforming raw data into a more useful format to simplify the processing of downstream applications. Every data scientists and machine learning engineer knows that data cleaning is at the core of creating effective machine learning models and extracting insights.
As you may have guessed already, data wrangling is highly dependent on tools such as regular expression engines. Each time you want to transform textual data from one format to another, look no further than regular expressions.
In Python, the regex method
re.sub(pattern, repl, string) transforms a
string into a new one where each occurrence of
pattern is replaced by the new string
repl. You can learn everything about the substitution method on my detailed blog tutorial (+video).
This way, you can transform currencies, dates, or stock prices into a common format with regular expressions.
Show me any parser and I show you a tool that leverages hundreds of regular expressions to process the input quickly and effectively.
You may ask: what’s a parser anyway? And you’re right to ask (there are no dumb questions). A parser translates a string of symbols into a higher-level abstraction such as a formalized language (often using an underlying grammar to “understand” the symbols). You’ll need a parser to write your own programming language, syntax system, or text editor.
For example, if you write a program in the Python programming language, it’s just a bunch of characters. Python’s parser brings order into the chaos and translates your meaningless characters into more meaningful abstractions (e.g. keywords, variable names, or function definitions). This is then used as an input for further processing stages such as the execution of your program.
If you’re looking at how parsers are implemented, you’ll see that they heavily rely on regular expressions. This makes sense because a regular expression can easily analyze and catch parts of your text. For example, to extract function names, you can use the following regex in your parser:
import re code = ''' def f1(): return 1 def f2() return 2 ''' print(re.findall('def ([a-zA-Z0-9_]+)', code)) # ['f1', 'f2']
You can see that our mini parser extracts all function names in the code. Of course, it’s only a minimal example and it wouldn’t work for all instances. For example, you can use more characters than the given ones to define a function name.
If you’re interested in writing parsers or learning about compilers, regular expressions are among the most useful tools in existence!
Yes, you’ve already learned about parsers in the previous point. And parsers are needed for any programming language. To put it bluntly: there’s no programming language in the world that doesn’t rely on regular expressions for their own implementation.
But there’s more: regular expressions are also very popular when writing code in any programming language. Some programming languages such as Perl provide built-in regex functionality: you don’t even need to import an external library.
I assure you, if you’re becoming a professional coder, you will use regular expressions in countless of coding projects. And the more you use it, the more you’ll learn to love and appreciate the power of regular expressions.
Syntax Highlighting Systems
Here’s how my standard coding environment looks like:
Any code editor provides syntax highlighting capablities:
- Function names may be blue.
- Strings may be yellow.
- Comments may be red.
- And normal code may be white.
This way, reading and writing code becomes far more convenient. More advanced IDEs such as PyCharm provide dynamic tooltips as an additional feature.
All of those functionalities are implemented with regular expressions to find the keywords, function names, and normal code snippets—and, ultimately, to parse the code to be highlighted and enriched with additional information.
Lexical Analysis in a Compiler
In compiler design, you’ll need a lexical analyzer:
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols. The grammar defined by regular expressions is known as regular grammar. The language defined by regular grammar is known as regular language.Source
As it turns out, regular expressions are the gold standard for creating a lexical analyzer for compilers.
I know this may sound like a very specific application but it’s an important one nonetheless.
Formal Language Theory
Theoretical computer science is the foundation of all computer science. The great names in computer science, Alan Turing, Alonzo Church, and Steven Kleene, all spent significant time and effort studying and developing regular expressions.
If you want to become a great computer scientist, you need to know your fair share of theoretical computer science. You need to know about formal language theory. You need to know about regular expressions that are at the heart of these theoretical foundations.
How do regular expressions relate to formal language theory? Each regular expression defines a “language” of acceptable words. All words that match the regular expression are in this language. All words that do not match the regular expression are not in this language. This way, you can create a precise sets of rules to describe any formal language—just by using the power of regular expressions.
Where to Go From Here?
Regular expressions are widely used for many practical applications. The ones described here are only a small subsets of the ones used in practice. However, I hope to have given you a glance into how important and relevant regular expressions have been, are, and will remain in the future.
Want to learn more about how to convert your computer science skills into money? Check out my free webinar that shows you a step-by-step approach to build your thriving online coding business (working from home). You don’t need to have any computer science background though. The only thing you need is the ambition to learn.
While working as a researcher in distributed systems, Dr. Christian Mayer found his love for teaching computer science students.
To help students reach higher levels of Python success, he founded the programming education website Finxter.com that has taught exponential skills to millions of coders worldwide. He’s the author of the best-selling programming books Python One-Liners (NoStarch 2020), The Art of Clean Code (NoStarch 2022), and The Book of Dash (NoStarch 2022). Chris also coauthored the Coffee Break Python series of self-published books. He’s a computer science enthusiast, freelancer, and owner of one of the top 10 largest Python blogs worldwide.
His passions are writing, reading, and coding. But his greatest passion is to serve aspiring coders through Finxter and help them to boost their skills. You can join his free email academy here.