Python Character Set [Regex Tutorial]

This tutorial makes you a master of character sets in Python. (I know, I know, it feels awesome to see your deepest desires finally come true.)

As I wrote this article, I saw a lot of different terms describing this same powerful concept such as “character class“, “character range“, or “character group“. However, the most precise term is “character set” as introduced in the official Python regex docs. So in this tutorial, I’ll use this term throughout.

Related article: Python Regex Superpower – The Ultimate Guide

Do you want to master the regex superpower? Check out my new book The Smartest Way to Learn Regular Expressions in Python with the innovative 3-step approach for active learning: (1) study a book chapter, (2) solve a code puzzle, and (3) watch an educational chapter video.

Python Regex – Character Set

So, what is a character set in regular expressions?

The character set is (surprise) a set of characters: if you use a character set in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set. As you may know, a set is an unordered collection of unique elements. So each character in a character set is unique and the order doesn’t really matter (with a few minor exceptions).

Here’s an example of a character set as used in a regular expression:

>>> import re
>>> re.findall('[abcde]', 'hello world!')
['e', 'd']

You use the re.findall(pattern, string) method to match the pattern '[abcde]' in the string 'hello world!'. You can think of all characters a, b, c, d, and e as being in an OR relation: either of them would be a valid match.

The regex engine goes from the left to the right, scanning over the string ‘hello world!’ and simultaneously trying to match the (character set) pattern. Two characters from the text ‘hello world!’ are in the character set—they are valid matches and returned by the re.findall() method.

You can simplify many character sets by using the range symbol ‘-‘ that has a special meaning within square brackets: [a-z] reads “match any character from a to z”, while [0-9] reads “match any character from 0 to 9”.

Here’s the previous example, simplified:

>>> re.findall('[a-e]', 'hello world!')
['e', 'd']

You can even combine multiple character ranges in a single character set:

>>> re.findall('[a-eA-E0-4]', 'hello WORLD 42!')
['e', 'D', '4', '2']

Here, you match three ranges: lowercase characters from a to e, uppercase characters from A to E, and numbers from 0 to 4. Note that the ranges are inclusive so both start and stop symbols are included in the range.

Python Regex Negative Character Set

But what if you want to match all characters—except some? You can achieve this with a negative character set!

The negative character set works just like a character set, but with one difference: it matches all characters that are not in the character set.

Here’s an example where you match all sequences of characters that do not contain characters a, b, c, d, or e:

>>> import re
>>> re.findall('[^a-e]+', 'hello world')
['h', 'llo worl']

We use the “at-least-once quantifier +” in the example that matches at least one occurrence of the preceding regex (if you’re unsure about how it works, check out my detailed Finxter tutorial about the plus operator).

There are only two such sequences: the one-character sequence ‘h’ and the eight-character sequence ‘llo worl’. You can see that even the empty space matches the negative character set.

Summary: the negative character set matches all characters that are not enclosed in the brackets.

How to Fix “re.error: unterminated character set at position”?

Now that you know character classes, you can probably fix this error easily: it occurs if you use the opening (or closing) bracket ‘[‘ in your pattern. Maybe you want to match the character ‘[‘ in your string?

But Python assumes that you’ve just opened a character class—and you forgot to close it.

Here’s an example:

>>> re.findall('[', 'hello [world]')
Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    re.findall('[', 'hello [world]')
  File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 223, in findall
    return _compile(pattern, flags).findall(string)
  File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 426, in _parse_sub
    not nested and not items))
  File "C:\Users\xcent\AppData\Local\Programs\Python\Python37\lib\sre_parse.py", line 532, in _parse
    source.tell() - here)
re.error: unterminated character set at position 0

The error happens because you used the bracket character ‘[‘ as if it was a normal symbol.

So, how to fix it? Just escape the special bracket character ‘\[‘ with the single backslash:

>>> re.findall('\[', 'hello [world]')
['[']

This removes the “special” meaning of the bracket symbol.

Related Re Methods

There are seven important regular expression methods which you must master:

  • The re.findall(pattern, string) method returns a list of string matches. Read more in our blog tutorial.
  • The re.search(pattern, string) method returns a match object of the first match. Read more in our blog tutorial.
  • The re.match(pattern, string) method returns a match object if the regex matches at the beginning of the string. Read more in our blog tutorial.
  • The re.fullmatch(pattern, string) method returns a match object if the regex matches the whole string. Read more in our blog tutorial.
  • The re.compile(pattern) method prepares the regular expression pattern—and returns a regex object which you can use multiple times in your code. Read more in our blog tutorial.
  • The re.split(pattern, string) method returns a list of strings by matching all occurrences of the pattern in the string and dividing the string along those. Read more in our blog tutorial.
  • The re.sub(The re.sub(pattern, repl, string, count=0, flags=0) method returns a new string where all occurrences of the pattern in the old string are replaced by repl. Read more in our blog tutorial.

These seven methods are 80% of what you need to know to get started with Python’s regular expression functionality. If you want to learn more, check out the most comprehensive Python regex tutorial in the world!

Where to Go From Here?

You’ve learned everything you need to know about the Python Regex Character Set Operator.

Summary:

If you use a character set [XYZ] in a regular expression pattern, you tell the regex engine to choose one arbitrary character from the set: X, Y, or Z.


Want to earn money while you learn Python? Average Python programmers earn more than $50 per hour. You can certainly become average, can’t you?

Join the free webinar that shows you how to become a thriving coding business owner online!

[Webinar] Become a Six-Figure Freelance Developer with Python

Join us. It’s fun! 🙂