5 Ways to Filter a List of Strings Based on a Regex Pattern in Python

πŸ’‘ Problem Formulation: When working with a list of strings in Python, a common task is to filter the list based on specific patterns. Regular Expressions (regex) are a powerful way of defining these patterns, enabling complex matching criteria that can go well beyond simple substring checks.

Let’s see how we can use regex to filter a list of strings in Python.

Method 1: Using re.match and List Comprehension

The re.match function is used to check if the string starts with the specified pattern. Pairing re.match with list comprehension is a common and readable way to filter lists.

Here’s how you can use it:

import re

strings = ["foo123", "bar", "baz123", "qux"]
pattern = re.compile(r'^\d+')  # regex to match strings starting with digits
filtered_list = [s for s in strings if not pattern.match(s)]

This code sets up a compiled regex pattern that matches any string starting with digits. By using a list comprehension, we filter out any strings that match this pattern.

Method 2: Using re.search() and filter()

Another way is to use re.search() that searches the entire string for the pattern. Combined with the built-in filter() function, it can be applied to the list.

Here’s an example:

import re

strings = ["foo", "123bar", "baz", "qux123"]
pattern = re.compile(r'123')  # regex to match strings containing '123'
filtered_list = list(filter(lambda s: not pattern.search(s), strings))

In this snippet, pattern.search looks for the raw string '123' in each string, and filter applies this pattern to remove matching strings, resulting in a list where none of the strings contain '123'.

Method 3: Using re.fullmatch and a Function

If we need to check if the entire string strictly conforms to a pattern, re.fullmatch() is our tool.

We can define a function to encapsulate our filtering logic:

import re

def filter_strings(strings, regex):
    pattern = re.compile(regex)
    return [s for s in strings if not pattern.fullmatch(s)]

strings = ["abc", "a1b2c3", "123", "xyz"]
regex = r'\d+'  # regex to match strings that are fully numeric
filtered_list = filter_strings(strings, regex)

This function compiles the provided regex and filters the list using list comprehension. Only strings that aren’t fully numeric as per the regex remain.

Method 4: Precompiled Regex and Generators

For large data sets, using generators can save memory. Let’s pair a precompiled regex with a generator expression to filter our list:

import re

strings = ["foo", "baz1", "2bar", "123foo"]
pattern = re.compile(r'[^0-9]+')  # regex to match strings without any digits
filtered_list = (s for s in strings if pattern.fullmatch(s))

# Use the generator
for valid_string in filtered_list:
    print(valid_string)

This method is memory-efficient as filtered_list doesn’t actually hold the entire filtered data at once; it generates filtered items on-the-fly.

Method 5: Using re.findall and Custom Filtering Logic

At times, you might want to use the more versatile re.findall() to customize your filtering criteria further. Below is a way to utilize this approach:

import re

strings = ["hello123", "test", "world12345", "regex"]
pattern = re.compile(r'123')

def custom_filter(strings, pattern):
    return [s for s in strings if not pattern.findall(s)]

filtered_list = custom_filter(strings, pattern)

In this case, custom_filter function looks for instances of '123' and returns a new list without strings containing that pattern.

Bonus One-Liner Method 6: Inline Regex with filter

Finally, for a quick, one-liner method, Python’s filter function can be used with an inline regex and re.match, like so:

import re

strings = ["data1", "info100", "data", "statistics"]
filtered_list = list(filter(lambda s: not re.match(r'.*\d', s), strings))

Here, we’re filtering out strings that end with a number by negating the regex match directly within filter.