5 Best Ways to Group Similar Substrings in a Python List

πŸ’‘ Problem Formulation: When working with lists of strings in Python, a common task is to group items based on shared substrings. For instance, given an input list ["apple", "applet", "bat", "battery", "batman"], the goal is to identify and group the items that share common substrings, potentially resulting in output like [["apple", "applet"], ["bat", "battery", "batman"]]. This article explores various methods to achieve such grouping efficiently.

Method 1: Using Python Default Libraries

This method involves the use of traditional Python libraries such as itertools for grouping and collections for maintaining a dictionary of the groups. It’s a straightforward approach leveraging the standard library.

Here’s an example:

import itertools
from collections import defaultdict

def group_similar_substrings(str_list):
    grouped_substrings = defaultdict(list)
    for item1, item2 in itertools.combinations(str_list, 2):
        if item1 in item2 or item2 in item1:
            grouped_substrings[item1].append(item2)
    return grouped_substrings

example_list = ["apple", "applet", "bat", "battery", "batman"]
print(group_similar_substrings(example_list))

Output:

{'apple': ['applet'], 'bat': ['battery', 'batman']}

This code snippet uses itertools.combinations to check every possible pair of strings to see if they contain similar substrings. Whenever we find a match, we add the matching item to a list under the original string’s key inside a defaultdict dictionary. This helps us avoid key errors that might occur if we were using a regular dictionary.

Method 2: Using List Comprehensions

This method takes a functional approach, using list comprehensions to create groups. It’s concise and leverages Python’s list comprehension feature for a neat and readable solution.

Here’s an example:

def group_similar_substrings(str_list):
    return {base: [s for s in str_list if base in s] 
            for base in set(str_list)}

example_list = ["apple", "applet", "bat", "battery", "batman"]
print(group_similar_substrings(example_list))

Output:

{'batman': ['batman'], 'battery': ['battery'], 'bat': ['bat', 'battery', 'batman'], 'applet': ['applet'], 'apple': ['apple', 'applet']}

In this code snippet, a dictionary comprehension iterates over each item in the set of the input list (which removes duplicates) and then generates list comprehensions for all strings containing the base string. It’s a compact and efficient way to group similar substrings without explicitly writing loops.

Method 3: Using Regular Expressions

Regular expressions are a powerful tool for string matching. By using Python’s re module, we can create groups based on matching patterns. This method is especially useful when dealing with complex substring patterns.

Here’s an example:

import re

def group_similar_substrings(str_list):
    grouped_substrings = defaultdict(list)
    for base in str_list:
        pattern = re.compile(re.escape(base))
        grouped_substrings[base] = [s for s in str_list if pattern.search(s)]
    return grouped_substrings

example_list = ["apple", "applet", "bat", "battery", "batman"]
print(group_similar_substrings(example_list))

Output:

{'apple': ['apple', 'applet'], 'applet': ['applet'], 'bat': ['bat', 'battery', 'batman'], 'battery': ['battery'], 'batman': ['batman']}

This code uses Python’s re module to compile a regex pattern from each string in the input list. Then, it searches for this pattern in all strings, collecting matches in the grouped_substrings dictionary. This approach is robust against any kind of substring and can be customized with different regex patterns.

Method 4: Using Trie Data Structure

A Trie is an efficient information retrieval data structure that can be used for grouping substrings. This approach builds up the Trie with the input strings and then uses it to group the substrings. It is especially efficient for a large number of strings with shared prefixes.

Here’s an example:

class TrieNode:
    def __init__(self):
        self.children = defaultdict(TrieNode)
        self.end_of_word = False

def build_trie(words):
    root = TrieNode()
    for word in words:
        node = root
        for char in word:
            node = node.children[char]
        node.end_of_word = True
    return root

def find_prefixes(root, prefix=''):
    prefixes = []
    if root.end_of_word:
        prefixes.append(prefix)
    for char, node in root.children.items():
        prefixes.extend(find_prefixes(node, prefix + char))
    return prefixes

words = ["apple", "applet", "bat", "battery", "batman"]
trie = build_trie(words)
grouped = {word: find_prefixes(trie, word) for word in words}
print(grouped)

Output:

{'apple': ['apple', 'applet'], 'applet': ['applet'], 'bat': ['bat', 'battery', 'batman'], 'battery': ['battery'], 'batman': ['batman']}

In this example, a Trie data structure is created, and the words are entered into the Trie. The find_prefixes function is used to recursively find all words that start with the given prefix. This method typically runs faster than other methods when dealing with a large number of strings, due to its efficient retrieval characteristics.

Bonus One-Liner Method 5: Using Dictionary Comprehension and filter

This one-liner method uses dictionary comprehension along with the filter function for a quick and easy solution. It is very concise and Pythonic.

Here’s an example:

example_list = ["apple", "applet", "bat", "battery", "batman"]
grouped = {base: list(filter(lambda x: base in x, example_list)) for base in set(example_list)}
print(grouped)

Output:

{'batman': ['batman'], 'battery': ['battery'], 'bat': ['bat', 'battery', 'batman'], 'applet': ['applet'], 'apple': ['apple', 'applet']}

The example makes use of a dictionary comprehension together with a lambda function to filter strings that contain the base string from the set of the original list. It’s a highly readable and efficient one-liner that takes advantage of Python’s functional programming capabilities.

Summary/Discussion

  • Method 1: Using Python Default Libraries. Easy to understand. Potentially less efficient due to explicit pairing.
  • Method 2: Using List Comprehensions. Clean and Pythonic. May not be the most efficient for large data sets.
  • Method 3: Using Regular Expressions. Highly customizable and powerful. Can become complex very quickly.
  • Method 4: Using Trie Data Structure. Very efficient for large lists with shared prefixes. Implementation is more complex.
  • Method 5: Bonus One-Liner Using Dictionary Comprehension and filter. Extremely concise. Could be less efficient if not used wisely.