5 Best Ways to Remove Duplicate Substrings from a List in Python

πŸ’‘ Problem Formulation: In Python, managing lists that contain substrings can be a common scenario. The challenge lies in identifying and removing any duplicate substrings to ensure a list with unique elements. Say we have an input list ['code', 'coder', 'coding', 'code'], the desired output after duplicate substring removal would be ['coder', 'coding'].

Method 1: Using Set Comprehension and the ‘all()’ Function

This method utilizes a set comprehension alongside the all() function to filter out duplicate substrings. It ensures that for each potential substring, there’s no other longer string in the list that contains it, thus effectively removing duplicates while preserving the unique substrings.

Here’s an example:

strings = ['code', 'coder', 'coding', 'code']
unique = {s for s in strings if not any(s in other for other in strings if s != other)}
print(list(unique))

Output:

['coder', 'coding']

The set comprehension iterates through each string in the list and uses a nested any() generator expression to determine if it is a substring of any other string. If not, it’s included in the resulting set, which is then converted back to a list.

Method 2: Using a Custom Function and ‘startswith()’

A tailored function can be written that loops over each string and checks if any other string in the list starts with this substring, effectively identifying and eliminating duplicates. This method provides clear semantics and can be more readable.

Here’s an example:

def remove_substrings(strings):
    result = []
    for s in strings:
        if not any(other.startswith(s) for other in strings if other != s):
            result.append(s)
    return result

strings = ['code', 'coder', 'coding', 'code']
print(remove_substrings(strings))

Output:

['coder', 'coding']

The custom function remove_substrings() loops through the input list of strings. Inside, it uses a generator expression with startswith() to check for each string if it’s a prefix of any other string. Strings not meeting this condition are appended to the result list.

Method 3: Using Sorting and Grouping

By sorting the list of strings and then grouping continuous duplicates using itertools.groupby(), we can effectively remove duplicate substrings. This method is especially efficient when dealing with already sorted or partially sorted lists.

Here’s an example:

from itertools import groupby

strings = ['code', 'coder', 'coding', 'code']
sorted_strings = sorted(strings, key=len)

unique = [next(group) for key, group in groupby(sorted_strings, key=len) if len(list(group)) == 1]
print(unique)

Output:

['coder', 'coding']

This code starts by sorting the list of strings based on their length, thereby bringing potential duplicates together. The groupby() function from the itertools module is used to group the strings by their length, and only those groups with a single item (unique strings) are kept.

Method 4: Using a Trie Data Structure

A trie is an efficient data structure to handle dynamic sets of strings especially for prefix checks. Implementing a trie to store all the strings and then traversing it to find non-duplicate substrings is a more advanced but very efficient solution.

Here’s an example:

class TrieNode:
    def __init__(self):
        self.children = {}
        self.endOfString = False

def insert(root, word):
    node = root
    for char in word:
        if char not in node.children:
            node.children[char] = TrieNode()
        node = node.children[char]
    node.endOfString = True

def unique_substrings(root, strings):
    unique = []
    for string in strings:
        node = root
        for char in string:
            if char in node.children:
                node = node.children[char]
            else:
                unique.append(string)
                break
        else:
            if node.endOfString:
                unique.append(string)
    return [string for string in unique if root.children.get(string[0]).endOfString]

# Usage
root = TrieNode()
strings = ['code', 'coder', 'coding', 'code']
for string in strings:
    insert(root, string)
print(unique_substrings(root, strings))

Output:

['coder', 'coding']

The code defines a TrieNode class and a function to insert strings into the trie. An additional function to find unique substrings checks each string against the trie. It collects strings that don’t have any prefixes already present in the trie, resulting in a list of unique substrings.

Bonus One-Liner Method 5: Using List Comprehension and ‘filter()’

As a one-liner approach, we can combine list comprehension with the filter() function and a lambda expression to concisely identify and remove duplicate substrings from the list.

Here’s an example:

strings = ['code', 'coder', 'coding', 'code']
unique = list(filter(lambda x: not any(x in y for y in strings if x != y), strings))
print(unique)

Output:

['coder', 'coding']

This one-liner uses a list comprehension nested within a filter() function. For each string in the list, it filters out those which are found as a substring in any other element, excluding itself, leading to a concise yet efficient way to find unique items.

Summary/Discussion

  • Method 1: Set Comprehension with ‘all()’. Strengths: concise, no import necessary. Weaknesses: can be less readable due to complexity.
  • Method 2: Custom Function with ‘startswith()’. Strengths: easy to understand, explicit logic. Weaknesses: potentially slower with large lists.
  • Method 3: Sorting and Grouping. Strengths: great for partially sorted lists, elegant use of itertools. Weaknesses: overhead of sorting the list.
  • Method 4: Trie Data Structure. Strengths: very efficient for large datasets, versatile for strings. Weaknesses: complex to implement, overkill for small lists.
  • Method 5: One-Liner with ‘filter()’. Strengths: elegant, functional programming style. Weaknesses: may lack clarity for those unfamiliar with lambda expressions.