5 Best Ways to Count the Number of Distinct Substrings in Python

💡 Problem Formulation: When working with text in Python, it’s sometimes necessary to count the number of distinct substrings within a given string s. For example, given the input ‘ababa’, we’d want to know the number of unique substrings, one of which could be ‘aba’, and in this case, the desired output would be 10. Identifying these substrings can be crucial for text analysis, data processing, and algorithm development.

Method 1: Brute Force Approach

To enumerate all possible substrings in a string, the brute force method involves two nested loops to generate and then store each unique substring in a set. Since sets cannot contain duplicates, the total count of distinct substrings will be the length of the set.

Here’s an example:

def count_distinct_substrings(s):
    distinct_substrings = set()
    for i in range(len(s)):
        for j in range(i+1, len(s)+1):
            distinct_substrings.add(s[i:j])
    return len(distinct_substrings)

print(count_distinct_substrings("ababa"))

Output: 10

This snippet defines a function count_distinct_substrings that loops through all possible start and end positions for substrings of the given string s. It adds each substring to a set, automatically removing duplicates. Finally, it returns the count of unique substrings.

Method 2: Using itertools Combinations

The itertools module provides a cleaner way to generate all possible substrings by using combinations to produce the start and end indices for each substring before adding them to a set.

Here’s an example:

from itertools import combinations

def count_distinct_substrings(s):
    distinct_substrings = {''.join(s[i:j]) for i, j in combinations(range(len(s) + 1), 2)}
    return len(distinct_substrings)

print(count_distinct_substrings("ababa"))

Output: 10

This code uses a set comprehension with combinations from the itertools module for generating all possible pairs of indices, representing the start and end of substrings in a single line of code. Each substring is added to the set, ensuring all entries are unique.

Method 3: Dynamic Programming

A dynamic programming approach can be used to optimize the computation of distinct substrings by using previously computed values. It typically involves creating an array to store the end points of the substrings and ensuring that overlapping computations aren’t repeated.

Here’s an example:

def count_distinct_substrings(s):
    n = len(s)
    dp = [0] * (n+1)
    substrings = set()
    for i in range(n):
        for j in range(i+1, n+1):
            if s[i:j] not in substrings:
                substrings.add(s[i:j])
                dp[j] = dp[j-1] + 1
            else:
                dp[j] = dp[j-1]
    return dp[n]

print(count_distinct_substrings("ababa"))

Output: 10

The dynamic programming method involves a two-dimensional array dp to store the number of distinct substrings up to a certain point in s. By checking whether a substring is unique before adding it to our set, we reduce redundant calculations.

Method 4: Trie Data Structure

A more advanced method utilizes a trie (prefix tree) data structure to efficiently store and count unique substrings. Each node of the trie represents a character of some substring, and each path from the root to a node represents a unique substring.

Here’s an example:

class TrieNode:
    def __init__(self):
        self.children = {}

def add_to_trie(s, root):
    count = 0
    for character in s:
        if character not in root.children:
            count += 1
            root.children[character] = TrieNode()
        root = root.children[character]
    return count

def count_distinct_substrings(s):
    root = TrieNode()
    total_count = 0
    for i in range(len(s)):
        total_count += add_to_trie(s[i:], root)
    return total_count

print(count_distinct_substrings("ababa"))

Output: 10

This function count_distinct_substrings creates a trie to store each substring. The add_to_trie function adds new nodes to the trie for each unique substring, updating the total count accordingly. The trie ensures that each substring is considered only once.

Bonus One-Liner Method 5: Using Python’s Set Comprehension

A modern, minimalistic one-liner solution employs set comprehension to generate all unique substrings in a Pythonic way.

Here’s an example:

print(len({''.join(s[i:j]) for i in range(len(s)) for j in range(i + 1, len(s) + 1)}))

Output: 10

This one-liner is a condensed version of the brute force method, using a set comprehension to create a set of all unique substrings by varying the start and end indices within the ranges that define possible substrings in s.

Summary/Discussion

Method 1: Brute Force. Simple, easy to understand. Inefficient for long strings because of its O(n^3) time complexity.
Method 2: itertools Combinations. More Pythonic, easier to implement. Still not the most efficient for very long strings.
Method 3: Dynamic Programming. More efficient than brute force. Requires additional memory and a bit more complex to understand.
Method 4: Trie Data Structure. Highly efficient for large datasets. More complex to implement and conceptualize.
Method 5: Set Comprehension One-Liner. Elegant and Pythonic. Has the same performance drawbacks as the brute force method.