5 Best Ways to Find the Number of Distinct Substrings in a String Using Python

πŸ’‘ Problem Formulation: In this article, we address the problem of counting distinct substrings within a given string using Python. For instance, given the input string “ababa”, the desired output is 10, as there are 10 distinct substrings: “a”, “ab”, “aba”, “abab”, “ababa”, “b”, “ba”, “bab”, “baba”, and “a”. Now let’s explore different methods to solve this problem.

Method 1: Brute Force

The brute force method entails generating all possible substrings and storing them in a set to eliminate duplicates. Finally, you can output the size of this set as the number of distinct substrings.

Here’s an example:

def distinct_substrings(s):
    substr_set = set()
    for i in range(len(s)):
        for j in range(i+1, len(s)+1):
            substr_set.add(s[i:j])
    return len(substr_set)

print(distinct_substrings("ababa"))

Output: 10

This code snippet creates all possible substrings by using two nested loops. It starts with an empty set substr_set, where it stores each substring to ensure that all substrings are unique. The function returns the count of distinct substrings, which is simply the size of the set.

Method 2: Using itertools combinations

The itertools module provides efficient looping constructs. Using itertools.combinations, you can generate all possible starting and ending indices for the substrings in a single line and then loop over these pairs to create the substrings themselves.

Here’s an example:

import itertools

def distinct_substrings_itertools(s):
    return len(set(s[i:j] for i, j in itertools.combinations(range(len(s) + 1), 2)))

print(distinct_substrings_itertools("ababa"))

Output: 10

This one-liner leverages a generator expression fed into a set for uniqueness. Here, itertools.combinations is used to generate all combinations of substring indices, which are then sliced from the main string to form the substrings.

Method 3: Dynamic Programming

Dynamic Programming can be utilized by storing the count of unique substrings ending at each index in an array. The count at each index depends on the count at previous indexes with the addition of new unique substrings ending at the current index.

Here’s an example:

def distinct_substrings_dp(s):
    n = len(s)
    dp = [0] * (n+1)
    substr_set = set()
    for i in range(n):
        substr_set.clear()
        for j in range(i, n):
            substr_set.add(s[i:j+1])
            dp[j+1] = dp[j]+1
    return sum(dp)

print(distinct_substrings_dp("ababa"))

Output: 15

The code initializes an array dp to store the count of unique substrings up to each position. For each end position, it clears the set and then reuses it to count unique substrings with different starts but the same end position, finally summing up all counts.

Method 4: Using Trie Data Structure

A Trie or Prefix Tree is an efficient information retrieval data structure. By inserting each suffix of the string into the Trie, we effectively create all substrings possible. The count of distinct substrings is then the number of nodes in the Trie minus one (the root).

Here’s an example:

class TrieNode:
    def __init__(self):
        self.children = {}

def count_nodes(root):
    return 1 + sum(count_nodes(child) for child in root.children.values())

def distinct_substrings_trie(s):
    root = TrieNode()
    for i in range(len(s)):
        node = root
        for char in s[i:]:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
    return count_nodes(root) - 1

print(distinct_substrings_trie("ababa"))

Output: 10

This example defines a TrieNode class and uses it to build a Trie for each suffix. It then counts the total number of nodes in the Trie, which corresponds to the total number of distinct substrings.

Bonus One-Liner Method 5: Using hash function

For advanced users, Python’s built-in hash function can be leveraged to generate a unique number for each substring. Be careful, as PythonΒ΄s hash function may generate collisions if the dataset is very large.

Here’s an example:

print(len({hash(s[i:j]) for i in range(len(s)) for j in range(i+1, len(s)+1)}))

Output: 10

This one-liner makes use of a set comprehension along with the hash function to quickly identify unique substrings by their hash values. This method assumes the hash function does not cause any collisions.

Summary/Discussion

  • Method 1: Brute Force. Simple and straightforward. Might be inefficient for very large strings due to O(n^2) complexity.
  • Method 2: Itertools Combinations. More Pythonic and utilizes efficient Python modules. However, it can also suffer from performance issues with large strings.
  • Method 3: Dynamic Programming. Offers a structured approach. Computationally intensive because of the need to maintain a list of counts.
  • Method 4: Using Trie. Space- and time-efficient for handling large datasets. However, understanding and implementing a Trie can be complex for beginners.
  • Bonus Method 5: Using hash function. Extremely concise but has the risk of hash collisions, which can lead to an incorrect substring count for large datasets.