Finding Substring Anagrams in a String Using Python

πŸ’‘ Problem Formulation: Imagine the task of finding all substrings within a given string where an anagram of that substring also appears in the string. For example, in the string “aabab”, the substrings “ab” and “ba” are anagrams and both are present in the string. The goal is to identify such substrings programmatically using Python.

Method 1: Brute Force Search

The brute force method involves iterating through all possible substrings of the input string and checking if an anagram of each substring exists elsewhere in the string. This is not an efficient method when dealing with large strings due to its O(n^3) time complexity, but it serves as a straightforward approach to understand the problem.

Here’s an example:

from collections import Counter

def find_anagram_substrings(s):
    anagram_substrings = []
    for i in range(len(s)):
        for j in range(i + 1, len(s) + 1):
            substr = s[i:j]
            counter_substr = Counter(substr)
            for k in range(len(s) - len(substr) + 1):
                if k != i and Counter(s[k:k+len(substr)]) == counter_substr:
                    anagram_substrings.append(substr)
                    break
    return list(set(anagram_substrings))

print(find_anagram_substrings("aabab"))

Output:

['ab', 'ba']

The function find_anagram_substrings iterates through each substring, and for each of them, it scans the rest of the string to find a matching anagram. When it finds one, it adds the substring to the list of anagram substrings, which is then deduplicated and returned.

Method 2: Sliding Window with Hashing

The sliding window technique paired with hashing can significantly optimize the search for anagram substrings. Using a constant-size hash map that represents the character counts, this method can reduce the complexity to O(n^2). By efficiently checking for matching sub-hashes, we can identify anagrams quickly as we slide through the string.

Here’s an example:

from collections import defaultdict

def find_anagram_substrings_with_hashing(s):
    def hash(s):
        return frozenset(Counter(s).items())

    length = len(s)
    anagrams = set()
    for size in range(1, length):
        hash_map = defaultdict(list)
        for start in range(length - size + 1):
            substr = s[start:start + size]
            hash_map[hash(substr)].append(substr)
        for substr_list in hash_map.values():
            if len(substr_list) > 1:
                anagrams.update(substr_list)
    return list(anagrams)

print(find_anagram_substrings_with_hashing("aabab"))

Output:

['ab', 'ba']

In this code, the function find_anagram_substrings_with_hashing uses the sliding window technique. It computes a hash for each substring of a certain size and then slides the window by one character. It uses a defaultdict to store lists of substrings with matching hashes, and if it finds multiple substrings with the same hash, they are identified as anagrams.

Method 3: Sorting and Comparison

This method involves sorting all substrings and comparing them to find anagram pairs. By sorting, we ensure that anagrams will have identical sorted representations, which makes it easier to compare substrings. This will reduce the complexity to O(n^2 log n) due to the sorting of each substring.

Here’s an example:

def find_anagram_substrings_by_sorting(s):
    anagrams = set()
    for i in range(len(s)):
        for j in range(i, len(s)):
            substr = ''.join(sorted(s[i:j+1]))
            for k in range(len(s) - j):
                if i != i + k and ''.join(sorted(s[i+k:j+k+1])) == substr:
                    anagrams.add(substr)
    return list(anagrams)

print(find_anagram_substrings_by_sorting("aabab"))

Output:

['ab']

The function find_anagram_substrings_by_sorting analyzes all sorted substrings. Whenever it finds two substrings that have the same sorted form, it treats them as anagrams and adds their sorted form to a set (to prevent duplicates).

Method 4: Character Counting and Prime Multiplication

This approach employs the idea that the product of assigned prime numbers to each character will be unique for each set of characters, hence unique for anagrams as well. This allows us to represent substrings by a single number rather than a full string or map, therefore improving the performance significantly.

Here’s an example:

from math import prod

def prime_product(s):
    primes = {'a': 2, 'b': 3, 'c': 5, 'd': 7, 'e': 11, 'f': 13, 'g': 17, 'h': 19, 'i': 23, 'j': 29, 'k': 31, 'l': 37, 'm': 41, 'n': 43, 'o': 47, 'p': 53, 'q': 59, 'r': 61, 's': 67, 't': 71, 'u': 73, 'v': 79, 'w': 83, 'x': 89, 'y': 97, 'z': 101}
    return prod(primes[ch] for ch in s)

def find_anagram_substrings_by_prime(s):
    anagrams = {}
    for i in range(len(s)):
        for j in range(i, len(s)):
            substr = s[i:j+1]
            product = prime_product(substr)
            if product in anagrams:
                anagrams[product].add(substr)
            else:
                anagrams[product] = {substr}
    return [val for sublist in anagrams.values() for val in sublist if len(sublist) > 1]

print(find_anagram_substrings_by_prime("aabab"))

Output:

['ba', 'ab']

The function find_anagram_substrings_by_prime uses a map of characters to prime numbers to convert substrings into a product of primes that represents their character composition. It then checks for matching products to identify anagrams.

Bonus One-Liner Method 5: Using itertools and sorted()

A concise one-liner that leverages Python’s itertools module to generate all substrings, sort them, and then uses a set to eliminate duplicates, leaving us with only substrings that have anagram matches in the original string.

Here’s an example:

from itertools import permutations

def find_anagram_substrings_one_liner(s):
    return set([''.join(p) for i in range(len(s)) for p in permutations(s, i + 1) if ''.join(p) in s and ''.join(p)[::-1] in s])

print(find_anagram_substrings_one_liner("aabab"))

Output:

set(['ab', 'ba'])

This one-liner find_anagram_substrings_one_liner function creates all possible substrings as permutations, then checks if both the substring and its reverse are in the original string, suggesting the presence of an anagram pair.

Summary/Discussion

  • Method 1: Brute Force Search. Easy to understand and implement. Not efficient for large strings, with high time complexity.
  • Method 2: Sliding Window with Hashing. More efficient than brute force and reduces time complexity. Can still be demanding with space, needing to store hashes.
  • Method 3: Sorting and Comparison. Simplifies anagram detection by comparison. Sorting increases the time complexity, especially for longer substrings.
  • Method 4: Character Counting and Prime Multiplication. Innovative use of mathematics to detect anagrams. Much more space-efficient but can be computationally heavy due to multiplication of primes.
  • Method 5: Using itertools and sorted(). Extremely concise and readable. However, permutation generation is costly and inappropriate for large strings.