Discovering the Largest Anagram Subset in Python Collections

February 26, 2024 by Emily Rosemary Collins

💡 Problem Formulation: Anagrams are words that are formed by rearranging the letters of another, such as ‘listen’ and ‘silent’. The task at hand is to find the size of the largest subset of anagram words within a given list of words. For instance, given the input [‘cat’, ‘dog’, ‘tac’, ‘god’, ‘act’], the desired output is 3, corresponding to one of the subsets with the most anagrams: [‘cat’, ‘tac’, ‘act’].

Method 1: Sorting Character Arrays

This method involves sorting the characters of each word to determine if two words are anagrams. Words that are anagrams will match when sorted. A dictionary is used to track lists of anagrams, with the sorted word as a key. The largest anagram subset size is then the length of the longest value in the dictionary.

Here’s an example:

from collections import defaultdict

def largest_anagram_subset(words):
    anagrams = defaultdict(list)
    for word in words:
        sorted_word = ''.join(sorted(word))
        anagrams[sorted_word].append(word)
    return max(len(words) for words in anagrams.values())
    
words = ['cat', 'dog', 'tac', 'god', 'act']
print(largest_anagram_subset(words))

The output of this code snippet is:

This code snippet defines a function largest_anagram_subset() that returns the size of the largest subset of anagram words. It sorts each word, uses the result as a key in a dictionary, and appends the original word to the corresponding list. The function then returns the size of the longest list in the dictionary.

Method 2: Utilizing Counters

Instead of sorting, this method uses the Counter class from the collections module to count letter occurrences. Two words are anagrams if their letter counts are equal. A dictionary with the counts as keys tracks the anagrams. The size of the largest group is then the length of the largest list in the dictionary.

Here’s an example:

from collections import defaultdict, Counter

def largest_anagram_subset(words):
    anagrams = defaultdict(list)
    for word in words:
        word_counter = tuple(sorted(Counter(word).items()))
        anagrams[word_counter].append(word)
    return max(len(words) for words in anagrams.values())

words = ['cat', 'dog', 'tac', 'god', 'act']
print(largest_anagram_subset(words))

The output of this code snippet is:

Here, the function largest_anagram_subset() creates a Counter object for each word, which is then converted into a sorted tuple to be used as a dictionary key. The resulting list for each key contains anagrams, and the function calculates the size of the longest list.

Method 3: Prime Number Multiplication

This innovative method exploits the fundamental theorem of arithmetic: every integer has a unique prime factorization. It assigns a unique prime number to each letter and multiplies these for every word. Words that are anagrams will have the same multiplication result. We track these results in a dictionary to find the largest anagram subset.

Here’s an example:

from collections import defaultdict

def prime_product(word):
    primes = {'a': 2, 'b': 3, 'c': 5, 'd': 7, 'e': 11, 'f': 13, 'g': 17, 'h': 19, 'i': 23, 'j': 29, 'k': 31, 'l': 37, 'm': 41, 'n': 43, 'o': 47, 'p': 53, 'q': 59, 'r': 61, 's': 67, 't': 71, 'u': 73, 'v': 79, 'w': 83, 'x': 89, 'y': 97, 'z': 101}
    product = 1
    for letter in word:
        product *= primes[letter]
    return product

def largest_anagram_subset(words):
    anagrams = defaultdict(list)
    for word in words:
        anagrams[prime_product(word)].append(word)
    return max(len(words) for words in anagrams.values())
    
words = ['cat', 'dog', 'tac', 'god', 'act']
print(largest_anagram_subset(words))

The output of this code snippet is:

The function prime_product() calculates the product of prime numbers associated with each letter in a word. In largest_anagram_subset(), the dictionary uses these products as keys to group anagrams. The size of the largest list represents the largest anagram subset.

Method 4: Frequency Vectors

By representing each word as a 26-dimensional vector, where each dimension corresponds to the frequency of a letter from ‘a’ to ‘z’, we can compare vectors to find anagrams. Words with identical vectors are anagrams. The dictionary tracks these vectors, using a tuple representation, to find the largest group of anagrams.

Here’s an example:

from collections import defaultdict

def frequency_vector(word):
    return tuple(word.count(char) for char in 'abcdefghijklmnopqrstuvwxyz')

def largest_anagram_subset(words):
    anagrams = defaultdict(list)
    for word in words:
        anagrams[frequency_vector(word)].append(word)
    return max(len(words) for words in anagrams.values())

words = ['cat', 'dog', 'tac', 'god', 'act']
print(largest_anagram_subset(words))

The output of this code snippet is:

In this code, frequency_vector() generates a count vector for every word. The function largest_anagram_subset() uses these vectors as keys for a dictionary to group anagrams. It determines the size of the largest subset by finding the longest list of anagrams.

Bonus One-Liner Method 5: Using itertools and max()

A one-liner approach leveraging Python’s itertools to group words by their sorted characters and the max() function to find the largest group, this method is concise yet powerful. It’s a combination of Method 1’s principles applied in a functional programming style for minimalists.

Here’s an example:

from itertools import groupby

words = ['cat', 'dog', 'tac', 'god', 'act']
largest_anagram_subset_size = len(max((list(group) for key, group in groupby(sorted(words, key=sorted), key=sorted)), key=len))
print(largest_anagram_subset_size)

The output of this code snippet is:

This one-liner first sorts the words based on their sorted character strings, then uses groupby() to group anagrams together. The size of the largest group is obtained with the max() function by comparing the lengths of each group.

Summary/Discussion

Method 1: Sorting Character Arrays. Strengths are its simplicity and readability. Weaknesses include potentially higher time complexity due to sorting.
Method 2: Utilizing Counters. More efficient than sorting for large words, it maintains readability. Weaknesses include added complexity of converting counters to a comparable format.
Method 3: Prime Number Multiplication. Extremely fast and efficient, this method is great for long words. However, it requires defining a map of primes which can be cumbersome.
Method 4: Frequency Vectors. Efficient for words of varying lengths, and sidesteps sorting. The weakness is that it could be less intuitive than Method 1.
Method 5: Bonus One-Liner. The strength lies in its brevity. The weakness is the potential decrease in readability and the performance hit from sorting and grouping all at once.