5 Best Ways to Count Word Concatenations in a List with Python

Rate this post

πŸ’‘ Problem Formulation: Imagine you have been provided with a list of words and you want to determine the number of concatenated words present within that list. A concatenated word is defined as a word that can be constructed by concatenating two or more smaller words also present in the list. For example, if the input is ["cat", "cats", "catsdog", "dog", "dogcatsdog"], the desired output would be 3 (‘catsdog’, ‘dogcatsdog’, and ‘catsdog’ again inside ‘dogcatsdog’).

Method 1: Brute Force

This method uses a simple brute force approach that looks for all possible concatenations by comparing each word against every other word in the list. The function specification would iterate over each word and for each of those, iterate over all other words to check if a valid concatenation can be formed.

Here’s an example:

def count_concatenations(word_list):
    count = 0
    for word in word_list:
        for concat in word_list:
            if word != concat and word in concat:
                count += 1
                break
    return count

example_list = ["cat", "cats", "catsdog", "dog", "dogcatsdog"]
print(count_concatenations(example_list))

Output:

3

This Python function count_concatenations() iterates over each word in the list and checks if it is a substring of any other word (indicating a concatenation). The break statement ensures that each word only counts once. It’s a straightforward method but inefficient for large lists due to its O(n^2) complexity.

Method 2: Using a Set for Lookups

This method improves on the brute force approach by using a set data structure for faster lookups. It still involves checking for each word if a concatenation exists but does this by creating all possible word splits and checks if those splits exist in the set.

Here’s an example:

def count_concatenations_with_set(word_list):
    word_set = set(word_list)
    count = 0
    for word in word_list:
        for i in range(1, len(word)):
            prefix, suffix = word[:i], word[i:]
            if prefix in word_set and suffix in word_set:
                count += 1
                break
    return count

example_list = ["cat", "cats", "catsdog", "dog", "dogcatsdog"]
print(count_concatenations_with_set(example_list))

Output:

3

In count_concatenations_with_set(), the word list is first converted into a set for O(1) average lookup time. The function then splits each word into all possible prefixes and suffixes, checking if both are present in the set simultaneously. This method is more efficient than the first, especially with large lists, because it reduces the lookup time significantly.

Method 3: Trie-based Approach

A more sophisticated solution involves using a trie (prefix tree) to store all words from the list. This data structure allows for a quick search of prefixes and can efficiently find concatenated words by navigating the trie according to the current substring of the word being tested.

Here’s an example:

class TrieNode:
    def __init__(self):
        self.children = {}
        self.end_of_word = False
    
def add_word(root, word):
    node = root
    for char in word:
        if char not in node.children:
            node.children[char] = TrieNode()
        node = node.children[char]
    node.end_of_word = True

def can_form(word, root, start):
    node = root
    for i in range(start, len(word)):
        if word[i] not in node.children:
            return False
        node = node.children[word[i]]
        if node.end_of_word:
            if i == len(word) - 1 or can_form(word, root, i + 1):
                return True
    return False

def count_concatenations_trie(word_list):
    count = 0
    root = TrieNode()
    for word in word_list:
        add_word(root, word)
    for word in word_list:
        if can_form(word, root, 0):
            count += 1
    return count

example_list = ["cat", "cats", "catsdog", "dog", "dogcatsdog"]
print(count_concatenations_trie(example_list))

Output:

3

The trie-based method is efficient in terms of lookup complexities. The add_word() function creates the trie from the word list, while can_form() recursively checks if the remainder of the word, starting from each character, is a valid word in the trie. This method is particularly potent for large datasets with many potential concatenations.

Method 4: Dynamic Programming

Dynamic programming can be used to solve this problem by breaking it down into subproblems. This approach checks whether each word can be formed by concatenating smaller words found earlier. It maintains an array to keep track of whether a word can be segmented into two or more words from the list.

Here’s an example:

def count_concatenations_dp(word_list):
    word_set = set(word_list)
    count = 0
    for word in word_list:
        dp = [False] * (len(word) + 1)
        dp[0] = True
        for i in range(1, len(word) + 1):
            for j in range(i):
                if dp[j] and word[j:i] in word_set:
                    dp[i] = True
                    break
        if dp[len(word)]:
            count += 1
    return count

example_list = ["cat", "cats", "catsdog", "dog", "dogcatsdog"]
print(count_concatenations_dp(example_list))

Output:

3

This count_concatenations_dp() function starts by converting the list into a set for quick lookups. For each word, it sets up a dynamic programming array to check each substring. The DP array marks a position as True if the substring ending at that position can be broken into smaller words that exist in the list, allowing for the early exit of the loop.

Bonus One-Liner Method 5: List Comprehension with Concatenation Checks

For a Pythonic one-liner approach, list comprehension can be used to pair each word with all other words and count how many times it features as a substring within those combinations.

Here’s an example:

example_list = ["cat", "cats", "catsdog", "dog", "dogcatsdog"]
count = sum(1 for word in example_list for concat in example_list if word in concat and word != concat)
print(count)

Output:

3

This concise one-liner uses a generator expression to iterate through all words and count concatenations. The condition inside the sum ensures we only count when a word is a non-trivial substring of another. While elegant and Pythonic, this method suffers the same inefficiency as the brute force approach when dealing with large lists.

Summary/Discussion

  • Method 1: Brute Force. Simple to understand. Inefficient for large data sets.
  • Method 2: Set for Lookups. Faster lookups than brute force. Still not the best for scalability concerns.
  • Method 3: Trie-based Approach. Optimized lookup times. Requires additional space and implementation effort.
  • Method 4: Dynamic Programming. Efficient for words that can be segmented. Can be memory intensive with very long words.
  • Method 5: List Comprehension. Pythonic and concise. May perform poorly on large lists due to its O(n^2) nature.