Counting Subsequences in Word Lists with Python: Top 5 Methods

πŸ’‘ Problem Formulation: In Python programming, you might encounter a challenge where you need to find the count of certain subsequences within a list of words. For instance, given a word list such as ['hack', 'a', 'ck', 'hak'] and a subsequence 'hack', the task is to determine how many times the subsequence can be formed from the words in the list. The desired output for this example would be 2, since ‘hack’ can be formed from ‘hack’ itself and also by combining ‘hak’ and ‘ck’.

Method 1: Iterative Comparison

This method involves iterating through each word in the list and checking if the characters of the target subsequence can be found in order within the word. The function keeps a count of how many times the subsequence can be formed. This approach utilizes nested loops and is straightforward to understand and implement.

Here’s an example:

def subsequence_count(word_list, subsequence):
    count = 0
    for word in word_list:
        it = iter(word)
        if all(char in it for char in subsequence):
            count += 1
    return count

print(subsequence_count(['hack', 'a', 'ck', 'hak'], 'hack'))

Output:

2

The subsequence_count function iterates through the list of words and uses an iterator to check if all characters of the sequence are present in the word in the correct order. The built-in all() function is convenient for this check. If the subsequence is found, the count is incremented.

Method 2: Using Regular Expressions

Regular expressions are a powerful tool for pattern matching. This method employs a regular expression to match the subsequence within each word. A regular expression is built to account for any characters between the subsequence characters. This method is efficient and concise for matching patterns.

Here’s an example:

import re

def subsequence_count_regex(word_list, subsequence):
    subseq_pattern = '.*'.join(subsequence)
    regex = re.compile(subseq_pattern)
    return sum(bool(regex.search(word)) for word in word_list)

print(subsequence_count_regex(['hack', 'a', 'ck', 'hak'], 'hack'))

Output:

2

In this code snippet, the subsequence_count_regex function builds a regular expression pattern that matches any characters between the subsequence characters. The sum function calculates the total count where the search method of the compiled regex object returns a match.

Method 3: Dynamic Programming

Dynamic Programming can optimize the subsequence counting by storing intermediate results. This method involves creating a matrix to keep track of subsequence formation across the word list. It is more complex but efficient for larger datasets and multiple searches.

Here’s an example:

def subsequence_count_dp(word_list, subsequence):
    sub_len = len(subsequence)
    dp = [0] * (sub_len + 1)
    dp[0] = 1
    
    for word in word_list:
        for i in range(sub_len-1, -1, -1):
            if subsequence[i] in word:
                dp[i+1] += dp[i]
    
    return dp[-1]
  
print(subsequence_count_dp(['hack', 'a', 'ck', 'hak'], 'hack'))

Output:

2

The subsequence_count_dp function utilizes a dynamic programming array dp to store the counts of subsequences ending at each position. It iterates backwards through the subsequence to avoid overwriting data it will need later. This method updates the counts efficiently for each word in the list.

Method 4: Recursive Decomposition

This method applies recursion to decompose the problem into smaller subproblems. It involves checking if the subsequence can be formed with the current word, and if not, recursively calling the function with the rest of the list and subsequence. Although intuitive and elegant, it may be slower due to the overhead of recursive calls.

Here’s an example:

def subsequence_count_recursive(word_list, subsequence, start=0):
    if not subsequence:
        return 1
    count = 0
    for i in range(start, len(word_list)):
        count += subsequence_count_recursive(
            word_list, subsequence[len(word_list[i]):], i+1
        )
    return count

print(subsequence_count_recursive(['hack', 'a', 'ck', 'hak'], 'hack'))

Output:

2

The recursive function subsequence_count_recursive checks for the subsequence from the start of the word list and uses a start index to avoid re-evaluating words, thus preventing redundant calculations. The base case for recursion occurs when the subsequence is empty, returning 1 as the sequence is found.

Bonus One-Liner Method 5: Functional Programming with filter and reduce

Leveraging Python’s functional programming capabilities, this one-liner approach combines filter and reduce to find subsequences. The filter function is used to discard non-matching words and reduce aggregates the results. This method is concise and makes use of Python’s higher-order functions for an expressive solution.

Here’s an example:

from functools import reduce

subsequence_count_fp = lambda wl, ss: reduce(lambda x, _: x+1, filter(lambda w: all(s in iter(w) for s in ss), wl), 0)

print(subsequence_count_fp(['hack', 'a', 'ck', 'hak'],'hack'))

Output:

2

The lambda function subsequence_count_fp takes a word list and subsequence as input and applies filter to keep only those words which contain the subsequence. The reduce function then counts the number of such valid words. This method, although efficient in writing, may not be as readable as other methods.

Summary/Discussion

  • Method 1: Iterative Comparison. Easy to understand and implement. Not the most efficient for large word lists or complex subsequences.
  • Method 2: Using Regular Expressions. Efficient for pattern matching. Might be difficult to grasp for those unfamiliar with regex syntax.
  • Method 3: Dynamic Programming. Efficient, optimizes for larger datasets. Can be complex to implement and understand.
  • Method 4: Recursive Decomposition. Intuitive and simple. However, it may be less efficient due to recursive overhead and deeper call stacks.
  • Bonus Method 5: Functional Programming. Concise and utilizes Python’s higher-order functions. Readability could be an issue, and it doesn’t necessarily offer performance benefits over other methods.