5 Best Ways to Find the Length of the Shortest Supersequence in Python

💡 Problem Formulation: In computational theory, a supersequence of two strings is a string that contains both as subsequences. The challenge is to find the shortest supersequence that encompasses both strings. For example, given ‘AGGTAB’ and ‘GXTXAYB’, one shortest supersequence is ‘AGXGTXAYB’ with a length of 9. This article aims to explore various Python methods to calculate this length efficiently.

Method 1: Recursive Solution

This method involves a recursive function that compares the characters of both strings and counts the characters that are part of the supersequence. The recursion explores all possibilities by either keeping the matching character once or by moving one character forward in both strings.

Here’s an example:

def superseq_length(X, Y, m, n):
    if not m: return n
    if not n: return m
    if X[m-1] == Y[n-1]:
        return 1 + superseq_length(X, Y, m-1, n-1)
    else:
        return 1 + min(superseq_length(X, Y, m-1, n), superseq_length(X, Y, m, n-1))

# Example usage
X = 'AGGTAB'
Y = 'GXTXAYB'
print(superseq_length(X, Y, len(X), len(Y)))

Output: 9

The provided function superseq_length() takes two strings and their lengths as arguments. It uses a recursive approach to explore different subproblems, choosing the minimum length from the two possibilities. Despite its clarity, the method is inefficient for large strings due to a high number of overlapping subproblems solved multiple times.

Method 2: Dynamic Programming (Bottom-Up)

A more efficient solution uses dynamic programming to store and reuse results of subproblems. This bottom-up approach fills a two-dimensional table where each cell represents the length of the shortest supersequence of substrings ending at corresponding positions in each string.

Here’s an example:

def superseq_length_dp(X, Y):
    m, n = len(X), len(Y)
    dp = [[0] * (n+1) for _ in range(m+1)]
    for i in range(m+1):
        for j in range(n+1):
            if not i: dp[i][j] = j
            elif not j: dp[i][j] = i
            elif X[i-1] == Y[j-1]: dp[i][j] = 1 + dp[i-1][j-1]
            else: dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1])
    return dp[m][n]

# Example usage
X = 'AGGTAB'
Y = 'GXTXAYB'
print(superseq_length_dp(X, Y))

Output: 9

In superseq_length_dp(), a matrix dp is constructed where dp[i][j] holds the length of the shortest supersequence for X[:i] and Y[:j]. This method is significantly faster than the recursive approach as it avoids recalculating results.

Method 3: Dynamic Programming (Top-Down)

Top-down dynamic programming, or memoization, is the recursive counterpart that includes caching to avoid redundant calculations. It stores results of subproblems in a memoization table to return previously computed values when the same subproblem occurs again.

Here’s an example:

def superseq_length_memo(X, Y):
    m, n = len(X), len(Y)
    memo = [[-1] * (n+1) for _ in range(m+1)]
    return aux_superseq(m, n, X, Y, memo)

def aux_superseq(i, j, X, Y, memo):
    if memo[i][j] < 0:
        if not i: memo[i][j] = j
        elif not j: memo[i][j] = i
        elif X[i-1] == Y[j-1]: memo[i][j] = 1 + aux_superseq(i-1, j-1, X, Y, memo)
        else: memo[i][j] = 1 + min(aux_superseq(i-1, j, X, Y, memo), aux_superseq(i, j-1, X, Y, memo))
    return memo[i][j]

# Example usage
X = 'AGGTAB'
Y = 'GXTXAYB'
print(superseq_length_memo(X, Y))

Output: 9

The function superseq_length_memo() uses an auxiliary function aux_superseq() that checks for an existing calculation in the memo table before computing the supersequence length. This method avoids redundant computations found in the purely recursive method while preserving its logical structure.

Method 4: Using LCS (Longest Common Subsequence)

The length of the shortest supersequence can also be derived using the Longest Common Subsequence (LCS) of the two given strings. The formula for the shortest supersequence length is m + n - LCS(X, Y), where m and n are the lengths of the strings.

Here’s an example:

def lcs(X, Y, m, n):
    if m == 0 or n == 0:
        return 0
    elif X[m-1] == Y[n-1]:
        return 1 + lcs(X, Y, m-1, n-1)
    else:
        return max(lcs(X, Y, m, n-1), lcs(X, Y, m-1, n))

def shortest_superseq_using_lcs(X, Y):
    m, n = len(X), len(Y)
    return m + n - lcs(X, Y, m, n)

# Example usage
X = 'AGGTAB'
Y = 'GXTXAYB'
print(shortest_superseq_using_lcs(X, Y))

Output: 9

By calculating the LCS using the function lcs(), we then derive the shortest supersequence length through a simple arithmetic operation in shortest_superseq_using_lcs(). This is conceptually simple, but like the recursive solution, may suffer performance issues for large strings without memoization or dynamic programming applied to the LCS computation.

Bonus One-Liner Method 5: Using functools and itertools

This one-liner method relies on the Python Standard Library’s functools and itertools to generate all supersequences and then selects the shortest one. It is more of a brute-force approach and not recommended for long strings, but it highlights the power and simplicity of Python’s functional programming tools.

Here’s an example:

from itertools import product
from functools import reduce

X = 'AGGTAB'
Y = 'GXTXAYB'
shortest_supersequence = reduce(lambda x, y: x if len(x) < len(y) else y, (''.join(s) for s in product(*zip(X, Y))))
print(len(shortest_supersequence))

Output: 9

Using product() from itertools, we generate cartesian products of paired characters from strings X and Y. The reduce() function then iteratively compares the supersequences to select the shortest one. While elegant, this method is computationally expensive and impractical for anything beyond small strings due to the combinatorial explosion of possibilities.

Summary/Discussion

Method 1: Recursive Solution. Easy to understand but has exponential time complexity for large strings.
Method 2: Dynamic Programming (Bottom-Up). Much more efficient than recursion and can handle larger strings.
Method 3: Dynamic Programming (Top-Down). Retains recursive clarity and adds efficiency through memoization.
Method 4: Using LCS. Conceptually simple but requires efficient LCS computation to be practical for large strings.
Bonus Method 5: Using functools and itertools. Elegant one-liner but can handle only small strings due to its brute-force nature.