5 Best Ways to Find Minimum Deletions to Make Strings Identical in Python

πŸ’‘ Problem Formulation: We often face scenarios where we need to align data or strings by deleting characters to make two strings identical. Finding the minimum number of deletions required for this task is a common problem in text processing and bioinformatics. For example, given two strings, S1 = ‘abcde’ and S2 = ‘abfge’, the minimum deletions to make these strings identical would be 3 (‘c’, ‘d’, ‘f’, and ‘g’ should be deleted).

Method 1: Dynamic Programming

This method employs a classic dynamic programming approach, which computes the length of the longest common subsequence (LCS) of two strings and uses this to determine the minimum number of deletions. The function find_min_deletions_dp() takes two strings as input and returns an integer representing the minimum deletions required.

Here’s an example:

def find_min_deletions_dp(str1, str2):
    m, n = len(str1), len(str2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if str1[i-1] == str2[j-1]:
                dp[i][j] = 1 + dp[i-1][j-1]
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    lcs = dp[m][n]
    return m + n - 2 * lcs

# Example usage
min_deletions = find_min_deletions_dp('abcde', 'abfge')
print(min_deletions)

Output:

3

This code sets up a two-dimensional array dp, iteratively fills it with values representing the lengths of common subsequences at different truncations of the two strings, and eventually calculates the result based on the length of the strings minus twice the length of their LCS.

Method 2: Recursive Approach

The recursive method calculates the LCS by comparing characters and recursively calling the function on the remaining substring. The function find_min_deletions_recursive() applies recursion to compute the LCS and the minimum deletions.

Here’s an example:

def find_min_deletions_recursive(str1, str2, m, n):
    if m == 0 or n == 0:
        return m + n
    if str1[m-1] == str2[n-1]:
        return find_min_deletions_recursive(str1, str2, m-1, n-1)
    return 1 + min(
        find_min_deletions_recursive(str1, str2, m-1, n),
        find_min_deletions_recursive(str1, str2, m, n-1)
    )

# Example usage
min_deletions = find_min_deletions_recursive('abcde', 'abfge', len('abcde'), len('abfge'))
print(min_deletions)

Output:

3

The recursive approach starts with the lengths of the strings and makes decisions at every step, either matching characters or recursively deleting one from either of the strings, aiming to eventually reach the base cases which return the number of remaining characters to be deleted.

Method 3: Memoization

Memoization is an optimization technique used in conjunction with recursion, storing results of expensive function calls and reusing them when the same inputs occur again. The find_min_deletions_memo() function incorporates a cache to store intermediate results.

Here’s an example:

def find_min_deletions_memo(str1, str2):
    memo = {}
    
    def recurse(m, n):
        if (m, n) in memo:
            return memo[(m, n)]
        if m == 0 or n == 0:
            result = m + n
        elif str1[m-1] == str2[n-1]:
            result = recurse(m-1, n-1)
        else:
            result = 1 + min(recurse(m-1, n), recurse(m, n-1))
        memo[(m, n)] = result
        return result
    
    return recurse(len(str1), len(str2))

# Example usage
min_deletions = find_min_deletions_memo('abcde', 'abfge')
print(min_deletions)

Output:

3

The memoization approach introduces a cache which significantly improves the time complexity by avoiding redundant calculations. The recursive logic is similar to Method 2, but lookups are performed in the memo dictionary to bypass unnecessary function calls.

Method 4: Bottom-Up Tabulation

Tabulation is a bottom-up dynamic programming approach that builds up a table (iteratively) and uses it to arrive at the final solution. The find_min_deletions_tabulation() function constructs a table that leads to the solution without recursion.

Here’s an example:

def find_min_deletions_tabulation(str1, str2):
    m, n = len(str1), len(str2)
    table = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(m):
        for j in range(n):
            if str1[i] == str2[j]:
                table[i+1][j+1] = table[i][j] + 1
            else:
                table[i+1][j+1] = max(table[i+1][j], table[i][j+1])
    return m + n - 2 * table[m][n]

# Example usage
min_deletions = find_min_deletions_tabulation('abcde', 'abfge')
print(min_deletions)

Output:

3

The function constructs a 2D table and iteratively populates it from sub-problem solutions, similar to Method 1 but iterates through the strings in a different manner. Rather than using recursion, it utilizes iterative constructs (loops) to build up solutions.

Bonus One-Liner Method 5: Using a Higher Order Function

For fun and brevity, Python’s higher-order functions can be used to construct a one-liner that leverages functools.reduce() and a lambda function to return the minimum deletions.

Here’s an example:

from functools import reduce

find_min_deletions_oneliner = lambda s1, s2: len(s1) + len(s2) - 2 * reduce(lambda x, y: x + (s1[x] == s2[y]), range(len(s1)), 0)

# Example usage
min_deletions = find_min_deletions_oneliner('abcde', 'abfge')
print(min_deletions)

Output:

3

This one-liner packs the logic of determining the LCS into the reduce() function and subtracts its double from the sum of the lengths of the two strings to get the result.

Summary/Discussion

  • Method 1: Dynamic Programming. Most reliable for larger strings. Can be less intuitive and requires understanding of dynamic programming.
  • Method 2: Recursive Approach. Simple to understand and implement but not efficient for large strings due to its exponential time complexity.
  • Method 3: Memoization. Adds a performance improvement to recursion by caching results, making it more viable for larger inputs but still using more memory than iterative solutions.
  • Method 4: Bottom-Up Tabulation. Provides the performance benefits of dynamic programming without the overhead of recursion. Efficient for larger inputs.
  • Method 5: One-Liner Higher Order Function. Concise and interesting, but not practical for understanding the solution or for large input strings due to the way Python handles lambda closure scopes.