5 Best Ways to Find the Longest Substring with K Unique Characters in a Given String in Python

💡 Problem Formulation: The specific challenge discussed in this article is to identify the longest substring within a given string that contains exactly k unique characters. For instance, in the string "aabbcc", the longest substring with 2 unique characters is either "aabb" or "bbcc", both of which have a length of 4. Our goal is to find an efficient solution to this problem using Python.

Method 1: The Sliding Window Technique

The sliding window technique involves maintaining a dynamic window of characters that adjusts its size and position to encompass substrings of valid length. We use two pointers that represent the start and end of the window. As we iterate through the string, we adjust the pointers to ensure the window always contains at most k distinct characters, and we record the length whenever we find a longer valid substring.

Here’s an example:

def longest_substring_k_unique(s, k):
    start = 0
    max_length = 0
    char_index_map = {}

    for end in range(len(s)):
        char_index_map[s[end]] = char_index_map.get(s[end], 0) + 1

        while len(char_index_map) > k:
            char_index_map[s[start]] -= 1
            if char_index_map[s[start]] == 0:
                del char_index_map[s[start]]
            start += 1
        
        max_length = max(max_length, end - start + 1)
    
    return max_length

print(longest_substring_k_unique('aabbcc', 2))

Output: 4

In the provided code snippet, a dictionary char_index_map keeps track of the count of each character within the window. We increment the end pointer to expand the window and add characters, decrement the start pointer to shrink the window if we exceed k unique characters. We then update max_length when a longer substring is found.

Method 2: Optimized Sliding Window with Ordinal Mapping

An optimization of the sliding window technique involves using an ordinal map (usually an array) instead of a dictionary to store counts of characters. This improvement generally offers better performance when dealing with a known character set, such as the ASCII character set, by utilizing constant-time access of array elements.

Here’s an example:

def longest_substring_k_unique_optimized(s, k):
    n = len(s)
    if n * k == 0:
        return 0

    start = end = max_length = 0
    char_count = [0] * 128  # Assuming ASCII character set

    unique = 0

    while end < n:
        if char_count[ord(s[end])] == 0:
            unique += 1
        char_count[ord(s[end])] += 1
        end += 1
        
        while unique > k:
            if char_count[ord(s[start])] == 1:
                unique -= 1
            char_count[ord(s[start])] -= 1
            start += 1
        
        max_length = max(max_length, end - start)
    
    return max_length

print(longest_substring_k_unique_optimized('aabbcc', 2))

Output: 4

In this code snippet, an array char_count is used to track the count of ASCII characters within the window, which offers faster access than a dictionary. The rest of the algorithm is similar to the previous method, adjusting start and end pointers while ensuring the count of unique characters does not exceed k.

Method 3: Using itertools to Generate Substrings

As a brute force approach, itertools can be used in Python to generate all possible substrings, after which we can filter those with k unique characters and determine the longest one. This method is not optimized for large strings but is easy to understand and implement.

Here’s an example:

import itertools

def longest_substring_bruteforce(s, k):
    max_length = 0
    for start in range(len(s)):
        for end in range(start + 1, len(s) + 1):
            substring = s[start:end]
            if len(set(substring)) == k:
                max_length = max(max_length, len(substring))
    return max_length

print(longest_substring_bruteforce('aabbcc', 2))

Output: 4

The brute force method uses nested loops to generate all possible substrings and a set to calculate the number of unique characters. It updates max_length whenever a valid substring (with count of unique characters equal to k) that is longer than the current maximum is found.

Method 4: Using Collections to Count Uniques

The collections module in Python contains a Counter class that can be used for counting hashable objects in an iterable. This method efficiently counts the unique elements in substrings as we iterate through them with the sliding window technique.

Here’s an example:

from collections import Counter

def longest_substring_collections(s, k):
    start = max_length = 0
    char_counter = Counter()
    
    for end in range(len(s)):
        char_counter[s[end]] += 1

        while len(char_counter) > k:
            char_counter[s[start]] -= 1
            if char_counter[s[start]] == 0:
                del char_counter[s[start]]
            start += 1
        
        max_length = max(max_length, end - start + 1)
    
    return max_length

print(longest_substring_collections('aabbcc', 2))

Output: 4

This method uses the Counter class to keep track of the character counts within the sliding window. Similar to Method 1, it adjusts the window size while ensuring the number of unique characters does not surpass k, and it updates the max_length accordingly.

Bonus One-Liner Method 5: Pythonic Approach with List Comprehensions

A one-liner solution can be written using nested list comprehensions and the set data structure. It demonstrates Python’s expressiveness but lacks readability and is not suitable for large strings due to performance issues.

Here’s an example:

def longest_substring_one_liner(s, k):
    return max([len(sub) for sub in 
                [''.join(j) for i in range(len(s)) 
                for j in itertools.combinations(s, i + 1)] 
                if len(set(sub)) == k], default=0)

print(longest_substring_one_liner('aabbcc', 2))

Output: 4

The one-liner utilizes the itertools module’s combinations function to create all possible substrings, filters them based on the number of unique characters, and then uses the built-in max function to find the length of the longest valid substring.

Summary/Discussion

Method 1: Sliding Window Technique. Most efficient for most use cases. Performs well with large strings. High readability and good performance.
Method 2: Optimized Sliding Window with Ordinal Mapping. Provides performance gains specifically for known character sets. Slightly less readable but faster due to array use instead of a hash map.
Method 3: Using itertools to Generate Substrings. Easy to understand and implement. Not suitable for large strings due to its brute force nature and resulting performance degradation.
Method 4: Using Collections to Count Uniques. Pythonic and readable, using built-in data structures. Good for cases where readability is a priority over raw performance.
Bonus One-Liner Method 5: Pythonic Approach with List Comprehensions. Expressive and compact. Tradeoff with readability and performance. Best used for educational purposes or small datasets.