Python – How to Find the Longest Substring in Alphabetical Order?

5/5 - (4 votes)

Programming Challenge

💬 Challenge: Given a Python string. Find the longest substring for which holds that all of its characters are in alphabetical order so that the first character comes earlier in the alphabet than the second character, and so on.

Here are three examples:

  • string: 'helloworld', longest substring: 'ellow'
  • string: 'abcde', longest substring: 'abcde'
  • string: 'zyxw', longest substring: 'z' or 'w'

You get the point. 🙂

Next, I’ll show you three methods, one is fast, one is faster, and one is blazingly fast:

Feel free to read on to learn how they work and how they compare against each other in our final runtime evaluation (Who will win?). It’s super insightful! 💡

Method 1: Using a Simple For Loop on Strings

The easiest way to get the longest substring in alphabetical order is to iterate over all characters in the original string and keep track of two variables:

  • longest – the longest substring found so far, and
  • current – the current substring that could become the longest.

In each iteration, you check if you can add the latest character to the current substring by comparing it against the last character of the current substring.

As soon as the loop terminates, you’ve found the longest substring in alphabetical order.

Here’s one implementation of this idea:

def longest_substring(s):
    longest = ''
    current = ''

    for c in s:
        if current == '' or c >= current[-1]:
            current += c
        else:
            current = c
        longest = max(current, longest, key=len)

    return longest

Here’s how this works for our initial examples:

print(longest_substring('helloworld'))
# ellow

print(longest_substring('abcde'))
# abcde

print(longest_substring('zyxw'))
# w

Explanation: Let’s examine some important details of this solution next.

if current == '' or c >= current[-1]:

This line first checks if the current variable is empty which only holds once, in the beginning. In that specific case, the expression already evaluates to True before checking the second condition c >= current[-1], so Python doesn’t raise an indexing error due to the empty current variable.

This specific Python feature is called “Short-Circuiting”. And you should definitely understand it as an expert coder!

👉 Recommended Tutorial: What is Short Circuit Evaluation in Python?

Using this approach also allows us to pass an empty string into the function without obtaining any error:

print(longest_substring(''))
# ''

Here’s another interesting line from the code solution:

longest = max(current, longest, key=len)

It uses the max() function to compare two strings by length using the len() function as a key. The function returns the longest string that is then assigned to the longest variable.

Instead, you could also do something like this using the ternary operator:

longest = current if len(current)>len(longest) else longest

Both solutions are equally efficient.

💡 Runtime Complexity: The runtime complexity of this approach is linear in the number of characters of the input string, i.e, O(n). You iterate once over all characters. Each loop body has constant runtime complexity because you only compare individual characters that are independent of the solution size. Therefore, we have O(n) * O(1) = O(n) runtime complexity.

Method 2: Using Lists to Speed Things Up

The previous method has linear runtime complexity which is the best we can do, assymptotically. However, we can still improve the “constant factors”, i.e., make the algorithm more efficient using a small optimization.

You can observe in the previous code snippet that we use a lot of string concatenation operations, i.e., current += c, which creates a new string in each loop iteration. Just to append a single character to the string!

In fact, depending on the implementation, this could turn out to kill our “linear runtime complexity” property in the first place!

Anyway, there’s an easy workaround: use Python lists instead of strings. Lists are mutable whereas strings are immutable.

💡 Recommended Tutorial: Mutable vs Immutable Objects in Python

This means you can append new characters at the end of a list object in constant runtime complexity no matter how large the list is. But you can never append a new character at the end of a string!

def longest_substring(s):
    longest = []
    current = []

    for c in s:
        if not current or c >= current[-1]:
            current.append(c)
        else:
            current = [c]
        longest = current if len(current)>len(longest) else longest

    return ''.join(longest)



print(longest_substring('helloworld'))
# ellow

print(longest_substring('abcde'))
# abcde

print(longest_substring('zyxw'))
# z

print(longest_substring(''))
# ''

Method 3: Using the Regex Engine

You can use a simple regular expression that adds an asterisk operator after each character in the alphabet to find all substrings that are alphabetically ordered. Store it in the substrings variable. Then use the max(substrings, key=len) function to obtain the longest of those substrings.

Here’s the regex:

re.findall('a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*', s)

I must say – I hate this solution. 😅 It’s hard-coded and uses an external dependency without a real need.

However, it works beautifully. It’s concise, readable, and bold. So, why not?

import re


def longest_substring(s):
    substrings = re.findall('a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*', s)
    return max(substrings, key=len)



print(longest_substring('helloworld'))
# ellow

print(longest_substring('abcde'))
# abcde

print(longest_substring('zyxw'))
# z

print(longest_substring(''))
# ''

💡 Info: You can read the asterisk A* quantifier as zero-or-more regex: the preceding regex A is matched an arbitrary number of times. So, you’re matching an arbitrary number of 'a' characters, then an arbitrary number of 'b' characters, and so on.

If you want to learn more about the re.findall() function, check out our guide on the Finxter blog.

For example, the input string 'helloworld' results in the following output of the re.findall() function: ['h', 'ellow', 'or', 'l', 'd', '']. These are all alphabetically ordered substrings of the string 'helloworld'. We take the longest one which is 'ellow'.

Runtime Comparison

The fastest method to find the longest substring in alphabetical order is, surprisingly, the regex method (Method 3) which reduces runtime by more than an order of magnitude: Method 1 needed 0.093 seconds, Method 2 needed 0.038 seconds, and Method 3 needed 0.0064.

Regex to the rescue! 💪⚡🔥

re.findall('a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*', s)

I have evaluated this on strings with lengths 26, 260, 2600, 26000, and 260000 using the following code on my notebook (Acer, Intel i7-8565U, 8GB DDR4 RAM):

import re
import time
import matplotlib.pyplot as plt


def longest_substring_1(s):
    longest = ''
    current = ''

    for c in s:
        if current == '' or c >= current[-1]:
            current += c
        else:
            current = c
        longest = max(current, longest, key=len)

    return longest


def longest_substring_2(s):
    longest = []
    current = []

    for c in s:
        if not current or c >= current[-1]:
            current.append(c)
        else:
            current = [c]
        longest = current if len(current)>len(longest) else longest

    return ''.join(longest)


def longest_substring_3(s):
    substrings = re.findall('a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*', s)
    return max(substrings, key=len)    


xs = [1, 10, 100, 1000, 10000]
runtimes_1, runtimes_2, runtimes_3 = [], [], []

for i in xs:

    s = 'abcdefghijklmnopqrstuvwxyz'*i
    
    # Elapsed Runtime Method 1
    t1 = time.time()
    longest_substring_1(s)
    t2 = time.time()
    runtimes_1.append(t2-t1)


    # Elapsed Runtime Method 2
    t1 = time.time()
    longest_substring_2(s)
    t2 = time.time()
    runtimes_2.append(t2-t1)


    # Elapsed Runtime Method 3
    t1 = time.time()
    longest_substring_3(s)
    t2 = time.time()
    runtimes_3.append(t2-t1)


plt.plot(xs, runtimes_1, 'o-', label='Method 1')
plt.plot(xs, runtimes_2, 'x--', label='Method 2')
plt.plot(xs, runtimes_3, 'v-', label='Method 3')

plt.xscale('log')
plt.ylabel('Elapsed Runtime (s)')
plt.xlabel('String size in multiples of 26 chars')
plt.legend()
plt.show()

Conclusion

I was very surprised when writing this article because my intuitive analysis was that all methods should have linear runtime complexity (but none of them have in practice).

Also, I thought the regex solution will turn out to be the most inefficient one.

None of my assumptions turned out to hold in practice! 😅


Thanks for reading through the whole article. ❤️ Feel free to join our email academy and download our free Python cheat sheets:


Image Credit: https://www.motorbiketireshop.com/