Levensthein Distance

How to Calculate the Levenshtein Distance in Python?

After reading this article, you will know exactly how to calculate the edit distance in Python.

Learning requires opening your knowledge gap first. So let’s do this. What is the output of the following Python puzzle showing you a concise code snippet to calculate the edit distance in Python? (source)

Python Source Code

def levenshtein(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(levenshtein(a[1:], b[1:])+(a[0] != b[0]),
               levenshtein(a[1:], b)+1,
               levenshtein(a, b[1:])+1)

    
print(levenshtein("cat","chello"))

Now, this is a hard nut to crack.

I once posted this Python puzzle to my community of puzzle solvers (called Finxters). Although many Finxters submitted the correct solution, most admitted that they did not really understand what is going on here. Let’s fix this.

General idea Levenshtein Distance

Before we dive in the code, let’s first understand the idea of the Levenshtein distance:

“In information theory, Linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.”

Wikipedia

Applications of Levensthein Distance

The levenshtein distance has important applications. Think about the auto-correction functionality on your smartphone. Say, you type in “helo” in your WhatsApp messenger. Your smartphone then selects several high probability words and sorts them by Levenshtein distance. The one with minimal Levenshtein distance (and, hence, maximal similarity) is “hello”. It automatically corrects “helo” to “hello”.

Explanation of Source Code

Let’s dive in the code. We already know that the Levenshtein distance computes the minimal number of edits (insert, delete, or replace) to reach the second string starting from the first string.

Here is one minimal sequence:

“cat”
“cht” (replace “a” by “h”)
“che” (replace “t” by “e”)
“chel” (insert “l” at position 3)
“chell” (insert “l” at position 4)
“chello” (insert “o” at position 5)

So we can transform the string “cat” in the string “chello” in five steps.

But how does the algorithm accomplish that?


Intermezzo: The Python Truth Value of Objects

In Python, EVERY object has a truth value. In Harry Potter, you are either good or bad. In Python, you are either True or False.

Most objects are in fact “True” (normal people are usually good). Intuitively, you know the few objects that are “False”, don’t you? For example:

  • 0 is False
  • ” is False
  • [] is False
  • {} is False

With this information, you can now easily understand the first two lines of the Levenshtein function:

if not a: return len(b)
if not b: return len(a)

Suppose string a is empty. In this case, the minimal edit distance is len(b) insertions of the characters in string b. We cannot do better. Therefore, we can directly return the correct edit distance if either of the strings is empty.

Let’s say both strings are non-empty (otherwise the solution is trivial as shown previously).

Now, we can simplify the problem in three ways.

First, we ignore the leading characters of both strings a and b and calculate the edit distance from a[1:] to b[1:] in a recursive manner.

If the leading characters a[0] and b[0] are different, we have to fix it by replacing a[0] by b[0]. Hence, we have to increment the edit distance by one if they are different.

In code, this looks as follows:

levenshtein(a[1:], b[1:])+(a[0] != b[0])

Second, we remove the first character a[0]. Now, we check the minimal edit distance recursively for this smaller problem. As we have removed a character, we increment the result by one.

In code, this looks as follows:

levenshtein(a[1:], b)+1

Third, we (conceptually) insert the character b[0] to the beginning of the word a. Now, we can reduce this problem to the smaller problem that arises if we remove the first character of b. As we have performed one edit operation (inserting), we increment the result by one.

Here is the code:

levenshtein(a, b[1:])+1

Finally, we simply take the minimum edit distance of all three results (replace the first character, remove the first character, insert the first character).

Final Remarks

Do you have difficulties understanding recursion and the Python basics (there are so many of them)? Why not resolving them, once and for all, and joining the top 10% of Pythonistas?

Leave a Comment