Levensthein Distance

How to Calculate the Levenshtein Distance in Python?

After studying this article, you will know exactly how to calculate the edit distance in Python.

Learning requires opening your knowledge gap first. So let’s do this. What is the output of the following Python puzzle showing you a concise code snippet to calculate the edit distance in Python? (source)

Python Source Code

def levenshtein(a, b):
    if not a: return len(b)
    if not b: return len(a)
    return min(levenshtein(a[1:], b[1:])+(a[0] != b[0]),
               levenshtein(a[1:], b)+1,
               levenshtein(a, b[1:])+1)

    
print(levenshtein("cat","chello"))

Now, this is a hard nut to crack. Let’s have a look at how this code works!

General Idea Levenshtein Distance

Before we dive in the code, let’s first understand the idea of the Levenshtein distance:

“In information theory, Linguistics and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.” Wikipedia

Here are the two most important points from the definition:

  • The Levenshtein distance is a metric measuring the difference between two strings. If two strings are similar, the distance should be small. If they are very different, the distance should be large.
  • But what does it mean for two strings to be similar or different? The metric is defined as the number of “edits” to transform one string to another. An edit can be an insertion of a character at a given position, a removal of a character, or a replacement of a character with another character.

Applications of Levenshtein Distance

The Levenshtein distance has important applications in practice. Think about the auto-correction functionality on your smartphone.

Say, you type in "helo" in your WhatsApp messenger. Your smartphone recognizes that this is not a word in its dictionary. It then selects several high probability words and may sort them by Levenshtein distance. One with minimal Levenshtein distance (and, hence, maximal similarity) is "hello" because you simply have to insert one character "l" to go from the incorrect "helo" to the correct word "hello" that exists in the dictionary.

Explanation of Source Code

Let’s dive into the code. We already know that the Levenshtein distance computes the minimal number of edits (insert, delete, or replace) to reach the second string starting from the first string.

Here is one minimal sequence:

"cat"
"cht" (replace "a" by "h")
"che" (replace "t" by "e")
"chel" (insert "l" at position 3)
"chell" (insert "l" at position 4)
"chello" (insert "o" at position 5)

So we can transform the string "cat" in the string "chello" with five edits. There’s no quicker way—go ahead and try it!

But how does the algorithm accomplish that?


Intermezzo: The Python Truth Value of Objects

In Python, EVERY object has a truth value. In Harry Potter, you are either good or bad. In Python, you are either True or False.

Most objects are in fact “True” (normal people are usually good). Intuitively, you know the few objects that are “False”, don’t you? For example:

  • 0 is False
  • ” is False
  • [] is False
  • {} is False

Understanding the Levenshtein Algorithm

With this information, you can now easily understand the first two lines of the Levenshtein function:

if not a: return len(b)
if not b: return len(a)

Suppose string a is empty. In this case, the minimal edit distance is len(b) insertions of the characters in string b. We cannot do better. Therefore, we can directly return the correct edit distance if either of the strings is empty.

Let’s say both strings are non-empty (otherwise the solution is trivial as shown previously).

Now, we can simplify the problem in three ways.

First, we ignore the leading characters of both strings a and b and calculate the edit distance from a[1:] to b[1:] in a recursive manner. Note that we use slicing to get the substrings starting from the second character with index 1.

๐Ÿ’กย Slicing is a concept to carve out a substring from a given string. Use slicing notation s[start:stop:step] to access every step-th element starting from index start (included) and ending in index stop (excluded). All three arguments are optional, so you can skip them to use the default values (start=0, stop=len(lst), step=1). For example, the expression s[2:4] from string 'hello' carves out the slice 'll' and the expression s[:3:2] carves out the slice 'hl'.

Related Article + Video Tutorial: Introduction to Slicing

If the leading characters a[0] and b[0] are different, we have to fix it by replacing a[0] by b[0]. Hence, we have to increment the edit distance by one if they are different.

In code, this looks as follows:

levenshtein(a[1:], b[1:])+(a[0] != b[0])

Second, we remove the first character a[0]. Now, we check the minimal edit distance recursively for this smaller problem. As we have removed a character, we increment the result by one.

In code, this looks as follows:

levenshtein(a[1:], b) + 1

Third, we (conceptually) insert the character b[0] to the beginning of the word a. Now, we can reduce this problem to the smaller problem that arises if we remove the first character of b. As we have performed one edit operation (inserting), we increment the result by one.

Here is the code:

levenshtein(a, b[1:])+1

Finally, we simply take the minimum edit distance of all three results (replace the first character, remove the first character, insert the first character).

Final Remarks

Thanks for reading this tutorial on the Finxter blog! ๐Ÿ™‚

Did you have difficulties understanding recursion and the Python basics (there are so many of them)? Why not resolving them, once and for all, and joining the top 10% of Pythonistas?

If you want to boost your career and improve your Python skills at the same time, why not start earning money while you learn as a Python freelancer?

Leave a Comment

Your email address will not be published. Required fields are marked *