fbpx

How to Calculate the Edit Distance in Python

Type “Helo World” into your Google search bar.

Google will ask you: “Did you mean: hello world”

A simple method to detect these typos is the Levensthein distance (also called edit distance). In fact, Google’s algorithm seems to use some variant of it. (source)

By studying this article, you’ll learn about the important practical algorithm to calculate the “Levenshtein distance” or “edit distance”.

The Basics

The Levenshtein distance is a metric to calculate the distance between two strings. It helps you to quantify how “similar” two strings are. The Levenshtein distance is also called “edit distance” which describes precisely what it measures: the number of character edits (insertions, removals, or substitutions) that are needed to transform one string into another. The intuition is the following: the smaller the Levenshtein distance, the more similar the strings.

The Levenshtein distance has important applications. Think about the auto-correction functionality on your smartphone. Say, you type in “helo” in your WhatsApp messenger. Your smartphone then selects several high probability words and sorts them (e.g. by Levenshtein distance). For example, the one with minimal Levenshtein distance (and, hence, maximal similarity) is the string “hello”. Thus, it may automatically correct “helo” to “hello”.

Let’s consider an example with two strings “cat” and “chello”. How to calculate the Levenshtein distance in this scenario?

We already know that the Levenshtein distance computes the minimal number of edits (insert, delete, or replace) to reach the second string starting from the first string.

Here is one minimal sequence:

  • “cat”
  • cht” (replace “a” by “h”)
  • che” (replace “t” by “e”)
  • chel” (insert “l” at position 3)
  • chell” (insert “l” at position 4)
  • chello” (insert “o” at position 5)

In this way, we can transform the string “cat” in the string “chello” in five editing steps – the Levenshtein distance is 5.

The Code

Problem: Write a Python one-liner that calculates the Levenshtein distance of two strings a and b.

## The Data
a = "cat"
b = "chello"
c = "chess"

## The One-Liner
ls = lambda a, b: len(b) if not a else len(a) if not b \
         else min(ls(a[1:], b[1:])+(a[0] != b[0]),
                  ls(a[1:], b)+1,
                  ls(a, b[1:])+1)

## The Result
print(ls(a,b))
print(ls(a,c))
print(ls(b,c))

Listing: Calculating the Levenshtein distance of two strings in one line.

What’s the output of this code snippet?

How It Works

Before we dive into the code, let’s quickly explore an important Python trick we heavily exploit in the one-liner. In Python, every object has a truth value – while you are either good or bad in the world of Harry Potter, you are either True or False in the world of Python! Most objects are in fact True (normal people are usually good). Intuitively, you know the few objects that are False, don’t you?

  • 0 is False
  • ” is False
  • [] is False
  • {} is False

As a rule of thumb, Python objects are considered False if they are empty or zero. Equipped with this information, you can now easily understand the first part of the Levenshtein function:

We create a lambda function that returns the number of edits required to transform a string a into a string b. There are two trivial cases: Suppose string a is empty. In this case, the minimal edit distance is len(b) insertions of the characters in string b. We cannot do better. Similarly, if string b is empty, the minimal edit distance is len(a). Thus, we can directly return the correct edit distance if either of the strings is empty.

Let’s say both strings are non-empty (otherwise the solution is trivial as shown previously). Now, we can simplify the problem in three ways.

First, we ignore the leading characters of both strings a and b and calculate the edit distance from a[1:] to b[1:] in a recursive manner. If the leading characters a[0] and b[0] are different, we have to fix it by replacing a[0] by b[0]. Hence, we increment the edit distance by one if they are different.

Second, we remove the first character a[0]. Now, we check the minimal edit distance recursively for this smaller problem. As we have removed a character, we increment the result by one.

Third, we (conceptually) insert the character b[0] to the beginning of the word a. Now, we can reduce this problem to the smaller problem that arises if we remove the first character of b. As we have performed one edit operation (inserting), we increment the result by one.

Finally, we simply take the minimum edit distance of all three results (replace the first character, remove the first character, insert the first character).

This one-liner solution demonstrates once again the importance of training your recursion skills – recursion may not come naturally to you but rest assured that it will after studying many recursive problems like this one.

Where to go from here?

This is an advanced algorithm that requires basic computer science and Python skills. If you feel like you need to train the basics, read the book “Coffee Break Python“. After studying the book, you will not only know your exact Python skill level, you will also understand Python code much faster!

Leave a Comment

Your email address will not be published. Required fields are marked *