5 Best Ways to Detect Nearly Identical Word Pairs in Python

πŸ’‘ Problem Formulation: When dealing with text data, detecting pairs of words that are nearly identical can be crucial for tasks like spell checking, plagiarism detection, or data deduplication. For instance, given a list of words ['example', 'samples', 'exampel', 'apple', 'aple'], the program should identify pairs like ('example', 'exampel') and ('apple', 'aple') as almost the same.

Method 1: Levenshtein Distance

The Levenshtein distance measures how many single-character edits (insertions, deletions, or substitutions) are required to change one word into the other. It’s a classic algorithm in text analysis for finding similar words. In Python, the python-Levenshtein library provides a fast implementation for this method.

Here’s an example:

import Levenshtein

words = ['example', 'samples', 'exampel', 'apple', 'aple']
similar_words = []

for i in range(len(words)):
    for j in range(i+1, len(words)):
        if Levenshtein.distance(words[i], words[j]) <= 1:
            similar_words.append((words[i], words[j]))
            
print(similar_words)

Output: [('example', 'exampel'), ('apple', 'aple')]

This code snippet sets a threshold of 1 for the Levenshtein distance, meaning it collects word pairs that are off by a single-character edit. It iterates all possible pairs and uses the Levenshtein.distance() function to find similar words.

Method 2: SequenceMatcher from difflib

The SequenceMatcher class from Python’s built-in difflib module is used to compare pairs of sequences of any type. By applying it to words, it can give us a ratio indicating the similarity. A high ratio indicates a small number of edits.

Here’s an example:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio() > 0.8

words = ['example', 'samples', 'exampel', 'apple', 'aple']
similar_pairs = [(a, b) for a in words for b in words if a < b and similar(a, b)]

print(similar_pairs)

Output: [('apple', 'aple'), ('example', 'exampel')]

The code defines a function similar() that uses SequenceMatcher to return True if two strings are similar above a specific threshold. It then finds pairs of words that meet this criterion.

Method 3: Hamming Distance

The Hamming distance is another string metric for measuring the difference between two strings. It’s more restrictive than Levenshtein, as it only considers substitution errors between strings of equal length. This can be beneficial for finding words that are similar in length and composition.

Here’s an example:

def hamming_distance(s1, s2):
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

words = ['example', 'samples', 'exampel', 'apple', 'aple']
similar_words_hamming = []

for i in range(len(words)):
    for j in range(i+1, len(words)):
        if len(words[i]) == len(words[j]) and hamming_distance(words[i], words[j]) <= 1:
            similar_words_hamming.append((words[i], words[j]))
            
print(similar_words_hamming)

Output: [('example', 'exampel')]

This snippet implements a simple function to calculate the Hamming distance and then collects word pairs that are off by a maximum of one substitution and are of the same length.

Method 4: Custom Similarity Function

Creating a custom function allows for fine-tuning the similarity conditions. For instance, one could define that words are almost the same if they start and end with the same letter and have a similar length, regardless of the characters in between.

Here’s an example:

def custom_similarity(w1, w2):
    return (w1[0] == w2[0] and w1[-1] == w2[-1] and abs(len(w1) - len(w2)) <= 1)

words = ['example', 'samples', 'exampel', 'apple', 'aple']
similar_words_custom = []

for a in words:
    for b in words:
        if a < b and custom_similarity(a, b):
            similar_words_custom.append((a, b))

print(similar_words_custom)

Output: [('apple', 'aple'), ('exampel', 'example'), ('example', 'samples')]

This code defines a custom function that considers two words to be similar under certain conditions. It is then applied to all unique word pairs to build the list of similar words according to the custom logic.

Bonus One-Liner Method 5: Set Intersection

If similarity is simply defined by having a significant number of common letters without considering order, a set intersection can be a quick measure.

Here’s an example:

words = ['example', 'samples', 'exampel', 'apple', 'aple']
similar_words_set = [(a, b) for a in words for b in words if a < b and len(set(a) & set(b)) >= len(a) - 1]

print(similar_words_set)

Output: [('example', 'exampel'), ('apple', 'aple')]

This concise code snippet creates pairs of words where the number of unique common letters is just one less than the length of the shorter word, indicating a high degree of similarity.

Summary/Discussion

  • Method 1: Levenshtein Distance. Highly versatile and accurate in finding similarity. May perform slower for larger datasets.
  • Method 2: SequenceMatcher. Native to Python, doesn’t require external libraries. Less granular control over character edits compared to Levenshtein.
  • Method 3: Hamming Distance. Best suited for words of the same length. It’s often not suitable for natural language processing where word lengths vary.
  • Method 4: Custom Similarity Function. Highly customizable to specific needs. May require additional logic for more complex comparisons.
  • Method 5: Set Intersection. Extremely simple and fast. Does not consider character order, which may not be suitable for all applications.