5 Best Ways to Find All Close Matches of Input String from a List in Python

πŸ’‘ Problem Formulation: When processing text data in Python, a common task is to find strings from a list that closely match a given input string. For instance, given an input string “htlm”, the desired output could be a list of close matches such as [“html”, “helm”, “halma”] from a predefined list of words. This article explores methods to efficiently solve this problem.

Method 1: Using difflib.get_close_matches

The difflib.get_close_matches function is a high-level method from the Python Standard Library that finds close matches to a specified word within a list of words. This function uses a sequence matcher algorithm to provide a list of close matches based on a similarity threshold.

Here’s an example:

import difflib

words = ['apple', 'peach', 'pear', 'avocado']
input_str = 'appel'
matches = difflib.get_close_matches(input_str, words)

print(matches)

Output:

['apple', 'peach']

This snippet demonstrates how to retrieve close matches to the string “appel” from a list of words. The function difflib.get_close_matches returns the best-matching strings in order of closeness.

Method 2: Levenshtein Distance (python-Levenshtein package)

The Levenshtein distance is a measure of the difference between two sequences. The python-Levenshtein package provides a function distance() that computes the Levenshtein distance between two strings. By iterating over a list and calculating the distance for each string, we can select close matches based on a chosen threshold.

Here’s an example:

import Levenshtein

words = ['hello', 'help', 'hell', 'hero']
input_str = 'heal'
matches = [word for word in words if Levenshtein.distance(word, input_str) < 2]

print(matches)

Output:

['hell', 'help']

This code snippet filters the list of words to find those with a Levenshtein distance of less than 2 from the input string “heal”, thus returning words most similar to it.

Method 3: FuzzyWuzzy Package

The FuzzyWuzzy package is another Python library that uses Levenshtein distance to determine the similarity between strings. Unlike the manual method of iterating over a list, FuzzyWuzzy provides built-in functions like process.extract() or process.extractBests() to find close matches with different control over thresholds and scoring.

Here’s an example:

from fuzzywuzzy import process

words = ['feature', 'future', 'fortune', 'furniture']
input_str = 'futre'
matches = process.extractBests(input_str, words, limit=3, scorer=fuzz.partial_ratio)

print(matches)

Output:

[('future', 83), ('feature', 62), ('fortune', 62)]

The snippet applies the fuzz.partial_ratio scorer to sort the matches by similarity to “futre” and limit the output to the top 3 close matches.

Method 4: Regex Approximation with the regex Package

The regex package in Python, an enhanced version of the re library, allows approximate string matching via regular expressions. This means you can match strings that are similar to a given pattern within a certain number of edits (insertions, deletions or substitutions).

Here’s an example:

import regex

words = ['react', 'retract', 'read', 'reast']
input_str = 'reat'
matches = [word for word in words if regex.fullmatch('(?' + input_str + '){e<=1}', word)]

print(matches)

Output:

['react', 'reast']

This example demonstrates using the regex library’s fuzzy matching syntax to find words that match “reat” within one edit distance.

Bonus One-Liner Method 5: List Comprehension with SequenceMatcher

For a quick, inline solution, Python’s difflib.SequenceMatcher can be used in a list comprehension to filter close matches based on a similarity ratio.

Here’s an example:

from difflib import SequenceMatcher

words = ['connection', 'collection', 'correction', 'reflection']
input_str = 'conection'
matches = [word for word in words if SequenceMatcher(None, word, input_str).ratio() > 0.8]

print(matches)

Output:

['connection', 'correction']

In a compact form, the list comprehension checks the similarity ratio between “conection” and each word in the list, selecting those with ratios above 0.8.

Summary/Discussion

  • Method 1: difflib.get_close_matches. Built-in, simple to use. The similarity threshold and return amount are adjustable. But it might not fit all use cases or similarity assessments.
  • Method 2: Levenshtein Distance. Provides precise control over the measure of similarity by counting edits. Requires external package and manual threshold setting. Potentially slower due to explicit iteration.
  • Method 3: FuzzyWuzzy Package. User-friendly with different scoring methods available. Thresholds and result limits are easy to tweak. Relies on an external library, which may be a barrier in certain environments.
  • Method 4: Regex Approximation with the regex Package. Offers Regex pattern matching with approximations. It’s very powerful and flexible but can be complex to define correct patterns and thresholds.
  • Bonus Method 5: SequenceMatcher in a List Comprehension. A quick one-liner for inline use. Straightforward and uses a built-in library. However, it may not be as efficient as other dedicated functions.