5 Best Ways to Do Fuzzy Matching on a Pandas DataFrame Column Using Python

πŸ’‘ Problem Formulation: When dealing with datasets in data science, we often encounter a scenario where we need to match strings in a column of a Pandas DataFrame that are similar but not exactly the sameβ€”a process known as fuzzy matching. This might occur, for example, when comparing company names that could be typed slightly differently in different sources. The goal is to identify and relate these similar string values effectively.

Method 1: Using the FuzzyWuzzy Library

FuzzyWuzzy is a Python library that uses Levenshtein Distance to calculate the differences between sequences. It is a powerful tool for fuzzy string matching and is especially useful when you want to find the best match for a string from a list of options. The FuzzyWuzzy library provides several methods to accomplish this, such as ratio(), partial_ratio(), token_sort_ratio(), and token_set_ratio().

Here’s an example:

from fuzzywuzzy import process
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'Company': ['Apple Inc.', 'Banana Inc', 'Orange Inc']})

# Define the string to match
string_to_match = 'Apple Incorporated'

# Perform fuzzy matching
best_match = process.extractOne(string_to_match, df['Company'])

print(best_match)

Output:

('Apple Inc.', 90)

This code snippet defines a list of company names and attempts to find the best fuzzy match for the given string ‘Apple Incorporated’. Using the extractOne function from FuzzyWuzzy, it compares the string against each entry in the column and returns the closest match along with its score out of 100, which represents the similarity percentage.

Method 2: Using the RapidFuzz Library

RapidFuzz is a library that provides a faster implementation compared to FuzzyWuzzy without the Python-Levenshtein C extension requirement. RapidFuzz includes string matching functions such as fuzz.ratio() and process.extract(), which can be used for fuzzy string matching and have similar APIs to those of FuzzyWuzzy.

Here’s an example:

from rapidfuzz import process, fuzz
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'Company': ['Alpha Corp', 'Beta Corporation', 'Gamma LLC']})

# Define the string to match
string_to_match = 'Beta Corp'

# Perform fuzzy matching
best_match = process.extractOne(string_to_match, df['Company'], scorer=fuzz.WRatio)

print(best_match)

Output:

('Beta Corporation', 90.0)

In this code snippet, we use RapidFuzz’s extractOne method with the fuzz.WRatio scorer to perform fuzzy matching. It returns the best match found in the ‘Company’ column for the string ‘Beta Corp’ along with the similarity score.

Method 3: Using String Grouper

String Grouper is a Python library designed to cluster strings that have a high similarity score. It’s particularly useful for big datasets and can process data quickly using a Tf-idf Vectorizer and cosine similarity. String Grouper is best suited when you want to group similar strings together.

Here’s an example:

from string_grouper import match_strings, match_most_similar
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'Company': ['Delta Ltd', 'Theta Ltd', 'Lambda Ltd']})

# Perform fuzzy matching
matches = match_most_similar(df['Company'], df['Company'])

# Show results
df['Best Match'] = matches
print(df)

Output:

      Company   Best Match
0   Delta Ltd   Delta Ltd
1   Theta Ltd   Theta Ltd
2  Lambda Ltd  Lambda Ltd

This code snippet uses the match_most_similar function from the String Grouper library to match each company with the most similar one within the same column. The returned matches are added as a new column in the DataFrame, showing the best match for each company name.

Method 4: Using the recordlinkage Library

The recordlinkage library provides a toolkit for record linkage and deduplication. It offers various comparison metrics for fuzzy matching, such as Levenshtein, Jaro, and Jaro-Winkler, and is optimized for matching records between two datasets.

Here’s an example:

import recordlinkage
from recordlinkage.index import Full
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'Company': ['Epsilon Inc', 'Zeta Inc', 'Eta Inc']})

# Define the indexer and comparison object
indexer = recordlinkage.Index()
indexer.add(Full())
comparer = recordlinkage.Compare()

# Define a dummy DataFrame to iterate over
dummy_df = pd.DataFrame({'Company': ['ePsilon Inc']})

# Index and compare records
pairs = indexer.index(df, dummy_df)
features = comparer.string('Company', 'Company', method='jarowinkler', threshold=0.85).compute(pairs, df, dummy_df)

# Get the best match
best_match_index = features[features.sum(axis=1) == features.max(axis=1)].index[0][0]
best_match = df.iloc[best_match_index]['Company']

print(best_match)

Output:

Epsilon Inc

This code snippet creates an indexer that performs a full index on the DataFrame, which means it will compare every record from one DataFrame to every record in the other. The Compare object then performs string comparisons using the Jaro-Winkler method. The record with the highest sum of comparison scores is then selected as the best match.

Bonus One-Liner Method 5: Using pandasql for SQL-like Fuzzy Matching

pandasql allows you to query Pandas DataFrames using SQL syntax. It’s a great way to combine the power of SQL with the flexibility of Python, especially for those already familiar with SQL queries. For simple use cases, you can join tables on conditions that roughly match using SQL string functions like LIKE.

Here’s an example:

from pandasql import sqldf
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'Company': ['Iota Services', 'Kappa Services', 'Lambda Services']})

# Fuzzy match SQL query
sql_query = """
SELECT Company
FROM df
WHERE Company LIKE '%Services'
"""

# Execute the query
fuzzy_matches = sqldf(sql_query, locals())
print(fuzzy_matches)

Output:

           Company
0   Iota Services
1  Kappa Services
2  Lambda Services

This code snippet demonstrates how to use pandasql to perform a SQL-like fuzzy match by selecting company names that end with the word ‘Services’. This method is not as sophisticated as some others for fuzzy matching but can be very handy for straightforward pattern matching.

Summary/Discussion

  • Method 1: FuzzyWuzzy. Uses the Levenshtein Distance for fuzzy string matching. Strengths include multiple match scoring options and ease of use. Weaknesses include speed limitations with large datasets and the need to install Python-Levenshtein for faster performance.
  • Method 2: RapidFuzz. Offers similar functionality as FuzzyWuzzy but optimized for performance. Strengths include faster matching and no C extension dependencies. Its API brings a lower learning curve for those already familiar with FuzzyWuzzy.
  • Method 3: String Grouper. Optimized for big datasets and uses cosine similarity. Strengths include efficiency with large datasets and the ability to cluster similar strings. The primary weakness is less granularity in match scoring compared to direct comparison methods.
  • Method 4: recordlinkage Library. Provides a comprehensive toolkit for record linkage. Strengths include a variety of comparison metrics and optimizations for record matching. This method can be more complex to set up and may require more code compared to others.
  • Bonus Method 5: pandasql. Allows for SQL-like querying on DataFrames. This method’s strengths lie in the convenience for SQL users, while weaknesses include limitations in fuzzy matching capabilities when compared to dedicated string matching algorithms.