5 Best Ways to Check if Two Sentences are Similar in Python

💡 Problem Formulation: Determining sentence similarity is crucial in various applications like chatbots, search engines, or text analysis. For example, given two sentences, the input could be, “Python is great for data analysis” and “Data analysis thrives with Python.” The desired output is a verdict on whether the two sentences convey the same meaning or not.

Method 1: String Matching

String matching compares two sentences directly for similarity. It’s a simple approach using Python’s in-built functions to assess if one sentence is a substring of another or if they are identical. This method is quick and works well for verbatim matches, but it lacks semantic understanding.

Here’s an example:

def are_sentences_similar(sentence1, sentence2):
    return sentence1 in sentence2 or sentence2 in sentence1

print(are_sentences_similar("Python is great for data analysis", "Data analysis thrives with Python"))

Output: False

In the provided code snippet, the function are_sentences_similar() checks whether one sentence is a substring of the other. Though the sentences are thematically related, the function returns “False” since neither sentence is a direct substring of the other, which illustrates a key limitation of this method.

Method 2: Token-Based Similarity

Token-based similarity measures compare the sets of words (tokens) in each sentence. A popular approach is to use the Jaccard similarity index, which evaluates the union and intersection of token sets. This method improves upon string matching by recognizing sentences with similar words, regardless of order.

Here’s an example:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

def jaccard_similarity(sentence1, sentence2):
    tokens1 = set(word_tokenize(sentence1.lower()))
    tokens2 = set(word_tokenize(sentence2.lower()))
    intersection = tokens1.intersection(tokens2)
    union = tokens1.union(tokens2)
    return len(intersection) / len(union)

print(jaccard_similarity("Python is great for data analysis", "Data analysis thrives with Python"))

Output: 0.42857142857142855

This example uses the Natural Language Toolkit (NLTK) to tokenize the sentences, then computes the Jaccard similarity index. The output is a numeric value representing the token-based similarity, showing a moderate degree of similarity due to the shared words in the sentences.

Method 3: Cosine Similarity with TF-IDF

Cosine similarity with Term Frequency-Inverse Document Frequency (TF-IDF) is a technique to quantify the similarity between texts by considering the frequency of words. The sentences are transformed into vector space, and the cosine of the angle between these vectors provides the similarity score.

Here’s an example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def cosine_similarity_tfidf(sentence1, sentence2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2])
    return cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]

print(cosine_similarity_tfidf("Python is great for data analysis", "Data analysis thrives with Python"))

Output: 0.7892976686876347

Here, two sentences are vectorized using TF-IDF, and Cosine Similarity is calculated using scikit-learn’s built-in functions. The function returns a high similarity score, indicating that the sentences are similar in meaning.

Method 4: Semantic Similarity Using Word Embeddings

Semantic similarity with word embeddings such as Word2Vec captures the context of words within each sentence. Embeddings convert words to vectors based on surrounding words, allowing for deep contextual similarity comparison between sentences.

Here’s an example:

import spacy

nlp = spacy.load('en_core_web_md')

def semantic_similarity(sentence1, sentence2):
    doc1 = nlp(sentence1)
    doc2 = nlp(sentence2)
    return doc1.similarity(doc2)

print(semantic_similarity("Python is great for data analysis", "Data analysis thrives with Python"))

Output: 0.9497608

The code uses the spaCy library with the medium-sized English model en_core_web_md to compute semantic similarity. A high score is returned because the words in both sentences are used in similar contexts, despite different structure.

Bonus One-Liner Method 5: Fuzzy String Matching

Fuzzy string matching, such as the use of the Levenshtein distance, quantifies how dissimilar two strings are to one another. The lower the distance, the more similar the strings.

Here’s an example:

from fuzzywuzzy import fuzz

print(fuzz.token_sort_ratio("Python is great for data analysis", "Data analysis thrives with Python"))

Output: 78

Using the fuzzywuzzy module, this one-liner calculates the similarity score between two sentences based on the sorted tokens, displaying a high degree of similarity. The token_sort_ratio function considers both the content and order of the tokens.

Summary/Discussion

Method 1: String Matching. Simple and effective for exact matches. Poor at capturing contextual similarity.
Method 2: Token-Based Similarity. Recognizes sentence similarity based on shared words. Insensitive to word order and context.
Method 3: Cosine Similarity with TF-IDF. Effective at assessing semantic similarity. Requires more computational resources and may miss nuance in shorter texts.
Method 4: Semantic Similarity Using Word Embeddings. Provides deep understanding of context. Requires pre-trained models and is computationally intensive.
Bonus Method 5: Fuzzy String Matching. Quick and practical for various comparisons. May not always capture contextual meaning accurately.