5 Best Ways to Find the Frequency of a Particular Word in a Cell of an Excel Table Using Python

February 28, 2024 by Emily Rosemary Collins

💡 Problem Formulation: When working with large datasets in Excel, it’s common to analyze the frequency of particular words within cells to draw insights. For example, suppose you have a dataset of customer feedbacks in an Excel file and you want to find out how often the word “excellent” appears across the feedbacks. Using Python, you can automate this task to quickly get the quantitative data you need for your analysis.

Method 1: Using `pandas` with `str.count`

This method involves reading the Excel file into a DataFrame using the pandas library and then applying the .str.count() method to the specific column to find the frequency of the word.

Here’s an example:

import pandas as pd

# Load Excel file into a DataFrame
df = pd.read_excel('feedback.xlsx')

# Count occurrences of the word 'excellent' in the 'Feedback' column
word_count = df['Feedback'].str.count('excellent').sum()

print(f"The word 'excellent' appears {word_count} times.")

Output: The word ‘excellent’ appears 15 times.

This snippet reads the Excel file ‘feedback.xlsx’ and then counts the occurrences of the word ‘excellent’ in the ‘Feedback’ column. The frequency is then printed out. This method is quick, efficient, and works directly with Excel files without the need for any intermediate conversion.

Method 2: Using Regular Expressions with `pandas`

Python’s re library (Regular Expressions) can be used to search for the word with more flexibility, which is useful if you need to handle different case sensitivities or word boundaries.

Here’s an example:

import pandas as pd
import re

# Load Excel file into a DataFrame
df = pd.read_excel('feedback.xlsx')

# Define a pattern to match the word 'excellent' with case insensitivity
pattern = re.compile(r'\bexcellent\b', re.IGNORECASE)

# Count occurrences using the pattern
word_count = df['Feedback'].apply(lambda comment: len(pattern.findall(comment))).sum()

print(f"The word 'excellent' appears {word_count} times.")

Output: The word ‘excellent’ appears 15 times.

After reading the Excel file, this code uses a regular expression pattern to match the word ‘excellent’ in a case-insensitive manner and with word boundaries to ensure accurate counting. It then applies this pattern across the ‘Feedback’ column to count the occurrences.

Method 3: Using a Custom Function

If you need more control or want to perform additional operations while counting words, you can write a custom Python function and use it with pandas.apply().

Here’s an example:

import pandas as pd

def count_word(cell, word):
    return cell.split().count(word)

df = pd.read_excel('feedback.xlsx')

word_count = df['Feedback'].apply(lambda x: count_word(x, 'excellent')).sum()

print(f"The word 'excellent' appears {word_count} times.")

Output: The word ‘excellent’ appears 15 times.

The custom function count_word() takes a cell’s content and the target word as arguments, then counts the occurrences of that word using the split() and count() methods. This method is then applied to each cell in the ‘Feedback’ column. This approach allows easy modification of the counting logic if needed.

Method 4: Using `openpyxl` for Large Excel Files

If you’re working with exceptionally large Excel files, openpyxl can be a better option as it allows you to read the file in a streaming fashion to save memory.

Here’s an example:

from openpyxl import load_workbook

def count_occurrences(column, word):
    count = 0
    for cell in column:
        if cell.value and word in cell.value:
            count += cell.value.count(word)
    return count

wb = load_workbook('feedback.xlsx', read_only=True)
ws = wb.active

word_count = count_occurrences(ws['A'], 'excellent')

print(f"The word 'excellent' appears {word_count} times.")

Output: The word ‘excellent’ appears 15 times.

This code example loads the ‘feedback.xlsx’ workbook in read-only mode to optimize memory usage, then counts the occurrences of ‘excellent’ in column A. The count_occurrences() function is customized to work with openpyxl‘s cell objects directly.

Bonus One-Liner Method 5: Using `pandas` with a Lambda Function

For a quick, one-liner method to count words, you can use a lambda function directly within pandas.apply().

Here’s an example:

import pandas as pd

df = pd.read_excel('feedback.xlsx')

# Count occurrences of 'excellent' using a lambda function
word_count = df['Feedback'].apply(lambda x: x.split().count('excellent')).sum()

print(f"The word 'excellent' appears {word_count} times.")

Output: The word ‘excellent’ appears 15 times.

This concise one-liner code efficiently counts the occurrences of the word ‘excellent’ by splitting each ‘Feedback’ cell content into a list of words and then using the count method, all within a lambda function.

Summary/Discussion

Method 1: Pandas with str.count. Strengths: Straightforward and concise. Weaknesses: May not handle complex patterns or different cases without additional parameters.
Method 2: Regular Expressions with pandas. Strengths: Allows complex search patterns and case insensitivity. Weaknesses: Slower than simple string methods; requires familiarity with regex.
Method 3: Custom Function. Strengths: Highly customizable and easy to extend. Weaknesses: Potentially less efficient than built-in methods.
Method 4: openpyxl for Large Files. Strengths: Optimized for memory usage with large files. Weaknesses: Requires additional library; code is more verbose.
Method 5: One-Liner Lambda Function. Strengths: Quick and easy to write. Weaknesses: Not as flexible for complex patterns or conditional counting.