5 Best Ways to Count Distinct Words and Their Frequency in Python

πŸ’‘ Problem Formulation: Given a text input, the goal is to develop a Python program that can count the number of distinct words and determine the frequency of each word. Imagine processing the string “apple orange banana apple apple banana”. The desired output would list ‘apple’ with a frequency of 3, ‘orange’ with a frequency of 1, and ‘banana’ with a frequency of 2.

Method 1: Using a Dictionary

This method involves iterating over the words in the input and using a dictionary to keep track of word frequencies. The dictionary’s keys are the words, and their corresponding values are the counts. This is a straightforward approach that utilizes the hashing mechanism of dictionaries in Python for efficient counting.

Here’s an example:

from collections import defaultdict

def count_words(text):
  word_freq = defaultdict(int)
  for word in text.split():
      word_freq[word] += 1
  return word_freq

freq = count_words("apple orange banana apple apple banana")
for word, count in freq.items():
    print(f"{word}: {count}")

Output:

apple: 3
orange: 1
banana: 2

This function count_words() takes a string as input and returns a dictionary where each word is associated with its frequency in the text. The defaultdict from the collections module is used to streamline the counting, as it initializes non-existent keys with a default value of 0.

Method 2: Using Counter from collections

The Counter class from the collections module is specifically designed for counting hashable objects. It simplifies the process of counting distinct words by automatically creating a dictionary with words as keys and their counts as values.

Here’s an example:

from collections import Counter

def count_words(text):
  return Counter(text.split())

freq = count_words("apple orange banana apple apple banana")
for word, count in freq.items():
    print(f"{word}: {count}")

Output:

apple: 3
orange: 1
banana: 2

The count_words() function uses the Counter object to count word occurrences immediately upon splitting the string. The result is a highly readable and concise one-liner that outputs the word frequencies.

Method 3: Using a for loop and a dictionary

Sometimes simple for loops can be the most understandable solution. This method manually checks if a word is in the dictionary and increments its count, else initializes it with 1. This method requires no additional imports but may be slower for very large inputs.

Here’s an example:

def count_words(text):
  word_freq = {}
  for word in text.split():
      if word in word_freq:
          word_freq[word] += 1
      else:
          word_freq[word] = 1
  return word_freq

freq = count_words("apple orange banana apple apple banana")
for word, count in freq.items():
    print(f"{word}: {count}")

Output:

apple: 3
orange: 1
banana: 2

This snippet shows a direct approach to increment the count of each word in the dictionary as it’s encountered in the text.

Method 4: Using Regular Expressions

For texts that require cleaning or have complex structures, regular expressions can be used to extract words before counting. This method requires more work to set up but can be more flexible when dealing with non-standard text inputs.

Here’s an example:

import re
from collections import Counter

def count_words(text):
    words = re.findall(r'\w+', text.lower())
    return Counter(words)

freq = count_words("Apple! Orange; Banana... Apple? APPLE, Banana")
for word, count in freq.items():
    print(f"{word}: {count}")

Output:

apple: 3
orange: 1
banana: 2

The code uses a regular expression to find all groups of alphanumerical characters, which are counted after being converted to lowercase. This is useful for normalizing the text.

Bonus One-Liner Method 5: Using List Comprehension and set

If you need a quick, concise solution and don’t mind running two lines of code, this method utilizes list comprehension combined with a set to first get distinct words and then count their occurrences using the list.count() method.

Here’s an example:

text = "apple orange banana apple apple banana"
unique_words = set(text.split())
freq = {word: text.split().count(word) for word in unique_words}
print(freq)

Output:

{'apple': 3, 'banana': 2, 'orange': 1}

This one-liner creates a set of unique words and then uses a dictionary comprehension to create a dictionary of word counts. While elegant, this method is not the most efficient for large datasets.

Summary/Discussion

  • Method 1: Using a Dictionary. Simple and efficient. Best used when additional library imports are not desired. Method may slow down as dictionary size increases.
  • Method 2: Using Counter from collections. Very efficient and readable. Specifically optimized for counting, but depends on the Python collections module.
  • Method 3: Using a for loop and a dictionary. Understandable logic flow. Good for educational purposes or small datasets. May not be as fast for very large datasets.
  • Method 4: Using Regular Expressions. Excellent for complex text inputs requiring cleaning. It might be overkill for simple text inputs and is less efficient than other methods due to regex processing.
  • Bonus One-Liner Method 5: Using List Comprehension and set. Very concise. Useful for quick tasks with smaller datasets but not recommended for processing large volumes of text due to its inefficiency.