5 Best Ways to Sort Phrases by Their Frequencies in Python

πŸ’‘ Problem Formulation: When working with text data, one might need to sort phrases or words by their frequency of occurrence. This task is common in data analysis, where insights are often drawn from the most frequently mentioned terms. Suppose we have a list of phrases ['apple', 'banana', 'apple', 'orange', 'banana', 'banana'], and we need to sort these phrases by their frequency in descending order. The desired output should be ['banana', 'apple', 'orange'],

Method 1: Using Collections Counter

This method leverages the Collections module in Python, which includes a Counter class specifically crafted for counting hashable objects. It internally creates a dictionary where elements are stored as dictionary keys and their counts as dictionary values. This function is highly optimized, making it suitable for large datasets.

Here’s an example:

from collections import Counter

def sort_phrases(phrases):
    phrase_counts = Counter(phrases)
    return sorted(phrase_counts, key=phrase_counts.get, reverse=True)

example_phrases = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']
print(sort_phrases(example_phrases))

Output:

['banana', 'apple', 'orange']

This snippet defines a function sort_phrases that takes a list of phrases and initializes a Counter object with it. The phrases are then sorted by their count in descending order, which is achieved by the reverse=True parameter in the sorted function.

Method 2: Using Defaultdict

The defaultdict function from the collections module is a dictionary-like object that provides all methods provided by a dictionary but takes a first argument (default_factory) as the default data type for the dictionary. Using defaultdict(int), we can easily keep count of all items efficiently.

Here’s an example:

from collections import defaultdict

def sort_phrases(phrases):
    phrase_counts = defaultdict(int)
    for phrase in phrases:
        phrase_counts[phrase] += 1
    return sorted(phrase_counts, key=phrase_counts.get, reverse=True)

example_phrases = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']
print(sort_phrases(example_phrases))

Output:

['banana', 'apple', 'orange']

Here, we utilize defaultdict to avoid key errors and simplify the counting process. We then sort the dictionary’s keys based on their corresponding values in descending order. It’s slightly more code intensive than using Counter, but offers more control and insight into the counting process.

Method 3: Using a Lambda Function and Sorted

Without importing any additional modules, you can still sort phrases by their frequencies in Python. This method uses the sorted function along with a lambda function to sort the phrases. The lambda function accesses the count of each phrase as the key for sorting.

Here’s an example:

def sort_phrases(phrases):
    return sorted(set(phrases), key=lambda x: phrases.count(x), reverse=True)

example_phrases = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']
print(sort_phrases(example_phrases))

Output:

['banana', 'apple', 'orange']

This code uses the built-in sorted function to sort the unique phrases based on their frequency in the original list, which is obtained by the count() method. The lambda function serves as a key argument to the sorted function, ensuring phrases are sorted in descending order based on their count.

Method 4: Using Pandas

For individuals working within a data science context, Pandas provides an efficient and easy-to-code method for sorting phrases. The library is built on top of NumPy and is optimized for performance, particularly with large data sets.

Here’s an example:

import pandas as pd

def sort_phrases(phrases):
    return pd.Series(phrases).value_counts().index.tolist()

example_phrases = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']
print(sort_phrases(example_phrases))

Output:

['banana', 'apple', 'orange']

This snippet first converts the list of phrases into a Pandas Series object. Then, by calling value_counts(), it gets the frequency of each phrase and sorts them in descending order. Finally, the indices which represent the sorted phrases are returned as a list.

Bonus One-Liner Method 5: Lambda and Sorted Short Form

This one-liner is a condensed form of the lambda and sorted method. It’s more cryptic but it’s the shortest form you can get for this kind of problem in Python, ideal for those who prefer concise code.

Here’s an example:

example_phrases = ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']
print(sorted(set(example_phrases), key=example_phrases.count, reverse=True))

Output:

['banana', 'apple', 'orange']

This approach uses the same logic as Method 3 but in an abbreviated form. We sort the unique phrases by the count directly within the sorted function without defining a separate function.

Summary/Discussion

  • Method 1: Collections Counter. Fast and readable. Ideal for large datasets. However, additional import (though standard library) is needed.
  • Method 2: Defaultdict. Offers good control and is part of the standard library. It is slightly more verbose than using Counter.
  • Method 3: Lambda Function and Sorted. Uses built-in functions only, no imports required. The readability might suffer for those unfamiliar with lambda functions, and it’s not optimal for large datasets.
  • Method 4: Using Pandas. Highly optimized for large datasets and very clean syntax. Requires an external library, which may not be suitable for all environments.
  • Bonus Method 5: Lambda and Sorted Short Form. It is the most concise. However, it might be the least readable for those not familiar with Python’s functional aspects.