5 Best Ways to Convert a String to a Set of Words in Python

💡 Problem Formulation: When working with textual data in Python, it’s common to encounter a scenario where you have a string that you wish to convert into a set of unique words. This transformation is crucial for tasks such as word frequency analysis, text processing, and search operations. For instance, given the input string “hello hello world”, the desired output is a set of words {‘hello’, ‘world’}.

Method 1: Using the split() Method and set() Constructor

This method involves using the split() function to divide the string into a list of words based on whitespace, followed by passing the list to the set() constructor to create a set of unique words.

Here’s an example:

input_string = "hello hello world"
words_set = set(input_string.split())
print(words_set)

Output:

{'hello', 'world'}

This snippet first splits the string “hello hello world” into a list of words, [‘hello’, ‘hello’, ‘world’], by the default white space delimiter. It then converts this list into a set, which inherently removes any duplicate elements.

Method 2: Using Regular Expressions

Regular expressions can be used to accommodate more complex word definitions that include handling punctuation and whitespace correctly. The re module’s findall() method can be particularly useful here.

Here’s an example:

import re
input_string = "hello, world! Hello?"
words_set = set(re.findall(r'\b\w+\b', input_string))
print(words_set)

Output:

{'Hello', 'hello', 'world'}

The regular expression pattern \b\w+\b matches entire words, delimited by word boundaries. The findall() function then captures all the occurrences as a list which is turned into a set, removing duplicates. Note that this approach is case-sensitive.

Method 3: Using String Methods with Comprehension

This method incorporates string methods such as strip() and lower(), combined with set and list comprehensions, to remove punctuation and make all words lowercase before converting them into a set.

Here’s an example:

import string
input_string = "Hello, world! Hello?"
words_set = {word.strip(string.punctuation).lower() for word in input_string.split()}
print(words_set)

Output:

{'hello', 'world'}

This code uses a set comprehension to iterate over each word resulting from split(), strips punctuation, converts to lowercase, and gathers the unique words in a set, ensuring a case-insensitive collection without punctuation.

Method 4: Using the Str Methods and filter()

Combining Python’s string methods with the filter() function, we can effectively exclude empty strings that may result from multiple spaces between words, ensuring an accurate set of words.

Here’s an example:

input_string = "hello   hello world"
words_set = set(filter(None, input_string.split()))
print(words_set)

Output:

{'hello', 'world'}

By splitting the input string and filtering out the empty strings, we get a list of unique words, which is then converted to a set to remove duplicates. This method is efficient in handling extra whitespace.

Bonus One-Liner Method 5: Using a Function

For convenience and reusability, one can encapsulate the conversion process into a one-liner function that combines the use of split(), comprehension, and the set() constructor.

Here’s an example:

convert_to_word_set = lambda s: set(s.split())
input_string = "hello   hello world"
print(convert_to_word_set(input_string))

Output:

{'hello', 'world'}

This lambda function named convert_to_word_set takes a string s and returns a set of words generated by splitting the input string. It’s a clean, concise way to perform this conversion repeatedly.

Summary/Discussion

Method 1: Using split() and set(). Strengths: Simple, straightforward. Weaknesses: Does not handle punctuation or case-insensitivity.
Method 2: Using Regular Expressions. Strengths: More sophisticated word matching, handles complex scenarios. Weaknesses: May require knowledge of regex patterns, case-sensitive unless explicitly handled.
Method 3: Using String Methods with Comprehension. Strengths: Handles punctuation and case conversion. Weaknesses: More complex one-liner, slight performance hit due to strip() on every word.
Method 4: Using Str Methods and filter(). Strengths: Filters empty strings, handles whitespace effectively. Weaknesses: Does not address punctuation or case-sensitivity.
Method 5: Using a One-Liner Function. Strengths: Reusable and concise. Weaknesses: This default implementation is basic, covering whitespace splitting but not punctuation or case.