5 Best Ways to Design a HashSet in Python

πŸ’‘ Problem Formulation: How can one implement a custom HashSet in Python, a data structure that stores unique elements, similar to Java’s HashSet? For instance, if the input is a list of items ['apple', 'banana', 'apple', 'cherry'], the desired output would be a collection containing only ‘apple’, ‘banana’, and ‘cherry’ with no duplicates.

Method 1: Using the built-in set data structure

The simplest method to implement a HashSet in Python is by using the built-in set data type. Python’s set is a mutable collection that is unordered, iterable, and does not allow duplicate elements. This can serve as a direct analog to a HashSet.

Here’s an example:

my_hashset = set(['apple', 'banana', 'apple', 'cherry'])
print(my_hashset)

Output of the code snippet:

{'banana', 'cherry', 'apple'}

This code snippet shows the construction of a set in Python, which inherently functions as a HashSet by automatically removing the duplicate entry of ‘apple’.

Method 2: Utilizing a dictionary for constant-time operations

Another way to implement a HashSet is by using a dictionary. This approach leverages the O(1) average case complexity for lookups, insertions, and deletions in a Python dictionary by using the keys as the set elements, ignoring the values.

Here’s an example:

class MyHashSet:
    def __init__(self):
        self.storage = dict()
        
    def add(self, key):
        self.storage[key] = True

    def contains(self, key):
        return key in self.storage

my_hashset = MyHashSet()
my_hashset.add('apple')
print(my_hashset.contains('apple'))
print(my_hashset.contains('banana'))

Output of the code snippet:

True
False

In this snippet, we designed a simple HashSet class that uses a Python dictionary for storage. The add() method introduces a new element, and the contains() method checks for its existence.

Method 3: Using a list with manual checks for uniqueness

You can create a HashSet by manually checking for uniqueness when adding elements to a list. This is less efficient since it requires iterating over the list to ensure an element is not already included before adding it.

Here’s an example:

class MyHashSet:
    def __init__(self):
        self.storage = []
        
    def add(self, key):
        if key not in self.storage:
            self.storage.append(key)

my_hashset = MyHashSet()
my_hashset.add('apple')
my_hashset.add('cherry')
print(my_hashset.storage)

Output of the code snippet:

['apple', 'cherry']

This example defines a class with a list to store elements. The add() method includes a new item only if it does not already exist in the storage list.

Method 4: Leveraging hashing with a fixed-size bucket list

For a more traditional HashSet implementation, one can create an array of fixed size, each element of which is a list – a bucket to store values hashed to the same index. This implementation resembles the actual underlying mechanism of a HashSet.

Here’s an example:

class MyHashSet:
    def __init__(self, size=10):
        self.size = size
        self.buckets = [[] for _ in range(size)]
    
    def compute_hash(self, key):
        return hash(key) % self.size
    
    def add(self, key):
        hash_index = self.compute_hash(key)
        if key not in self.buckets[hash_index]:
            self.buckets[hash_index].append(key)

my_hashset = MyHashSet()
my_hashset.add('apple')
my_hashset.add('banana')
print(my_hashset.buckets)

Output of the code snippet:

[[], [], [], ['apple'], [], ['banana'], [], [], [], []]

This code presents a hashset using a list of buckets, where each bucket is a standard Python list. It uses a hashing function to allocate each value to a bucket based on its hash code.

Bonus One-Liner Method 5: Using collections.Counter

The collections.Counter class from Python’s standard library can be utilized to create a HashSet. It provides a quick one-liner for cases where you’d like to see if an element has been added to the set at least once, effectively deduplicating the input.

Here’s an example:

from collections import Counter
my_hashset = list(Counter(['apple', 'banana', 'apple', 'cherry']).keys())
print(my_hashset)

Output of the code snippet:

['apple', 'banana', 'cherry']

This line of Python code constructs a Counter object that counts the occurrences of each element and then takes the keys, which represent the unique elements, thus simulating a HashSet.

Summary/Discussion

  • Method 1: Built-in set. Highly efficient and Pythonic. Limited customization.
  • Method 2: Dictionary-based HashSet. Fast operations. Slightly more memory usage due to key-value storage.
  • Method 3: List with manual checks. Simple but inefficient for large datasets due to O(n) search time.
  • Method 4: Hashing with buckets. Mimics real HashSet, good for learning internals. Complex and sized buckets could lead to inefficiencies.
  • Method 5: Counter keys as a HashSet. Quick one-liner with an emphasis on readability. Not a true set and may be less efficient than the built-in set.