Implementing Set Data Structures in Python Without the Built-in Set Class

💡 Problem Formulation: In Python, the standard library offers a convenient set class for representing a collection of distinct elements. However, understanding how sets works under the hood can be a valuable exercise for enhancing one’s programming skills. Suppose we want to implement our own version of a set. We need our program to allow us to store unique values and provide basic set functionalities such as addition and removal of elements, without relying on the Python’s built-in set class. Consider an input list [1, 2, 3, 2, 1] – our custom set implementation should be able to create a collection {1, 2, 3}, maintaining uniqueness of values.

Method 1: Using a Python Dictionary

Dictionaries in Python are implemented as hash tables, which means that they store unique keys much like sets. By using only keys (and ignoring values), we can mimic the behaviour of a set. This method is efficient because Python dictionaries provide constant-time complexity for addition and membership tests.

Here’s an example:

class MySet:
    def __init__(self, iterable=None):
        self.dict = {}
        if iterable:
            for item in iterable:
                self.dict[item] = None

    def add(self, item):
        self.dict[item] = None

    def remove(self, item):
        del self.dict[item]

    def __contains__(self, item):
        return item in self.dict

    def __iter__(self):
        return iter(self.dict)

my_set = MySet([1, 2, 3, 2, 1])
for val in my_set:
    print(val)

Output:

1
2
3

The code snippet above defines a class MySet which simulates a set using a dictionary. Elements of the set are the keys of the dictionary. Main set operations, such as adding an element and checking for membership, are facilitated by the dictionary’s methods. This implementation efficiently ensures the uniqueness of elements in the set.

Method 2: Using a List with Manual Checking

This method involves creating a set using a list and enforcing the uniqueness of elements by manually checking for their presence before adding them. This is less efficient than the dictionary method, particularly when the set is large, as it requires linear time to check each item’s presence.

Here’s an example:

class MySet:
    def __init__(self, iterable=None):
        self.list = []
        if iterable:
            for item in iterable:
                self.add(item)

    def add(self, item):
        if item not in self.list:
            self.list.append(item)

    def remove(self, item):
        self.list.remove(item)

    def __contains__(self, item):
        return item in self.list

    def __iter__(self):
        return iter(self.list)

my_set = MySet([1, 2, 3, 2, 1])
print(my_set.list)

Output:

[1, 2, 3]

The class MySet is defined with a list as its underlying structure. The add method checks whether an element is already present in the list before appending it, ensuring that all elements are unique. The drawback of this approach is that checking for the existence of an element has a linear time complexity, making it inefficient for large datasets.

Method 3: Using Sorted List and Binary Search

A refinement of the list method is to maintain the list in sorted order and use binary search to check for the presence of elements. This takes advantage of the fact that binary search operates in logarithmic time, speeding up the membership test significantly compared to linear search on an unsorted list.

Here’s an example:

import bisect

class MySet:
    def __init__(self, iterable=None):
        self.sorted_list = sorted(set(iterable)) if iterable else []

    def add(self, item):
        index = bisect.bisect_left(self.sorted_list, item)
        if index == len(self.sorted_list) or self.sorted_list[index] != item:
            self.sorted_list.insert(index, item)

    def remove(self, item):
        index = bisect.bisect_left(self.sorted_list, item)
        if index != len(self.sorted_list) and self.sorted_list[index] == item:
            del self.sorted_list[index]

    def __contains__(self, item):
        index = bisect.bisect_left(self.sorted_list, item)
        return index != len(self.sorted_list) and self.sorted_list[index] == item

    def __iter__(self):
        return iter(self.sorted_list)

my_set = MySet([1, 2, 3, 2, 1])
print(my_set.sorted_list)

Output:

[1, 2, 3]

In this example, the MySet class keeps its elements in a sorted list. Addition and checking for membership use the bisect module to perform binary searches, which is faster than linear search for membership testing. However, insertion and deletion require shifting elements, which can be costly for large data sets.

Method 4: Hash Table Implementation from Scratch

The most complex but instructive method involves building a hash table from scratch. This requires a deeper understanding of how hash tables work, including handling hash collisions using chaining or open addressing. Performance-wise, this is similar to the dictionary method but requires more code and careful consideration of potential edge cases and hash functions.

Here’s an example:

# Hash table implementation is beyond the scope
# of this small code snippet example. Please see below
# for a discussion about the method instead.

This method isn’t showcased with a full code example due to its complexity and length, which is beyond the scope of this article. However, it involves manually managing an array and handling hash collisions, which can occur when different elements produce the same hash code. This approach mimics the internal workings of a Python dictionary.

Bonus One-Liner Method 5: Using Frozenset as Immutable Set

As a bonus, if immutability is not a concern, Python’s frozenset type, which is an immutable version of set, can be used. It comes with all the performance benefits of a regular set but cannot be modified after creation.

Here’s an example:

immutable_set = frozenset([1, 2, 3, 2, 1])
print(immutable_set)

Output:

frozenset({1, 2, 3})

The frozenset type is used here to create a set that is frozen, or immutable, after its creation. This means it can provide a performance similar to a regular set but without the feature of being able to add or remove elements after its creation.

Summary/Discussion

Method 1: Dictionary-Based Set. Highly efficient for lookups and insertions. However, it may consume more memory because of the underlying hash table structure.
Method 2: List with Manual Checking. Simple to implement but not efficient for large data sets because it requires scanning the entire list to find duplicates.
Method 3: Sorted List with Binary Search. More efficient than Method 2 due to faster searching, but insertion and deletion can be slow because the list must remain sorted.
Method 4: Custom Hash Table. Offers the most insight into data structure implementation, resulting in performance similar to Python’s dictionary. However, it is complex and requires careful handling of hashing and collisions.
Method 5: Frozenset. Provides immutability with performance benefits similar to set but allows for no subsequent modification to the collection.