Sets in Python – with Harry Potter Examples

When I worked on my recent project using Python’s built-in set data structure, I wondered “how exactly does the set data structure work in Python?”. I searched the web for set tutorials but not a single article could answer all of my questions!

So I wrote this 4000-word article that teaches you everything you need to know about sets in Python. I have answered any question I could think of
– and I even searched the web for more questions other people have about sets. In the unlikely case that you still have unanswered questions about sets in Python, feel free to write me and I will answer them for you!

To make it a bit more fun, I have used Harry Potter examples throughout the article.

What is a set in Python?

The set data structure is one of the basic collection data types in Python and many other programming languages.

In fact, there are even popular languages for distributed computing that focus almost exclusively on set operations (like MapReduce or Apache Spark) as primitives of the programming language.

Definition: A set is an unordered collection of unique elements.

Let’s break this down.

(1) Collection: A set is a collection of elements like a list or a tuple. The collection consists of either primitive elements (e.g. integers, floats, strings), or complex elements (e.g. objects, tuples). However, all data types must be hashable.

What is a hashable data type?

Here is the relevant excerpt of the documentation:

“An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method).”

The set data structure heavily relies on the hash function to implement the specification (this would be a nice topic for tomorrow).

Let’s have a look at an example (we stay with the Harry Potter examples because this is at the top of my mind — reading it every day with my daughter):

hero = "Harry"
guide = "Dumbledore"
enemy = "Lord V."

print(hash(hero))
# 6175908009919104006

print(hash(guide))
# -5197671124693729851

## Puzzle: can we create a set of strings?

characters = {hero, guide, enemy}
print(characters)
# {'Lord V.', 'Dumbledore', 'Harry'}


## Puzzle: can we create a set of lists?
team_1 = [hero, guide]
team_2 = [enemy]

teams = {team_1, team_2}
# TypeError: unhashable type: 'list'

As you can see, we can create a set of strings because strings are hashable. But we cannot create a set of lists because lists are unhashable.

Why are lists unhashable?

Because they are mutable: you can change a list by appending or removing elements. If you change the list data type, the hash value changes (it is calculated based on the content of the list). This directly violates the above definition (“hash value […] never changes during its lifetime”).

Key takeaway: mutable data types are not hashable. Therefore, you cannot use them in sets.

(2) Unordered: Unlike lists, sets are unordered because there is no fixed order of the elements. In other words, regardless of the order you put stuff in the set, you can never be sure in which order the set stores these elements.

Here is an example from the above code:

characters = {hero, guide, enemy}
print(characters)
# {'Lord V.', 'Dumbledore', 'Harry'}

You put in the hero first, but the interpreter prints the enemy first (the Python interpreter is on the dark side, obviously).

(3) Unique: All elements in the set are unique. Each pair of values (x,y) in the set produces a different pair of hash values (hash(x)!=hash(y)). Hence, each pair of elements x and y in the set are different.

This means that we cannot create an army of Harry Potter clones to fight Lord V:

clone_army = {hero, hero, hero, hero, hero, enemy}
print(clone_army)
# {'Lord V.', 'Harry'}

No matter how often you put the same value into the same set, the set stores only one instance of this value. An extension of the normal set data structure is the “multiset” data structure where a multiset can store multiple instances of the same value.

The Python standard library also comes with a multiset package.

How to create a set?

There are three basic alternatives to create a set:

  1. use the constructor set([1,2,3]) and pass an iterable of the elements;
  2. use the bracket notation ‘{1,2,3}’ with the elements inside, separated by a comma; or
  3. create an empty set and add the elements manually.

Here is an example of these three options:

s1 = {"Harry", "Ron", "Hermine"}
print(s1)
# {'Harry', 'Ron', 'Hermine'}

s2 = set(["Harry", "Ron", "Hermine"])
print(s2)
# {'Harry', 'Ron', 'Hermine'}

s3 = set()
s3.add("Harry")
s3.add("Ron")
s3.add("Hermine")
print(s3)
# {'Harry', 'Ron', 'Hermine'}

However, you cannot mix these ways to create a set! For example, you cannot pass the individual elements in the constructor set().

# Wrong!
s4 = set("Harry", "Ron", "Hermine")
# TypeError: set expected at most 1 arguments, got 3

One question that is often asked is the following:

Can a set have multiple data types?

Yes, absolutely! Here is what happens if you create a set with integers and strings:

s = {1, 2, 3, "Harry", "Ron"}
print(s)
# {1, 2, 3, 'Ron', 'Harry'}

As you can see, the Python interpreter does not complain when you throw different data types in the same set. You have to be eviler than that!

What are real-world examples of sets?

Sets are everywhere in coding. Every single major programming language comes with built-in set functionality. The set data structure is one of the most important data structures. You will use it all the time!

For example, you write a web crawler that explores web pages and stores their URL in a variable ‘visited’. Now, there are two ways of implementing this: First, use a list data structure and append the URL if it is not in the list. Second, use a set data structure and add the URL if it is not in the set. The second one is way faster! Why? Because sets are designed to be very fast to check membership of individual elements (e.g. is the URL already in the set?). Read the last part of this article to learn why!

Another example is in email marketing. Suppose you have a huge database of email subscribers, stored as a list. You want to find the duplicate email addresses. Easy: convert the list to a set and back to the list – and voilà – the duplicates are gone! Why? Because sets are duplicate-free. By the way, this is also one of the fastest ways to remove duplicates from the list.

[Overview] What are the most important set operations in Python?

Let’s start with a few examples first. Take your time to study these examples carefully.

Gryffindors = {"Harry", "Ron", "Hermine", "Neville",
               "Seamus", "Ginny", "Fred", "George"}

## Set Conversion
Weasleys = set(["Ron", "Ginny", "Fred"])
# {'Ron', 'Fred', 'Ginny'}

## Add Element
Weasleys.add("George")
# {'Ron', 'Fred', 'Ginny', 'George'}

## Remove Element
Gryffindors.remove("Neville")
# {'Ron', 'Hermine', 'George', 'Harry', 'Ginny', 'Seamus', 'Fred'}

## Membership
'Ginny' in Gryffindors
# True

## Size
len(Weasleys)
# 4

## Intersection
Weasleys & Gryffindors
# {'Fred', 'George', 'Ron', 'Ginny'}

## Union
Weasleys | Gryffindors
# {'Ron', 'Hermine', 'George', 'Harry', 'Ginny', 'Seamus', 'Fred'}

## Difference
Gryffindors - Weasleys
# {'Harry', 'Hermine', 'Seamus'}

## Symmetric Difference
Gryffindors ^ {'Harry', 'Ginny', 'Malfoy'}
# {'Ron', 'Fred', 'George', 'Malfoy', 'Hermine', 'Seamus'}

## Set Disjoint
Gryffindors.isdisjoint({'Malfoy', 'Grabbe', 'Goyle'})
# True

## Subset
Weasleys.issubset(Gryffindors)
# True

## Superset
Gryffindors.issuperset(Weasleys)

## Pop
print(Gryffindors.pop())
# 'Seamus'
print(Gryffindors)
# {'Ron', 'Fred', 'Hermine', 'Harry', 'Seamus', 'George'}

In the next few paragraphs, I give you detailed examples of the most important set operations (see docs).

How does set conversion work in Python

Sets are collections like tuples or lists. That’s why you can easily convert sets in lists or tuples. Here is how:

# convert list to set:
s = set([1,2,3])
print(s)
# {1, 2, 3}

# convert tuple to set:
s = set((1,2,3))
print(s)
# {1, 2, 3}

Note that the Python interpreter uses the bracket notation to represent a set on your console.

How to add an element to a set in Python?

Use the set function s.add(x) to add element x to the set s. Here is an example:

# Add Operator
s = set()
s.add("Harry")
s.add("Ron")
s.add("Hermine")
print(s)
# {'Harry', 'Ron', 'Hermine'}

How to remove an element from a set in Python?

Use the set function s.remove(x) to remove element x from set s. Note that because the set is duplicate-free, it is impossible that element x still exists in the set after calling remove(). In this way, the semantics are different than for Python lists where remove() only removes the first occurrence of the element in the list.

Here is an example:

# Remove Operator
s = set()
s.add("Harry")
s.add("Ron")
s.add("Hermine")
print(s)
# {'Harry', 'Ron', 'Hermine'}

s.remove("Ron")
s.remove("Harry")
print(s)
# {'Hermine'}

How to check whether an element x is in the set s in Python (Membership)?

The membership operator “x in s” checks whether set s contains element x. It returns True if this is the case. Here is an example:

# Membership Operator
s = {"Harry", "Ron", "Hermine"}
x = "Ginny"
print(x in s)
# False

How to determine the number of elements in a set x in Python?

Simply use the built-in len(s) function to get the number of elements in the set s.

Here is an example:

# Size Operator
s = {"Harry", "Ron", "Hermine"}
print(len(s))
# 3

How to intersect two sets in Python?

The set intersection operator creates a new set which contains all elements that are in both sets s1 and s2 — but not the ones that are only in one set. This means that the new set will never be larger than any of the sets s1 or s2.

There are two operators in Python to intersect two sets s1 and s2: the function s1.intersect(s2) or the operator s1 & s2.

Maybe you remember Venn diagrams from school? Here is an example of set intersection:

As you can see, the new set contains all elements that are in both sets s1 and s2.

Here is an example in code:

# Intersection
good = {"Harry", "Ron", "Snape"}
bad = {"Lord V", "Quirrell", "Snape"}

print(good & bad)
# {'Snape'}

print(good.intersection(bad))
# {'Snape'}

What is the union of two sets?

The set union operator creates a new set which contains all elements that are in either set s1 or s2. This means that the new set will never be smaller than any of the sets s1 or s2.

There are two operators in Python to calculate the union of two sets s1 and s2: the function s1.union(s2) or the operator s1 | s2.

# Union
good = {"Harry", "Ron", "Snape"}
bad = {"Lord V", "Quirrell", "Snape"}

print(good | bad)
# {'Lord V', 'Quirrell', 'Snape', 'Harry', 'Ron'}

print(good.union(bad))
# {'Lord V', 'Quirrell', 'Snape', 'Harry', 'Ron'}

What is the difference between two sets?

The set difference operator creates a new set which contains all elements that are in set s1 but not in s2. This means that the new set will never be larger than set s1.

There are two operators in Python to calculate the difference of two sets s1 and s2: the function s1.difference(s2) or the operator s1 – s2.

# Difference
good = {"Harry", "Ron", "Snape"}
bad = {"Lord V", "Quirrell", "Snape"}

print(good - bad)
# {'Harry', 'Ron'}

print(good.difference(bad))
# {'Harry', 'Ron'}

What is the symmetric difference of two sets?

The symmetric set difference operator creates a new set which contains all elements that are in either set s1 or in s2 but not in the intersection of s1 and s2.

There are two operators in Python to calculate the difference of two sets s1 and s2: the function s1.difference(s2) or the operator s1 – s2.

# Symmetric Difference
good = {"Harry", "Ron", "Snape"}
bad = {"Lord V", "Quirrell", "Snape"}

print(good ^ bad)
# {'Quirrell', 'Ron', 'Harry', 'Lord V'}

print(good.symmetric_difference(bad))
# {'Quirrell', 'Ron', 'Harry', 'Lord V'}

print(bad.symmetric_difference(good))
# {'Quirrell', 'Ron', 'Lord V', 'Harry'}

What is the set disjoint operator in Python?

The set disjoint operation checks for two given sets s1 and s2 whether they have no elements in common.

# Set Disjoint Operation
good = {"Harry", "Ron", "Snape"}
bad = {"Lord V", "Quirrell", "Snape"}

print(good.isdisjoint(bad))
# False

print(bad.isdisjoint(good))
# False

bad.remove("Snape")
print(good.isdisjoint("Snape"))
# True

As you can see, the good and the bad in Harry Potter are not disjoint because “Snape” is both – good AND bad. However, after removing “Snape” from the set of bad wizards (SPOILER ALERT), they become disjoint again.

How does the subset operator work in Python?

The operation s1.issubset(s2) in Python checks whether all elements in set s1 are also elements in set s2. Of course, set s2 can have much more elements that are not in set s1.

# Set Subset Operation
Gryffindors = {"Seamus", "Fred", "George", "Harry", "Ginny", "Hermine"}
Weasleys = {"Fred", "George", "Ginny"}

print(Weasleys.issubset(Gryffindors))
# True

print(Gryffindors.issubset(Weasleys))
# False

While the set of all Weasleys is a subset of the set of all Gryffindors, the other way does not hold — there are (still) Gryffindors that are not Weasleys (e.g. “Harry” and “Hermine”).

How does the superset operator work in Python?

The operation s1.issuperset(s2) in Python is analogous to the previous operation issubset(). But in contrast to that, it checks whether all elements in set s2 are also elements in set s1. Of course, set s1 can have much more elements that are not in set s2.

# Set Superset Operation
Gryffindors = {"Seamus", "Fred", "George", "Harry", "Ginny", "Hermine"}
Weasleys = {"Fred", "George", "Ginny"}

print(Weasleys.issuperset(Gryffindors))
# False

print(Gryffindors.issuperset(Weasleys))
# True

Clearly, the set of all Weasleys is NOT a superset of the set of all Gryffindors (e.g. “Harry” is not a Weasley). However, the set of all Gryffindors is a superset of the set of all Weasleys.

How to pop a set element in Python?

The s.pop() operation removes an arbitrary element x from the set s. It returns this element x. The pop() operation is often useful because you cannot easily access an arbitrary element of a set — you cannot use indices on Python sets because sets are unordered.

Here is an example:

# Set Pop Operation
teachers = {"Trelawney", "Snape", "Hagrid"}
leaves_hogwarts = teachers.pop()
print(teachers)
# e.g. {'Snape', 'Hagrid'}

Remember when Prof. Umbridge controlled each and every teacher in Hogwarts? She quickly found out that Prof. Trelawney is not a suitable teacher, so she kicked her out from the set of all teachers. Essentially, she performed the pop() operation (although selecting an element from the set was less random).

How does set comprehension work?

Set comprehension is a concise way of creating sets. Say you want to filter out all customers from your database who earn more than $1,000,000. This is what a newbie not knowing set comprehension would do:

# (name, $-income)
customers = [("John", 240000),
            ("Alice", 120000),
            ("Ann", 1100000),
            ("Zach", 44000)]
# your high-value customers earning >$1M
whales = set()
for customer, income in customers:
   if income>1000000:
       whales.add(customer)
print(whales)
# {'Ann'}

This snippet needs four lines just to create a set of high-value customers (whales)!

If you do that in your public Python code base, be prepared to get busted for “not writing Pythonic code”. 😉

Instead, a much better way of doing the same thing is to use set comprehension:

whales = {x for x,y in customers if y>1000000}
print(whales)
# {'Ann'}

Beautiful, isn’t it?

Set comprehension is dead simple when you know the formula I will show you in a moment. So why are people confused about how to use set comprehension? Because they never looked up the most important statement on list comprehension (which is similar to set comprehension) in the Python documentation. It’s this:

“A list comprehension consists of brackets containing an expression followed by a for clause, then zero or more for or if clauses. The result will be a new list resulting from evaluating the expression in the context of the for and if clauses which follow it.” (source)

In other words, here is the formula for set comprehension.

Formula: Set comprehension consists of two parts.

‘{‘ + expression + context + ‘}’

The first part is the expression. In the example above it was the variable x. But you can also use a more complex expression such as x.upper(). Use any variable in your expression that you have defined in the context within a loop statement. See this example:

whales = {x.upper() for x,y in customers if y>1000000}
print(whales)
# {'ANN'}

The second part is the context. The context consists of an arbitrary number of for and if clauses. The single goal of the context is to define (or restrict) the sequence of elements on which we want to apply the expression. That’s why you sometimes see complex restrictions such as this:

small_fishes = {x + str(y) for x,y in customers if y<1000000 if x!='John'}
# (John is not a small fish...)
print(small_fishes)
# {'Zach44000', 'Alice120000'}

For more details about set comprehension, read this article.

Python Sets vs Lists – when to use sets and when lists in Python?

As a master coder, you always select the best data structure for your problem at hand.

If you choose the right data structure, your solution will be elegant and will run smoothly even for large input sizes. At the same time, your source code will be concise and readable.

That’s the gold standard.

But if you choose the wrong data structure for your problem at hand, you will waste lots of time writing the code. As soon as you believe that you have solved the problem, you will realize that your code base is full of bugs. And it will be very inefficient and not capable of running on large input sizes.

Let’s have a look at a practical example: The problem of removing duplicates from a collection.

dupes = [1,4,3,2,3,3,2,1]

# Bad solution: wrong data structure (list)
lst_tmp = [ ]
for element in dupes:
    if element not in lst_tmp:
        lst_tmp.append(element)
print(lst_tmp)
# [1, 4, 3, 2]


# Good solution: right data structure (set)
print(list(set(dupes)))
# [1, 2, 3, 4]

You use the set data structure here because of its specific characteristics: a set is an unordered collection of unique elements. Bingo! That’s what we need.

On the other hand, the list data structure does not fit that good to the problem: it allows duplicates and cares about the order of the elements (which we don’t).

Why is the list inefficient in this example? Because checking membership is very slow for lists — you need to traverse the whole list to see whether an element is in the list or not.

So how do you know when to use lists and when to use sets in Python?

Just remember the following simplified table.

Instead of using the more complex Big-O notation, I just tell you whether the operation is FAST or SLOW (for the pros: FAST is constant runtime complexity, SLOW is linear runtime complexity). If you want to dive deeper in the runtime complexities of different set operations, please see the second more comprehensive table below.

You have to know this table by heart if you have any ambition in coding. Spend time now and master it thoroughly.

# Operator          
List    Set
Add element FAST FAST
Remove element SLOW FAST
Membership (“in”) SLOWFAST
Access i-th element  FAST
Union SLOW
Intersection SLOW

In plain English: use sets if you only need to test for membership, use lists if the order of elements is important.

The reason why sets are superior in performance is that they do not provide such a strong “service” – they ignore the concrete order of the elements.

How is the set data structure implemented in Python? And why is set membership faster than list membership?

We have already established:

“List membership is slower than set membership because the former checks every element while the latter uses only one lookup.”

Do you really understand why?

If I address this topic in my email Python course (it’s free, come join me 😉, the following question regularly comes up:

“I still don’t understand why set membership checks should be faster. Why is it just one lookup for a set?”

I believe that many advanced coders would have difficulties explaining WHY set membership is faster. Pause reading for a moment and try to explain it to yourself!

So, how do sets work in Python?

Sets are implemented by using a hash table as an underlying data structure. A hash table is a data structure that maps keys to values (like a dic in Python). Here is an example of a hash table storing the age of random “Harry Potter” characters:

Key --> Value
(Name) --> (Age)
----------------
"Harry" --> 13
"Hermine" --> 13
"Dumbledore" --> 398
"Lord V" --> 72

Before moving on, how does Python use a hash table to implement a set? Simply by using “dummy values”. Here is how Python, conceptually, implements the set {“Harry”, “Hermine”, “Dumbledore”, “Lord V”}:

"Harry" --> None
"Hermine" --> None
"Dumbledore" --> None
"Lord V" --> None

Imagine you would have to implement the set data structure based on the hash table (or Python dic). Every hash table already provides the membership operator (e.g., “key” in dic.keys()). And if you know how to calculate membership, you can easily create the most important set functions like union or intersection.

Now let’s go back to the above hash table to learn about why the membership operator is fast for hash tables.

Remember, our goal is the following. Given a key, we want to get the associated value (e.g. “Harry” should give us the value “13”).

At the heart of any hash table is an array. Suppose, we store the data in an array like this:

Index --> Value
0 --> ("Harry", 13)
1 --> ("Hermine", 13)
2 --> ("Dumbledore", 398)
3 --> ("Lord V", 72)

This is in fact how many hash tables are implemented (e.g. in the programming language C). The good thing with arrays is that if you know the index, you can quickly get the (key, value) pair stored at that index. For example, you can get the (key, value)-pair (“Lord V”, 72) in one quick shot by calling array[3].

However, testing whether a certain key exists in the array is a pain: you have to check EVERY single array element until you either found the key or you run out of array elements. If the array has size n, you need to search n elements if the key is not in the array.

The hash table uses a nice trick: it uses a function that maps a key to an index (called the hash function). The index is then used to get the associated value in the array. If you look at it from above, you assign keys to values.

Read the last paragraph again until you get it.

Here is an example:

Key --> Hashing --> Index --> Value
"Harry" --> h("Harry") --> 0 --> 13
"Hermine" --> h("Hermine") --> 1 --> 13
"Dumbledore" --> h("Dumbledore") --> 2 --> 398
"Lord V" --> h("Lord V") --> 3 --> 72

This way, you can implement a hash table using nothing but a simple array (which is built into nearly every programming language).

Now here’s the thing: no matter how many (key, value) pairs you have, you calculate the index using the hash function on the key and use the index to access the array element (value). Both calculating the hash value and accessing the array is fast and independent from the size of the data structure.

I think this answers the question already (“why is set membership faster than list membership?”). I just want to note that it is a bit more difficult than that because the hash table has to account for “collisions” that happen if two different keys are hashed to the same index. Technically, this is solved by storing MULTIPLE values per index and decreasing the likelihood of such collisions by selecting better hash functions.

Where to go from here?

If you liked to learn about the Python basics, you will love my free “Coffee Break Python” email series.

Click the link, put in your email, and enjoy 5 Python cheat sheets and hundreds of small daily Python lectures in your inbox.

Leave a Comment

Your email address will not be published. Required fields are marked *