Python List: Remove Duplicates and Keep the Order

Rate this post

Removing duplicates from a list is pretty simple. You can do it with a Python one-liner

>>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
>>> result = list(set(initial))
>>> result
[1, 7, 9, 6]

Python set elements have to be unique so converting a list into a set and back again achieves the desired result.

What if the original order of the list is important though? That makes things a bit more complicated because sets are unordered, so once you’ve finished the conversion the order of the list will be lost.

Fortunately, there are several ways to overcome this issue. In this article we’ll look at a range of different solutions to the problem and consider their relative merits. 

Method 1 – For Loop

A basic way to achieve the required result is with a for loop:

 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = []
 >>> for item in initial:
         if item not in result:
             result.append(item)
 >>> result
 
 [1, 9, 6, 7]


This approach does at least have the advantage of being easy to read and understand. It’s quite inefficient though as the not in check is being completed for every element of the initial list

That might not be a problem with this simple example, but the time overhead will become increasingly evident if the list gets very large.

Method 2 – List Comprehension

One alternative is to use a list comprehension:

 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = []
 >>> [result.append(item) for item in initial if item not in result]
 [None, None, None, None]
 >>> result
 
 [1, 9, 6, 7]

List comprehensions are handy and very powerful Python tools that enable you to combine variables, for loops and if statements. They make it possible to create a list with a single line of code (but you can split them into multiple lines to improve readability too!).

Although shorter and still fairly clear, using a list comprehension in this instance is not a very good idea.

That’s because it takes the same inefficient approach to membership testing that we saw in Method 1. It also relies on the side effects of the comprehension to build the result list, which many consider to be bad practice.

To explain further, even if it’s not assigned to a variable for later use, a list comprehension still creates a list object. So, in the process of appending items from the initial list to the result list, our code is also creating a third list containing the return value of each result.append(item) call.

Python functions return the value None if no other return value is specified, meaning that (as you can see above) the output from the third list is: 

[None, None, None, None]

A for loop is clearer and does not rely on side effects so is the better method of the two on this occasion.

Method 3 – Sorted Set

We can’t simply convert our list to a set to remove duplicates if we want to preserve order. However, using this approach in conjunction with the sorted function is another potential way forward:

 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = sorted(set(initial), key=initial.index)
 >>> result
 
 [1, 9, 6, 7]


As you can see, this method uses the index of the initial list to sort the set of unique values in the correct order.

The problem is that although it’s pretty easy to understand it’s not much faster than the basic for loop shown in Method 1.

Method 4 – Dictionary fromkeys()

A seriously quick approach is to use a dictionary:

 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = list(dict.fromkeys(initial))
 >>> result
 
 [1, 9, 6, 7]


Like sets, dictionaries use hash tables, which means they are extremely fast.

Python dictionary keys are unique by default so converting our list into a dictionary will remove duplicates automatically.

The dict.fromkeys() method creates a new dictionary using the elements from an iterable as the keys. 

Once this has been done with our initial list, converting the dictionary back to a list gives the result we’re looking for.

Dictionaries only became ordered in all python implementations when Python 3.7 was released (this was also an implementation detail of CPython 3.6). 

So, if you’re using an older version of Python, you will need to import the OrderedDict class from the collections package in the standard library instead:

 >>> from collections import OrderedDict
 >>> initial = [1, 1, 9, 1, 9, 6, 9, 7]
 >>> result = list(OrderedDict.fromkeys(initial))
 >>> result
 
 [1, 9, 6, 7]

This approach might not be as fast as using a standard dictionary, but it’s still very speedy!

Exercise: Run the code. Does it work?

Method 5 – more-itertools

Up to this point, we’ve only looked at lists containing immutable items. But what if your list contains mutable data types such as lists, sets or dictionaries?

It’s still possible to use the basic for loop shown in Method 1, but that won’t cut the mustard if speed is of the essence.

Also, if we try to use dict.fromkeys() we’ll receive a TypeError because dictionary keys must be hashable.

A great answer to this conundrum comes in the form of a library called more-itertools. It’s not part of the Python standard library so you’ll need to pip install it.

With that done, you can import and use its unique_everseen() function like so:

 >>> from more_itertools import unique_everseen
 >>> mutables = [[1, 2, 3], [2, 3, 4], [1, 2, 3]]
 >>> result = list(unique_everseen(mutables))
 >>> result
 
 [[1, 2, 3], [2, 3, 4]]

The library more-itertools is designed specifically for working with Python’s iterable data types in efficient ways (it complements itertools which IS part of the standard library).

The function unique_everseen() yields unique elements while preserving order and crucially it can handle mutable data types, so it’s exactly what we’re looking for.

The function also provides a way to remove duplicates even more quickly from a list of lists:

 ...
 >>> result = list(unique_everseen(mutables, key=tuple))
 >>> result
 
 [[1, 2, 3], [2, 3, 4]]

This works well because it converts the unhashable lists into hashable tuples to speed things up further.

If you want to apply this trick to a list of sets, you can use frozenset as the key:

 ...
 >>> mutables = [{1, 2, 3}, {2, 3, 4}, {1, 2, 3}]
 >>> result = list(unique_everseen(mutables, key=frozenset))
 >>> result
 
 [{1, 2, 3}, {2, 3, 4}]

Specifying a key with a list of dictionaries is a little more complicated, but can still be achieved with the help of a lambda function:

 ...
 >>> mutables = [{'one': 1}, {'two': 2}, {'one': 1}]
 >>> result = list(
     unique_everseen(mutables, key=lambda x: frozenset(x.items()))
     )
 >>> result
 
 [{'one': 1}, {'two': 2}]

The function unique_everseen() can also be used with lists containing a mix of iterable and non-iterable items (think integers and floats), which is a real bonus. Attempting to provide a key in this instance will result in a TypeError though.

Method 6 – NumPy unique()

If you’re working with numerical data, the third-party library numpy is an option too:

 >>> import numpy as np
 >>> initial = np.array([1, 1, 9, 1, 9, 6, 9, 7])
 >>> _, idx = np.unique(initial, return_index=True)
 >>> result = initial[np.sort(idx)]
 >>> result
 
 [1 9 6 7]

The index values of the unique items can be stored by using the np.unique() function with the return_index parameter set to True.

These can then be passed to np.sort() to produce a correctly ordered slice with duplicates removed.

Technically this method could be applied to a standard list by first converting it into a numpy array and then converting it back to list format at the end. However, this would be an overcomplicated and inefficient way of achieving the result.

Using these kinds of techniques only really makes sense if you are also utilizing some of numpy’s powerful features for other reasons.

Method 7 – pandas unique()

Another third-party library we could use is pandas:

 >>> import pandas as pd
 >>> initial = pd.Series([1, 1, 9, 1, 9, 6, 9, 7])
 >>> result = pd.unique(initial)
 >>> result
 
 [1 9 6 7]

pandas is better suited to the task because it preserves order by default and pd.unique() is significantly faster than np.unique().

As with the numpy method, it would be perfectly possible to convert the result to a standard list at the end.

Again though, unless you’re employing the amazing data analysis tools provided by pandas for another purpose, there is no obvious reason to choose this approach over the even faster option utilizing Python’s built-in dictionary data type (Method 4).

Summary

As we’ve seen, there are a wide range of ways to solve this problem and the decision about which one to select should be driven by your particular circumstances. 

If you’re writing a quick script and your list isn’t huge, you may opt to use a simple for loop for the sake of clarity.

However, if efficiency is a factor and your lists don’t contain mutable items then going with dict.fromkeys() is an excellent option. It’s great that this method uses one of Python’s built-in data types and retains a good level of readability while massively improving on the for loop’s speed.

Alternatively, if you’re using an older version of Python, OrderedDict.fromkeys() is a really good choice as it’s still very fast.

If you need to work with lists that contain mutable items, importing more-itertools so you can take advantage of the brilliant unique_everseen() function makes a lot of sense.

Lastly, if you’re doing some serious number crunching with numpy or manipulating data with pandas, it would probably be wise to go with the methods built into those tools for this purpose. 

The choice is of course yours, and I hope this article has provided some useful insights that will help you pick the right approach for the job at hand.