How to Determine Unique Values from Multiple Columns in a DataFrame?

Rate this post

Problem Formulation

While working on large datasets, we might have to find the unique values from various columns.

For example, let’s say, we have a dataframe object as shown below:

df = pd.DataFrame({'Col1': ['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
                   'Col2': ['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
                   'Col3': [1,2,1,2,3]})

And we want to find the unique elements from Col1 and Col2. That is, we want to return the following values:

'Julie', 'Mary', 'Steve', 'Alice', 'Bob'

Solution Overview

To solve this, let’s use the divide and conquer method. Let’s try to address the problem step-by-step.

There are two tasks to be performed.

  1. Joining the columns (Col1, Col2)
  2. Finding unique elements

Now, let’s understand the technicalities involved. When we merge the columns of the dataframe, either a 1D array-like object or a multi-dimensional array object can be returned.

To find the unique elements within an object, we can use numpy.unique() , pandas.unique(), or convert the object to a set.

However, it is important to note that :

  • set() accepts only 1D array-like objects..
  • pandas.unique() expects a 1D array-like object as an input parameter.
  • numpy.unique() accepts a multi-dimensional array-like object as it converts this array to a 1D array-like object internally.

Different methods for joining the columns

We can use one of the following methods to join the columns:

Method 1: Using pandas.DataFrame.values

We can convert the dataframes into a numpy array representation using the pandas.Dataframe.values property.

import pandas as pd
print(df[['Col1','Col2']].values)

Output:

[['Alice' 'Julie']
 ['Bob' 'Steve']
 ['Alice' 'Mary']
 ['Mary' 'Bob']
 ['Julie' 'Steve']]

Note that this returns a 2D array object. Thus, it can be used with np.unique(). But not with pd.unique() and set().

If you wish to use pd.unique() or set(), then flatten the array using the function numpy.flatten() 

💡 Note: You can also use numpy.ravel() instead of numpy.flatten().

import numpy as np
import pandas as pd

#flattening the array with np.flatten()
print(df[['Col1','Col2']].values.flatten())

Output:

['Alice' 'Julie' 'Bob' 'Steve' 'Alice' 'Mary' 'Mary' 'Bob' 'Julie' 'Steve']

💡 Tip: Passing the parameter 'K' in np.flatten() and np.ravel() decreases the execution considerably. To pass this parameter, just use np.flatten('K') or np.ravel('K')

Method 2: Using pandas.concat()

We can also use pd.concat() to join the sequences. It returns a merged DataFrame object.

Refer to the below example.

import pandas as pd
print(pd.concat([df['Col1'],df['Col2']]))

Output:

0    Alice
1      Bob
2    Alice
3     Mary
4    Julie
0    Julie
1    Steve
2     Mary
3      Bob
4    Steve
dtype: object

Different methods to find unique elements

Method 1: Using set()

You can use the set() function to convert the array or dataframe object to a set. Note that, this expects a 1D array as the input.

Example 1: In this example, we will be using pd.concat() to join the columns and then use set() to convert the Datframe object to a set.

import pandas as pd
df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
                 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
                 'Col3':[1,2,1,2,3]})

#Concatenate the columns using pd.concatenate() and convert it to a set.
print(set(pd.concat([df['Col1'],df['Col2']])))

Output:

{'Steve', 'Mary', 'Julie', 'Bob', 'Alice'}

Example 2: We can also flatten the array and then convert the array to a set() as shown below.

import pandas as pd
import numpy as np

df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
                 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
                 'Col3':[1,2,1,2,3]})

print(set(df[['Col1','Col2']].values.flatten('K')))

Output:

{'Julie', 'Mary', 'Alice', 'Bob', 'Steve'}

Method 2: Using the numpy.unique()

Use the numpy.unique() method when you have to find unique elements from a multi-dimensional array-like object. 

Consider the below example, joining both the columns with DataFrame.values returns a 2D array. We can find the unique elements from a 2D array using np.unique().

import pandas as pd
import numpy as np

df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
                 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
                 'Col3':[1,2,1,2,3]})

print(np.unique(df[['Col1','Col2']].values))

Output:

['Alice' 'Bob' 'Julie' 'Mary' 'Steve']

Method 3: Using the pandas.unique()

The pandas.unique() method returns the unique elements from the 1D array-like object. Note that this is significantly faster than numpy.unique().

Example 1: In this example, let us use pd.concat() to join the columns and then use pd.unique() to find unique values.

import pandas as pd

df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
                 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
                 'Col3':[1,2,1,2,3]})

print(pd.unique(pd.concat([df['Col1'],df['Col2']])))

Output:

['Alice' 'Bob' 'Mary' 'Julie' 'Steve']

Execution Times of these methods

Now, lets see how much time each method takes to perform the task. 

import pandas as pd
import numpy as np
from time import perf_counter_ns

df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
                 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
                 'Col3':[1,2,1,2,3]})


# Concatenate the columns,flatten it and convert it to a set.
start=perf_counter_ns()
set(df[['Col1','Col2']].values.flatten('K'))
end=perf_counter_ns()
print("Execution Time using df.values & set()   : %9dns"%(end-start))

#Concatenate the columns using pd.concatenate()and convert it to a set.
start=perf_counter_ns()
set(pd.concat([df['Col1'],df['Col2']]))
end=perf_counter_ns()
print("Execution Time using pd.concat & set(): %9dns"%(end-start))

#Using np.unique()
start=perf_counter_ns()
np.unique(df[['Col1','Col2']].values)
end=perf_counter_ns()
print("Execution Time using np.unique()      : %9dns"%(end-start))

#Using pd.unique()
start=perf_counter_ns()
pd.unique(pd.unique(pd.concat([df['Col1'],df['Col2']])))
end=perf_counter_ns()
print("Execution Time using pd.unique()      : %9dns"%(end-start))

Output:

Execution Time using df.values & set() :    815500ns
Execution Time using pd.concat & set():    328500ns
Execution Time using np.unique()         :    394200ns
Execution Time using pd.unique()         :    212400ns

Conclusion

That brings us to the end of this tutorial.

We have learned how to determine the unique values from multiple columns in a data frame. We have also seen the time complexities in each of these scenarios. We hope this has been informative.

Thanks for reading and feel free to join our community of passioned coders: