Problem Formulation
While working on large datasets, we might have to find the unique values from various columns.
For example, letβs say, we have a dataframe object as shown below:
df = pd.DataFrame({'Col1': ['Alice', 'Bob', 'Alice', 'Mary', 'Julie'], 'Col2': ['Julie', 'Steve', 'Mary', 'Bob', 'Steve'], 'Col3': [1,2,1,2,3]})
And we want to find the unique elements from Col1
and Col2
. That is, we want to return the following values:
'Julie', 'Mary', 'Steve', 'Alice', 'Bob'
Solution Overview
To solve this, let’s use the divide and conquer method. Let’s try to address the problem step-by-step.
There are two tasks to be performed.
- Joining the columns (Col1, Col2)
- Finding unique elements
Now, let’s understand the technicalities involved. When we merge the columns of the dataframe, either a 1D array-like object or a multi-dimensional array object can be returned.
To find the unique elements within an object, we can use numpy.unique()
, pandas.unique()
, or convert the object to a set.
However, it is important to note that :
set()
accepts only 1D array-like objects..pandas.unique()
expects a 1D array-like object as an input parameter.numpy.unique()
accepts a multi-dimensional array-like object as it converts this array to a 1D array-like object internally.
Different methods for joining the columns
We can use one of the following methods to join the columns:
Method 1: Using pandas.DataFrame.values
We can convert the dataframes into a numpy array representation using theΒ pandas.Dataframe.values
property.
import pandas as pd print(df[['Col1','Col2']].values)
Output:
[['Alice' 'Julie'] ['Bob' 'Steve'] ['Alice' 'Mary'] ['Mary' 'Bob'] ['Julie' 'Steve']]
Note that this returns a 2D array object. Thus, it can be used with np.unique()
. But not with pd.unique()
and set()
.
If you wish to use pd.unique()
or set()
, then flatten the array using the function numpy.flatten()
Β
π‘ Note: You can also use numpy.ravel()
instead of numpy.flatten()
.
import numpy as np import pandas as pd #flattening the array with np.flatten() print(df[['Col1','Col2']].values.flatten())
Output:
['Alice' 'Julie' 'Bob' 'Steve' 'Alice' 'Mary' 'Mary' 'Bob' 'Julie' 'Steve']
π‘ Tip: Passing the parameter 'K'
in np.flatten()
and np.ravel()
decreases the execution considerably. To pass this parameter, just use np.flatten('K')
or np.ravel('K')
Method 2: Using pandas.concat()
We can also use pd.concat()
to join the sequences. It returns a merged DataFrame object.
Refer to the below example.
import pandas as pd print(pd.concat([df['Col1'],df['Col2']]))
Output:
0 Alice 1 Bob 2 Alice 3 Mary 4 Julie 0 Julie 1 Steve 2 Mary 3 Bob 4 Steve dtype: object
Different methods to find unique elements
Method 1: Using set()
You can use the set()
function to convert the array or dataframe object to a set. Note that, this expects a 1D array as the input.
Example 1: In this example, we will be using pd.concat()
to join the columns and then use set()
to convert the Datframe object to a set.
import pandas as pd df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'], 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'], 'Col3':[1,2,1,2,3]}) #Concatenate the columns using pd.concatenate() and convert it to a set. print(set(pd.concat([df['Col1'],df['Col2']])))
Output:
{'Steve', 'Mary', 'Julie', 'Bob', 'Alice'}
Example 2: We can also flatten the array and then convert the array to a set()
as shown below.
import pandas as pd import numpy as np df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'], 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'], 'Col3':[1,2,1,2,3]}) print(set(df[['Col1','Col2']].values.flatten('K')))
Output:
{'Julie', 'Mary', 'Alice', 'Bob', 'Steve'}
Method 2: Using the numpy.unique()
Use the numpy.unique()
method when you have to find unique elements from a multi-dimensional array-like object.Β
Consider the below example, joining both the columns with DataFrame.values returns a 2D array. We can find the unique elements from a 2D array using np.unique()
.
import pandas as pd import numpy as np df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'], 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'], 'Col3':[1,2,1,2,3]}) print(np.unique(df[['Col1','Col2']].values))
Output:
['Alice' 'Bob' 'Julie' 'Mary' 'Steve']
Method 3: Using the pandas.unique()
The pandas.unique()
method returns the unique elements from the 1D array-like object. Note that this is significantly faster than numpy.unique()
.
Example 1: In this example, let us use pd.concat()
to join the columns and then use pd.unique()
to find unique values.
import pandas as pd df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'], 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'], 'Col3':[1,2,1,2,3]}) print(pd.unique(pd.concat([df['Col1'],df['Col2']])))
Output:
['Alice' 'Bob' 'Mary' 'Julie' 'Steve']
Execution Times of these methods
Now, lets see how much time each method takes to perform the task.Β
import pandas as pd import numpy as np from time import perf_counter_ns df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'], 'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'], 'Col3':[1,2,1,2,3]}) # Concatenate the columns,flatten it and convert it to a set. start=perf_counter_ns() set(df[['Col1','Col2']].values.flatten('K')) end=perf_counter_ns() print("Execution Time using df.values & set() : %9dns"%(end-start)) #Concatenate the columns using pd.concatenate()and convert it to a set. start=perf_counter_ns() set(pd.concat([df['Col1'],df['Col2']])) end=perf_counter_ns() print("Execution Time using pd.concat & set(): %9dns"%(end-start)) #Using np.unique() start=perf_counter_ns() np.unique(df[['Col1','Col2']].values) end=perf_counter_ns() print("Execution Time using np.unique() : %9dns"%(end-start)) #Using pd.unique() start=perf_counter_ns() pd.unique(pd.unique(pd.concat([df['Col1'],df['Col2']]))) end=perf_counter_ns() print("Execution Time using pd.unique() : %9dns"%(end-start))
Output:
Execution Time using df.values & set() : 815500ns Execution Time using pd.concat & set(): 328500ns Execution Time using np.unique() : 394200ns Execution Time using pd.unique() : 212400ns
Conclusion
That brings us to the end of this tutorial.
We have learned how to determine the unique values from multiple columns in a data frame. We have also seen the time complexities in each of these scenarios. We hope this has been informative.
Thanks for reading and feel free to join our community of passioned coders: