# How to Determine Unique Values from Multiple Columns in a DataFrame?

Rate this post

## Problem Formulation

While working on large datasets, we might have to find the unique values from various columns.

For example, letβs say, we have a dataframe object as shown below:

```df = pd.DataFrame({'Col1': ['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
'Col2': ['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
'Col3': [1,2,1,2,3]})```

And we want to find the unique elements from `Col1` and `Col2`. That is, we want to return the following values:

`'Julie', 'Mary', 'Steve', 'Alice', 'Bob'`

## Solution Overview

To solve this, let’s use the divide and conquer method. Let’s try to address the problem step-by-step.

There are two tasks to be performed.

1. Joining the columns (Col1, Col2)
2. Finding unique elements

Now, let’s understand the technicalities involved. When we merge the columns of the dataframe, either a 1D array-like object or a multi-dimensional array object can be returned.

To find the unique elements within an object, we can use `numpy.unique()` , `pandas.unique()`, or convert the object to a set.

However, it is important to note that :

• `set()` accepts only 1D array-like objects..
• `pandas.unique()` expects a 1D array-like object as an input parameter.
• `numpy.unique()` accepts a multi-dimensional array-like object as it converts this array to a 1D array-like object internally.

## Different methods for joining the columns

We can use one of the following methods to join the columns:

### Method 1: Using pandas.DataFrame.values

We can convert the dataframes into a numpy array representation using theΒ `pandas.Dataframe.values` property.

```import pandas as pd
print(df[['Col1','Col2']].values)```

Output:

```[['Alice' 'Julie']
['Bob' 'Steve']
['Alice' 'Mary']
['Mary' 'Bob']
['Julie' 'Steve']]```

Note that this returns a 2D array object. Thus, it can be used with `np.unique()`. But not with `pd.unique()` and `set()`.

If you wish to use `pd.unique()` or `set()`, then flatten the array using the function `numpy.flatten()`Β

π‘ Note: You can also use `numpy.ravel()` instead of `numpy.flatten()`.

```import numpy as np
import pandas as pd

#flattening the array with np.flatten()
print(df[['Col1','Col2']].values.flatten())```

Output:

`['Alice' 'Julie' 'Bob' 'Steve' 'Alice' 'Mary' 'Mary' 'Bob' 'Julie' 'Steve']`

π‘ Tip: Passing the parameter `'K'` in `np.flatten()` and `np.ravel()` decreases the execution considerably. To pass this parameter, just use `np.flatten('K')` or `np.ravel('K')`

### Method 2: Using pandas.concat()

We can also use `pd.concat()` to join the sequences. It returns a merged DataFrame object.

Refer to the below example.

```import pandas as pd
print(pd.concat([df['Col1'],df['Col2']]))```

Output:

```0    Alice
1      Bob
2    Alice
3     Mary
4    Julie
0    Julie
1    Steve
2     Mary
3      Bob
4    Steve
dtype: object
```

## Different methods to find unique elements

### Method 1: Using set()

You can use the `set()` function to convert the array or dataframe object to a set. Note that, this expects a 1D array as the input.

Example 1: In this example, we will be using `pd.concat()` to join the columns and then use `set()` to convert the Datframe object to a set.

```import pandas as pd
df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
'Col3':[1,2,1,2,3]})

#Concatenate the columns using pd.concatenate() and convert it to a set.
print(set(pd.concat([df['Col1'],df['Col2']])))```

Output:

`{'Steve', 'Mary', 'Julie', 'Bob', 'Alice'}`

Example 2: We can also flatten the array and then convert the array to a `set()` as shown below.

```import pandas as pd
import numpy as np

df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
'Col3':[1,2,1,2,3]})

print(set(df[['Col1','Col2']].values.flatten('K')))
```

Output:

`{'Julie', 'Mary', 'Alice', 'Bob', 'Steve'}`

### Method 2: Using the numpy.unique()

Use the `numpy.unique()` method when you have to find unique elements from a multi-dimensional array-like object.Β

Consider the below example, joining both the columns with DataFrame.values returns a 2D array. We can find the unique elements from a 2D array using `np.unique()`.

```import pandas as pd
import numpy as np

df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
'Col3':[1,2,1,2,3]})

print(np.unique(df[['Col1','Col2']].values))
```

Output:

`['Alice' 'Bob' 'Julie' 'Mary' 'Steve']`

### Method 3: Using the pandas.unique()

The `pandas.unique()` method returns the unique elements from the 1D array-like object. Note that this is significantly faster than `numpy.unique()`.

Example 1: In this example, let us use `pd.concat()` to join the columns and then use `pd.unique()` to find unique values.

```import pandas as pd

df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
'Col3':[1,2,1,2,3]})

print(pd.unique(pd.concat([df['Col1'],df['Col2']])))
```

Output:

`['Alice' 'Bob' 'Mary' 'Julie' 'Steve']`

## Execution Times of these methods

Now, lets see how much time each method takes to perform the task.Β

```import pandas as pd
import numpy as np
from time import perf_counter_ns

df= pd.DataFrame({'Col1':['Alice', 'Bob', 'Alice', 'Mary', 'Julie'],
'Col2':['Julie', 'Steve', 'Mary', 'Bob', 'Steve'],
'Col3':[1,2,1,2,3]})

# Concatenate the columns,flatten it and convert it to a set.
start=perf_counter_ns()
set(df[['Col1','Col2']].values.flatten('K'))
end=perf_counter_ns()
print("Execution Time using df.values & set()   : %9dns"%(end-start))

#Concatenate the columns using pd.concatenate()and convert it to a set.
start=perf_counter_ns()
set(pd.concat([df['Col1'],df['Col2']]))
end=perf_counter_ns()
print("Execution Time using pd.concat & set(): %9dns"%(end-start))

#Using np.unique()
start=perf_counter_ns()
np.unique(df[['Col1','Col2']].values)
end=perf_counter_ns()
print("Execution Time using np.unique()      : %9dns"%(end-start))

#Using pd.unique()
start=perf_counter_ns()
pd.unique(pd.unique(pd.concat([df['Col1'],df['Col2']])))
end=perf_counter_ns()
print("Execution Time using pd.unique()      : %9dns"%(end-start))```

Output:

```Execution Time using df.values & set() :    815500ns
Execution Time using pd.concat & set():    328500ns
Execution Time using np.unique()         :    394200ns
Execution Time using pd.unique()         :    212400ns```

## Conclusion

That brings us to the end of this tutorial.

We have learned how to determine the unique values from multiple columns in a data frame. We have also seen the time complexities in each of these scenarios. We hope this has been informative.