5 Best Ways to Extract Unique Values from a Pandas DataFrame Column

πŸ’‘ Problem Formulation:

In data analysis using pandas, it’s a common necessity to extract unique values from a DataFrame column for data exploration, summary statistics, or for further processing. Given a DataFrame with a column containing duplicate values, the objective is to retrieve a list of distinct values from that column. For example, given a ‘colors’ column with the values ['red', 'blue', 'red', 'green', 'blue', 'green'], we aim to obtain the unique array ['red', 'blue', 'green'].

Method 1: Using unique() Method

The unique() method is a built-in pandas function that directly returns the unique values in the order they appear, which is useful for categorical data summarization.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'colors': ['red', 'blue', 'red', 'green', 'blue', 'green']})

# Getting unique values
unique_colors = df['colors'].unique()

print(unique_colors)

Output:

['red', 'blue', 'green']

This method is straightforward: the unique() function is called on the column ‘colors’ of the DataFrame. It returns a numpy array with the unique values found in that column.

Method 2: Using drop_duplicates() Method

The drop_duplicates() method is typically used to remove duplicate rows, but it can also be applied to a Series to obtain unique values from a DataFrame column, which is advantageous when you need a Series instead of an array.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'colors': ['red', 'blue', 'red', 'green', 'blue', 'green']})

# Getting unique values
unique_series = df['colors'].drop_duplicates()

print(unique_series)

Output:

0      red
1     blue
3    green
Name: colors, dtype: object

When drop_duplicates() is invoked on the ‘colors’ column, it returns a new Series with the duplicate values removed, preserving the original order.

Method 3: Using value_counts() Method

The value_counts() method doesn’t directly provide unique values but returns a Series containing counts of unique values, which inadvertently provides the unique entries. This method is particularly powerful when the count of occurrences is also required.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'colors': ['red', 'blue', 'red', 'green', 'blue', 'green']})

# Getting unique values along with their count
unique_count = df['colors'].value_counts()

print(unique_count)

Output:

red      2
blue     2
green    2
Name: colors, dtype: int64

The unique values along with their counts are produced by first calling value_counts() on the column. The index of the resulting Series consists of the unique values.

Method 4: Using Set Conversion

The conversion of a pandas Series to a set via the built-in Python set() function is an alternative way to get unique values. While it’s not a pandas-specific feature, sets inherently contain only unique values and can be useful when the order of the unique elements is not of importance.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'colors': ['red', 'blue', 'red', 'green', 'blue', 'green']})

# Getting unique values by converting to a set
unique_set = set(df['colors'])

print(unique_set)

Output:

{'green', 'red', 'blue'}

A set is created from the ‘colors’ column which automatically removes any duplicate entries since sets cannot contain duplicates by definition.

Bonus One-Liner Method 5: Using nunique() for Counting Unique Values

While the nunique() method doesn’t provide the unique values themselves, it quickly returns the number of unique values, which can be useful when only the count is needed.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({'colors': ['red', 'blue', 'red', 'green', 'blue', 'green']})

# Counting unique values
unique_count = df['colors'].nunique()

print("Number of unique colors:", unique_count)

Output:

Number of unique colors: 3

The nunique() method is applied to count the number of unique entries within the ‘colors’ column, returning an integer.

Summary/Discussion

  • Method 1: unique() Method. Returns unique values in the order they appear. Straightforward and pandas-specific. However, provides an array, not a Series.
  • Method 2: drop_duplicates() Method. Offers unique values as a pandas Series, retaining the original order. Slightly less efficient than unique() if you don’t need a Series.
  • Method 3: value_counts() Method. Useful when counts of unique values are needed. Not the best for simple unique value extraction.
  • Method 4: Set Conversion. Language-agnostic way to obtain unique values; order isn’t preserved. It introduces the overhead of a type conversion.
  • Method 5: nunique() Method. Great for when only the unique count is required. Does not provide the unique values themselves.