5 Best Ways to Drop Duplicates in Python Pandas Series

πŸ’‘ Problem Formulation: When working with dataset series in Python using pandas, it’s common to encounter duplicate entries that can skew the data analysis. It is important to remove these duplicates to ensure the integrity of the dataset. This article demonstrates how to remove duplicate values from a pandas Series object. Suppose we have a Series with values [1, 2, 2, 3, 4, 4, 4, 5], the desired output after dropping duplicates would be [1, 2, 3, 4, 5].

Method 1: Using Series.drop_duplicates()

One of the most straightforward methods to drop duplicates from a pandas Series is to use the Series.drop_duplicates() method. By default, this method keeps the first occurrence of each value and removes subsequent duplicates, although this behavior can be changed by specifying the ‘keep’ parameter.

Here’s an example:

import pandas as pd

# Create a pandas Series with duplicates
series = pd.Series([1, 2, 2, 3, 4, 4, 4, 5])

# Drop duplicates
unique_series = series.drop_duplicates()

print(unique_series)

Output: 0 1 1 2 3 3 4 4 7 5 dtype: int64

This code snippet creates a pandas Series with some duplicate entries. The drop_duplicates() method is called on this Series, which returns a new Series without any duplicates. By default, the first occurrence is kept, and all other duplicates are removed.

Method 2: Customizing Keep Parameter

The drop_duplicates() method has a ‘keep’ parameter that allows users to specify which duplicates to keep. Options are ‘first’, ‘last’ or False (to drop all duplicates).

Here’s an example:

import pandas as pd

# Create a pandas Series with duplicates
series = pd.Series([1, 2, 2, 3, 4, 4, 4, 5])

# Drop duplicates, keep last occurrence
unique_series = series.drop_duplicates(keep='last')

print(unique_series)

Output: 0 1 2 2 3 3 6 4 7 5 dtype: int64

The code snippet is similar to the first method but includes the parameter keep='last'. This means that if there are duplicates, the last occurrence is the one that will be preserved in the resulting Series.

Method 3: Sorting Before Dropping Duplicates

Another effective technique involves sorting the Series before dropping duplicates. This method ensures that the values are ordered and can be particularly useful when the ‘keep’ parameter is set to ‘first’ or ‘last’.

Here’s an example:

import pandas as pd

# Create a pandas Series with duplicates
series = pd.Series([4, 2, 2, 3, 1, 4, 4, 5])

# Sort and drop duplicates
sorted_unique_series = series.sort_values().drop_duplicates()

print(sorted_unique_series)

Output: 4 1 1 2 3 3 0 4 7 5 dtype: int64

Here, the code snippet first uses sort_values() to sort the Series, which is then followed by drop_duplicates(). The sorted unique values are returned, maintaining the sorted order.

Method 4: Using Boolean Indexing

Boolean indexing can also be employed to filter out duplicates by creating a mask that only selects items that have not been seen before.

Here’s an example:

import pandas as pd

# Create a pandas Series with duplicates
series = pd.Series([1, 2, 2, 3, 4, 4, 4, 5])

# Create a mask for non-duplicate values
mask = ~series.duplicated()

# Filter series based on mask
unique_series = series[mask]

print(unique_series)

Output: 0 1 1 2 3 3 4 4 7 5 dtype: int64

This code snippet illustrates how to create a boolean mask with the duplicated() method that marks duplicates. The tilde (~) is used to invert the mask, choosing only non-duplicate values when filtering the original Series.

Bonus One-Liner Method 5: Using Lambda and GroupBy

For more complex scenarios, one might resort to a one-liner involving a group-by operation and a lambda function for custom duplicate handling.

Here’s an example:

import pandas as pd

# Create a pandas Series with duplicates
series = pd.Series([2, 1, 2, 3, 4, 4, 4, 5])

# One-liner using groupby and lambda
unique_series = series.groupby(series).apply(lambda x: x.iloc[0])

print(unique_series)

Output: 1 1 2 2 3 3 4 4 5 5 dtype: int64

This code uses groupby() to group the Series by its own values and applies a lambda function to pick the first occurrence of each group. The result is a Series without duplicates.

Summary/Discussion

  • Method 1: Series.drop_duplicates(). Straightforward and simple method. Works well for most cases but lacks customization for complex scenarios.
  • Method 2: Customizing Keep Parameter. Offers control over which duplicates to keep. Useful, but requires understanding of keep options.
  • Method 3: Sorting Before Dropping Duplicates. Ensures ordered results and is most effective for sorted data but introduces additional computational overhead.
  • Method 4: Using Boolean Indexing. Offers flexibility and can be tailored to custom conditions. More verbose and slightly less intuitive than drop_duplicates().
  • Bonus Method 5: Using Lambda and GroupBy. Powerful one-liner suitable for complex and highly customizable scenarios. However, it can be overkill and not as performant for simple situations.