5 Best Ways to Remove Duplicate Values from a Pandas DataFrame in Python

💡 Problem Formulation: In the realm of data manipulation using Python’s Pandas library, a common challenge is the removal of duplicate rows to maintain data integrity and accuracy. For instance, if you have a DataFrame containing user information, you might find some users listed more than once. The desired output is to have each user appear only once in your dataset, ensuring unique entries for subsequent analysis.

Method 1: Drop Duplicates Using `drop_duplicates()`

This method utilizes the built-in Pandas function drop_duplicates(), which offers a straightforward way to remove duplicate rows based on all or a subset of columns. The original DataFrame remains unchanged unless the inplace=True argument is specified. By default, it keeps the first occurrence of the duplicate entry.

Here’s an example:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'User': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'Age': [25, 30, 35, 25]
})

# Remove duplicate rows
df_cleaned = df.drop_duplicates()
print(df_cleaned)

Output:

      User  Age
0   Alice   25
1     Bob   30
2  Charlie   35

The example shows the removal of duplicate rows in a DataFrame consisting of user names and ages. The drop_duplicates() method effectively keeps the first occurrence of user ‘Alice’ and discards the second.

Method 2: Drop Duplicates with a Subset of Columns

Using the subset parameter of the drop_duplicates() method allows you to define a list of columns to consider for identifying duplicates. This is useful when you only want to remove duplicates based on specific data columns rather than all columns.

Here’s an example:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'User': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'Age': [25, 30, 35, 26]
})

# Remove duplicate rows based on the 'User' column
df_cleaned = df.drop_duplicates(subset=['User'])
print(df_cleaned)

Output:

      User  Age
0   Alice   25
1     Bob   30
2  Charlie   35

The code example demonstrates removing duplicates from a DataFrame based on the ‘User’ column. The second occurrence of ‘Alice’, albeit with a different age, is discarded as a duplicate in terms of user name.

Method 3: Dropping Duplicates and Keeping the Last Occurrence

The keep parameter of drop_duplicates() can be set to ‘last’ to retain the last occurrence of the duplicate. This can be particularly useful if the duplicates have variations and the most recent data (often the last ones) are to be kept.

Here’s an example:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'User': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'Age': [25, 30, 35, 26]
})

# Remove duplicate rows while keeping the last occurrence
df_cleaned = df.drop_duplicates(keep='last')
print(df_cleaned)

Output:

      User  Age
1     Bob   30
2  Charlie   35
3   Alice   26

In this example, by setting keep='last', the last occurrence of the user ‘Alice’ with age 26 is retained, while the earlier entry is considered a duplicate and removed.

Method 4: Using Boolean Masking for Custom Duplicate Criteria

For more complex scenarios where custom logic determines duplicate entries, Boolean masking can be applied. This involves creating a Boolean Series that flags duplicates according to your criteria and then filtering the DataFrame based on this mask.

Here’s an example:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'User': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 30, 25, 35]
})

# Create a boolean series to flag duplicates in the 'User' column
is_duplicate = df.duplicated(subset='User')

# Filter out duplicates
df_cleaned = df[~is_duplicate]
print(df_cleaned)

Output:

      User  Age
0   Alice   25
1     Bob   30
3  Charlie   35

The code above makes use of a Boolean Series to identify duplicates, inverting that series with ~ to select non-duplicate rows. This method gives you the flexibility to define custom conditions for identifying duplicates.

Bonus One-Liner Method 5: Removing Duplicates with a Lambda Function

A concise one-liner approach can be achieved using a lambda function combined with drop_duplicates(). This method condenses the process into a single, albeit dense, line of code.

Here’s an example:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'User': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'Age': [25, 30, 25, 35]
})

# One-liner removal of duplicates
df_cleaned = df.groupby('User').apply(lambda x: x.drop_duplicates()).reset_index(drop=True)
print(df_cleaned)

Output:

      User  Age
0   Alice   25
1     Bob   30
2  Charlie   35

Here we use a lambda function within a groupby-apply mechanism, grouping by ‘User’ and dropping duplicates within each group. The index is then reset for a clean result. This is a powerful one-liner but might be less readable for those new to these methods.

Summary/Discussion

Method 1: Drop Duplicates Using drop_duplicates(). Straightforward and easy to use for simple cases. Limited to exact match duplicates.
Method 2: Subset of Columns. Allows for selective deduplication based on specific columns. May not be suitable when complex duplicate criteria are needed.
Method 3: Keep Last Duplicate. Useful when you need to preserve last entered duplicates, but not effective if the order is not meaningful.
Method 4: Boolean Masking. Offers custom control over what constitutes a duplicate. Can be more complex to implement.
Method 5: Lambda Function. A compact one-liner, but potentially less readable. Best suited for experienced users who need to write concise code.

Method 1: Drop Duplicates Using drop_duplicates()

Method 2: Drop Duplicates with a Subset of Columns

Method 3: Dropping Duplicates and Keeping the Last Occurrence

Method 4: Using Boolean Masking for Custom Duplicate Criteria

Bonus One-Liner Method 5: Removing Duplicates with a Lambda Function

Summary/Discussion

Method 1: Drop Duplicates Using `drop_duplicates()`