4 Best Ways to Count Duplicates in a DataFrame

Problem Formulation and Solution Overview

This article will show you how to count duplicates in a Pandas DataFrame in Python.

To make it more fun, we have the following running scenario:

Rivers Clothing has a CSV containing all its employees. However, their CSV file has more rows than employees. This is a definite problem! They have asked you to identify the duplicate records.

πŸ’¬ Question: How would we write Python code to count duplicates in a Pandas DataFrame?

We can accomplish this task by one of the following options:

πŸ’‘ Note: To follow along, click here to download the Rivers Employee CSV file, then move this file to the current working directory. πŸ“„


Preparation

Before any data manipulation can occur, one (1) new library will require installation.

  • The Pandas library enables access to/from a DataFrame.

To install this library, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

If the installation was successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required library.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

 import pandas as pd 

πŸ’‘ Note: Although this article works with the last 10 rows of the CSV, the entire file is searched for duplicates of these rows.


Method 1: Use groupby()

This method uses groupby() to create a subgroup containing all id values in a List format. It then checks for and outputs unique ids and associated counts for the last 10 rows.

df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10)
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})
print(dups)

This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)) are examined.

Next, the following occurs:

  • The id column is passed in (usecols=['id']) and converted to a List (tolist()).
  • The total number of rows is determined (size()).
  • The DataFrame index is reset (reset_index()) to the default value.
  • A new column is created, renamed (rename(columns={0:'count'}), and the index count is set to zero (0), the default value.
  • The unique ids and associated occurrences (count) save to dups.

The results are then output to the terminal.

Output

If you add up the count on the right-hand side, they add up to 10: the same number of rows we retrieved above (tail(10)).

idcount
01281
1129 2
21303
31311
41321
51331
61401

πŸ’‘ Note: If you reviewed the CSV file in its entirety, you would see that, for example, id 129 appears two (2) times at various row numbers in the file.


Method 2: Use a Pivot Table

This method uses a pivot_table() to extract values from the id column and retrieve the size of the same. It then checks for and outputs unique ids and associated counts for the last 10 rows.

df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10)
dups = df.pivot_table(columns=['id'], aggfunc='size')
print(dups)

This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)) are examined.

Next, all the id column values are evaluated (aggfunc='size'), and the unique ids and associated count are output to the terminal for the last 10 rows.

Output

If you add up the count on the right-hand side, they add up to 10: the same number of rows we retrieved above (tail(10)).

This example has an additional row indicating Data Type.

idcount
1281
129 2
1303
1311
1321
1331
1401
dtype: int64

Method 3: Use a Lambda

This method uses a Lambda with join() to extract and return duplicate unique id values and their associated counts for the last 10 rows.

df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10)
dups = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1]))
print(dups)

πŸ’‘A Finxter Favorite!

This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)) are examined.

Next, the Lambda checks for and return an object of all unique ids having more than one entry. Finally, the results save to dups and are output to the terminal.

Output

id [val = 130, cnt = 3] [val = 129, cnt = 2]
dtype: object

Method 4: Use duplicated()

This method checks for duplicate id values and returns a series of Boolean values indicating the duplicates for the last 10 rows.

df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10)
print(df.duplicated(subset='id'))

This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)) are examined.

Output:

The column on the left-hand side displays the row number containing the duplicate id values. The right column indicates True if a duplicate row was found. Otherwise False.

This example has an additional row indicating Data Type.

60False
61False
62False
63False
64True
65False
66False
67False
68True
68True
dtype: bool

Summary

These four (4) methods to count duplicates in a Pandas DataFrame should give you enough information to select the best one for your coding requirements.

Good Luck & Happy Coding!