Problem Formulation and Solution Overview
♥️ Info: Are you AI curious but you still have to create real impactful projects? Join our official AI builder club on Skool (only $5): SHIP! - One Project Per Month
💡 Note: To follow along, click here to download the Rivers Employee CSV file, then move this file to the current working directory. 📄
Preparation
Before any data manipulation can occur, one (1) new library will require installation.
- The Pandas library enables access to/from a DataFrame.
To install this library, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter> key on the keyboard to start the installation process.
If the installation was successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required library.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd
💡 Note: Although this article works with the last 10 rows of the CSV, the entire file is searched for duplicates of these rows.
Method 1: Use groupby()
This method uses groupby() to create a subgroup containing all id values in a List format. It then checks for and outputs unique ids and associated counts for the last 10 rows.
df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10)
dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})
print(dups)This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)) are examined.
Next, the following occurs:
- The
idcolumn is passed in (usecols=['id']) and converted to a List (tolist()). - The total number of rows is determined (
size()). - The DataFrame index is reset (
reset_index()) to the default value. - A new column is created, renamed (
rename(columns={0:'count'}), and the index count is set to zero (0), the default value. - The unique
idsand associated occurrences (count) save todups.
The results are then output to the terminal.
Output
If you add up the count on the right-hand side, they add up to 10: the same number of rows we retrieved above (tail(10)).
| id | count | |
| 0 | 128 | 1 |
| 1 | 129 | 2 |
| 2 | 130 | 3 |
| 3 | 131 | 1 |
| 4 | 132 | 1 |
| 5 | 133 | 1 |
| 6 | 140 | 1 |
💡 Note: If you reviewed the CSV file in its entirety, you would see that, for example, id 129 appears two (2) times at various row numbers in the file.

Method 2: Use a Pivot Table
This method uses a pivot_table() to extract values from the id column and retrieve the size of the same. It then checks for and outputs unique ids and associated counts for the last 10 rows.
df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10)
dups = df.pivot_table(columns=['id'], aggfunc='size')
print(dups)This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)) are examined.
Next, all the id column values are evaluated (aggfunc='size'), and the unique ids and associated count are output to the terminal for the last 10 rows.
Output
If you add up the count on the right-hand side, they add up to 10: the same number of rows we retrieved above (tail(10)).
This example has an additional row indicating Data Type.
| id | count |
| 128 | 1 |
| 129 | 2 |
| 130 | 3 |
| 131 | 1 |
| 132 | 1 |
| 133 | 1 |
| 140 | 1 |
| dtype: int64 |
Method 3: Use a Lambda
This method uses a Lambda with join() to extract and return duplicate unique id values and their associated counts for the last 10 rows.
df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10)
dups = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1]))
print(dups)💡A Finxter Favorite!
This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)) are examined.
Next, the Lambda checks for and return an object of all unique ids having more than one entry. Finally, the results save to dups and are output to the terminal.
Output
id [val = 130, cnt = 3] [val = 129, cnt = 2] |
Method 4: Use duplicated()
This method checks for duplicate id values and returns a series of Boolean values indicating the duplicates for the last 10 rows.
df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10)
print(df.duplicated(subset='id'))This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)) are examined.
Output:
The column on the left-hand side displays the row number containing the duplicate id values. The right column indicates True if a duplicate row was found. Otherwise False.
This example has an additional row indicating Data Type.
| 60 | False |
| 61 | False |
| 62 | False |
| 63 | False |
| 64 | True |
| 65 | False |
| 66 | False |
| 67 | False |
| 68 | True |
| 68 | True |
| dtype: bool |
Summary
These four (4) methods to count duplicates in a Pandas DataFrame should give you enough information to select the best one for your coding requirements.
Good Luck & Happy Coding!