Problem Formulation and Solution Overview
π‘ Note: To follow along, click here to download the Rivers Employee CSV file, then move this file to the current working directory. π
Preparation
Before any data manipulation can occur, one (1) new library will require installation.
- The Pandas library enables access to/from a DataFrame.
To install this library, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
If the installation was successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required library.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd
π‘ Note: Although this article works with the last 10 rows of the CSV, the entire file is searched for duplicates of these rows.
Method 1: Use groupby()
This method uses groupby()
to create a subgroup containing all id
values in a List format. It then checks for and outputs unique ids
and associated counts for the last 10 rows.
df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10) dups = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'}) print(dups)
This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)
) are examined.
Next, the following occurs:
- The
id
column is passed in (usecols=['id']
) and converted to a List (tolist()
). - The total number of rows is determined (
size()
). - The DataFrame index is reset (
reset_index()
) to the default value. - A new column is created, renamed (
rename(columns={0:'count'}
), and the index count is set to zero (0
), the default value. - The unique
ids
and associated occurrences (count
) save todups
.
The results are then output to the terminal.
Output
If you add up the count
on the right-hand side, they add up to 10: the same number of rows we retrieved above (tail(10)
).
id | count | |
0 | 128 | 1 |
1 | 129 | 2 |
2 | 130 | 3 |
3 | 131 | 1 |
4 | 132 | 1 |
5 | 133 | 1 |
6 | 140 | 1 |
π‘ Note: If you reviewed the CSV file in its entirety, you would see that, for example, id 129
appears two (2) times at various row numbers in the file.
Method 2: Use a Pivot Table
This method uses a pivot_table()
to extract values from the id
column and retrieve the size of the same. It then checks for and outputs unique ids
and associated counts for the last 10 rows.
df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10) dups = df.pivot_table(columns=['id'], aggfunc='size') print(dups)
This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)
) are examined.
Next, all the id
column values are evaluated (aggfunc='size'
), and the unique ids
and associated count
are output to the terminal for the last 10 rows.
Output
If you add up the count
on the right-hand side, they add up to 10: the same number of rows we retrieved above (tail(10)
).
This example has an additional row indicating Data Type.
id | count |
128 | 1 |
129 | 2 |
130 | 3 |
131 | 1 |
132 | 1 |
133 | 1 |
140 | 1 |
dtype: int64 |
Method 3: Use a Lambda
This method uses a Lambda
with join()
to extract and return duplicate unique id
values and their associated counts for the last 10 rows.
df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10) dups = df.apply(lambda x: ' '.join([f'[val = {i}, cnt = {v}]' for i, v in x.value_counts().iteritems() if v > 1])) print(dups)
π‘A Finxter Favorite!
This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)
) are examined.
Next, the Lambda
checks for and return an object of all unique ids
having more than one entry. Finally, the results save to dups
and are output to the terminal.
Output
id [val = 130, cnt = 3] [val = 129, cnt = 2] |
Method 4: Use duplicated()
This method checks for duplicate id
values and returns a series of Boolean values indicating the duplicates for the last 10 rows.
df = pd.read_csv('rivers_emp.csv', usecols=['id']).tail(10) print(df.duplicated(subset='id'))
This code reads in the Rivers CSV file. For brevity, only the bottom 10 rows (tail(10)
) are examined.
Output:
The column on the left-hand side displays the row number containing the duplicate id
values. The right column indicates True if a duplicate row was found. Otherwise False.
This example has an additional row indicating Data Type.
60 | False |
61 | False |
62 | False |
63 | False |
64 | True |
65 | False |
66 | False |
67 | False |
68 | True |
68 | True |
dtype: bool |
Summary
These four (4) methods to count duplicates in a Pandas DataFrame should give you enough information to select the best one for your coding requirements.
Good Luck & Happy Coding!