5 Best Ways to Find Duplicate Contacts in a List of Contacts in Python

💡 Problem Formulation: We are often faced with the task of cleaning up our digital address books, which includes finding and removing duplicate contacts. Suppose we have a list of contacts, where each contact is represented by a dictionary containing names, emails, and phone numbers. Our output should identify contacts with identical information, indicating where duplicates are located in the original list.

Method 1: Using a Loop and Comparisons

This method iterates over the list of contacts and compares each contact with every other contact to identify duplicates. While this brute force method is simple and requires no additional data structures, its performance is less than optimal for large datasets due to its O(n^2) complexity.

Here’s an example:

contacts = [
    {'name': 'John Doe', 'email': 'john@example.com'},
    {'name': 'Jane Smith', 'email': 'jane@example.com'},
    {'name': 'John Doe', 'email': 'john@example.com'}
]

duplicates = []
for i in range(len(contacts)):
    for j in range(i+1, len(contacts)):
        if contacts[i] == contacts[j]:
            duplicates.append((i, j))

print('Duplicate Contacts:', duplicates)

Output:

Duplicate Contacts: [(0, 2)]

In this snippet, we compare each contact in the list with every other contact using nested loops. When a duplicate is found, the indices of the duplicate contacts are added to a list.

Method 2: Using a Set for Uniqueness

This approach utilizes a set’s unique value property to identify duplicates by adding contact entries to the set while keeping track of seen contacts. It offers better performance than method 1, as it usually operates close to O(n) time complexity.

Here’s an example:

seen_contacts = set()
duplicates = []

for i, contact in enumerate(contacts):
    contact_tuple = tuple(contact.items())
    if contact_tuple in seen_contacts:
        duplicates.append(contact)
    else:
        seen_contacts.add(contact_tuple)

print('Duplicate Contacts:', duplicates)

Output:

Duplicate Contacts: [{'name': 'John Doe', 'email': 'john@example.com'}]

The code snippet creates a set to track unique contacts. Contacts are converted to tuple form to be hashable and then added to the set. If a contact is already in the set, it’s flagged as a duplicate.

Method 3: Sorting and Grouping

This method sorts the contact list and then groups adjacent duplicates together, taking advantage of the sorting to efficiently identify duplicates. Although the sort operation is O(n log n), it can be faster overall than the first method when dealing with larger datasets.

Here’s an example:

from itertools import groupby

# Assume `contacts` is defined as before and it is sorted by a key if necessary.

duplicates = []
for key, group in groupby(contacts, lambda c: tuple(c.items())):
    group_list = list(group)
    if len(group_list) > 1:
        duplicates.extend(group_list)

print('Duplicate Contacts:', duplicates)

Output:

Duplicate Contacts: [{'name': 'John Doe', 'email': 'john@example.com'}, {'name': 'John Doe', 'email': 'john@example.com'}]

Using the groupby() function from the itertools module, we gather the contacts into groups based on their data. Any group larger than one indicates duplicates, and these are then added to the duplicates list.

Method 4: Using a Hash Map (Dictionary)

By creating a hash map that maps the contacts’ information into a cumulative list of contacts, this method allows for quick identification and grouping of duplicates with close to O(n) time complexity, especially beneficial when dealing with very large datasets.

Here’s an example:

from collections import defaultdict

contact_map = defaultdict(list)

for contact in contacts:
    contact_key = frozenset(contact.items())
    contact_map[contact_key].append(contact)

duplicates = [group for group in contact_map.values() if len(group) > 1]

print('Duplicate Contacts:', duplicates)

Output:

Duplicate Contacts: [[{'name': 'John Doe', 'email': 'john@example.com'}, {'name': 'John Doe', 'email': 'john@example.com'}]]

In this code snippet, defaultdict from collections is used to map each contact’s information to a list. If more than one contact ends up having the same key, the list will have more than one entry, indicating duplicates.

Bonus One-Liner Method 5: Using List Comprehension with a Filter

This one-liner method uses list comprehension along with a filter function to quickly identify duplicate contacts. Though not as efficient for large lists, it offers a concise, albeit less readable, way to find duplicates.

Here’s an example:

duplicates = [c for c in contacts if contacts.count(c) > 1]

print('Duplicate Contacts:', duplicates)

Output:

Duplicate Contacts: [{'name': 'John Doe', 'email': 'john@example.com'}, {'name': 'John Doe', 'email': 'john@example.com'}]

The code uses list comprehension to iterate over the contacts and uses .count() to filter out non-duplicates, returning only those contacts that appear more than once in the list.

Summary/Discussion

Method 1: Loop and Comparisons. Simplicity is its strength, but has poor performance with large datasets due to O(n^2) complexity.
Method 2: Set for Uniqueness. Efficient for large contact lists with average-case performance close to O(n), but requires hashing of contacts.
Method 3: Sorting and Grouping. Better handling of large datasets due to O(n log n) sort time complexity, but sorting overhead may be costly for already large and sorted datasets.
Method 4: Hash Map (Dictionary). Fast and efficient, working close to O(n) for grouping duplicates, ideal for large numbers of contacts.
Bonus Method 5: List Comprehension with Filter. Offers brevity and satisfies functional programming enthusiasts, but not ideal for performance on large data.