Python - Finding the Most Common Element in a Column

Problem Formulation and Solution Overview

This article will show you how to find the most common element in a Pandas Column.

To make it more interesting, we have the following running scenario:

You have been provided with a downloadable CSV file containing crime statistics for the San Diego area, including their respective NCIC Crime Codes.

💬 Question: How would you determine the most common NCIC Crime Code that occurs in San Diego’s jurisdiction?

We can accomplish this task by one of the following options:

Method 1: Use Pandas mode()
Method 2: Use Pandas value_counts()
Method 3: Use value_counts() and idxmax()
Method 4: Use value_counts() and keys()
Method 5: Use Pandas groupby()

Preparation

Before moving forward, please ensure the Pandas library is installed. Click here if you require instructions.

Then, add the following code to the top of each script. This snippet will allow the code in this article to run error-free.

import pandas as pd

After importing the Pandas library, this library is referenced by calling the shortcode (pd).

Method 1: Use Pandas mode()

This example uses the mode() method to determine the single most common crime committed in San Diego on a given day.

df = pd.read_csv('crimes.csv', usecols=['crimedescr'])
max_crime = df['crimedescr'].mode()
print(max_crime)

The above code reads in the crimedescr column from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Next, the crimedescr column is then accessed, and the mode() method is appended. This method returns a value or set of values that appear most often along a selected axis. The results save to max_crime.

These results are output to the terminal.

0 10851(A)VC TAKE VEH W/O OWNER
Name: crimedescr, dtype: object

So, out of 7,854 rows of crimes committed on a given day for San Diego, the above offense was committed the highest number of times.

The above code only provides us with the name of the most common crime; what if we need the crime name and the respective count?

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
max_crime = df['crimedescr'].mode()
max_count = df['ucr_ncic_code'].mode()

print(max_crime)
print(max_count)

The above code is output to the terminal and displays the following.

0 10851(A)VC TAKE VEH W/O OWNER
Name: crimedescr, dtype: object
0 7000
Name: ucr_ncic_code, dtype: int64

Now, you are equipped to return to your boss and tell them that 7,000 offenses of 10851 (A) VC TAKE VEH W/O OWNER occurred on a given day in San Diego.

Method 2: Use value_counts()

This example uses the value_counts() function to determine the top 5 most common crimes committed in San Diego on a given day.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
top5_names = df['crimedescr'].value_counts()[:5].index.tolist()
print(top5_names)

The above code reads in the crimedescr and ucr_ncic_code columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Then, the crimedescr column is accessed, and the value_counts() function is appended. This function returns a series containing the counts of unique values.

However, since slicing is also appended ([:5]), only the top five (5) common crimes are retrieved and then converted to a List. The results save to top5_names.

['10851(A)VC TAKE VEH W/O OWNER', 'TOWED/STORED VEH-14602.6', '459 PC BURGLARY VEHICLE', 'TOWED/STORED VEHICLE', '459 PC BURGLARY RESIDENCE']

The above code only provides us with the names of the top 5 most common crimes; what if we need the names and their respective counts?

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
top5 = df['crimedescr'].value_counts()[:5].sort_values(ascending=False)
print(top5)

The above output is sent to the terminal.

10851(A)VC TAKE VEH W/O OWNER	653
TOWED/STORED VEH-14602.6	463
459 PC BURGLARY VEHICLE	462
TOWED/STORED VEHICLE	434
459 PC BURGLARY RESIDENCE	356
Name: crimedescr, dtype: int64

A cleaner way to achieve the same results is to use the following code.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
top5 = df['crimedescr'].value_counts().nlargest(5)
print(top5)

The above code calls the nlargest() method to determine and retrieve the top five (5) common crimes. The output is identical to the above.

10851(A)VC TAKE VEH W/O OWNER	653
TOWED/STORED VEH-14602.6	463
459 PC BURGLARY VEHICLE	462
TOWED/STORED VEHICLE	434
459 PC BURGLARY RESIDENCE	356
Name: crimedescr, dtype: int64

A much cleaner and more precise output to send to the boss!

Method 3: Use value_counts() and idxmax()

This example uses value_counts() and idxmax() to determine the single most common crime committed in San Diego on a given day.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
max_crime = df['crimedescr'].value_counts().idxmax()
print(max_crime)

The above code reads in the crimedescr and ucr_ncic_code columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Then, the crimedescr column is accessed, and the value_counts() function is appended. This function returns a series containing the count of unique values.

Next, idxmax() is appended. This method returns the index of the first occurrence of the maximum index(es) over a selected axis.

The results save to max_crime and are output to the terminal.

10851(A)VC TAKE VEH W/O OWNER

Method 4: Use value_counts() and keys()

This example uses value_counts() and keys() to determine the top 5 most common crimes committed in unique grid areas of San Diego on a given day.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'grid', 'ucr_ncic_code'])
top5_grids = df['grid'].value_counts().keys()[:5]
print(top5_grids)

The above code reads in the crimedescr, grid, and the ucr_ncic_code columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Let’s break the highlighted line down.

If df['grid'].value_counts() was output to the terminal, the following would display (snippet). However, we have added a heading row to make it more understandable, and only five (5) rows are displayed.

Grid #	Grid Total
742	115
969	105
958	100
564	80
1084	71

Next, the code keys()[:5] is appended. The final output displays as follows.

Int64Index([742, 969, 958, 564, 1084], dtype='int64')

Method 5: Use groupby()

This examples uses groupby() to group our data on the Crime Code and displays the totals in descending order.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])

res = (df.groupby(['ucr_ncic_code','crimedescr']).size() 
       .sort_values(ascending=False) 
       .reset_index(name='count'))
print(res)

The above code reads in the crimedescr and the ucr_ncic_code columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Next, the groupby() function is called and passed the first argument: df.groupby(['ucr_ncic_code','crimedescr']).size(). If this was output to the terminal at this point, the following would display (snippet).

print(df.groupby(['ucr_ncic_code','crimedescr']).size())

ucr_ncic_code	crimedescr
909		2
999		1
197		1
664		1
1099		1

As you can see, the other arguments need to be added to turn this into something usable. Sorting the data in descending order and adding a count column will provide the results we are looking for.

If the original Method 5 code example was output to the terminal, the following would display.

	ucr_ncic_code	crimedescr	count
0	2404	10851(A)VC TAKE VEH W/O OWNER	653
1	7000	TOWED/STORED VEH-14602.6	463
2	2299	459 PC BURGLARY VEHICLE	462
3	7000	TOWED/STORED VEHICLE	434
4	2204	459 PC BURGLARY RESIDENCE	356

Summary

This article has provided five (5) ways to find the most common element in a Panda Column. These examples should provide you with enough information to select the one that best meets your coding requirements.

Good Luck & Happy Coding!

Python – Finding the Most Common Element in a Column