Python – Finding the Most Common Element in a Column

5/5 - (6 votes)

Problem Formulation and Solution Overview

This article will show you how to find the most common element in a Pandas Column.

To make it more interesting, we have the following running scenario:

You have been provided with a downloadable CSV file containing crime statistics for the San Diego area, including their respective NCIC Crime Codes.


πŸ’¬ Question: How would you determine the most common NCIC Crime Code that occurs in San Diego’s jurisdiction?

We can accomplish this task by one of the following options:


Preparation

Before moving forward, please ensure the Pandas library is installed. Click here if you require instructions.

Then, add the following code to the top of each script. This snippet will allow the code in this article to run error-free.

import pandas as pd

After importing the Pandas library, this library is referenced by calling the shortcode (pd).


Method 1: Use Pandas mode()

This example uses the mode() method to determine the single most common crime committed in San Diego on a given day.

df = pd.read_csv('crimes.csv', usecols=['crimedescr'])
max_crime = df['crimedescr'].mode()
print(max_crime)

The above code reads in the crimedescr column from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Next, the crimedescr column is then accessed, and the mode() method is appended. This method returns a value or set of values that appear most often along a selected axis. The results save to max_crime.

These results are output to the terminal.

0 10851(A)VC TAKE VEH W/O OWNER
Name: crimedescr, dtype: object

So, out of 7,854 rows of crimes committed on a given day for San Diego, the above offense was committed the highest number of times.

The above code only provides us with the name of the most common crime; what if we need the crime name and the respective count?

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
max_crime = df['crimedescr'].mode()
max_count = df['ucr_ncic_code'].mode()

print(max_crime)
print(max_count)

The above code is output to the terminal and displays the following.

0 10851(A)VC TAKE VEH W/O OWNER
Name: crimedescr, dtype: object
0 7000
Name: ucr_ncic_code, dtype: int64

Now, you are equipped to return to your boss and tell them that 7,000 offenses of 10851 (A) VC TAKE VEH W/O OWNER occurred on a given day in San Diego.

10 Minutes to Pandas in 5 Minutes (Okay 8)

Method 2: Use value_counts()

This example uses the value_counts() function to determine the top 5 most common crimes committed in San Diego on a given day.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
top5_names = df['crimedescr'].value_counts()[:5].index.tolist()
print(top5_names)

The above code reads in the crimedescr and ucr_ncic_code columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Then, the crimedescr column is accessed, and the value_counts() function is appended. This function returns a series containing the counts of unique values.

However, since slicing is also appended ([:5]), only the top five (5) common crimes are retrieved and then converted to a List. The results save to top5_names.

['10851(A)VC TAKE VEH W/O OWNER', 'TOWED/STORED VEH-14602.6', '459 PC BURGLARY VEHICLE', 'TOWED/STORED VEHICLE', '459 PC BURGLARY RESIDENCE']

The above code only provides us with the names of the top 5 most common crimes; what if we need the names and their respective counts?

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
top5 = df['crimedescr'].value_counts()[:5].sort_values(ascending=False)
print(top5)

The above output is sent to the terminal.

10851(A)VC TAKE VEH W/O OWNER653
TOWED/STORED VEH-14602.6463
459 PC BURGLARY VEHICLE462
TOWED/STORED VEHICLE434
459 PC BURGLARY RESIDENCE356
Name: crimedescr, dtype: int64
The Ultimate Guide to Slicing in Python

A cleaner way to achieve the same results is to use the following code.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
top5 = df['crimedescr'].value_counts().nlargest(5)
print(top5)

The above code calls the nlargest() method to determine and retrieve the top five (5) common crimes. The output is identical to the above.

10851(A)VC TAKE VEH W/O OWNER653
TOWED/STORED VEH-14602.6463
459 PC BURGLARY VEHICLE462
TOWED/STORED VEHICLE434
459 PC BURGLARY RESIDENCE356
Name: crimedescr, dtype: int64

A much cleaner and more precise output to send to the boss!


Method 3: Use value_counts() and idxmax()

This example uses value_counts() and idxmax() to determine the single most common crime committed in San Diego on a given day.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])
max_crime = df['crimedescr'].value_counts().idxmax()
print(max_crime)

The above code reads in the crimedescr and ucr_ncic_code columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Then, the crimedescr column is accessed, and the value_counts() function is appended. This function returns a series containing the count of unique values.

Next, idxmax() is appended. This method returns the index of the first occurrence of the maximum index(es) over a selected axis.

The results save to max_crime and are output to the terminal.

10851(A)VC TAKE VEH W/O OWNER

Method 4: Use value_counts() and keys()

This example uses value_counts() and keys() to determine the top 5 most common crimes committed in unique grid areas of San Diego on a given day.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'grid', 'ucr_ncic_code'])
top5_grids = df['grid'].value_counts().keys()[:5]
print(top5_grids)

The above code reads in the crimedescr, grid, and the ucr_ncic_code columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Let’s break the highlighted line down.

If df['grid'].value_counts() was output to the terminal, the following would display (snippet). However, we have added a heading row to make it more understandable, and only five (5) rows are displayed.

Grid #Grid Total
742115
969105
958100
56480
108471

Next, the code keys()[:5] is appended. The final output displays as follows.

Int64Index([742, 969, 958, 564, 1084], dtype='int64')

Method 5: Use groupby()

This examples uses groupby() to group our data on the Crime Code and displays the totals in descending order.

df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code'])

res = (df.groupby(['ucr_ncic_code','crimedescr']).size() 
       .sort_values(ascending=False) 
       .reset_index(name='count'))
print(res)

The above code reads in the crimedescr and the ucr_ncic_code columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df.

Next, the groupby() function is called and passed the first argument: df.groupby(['ucr_ncic_code','crimedescr']).size(). If this was output to the terminal at this point, the following would display (snippet).

print(df.groupby(['ucr_ncic_code','crimedescr']).size())
ucr_ncic_codecrimedescr
9092
9991
1971
6641
10991

As you can see, the other arguments need to be added to turn this into something usable. Sorting the data in descending order and adding a count column will provide the results we are looking for.

If the original Method 5 code example was output to the terminal, the following would display.

ucr_ncic_codecrimedescrcount
0240410851(A)VC TAKE VEH W/O OWNER653
17000TOWED/STORED VEH-14602.6463
22299459 PC BURGLARY VEHICLE462
37000TOWED/STORED VEHICLE434
42204459 PC BURGLARY RESIDENCE356
The Pandas groupby() Method

Summary

This article has provided five (5) ways to find the most common element in a Panda Column. These examples should provide you with enough information to select the one that best meets your coding requirements.

Good Luck & Happy Coding!


Programming Humor – Python

“I wrote 20 short programs in Python yesterday. It was wonderful. Perl, I’m leaving you.”xkcd