Problem Formulation and Solution Overview
To make it more interesting, we have the following running scenario:
You have been provided with a downloadable CSV file containing crime statistics for the San Diego area, including their respective NCIC Crime Codes.
Preparation
import pandas as pd
After importing the Pandas library, this library is referenced by calling the shortcode (pd
).
Method 1: Use Pandas mode()
This example uses the mode()
method to determine the single most common crime committed in San Diego on a given day.
df = pd.read_csv('crimes.csv', usecols=['crimedescr']) max_crime = df['crimedescr'].mode() print(max_crime)
The above code reads in the crimedescr
column from the crimes.csv file downloaded earlier. This saves to the DataFrame df
.
Next, the crimedescr
column is then accessed, and the mode()
method is appended. This method returns a value or set of values that appear most often along a selected axis. The results save to max_crime
.
These results are output to the terminal.
0 10851(A)VC TAKE VEH W/O OWNER |
So, out of 7,854 rows of crimes committed on a given day for San Diego, the above offense was committed the highest number of times.
The above code only provides us with the name of the most common crime; what if we need the crime name and the respective count?
df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code']) max_crime = df['crimedescr'].mode() max_count = df['ucr_ncic_code'].mode() print(max_crime) print(max_count)
The above code is output to the terminal and displays the following.
0 10851(A)VC TAKE VEH W/O OWNER |
Now, you are equipped to return to your boss and tell them that 7,000 offenses of 10851 (A) VC TAKE VEH W/O OWNER occurred on a given day in San Diego.
Method 2: Use value_counts()
This example uses the value_counts()
function to determine the top 5 most common crimes committed in San Diego on a given day.
df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code']) top5_names = df['crimedescr'].value_counts()[:5].index.tolist() print(top5_names)
The above code reads in the crimedescr
and ucr_ncic_code
columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df
.
Then, the crimedescr
column is accessed, and the value_counts()
function is appended. This function returns a series containing the counts of unique values.
However, since slicing is also appended ([:5]
), only the top five (5) common crimes are retrieved and then converted to a List. The results save to top5_names
.
['10851(A)VC TAKE VEH W/O OWNER', 'TOWED/STORED VEH-14602.6', '459 PC BURGLARY VEHICLE', 'TOWED/STORED VEHICLE', '459 PC BURGLARY RESIDENCE'] |
The above code only provides us with the names of the top 5 most common crimes; what if we need the names and their respective counts?
df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code']) top5 = df['crimedescr'].value_counts()[:5].sort_values(ascending=False) print(top5)
The above output is sent to the terminal.
10851(A)VC TAKE VEH W/O OWNER | 653 |
TOWED/STORED VEH-14602.6 | 463 |
459 PC BURGLARY VEHICLE | 462 |
TOWED/STORED VEHICLE | 434 |
459 PC BURGLARY RESIDENCE | 356 |
Name: crimedescr, dtype: int64 |
A cleaner way to achieve the same results is to use the following code.
df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code']) top5 = df['crimedescr'].value_counts().nlargest(5) print(top5)
The above code calls the nlargest()
method to determine and retrieve the top five (5) common crimes. The output is identical to the above.
10851(A)VC TAKE VEH W/O OWNER | 653 |
TOWED/STORED VEH-14602.6 | 463 |
459 PC BURGLARY VEHICLE | 462 |
TOWED/STORED VEHICLE | 434 |
459 PC BURGLARY RESIDENCE | 356 |
Name: crimedescr, dtype: int64 |
A much cleaner and more precise output to send to the boss!
Method 3: Use value_counts() and idxmax()
This example uses value_counts()
and idxmax()
to determine the single most common crime committed in San Diego on a given day.
df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code']) max_crime = df['crimedescr'].value_counts().idxmax() print(max_crime)
The above code reads in the crimedescr
and ucr_ncic_code
columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df
.
Then, the crimedescr
column is accessed, and the value_counts()
function is appended. This function returns a series containing the count of unique values.
Next, idxmax()
is appended. This method returns the index of the first occurrence of the maximum index(es) over a selected axis.
The results save to max_crime
and are output to the terminal.
10851(A)VC TAKE VEH W/O OWNER |
Method 4: Use value_counts() and keys()
This example uses value_counts()
and keys()
to determine the top 5 most common crimes committed in unique grid areas of San Diego on a given day.
df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'grid', 'ucr_ncic_code']) top5_grids = df['grid'].value_counts().keys()[:5] print(top5_grids)
The above code reads in the crimedescr
, grid
, and the ucr_ncic_code
columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df
.
Let’s break the highlighted line down.
If df['grid'].value_counts()
was output to the terminal, the following would display (snippet). However, we have added a heading row to make it more understandable, and only five (5) rows are displayed.
Grid # | Grid Total |
742 | 115 |
969 | 105 |
958 | 100 |
564 | 80 |
1084 | 71 |
Next, the code keys()[:5]
is appended. The final output displays as follows.
Int64Index([742, 969, 958, 564, 1084], dtype='int64') |
Method 5: Use groupby()
This examples uses groupby()
to group our data on the Crime Code and displays the totals in descending order.
df = pd.read_csv('crimes.csv', usecols=['crimedescr', 'ucr_ncic_code']) res = (df.groupby(['ucr_ncic_code','crimedescr']).size() .sort_values(ascending=False) .reset_index(name='count')) print(res)
The above code reads in the crimedescr
and the ucr_ncic_code
columns from the crimes.csv file downloaded earlier. This saves to the DataFrame df
.
Next, the groupby()
function is called and passed the first argument: df.groupby(['ucr_ncic_code','crimedescr']).size()
. If this was output to the terminal at this point, the following would display (snippet).
print(df.groupby(['ucr_ncic_code','crimedescr']).size())
ucr_ncic_code | crimedescr | |
909 | 2 | |
999 | 1 | |
197 | 1 | |
664 | 1 | |
1099 | 1 |
As you can see, the other arguments need to be added to turn this into something usable. Sorting the data in descending order and adding a count column will provide the results we are looking for.
If the original Method 5 code example was output to the terminal, the following would display.
ucr_ncic_code | crimedescr | count | |
0 | 2404 | 10851(A)VC TAKE VEH W/O OWNER | 653 |
1 | 7000 | TOWED/STORED VEH-14602.6 | 463 |
2 | 2299 | 459 PC BURGLARY VEHICLE | 462 |
3 | 7000 | TOWED/STORED VEHICLE | 434 |
4 | 2204 | 459 PC BURGLARY RESIDENCE | 356 |
Summary
This article has provided five (5) ways to find the most common element in a Panda Column. These examples should provide you with enough information to select the one that best meets your coding requirements.
Good Luck & Happy Coding!
Programming Humor – Python
