When weโre working on real-world data, the data files that we receive will be huge. In such cases, cleansing the data becomes important. Often, we might have to count the occurrences of elements in a column. Doing so, gives us insights like:
- The number of times the value has been duplicated.
- Count corresponding to a particular value.
- Checking the number of missing values from the input.
Consider a data frame as shown below.
data = {'Name' : ['Alice','Bob','Steve','George','Bob','Steve','George','Scottie'], 'Gender' : ['F','M','M','M',np.nan,'M','M','F'], 'Rank': [3,4,np.nan,4,5,6,np.nan,5]}
We can count the occurrence of the elements in a column to get some insights into the data.
For example, to find the count of duplicates of the candidates, or to find the number of male and female candidates or to find the number of missing rank values, etc.
In this article, let’s learn different methods to find the count of occurrences of the elements in a column. Without any further delay, let’s get started.
Method 1: Using dataframe.value_counts()
We can use the dataframe.value_counts()
method to find the count of occurrences of the elements in a column.
Just say,
dataframe['ColumnName'].value_counts()
If you want to find the count of occurrences of a combination of columns use the following,
dataframe[['ColumnName1','ColumnName2',....,'ColumnNameN'].value_counts()
Example: Let’s say we have to find the count of male and female candidates from the above dataframe. We can use dataframe.count_values()
as shown below.
import numpy as np import pandas as pd data = {'Name' : ['Alice','Bob','Steve','George','Bounty','Sitara','George','Scottie'], 'Gender' : ['F','M','M','M',np.nan,'M','M','F'], 'Rank': [3,4,np.nan,4,5,6,np.nan,5]} df=pd.DataFrame(data) print(df['Gender'].value_counts())
Output:
M 5 F 2 Name: Gender, dtype: int64
As we can see, it gives us the count of male and female candidates. On careful examination, we can notice that the count of nan values is not shown.
By default, dataframe.value_counts()
does not specify the missing values. To specify the missing values, we have to pass a parameter dropna=False
as shown below.
dataframe.value_counts(dropna=False)
Letโs use this in our example and see what happens.
import numpy as np import pandas as pd data = {'Name' : ['Alice','Bob','Steve','George','Bounty','Sitara','George','Scottie'], 'Gender' : ['F','M','M','M',np.nan,'M','M','F'], 'Rank': [3,4,np.nan,4,5,6,np.nan,5]} df=pd.DataFrame(data) print(df['Gender'].value_counts(dropna=False))
Output:
M 5 F 2 NaN 1 Name: Gender, dtype: int64
As seen in the output, the count of missing values is displayed.
We can also find the count of occurrences from the combination of columns as shown below.
import numpy as np import pandas as pd data = {'Name' : ['Alice','Bob','Steve','George','Bounty','Sitara','George','Scottie'], 'Gender' : ['F','M','M','M',np.nan,'M','M','F'], 'Rank': [3,4,np.nan,4,5,6,np.nan,5], 'Value': [1,2,3,1,2,3,1,2]} df=pd.DataFrame(data) print(df[['Name','Rank']].value_counts())
Output:
Name Rank Alice 3.0 1 Bob 4.0 1 Bounty 5.0 1 George 4.0 1 Scottie 5.0 1 Sitara 6.0 1
Method 2: Using dataframe.groupby()
We can also use the methods dataframe.groupby().size()
or dataframe.groupby().count()
to find the count of occurrences of the Column Values using the below syntax.
dataframe.groupby('ColumnName').size()
or
dataframe.groupby('ColumnName').count()
If you count the ocurrences of the group of multiple columns use,
dataframe.groupby(['ColumnName1','ColumnName2',......'ColumnNameN']).size()
or
dataframe.groupby(['ColumnName1','ColumnName2',......'ColumnNameN']).count()
Example: Letโs say we have to find the count of occurrences of the name. We can use dataframe.groupby().size()
as shown below
import numpy as np import pandas as pd data = {'Name' : ['Alice','Bob','Steve','George','Bounty','Sitara','George','Scottie'], 'Gender' : ['F','M','M','M',np.nan,'M','M','F'], 'Rank': [3,4,np.nan,4,5,6,np.nan,5]} df=pd.DataFrame(data) print(df.groupby('Name').size())
Output:
Name Alice 1 Bob 1 Bounty 1 George 2 Scottie 1 Sitara 1 Steve 1 dtype: int64
Alternatively, we can also use dataframe.groupby().count()
as shown below.
import numpy as np import pandas as pd data = {'Name' : ['Alice','Bob','Steve','George','Bounty','Sitara','George','Scottie'], 'Gender' : ['F','M','M','M',np.nan,'M','M','F'], 'Rank': [3,4,np.nan,4,5,6,np.nan,5]} df=pd.DataFrame(data) print(df.groupby('Name').count())
Output:
Gender Rank Name Alice 1 1 Bob 1 1 Bounty 0 1 George 2 1 Scottie 1 1 Sitara 1 1 Steve 1 0
We can also find the count of occurrences from the combination of columns using groupby()
as shown below.
import numpy as np import pandas as pd data = {'Name' : ['Alice','Bob','Steve','George','Bounty','Sitara','George','Scottie'], 'Gender' : ['F','M','M','M',np.nan,'M','M','F'], 'Rank': [3,4,np.nan,4,5,6,np.nan,5], } df=pd.DataFrame(data) print(df.groupby(['Name','Rank']).count())
Output:
Gender Name Rank Alice 3.0 1 Bob 4.0 1 Bounty 5.0 0 George 4.0 1 Scottie 5.0 1 Sitara 6.0 1
Measuring the time complexity of these methods
Now that weโve seen different ways of finding the count of occurrences of the Col elements, let’s have a look at the execution times using different methods.
import numpy as np import pandas as pd from time import perf_counter_ns data = {'Name' : ['Alice','Bob','Steve','George','Bounty','Sitara','George','Scottie'], 'Gender' : ['F','M','M','M',np.nan,'M','M','F'], 'Rank': [3,4,np.nan,4,5,6,np.nan,5]} df=pd.DataFrame(data) # Using groupby.size() start=perf_counter_ns() df.groupby('Name').size() end=perf_counter_ns() print("Time taken uisng groupby().size() : %7dns"%(end-start)) # Using groupby().count() start=perf_counter_ns() df.groupby('Name').count() end=perf_counter_ns() print("Time taken uisng groupby().count() : %7dns"%(end-start)) # Using value_counts() start=perf_counter_ns() df['Name'].value_counts() end=perf_counter_ns() print("Time taken uisng value_counts() : %7dns"%(end-start))
Output:
Time taken using groupby().size() : 820000ns Time taken using groupby().count() : 666000ns Time taken using value_counts() : 235800ns
When weโre finding the count of occurrences of one column, value_counts()
is faster.
However, when weโre finding the count of occurrences from a combination of several columns, groupby()
methods are faster. Refer to the below code for more details.
import numpy as np import pandas as pd from time import perf_counter_ns data = {'Name' : ['Alice','Bob','Steve','George','Bounty','Sitara','George','Scottie'], 'Gender' : ['F','M','M','M',np.nan,'M','M','F'], 'Rank': [3,4,np.nan,4,5,6,np.nan,5], } df=pd.DataFrame(data) # Using groupby.size() start=perf_counter_ns() df.groupby(['Name','Rank']).size() end=perf_counter_ns() print("Time taken uisng groupby().size() : %7dns"%(end-start)) # Using groupby().count() start=perf_counter_ns() df.groupby(['Name','Rank']).count() end=perf_counter_ns() print("Time taken uisng groupby().count() : %7dns"%(end-start)) # Using value_counts() start=perf_counter_ns() df[['Name','Rank','Gender']].value_counts() end=perf_counter_ns() print("Time taken uisng value_counts() : %7dns"%(end-start))
Output:
Time taken uisng groupby().size() : 544800ns Time taken uisng groupby().count() : 653100ns Time taken uisng value_counts() : 925300ns
Conclusion
That brings us to the end of this short tutorial. We have seen different methods to count the occurrences of the values in a column. We have also checked the time complexities of these methods.
We hope this has been informative. Thanks for reading. Do you have any questions? Feel free to comment and let us know. For more interesting content, subscribe to our email newsletter: