Analyzing Mobile Data Speeds from TRAI with Pandas in Python

💡 Problem Formulation: In this article, we address how to manipulate and analyze mobile data speeds provided by the Telecom Regulatory Authority of India (TRAI) using Pandas in Python. Analysts require efficient methods to parse, clean, aggregate, and visualize this data for reporting and decision-making. Often, the input is a large dataset of mobile speeds in CSV or Excel format, and the desired output is insightful information such as average speeds, compliance with benchmarks, or trends over time.

Method 1: Reading and Cleaning Data

An essential first step in data analysis with Pandas is reading data into a DataFrame and then cleaning it to ensure quality and consistency. This involves removing missing values, correcting data types, and potentially renaming columns for better readability.

Here’s an example:

import pandas as pd

data = pd.read_csv('trai_data_speeds.csv')
cleaned_data = data.dropna().rename(columns={'speed_Mbps': 'Speed (Mbps)'})
print(cleaned_data.head())

Output of this code snippet:

   Provider  Speed (Mbps)        Date
0    Airtel          8.5  2021-01-01
1      Jio          15.6  2021-01-01
2      Vi           10.4  2021-01-01
3   BSNL            4.2  2021-01-01

This code snippet imports the Pandas library, reads data from a CSV file into a DataFrame, drops all missing values, and renames a particular column for better clarity. The output displays the cleaned DataFrame’s first few records.

Method 2: Aggregating Data

Aggregating data is crucial for summarizing large datasets. With Pandas, you can compute summary statistics, such as mean or median, for various groupings within your data, providing insights into mobile speed performance across different dimensions like times or providers.

Here’s an example:

grouped_data = cleaned_data.groupby('Provider')['Speed (Mbps)'].mean()
print(grouped_data)

Output of this code snippet:

Provider
Airtel   9.5
Jio     20.3
Vi      15.1
BSNL     5.2
Name: Speed (Mbps), dtype: float64

The code snippet groups the cleaned data by the ‘Provider’ and calculates the mean ‘Speed (Mbps)’ for each provider. The result is a Series displaying the average speed offered by each mobile provider.

Method 3: Time Series Analysis

Time series analysis with Pandas allows for the exploration of data trends over time. This can help identify whether mobile data speeds are improving, deteriorating, or exhibiting seasonality.

Here’s an example:

cleaned_data['Date'] = pd.to_datetime(cleaned_data['Date'])
time_series = cleaned_data.set_index('Date').resample('M')['Speed (Mbps)'].mean()
print(time_series.plot())

Output of this code snippet:

<AxesSubplot:xlabel='Date'>

This code snippet turns the ‘Date’ column into datetime objects, sets it as the index, and then resamples the data by month to get the average ‘Speed (Mbps)’. A line plot of these averages over time is produced, encapsulating the trend of data speeds.

Method 4: Visualizing the Data

Visualization is key to understanding and communicating complex data. With Pandas, you can easily plot various aspects of your dataset such as distributions of speeds and comparisons between providers.

Here’s an example:

import matplotlib.pyplot as plt

cleaned_data.boxplot(by='Provider', column=['Speed (Mbps)'])
plt.title('Speed Distribution by Provider')
plt.suptitle('')
plt.ylabel('Speed (Mbps)')
plt.show()

Output of this code snippet:

A plot visualizing a boxplot showing the distribution of mobile data speeds for each provider.

The code snippet produces a boxplot using Matplotlib to visualize the distribution of data speeds for each mobile provider. Boxplots are useful for identifying outliers and comparing distributions.

Bonus One-Liner Method 5: Pivot Tables

Pandas’ pivot_table is a versatile function to quickly reshape your dataset and compute complex aggregations, handy for comparing mobile data speeds across multiple dimensions such as providers and regions.

Here’s an example:

pivot = cleaned_data.pivot_table(index='Provider', columns='Date', values='Speed (Mbps)', aggfunc='mean')
print(pivot)

Output of this code snippet:

Date        2021-01-01  2021-02-01  ...
Provider                              
Airtel              8.5          9.0
Jio                15.6         19.1
Vi                 10.4         10.7
BSNL                4.2          4.5

A one-liner pivot table creation using Pandas allows us to see the average speeds for each provider on each date, providing a clear and concise comparison over time.

Summary/Discussion

Method 1: Reading and Cleaning Data. Essential for ensuring data quality. Can be time-consuming with large datasets; automation of cleaning rules is advised.
Method 2: Aggregating Data. Powerful for gathering insights from data. Could become complex with many groupings and aggregations required.
Method 3: Time Series Analysis. Reveals trends and patterns over time. Assumes data is properly time-stamped, can require additional time-formatting efforts.
Method 4: Visualizing the Data. Invaluable for storytelling with data. Visualizations need to be carefully designed for clarity and accurate representation of the data.
Method 5: Pivot Tables. Quick comparative analysis tool, but can become unwieldy with very large and complex datasets.