This article focuses on analyzing the coronavirus dataset using Python language.
We are not using any of the Python data analysis libraries. Instead, we’ll use our raw Python skills to write a function, slicing, and indexing.
Also, we’ll use Python arithmetic operators such as
sum() and division.
Finally, we’ll use a lambda expression to perform the traditional looping method.
The Jupyter notebook is the preferred IDE (Integrated Development Environment) to write and execute code samples. The dataset we are using is from the data world website. You can download it from the link below.
Our dataset consists of some empty strings. Firstly, we must clean the dataset before performing arithmetic operations or data analysis.
Python open() and reader() Function
We will use CSV (Comma Separated Values) module to open and read the dataset. The
csv module defines the Python
reader method and other methods.
More on that here 👉 https://docs.python.org/3/library/csv.html.
Let’s import the
reader() function from the python
from csv import reader
Now, let’s open and read the coronavirus dataset by running the following code.
open_file = open('daily_coronavirus_full_data.csv') read_file = reader(open_file) list_covid_file = list(read_file)
- A Python
open()function opens a file and returns our datasets into a variable
- We are using the primary usage of the
reader()function. A reader reads datasets in the
list_covid_filedisplayed the contents of the dataset as a Python list.
Execute the following code:
Here’s the output:
The above screenshot consists of a list of lists. The first item in the list is the header, followed by the rows of the datasets.
Indexing and Slicing
Now, retrieve any row or rows from the dataset using a
slice() function. Fetch the dataset header with the
list_covid_file[0:1]– retrieved a dataset contents from index 0 and end at index 1.
- Index 0 is the first row, and index 1 is the second row in the dataset.
- However, the
slice()function would ignore the index 1.
- We used Python
print()function to visualise the dataset header as it should in the csv file.
The above screenshot consists of ten different variables in the dataset head.
Python Negative Indexing: Get the last row or last element in the list using a negative index.
Using Python len() Function.
len() function returns the row number in the datasets. Let’s retrieve the length of our dataset using the
len() function accepted dataset as a parameter, which returns the following output:
Using List Comprehension
List comprehension returns a new iterable such as lists, tuples or strings, and it’s a short version of the traditional looping technique.
get_row = [x for x in covid_dataset]
- We created a variable
- List comprehension has two angle brackets that consist of expressions that run each element in the list.
- Then, assign the outcome to the
You should notice from the above screenshot that we have empty strings (
'.') in the dataset. The next task is to replace all the empty strings(
Replacing Empty Strings – Add the result to the list with an append() function
The above screenshot is a reusable function.
- We created a custom function that accepts two parameters: dataset and row.
- And declared an empty list
- Then iterate over the coronavirus data and assign row into a variable
- We check if the row has empty strings (
- And if it’s true, assign a value
"0.0"to all empty strings.
- Then, we convert the row from the string into a
- And add the result
dataset_rowinto a list
Outside the loop, return a new list result
Let’s, create an object of the
get_dataset = generic_function(covid_dataset, 5)
generic_function function accepts two arguments: dataset and row 5, which it’s assigned to a variable
We replaced all empty strings with 0.0. We can do this repeatedly by checking any row with empty strings and replacing them with 0.0.
Python Arithmetic Operations
Using the sum() function
We will reuse a “
generic_function” function we created in an earlier example. Add the total number of deaths using a built-in Python
sum() function and return the total of all data points.
get_all_deaths = sum(get_dataset)
- We created a variable called
get_dataset(object) created from
- Then, add all the data points in row 5 and assign it to a variable
Average Number of Deaths Using len() Function
We will compute an average death by dividing the total number of deaths by the total length of row 5 (total death).
avg_deaths = get_all_deaths/len(get_dataset)
- We created an
get_all_deathsis the aggregate of all death in row 5 and divided by the length of
get_datasetusing the Pyhton
- Then, return the result in a variable
Let’s execute average deaths:
The above screenshot shows the average number of deaths in all countries.
Python round() function
round(avg_death, 2) # 47705.73
Rounded average death into 2 decimal points.
round() function returns a floating-point and accepts two parameters; the number to rounded and the decimal place number.
Explore New Cases By Country
Python split() Function
Let’s continue exploring our dataset by analyzing the number of new cases in each country.
Below screenshot is a function that gets new cases of coronavirus by country.
The above function consists of the following:
- The function
new_cases_by_countryhas three parameters: the dataset, locations, and new cases rows.
- We create a variable empty
number_of_cases_by_contrythat holds the new results list.
- We iterated over the coronavirus dataset and assigned location and new case rows to variables
- And check if the location is equal to the country in the location row.
- Also, check if the new cases row has empty strings, and if it does, replace them with the value “0.0”.
- Concatenate location and
new_casesand separate them with a pipe (
"|")sign, which would help us manipulate the result later.
- And add location and new cases into empty list “
number_of_cases_by_country” using Python built-in
- Then, we use a
split()function to split the string into a list by splitting the output from the pipe (
- Then, use a
returnkeyword to return a new list of
Let’s create an object of
new_cases_by_location = new_cases_by_country(covid_dataset, 1, 2, 'United Kingdom')
- We add three arguments into
new_cases_by_countryfunction: dataset,new cases, location rows and country.
- And assigned it to a variable,
Execute the following:
The above screenshot displayed the number of new cases in the United Kingdom.
We can explore the number of cases in each country using the
Python Lambda Expression
Python map() Function
The lambda is an anonymous function that indicates a function without a given name. The following code sample combines python
map() and lambda expression to add all the new coronavirus cases in the United Kingdom.
- We introduced an anonymous python function known as lambdas expression. Check python documentation for more details 6: expressions — Python 3.10.2 documentation.
- We created a variable
sum_new_casesthat holds all the United Kingdom new cases.
map()function makes it simpler and more efficient to iterate over items.
- We pass the lambda function and a list as an argument in a
- Then, the variable
xgets the first index in
new_cases_by_locationand converts it from string to Python
float(), and return a new list result.
Python max() Function
max() function returns the highest number in the list.
Let’s use the
max() function to retrieve the highest number of new cases reported in the United Kingdom.
The variable “
sum_new_cases“ is the lambda expression object we created in the code sample above.
We passed the “
sum_new_cases” into the
max() function, which returns the highest number in the list.
The maximum number of cases reported in the United Kingdom is 221222.0
There is a lot of data exploration to cover but, this should furnish you with some primary usage of Python built-in functions, function declaration, and function reusability.
All this should come in handy when analyzing a vast dataset.
You can go beyond the code samples shown in this article and play around with the dataset to showcase your python skills.