Top 13 Python Tricks for Data Analysis

This article focuses on analyzing the coronavirus dataset using Python language.

We are not using any of the Python data analysis libraries. Instead, we’ll use our raw Python skills to write a function, slicing, and indexing.

Also, we’ll use Python arithmetic operators such as sum() and division.

Finally, we’ll use a lambda expression to perform the traditional looping method.

The Jupyter notebook is the preferred IDE (Integrated Development Environment) to write and execute code samples. The dataset we are using is from the data world website. You can download it from the link below. 

Our dataset consists of some empty strings. Firstly, we must clean the dataset before performing arithmetic operations or data analysis.

Python open() and reader() Function

We will use CSV (Comma Separated Values) module to open and read the dataset. The csv module defines the Python reader method and other methods.

More on that here πŸ‘‰ https://docs.python.org/3/library/csv.html.

Let’s import the reader() function from the python csv module.

from csv import reader

Now, let’s open and read the coronavirus dataset by running the following code. 

open_file = open('daily_coronavirus_full_data.csv')
read_file = reader(open_file)
list_covid_file = list(read_file)
  • A Python open() function opens a file and returns our datasets into a variable open_file.
  • We are using the primary usage of the reader() function. A reader reads datasets in the open_file variable.
  • And list_covid_file displayed the contents of the dataset as a Python list.

Execute the following code:

list_covid_file

Here’s the output:

The above screenshot consists of a list of lists. The first item in the list is the header, followed by the rows of the datasets.

Indexing and Slicing

Now, retrieve any row or rows from the dataset using a slice() function. Fetch the dataset header with the slice() function. 

Code Sample:           

  • list_covid_file[0:1] – retrieved a dataset contents from index 0 and end at index 1.
  • Index 0 is the first row, and index 1 is the second row in the dataset.
  • However, the slice() function would ignore the index 1.
  • We used Python print() function to visualise the dataset header as it should in the csv file. 

Output:

The above screenshot consists of ten different variables in the dataset head.

Python Negative Indexing: Get the last row or last element in the list using a negative index.

Code Sample:

print(list_covid_file[-1])

Output:

Using Python len() Function.

The len() function returns the row number in the datasets. Let’s retrieve the length of our dataset using the len() function.

Run:

len(covid_dataset)

The Python len() function accepted dataset as a parameter, which returns the following output:

153482

Using List Comprehension

List comprehension returns a new iterable such as lists, tuples or strings, and it’s a short version of the traditional looping technique.

Code Sample:

get_row = [x for x in covid_dataset]
  • We created a variable get_row
  • List comprehension has two angle brackets that consist of expressions that run each element in the list.
  • Then, assign the outcome to the get_row variable.

Now, execute get_row variable.

get_row

Output

You should notice from the above screenshot that we have empty strings ('' or '.') in the dataset. The next task is to replace all the empty strings(' ', '.') with '0.0'.

Replacing Empty Strings – Add the result to the list with an append() function

Code Sample:

The above screenshot is a reusable function.

  • We created a custom function that accepts two parameters: dataset and row.
  • And declared an empty list fetch_new_data.
  • Then iterate over the coronavirus data and assign row into a variable dataset_row.
  • We check if the row has empty strings ('', '.'
  • And if it’s true, assign a value "0.0" to all empty strings.
  • Then, we convert the row from the string into a float().
  • And add the result dataset_row into a list fetch_new_data using Python append() function.

Outside the loop, return a new list result fetch_new_data.

Let’s, create an object of the generic_function function.

Example Code:

get_dataset = generic_function(covid_dataset, 5)

The generic_function function accepts two arguments: dataset and row 5, which it’s assigned to a variable get_dataset.

Execute: 

get_dataset

Output

We replaced all empty strings with 0.0. We can do this repeatedly by checking any row with empty strings and replacing them with 0.0.

Python Arithmetic Operations

Using the sum() function

We will reuse a “generic_function” function we created in an earlier example. Add the total number of deaths using a built-in Python sum() function and return the total of all data points.

Code Sample

get_all_deaths = sum(get_dataset)
  • We created a variable called get_all_deaths.
  • The sum() function accepted get_dataset (object) created from generic_function.
  • Then, add all the data points in row 5 and assign it to a variable get_all_deaths.

Now run:

get_all_deaths

Output:

Average Number of Deaths Using len() Function

We will compute an average death by dividing the total number of deaths by the total length of row 5 (total death). 

Code Sample:

avg_deaths = get_all_deaths/len(get_dataset)
  • We created an avg_deaths variable.
  • get_all_deaths is the aggregate of all death in row 5 and divided by the length of get_dataset using the Pyhton len() function.
  • Then, return the result in a variable avg_death.

Let’s execute average deaths:

avg_death

Output:

The above screenshot shows the average number of deaths in all countries.

Python round() function 

Let’s execute:

round(avg_death, 2)
# 47705.73

Rounded average death into 2 decimal points.

Python round() function returns a floating-point and accepts two parameters; the number to rounded and the decimal place number. 

Explore New Cases By Country

Python split() Function

Let’s continue exploring our dataset by analyzing the number of new cases in each country.

Below screenshot is a function that gets new cases of coronavirus by country.

The above function consists of the following:

  1. The function new_cases_by_country has three parameters: the dataset, locations, and new cases rows.
  2. We create a variable empty number_of_cases_by_contry that holds the new results list.
  3. We iterated over the coronavirus dataset and assigned location and new case rows to variables location & new_cases.
  4. And check if the location is equal to the country in the location row.
  5. Also, check if the new cases row has empty strings, and if it does, replace them with the value “0.0”.
  6. Concatenate location and new_cases and separate them with a pipe ("|")sign, which would help us manipulate the result later.
  7. And add location and new cases into empty list β€œnumber_of_cases_by_country” using Python built-in append() function.
  8. Then, we use a split() function to split the string into a list by splitting the output from the pipe ('|') sign.
  9. Then, use a return keyword to return a new list of number_of_cases_by_country.

Let’s create an object of new_cases_by_country.

Sample Code: 

new_cases_by_location = new_cases_by_country(covid_dataset, 1, 2, 'United Kingdom')
  1. We add three arguments into new_cases_by_country function: dataset,new cases, location rows and country. 
  2. And assigned it to a variable, new_cases_by_location.

Execute the following:

new_cases_by_location

Output:

The above screenshot displayed the number of new cases in the United Kingdom. 

We can explore the number of cases in each country using the new_cases_by_country function.

Python Lambda Expression

Python map() Function

The lambda is an anonymous function that indicates a function without a given name.  The following code sample combines python map() and lambda expression to add all the new coronavirus cases in the United Kingdom. 

Code Sample:

  • We introduced an anonymous python function known as lambdas expression. Check python documentation for more details 6: expressions β€” Python 3.10.2 documentation.
  • We created a variable sum_new_cases that holds all the United Kingdom new cases.
  • The map() function makes it simpler and more efficient to iterate over items.
  • We pass the lambda function and a list as an argument in a map() function.
  • Then, the variable x[1] gets the first index in new_cases_by_location and converts it from string to Python float(), and return a new list result.

Output

Python max() Function

Python max() function returns the highest number in the list.

Let’s use the max() function to retrieve the highest number of new cases reported in the United Kingdom.

Code Sample: 

max(sum_new_cases)

The variable sum_new_cases is the lambda expression object we created in the code sample above.

We passed the “sum_new_casesinto the max() function, which returns the highest number in the list.

Output:

The maximum number of cases reported in the United Kingdom is 221222.0

Conclusion

There is a lot of data exploration to cover but, this should furnish you with some primary usage of Python built-in functions, function declaration, and function reusability.

All this should come in handy when analyzing a vast dataset.

You can go beyond the code samples shown in this article and play around with the dataset to showcase your python skills.