Python Working with the Pandas DataFrame & MySQL – Part 1

Rate this post

Background

To provide a solid understanding of data manipulation using the Pandas DataFrame and MySQL, the Finxter Academy has created a 3-part series to take the user from beginner to advanced. This article assumes you have access to a localhost/server with MySQL.

Part 2 and Part 3 also assumes you possess the login credentials with appropriate permissions to access the same.

In this article, we will be reading in a large CSV file, querying, sorting, and extracting the first 150 records.


Preparation

Before any data manipulation can occur, two (2) new libraries will need installation.

  • The Pandas library. This library provides access to Pandas and their various functions. The name pd is the industry-standard naming convention to reference the same.
  • The MySQL library. This library has the functionality to connect to a localhost/server and run queries.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install mysql-connector-python

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import mysql.connector
from mysql.connector import Error

file_name = 'Sacramento_Crime_Data_From_Two_Years_Ago.csv'

Download the CSV File

For this example, the Crime Statistics CSV file for the City of Sacramento is used. To download this file, perform the following steps:

A) Navigate to the City of Sacramento Crime Statistics dataset.

B) Click the Download icon to display the Download Options area.

C) From the Download Options area, click the Download button.

D) From the pop-up dialog box, click to select Save File. Next, click the OK button to start the download process.

E) Save/move the CSV file to your current working directory.

πŸ’‘ Note: Ensure the CSV filename matches the filename in the Required Starter Code.


Create a Pandas DataFrame

A Pandas DataFrame is a two-dimensional data structure. This structure contains rows and columns: similar to Excel or a Database. Just like Excel and Databases, the DataFrame works on the same principles. All structures contain:

  • Rows
  • Columns
  • Data

Only a few columns from the original CSV file are accessed and used.

cols = ['Record_ID', 'Offense_Code', 'Police_District', 'Beat', 'Grid', 'Occurence_Date']

df = pd.read_csv(file_name,
        sep=',',
        index_col=['Record_ID'], 
        usecols=cols,
        parse_dates=['Occurence_Date']
        )

print(df.head(4))
  • Line [1] contains a list of columns used in the DataFrame.
  • Line [2] reads in the CSV file downloaded earlier. The read_csv() method has several parameters. This example uses a few. To delve further into this topic, click here.
  • Line [3] outputs the newly created DataFrame (df) to the terminal. Using head() and passing a numeric value will display the first x number of rows in the DataFrame. If left blank, the default five rows will display.
file_nameThe filename is the name of the CSV file downloaded earlier.
sepThe field separator character, by default, is the comma (,).
index_colThese column(s) are the index field(s). These field(s) must exist in the CSV file and contain unique values.
use_colsThis value is a list of column(s). The column(s) listed here are available in the DataFrame. If empty, all are available.
parse_datesThis value is a list of date fields. Using parse_dates will ensure dates display correctly.

Output

  Record_IDOffense_CodePolice_DistrictBeat Grid           Occurence_Date
1359411           1315              33B07362019-01-01 08:58:00+00:00
1359412           2202              66D08242019-01-01 08:35:00+00:00
1359413           2203              44A09052019-01-01 09:40:00+00:00
1359453           5404              22B05642019-01-01 09:56:00+00:00

πŸ’‘ Note: Using tail() and passing a numeric value will display the last x number of rows in the DataFrame. If left blank, the default five rows will display.


Retrieve Total Number of Records in DataFrame

To retrieve the total number of records (excluding the header row) in the DataFrame, reference the index_col set above.

total_recs = len(df.index)
print(total_recs)
  • Line [1] This line returns the record count and assigns that value to total_recs.
  • Line [2] outputs the total number of rows in the DataFrame to the terminal.

Output

39504

Whew! That’s quite a few records to contend with. The next step is to narrow down the DataFrame to a more manageable level.


Query the DataFrame

The example below uses the Pandas query() function to query the DataFrame. For additional details on this topic, click here.

df.query('Record_ID >= 1359412' and 'Record_ID <= 1360143', inplace=True)
print(df.head(4))
  • Line [1] creates a query to reduce the total records to a specified range. When inplace=True the data is modified in place. This DataFrame is updated to reflect the query results. If False (default), a copy of the DataFrame is returned.
  • Line [2] outputs the subset to the terminal.

Output

  Record_IDOffense_CodePolice_DistrictBeat Grid           Occurence_Date
1359411           1315              33B07362019-01-01 08:58:00+00:00
1359412           2202              66D08242019-01-01 08:35:00+00:00
1359413           2203              44A09052019-01-01 09:40:00+00:00
1359453           5404              22B05642019-01-01 09:56:00+00:00

Sort the DataFrame

For this example, two fields sort in ascending order. If the order is empty, ascending is assumed.

df.sort_values(by=['Police_District', 'Beat'], inplace=True)
df.to_csv('crimes_01.csv')
  • Line [1] sorts and updates the current DataFrame.
  • Line [2] saves all changes made to the original DataFrame to a new CSV file called crimes_01.csv.

A new CSV file resides in the current working directory if this code runs successfully.

Output (Partial)

If we open the new CSV file, all the changes made above should appear.

πŸ’‘ Note: Take some time to compare the original CSV to the new CSV.


Summary

In this article, we learned how to:

  • Install the Pandas Library
  • Install the mysql-connector-python Library
  • Download and save a CSV file from the web.
  • Create a Pandas DataFrame
  • Retrieve Total Number of Records from a DataFrame
  • Query the DataFrame
  • Sort the DataFrame
  • Save the DataFrame to a new CSV file.

Next Up

Part 2 focuses on the following:

  • Reading in a CSV File
  • Inserting a DataFrame Column
  • Connecting to MySQL
  • Creating a MySQL Database