Background
To provide a solid understanding of data manipulation using the Pandas DataFrame and MySQL, the Finxter Academy has created a 3-part series to take the user from beginner to advanced. This article assumes you have access to a localhost/server with MySQL.
Part 2 and Part 3 also assumes you possess the login credentials with appropriate permissions to access the same.
In this article, we will be reading in a large CSV file, querying, sorting, and extracting the first 150 records.
Preparation
Before any data manipulation can occur, two (2) new libraries will need installation.
- The Pandas library. This library provides access to Pandas and their various functions. The name
pd
is the industry-standard naming convention to reference the same. - The MySQL library. This library has the functionality to connect to a localhost/server and run queries.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install mysql-connector-python
Hit the <Enter>
key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd import mysql.connector from mysql.connector import Error file_name = 'Sacramento_Crime_Data_From_Two_Years_Ago.csv'
Download the CSV File
For this example, the Crime Statistics CSV file for the City of Sacramento is used. To download this file, perform the following steps:
A) Navigate to the City of Sacramento Crime Statistics dataset.
B) Click the Download icon to display the Download Options area.
C) From the Download Options area, click the Download button.
D) From the pop-up dialog box, click to select Save File. Next, click the OK button to start the download process.
E) Save/move the CSV file to your current working directory.
π‘ Note: Ensure the CSV filename matches the filename in the Required Starter Code.
Create a Pandas DataFrame
A Pandas DataFrame is a two-dimensional data structure. This structure contains rows and columns: similar to Excel or a Database. Just like Excel and Databases, the DataFrame works on the same principles. All structures contain:
- Rows
- Columns
- Data
Only a few columns from the original CSV file are accessed and used.
cols = ['Record_ID', 'Offense_Code', 'Police_District', 'Beat', 'Grid', 'Occurence_Date'] df = pd.read_csv(file_name, sep=',', index_col=['Record_ID'], usecols=cols, parse_dates=['Occurence_Date'] ) print(df.head(4))
- Line [1] contains a list of columns used in the DataFrame.
- Line [2] reads in the CSV file downloaded earlier. The
read_csv()
method has several parameters. This example uses a few. To delve further into this topic, click here. - Line [3] outputs the newly created DataFrame (
df
) to the terminal. Usinghead()
and passing a numeric value will display the first x number of rows in the DataFrame. If left blank, the default five rows will display.
file_name | The filename is the name of the CSV file downloaded earlier. |
sep | The field separator character, by default, is the comma (,). |
index_col | These column(s) are the index field(s). These field(s) must exist in the CSV file and contain unique values. |
use_cols | This value is a list of column(s). The column(s) listed here are available in the DataFrame. If empty, all are available. |
parse_dates | This value is a list of date fields. Using parse_dates will ensure dates display correctly. |
Output
Record_ID | Offense_Code | Police_District | Beat | Grid | Occurence_Date |
1359411 | 1315 | 3 | 3B | 0736 | 2019-01-01 08:58:00+00:00 |
1359412 | 2202 | 6 | 6D | 0824 | 2019-01-01 08:35:00+00:00 |
1359413 | 2203 | 4 | 4A | 0905 | 2019-01-01 09:40:00+00:00 |
1359453 | 5404 | 2 | 2B | 0564 | 2019-01-01 09:56:00+00:00 |
π‘ Note: Using tail()
and passing a numeric value will display the last x number of rows in the DataFrame. If left blank, the default five rows will display.
Retrieve Total Number of Records in DataFrame
To retrieve the total number of records (excluding the header row) in the DataFrame, reference the index_col
set above.
total_recs = len(df.index) print(total_recs)
- Line [1] This line returns the record count and assigns that value to
total_recs
. - Line [2] outputs the total number of rows in the DataFrame to the terminal.
Output
39504
Whew! Thatβs quite a few records to contend with. The next step is to narrow down the DataFrame to a more manageable level.
Query the DataFrame
The example below uses the Pandas query()
function to query the DataFrame. For additional details on this topic, click here.
df.query('Record_ID >= 1359412' and 'Record_ID <= 1360143', inplace=True) print(df.head(4))
- Line [1] creates a query to reduce the total records to a specified range. When
inplace=True
the data is modified in place. This DataFrame is updated to reflect the query results. IfFalse
(default), a copy of the DataFrame is returned. - Line [2] outputs the subset to the terminal.
Output
Record_ID | Offense_Code | Police_District | Beat | Grid | Occurence_Date |
1359411 | 1315 | 3 | 3B | 0736 | 2019-01-01 08:58:00+00:00 |
1359412 | 2202 | 6 | 6D | 0824 | 2019-01-01 08:35:00+00:00 |
1359413 | 2203 | 4 | 4A | 0905 | 2019-01-01 09:40:00+00:00 |
1359453 | 5404 | 2 | 2B | 0564 | 2019-01-01 09:56:00+00:00 |
Sort the DataFrame
For this example, two fields sort in ascending order. If the order is empty, ascending is assumed.
df.sort_values(by=['Police_District', 'Beat'], inplace=True) df.to_csv('crimes_01.csv')
- Line [1] sorts and updates the current DataFrame.
- Line [2] saves all changes made to the original DataFrame to a new CSV file called
crimes_01.csv
.
A new CSV file resides in the current working directory if this code runs successfully.
Output (Partial)
If we open the new CSV file, all the changes made above should appear.
π‘ Note: Take some time to compare the original CSV to the new CSV.
Summary
In this article, we learned how to:
- Install the Pandas Library
- Install the
mysql-connector-python
Library - Download and save a CSV file from the web.
- Create a Pandas DataFrame
- Retrieve Total Number of Records from a DataFrame
- Query the DataFrame
- Sort the DataFrame
- Save the DataFrame to a new CSV file.
Next Up
Part 2 focuses on the following:
- Reading in a CSV File
- Inserting a DataFrame Column
- Connecting to MySQL
- Creating a MySQL Database