Part 1: How to Read, Query, and Write a Pandas DataFrame [CSV]

Story: This series of articles assume you are an employee of the City of Sacramento’s IT Department.

At the end of each month, a CSV file is sent to the Chief of Police outlining the crime statistics for the current month.

However, the Chief prefers to view the output in a styled PDF format based on a District/Beat of his choosing.

Part 1 of this series focuses on:

  • importing the crimes.csv file to a DataFrame
  • sorting the DataFrame
  • prompting the user for information
  • querying the DataFrame
  • saving the results to a new CSV file

Preparation

Before any data manipulation can occur, three (3) new libraries will require installation.

  • The Pandas library enables access to/from a DataFrame.
  • The FPDF library enables the conversion of a CSV file to a professionally formatted PDF file.
  • The Yagmail library is a GMAIL/SMTP client that makes it easy to send emails.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install fpdf

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install yagmail

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
from fpdf import FPDF
import csv
import datetime
import yagmail

Download the Sample Data

The first step is to download some sample data. Click here to download the Crime Statistics CSV file used in this article.

Download this file and move it to the same directory as the Python code (current working directory) resides.


View the CSV File

Manually open the CSV file to review the data.

πŸ’‘ Note: The actual CSV file contains 7,585 rows.


Read the CSV File

With the appropriate CSV in the current working directory, run the following code:

cols = ['cdatetime', 'address', 'district', 'beat', 'grid', 'crimedescr']
df   = pd.read_csv('crimes.csv', usecols=cols)
df.sort_values('cdatetime', inplace=True, ascending=True)
print(df.head(3))
  • Line [1] creates a list containing the CSV columns to read in.
  • Line [2] reads in the CSV file with the columns outlined above. The output saves to the DataFrame df.
  • LIne [3] sorts the original DataFrame based on the cdatetime column.
  • Line [4] outputs the first three (3) rows of the updated DataFrame df to the terminal.

Output:

cdatetimeaddressdistrictbeatgridcrimedesc
01/1/06 0:003108 OCCIDENTAL DR33C111510851(A)VC TAKE VEH W/O OWNER
191/1/06 0:005641 DORSET WAY44C1225484J PC PUBLISH CARD INFO
181/1/06 0:003525 42ND ST66A1034530.5 PC USE PERSONAL ID INFO

πŸ’‘ Note: You could combine Lines [1] and [2] to compress the code snippet. Before continuing you may want to remove Line [4].


Clean the Data

Upon reviewing the CSV file, we noticed two (2) issues. They are the following:

  1. Remove the trailing spaces from the values in the beat field.
  2. Change the data from upper to title case for better readability.
df['beat'] = df['beat'].str.rstrip()
df = df.apply(lambda x: x.astype(str).str.title())
print(df.head(3))
  • Line [1] strips the trailing spaces (rstrip()) from each value in the beat column.
  • Line [2] uses a lambda to change the DataFrame data from upper to title case (title()).
  • Line [3] outputs the first three (3) rows of the updated DataFrame df to the terminal.

Output:

cdatetimeaddressdistrictbeatgridcrimedesc
01/1/06 0:003108 Occidental Dr33C111510851(A)Vc Take Veh W/O Owner
191/1/06 0:005641 Dorset Way44C1225484J Pc Publish Card Info
181/1/06 0:003525 42Nd St66A1034530.5 Pc Use Personal Id Info

πŸ’‘ Note: Before continuing you may want to remove Line [3].


Get Report Selection

This section creates a re-usable while loop that prompts the user for a value. If you look at the CSV file, you will notice that the District column contains numbers from 1-6, whereas the Beat column combines the District number with a character from A-C (example: 3C).

For this article, Report Number 3C is selected.

lst = '123456ABCQ'
rpt_num = None

while True:
    rpt_num = input('Select a District/Beat (1A-6C or Q to quit): ').upper()
    if rpt_num == 'Q':
        exit()
    elif rpt_num[0] not in lst[:6] or rpt_num[1] not in lst[6:9]:
        print('You entered an invalid selection!')
    else:
        break
 
print(f'Report {rpt_num} generating!')
        
  • Line [1] creates a string containing all available Districts and Beats as well as an option to quit (Q). The output saves to lst.
  • Line [2] creates the variable rpt_num containing a value of None.
  • Line [3] instantiates a while loop that continues until a valid rpt_num or ‘Q’ returns.
  • Line [4] prompts the user to enter a valid District/Beat combination or ‘Q’ and converts the output to uppercase.
  • Line [5-6] executes if a ‘Q’ was entered. The script ends.
  • Line [7] validates that the first character of rpt_num is a number from 1-6 and validates that the second character of rpt_num is a character from A-C.
  • If Line [7] resolves to True, the code falls to the else statement on Line [8].
  • If Line [7] resolves to False, Line [9] displays an error message and restarts the while loop.
  • Line [9-10] executes if the rpt_num passes the validation. The code breaks out of the while loop.
  • Line [11] outputs a message containing the valid selection to the terminal.

Query the DataFrame

With the rpt_num variable generated earlier, this section queries the DataFrame df based on the beat column and outputs the result to a new CSV file. This file saves to the current working directory.

πŸ’‘ Note: If you review the original crimes.csv file, you will notice that the beat column combines district with the beat (example 3C). Therefore, we can query the DataFrame based on the beat column to return the correct results.

the_filter = (df.query(f"beat == '{rpt_num}'"))
filt_cols=['cdatetime','address','grid','crimedescr']
the_filter.to_csv(f'{rpt_num}.csv', columns=filt_cols)
print(f'Report {rpt_num}.csv resides in the current working directory!')
  • Line [1] builds a query based on the rpt_num variable saved earlier. The output saves to the_filter.
  • Line [2] creates a list containing the DataFrame columns to save to the new CSV file. This output saves to filt_cols.
  • Line [3] saves the query results with the specified columns to a new CSV file. This filename resolves to the rpt_name variable saved earlier with '.csv' appended to the end (example: 3C.csv).
  • Line [4] outputs a message to the terminal.

πŸ’‘ Note: The new CSV file should now reside in the current working directory. Open this file to view the contents.

The output below is the top three (3) rows from the new CSV file, 3C.csv.

,cdatetime,address,grid,crimedescr
0,1/1/06 0:00,3108 Occidental Dr,1115,10851(A)Vc Take Veh W/O Owner
13,1/1/06 0:00,7721 College Town Dr,888,530.5 Pc Use Personal Id Info
7,1/1/06 0:00,3547 P St,853,484 Pc Petty Theft/Inside

Summary

In this article, you learned how to:

  • Read in an existing CSV file to a DataFrame.
  • Sort the DataFrame.
  • Set up a while statement with the input() function to retrieve valid data from a user.
  • Query a DataFrame.
  • Save the query results to a new CSV file.

What’s Next

In Part 2 of this series, you will learn how to read in the existing filtered CSV file, format, paginate and output to a PDF file.

In Part 3 of this series, you will learn how to attach the PDF file to a Gmail account and send the email to a recipient.