Story: This series of articles assume you are an employee of the City of Sacramento’s IT Department.
At the end of each month, a CSV file is sent to the Chief of Police outlining the crime statistics for the current month.
However, the Chief prefers to view the output in a styled PDF format based on a District/Beat of his choosing.
Part 1 of this series focuses on:
- importing the
crimes.csv
file to a DataFrame - sorting the DataFrame
- prompting the user for information
- querying the DataFrame
- saving the results to a new CSV file
Preparation
- The Pandas library enables access to/from a DataFrame.
- The FPDF library enables the conversion of a CSV file to a professionally formatted PDF file.
- The Yagmail library is a GMAIL/SMTP client that makes it easy to send emails.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install fpdf
Hit the <Enter>
key on the keyboard to start the installation process.
$ pip install yagmail
Hit the <Enter>
key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
- How to install Pandas on PyCharm
- How to install FPDF on PyCharm
- How to install Yagmail on PyCharm
import pandas as pd from fpdf import FPDF import csv import datetime import yagmail
Download the Sample Data
The first step is to download some sample data. Click here to download the Crime Statistics CSV file used in this article.
Download this file and move it to the same directory as the Python code (current working directory) resides.
View the CSV File
Manually open the CSV file to review the data.
π‘ Note: The actual CSV file contains 7,585 rows.
Read the CSV File
With the appropriate CSV in the current working directory, run the following code:
cols = ['cdatetime', 'address', 'district', 'beat', 'grid', 'crimedescr'] df = pd.read_csv('crimes.csv', usecols=cols) df.sort_values('cdatetime', inplace=True, ascending=True) print(df.head(3))
- Line [1] creates a list containing the CSV columns to read in.
- Line [2] reads in the CSV file with the columns outlined above. The output saves to the DataFrame
df
. - LIne [3] sorts the original DataFrame based on the
cdatetime
column. - Line [4] outputs the first three (3) rows of the updated DataFrame
df
to the terminal.
Output:
cdatetime | address | district | beat | grid | crimedesc | |
0 | 1/1/06 0:00 | 3108 OCCIDENTAL DR | 3 | 3C | 1115 | 10851(A)VC TAKE VEH W/O OWNER |
19 | 1/1/06 0:00 | 5641 DORSET WAY | 4 | 4C | 1225 | 484J PC PUBLISH CARD INFO |
18 | 1/1/06 0:00 | 3525 42ND ST | 6 | 6A | 1034 | 530.5 PC USE PERSONAL ID INFO |
π‘ Note: You could combine Lines [1] and [2] to compress the code snippet. Before continuing you may want to remove Line [4].
Clean the Data
Upon reviewing the CSV file, we noticed two (2) issues. They are the following:
- Remove the trailing spaces from the values in the
beat
field. - Change the data from upper to title case for better readability.
df['beat'] = df['beat'].str.rstrip() df = df.apply(lambda x: x.astype(str).str.title()) print(df.head(3))
- Line [1] strips the trailing spaces (
rstrip()
) from each value in thebeat
column. - Line [2] uses a lambda to change the DataFrame data from upper to title case (
title()
). - Line [3] outputs the first three (3) rows of the updated DataFrame
df
to the terminal.
Output:
cdatetime | address | district | beat | grid | crimedesc | |
0 | 1/1/06 0:00 | 3108 Occidental Dr | 3 | 3C | 1115 | 10851(A)Vc Take Veh W/O Owner |
19 | 1/1/06 0:00 | 5641 Dorset Way | 4 | 4C | 1225 | 484J Pc Publish Card Info |
18 | 1/1/06 0:00 | 3525 42Nd St | 6 | 6A | 1034 | 530.5 Pc Use Personal Id Info |
π‘ Note: Before continuing you may want to remove Line [3].
Get Report Selection
This section creates a re-usable while
loop that prompts the user for a value. If you look at the CSV file, you will notice that the District
column contains numbers from 1-6, whereas the Beat
column combines the District
number with a character from A-C (example: 3C).
For this article, Report Number 3C
is selected.
lst = '123456ABCQ' rpt_num = None while True: rpt_num = input('Select a District/Beat (1A-6C or Q to quit): ').upper() if rpt_num == 'Q': exit() elif rpt_num[0] not in lst[:6] or rpt_num[1] not in lst[6:9]: print('You entered an invalid selection!') else: break print(f'Report {rpt_num} generating!')
- Line [1] creates a string containing all available
Districts
andBeats
as well as an option to quit (Q). The output saves tolst
. - Line [2] creates the variable
rpt_num
containing a value ofNone
. - Line [3] instantiates a
while
loop that continues until a validrpt_num
or ‘Q’ returns. - Line [4] prompts the user to enter a valid District/Beat combination or ‘Q’ and converts the output to uppercase.
- Line [5-6] executes if a ‘Q’ was entered. The script ends.
- Line [7] validates that the first character of
rpt_num
is a number from 1-6 and validates that the second character ofrpt_num
is a character from A-C. - If Line [7] resolves to
True
, the code falls to theelse
statement on Line [8]. - If Line [7] resolves to
False
, Line [9] displays an error message and restarts thewhile
loop. - Line [9-10] executes if the rpt_num passes the validation. The code breaks out of the
while
loop. - Line [11] outputs a message containing the valid selection to the terminal.
Query the DataFrame
With the rpt_num
variable generated earlier, this section queries the DataFrame df
based on the beat
column and outputs the result to a new CSV file. This file saves to the current working directory.
π‘ Note: If you review the original crimes.csv
file, you will notice that the beat
column combines district
with the beat
(example 3C). Therefore, we can query the DataFrame based on the beat
column to return the correct results.
the_filter = (df.query(f"beat == '{rpt_num}'")) filt_cols=['cdatetime','address','grid','crimedescr'] the_filter.to_csv(f'{rpt_num}.csv', columns=filt_cols) print(f'Report {rpt_num}.csv resides in the current working directory!')
- Line [1] builds a query based on the
rpt_num
variable saved earlier. The output saves tothe_filter
. - Line [2] creates a list containing the DataFrame columns to save to the new CSV file. This output saves to
filt_cols
. - Line [3] saves the query results with the specified columns to a new CSV file. This filename resolves to the
rpt_name
variable saved earlier with'.csv'
appended to the end (example:3C.csv
). - Line [4] outputs a message to the terminal.
π‘ Note: The new CSV file should now reside in the current working directory. Open this file to view the contents.
The output below is the top three (3) rows from the new CSV file, 3C.csv.
,cdatetime,address,grid,crimedescr 0,1/1/06 0:00,3108 Occidental Dr,1115,10851(A)Vc Take Veh W/O Owner 13,1/1/06 0:00,7721 College Town Dr,888,530.5 Pc Use Personal Id Info 7,1/1/06 0:00,3547 P St,853,484 Pc Petty Theft/Inside |
Summary
In this article, you learned how to:
- Read in an existing CSV file to a DataFrame.
- Sort the DataFrame.
- Set up a
while
statement with theinput()
function to retrieve valid data from a user. - Query a DataFrame.
- Save the query results to a new CSV file.
What’s Next
In Part 2 of this series, you will learn how to read in the existing filtered CSV file, format, paginate and output to a PDF file.
In Part 3 of this series, you will learn how to attach the PDF file to a Gmail account and send the email to a recipient.