This article focuses on the serialization and conversion methods of a Python DataFrame:
to_gbq()
,to_coo()
.
Let’s get started!
Preparation
Before any data manipulation can occur, four (4) new libraries will require installation.
- The Pandas library enables access to/from a DataFrame.
- The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.
- The pandas_gbq allows access to Google Big Query (GBQ)
- The google.auth authentication.
To install these libraries, navigate to an IDE terminal. At the command prompt ($
), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($
). Your terminal prompt may be different.
$ pip install pandas
Hit the <Enter> key on the keyboard to start the installation process.
$ pip install pandas_gbq
Hit the <Enter> key on the keyboard to start the installation process.
$ pip install numpy
Hit the <Enter> key on the keyboard to start the installation process.
$ pip install google.auth
Hit the <Enter> key on the keyboard to start the installation process.
If the installations were successful, a message displays in the terminal indicating the same.
Feel free to view the PyCharm installation guide for the required libraries.
Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.
import pandas as pd import numpy as np from google.cloud import bigquery import google.auth
DataFrame.to_gbq()
The to_gbq()
method writes a DataFrame object to a Google BigQuery table.
BigQuery is Google Cloud Data Warehouse that helps you store, analyze and visualize data.
The syntax for this method is as follows:
DataFrame.to_gbq(destination_table, project_id=None, chunksize=None, reauth=False, if_exists='fail', auth_local_webserver=False, table_schema=None, location=None, progress_bar=True, credentials=None)
Parameter | Description |
---|---|
destination_table | This string parameter denotes the table name presented in the form dataset.tablename . |
project_id | This parameter is the GBQ Account Project ID. This ID is only available from the GBG environment. |
chunksize | This parameter depicts the total rows to insert/chunk. The default value is all rows at once. |
reauth | If True , request user re-authentication from GBQ. |
if_exists | The path taken is the table exists. Options are: – 'fail' : if the table exists, raise an error.– 'replace' : if the table exists, drop and re-create.– 'append' : if the table exists, append data. |
auth_local_webserver | If True , use the local webserver flow instead of the console flow to retrieve user credentials. |
table_schema | This parameter is a dictionary containing table fields data types (Schema). If not provided, BGQ will attempt to create a Schema. |
location | This parameter is the location where the load job runs. |
progress_bar | If True , use the tqdm library to display the progress bar for the upload. |
credentials | This parameter is the credentials for accessing Google APIs. |
The to_gbq()
determines the BigQuery Table Schema based on the data types of the uploaded DataFrame. If this Schema does not meet your requirements, you can create a new Table Schema and pass it via the table_schema
parameter.
π‘ Note: To view a list of available GBQ Data Types, click here.
Before running any code, you will need to do the following:
- Step 1: Create a Google Email Account.
- Step 2: Log in to the Google Email Account.
- Step 3: Navigate to Google Cloud BigQuery and set up your Service Account.
View the video below for instructions.
π‘ Note: Remember to take note of the project details and authentication.
For this example, countries.csv reads into a DataFrame. This DataFrame adds to a Google Big Query Table. Click here to save this CSV file and move it to the current working directory.
prj_id = 'central-hangar-341214' dat_set = 'my_dataset' table_id = 'my_dataset.countries' df = pd.read_csv('countries.csv') df.to_gbq(table_id, if_exists='append', project_id=prj_id)
- Line [1] contains the Project ID assigned earlier.
- Line [2] is a name we assigned to the dataset.
- Line [3] combines the dataset name added to the table name. For this example,
'countries'
is entered. - Line [4] reads in the
countries.csv
to a DataFrame. - Line [5] does the following:
- Assigns the
table_id
('countries.csv'
) - Checks if this table already exists. If so, the data appends to it.
- Assigns the
project_id
('central-hangar-341214'
) - Attempts to connect to and upload the data
- Assigns the
π‘ Note: Remember to generate and save the authentication.
DataFrame Sparse to_coo()
The sparse to_coo()
method creates a scipy.sparse.coo_matrix
from a Series containing a MultiIndex
. The row_levels
and column_levels
determine the row/column coordinates.
The syntax for this method is as follows:
Series.sparse.to_coo(row_levels=(0,), column_levels=(1,), sort_labels=False)
Parameter | Description |
---|---|
row_levels | This parameter is a tuple or a list. |
column_levels | This parameter is a tuple or a list. |
sort_labels | If this parameter is True , the sort is performed before creating the sparse matrix. |
This example has random and missing data. This data is re-sampled and converted into a tuple format using to_coo()
.
stats = pd.Series([1.0080, 4.00260, 7.0, 9.012183, np.nan, np.nan]) stats.index = pd.MultiIndex.from_tuples( [(np.nan, 2, "a", 0), (1, 2, "a", 1), (np.nan, 1, "b", 0), (1, 1, "b", 1), (2, 1, "b", 0), (np.nan, 1, "b", 1)], names=["HYD", "HEL", "LIT", "BER"]) new_stats = stats.astype("Sparse") A, rows, columns = new_stats.sparse.to_coo( row_levels=["HYD", "HEL"], column_levels=["LIT", "BER"], sort_labels=True) print(A)
Output
(0, 0) 1.008 |
(1, 1) 4.0026 |
(2, 2) 7.0 |
(3, 3) 9.012183 |
If we applied the todense()
method to the above data, the output would be as follows:
stats = pd.Series([1.0080, 4.00260, 7.0, 9.012183, np.nan, np.nan]) stats.index = pd.MultiIndex.from_tuples( [(np.nan, 2, "a", 0), (1, 2, "a", 1), (np.nan, 1, "b", 0), (1, 1, "b", 1), (2, 1, "b", 0), (np.nan, 1, "b", 1)], names=["HYD", "HEL", "LIT", "BER"]) new_stats = stats.astype("Sparse") A, rows, columns = new_stats.sparse.to_coo( row_levels=["HYD", "HEL"], column_levels=["LIT", "BER"], sort_labels=True) print(A.todense())
Output
[[1.008 0. 0. 0. ] |
Further Learning Resources
This is Part 21 of the DataFrame method series.
Also, have a look at the Pandas DataFrame methods cheat sheet!