Serialize DataFrame to_gbq() and to_coo()

This article focuses on the serialization and conversion methods of a Python DataFrame:

  • to_gbq(),
  • to_coo().

Let’s get started!


Preparation

Before any data manipulation can occur, four (4) new libraries will require installation.

  • The Pandas library enables access to/from a DataFrame.
  • The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.
  • The pandas_gbq allows access to Google Big Query (GBQ)
  • The google.auth authentication.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install pandas_gbq 

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install google.auth

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import numpy as np 
from google.cloud import bigquery
import google.auth

DataFrame.to_gbq()

The to_gbq() method writes a DataFrame object to a Google BigQuery table.

BigQuery is Google Cloud Data Warehouse that helps you store, analyze and visualize data.

The syntax for this method is as follows:

DataFrame.to_gbq(destination_table, project_id=None, chunksize=None, reauth=False, if_exists='fail', auth_local_webserver=False, table_schema=None, location=None, progress_bar=True, credentials=None)
ParameterDescription
destination_tableThis string parameter denotes the table name presented in the form dataset.tablename.
project_idThis parameter is the GBQ Account Project ID.
This ID is only available from the GBG environment.
chunksizeThis parameter depicts the total rows to insert/chunk. The default value is all rows at once.
reauthIf True, request user re-authentication from GBQ.
if_existsThe path taken is the table exists. Options are:
'fail': if the table exists, raise an error.
'replace': if the table exists, drop and re-create.
'append': if the table exists, append data.
auth_local_webserverIf True, use the local webserver flow instead of the console flow to retrieve user credentials.
table_schemaThis parameter is a dictionary containing table fields data types (Schema). If not provided, BGQ will attempt to create a Schema.
locationThis parameter is the location where the load job runs.
progress_barIf True, use the tqdm library to display the progress bar for the upload.
credentialsThis parameter is the credentials for accessing Google APIs.

The to_gbq() determines the BigQuery Table Schema based on the data types of the uploaded DataFrame. If this Schema does not meet your requirements, you can create a new Table Schema and pass it via the table_schema parameter.

πŸ’‘ Note: To view a list of available GBQ Data Types, click here.

Before running any code, you will need to do the following:

View the video below for instructions.

πŸ’‘ Note: Remember to take note of the project details and authentication.

For this example, countries.csv reads into a DataFrame. This DataFrame adds to a Google Big Query Table. Click here to save this CSV file and move it to the current working directory.

prj_id   = 'central-hangar-341214'
dat_set  = 'my_dataset'
table_id = 'my_dataset.countries'

df = pd.read_csv('countries.csv')
df.to_gbq(table_id, if_exists='append', project_id=prj_id)
  • Line [1] contains the Project ID assigned earlier.
  • Line [2] is a name we assigned to the dataset.
  • Line [3] combines the dataset name added to the table name. For this example, 'countries' is entered.
  • Line [4] reads in the countries.csv to a DataFrame.
  • Line [5] does the following:
    • Assigns the table_id ('countries.csv')
    • Checks if this table already exists. If so, the data appends to it.
    • Assigns the project_id ('central-hangar-341214')
    • Attempts to connect to and upload the data

πŸ’‘ Note: Remember to generate and save the authentication.


DataFrame Sparse to_coo()

The sparse to_coo() method creates a scipy.sparse.coo_matrix from a Series containing a MultiIndex. The row_levels and column_levels determine the row/column coordinates.

The syntax for this method is as follows:

Series.sparse.to_coo(row_levels=(0,), column_levels=(1,), sort_labels=False)
ParameterDescription
row_levelsThis parameter is a tuple or a list.
column_levelsThis parameter is a tuple or a list.
sort_labelsIf this parameter is True, the sort is performed before creating the sparse matrix.

This example has random and missing data. This data is re-sampled and converted into a tuple format using to_coo().

stats = pd.Series([1.0080, 4.00260, 7.0, 9.012183, np.nan, np.nan])
stats.index = pd.MultiIndex.from_tuples(
    [(np.nan, 2, "a", 0),
     (1, 2, "a", 1),
     (np.nan, 1, "b", 0),
     (1, 1, "b", 1),
     (2, 1, "b", 0),
     (np.nan, 1, "b", 1)],
    names=["HYD", "HEL", "LIT", "BER"])

new_stats = stats.astype("Sparse")

A, rows, columns = new_stats.sparse.to_coo(
    row_levels=["HYD", "HEL"], 
    column_levels=["LIT", "BER"], sort_labels=True)

print(A)

Output

(0, 0) 1.008
(1, 1) 4.0026
(2, 2) 7.0
(3, 3) 9.012183

If we applied the todense() method to the above data, the output would be as follows:

stats = pd.Series([1.0080, 4.00260, 7.0, 9.012183, np.nan, np.nan])
stats.index = pd.MultiIndex.from_tuples(
    [(np.nan, 2, "a", 0),
     (1, 2, "a", 1),
     (np.nan, 1, "b", 0),
     (1, 1, "b", 1),
     (2, 1, "b", 0),
     (np.nan, 1, "b", 1)],
    names=["HYD", "HEL", "LIT", "BER"])

new_stats = stats.astype("Sparse")

A, rows, columns = new_stats.sparse.to_coo(
    row_levels=["HYD", "HEL"], 
    column_levels=["LIT", "BER"], sort_labels=True)

print(A.todense())

Output

[[1.008 0. 0. 0. ]
[0. 4.0026 0. 0. ]
[0. 0. 7. 0. ]
[0. 0. 0. 9.012183]
[0. 0. 0. 0. ]]

Further Learning Resources

This is Part 21 of the DataFrame method series.

  • Part 1 focuses on the DataFrame methods abs(), all(), any(), clip(), corr(), and corrwith().
  • Part 2 focuses on the DataFrame methods count(), cov(), cummax(), cummin(), cumprod(), cumsum().
  • Part 3 focuses on the DataFrame methods describe(), diff(), eval(), kurtosis().
  • Part 4 focuses on the DataFrame methods mad(), min(), max(), mean(), median(), and mode().
  • Part 5 focuses on the DataFrame methods pct_change(), quantile(), rank(), round(), prod(), and product().
  • Part 6 focuses on the DataFrame methods add_prefix(), add_suffix(), and align().
  • Part 7 focuses on the DataFrame methods at_time(), between_time(), drop(), drop_duplicates() and duplicated().
  • Part 8 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
  • Part 9 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
  • Part 10 focuses on the DataFrame methods reset_index(), sample(), set_axis(), set_index(), take(), and truncate()
  • Part 11 focuses on the DataFrame methods backfill(), bfill(), fillna(), dropna(), and interpolate()
  • Part 12 focuses on the DataFrame methods isna(), isnull(), notna(), notnull(), pad() and replace()
  • Part 13 focuses on the DataFrame methods drop_level(), pivot(), pivot_table(), reorder_levels(), sort_values() and sort_index()
  • Part 14 focuses on the DataFrame methods nlargest(), nsmallest(), swap_level(), stack(), unstack() and swap_axes()
  • Part 15 focuses on the DataFrame methods melt(), explode(), squeeze(), to_xarray(), t() and transpose()
  • Part 16 focuses on the DataFrame methods append(), assign(), compare(), join(), merge() and update()
  • Part 17 focuses on the DataFrame methods asfreq(), asof(), shift(), slice_shift(), tshift(), first_valid_index(), and last_valid_index()
  • Part 18 focuses on the DataFrame methods resample(), to_period(), to_timestamp(), tz_localize(), and tz_convert()
  • Part 19 focuses on the visualization aspect of DataFrames and Series via plotting, such as plot(), and plot.area().
  • Part 20 focuses on continuing the visualization aspect of DataFrames and Series via plotting such as hexbin, hist, pie, and scatter plots.
  • Part 21 focuses on the serialization and conversion methods from_dict(), to_dict(), from_records(), to_records(), to_json(), and to_pickles().
  • Part 22 focuses on the serialization and conversion methods to_clipboard(), to_html(), to_sql(), to_csv(), and to_excel().
  • Part 23 focuses on the serialization and conversion methods to_markdown(), to_stata(), to_hdf(), to_latex(), to_xml().
  • Part 24 focuses on the serialization and conversion methods to_parquet(), to_feather(), to_string(), Styler.
  • Part 25 focuses on the serialization and conversion methods to_bgq() and to_coo().

Also, have a look at the Pandas DataFrame methods cheat sheet!