Serialize DataFrame – Dict, Records, JSON, and Pickles

5/5 - (2 votes)

This article focuses on the serialization and conversion methods of a Python DataFrame:

  • from_dict(),
  • to_dict(),
  • from_records(),
  • to_records(),
  • to_json(), and
  • to_pickles().

Let’s get started!


Preparation

Before any data manipulation can occur, two (2) new libraries will require installation.

  • The Pandas library enables access to/from a DataFrame.
  • The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.


Feel free to view the PyCharm installation guide for the required libraries.


Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import numpy as np

DataFrame.from_dict()

The from_dict() classmethod converts a valid dictionary structure into a DataFrame format. Upon conversion, the keys of the original dictionary translate to DataFrame columns.

The syntax for this method is as follows:

classmethod DataFrame.from_dict(data, orient='columns', dtype=None, columns=None)
ParameterDescription
dataThe parameter is a valid dictionary to be converted.
orientThe available options are:
'columns': if keys are columns, pass this option. Selected by default.
'index': If keys are rows, pass this option.
'tight': if tight, assume a dictionary with keys.
dtypeThis parameter is the data type to force. Otherwise, it is, by default, infer.
columnsThis parameter is the column(s) to use if orient is 'index'.

For this example, a Dictionary containing the first five (5) elements of the Periodic Table convert to a DataFrame.

elements = {'Hydrogen': [1, 1766], 'Helium':    [2, 1868], 
            'Lithium':  [3, 1817], 'Beryllium': [4, 1798], 
            'Boron':    [5, 1808]}

periodic_df = pd.DataFrame.from_dict(elements, orient='index',
              columns=['Atomic #', 'Discovered'])
print(periodic_df)
  • Line [1] creates a dictionary of lists and saves it to the variable elements.
  • Line [2] does the following:
    • creates a DataFrame from the elements Dictionary
    • sets the orient parameter to index
    • sets the column names to clearly identify the data
    • saves the output to the periodic_df DataFrame
  • Line [3] outputs the DataFrame to the terminal.

Output

Atomic #Discovered
Hydrogen11766
Helium21868
Lithium31817
Beryllium41798
Boron51808

DataFrame.to_dict()

The to_dict() method converts a valid DataFrame structure to a dictionary format.

The syntax for this method is as follows:

DataFrame.to_dict(orient='dict', into=<class 'dict'>)
ParameterDescription
orientThis parameter sets the values of the dictionary. The available options are:
'dict': dictionary: {column -> {index -> value}}
‘list’: dictionary: {column -> [values]}
‘series’: dictionary: {column -> Series(values)}
‘split’: dictionary: {‘index’ -> [index], ‘columns’, etc.}
‘tight’: dictionary: {‘index’ -> [index], etc.}
‘records’: list: [{column -> value}, … , {column -> value}]
‘index’: dictionary: {index -> {column -> value}}
intoThis parameter sets the data structure to convert the data into.
The default value is a dictionary.

This example reads in the file’s first (5) rows / three (3) columns to a DataFrame. This DataFrame then converts to a dictionary format.

Click here to save this CSV file and move it to the current working directory.

df = pd.read_csv('finxters.csv', usecols=['FID', 'First_Name', 'Last_Name']).head()
print(df)
result = df.to_dict()
print(result)
  • Line [1] reads in the first five (5) rows (head) and three (3) columns (usecols) of the finxters.csv file. The output saves to a DataFrame (df).
  • Line [2] outputs the DataFrame to the terminal.
  • Line [3] converts the DataFrame (df) to a dictionary. The output saves to result.
  • Line [4] outputs the result to the terminal.

Output – df

FIDFirst_NameLast_Name
030022145SteveHamilton
130022192AmyPullister
230022331PeterDunn
330022345MarcusWilliams
430022359AliceMiller

Output – result

{'FID': {0: 30022145, 1: 30022192, 2: 30022331, 3: 30022345, 4: 30022359},
'First_Name': {0: 'Steve', 1: 'Amy', 2: 'Peter', 3: 'Marcus', 4: 'Alice'},
'Last_Name':  {0: 'Hamilton', 1: 'Pullister', 2: 'Dunn', 3: 'Williams', 4: 'Miiller'}}

If the split parameter was passed to the to_dict() parameter, the output would be as follows:

df = pd.read_csv('finxters.csv', usecols=['FID', 'First_Name', 'Last_Name']).head() 
print(df)
result = df.to_dict('split')
print(result)

Output – result

{'index': [0, 1, 2, 3, 4],
'columns': ['FID', 'First_Name', 'Last_Name'],
'data': [[30022145, 'Steve', 'Hamilton'], [30022192, 'Amy', 'Pullister'],
          [30022331, 'Peter', 'Dunn'], [30022345, 'Marcus', 'Williams'],
          [30022359, 'Alice', 'Miiller']]}

DataFrame.from_records()

The from_records() classmethod converts a valid ndarray, tuple, or dictionary structure into a DataFrame format.

The syntax for this method is as follows:

classmethod DataFrame.from_records(data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)
ParameterDescription
dataThis parameter is a valid ndarray, tuple, or dictionary structure.
indexA field of arrays for the index or a list containing a specific set.
excludeThe columns/fields to exclude from the conversion.
columnsThe column names to use in the conversion.
coerce_floatThis parameter tries to convert decimal values to floats.
nrowsIf an iterator, the number of rows to read in.


This example converts a list of tuples (an ndarray) containing four (4) fictitious Finxter users to a DataFrame.

data = np.array([(30022145, 'wildone92'), (30022192, 'AmyP'), 
                 (30022331, '1998_pete'), (30022345, 'RexTex')])
users_df = pd.DataFrame.from_records(data, columns=['ID', 'Username'])
print(users_df)
  • Line [1] creates a list of tuples (ndarray) and saves it to the data variable.
  • Line [2] does the following:
    • creates a DataFrame from the data variable
    • sets the column names to clearly identify the data
  • Outputs the DataFrame to the terminal.

Output

IDUsername
030022145wildone92
130022192AmyP
2300223311998_pete
330022345RexTex

DataFrame.to_records()

The to_records() method converts a valid DataFrame structure to a NumPy record array. The index is included as the first field if requested.

The syntax for this method is as follows:

DataFrame.to_records(index=True, column_dtypes=None, index_dtypes=None)
ParameterDescription
indexThis parameter, if True, includes the index in the record array.
This value saves to the index field or index label.
column_dtypesThe data type to store the columns. If a dictionary, each column maps accordingly.
index_dtypesThe data type to store index levels. If a dictionary, each index level and indices map accordingly.

This example reads in the file’s first (5) rows / three (3) columns to a DataFrame. This DataFrame then converts to records.

Click here to save this CSV file and move it to the current working directory.

df = pd.read_csv('finxters.csv', usecols=['FID', 'First_Name', 'Last_Name']).head()
print(df)
result = df.to_records()
print(result)
  • Line [1] reads in the first five (5) rows (head) and three (3) columns (usecols) of the finxters.csv file. The output saves to a DataFrame (df).
  • Line [2] outputs the DataFrame to the terminal.
  • Line [3] converts the DataFrame (df) to records. The output saves to result.
  • Line [4] outputs the result to the terminal.
FIDFirst_NameLast_Name
030022145SteveHamilton
130022192AmyPullister
230022331PeterDunn
330022345MarcusWilliams
430022359AliceMiller

Output – df

Output – result

[(0, 30022145, 'Steve', 'Hamilton') (1, 30022192, 'Amy', 'Pullister')
(2, 30022331, 'Peter', 'Dunn') (3, 30022345, 'Marcus', 'Williams')
(4, 30022359, 'Alice', 'Miiller')]

DataFrame.to_json()

The to_json() method converts a DataFrame object to a JSON string.

💡 Note: Any NaN/None values will convert to NULL values.
Any DateTime objects will convert to UNIX timestamps.

The syntax for this method is as follows:

DataFrame.to_json(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression='infer', index=True, indent=None, storage_options=None)
ParameterDescription
path_or_bufThis parameter is a string, path, or file object with a write function.
orientThis parameter is the expected JSON format. The options are a:
Series:
– default is 'index'
– values are: ‘split’, ‘records’, ‘index’, ‘table’
DataFrame:
– default is 'columns'
– values are: ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’
JSON:
'dict': dictionary: {column -> {index -> value}}
‘list’: dictionary: {column -> [values]}
‘series’: dictionary: {column -> Series(values)}
‘split’: dictionary: {‘index’ -> [index], ‘columns’, etc.}
‘tight’: dictionary: {‘index’ -> [index], etc.}
‘records’: list: [{column -> value}, … , {column -> value}]
‘index’: dictionary: {index -> {column -> value}}
date_formatThis is the format of the date conversion. The options are:
'epoch' or 'iso'.
double_precisionThe decimal places to use when encoding float values.
force_asciiWhether to force the encoded string to be valid ASII.
date_unitThe unit of time for encoding.
default_handlerThe handler to call if the string can not be converted to JSON.
linesIf orient is ‘records’, then write a line delimited JSON string.
compressionIf 'infer'/‘path_or_buf’, use: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, or ‘.zst’ ext.
indexIf True, this parameter includes index values in the JSON string.
indentThis parameter determines the length of the indent for a record.
storage_optionsThis parameter contains extra options (dictionary format), such as host, port, username, etc.

This example reads in the countries.csv file to a DataFrame. This DataFrame then converts to JSON. Click here to save this CSV file and move it to the current working directory.

df = pd.read_csv('countries.csv').head()
result = df.to_json(indent=4, orient='records', lines=True)
print(result)
  • Line [1] reads in the first five (5) rows (head) of the countries.csv file. The output saves to a DataFrame (df).
  • Line [2] does the following:
    • converts the DataFrame to a JSON format
    • formats the output by indenting each record four (4) spaces from the left
    • sets the orient parameter to records and lines to True (see above definition).
    • saves the output to result.
  • Line [3] outputs the result to the terminal.

Output – result

{
"Country":"Germany",
"Capital":"Berlin",
"Population":83783942,
"Area":357021
}
{
"Country":"France",
"Capital":"Paris",
"Population":67081000,
"Area":551695
}
{
"Country":"Spain",
"Capital":"Madrid",
"Population":47431256,
"Area":498511
}
{
"Country":"Italy",
"Capital":"Rome",
"Population":60317116,
"Area":301338
}
{
"Country":"Poland",
"Capital":"Warsaw",
"Population":38383000,
"Area":312685
}

DataFrame.to_pickle()

The to_pickle() method converts an object in memory to a byte stream. This object can be stored as a binary file and read back in later.

The syntax for this method is as follows:

DataFrame.to_pickle(path, compression='infer', protocol=5, storage_options=None)
ParameterDescription
pathThis parameter is the file path where the pickle file saves.
compressionIf 'infer', options are: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, or ‘.zst’ ext.
protocolThis parameter is an integer that stipulates the protocol to use.
Options are 0-5. Click here for additional details.
storage_optionsThis parameter is a dictionary containing additional details such as a host or port.

This example reads in the finxters.csv file to a DataFrame. The contents of this DataFrame saves to a pickle file.

Click here to save this CSV file and move it to the current working directory.

df_users = pd.read_csv('finxters.csv', usecols=['FID', 'Username', 'Password'])
df_users.to_pickle('pickle_file')
  • Line [1] reads in three (3) columns from the finxters.csv file. The output saves to a DataFrame (df_users).
  • Line [2] saves the contents of the DataFrame to a pickle file.

💡 Note: Navigate to the current working directory to see this file located in the file list.

To learn how to read in a pickle file, click here for details.

Further Learning Resources

This is Part 21 of the DataFrame method series.

  • Part 1 focuses on the DataFrame methods abs(), all(), any(), clip(), corr(), and corrwith().
  • Part 2 focuses on the DataFrame methods count(), cov(), cummax(), cummin(), cumprod(), cumsum().
  • Part 3 focuses on the DataFrame methods describe(), diff(), eval(), kurtosis().
  • Part 4 focuses on the DataFrame methods mad(), min(), max(), mean(), median(), and mode().
  • Part 5 focuses on the DataFrame methods pct_change(), quantile(), rank(), round(), prod(), and product().
  • Part 6 focuses on the DataFrame methods add_prefix(), add_suffix(), and align().
  • Part 7 focuses on the DataFrame methods at_time(), between_time(), drop(), drop_duplicates() and duplicated().
  • Part 8 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
  • Part 9 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
  • Part 10 focuses on the DataFrame methods reset_index(), sample(), set_axis(), set_index(), take(), and truncate()
  • Part 11 focuses on the DataFrame methods backfill(), bfill(), fillna(), dropna(), and interpolate()
  • Part 12 focuses on the DataFrame methods isna(), isnull(), notna(), notnull(), pad() and replace()
  • Part 13 focuses on the DataFrame methods drop_level(), pivot(), pivot_table(), reorder_levels(), sort_values() and sort_index()
  • Part 14 focuses on the DataFrame methods nlargest(), nsmallest(), swap_level(), stack(), unstack() and swap_axes()
  • Part 15 focuses on the DataFrame methods melt(), explode(), squeeze(), to_xarray(), t() and transpose()
  • Part 16 focuses on the DataFrame methods append(), assign(), compare(), join(), merge() and update()
  • Part 17 focuses on the DataFrame methods asfreq(), asof(), shift(), slice_shift(), tshift(), first_valid_index(), and last_valid_index()
  • Part 18 focuses on the DataFrame methods resample(), to_period(), to_timestamp(), tz_localize(), and tz_convert()
  • Part 19 focuses on the visualization aspect of DataFrames and Series via plotting, such as plot(), and plot.area().
  • Part 20 focuses on continuing the visualization aspect of DataFrames and Series via plotting such as hexbin, hist, pie, and scatter plots.
  • Part 21 focuses on the serialization and conversion methods from_dict(), to_dict(), from_records(), to_records(), to_json(), and to_pickles().
  • Part 22 focuses on the serialization and conversion methods to_clipboard(), to_html(), to_sql(), to_csv(), and to_excel().
  • Part 23 focuses on the serialization and conversion methods to_markdown(), to_stata(), to_hdf(), to_latex(), to_xml().
  • Part 24 focuses on the serialization and conversion methods to_parquet(), to_feather(), to_string(), Styler.
  • Part 25 focuses on the serialization and conversion methods to_bgq() and to_coo().