Serialize DataFrame - Dict, Records, JSON, and Pickles

This article focuses on the serialization and conversion methods of a Python DataFrame:

from_dict(),
to_dict(),
from_records(),
to_records(),
to_json(), and
to_pickles().

Let’s get started!

Preparation

Before any data manipulation can occur, two (2) new libraries will require installation.

The Pandas library enables access to/from a DataFrame.
The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.

Feel free to view the PyCharm installation guide for the required libraries.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import numpy as np

DataFrame.from_dict()

The from_dict() classmethod converts a valid dictionary structure into a DataFrame format. Upon conversion, the keys of the original dictionary translate to DataFrame columns.

The syntax for this method is as follows:

classmethod DataFrame.from_dict(data, orient='columns', dtype=None, columns=None)

Parameter	Description
`data`	The parameter is a valid dictionary to be converted.
`orient`	The available options are: – `'columns'`: if keys are columns, pass this option. Selected by default. – `'index'`: If keys are rows, pass this option. – `'tight'`: if tight, assume a dictionary with keys.
`dtype`	This parameter is the data type to force. Otherwise, it is, by default, `infer`.
`columns`	This parameter is the column(s) to use if orient is `'index'`.

For this example, a Dictionary containing the first five (5) elements of the Periodic Table convert to a DataFrame.

elements = {'Hydrogen': [1, 1766], 'Helium':    [2, 1868], 
            'Lithium':  [3, 1817], 'Beryllium': [4, 1798], 
            'Boron':    [5, 1808]}

periodic_df = pd.DataFrame.from_dict(elements, orient='index',
              columns=['Atomic #', 'Discovered'])
print(periodic_df)

Line [1] creates a dictionary of lists and saves it to the variable elements.
Line [2] does the following:
- creates a Dat a Frame from the elements Dictionary
- sets the orient parameter to index
- sets the column names to clearly identify the data
- saves the output to the periodic_df DataFrame
Line [3] outputs the DataFrame to the terminal.

Output

	Atomic #	Discovered
Hydrogen	1	1766
Helium	2	1868
Lithium	3	1817
Beryllium	4	1798
Boron	5	1808

DataFrame.to_dict()

The to_dict() method converts a valid DataFrame structure to a dictionary format.

The syntax for this method is as follows:

DataFrame.to_dict(orient='dict', into=<class 'dict'>)

Parameter Description

orient This parameter sets the values of the dictionary. The available options are:
– 'dict': dictionary: {column -> {index -> value}}
– ‘list’: dictionary: {column -> [values]}
– ‘series’: dictionary: {column -> Series(values)}
– ‘split’: dictionary: {‘index’ -> [index], ‘columns’, etc.}
– ‘tight’: dictionary: {‘index’ -> [index], etc.}
– ‘records’: list: [{column -> value}, … , {column -> value}]
– ‘index’: dictionary: {index -> {column -> value}}

into This parameter sets the data structure to convert the data into.
The default value is a dictionary.

Parameter	Description
orient	This parameter sets the values of the dictionary. The available options are: – `'dict'`: dictionary: `{column -> {index -> value}}` – `‘list’`: dictionary: `{column -> [values]}` – `‘series’`: dictionary: `{column -> Series(values)}` – `‘split’`: dictionary: `{‘index’ -> [index], ‘columns’, etc.}` – `‘tight’`: dictionary: `{‘index’ -> [index], etc.}` – `‘records’`: list: `[{column -> value}, … , {column -> value}]` – `‘index’`: dictionary: `{index -> {column -> value}}`
into	This parameter sets the data structure to convert the data into. The default value is a dictionary.

This example reads in the file’s first (5) rows / three (3) columns to a DataFrame. This DataFrame then converts to a dictionary format.

Click here to save this CSV file and move it to the current working directory.

df = pd.read_csv('finxters.csv', usecols=['FID', 'First_Name', 'Last_Name']).head()
print(df)
result = df.to_dict()
print(result)

Line [1] reads in the first five (5) rows (head) and three (3) columns (usecols) of the finxters.csv file. The output saves to a DataFrame (df).
Line [2] outputs the DataFrame to the terminal.
Line [3] converts the DataFrame (df) to a dictionary. The output saves to result.
Line [4] outputs the result to the terminal.

Output – df

	FID	First_Name	Last_Name
0	30022145	Steve	Hamilton
1	30022192	Amy	Pullister
2	30022331	Peter	Dunn
3	30022345	Marcus	Williams
4	30022359	Alice	Miller

Output – result

{'FID': {0: 30022145, 1: 30022192, 2: 30022331, 3: 30022345, 4: 30022359},
'First_Name': {0: 'Steve', 1: 'Amy', 2: 'Peter', 3: 'Marcus', 4: 'Alice'},
'Last_Name':  {0: 'Hamilton', 1: 'Pullister', 2: 'Dunn', 3: 'Williams', 4: 'Miiller'}}

If the split parameter was passed to the to_dict() parameter, the output would be as follows:

df = pd.read_csv('finxters.csv', usecols=['FID', 'First_Name', 'Last_Name']).head() 
print(df)
result = df.to_dict('split')
print(result)

Output – result

{'index':     [0, 1, 2, 3, 4],
'columns': ['FID', 'First_Name', 'Last_Name'],
'data':        [[30022145, 'Steve', 'Hamilton'], [30022192, 'Amy', 'Pullister'],
                   [30022331, 'Peter', 'Dunn'], [30022345, 'Marcus', 'Williams'],
                   [30022359, 'Alice', 'Miiller']]}

DataFrame.from_records()

The from_records() classmethod converts a valid ndarray, tuple, or dictionary structure into a DataFrame format.

The syntax for this method is as follows:

classmethod DataFrame.from_records(data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)

Parameter	Description
`data`	This parameter is a valid `ndarray`, tuple, or dictionary structure.
`index`	A field of arrays for the index or a list containing a specific set.
`exclude`	The columns/fields to exclude from the conversion.
`columns`	The column names to use in the conversion.
`coerce_float`	This parameter tries to convert decimal values to floats.
`nrows`	If an itera t or, the number of rows to read in.

This example converts a list of tuples (an ndarray) containing four (4) fictitious Finxter users to a DataFrame.

data = np.array([(30022145, 'wildone92'), (30022192, 'AmyP'), 
                 (30022331, '1998_pete'), (30022345, 'RexTex')])
users_df = pd.DataFrame.from_records(data, columns=['ID', 'Username'])
print(users_df)

Line [1] creates a list of tuples (ndarray) and saves it to the data variable.
Line [2] does the following:
- creates a DataFrame from the data variable
- sets the column names to clearly identify the data
Outputs the DataFrame to the terminal.

Output

	ID	Username
0	30022145	wildone92
1	30022192	AmyP
2	30022331	1998_pete
3	30022345	RexTex

DataFrame.to_records()

The to_records() method converts a valid DataFrame structure to a NumPy record array. The index is included as the first field if requested.

The syntax for this method is as follows:

DataFrame.to_records(index=True, column_dtypes=None, index_dtypes=None)

Parameter	Description
`index`	This parameter, if `True`, includes the index in the record array. This value saves to the index field or index label.
`column_dtypes`	The data type to store the columns. If a dictionary, each column maps accordingly.
`index_dtypes`	The data type to store index levels. If a dictionary, each index level and indices map accordingly.

This example reads in the file’s first (5) rows / three (3) columns to a DataFrame. This DataFrame then converts to records.

Click h e re to save this CSV file and move it to the current working directory.

df = pd.read_csv('finxters.csv', usecols=['FID', 'First_Name', 'Last_Name']).head()
print(df)
result = df.to_records()
print(result)

Line [1] reads in the first five (5) rows (head) and three (3) columns (usecols) of the finxters.csv file. The output saves to a DataFrame (df).
Line [2] outputs the DataFrame to the terminal.
Line [3] converts the DataFrame (df) to records. The output saves to result.
Line [4] outputs the result to the terminal.

	FID	First_Name	Last_Name
0	30022145	Steve	Hamilton
1	30022192	Amy	Pullister
2	30022331	Peter	Dunn
3	30022345	Marcus	Williams
4	30022359	Alice	Miller

Output – df

Output – result

[(0, 30022145, 'Steve', 'Hamilton') (1, 30022192, 'Amy', 'Pullister')
(2, 30022331, 'Peter', 'Dunn') (3, 30022345, 'Marcus', 'Williams')
(4, 30022359, 'Alice', 'Miiller')]

DataFrame.to_json()

The to_json() method converts a DataFrame object to a JSON string.

💡 Note: Any NaN/None values will convert to NULL values.
Any DateTime objects will convert to UNIX timestamps.

The syntax for this method is as follows:

DataFrame.to_json(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression='infer', index=True, indent=None, storage_options=None)

Parameter	Description
`path_or_buf`	This parameter is a string, path, or file object with a write function.
`orient`	This parameter is the expected JSON format. The options are a: Series: – default is `'index'` – values are: `‘split’`, `‘records’`, `‘index’`, `‘table’` DataFrame: – default is `'columns'` – values are: `‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’` JSON: – `'dict'`: dictionary: `{column -> {index -> value}}` – `‘list’`: dictionary: `{column -> [values]}` – `‘series’`: dictionary: `{column -> Series(values)}` – `‘split’`: dictionary: `{‘index’ -> [index], ‘columns’, etc.}` – `‘tight’`: dictionary: `{‘index’ -> [index], etc.}` – `‘records’`: list: `[{column -> value}, … , {column -> value}]` – `‘index’`: dictionary: `{index -> {column -> value}}`
`date_format`	This is the format of the date conversion. The options are: `'epoch'` or `'iso'`.
`double_precision`	The decimal places to use when encoding float values.
`force_ascii`	Whether to force the encoded string to be valid ASII.
`date_unit`	The unit of time for encoding.
`default_handler`	The handler to call if the string can not be converted to JSON.
`lines`	If orient is `‘records’`, then write a line delimited JSON string.
`compression`	If `'infer'/‘path_or_buf’`, use: `‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’`, or `‘.zst’` ext.
`index`	If `True`, this parameter includes index values in the JSON string.
`indent`	This parameter determines the length of the indent for a record.
`storage_options`	This parameter contains extra options (dictionary format), such as host, port, username, etc.

This example reads in the countries.csv file to a DataFrame. This DataFrame then converts to JSON. Click here to save this CSV file and move it to the current working directory.

df = pd.read_csv('countries.csv').head()
result = df.to_json(indent=4, orient='records', lines=True)
print(result)

Line [1] reads in the first five (5) rows (head) of the countries.csv file. The output saves to a DataFrame (df).
Line [2] does the following:
- converts the DataFrame to a JSON format
- formats the output by indenting each record four (4) spaces from the left
- sets the orient parameter to records and lines to True (see above definition).
- saves the output to result.
Line [3] outputs the result to the terminal.

Output – result

{
"Country":"Germany",
"Capital":"Berlin",
"Population":83783942,
"Area":357021
}

{
"Country":"France",
"Capital":"Paris",
"Population":67081000,
"Area":551695
}

{
"Country":"Spain",
"Capital":"Madrid",
"Population":47431256,
"Area":498511
}

{
"Country":"Italy",
"Capital":"Rome",
"Population":60317116,
"Area":301338
}

{
"Country":"Poland",
"Capital":"Warsaw",
"Population":38383000,
"Area":312685
}

DataFrame.to_pickle()

The to_pickle() method converts an object in memory to a byte stream. This object can be stored as a binary file and read back in later.

The syntax for this method is as follows:

DataFrame.to_pickle(path, compression='infer', protocol=5, storage_options=None)

Parameter	Description
`path`	This parameter is the file path where the pickle file saves.
`compression`	If `'infer'`, options are: `‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’`, or `‘.zst’` ext.
`protocol`	This parameter is an integer that stipulates the protocol to use. Options are 0-5. Click here for additional details.
`storage_options`	This parameter is a dictionary containing additional details such as a host or port.

This example reads in the finxters.csv file to a DataFrame. The contents of this DataFrame saves to a pickle file.

Click here to save this CSV file and move it to the current working directory.

df_users = pd.read_csv('finxters.csv', usecols=['FID', 'Username', 'Password'])
df_users.to_pickle('pickle_file')

Line [1] reads in three (3) columns from the finxters.csv file. The output saves to a DataFrame (df_users).
Line [2] saves the contents of the DataFrame to a pickle file.

💡 Note: Navigate to the current working directory to see this file located in the file list.

To learn how to read in a pickle file, click here for details.

Further Learning Resources

This is Part 21 of the DataFrame method series.

Part 1 focuses on the DataFrame methods abs(), all(), any(), clip(), corr(), and corrwith().
Part 2 focuses on the DataFrame methods count(), cov(), cummax(), cummin(), cumprod(), cumsum().
Part 3 focuses on the DataFrame methods describe(), diff(), eval(), kurtosis().
Part 4 focuses on the DataFrame methods mad(), min(), max(), mean(), median(), and mode().
Part 5 focuses on the DataFrame methods pct_change(), quantile(), rank(), round(), prod(), and product().
Part 6 focuses on the DataFrame methods add_prefix(), add_suffix(), and align().
Part 7 focuses on the DataFrame methods at_time(), between_time(), drop(), drop_duplicates() and duplicated().
Part 8 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
Part 9 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
Part 10 focuses on the DataFrame methods reset_index(), sample(), set_axis(), set_index(), take(), and truncate()
Part 11 focuses on the DataFrame methods backfill(), bfill(), fillna(), dropna(), and interpolate()
Part 12 focuses on the DataFrame methods isna(), isnull(), notna(), notnull(), pad() and replace()
Part 13 focuses on the DataFrame methods drop_level(), pivot(), pivot_table(), reorder_levels(), sort_values() and sort_index()
Part 14 focuses on the DataFrame methods nlargest(), nsmallest(), swap_level(), stack(), unstack() and swap_axes()
Part 15 focuses on the DataFrame methods melt(), explode(), squeeze(), to_xarray(), t() and transpose()
Part 16 focuses on the DataFrame methods append(), assign(), compare(), join(), merge() and update()
Part 17 focuses on the DataFrame methods asfreq(), asof(), shift(), slice_shift(), tshift(), first_valid_index(), and last_valid_index()
Part 18 focuses on the DataFrame methods resample(), to_period(), to_timestamp(), tz_localize(), and tz_convert()
Part 19 focuses on the visualization aspect of DataFrames and Series via plotting, such as plot(), and plot.area().
Part 20 focuses on continuing the visualization aspect of DataFrames and Series via plotting such as hexbin, hist, pie, and scatter plots.
Part 21 focuses on the serialization and conversion methods from_dict(), to_dict(), from_records(), to_records(), to_json(), and to_pickles().
Part 22 focuses on the serialization and conversion methods to_clipboard(), to_html(), to_sql(), to_csv(), and to_excel().
Part 23 focuses on the serialization and conversion methods to_markdown(), to_stata(), to_hdf(), to_latex(), to_xml().
Part 24 focuses on the serialization and conversion methods to_parquet(), to_feather(), to_string(), Styler.
Part 25 focuses on the serialization and conversion methods to_bgq() and to_coo().