5 Best Ways to Convert Pandas DataFrame to Nested Dict

πŸ’‘ Problem Formulation: Data scientists and developers often need to convert a Pandas DataFrame into a nested dictionary for API consumption, data serialization, or simply to interface with other Python structures. For example, converting a DataFrame with columns ‘Category’, ‘Item’, and ‘Value’ into a nested dictionary where each ‘Category’ becomes the key to a dictionary of ‘Item’: ‘Value’ pairs. This article illustrates how to achieve this conversion using different methods.

Method 1: Using groupby() and to_dict()

Grouping by a column and then converting to a dictionary can create a nested dict where the group keys are the first-level keys. The groupby() function is followed by the apply() method to turn each group into a dictionary using to_dict(). This method is best for creating a single-level nested dictionary.

Here’s an example:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Category': ['Fruit', 'Fruit', 'Vegetable', 'Vegetable'],
    'Item': ['Apple', 'Banana', 'Carrot', 'Bean'],
    'Value': [1, 2, 3, 4]
})

# Convert DataFrame to nested dict
nested_dict = df.groupby('Category')[['Item', 'Value']].apply(lambda x: x.set_index('Item').to_dict(orient='index')).to_dict()

print(nested_dict)

The output will be:

{
    'Fruit': {'Apple': {'Value': 1}, 'Banana': {'Value': 2}},
    'Vegetable': {'Carrot': {'Value': 3}, 'Bean': {'Value': 4}}
}

This code snippet creates a DataFrame with a category, item, and value columns. By using the groupby() function on ‘Category’ we get groups that are converted into a dictionary for each category, with the items as keys and a dictionary of their values as the corresponding values, using the apply() method to apply the to_dict() function.

Method 2: Using Pivot Table

A pivot table can reorganize a DataFrame into a structured nested dictionary where rows become the first-level keys, and columns become the second-level keys. This method is commonly applied when the DataFrame represents a matrix-like structure. The pandas.pivot_table() function is very versatile for these scenarios.

Here’s an example:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Category': ['Fruit', 'Fruit', 'Vegetable', 'Vegetable'],
    'Item': ['Apple', 'Banana', 'Carrot', 'Bean'],
    'Value': [1, 2, 3, 4]
})

# Create pivot table and convert to nested dict
nested_dict = df.pivot_table(index='Category', columns='Item', values='Value', aggfunc='sum').to_dict()

print(nested_dict)

The output will be:

{
    'Apple': {'Fruit': 1, 'Vegetable': NaN},
    'Banana': {'Fruit': 2, 'Vegetable': NaN},
    'Bean': {'Fruit': NaN, 'Vegetable': 4},
    'Carrot': {'Fruit': NaN, 'Vegetable': 3}
}

The snippet generates a pivot table with ‘Category’ as the index and ‘Item’ as columns, aggregating values by their sum. It’s then converted into a dictionary with item names as keys and corresponding dictionaries mapping categories to their values as the values. Note that missing combinations yield NaN values.

Method 3: Using a Multi-Index and to_dict()

Creating a multi-index in Pandas allows for representing multi-layered data by setting multiple columns as index, which can be then turned into a nested dictionary where each level of the index becomes a deeper level in the dict structure. The set_index() method creates the multi-index, which then facilitates the to_dict() conversion.

Here’s an example:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Category': ['Fruit', 'Fruit', 'Vegetable', 'Vegetable'],
    'Item': ['Apple', 'Banana', 'Carrot', 'Bean'],
    'Value': [1, 2, 3, 4]
})

# Set multi-index and convert to nested dict
nested_dict = df.set_index(['Category', 'Item']).to_dict(orient='index')

print(nested_dict)

The output will be:

{
    ('Fruit', 'Apple'): {'Value': 1},
    ('Fruit', 'Banana'): {'Value': 2},
    ('Vegetable', 'Carrot'): {'Value': 3},
    ('Vegetable', 'Bean'): {'Value': 4}
}

This snippet sets up a multi-index based on ‘Category’ and ‘Item’ columns. The to_dict() method with `orient=’index’` argument turns the multi-index DataFrame into a nested dictionary. Here, the tuples become the keys of the outer dictionary, and the ‘Value’ becomes the inner dictionary.

Method 4: Using json_normalize()

The json_normalize() function in Pandas can be employed to normalize semi-structured JSON data into a flat table and then convert to a nested dictionary. This approach is particularly useful when dealing with JSON data that has already been loaded into a DataFrame and needs transformation.

Here’s an example:

import pandas as pd
from pandas import json_normalize

# Create a DataFrame with semi-structured JSON
df = pd.DataFrame({
    'Category': ['Fruit', 'Fruit', 'Vegetable', 'Vegetable'],
    'Item': ['Apple', 'Banana', 'Carrot', 'Bean'],
    'Value': [1, 2, 3, 4]
})

# Normalize JSON data and convert to nested dict
nested_dict = json_normalize(df.to_dict(orient='records'))

print(nested_dict.to_dict(orient='records'))

The output will be:

[
    {'Category': 'Fruit', 'Item': 'Apple', 'Value': 1},
    {'Category': 'Fruit', 'Item': 'Banana', 'Value': 2},
    {'Category': 'Vegetable', 'Item': 'Carrot', 'Value': 3},
    {'Category': 'Vegetable', 'Item': 'Bean', 'Value': 4}
]

In the given snippet, the to_dict(orient='records') method is used to transform a DataFrame into a list of dictionaries where each dictionary represents a row in the DataFrame. json_normalize() is then used to normalize the data into a flat structure which is converted back into a dictionary representing records.

Bonus One-Liner Method 5: Using a Simple Dictionary Comprehension

For a straightforward DataFrame structure, a nested dictionary can be created using a simple dictionary comprehension. This approach is quick and Pythonic, suitable for smaller DataFrames and when you need a fast solution with no dependency on complex functionality.

Here’s an example:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    'Category': ['Fruit', 'Fruit', 'Vegetable', 'Vegetable'],
    'Item': ['Apple', 'Banana', 'Carrot', 'Bean'],
    'Value': [1, 2, 3, 4]
})

# Use dictionary comprehension to create nested dict
nested_dict = {k: f.groupby('Item')['Value'].apply(lambda x: x.to_dict()).to_dict() 
               for k, f in df.groupby('Category')}

print(nested_dict)

The output will be:

{
    'Fruit': {'Apple': 1, 'Banana': 2},
    'Vegetable': {'Carrot': 3, 'Bean': 4}
}

This code creates a nested dictionary by grouping the DataFrame by ‘Category’ and then iterating through each group. For each category, it groups the data again by ‘Item’ and converts the ‘Value’ to a dictionary, resulting in a two-level nested structure.

Summary/Discussion

  • Method 1: Groupby and to_dict. Strengths: Simple and intuitive for one-level nesting. Weaknesses: Not as versatile for deeper levels of nesting.
  • Method 2: Pivot Table. Strengths: Useful for matrix-like structures and when working with data that is neatly tabular. Weaknesses: Can introduce NaN values where data is missing.
  • Method 3: Multi-Index and to_dict. Strengths: Suitable for complex hierarchical data structures. Weaknesses: Can produce a less intuitive dictionary format with tuple-based keys.
  • Method 4: json_normalize. Strengths: Powerful when dealing with semi-structured JSON data. Weaknesses: Can be overkill for simple DataFrames and adds an extra conversion step.
  • Bonus Method 5: Dictionary Comprehension. Strengths: Quick and easy one-liner. Weaknesses: Not suitable for more complex data manipulations and larger DataFrames.