5 Best Ways to Convert Complex JSON to DataFrame in Python

πŸ’‘ Problem Formulation: In the era of big data, developers often find themselves needing to convert JSON structures with nested arrays and objects into tidy pandas DataFrames for analysis. Imagine receiving a JSON file with multiple levels of hierarchy, and you need to flatten this structure for use within a pandas DataFrame. This article guides you through five effective methods to transform a complex JSON into an analyzable, flat data structure, suitable for data science or machine learning applications.

Method 1: Using pandas json_normalize

The pandas library provides json_normalize, a powerful function specifically designed to flatten nested JSON objects into a flat table. It’s an ideal choice when dealing with JSON data with multiple nested levels.

Here’s an example:

import pandas as pd
json_data = {
    'name': 'John Doe',
    'projects': [{'name': 'Project 1', 'year': 2020}, {'name': 'Project 2', 'year': 2021}]
}
df = pd.json_normalize(json_data, 'projects', ['name'])
print(df)

Output:

         name  year
0  Project 1  2020
1  Project 2  2021

This code snippet uses json_normalize to flatten the ‘projects’ array from the input JSON while preserving the ‘name’ field as a constant column in the output DataFrame. It produces a DataFrame with a row for each entry in the ‘projects’ array, effectively converting the nested JSON to a flat table.

Method 2: Manual Unpacking in a Loop

Manually unpacking the JSON and inserting it into a DataFrame through looping allows full control over the flattening process. You can preserve or discard information as needed, giving you customizability over the structure of your DataFrame.

Here’s an example:

import pandas as pd
json_data = [{
    'name': 'Alice',
    'details': {'age': 30, 'city': 'New York'},
    'hobbies': ['reading', 'swimming']
}]
rows = []
for item in json_data:
    row = item.copy()
    row.update(row.pop('details'))
    row['hobbies'] = ', '.join(row['hobbies'])
    rows.append(row)
df = pd.DataFrame(rows)
print(df)

Output:

    name  age      city            hobbies
0  Alice   30  New York  reading, swimming

Here we loop over the list of JSON objects, manually unpack and combine nested structures under ‘details’, convert lists to comma-separated strings, and append each flattened dictionary to a list. We then create a DataFrame from the list for analysis.

Method 3: Using Recursion for Deeply Nested Structures

Recursion provides a methodical way to handle deeply nested JSON objects by flattening each level iteratively. This method is most effective when the depth of the nested JSON is irregular or potentially very deep.

Here’s an example:

import pandas as pd

def flatten_json(y):
    out = {}
    
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], f"{name}{a}_")
        elif type(x) is list:
            for i, a in enumerate(x):
                flatten(a, f"{name}{i}_")
        else:
            out[name[:-1]] = x
    
    flatten(y)
    return out

json_data = {
    'name': 'Bob',
    'info': {'address': {'street': '123 Main St', 'zip': '10001'}}
}
df = pd.DataFrame([flatten_json(json_data)])
print(df)

Output:

  name info_address_street info_address_zip
0  Bob          123 Main St             10001

The recursive flatten_json function traverses the nested JSON, constructing a flat dictionary where keys are the concatenated path of each nested element. The resulting DataFrame is created from a list containing the flattened dictionary.

Method 4: Using ast.literal_eval

For JSON data stored as strings within a DataFrame, Python’s ast.literal_eval() can be used to parse strings as Python literals, combined with pandas’ apply() to expand each column as necessary.

Here’s an example:

import pandas as pd
import ast

df_str = pd.DataFrame({
    'json_data': ["{'name': 'Eve', 'age': 22, 'skills': ['Python', 'SQL']}"]
})

df = df_str['json_data'].apply(lambda x: ast.literal_eval(x)).apply(pd.Series)
print(df)

Output:

  name  age        skills
0  Eve   22  [Python, SQL]

This code snippet takes a DataFrame with a column of JSON-like strings, uses ast.literal_eval() to safely evaluate the string as a Python dictionary, and then applies pandas’ apply() function with pd.Series to unpack the dictionary into separate DataFrame columns.

Bonus One-Liner Method 5: Using pandas read_json

For simpler cases where the JSON structure maps neatly to a DataFrame, you can use the one-liner pandas.read_json(). It’s perfect for quickly converting JSON strings or files into pandas DataFrames when no nested or complex structures are involved.

Here’s an example:

import pandas as pd
json_str = '[{"name":"Frank", "age":29, "city":"London"}]'
df = pd.read_json(json_str)
print(df)

Output:

    name  age    city
0  Frank   29  London

This snippet utilizes pd.read_json() which directly converts a JSON string into a pandas DataFrame, under the condition that the JSON structure is already in a format that DataFrame understands (essentially a list of records).

Summary/Discussion

  • Method 1: pandas json_normalize. Ideal for semi-structured JSON. Does not handle very deep nesting or mixed types well.
  • Method 2: Manual Unpacking. Highly customizable. Can be labour-intensive and error-prone for very complex JSON data.
  • Method 3: Recursive Flattening. Suited for very deeply nested structures. Can be complex to write and understand.
  • Method 4: ast.literal_eval. Good for DataFrame columns with JSON strings. Unsafe if source strings are untrusted.
  • Method 5: pandas read_json. The most straightforward but only suitable for JSON that is already formatted as a record list.