Creating Sankey Diagrams with Python Plotly from a Pandas DataFrame

💡 Problem Formulation: Users need to visualize complex flow data as a Sankey diagram, but face difficulties in translating their data structured within a Pandas DataFrame to the specific format required by Plotly’s Sankey diagram. A typical input is a DataFrame with columns representing the source, target, and flow amounts. The desired output is a clear and interactive Sankey diagram that visually represents the flow or transfer of quantities from one set of items to another.

Method 1: Manual Data Preparation

This method involves creating dictionaries for nodes and links directly from a Pandas DataFrame by manually preparing lists of sources, targets, and values. It is suitable for small datasets that are not subject to frequent updates.

Here’s an example:

import pandas as pd
import plotly.graph_objects as go

# Sample dataframe.
df = pd.DataFrame({
    'source': ["A", "A", "B", "C"],
    'target': ["B", "C", "C", "D"],
    'value': [10, 15, 25, 20]
})

# Creating nodes and links for Plotly Sankey diagram.
nodes = list(set(df['source']).union(df['target']))
node_indices = {node: i for i, node in enumerate(nodes)}
links = {
    'source': df['source'].map(node_indices),
    'target': df['target'].map(node_indices),
    'value': df['value']
}

# Build the Sankey diagram
fig = go.Figure(data=[go.Sankey(
    node = {'label': nodes},
    link = links
)])
fig.show()

This code snippet creates a Sankey diagram by first generating a unique list of nodes and then creating a mapping of these nodes to their respective indices, used as sources and targets. This process requires careful data preparation but provides full control over the order and naming of the nodes.

Method 2: Automated Node Identification

Automating the creation of node labels and indices by utilizing Python’s built-in data structures to handle larger or dynamically changing datasets seamlessly, improving efficiency and reducing manual errors.

Here’s an example:

import pandas as pd
import plotly.graph_objects as go

# Sample dataframe
df = pd.DataFrame({
    'source': ["A", "A", "B", "C"],
    'target': ["B", "C", "C", "D"],
    'value': [10, 15, 25, 20]
})

# Automatically build nodes list and create links dictionary
nodes = list(set(df['source']).union(df['target']))
nodes.sort()  # Optional: sort nodes if order is important
node_dict = {node: idx for idx, node in enumerate(nodes)}
links = {
    'source': df['source'].map(node_dict).tolist(),
    'target': df['target'].map(node_dict).tolist(),
    'value': df['value'].tolist(),
}

fig = go.Figure(data=[go.Sankey(
    node = {'label': nodes},
    link = links
)])
fig.show()

This code example automates the creation of the nodes dictionary, which reduces potential for errors and streamlines the process, particularly useful for larger datasets or datasets that frequently change.

Method 3: Utilizing Helper Functions

By implementing a helper function, we can abstract the logic of transforming a DataFrame into a format required by Plotly, making our code more modular and easier to understand. This is advantageous when the same transformation needs to be applied to multiple datasets.

Here’s an example:

import pandas as pd
import plotly.graph_objects as go

# Helper function to prepare Sankey diagram data
def prepare_sankey_data(df):
    nodes = list(set(df['source']).union(df['target']))
    nodes.sort()  # Optional: sort nodes if order is important
    node_dict = {node: idx for idx, node in enumerate(nodes)}
    links = {
        'source': df['source'].map(node_dict).tolist(),
        'target': df['target'].map(node_dict).tolist(),
        'value': df['value'].tolist(),
    }
    return nodes, links

# Sample dataframe
df = pd.DataFrame({
    'source': ["A", "A", "B", "C"],
    'target': ["B", "C", "C", "D"],
    'value': [10, 15, 25, 20]
})

nodes, links = prepare_sankey_data(df)

fig = go.Figure(data=[go.Sankey(node={'label': nodes}, link=links)])
fig.show()

The provided code demonstrates the use of a helper function prepare_sankey_data which takes a DataFrame and returns the nodes and links required to create a Sankey diagram with Plotly. This approach is great for clean and reusable code.

Method 4: Utilizing the Plotly Express API

Plotly Express is a high-level interface for Plotly which simplifies the creation of plots. In this method, we’ll use the plotly.express module to create a Sankey diagram with less manual handling of nodes and links.

Here’s an example:

import pandas as pd
import plotly.express as px

# Sample dataframe
df = pd.DataFrame({
    'source': ["A", "A", "B", "C"],
    'target': ["B", "C", "C", "D"],
    'value': [10, 15, 25, 20]
})

fig = px.sankey(df, source='source', target='target', value='value')
fig.show()

This compact code snippet uses Plotly Express to generate the Sankey diagram. By directly specifying the source, target, and value columns, we avoid manual data manipulation and allow Plotly to handle the transformation efficiently.

Bonus One-Liner Method 5: Using a One-Liner with Plotly Express

For those who prefer a very succinct approach, this one-liner combines data preparation and plotting into a single line, using method chaining with a Pandas DataFrame and Plotly Express.

Here’s an example:

import pandas as pd
import plotly.express as px

# Sample dataframe and a one-liner to create & display the Sankey diagram.
px.sankey(pd.DataFrame({'source': ["A", "A", "B", "C"], 'target': ["B", "C", "C", "D"], 'value': [10, 15, 25, 20]}), source='source', target='target', value='value').show()

This example takes advantage of the fluent interface provided by Pandas and Plotly Express to define the DataFrame and immediately pass it into Plotly Express to create and display the Sankey diagram in one line.

Summary/Discussion

Method 1: Manual Data Preparation. Offers full control for small datasets. It can be cumbersome for larger or changing data sets.
Method 2: Automated Node Identification. Streamlines node indexing, minimizes manual errors. It may be less transparent for users unfamiliar with Python mapping mechanisms.
Method 3: Utilizing Helper Functions. Encourages code reusability and modularity. Only efficient if multiple transformations are needed.
Method 4: Utilizing the Plotly Express API. Simplifies the Sankey diagram creation process significantly. However, it might offer less customization than the lower level Plotly API.
Bonus Method 5: Using a One-Liner. Ideal for quick plotting without repeated dataset transformations. Less readable and potentially more difficult to debug.