5 Best Ways to Reshape Data in a Pandas DataFrame - Be on the Right Side of Change

💡 Problem Formulation: When working with Pandas in Python, data analysts often need to alter the structure of DataFrame objects to perform better data analysis, enhance readability, or prepare data for machine learning models. For instance, consider a DataFrame with continuous time series data that must be reshaped into a wide format with distinct columns for each time period. The desired output is a DataFrame with rows representing unique entities and columns encapsulating time-specific measurements.

Method 1: Pivot

Pivoting is a common data manipulation technique that transforms data from long to wide format. It reorganizes the data, spreading a particular column into multiple columns based on its unique values, allowing for better comparison and visualization. The pivot() function in Pandas is specifically designed for this operation, and it requires specifying an index, columns on which to pivot, and values to populate the new DataFrame structure.

Here’s an example:

import pandas as pd

# Sample data in long format
data = {
    'date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
    'variable': ['A', 'B', 'A', 'B'],
    'value': [1, 4, 2, 5]
}
df = pd.DataFrame(data)

# Pivot the DataFrame
df_pivoted = df.pivot(index='date', columns='variable', values='value')

Output:

variable        A    B
date                   
2021-01-01       1    4
2021-01-02       2    5

This code snippet transforms the DataFrame from long to wide format using the pivot() function. The ‘date’ column becomes the index, while unique values in the ‘variable’ column become the new column headers with corresponding ‘value’ data filled in.

Method 2: Melt

The melt() function is used to transform data from a wide format to a long format, which can be more suitable for certain types of analysis or visualization. This method involves melting the DataFrame, which takes multiple columns and condenses them into two columns, one for variable names and one for values, effectively making the data ‘longer’.

Here’s an example:

import pandas as pd

# Sample data in wide format
data = {
    'date': ['2021-01-01', '2021-01-02'],
    'A': [1, 2],
    'B': [4, 5]
}
df = pd.DataFrame(data)

# Melt the DataFrame
df_melted = df.melt(id_vars='date', var_name='variable', value_name='value')

Output:

         date variable  value
0  2021-01-01        A      1
1  2021-01-02        A      2
2  2021-01-01        B      4
3  2021-01-02        B      5

By using the melt() function, the example above converts the wide-form DataFrame into a long format, where ‘date’ is preserved as an identifier variable, and the columns ‘A’ and ‘B’ are transformed into two columns with variable names and values.

Method 3: Stack

Stacking a DataFrame involves moving the innermost column index to become the innermost row index, turning the data into a long or stacked format. This is particularly useful for multi-level column indices. The stack() method in Pandas compresses a level in the DataFrame’s columns to produce a Series with a multi-level index.

Here’s an example:

import pandas as pd

# Sample data with multi-level columns
header = pd.MultiIndex.from_product([['Semester 1', 'Semester 2'], ['Math', 'Science']])
data = [
    [70, 80, 85, 90],
    [92, 88, 94, 96]
]
df = pd.DataFrame(data, columns=header)

# Stack the DataFrame
df_stacked = df.stack(level=0)

Output:

           Math  Science
0 Semester 1   70       80
  Semester 2   85       90
1 Semester 1   92       88
  Semester 2   94       96

The stack() method in the example stacks the multi-level column headers into a multi-level index on the rows. It converts the wide-format DataFrame with multi-level columns for each semester into a longer format with a multi-level row index.

Method 4: Unstack

Unstack is the inverse operation of stack, which ‘pivots’ a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new level of column labels. The unstack() method is particularly useful when dealing with multi-level indices and can help in reverting to a wide format from a long or stacked format.

Here’s an example:

import pandas as pd

# Sample data with a multi-level index
data = {
    ('Semester 1', 'Math'): [70, 92],
    ('Semester 1', 'Science'): [80, 88],
    ('Semester 2', 'Math'): [85, 94],
    ('Semester 2', 'Science'): [90, 96]
}
index = pd.MultiIndex.from_tuples([('Alice', 'Midterm'), ('Bob', 'Final')])
df = pd.DataFrame(data, index=index)

# Unstack the DataFrame
df_unstacked = df.unstack(level=1)

Output:

Semester 1       Semester 2     
           Math Science      Math Science
Alice        70      80        85      90
Bob          92      88        94      96

In the provided example, the unstack() method is applied to a DataFrame with a multi-level index. By unstacking the second level of the index (‘Midterm’ and ‘Final’), these elements are pivoted to form a new level of column headers, converting the DataFrame to a wide format.

Bonus One-Liner Method 5: Wide to Long Using `wide_to_long()`

This powerful one-liner function converts a wide format DataFrame to a long format one. It is ideal for DataFrames with sequentially numbered or similarly named columns since it provides patterns to match to select groups of columns to reshape. It’s a convenience wrapper around multiple melt() calls.

Here’s an example:

import pandas as pd

# Sample wide-form data
data = {
    'subject': ['Math', 'Science'],
    'test_1': [75, 78],
    'test_2': [82, 95],
    'test_3': [89, 87]
}
df = pd.DataFrame(data)

# Use wide_to_long to reshape the DataFrame
df_long = pd.wide_to_long(df, stubnames='test', i='subject', j='test_number')

Output:

                 test
subject  test_number     
Math     1             75
Science  1             78
Math     2             82
Science  2             95
Math     3             89
Science  3             87

The wide_to_long() function is leveraged to convert the DataFrame into a long format, where ‘test_’ columns are transformed into a ‘test’ column with a ‘test_number’ identifying variable. It is an elegant and compact way to reshape data when dealing with columns with a common prefix.

Summary/Discussion

Method 1: Pivot. Useful for converting long to wide format. Limited to unique value pairs. Not suitable when there are duplicate entries for index/column pairs.
Method 2: Melt. Converts wide to long format. Versatile, especially when the dataset has many columns to be melted into two. May require additional manipulation for multi-level cases.
Method 3: Stack. Moves the innermost column index to the row index. Ideal for multi-level column indices. The resulting object is a Series, which may need conversion back to a DataFrame.
Method 4: Unstack. Pivots a level of row indices to columns, converting long to wide format. Best suited for data with hierarchical indices. Can introduce NaN values if the index-column pairings do not cover all combinations.
Method 5: Wide to Long. A flexible one-liner for reshaping wide to long format. Handles columns with a common pattern effectively. The requirement for a ‘stub’ pattern can make it less intuitive for non-sequential column names.

Method 1: Pivot

Method 2: Melt

Method 3: Stack

Method 4: Unstack

Bonus One-Liner Method 5: Wide to Long Using wide_to_long()

Summary/Discussion

Bonus One-Liner Method 5: Wide to Long Using `wide_to_long()`