5 Best Ways to Sort CSV by Multiple Columns in Python

πŸ’‘ Problem Formulation: Imagine you have a CSV file containing sales data that you want to sort by both date and sales amount, in that priority. Your input is an unsorted CSV file and the desired output is a CSV sorted first by the ‘Date’ column and then by the ‘Amount’ column.

Method 1: Using pandas for Complex Sorting

This method involves using the powerful pandas library, which provides a dataframe structure and a sort_values() method that allows sorting based on multiple columns. This method is highly efficient and perfect for large datasets.

Here’s an example:

import pandas as pd

df = pd.read_csv('sales.csv')
sorted_df = df.sort_values(by=['Date', 'Amount'])
sorted_df.to_csv('sorted_sales.csv', index=False)

Output:

The CSV file ‘sorted_sales.csv’ is now sorted by the ‘Date’ column, and subsequently by the ‘Amount’ column.

This code snippet reads the ‘sales.csv’ file into a pandas dataframe, sorts it using the sort_values() method with the specified columns, and then writes the sorted dataframe back to ‘sorted_sales.csv’ without the index column.

Method 2: Using csv and operator Modules

By combining Python’s built-in csv module with the operator module’s itemgetter function, you can sort a CSV file by multiple columns. This is a more manual approach and is suitable when external libraries are not an option.

Here’s an example:

import csv
from operator import itemgetter

with open('sales.csv', 'r') as file, open('sorted_sales.csv', 'w', newline='') as output_file:
    csv_reader = csv.reader(file)
    header = next(csv_reader)
    sorted_rows = sorted(csv_reader, key=itemgetter(0, 1))
    
    csv_writer = csv.writer(output_file)
    csv_writer.writerow(header)
    csv_writer.writerows(sorted_rows)

Output:

The ‘sorted_sales.csv’ file is now sorted first by the first column (assuming it’s ‘Date’), and then by the second column (assuming it’s ‘Amount’).

The provided code reads ‘sales.csv’, sorts the data rows using the sorted() method and itemgetter to specify the columns, and writes the sorted data to ‘sorted_sales.csv’, keeping the headers intact.

Method 3: Sorting with lambda Functions

If you prefer a more customized sorting logic that doesn’t rely on external sorting attributes, you can use a lambda function in combination with the sorted() method to sort your CSV.

Here’s an example:

import csv

with open('sales.csv', 'r') as file, open('sorted_sales.csv', 'w', newline='') as output_file:
    csv_reader = csv.reader(file)
    header = next(csv_reader)
    sorted_rows = sorted(csv_reader, key=lambda row: (row[0], float(row[1])))
    
    csv_writer = csv.writer(output_file)
    csv_writer.writerow(header)
    csv_writer.writerows(sorted_rows)

Output:

‘sorted_sales.csv’ is sorted based on the lambda function criteria; by ‘Date’ and then by ‘Amount’ as a float.

This snippet uses a lambda function to define custom sorting criteria, such as converting a column value to a different type (e.g., string to float) for proper comparison, and writes the sorted rows to a new CSV file.

Method 4: Using numpy for Array Sorting

When performance is key and your data can be represented as arrays, the numpy library’s sorting capabilities come in handy. It’s particularly useful for numerical data sorting.

Here’s an example:

import numpy as np

data = np.genfromtxt('sales.csv', delimiter=',', dtype=None, names=True, encoding=None)
sorted_data = np.sort(data, order=['Date', 'Amount'])
np.savetxt('sorted_sales.csv', sorted_data, delimiter=',', fmt='%s')

Output:

The new ‘sorted_sales.csv’ will be sorted by ‘Date’ and ‘Amount’ column values accordingly.

This code uses np.genfromtxt() to load the CSV data into a structured NumPy array, sorts it with the specified column order, and writes the sorted array to a file. NumPy takes care of type conversions internally.

Bonus One-Liner Method 5: Sorting with pandas One-Liner

If you’re looking for the shortest possible code to achieve multi-column sorting in pandas, this one-liner is for you.

Here’s an example:

pd.read_csv('sales.csv').sort_values(['Date', 'Amount']).to_csv('sorted_sales.csv', index=False)

Output:

A succinctly sorted ‘sorted_sales.csv’ based on the ‘Date’ and ‘Amount’ columns.

This powerful one-liner reads the CSV into a pandas dataframe, sorts it by the specified columns, and writes the sorted dataframe to a new CSV file, all in a single chain of method calls.

Summary/Discussion

  • Method 1: pandas sort_values(). Strengths: Very flexible and powerful, good for large datasets. Weaknesses: Requires pandas installation, which may be heavy for small scripts.
  • Method 2: csv and operator. Strengths: Utilizes only built-in libraries, good for avoiding external dependencies. Weaknesses: Less convenient and can be verbose for complex data manipulation.
  • Method 3: lambda function. Strengths: Allows custom sort logic, great for fine-grained control. Weaknesses: Can be less readable and harder to maintain for complex conditions.
  • Method 4: numpy sort. Strengths: Very fast and efficient, especially with numerical data. Weaknesses: Not as straightforward for string-based sorting; requires understanding of numpy data types.
  • Method 5: pandas one-liner. Strengths: Extremely concise. Weaknesses: Might obscure readability and make debugging harder for someone not familiar with pandas method chaining.