5 Best Ways to Sort CSV by a Single Column in Python

πŸ’‘ Problem Formulation: Frequently in data analysis, you encounter the need to reorder CSV data based on the contents of a particular column. For instance, imagine you have a CSV file of sales data and you want your output organized ascendingly by the ‘Revenue’ column. Sorting CSV files programmatically avoids the tedium of manual data manipulation and enhances data comprehension and further data processing accuracy.

Method 1: Using Python’s csv and operator modules

This method combines Python’s built-in csv module for reading and writing CSV files with the operator module to sort the data. The operator.itemgetter() function is particularly useful for retrieving the elements from rows that you are interested in sorting by.

Here’s an example:

import csv
from operator import itemgetter

with open('sales.csv', 'r') as csvfile, open('sorted_sales.csv', 'w', newline='') as sortedfile:
    csvreader = csv.reader(csvfile)
    header = next(csvreader)
    sortedcsv = sorted(csvreader, key=itemgetter(2))  # Assuming the 'Revenue' column is at index 2
    csvwriter = csv.writer(sortedfile)
    csvwriter.writerow(header)
    csvwriter.writerows(sortedcsv)

The output will be a new CSV file named sorted_sales.csv sorted by the ‘Revenue’ column.

This snippet reads a CSV file, uses operator.itemgetter() to sort the rows by the third column (index 2), and then writes the sorted rows to a new CSV file.

Method 2: Using pandas DataFrame

pandas is a powerful data manipulation library that allows one to read, manipulate, and write data with ease. Here, we use pandas to load a CSV into a DataFrame, sort the DataFrame by a specified column, and then save the sorted DataFrame back to a CSV.

Here’s an example:

import pandas as pd

df = pd.read_csv('sales.csv')
sorted_df = df.sort_values('Revenue')  # Column name is 'Revenue'
sorted_df.to_csv('sorted_sales.csv', index=False)

This results in sorted_sales.csv being populated with the sorted data, excluding the index.

The code reads the CSV into a pandas DataFrame, sorts it by the ‘Revenue’ column using DataFrame.sort_values(), and then writes the sorted DataFrame to a new CSV file without the index column.

Method 3: Using Python’s built-in sorted() function with csv.DictReader

By exploiting the csv.DictReader class, we can handle CSV data as a list of dictionaries which makes the sorting process quite intuitive. Python’s sorted() function sorts any iterable. Dictionary keys enable us to specify the sort column in a readable manner.

Here’s an example:

import csv

with open('sales.csv', 'r') as csvfile, open('sorted_sales.csv', 'w', newline='') as sortedfile:
    csvreader = csv.DictReader(csvfile)
    sortedlist = sorted(csvreader, key=lambda x: x['Revenue'])
    
    csvwriter = csv.DictWriter(sortedfile, fieldnames=csvreader.fieldnames)
    csvwriter.writeheader()
    for row in sortedlist:
        csvwriter.writerow(row)

After running this code, sorted_sales.csv will be the output CSV sorted by the ‘Revenue’ column.

The snippet uses csv.DictReader and csv.DictWriter to read and write CSV data as dictionaries. Sorting is performed using a lambda function that takes the key ‘Revenue’ for each row in the CSV.

Method 4: Using numpy to sort a record array

numpy is a fundamental package for scientific computing in Python. If performance is a concern, especially with large CSV files, using numpy can be a good approach. By reading the CSV into a numpy record array, we can sort the array based on one of its fields.

Here’s an example:

import numpy as np

data = np.genfromtxt('sales.csv', delimiter=',', names=True, dtype=None, encoding='utf-8')
sorted_data = np.sort(data, order='Revenue')
np.savetxt('sorted_sales.csv', sorted_data, delimiter=',', fmt='%s', header=','.join(data.dtype.names), comments='')

Your sorted CSV will now be stored as sorted_sales.csv.

This code reads the CSV data into a numpy record array using np.genfromtxt(), sorts it by the ‘Revenue’ field, and then saves the sorted array to a new CSV file with np.savetxt().

Bonus One-Liner Method 5: Using csvsort from csvkit

For convenience, you can use the command-line tool csvsort from csvkit, a suite of utilities for converting to and working with CSV. This one-liner method allows for a quick sort of a CSV file from the terminal.

Here’s an example:

!csvsort -c 'Revenue' sales.csv > sorted_sales.csv

This will generate a new file sorted_sales.csv sorted by the ‘Revenue’ column.

By executing this command in a terminal or a Jupyter notebook cell, it sorts the ‘sales.csv’ by the specified column using csvsort and writes it to sorted_sales.csv.

Summary/Discussion

  • Method 1: csv and operator Modules. Best for small to medium datasets. Can be memory-intensive with large files.
  • Method 2: pandas DataFrame. Great for complex data manipulation. Overhead for large files and unnecessary for simple sorts.
  • Method 3: Built-in sorted() Function and csv.DictReader. Intuitive and Pythonic. Not as memory efficient as it could be for very large datasets.
  • Method 4: numpy Record Array. Performs well with large datasets. Can be overkill for simple tasks and learning curve associated with numpy.
  • Bonus One-Liner Method 5: csvsort from csvkit. Very fast and convenient. Requires csvkit installation and is less flexible for programmatic usage.