5 Best Ways to Concatenate CSV Files in Python

5/5 - (1 vote)

πŸ’‘ Problem Formulation: Concatenation of CSV files is a common task where you have multiple files with the same columns that you want to merge into a single file without losing any data. For instance, you’ve collected weekly reports in the CSV format and now need to combine them into a monthly report.

Method 1: Using Python’s Standard Library

This approach uses Python’s built-in csv module, handling CSV files seamlessly. The method is straightforward: read each file with a CSV reader and write its contents into a CSV writer, excluding the header after the first file.

Here’s an example:

import csv

def concatenate_csv(file_list, output_file):
    with open(output_file, 'w', newline='') as f_output:
        csv_output = csv.writer(f_output)
        for i, file in enumerate(file_list):
            with open(file, 'r') as f_input:
                csv_input = csv.reader(f_input)
                if i == 0:
                    csv_output.writerow(next(csv_input))  # Write headers from the first file
                for row in csv_input:
                    csv_output.writerow(row)

# Usage
concatenate_csv(['week1.csv', 'week2.csv'], 'monthly_report.csv')

The output would be a single file called monthly_report.csv containing all the data from week1.csv and week2.csv.

This script functions by creating a CSV writer for the output file and looping over a list of input files. Headers are retained from the first file, and the rows from each file are written consecutively. It’s a clean solution that requires no additional libraries.

Method 2: Using Pandas Library

Pandas is a powerful data manipulation library in Python that makes concatenating CSV files a breeze. The method reads files into Pandas DataFrames, concatenates them, and writes back to CSV.

Here’s an example:

import pandas as pd

def concatenate_csv_pandas(file_list, output_file):
    df_list = [pd.read_csv(file) for file in file_list]
    df_concatenated = pd.concat(df_list, ignore_index=True)
    df_concatenated.to_csv(output_file, index=False)

# Usage
concatenate_csv_pandas(['week1.csv', 'week2.csv'], 'monthly_report.csv')

The output is the same as before: a unified monthly_report.csv with the combined contents of the weekly files.

The code reads each file into a DataFrame, combines them with the concat() function, and exports the result as a new CSV. This method handles different data types and indices effectively but requires Pandas, an external library.

Method 3: Using the Command Line

For those comfortable with the command-line interface (CLI), this method doesn’t even involve writing a Python script. The Unix cat command can concatenate files, and with a bit of tweaking, it can handle CSV files without repeating headers.

Here’s an example:

!tail -n +2 week2.csv >> week1.csv
!mv week1.csv monthly_report.csv

The output is a file named monthly_report.csv, originated from appending week2.csv (excluding its header) to week1.csv.

The tail command is used to skip the header of subsequent files, and mv renames the final file. It is a quick and simple method but requires Unix-like environment and is less flexible compared to Python scripts.

Method 4: Using CSVKIT

CSVKIT is a suite of command-line tools for converting to and working with CSV. This tool allows for a more elegant and feature-rich CLI solution to concatenate CSV files.

Here’s an example:

!csvstack week1.csv week2.csv > monthly_report.csv

The tool will output monthly_report.csv, with both input files merged properly.

csvstack is specifically designed to stack CSV files, handling headers and column orders automatically. This method is quick and avoids memory issues with large files, but it requires the installation of the CSVKIT package.

Bonus One-Liner Method 5: Using Unix awk

The awk utility in Unix is a powerful text-processing tool. With a one-liner, you can concatenate files while taking care of headers.

Here’s an example:

!awk '(NR == 1) || (FNR > 1)' week1.csv week2.csv > monthly_report.csv

The command creates monthly_report.csv, combining the data from the weekly CSV files.

It uses awk to print the header from the first file (NR == 1) and skip headers from all other files (FNR > 1). This compact solution is extremely fast and works well on Unix systems but can be a bit cryptic for those unfamiliar with awk syntax.

Summary/Discussion

  • Method 1: Python’s Standard Library. Simple and does not require additional libraries. Limited to Python’s file and memory handling capabilities.
  • Method 2: Pandas Library. Handles various data types and large datasets efficiently. Requires the installation of Pandas, hence not suitable for minimal dependency environments.
  • Method 3: Command Line with cat and tail. Quick and does not need Python, but is platform-dependent and less flexible.
  • Method 4: CSVKIT. Feature-rich CLI tool, great for large datasets. Needs external installation and learning of new syntax.
  • Method 5: Unix awk. Fast and powerful for those familiar with Unix command-line tools. Not user-friendly for beginners and platform-dependent.