5 Best Ways to Merge All CSV Files Into a Single DataFrame Using Python Pandas

πŸ’‘ Problem Formulation: In data analysis, it is common to come across multiple CSV files with the same structure that need to be combined into a single DataFrame for analysis. For example, you might have daily sales data in separate CSV files for each day of the month and want to combine them to assess monthly performance. The goal is to create a consolidated DataFrame that aggregates the contents from all the CSV files.

Method 1: Using glob and concat

The glob module finds all the CSV files matching a specific pattern, which we can then read in a loop and concatenate into a single DataFrame using pandas’ concat function. This method is simple and very effective for combining files in bulk.

Here’s an example:

import pandas as pd
import glob

path = r'/path/to/csv/files' # use your path
all_files = glob.glob(path + "/*.csv")

li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

The output is a single DataFrame containing data from all the CSV files.

This code snippet sets up a path where all CSV files are stored, uses glob.glob() to find all CSV files, reads them one by one into temporary DataFrames, appends them to a list, and finally concatenates them all with pd.concat().

Method 2: Using os and concat

By using os.listdir we can get a list of file names in a directory, then filter for CSV files and concatenate them. It’s a more hands-on approach compared to glob but equally effective.

Here’s an example:

import pandas as pd
import os

path = '/path/to/csv/files'
files = os.listdir(path)

dfs = [pd.read_csv(path + "/" + f) for f in files if f.endswith('.csv')]
final_df = pd.concat(dfs, ignore_index=True)

The output is a single DataFrame with the merged data from all CSV files.

After defining the directory path, os.listdir() fetches all file names. List comprehension is used to read each CSV file into a DataFrame if it ends with ‘.csv’, and pd.concat() merges them into a single DataFrame.

Method 3: Using Pathlib and concat

Using Pathlib provides an object-oriented approach to handling filesystem paths. Coupled with pandas’ concat, this method is modern and highly readable.

Here’s an example:

from pathlib import Path
import pandas as pd

p = Path('/path/to/csv/files')
files = p.rglob('*.csv')

dfs = [pd.read_csv(f) for f in files]
final_df = pd.concat(dfs, ignore_index=True)

The output is the combined DataFrame from all matching CSV files.

The Pathlib’s rglob method is used to recursively search for CSV files in the directory. These files are then read into DataFrames and concatenated into one final DataFrame.

Method 4: Using dask

Dask is a parallel computing library that seamlessly scales Pandas workflows. It is useful for handling larger datasets because it processes data in chunks.

Here’s an example:

import dask.dataframe as dd

path = '/path/to/csv/files/*.csv' # Glob pattern
df = dd.read_csv(path)
final_df = df.compute()

The result is a Pandas DataFrame with all the CSV files merged.

This snippet uses Dask to handle potentially large CSV files in an efficient manner. The read_csv function behaves much like pandas’ version but is designed for lazy loading, and compute() converts the dask DataFrame into a pandas DataFrame.

Bonus One-Liner Method 5: Using a compact list comprehension with concat

For the ultimate one-liner solution, you can blend list comprehension with pandas’ concat function for a quick-and-dirty merge.

Here’s an example:

import pandas as pd
from glob import iglob

final_df = pd.concat((pd.read_csv(f) for f in iglob('/path/to/csv/files/*.csv')), ignore_index=True)

This generates a DataFrame that amalgamates all the data from the CSV files found.

This compact one-liner creates a generator expression to iterate over all the CSV file paths and read each one with Pandas, immediately concatenating them into a final DataFrame.

Summary/Discussion

  • Method 1: Glob with Concat. Straightforward and clear. May be slow with a large number of files.
  • Method 2: OS with Concat. More control over file selection. Required manual filtering of CSV files.
  • Method 3: Pathlib with Concat. Modern and readable. Similar performance to glob.
  • Method 4: Using Dask. Efficient with large datasets. Extra dependency and slight learning curve.
  • Method 5: Compact One-Liner. Quick but less readable. Same efficiency as Method 1 and 2.