5 Best Ways to Read All CSV Files in a Folder Using Pandas

πŸ’‘ Problem Formulation: When working with data analysis in Python, it’s common to have a collection of CSV files in a directory that you want to load and concatenate into a single DataFrame for processing. Python’s pandas library is a powerful tool for this task. For instance, you might have a folder named “sales_data” containing weekly sales reports in CSV format and want to combine them into one comprehensive dataset. This article demonstrates how to efficiently read all these CSV files into a single DataFrame.

Method 1: Using glob.glob()

The glob module in Python provides a function glob.glob() that can be used to return all file paths that match a specific pattern, in this case, CSV files. This method is simple and straight-forward for getting a list of CSV file paths and loading them into pandas.

Here’s an example:

import pandas as pd
import glob

path = 'your_data_folder/' # use your path
all_files = glob.glob(path + "/*.csv")

df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

Sample Output:

   sales_week  revenue
0         1     12000
1         2     15000
...

This code snippet begins by importing the necessary libraries. It utilizes glob.glob() to find all files ending with ‘.csv’ in the specified directory. Then, it reads each CSV file into a pandas DataFrame and concatenates them into a single DataFrame df, with the ignore_index=True parameter ensuring that the index is properly constructed.

Method 2: Using os.listdir()

This method utilizes the os module to list all files in the specified directory, then filters for CSV files based on their extension. It’s a native Python approach that does not rely on external libraries besides pandas for reading the CSV files.

Here’s an example:

import pandas as pd
import os

path = 'your_data_folder/'
csv_files = [os.path.join(path, f) for f in os.listdir(path) if f.endswith('.csv')]

df = pd.concat((pd.read_csv(f) for f in csv_files), ignore_index=True)

Sample Output:

   product_id  quantity_sold
0         101             50
1         102             75
...

This snippet reads through the directory listing obtained from os.listdir() and filters out non-CSV files. Each CSV file is read into a DataFrame and concatenated into a single DataFrame variable df. The construction of the DataFrame index is also taken care of by the ignore_index=True parameter.

Method 3: Using pathlib.Path()

The pathlib module available in Python 3.4 and above offers an object-oriented approach to handling filesystem paths. The Path class allows for easy path manipulations and can be used in conjunction with pandas to read all CSV files in a directory.

Here’s an example:

from pathlib import Path
import pandas as pd

path = Path('your_data_folder/')
csv_files = path.glob('*.csv')

df = pd.concat((pd.read_csv(f) for f in csv_files), ignore_index=True)

Sample Output:

   customer_id  total_spent
0         4562         120.0
1         4563         110.5
...

This code uses the Path.glob() method, which makes the process intuitive and readable. After creating a path object, the glob method is called to fetch all CSV files, and they are read and concatenated into a single DataFrame df with a coherent index.

Method 4: Using dask.dataframe

Dask is a parallel computing library that integrates with pandas, providing an opportunity to work with larger-than-memory datasets by breaking them into smaller, manageable parts. dask.dataframe has a read_csv() function that can handle reading multiple CSV files in one go, which may be very efficient for large datasets.

Here’s an example:

import dask.dataframe as dd

path = 'your_data_folder/*.csv' # glob string for all csv files
df = dd.read_csv(path).compute()

Sample Output:

   region  units_sold
0    East         124
1    West         78
...

The dask read_csv() method leverages the glob-style string to fetch and read all CSV files concurrently. It then uses the compute() method to combine the results into a single pandas DataFrame df. This method shines with large datasets and distributed computing environments.

Bonus One-Liner Method 5: Using a list comprehension with pandas.read_csv()

This compact method employs a list comprehension and pandas.read_csv() in a single line to achieve the task. It’s a Pythonic way that leverages the readability and simplicity of list comprehensions.

Here’s an example:

import pandas as pd
from glob import glob

df = pd.concat([pd.read_csv(file) for file in glob('your_data_folder/*.csv')], ignore_index=True)

Sample Output:

   order_id  amount
0    10001    250.0
1    10002    135.0
...

This succinct code snippet uses a list comprehension to iterate through all CSV file paths obtained by glob('your_data_folder/*.csv') and reads each file. The pd.concat() function is used to merge all resulting DataFrames into a single DataFrame df with a clean index.

Summary/Discussion

  • Method 1: Using glob.glob() Simple and effective. Ideal for general use cases. May not be as intuitive as using pathlib for filesystem path operations.
  • Method 2: Using os.listdir() Uses Python’s core libraries, no need for additional imports besides pandas. May require additional filtering if other non-CSV file types are present in the directory.
  • Method 3: Using pathlib.Path() Modern, object-oriented approach. Enhances code readability. Might not be familiar to those who prefer traditional file handling methods.
  • Method 4: Using dask.dataframe Excellent for large datasets and distributed computing. Overkill for smaller, memory-sized datasets and adds an external dependency.
  • Method 5: Bonus One-Liner Compact and Pythonic. Very legible and quick for those comfortable with list comprehensions. Lack of explicit looping may obscure understanding for beginners.