π‘ Problem Formulation: In data analysis, it is common to come across multiple CSV files with the same structure that need to be combined into a single DataFrame for analysis. For example, you might have daily sales data in separate CSV files for each day of the month and want to combine them to assess monthly performance. The goal is to create a consolidated DataFrame that aggregates the contents from all the CSV files.
Method 1: Using glob and concat
The glob module finds all the CSV files matching a specific pattern, which we can then read in a loop and concatenate into a single DataFrame using pandas’ concat function. This method is simple and very effective for combining files in bulk.
Here’s an example:
import pandas as pd import glob path = r'/path/to/csv/files' # use your path all_files = glob.glob(path + "/*.csv") li = [] for filename in all_files: df = pd.read_csv(filename, index_col=None, header=0) li.append(df) frame = pd.concat(li, axis=0, ignore_index=True)
The output is a single DataFrame containing data from all the CSV files.
This code snippet sets up a path where all CSV files are stored, uses glob.glob()
to find all CSV files, reads them one by one into temporary DataFrames, appends them to a list, and finally concatenates them all with pd.concat()
.
Method 2: Using os and concat
By using os.listdir we can get a list of file names in a directory, then filter for CSV files and concatenate them. It’s a more hands-on approach compared to glob but equally effective.
Here’s an example:
import pandas as pd import os path = '/path/to/csv/files' files = os.listdir(path) dfs = [pd.read_csv(path + "/" + f) for f in files if f.endswith('.csv')] final_df = pd.concat(dfs, ignore_index=True)
The output is a single DataFrame with the merged data from all CSV files.
After defining the directory path, os.listdir()
fetches all file names. List comprehension is used to read each CSV file into a DataFrame if it ends with ‘.csv’, and pd.concat()
merges them into a single DataFrame.
Method 3: Using Pathlib and concat
Using Pathlib provides an object-oriented approach to handling filesystem paths. Coupled with pandas’ concat, this method is modern and highly readable.
Here’s an example:
from pathlib import Path import pandas as pd p = Path('/path/to/csv/files') files = p.rglob('*.csv') dfs = [pd.read_csv(f) for f in files] final_df = pd.concat(dfs, ignore_index=True)
The output is the combined DataFrame from all matching CSV files.
The Pathlib’s rglob
method is used to recursively search for CSV files in the directory. These files are then read into DataFrames and concatenated into one final DataFrame.
Method 4: Using dask
Dask is a parallel computing library that seamlessly scales Pandas workflows. It is useful for handling larger datasets because it processes data in chunks.
Here’s an example:
import dask.dataframe as dd path = '/path/to/csv/files/*.csv' # Glob pattern df = dd.read_csv(path) final_df = df.compute()
The result is a Pandas DataFrame with all the CSV files merged.
This snippet uses Dask to handle potentially large CSV files in an efficient manner. The read_csv
function behaves much like pandas’ version but is designed for lazy loading, and compute()
converts the dask DataFrame into a pandas DataFrame.
Bonus One-Liner Method 5: Using a compact list comprehension with concat
For the ultimate one-liner solution, you can blend list comprehension with pandas’ concat function for a quick-and-dirty merge.
Here’s an example:
import pandas as pd from glob import iglob final_df = pd.concat((pd.read_csv(f) for f in iglob('/path/to/csv/files/*.csv')), ignore_index=True)
This generates a DataFrame that amalgamates all the data from the CSV files found.
This compact one-liner creates a generator expression to iterate over all the CSV file paths and read each one with Pandas, immediately concatenating them into a final DataFrame.
Summary/Discussion
- Method 1: Glob with Concat. Straightforward and clear. May be slow with a large number of files.
- Method 2: OS with Concat. More control over file selection. Required manual filtering of CSV files.
- Method 3: Pathlib with Concat. Modern and readable. Similar performance to glob.
- Method 4: Using Dask. Efficient with large datasets. Extra dependency and slight learning curve.
- Method 5: Compact One-Liner. Quick but less readable. Same efficiency as Method 1 and 2.