5 Best Ways to Process Excel Files Data in Chunks with Python

Rate this post

πŸ’‘ Problem Formulation: Processing large Excel files can be memory-intensive and may lead to performance issues. The objective is to read Excel files in chunks, allowing memory-efficient data processing. For instance, an input may be a 100,000-row Excel file, and the desired output would be processed data from chunks of 10,000 rows each.

Method 1: Pandas with Chunksize Parameter

Using the Pandas library’s read_excel() function with the chunksize parameter gives us the capability to process large Excel files in chunks. This approach allows iterative processing, reducing memory usage by not loading the entire dataset at once.

Here’s an example:

import pandas as pd

chunk_size = 10000
for chunk in pd.read_excel('large_dataset.xlsx', chunksize=chunk_size):
    process(chunk)

Assuming process() is a function that processes your data chunk.

In the provided code snippet, Pandas reads the ‘large_dataset.xlsx’ in chunks of 10,000 rows. Each chunk is then passed to a hypothetical process() function, which could perform operations like data cleaning, filtering, or aggregation.

Method 2: Dask Library

Dask is a parallel computing library that integrates with Pandas and allows you to work on large datasets efficiently. It can read Excel files in chunks similar to Pandas but can perform operations in parallel, making it faster.

Here’s an example:

import dask.dataframe as dd

ddf = dd.read_excel('large_dataset.xlsx', chunksize=10000)
result = ddf.groupby('column').sum().compute()

Here we group by a specific ‘column’ and calculate the sum, executed in parallel.

The example creates a Dask DataFrame that reads an Excel file in chunks. It then demonstrates a groupby operation summed across a specific column. The compute() function triggers the parallel computation.

Method 3: OpenPyXL with Manual Chunking

OpenPyXL is another library for reading and writing Excel files. Unlike Pandas, OpenPyXL doesn’t natively support chunking, but you can manually iterate over rows while limiting the number of rows processed at a time.

Here’s an example:

from openpyxl import load_workbook

def process_rows(rows):
    # Your processing logic here
    pass

wb = load_workbook(filename='large_dataset.xlsx', read_only=True)
ws = wb.active

chunk_size = 10000
rows = []
for row in ws.iter_rows(values_only=True):
    rows.append(row)
    if len(rows) == chunk_size:
        process_rows(rows)
        rows = []
if rows:
    process_rows(rows)

The function process_rows() would process the chunk of rows collected.

This snippet manually manages chunks by appending rows to a list and then processes them when the list size reaches the desired chunk size. It ensures the last chunk is also processed if it’s smaller than the chunk size.

Method 4: SqlAlchemy with Pandas

By first loading the Excel file into a database using SqlAlchemy and then querying it in chunks with Pandas, you can efficiently process large datasets without excessive memory usage.

Here’s an example:

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///:memory:')
chunk_size = 10000

with pd.ExcelFile('large_dataset.xlsx') as xls:
    for sheet_name in xls.sheet_names:
        offset = 0
        while True:
            chunk = pd.read_excel(xls, sheet_name=sheet_name, nrows=chunk_size, skiprows=range(1, offset))
            if chunk.empty:
                break
            chunk.to_sql(sheet_name, con=engine, if_exists='append', index=False)
            offset += chunk_size

The Excel file is loaded into an in-memory SQLite database, chunk by chunk.

The code first connects to an in-memory SQL database. It then reads the Excel file sheet by sheet and uploads each chunk to the database. The if_exists='append' parameter ensures chunks are added and not overwritten.

Bonus One-Liner Method 5: Command-Line Python with CSV Conversion

For simple data processing tasks, converting an Excel file to CSV and using Python’s command-line tools like awk or pandas in a streaming fashion can do the trick. This method is highly efficient but doesn’t apply for all processing needs.

Here’s an example:

in2csv large_dataset.xlsx | awk -F',' '{ if (NR % 10000 == 0) { print $0 } }'

This command converts an Excel file to CSV and prints every 10000th line.

The above one-liner converts the Excel file into a CSV format using in2csv (from the csvkit utilities) and then processes the output stream with awk, only printing lines that are multiples of 10,000. It’s a simple streaming approach to chunk processing.

Summary/Discussion

  • Method 1: Pandas with Chunksize. Strengths: Integrated with Pandas, easy to use. Weaknesses: May not be as efficient as parallel processing for larger files.
  • Method 2: Dask Library. Strengths: Parallel processing, efficient memory usage. Weaknesses: More complex setup, may require additional learning.
  • Method 3: OpenPyXL with Manual Chunking. Strengths: Good for fine-grained control, direct manipulation of Excel files. Weaknesses: Manual overhead, potentially slower.
  • Method 4: SqlAlchemy with Pandas. Strengths: Robust for complex operations and queries. Weaknesses: Setup requires knowledge of databases, perhaps overkill for simple tasks.
  • Method 5: Command-Line Python with CSV Conversion. Strengths: Fast and efficient for simple tasks. Weaknesses: Limited processing capabilities, not suitable for all data processing needs.