5 Best Ways to Convert PDF to CSV Using Python

Rate this post

πŸ’‘ Problem Formulation: In many data processing tasks, it’s necessary to convert data from PDF files into a CSV format. This enables easier manipulation and analysis of data using tools like Python. Let’s take an example where we have financial reports in PDF format and we want to convert these into CSV files containing all tabular data in a structured form.

Method 1: Using Tabula-py

Tabula-py is a Python library that allows extraction of tables from PDFs into a DataFrame and then it can be exported as CSV. It’s a wrapper of tabula-java, which is used behind the scenes to do the actual extraction.

Here’s an example:

import tabula

file_path = 'financial_report.pdf'
tables = tabula.read_pdf(file_path, pages='all', multiple_tables=True)
for i, table in enumerate(tables):
    table.to_csv(f'table_{i}.csv', index=False)

Output: Multiple CSV files each containing a table extracted from the PDF.

This code uses tabula.read_pdf to read the PDF file and then iterates over the extracted tables, saving each one to a separate CSV file. It’s simple and very efficient for extracting tables from PDFs.

Method 2: Using PyPDF2 and csv

PyPDF2 is a pure-Python library built as a PDF toolkit. It can be used to extract text from PDFs and then with additional parsing, one can save the data into CSV format using Python’s csv module.

Here’s an example:

import PyPDF2
import csv

with open('financial_report.pdf', 'rb') as file:
    reader = PyPDF2.PdfReader(file)
    text = ''
    for page in reader.pages:
        text += page.extract_text() + '\n'

# Here you would parse text into a structured format depending on the specifics of your PDF

with open('output.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(parsed_data)

Output: A single CSV file with extracted text data.

The code first extracts all text from the PDF using PyPDF2. The extracted text then needs to be parsed into a structure suitable for CSV. Finally, it’s exported using Python’s csv module.

Method 3: Using PDFMiner.six

PDFMiner.six is a community maintained fork of the original PDFMiner. It is designed to extract text from PDF files for analysis. It’s more flexible than PyPDF2 and better for non-table data.

Here’s an example:

from pdfminer.high_level import extract_text
import csv

text = extract_text('financial_report.pdf')

# Once again, parsing of the text is required here to format it for CSV

with open('output.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(parsed_data)

Output: A CSV file with the text content of the PDF.

After extracting the text from the PDF with PDFMiner.six, the data must be parsed to fit CSV format, which can be challenging based on the complexity of the PDF’s layout.

Method 4: Using Poppler and Pandas

Poppler is a PDF rendering library which, when combined with Pandas, can be used to convert PDF files into CSV. First, you convert the PDF into images with Poppler, then analyse these images with OCR (Optical Character Recognition) tools like Tesseract, and finally store the data with Pandas.

Here’s an example:

from pdf2image import convert_from_path
import pytesseract
import pandas as pd

images = convert_from_path('financial_report.pdf')
text_data = []

for image in images:
    text = pytesseract.image_to_string(image)
    text_data.append(text)

# Assuming text_data is now neatly parsed
df = pd.DataFrame(text_data)
df.to_csv('output.csv', index=False)

Output: A CSV file with text recognized from images of the PDF pages.

This example converts each PDF page to an image, then uses OCR to extract text, and finally writes it into a CSV file. This is a powerful but complex approach and may require fine-tuning of the OCR.

Bonus One-Liner Method 5: Using Camelot

Camelot is another Python library that excels at extracting tables from PDFs to CSV. It gives very accurate results and saves a lot of manual effort.

Here’s an example:

import camelot

tables = camelot.read_pdf('financial_report.pdf')
tables.export('output.csv', f='csv', compress=True)

Output: A CSV.gz file with tables from the PDF.

This snippet uses Camelot to read tables from the PDF and export them into compressed CSV format. It’s a quick and powerful one-liner suitable for well-structured PDFs with clear tables.

Summary/Discussion

  • Method 1: Tabula-py. Best for extracting tables directly to DataFrames. Not ideal for PDFs with complex layouts or non-table content.
  • Method 2: PyPDF2 and csv. Suitable for simple text extraction from PDFs when you’re ready to write custom parsing. Not the best for data trapped in images or irregular formats.
  • Method 3: PDFMiner.six. Good for detailed text extraction permitting heavy customization in parsing. May require more coding for converting to structured CSV.
  • Method 4: Poppler and Pandas. Powerful if you’re dealing with scanned documents, although it requires chaining multiple libraries and dealing with OCR accuracy.
  • Method 5: Camelot. Very efficient at extracting tables if they are well-defined. It may struggle with non-standard table formats or non-text content.