5 Best Methods to Import CSV Data into MongoDB using Python

πŸ’‘ Problem Formulation: Suppose you have data stored in a CSV file and you want to import it into a MongoDB collection for further data processing or storage. The process should read data from CSV and store it in a structured format inside MongoDB, preserving the datatype where possible. Let’s explore different methods to transfer data from a CSV file into MongoDB using Python.

Method 1: Using pymongo and csv.DictReader

This method involves Python’s standard CSV library to read the CSV file and pymongo, the MongoDB driver for Python, to transfer the data into MongoDB. The csv.DictReader function enables us to read the CSV file into a list of dictionaries, capturing the header row as dictionary keys, which is a convenient format for insertion into a MongoDB collection.

Here’s an example:

from pymongo import MongoClient
import csv

# Connect to the MongoDB server
client = MongoClient("mongodb://localhost:27017")
db = client["database_name"]
collection = db["collection_name"]

# Read CSV file and insert data into MongoDB
with open('data.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        collection.insert_one(row)

Each row from the CSV file has been inserted as a distinct document in the MongoDB collection.

This code snippet uses pymongo to establish a connection to a MongoDB instance, then reads a CSV file line by line to insert each row as a document into the collection specified. The csv.DictReader class is handy because it automatically uses the headers from the CSV file as the keys in the inserted dictionaries.

Method 2: Using pandas and pymongo Bulk Write

By utilizing pandas for data manipulation and pymongo for batch operations, you can efficiently insert large volumes of data into MongoDB. Pandas offers a convenient DataFrame structure which can be obtained directly from a CSV file using read_csv(), while pymongo’s bulk_write() operation can insert many documents in a single command, potentially reducing network overhead.

Here’s an example:

import pandas as pd
from pymongo import MongoClient, InsertOne

# Load CSV file into pandas DataFrame
df = pd.read_csv('data.csv')

# Create list of InsertOne operations
operations = [InsertOne(row.to_dict()) for index, row in df.iterrows()]

# Connect to MongoDB and execute bulk operation
client = MongoClient("mongodb://localhost:27017")
db = client["database_name"]
collection = db["collection_name"]
collection.bulk_write(operations)

Multiple documents have been added to your MongoDB collection in a single operation.

This code example demonstrates how to use pandas to read a CSV file into a DataFrame, convert each row into a dictionary with to_dict(), and create a list of InsertOne operations that are executed with pymongo’s bulk_write() method to transfer data into MongoDB efficiently.

Method 3: Using mongoimport Command-Line Tool

If you prefer using a command-line tool over writing Python code, you can use MongoDB’s built-in mongoimport utility. This tool allows you to directly import data from a CSV file into a MongoDB collection. This method is especially useful for one-off imports or scripts that need to run without a Python environment.

Here’s an example:

mongoimport --db database_name --collection collection_name --type csv --file data.csv --headerline

This will result in the CSV data being imported into the specified collection.

The mongoimport command outlined here specifies the database and collection to which the CSV data should be imported. The --type csv flag indicates the format of the source file, --file specifies the path to the file, and --headerline informs mongoimport that the first line of the CSV file contains the field names.

Method 4: Using PyMongo and a Custom CSV Parser

For cases where your CSV data needs preprocessing or is non-standard, using a custom CSV parser written in Python in combination with pymongo to insert data can be a suitable approach. This enables you to handle complexities such as mixed data types, multiline fields, or special parsing rules before inserting the data into MongoDB.

Here’s an example:

from pymongo import MongoClient

# Custom CSV parser function (simplified example)
def parse_csv(file_path):
    # Implement custom parsing logic here
    return [{"_id": 1, "data": "example"}]

# Connect to the MongoDB server
client = MongoClient("mongodb://localhost:27017")
db = client["database_name"]
collection = db["collection_name"]

# Insert parsed data into MongoDB
parsed_data = parse_csv('data.csv')
collection.insert_many(parsed_data)

Custom-parsed data has been inserted into MongoDB.

The custom parse_csv() function represents a placeholder for your specific CSV parsing logic. After parsing the CSV file, the resulting data is inserted into MongoDB using the insert_many() method provided by pymongo. This flexibility allows for handling nuanced CSV data efficiently.

Bonus One-Liner Method 5: Using a pandas One-Liner with PyMongo

For quick and dirty imports when the data does not require preprocessing, you can use a pandas one-liner that combines reading a CSV file and inserting the data into MongoDB.

Here’s an example:

pd.read_csv('data.csv').to_dict('records')

This converts the content of the CSV file into a list of dictionaries.

The one-liner here succinctly uses pandas to read the CSV file and convert it directly into a list of dictionaries with the to_dict('records') method, which can then be inserted into MongoDB using pymongo’s insert_many() method. This is efficient and convenient for simpler use cases.

Summary/Discussion

  • Method 1: Using pymongo and csv.DictReader. Strengths: Utilizes built-in Python libraries, provides a simple and direct way to import CSV data. Weaknesses: Might not be as efficient for large CSV files due to the line-by-line insertion.
  • Method 2: Using pandas and pymongo Bulk Write. Strengths: Efficient for large datasets due to bulk insert. Weaknesses: Adds a dependency on the pandas library which might be overkill for simpler needs.
  • Method 3: Using mongoimport Command-Line Tool. Strengths: No need to write Python code, good for scripting and one-time imports. Weaknesses: Less flexible, any preprocessing must be done in advance.
  • Method 4: Using PyMongo and a Custom CSV Parser. Strengths: Highly customizable, enables handling complex CSV data. Weaknesses: Requires writing custom parsing logic, which can be error-prone and time-consuming.
  • Bonus Method 5: Using a pandas One-Liner with PyMongo. Strengths: Quick and easy for straightforward data. Weaknesses: Offers no preprocessing and assumes CSV structure matches MongoDB document structure.