5 Best Ways to Convert CSV to RDF Using Python

πŸ’‘ Problem Formulation: Converting data from CSV (Comma Separated Values) format to RDF (Resource Description Framework) is a common requirement in data integration and semantic web projects. For example, one might need to transform a CSV file containing information about books (such as title, author, and ISBN) into an RDF format to make it part of a linked data system. This article explores various methods to achieve this conversion using Python.

Method 1: Using rdflib

rdflib is a Python library designed to work with RDF data. It provides tools to parse and serialize RDF/XML, N-Triples, Turtle, TriX, RDFa, and Microdata. This method involves reading the CSV file, creating an RDF graph, and adding triples to this graph based on the CSV data before serializing it to an RDF format.

Here’s an example:

import csv
from rdflib import Graph, URIRef, Literal, Namespace

g = Graph()
n = Namespace("http://example.org/")

with open('books.csv', 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        book = URIRef(n[row['ISBN']])
        g.add((book, n.title, Literal(row['Title'])))
        g.add((book, n.author, Literal(row['Author'])))

print(g.serialize(format='turtle'))

Output of this code snippet:

@prefix ns1: <http://example.org/> .

ns1:1234567890 ns1:author "Author Name" ;
               ns1:title "Book Title" .

This code snippet creates a graph g using RDFLib’s Graph class, defines a namespace n, reads the CSV using Python’s built-in csv module, and for each row, creates a new RDF resource with the book’s ISBN as its URI. Then it adds the title and author of the book to the graph as RDF triples. Finally, it serializes the graph to Turtle, a compact and human-readable RDF format.

Method 2: Using pandas and rdfpandas

Pandas is a powerful Python library for data manipulation, and rdfpandas integrates pandas with RDF. This method leverages pandas for reading the CSV and converting the DataFrame to an RDF graph that can be serialized.

Here’s an example:

import pandas as pd
from rdfpandas.graph import to_graph
from rdflib import Graph

df = pd.read_csv('books.csv')
g = to_graph(df, "http://example.org/")

serialized_rdf = g.serialize(format='turtle')
print(serialized_rdf)

Output of this code snippet:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/0> rdf:type rdfs:Resource;
                               :Author "Author Name";
                               :ISBN "1234567890";
                               :Title "Book Title".

This code leverages the pandas library to read a CSV file into a DataFrame and then uses rdfpandas to convert that DataFrame into an RDF Graph object from rdflib. The to_graph function creates triples automatically based on the DataFrame fields, which can then be serialized and printed in Turtle format.

Method 3: Manually Creating RDF Triples

Sometimes you may want finer control over how CSV fields are translated into RDF. In these cases, constructing RDF triples manually using basic Python data structures and string formatting before writing it out to a file can provide the necessary flexibility.

Here’s an example:

import csv

rdf_output = ""
base_uri = "<http://example.org/book/"
with open('books.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader:
        subject = f"{base_uri}{row['ISBN']}>"
        title_triple = f"{subject} <http://purl.org/dc/elements/1.1/title> \"{row['Title']}\" ."
        author_triple = f"{subject} <http://purl.org/dc/elements/1.1/creator> \"{row['Author']}\" ."
        rdf_output += f"{title_triple}\n{author_triple}\n"

print(rdf_output)

Output of this code snippet:

<http://example.org/book/1234567890> <http://purl.org/dc/elements/1.1/title> "Book Title" .
<http://example.org/book/1234567890> <http://purl.org/dc/elements/1.1/creator> "Author Name" .

This snippet reads the CSV file using Python’s built-in csv module. It constructs RDF triples as strings with the correct RDF syntax, concatenating dynamic portions of the triple with static URIs. Each row from the CSV file is converted into two RDF triples: one linking the book to its title and another linking the book to its author.

Method 4: Using csvwlib

csvwlib is a library that follows the W3C standard for CSV on the Web, which is designed to provide a way to describe CSV data for interoperability. This library can be used to parse CSV files and create RDF based on a given metadata description that defines the mapping between CSV columns and RDF terms.

Here’s an example:

from csvwlib import CSVWConverter

csv_path = 'books.csv'
metadata_path = 'books-metadata.json'
rdf_output = CSVWConverter.to_rdf(csv_path, mode='minimal', metadata_file=metadata_path)

print(rdf_output)

Output of this code snippet (snapshot):

@prefix ns1: <http://example.org/> .

[] ns1:author "Author Name";
   ns1:title "Book Title";

By using csvwlib, the conversion process is standardized following the CSV on the Web framework. This example assumes you have a metadata file in JSON format that describes the CSV structure and the desired RDF output structure. The CSVWConverter.to_rdf function then creates RDF triples based on these descriptions.

Bonus One-Liner Method 5: Convert CSV to RDF with a one-liner using pandas, rdflib, and rdfpandas

For quick conversions where the CSV column headings are already suitable RDF predicates, one can leverage a combination of pandas and rdfpandas with rdflib to perform the conversion in a one-liner. This is less flexible but suitable for simple datasets.

Here’s an example:

print(Graph().parse(data=pd.read_csv('books.csv').to_json(orient='records'), format='json-ld').serialize(format='turtle'))

Output of this code snippet (snapshot):

@prefix : <http://example.org/> .

[] :Author "Author Name";
   :Title "Book Title".

This one-liner reads a CSV with pandas, converts it to JSON in record orientation, parses the JSON as JSON-LD using RDFLib’s Graph().parse() method, and then serializes it to Turtle format.

Summary/Discussion

  • Method 1: Using rdflib. Well-suited for those familiar with the RDFLib library and who need precise control over the RDF creation. Can be verbose for large datasets.
  • Method 2: Using pandas and rdfpandas. Takes advantage of the powerful data handling of pandas along with direct RDF conversion, but it requires an additional package and is best for when data already aligns with RDF concepts.
  • Method 3: Manually Creating RDF Triples. Offers maximum flexibility without relying on external libraries, although it can be error-prone and require careful string formatting.
  • Method 4: Using csvwlib. Adheres to W3C standards for interoperable CSV data handling, but necessitates additional metadata description files. Ensures a standardized RDF output.
  • Bonus Method 5: One-Liner Method. Quick and easy for simple conversions but lacks flexibility for more complex or specific RDF constructs.