Python CSV to UTF-8

This article concerns the conversion and handling of CSV file formats in combination with the UTF-8 encoding standard.

πŸ’‘ The Unicode Transformation Format 8-Bit (UTF-8) is a variable-width character encoding used for electronic communication. UTF-8 can encode more than 1 million (more or less weird) characters using 1 to 4 byte code units. Example UTF-8 characters: ☈,β˜‡,β˜…,β˜ƒ,β˜„,☍

UTF-8 is the default encoding standard on Windows, Linux, and macOS.

If you write a CSV file using Python’s standard file handling operations such as open() and file.write(), Python will automatically create a UTF-8 file.

So if you came to this website searching for “CSV to UTF-8”, my guess is that you read a different encoded CSV file format such as ASCII, ANSI, or UTF-16 with some “weird” characters.

Say, you want to read this ANSI file:

Now, you can simply convert this to an UTF-8 CSV file via the following approach:

CSV to UTF-8 Conversion in Python

The no-library approach to convert a CSV file to a CSV UTF-8 file is to open the first file in the non-UTF-8 format and write its contents back in an UTF-8 file right away. You can use the open() function’s encoding argument to set the encoding of the file to be read.

with open('my_file.csv', 'r', encoding='ANSI', errors='ignore') as infile:
    with open('my_file_utf8.csv', 'w') as outfile:
     outfile.write(infile.read())

After conversion from ANSI to UTF-8 using the given approach, the new CSV file is now UTF-8 formatted:

CSV Reader/Writer – CSV to UTF-8 Conversion

You don’t need a CSV reader to convert a CSV to UTF-8 as shown in the previous example. However, if you wish to do so, make sure to pass the encoding argument when opening the file reader used to create the CSV Reader object.

import csv


with open('my_file.csv', 'r', encoding='ANSI', errors='ignore') as infile:
    with open('my_file_utf8.csv', 'w', newline='') as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)
        for row in reader:
            print(row)
            writer.writerow(row)

The extra newline argument is there to prevent Windows adding an extra newline when writing each row.

The output is the same UTF-8 encoded CSV:

Pandas – CSV to UTF-8 Conversion

You can use the pandas.read_csv() and to_csv() functions to read and write a CSV file using various encodings (e.g., UTF-8, ASCII, ANSI, ISO) as defined in the encoding argument of both functions.

Here’s an example:

import pandas as pd


df = pd.read_csv('my_file.csv', encoding='ANSI')
df.to_csv('my_file_utf8.csv', encoding='utf-8', index=False)

ANSI to UTF-8

The no-library approach to convert an ANSI-encoded CSV file to a UTF-8-encoded CSV file is to open the first file in the ANSI format and write its contents back in an UTF-8 file. Use the open() function’s encoding argument to set the encoding of the file to be read.

Here’s an example:

with open('my_file.csv', 'r', encoding='ANSI', errors='ignore') as infile:
    with open('my_file_utf8.csv', 'w') as outfile:
     outfile.write(infile.read())

This converts the following ANSI file to an UTF-8 file:

Related Tu