Introduction
Problem Statement: How to fix “UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0xa5 in position 0: invalid start byte” in Python?
Using a specific standard to convert letters, symbols and numbers from one form to another is termed as Encoding. A Unicode character can be encoded using a variety of encoding schemes. The most common ones are utf-8, utf-16, and latin. The character, $, for example, corresponds to U+0024 in the utf-8 encoding standard, U+0024 in the UTF-16 encoding standard, and may not correspond to any value in any other encoding standard.
- Often, while reading the input files, you might encounter an UnicodeDecodeError. When the input file contains characters (non-ASCII) that are not mapped to the encoding standard in use, the
decode()
function will fail, and this kind of error will be seen as a result of that. - Thus, the error means that the byte 0xa05 at position 0 in the input file cannot be decoded using the encoding format utf-8. That is, there is no mapping corresponding to this character in utf-8.
Example:
s = b'\xf8\xe7' print(s.decode('UTF-8'))
Output:
Traceback (most recent call last): File "C:\Users\SHUBHAM SAYON\PycharmProjects\Finxer\UnicodeEncode.py", line 2, in <module> print(s.decode('UTF-8')) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte
In this tutorial, we will have a look at various ways to fix this error. So, without further delay let the games(fixes) begin!
#Fix 1: Use the Appropriate Encoding Standard
Only way to eliminate this error is to pass the proper/appropriate encoding scheme of the file as a parameter while reading it.
Example:
s = b'\xf8\xe7' print(s.decode('latin1')) # øç
Let’s have a look at couple of different scenarios and how we can use the correct encoding scheme to avoid the occurence of an error:
Scenario 1: Fixing Normal File Operations
file_data = open(path_to_the_file, mode="r", encoding="latin1")
Example 2: The Pandas Fix
import pandas as pd file_data=pd.read_csv(path_to_file, encoding="latin1")
But, what if you do not know the encoding scheme of the file? You can find one using the chardet package.
- Firstly, install the chardet using the following command :
pip install chardet
- Then, use the below code snippet to identify the encoding format and then pass this value to the encoding parameter.
import chardet import pandas as pd with open(path_to_the_file,'r') as f: raw_data= f.read() result = chardet.detect(raw_data.encode()) encoding_format = result['encoding'] f.seek(0,0) # reset the file pointer to the beginning of the file. data= pd.read_csv(f,delimiter=",", encoding=encoding_format)
➤ unicode_escape
Note: In most cases, people have found that setting the encoding parameter to “unicode_escape”, “latin-1”, or “ISO-8859-1” has helped.
To use unicode_escape as the encoding parameter, use the below code snippet.
Example:
file_data=pd.read_csv(path_to_file, encoding="unicode_escape")
#Fix 2: Read the File in Binary Format
Try this fix if you see the error working with the log files or text files.
When you open a file for reading, the file opens in the read mode by default. In this mode, the only strings are read. To read the Unicode characters, open the file in read binary(rb) mode.
Example:
file_data = open(path_to_the_file, mode="rb")
#Fix 3: Ignore the Un-Encodable Characters
You can opt to ignore the characters if they are not necessary for further processing and you are only concerned with getting rid of the error.
You encounter this error while cleaning the file to extract some information. Your program does not expect any Unicode characters to be present, for example. You can ignore these characters.
Use any of the following snippets to ignore the characters while you’re reading the file using file operations.
string_with_issue.encode(encoding = ‘UTF-8’,errors = ‘ignore’) |
When you are using pandas, you can achieve the same result using the following code snippet.
import pandas as pd file_data=pd.read_csv(path_to_file, encoding=”utf-8″, encoding_errors=”ignore”) |
#Fix 4: Use engine=“python”
Passing the engine=’python’
has fixed the issues in some cases. Hence, this fix deserves a mention in the list of our solutions. Note that this works with pandas and not with the file operations using the open()
function.
Example: When using the Pandas library’s read_csv()
function, you can specify the engine parameter as shown below:
import pandas as pd file_data=pd.read_csv(path_to_file, engine="python")
BONUS Read
Encoding and Decoding
The process of converting human-readable data into a specified format, for the secured transmission of data is known as encoding. Decoding is the opposite of encoding that is to convert the encoded information to normal text (human-readable form).
In Python,
encode()
is an inbuilt method used for encoding. Incase no encoding is specified, UTF-8 is used as default.decode()
is an inbuilt method used for decoding.
The following diagram should make things a little easier:
Example:
u = 'Πύθωνος' print("UNICODE Representation of é: ", u.encode('utf-8'))
Output:
UNICODE Representation of é: b'\xce\xa0\xcf\x8d\xce\xb8\xcf\x89\xce\xbd\xce\xbf\xcf\x82'
Codepoint
Unicode maps the codepoint to their respective characters. So, what do we mean by a codepoint?
- Codepoints are numerical values or integers used to represent a character.
- The Unicode code point for é is
U+00E9
which is integer 233. When you encode a character and print it, you will generally get its hexadecimal representation as an output instead of its binary equivalent (as seen in the examples above). - The byte sequence of a code point is different in different encoding schemes. For eg: the byte sequence for é in
UTF-8
is\xc3\xa9
while inUTF-16
is \xff\xfe\xe9\x00.
Please have a look at the following program to get a better grip on this concept:
u = 'é' print("INTEGER value for é: ", ord(u)) print("ENCODED Representation of é in UTF-8: ", u.encode('utf-8')) print("ENCODED Representation of é in UTF-16: ", u.encode('utf-16'))
Output:
INTEGER value for é: 233
ENCODED Representation of é in UTF-8: b'\xc3\xa9'
ENCODED Representation of é in UTF-16: b'\xff\xfe\xe9\x00'
Conclusion
In this tutorial, we have covered some fixes to solve the UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte
. Some fixes apply to the CSV files, while others work for the .txt files. Apply them appropriately based on the requirement.
Hopefully this article has been informative and helped you. Stay tuned and subscribe to our site to get more stuff like this. Till then, Happy Pythoning!
Post Credits: Shubham Sayon and Anusha Pai
To become a PyCharm master, check out our full course on the Finxter Computer Science Academy available for free for all Finxter Premium Members: