Python Convert Unicode to Bytes, ASCII, UTF-8, Raw String

Python Convert Unicode to Bytes

Converting Unicode strings to bytes is quite common these days because it is necessary to convert strings to bytes to process files or machine learning. Let’s take a look at how this can be accomplished.

Method 1 Built-in function bytes()

A string can be converted to bytes using the bytes() generic function. This function internally points to the CPython library, which performs an encoding function to convert the string to the specified encoding. Let’s see how it works and immediately check the data type:

A = 'Hello'
>>>print(bytes(A, 'utf-8'), type(bytes(A, 'utf-8')))
# b'Hello' <class 'bytes'>

A literal b appeared – a sign that it is a string of bytes. Unlike the following method, the bytes() function does not apply any encoding by default, but requires it to be explicitly specified and otherwise raises the TypeError: string argument without an encoding.

Method 2 Built-in function encode()

Perhaps the most common method to accomplish this task uses the encoding function to perform the conversion and does not use one additional reference to a specific library, this function calls it directly.

The built-in function encode() is applied to a Unicode string and produces a string of bytes in the output, used in two arguments: the input string encoding scheme and an error handler. Any encoding can be used in the encoding scheme: ASCII, UTF-8 (used by default), UTF-16, latin-1, etc. Error handling can work in several ways:

strict – used by default, will raise a UnicodeError when checking for a character that is not supported by this encoding;

ignore – unsupported characters are skipped;

replace – unsupported characters are replaced with “?”;

xmlcharrefreplace – unsupported characters are replaced with their corresponding XML-representation;

backslashreplace – unsupported characters are replaced with sequences starting with a backslash;

namereplace – unsupported characters are replaced with sequences like \N{…};surrogateescape – replaces each byte with a surrogate code, from U+DC80 to U+DCFF;

surrogatepass – ignores surrogate codes, is used with the following encodings: utf-8, utf-16, utf-32, utf-16-be, utf-16-le, utf-32-be, utf-32-le.

Let’s consider an example:

A = '\u0048\u0065\u006C\u006C\u006F'
>>>print(A.encode())
# b'Hello'

In this example, we did not explicitly specify either the encoding or the error handling method, we used the default values – UTF-8 encoding and the strict method, which did not cause any errors. But this is highly discouraged, since other developers may not only use encodings other than UTF-8 and not declare it in the header, but the metacharacters used may differ from the content.

Python Convert Unicode to ASCII

Now let’s look at methods for further converting byte strings. We need to get a Unicode ASCII string.

Method 1 Built-in function decode()

The decode() function, like encode(), works with two arguments – encoding and error handling. Let’s see how it works:

>>>print(A.encode('ascii').decode('ascii'))
# Hello

This method is good if the input Unicode string is encoded in ASCII or other developers are responsible and explicitly declared the encoding in the header, but as soon as a codepoint appears in the range from 0 to 127, the method does not work:

A = '\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1'
>>>print(A.encode('ascii').decode('ascii'))
# UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-7: ordinal not in range(128)

You can use various error handlers, for example, backslashreplace (to replace unsupported characters with sequences starting with backslashes) or namereplace (to insert sequences like \ N {…}):

A = '\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1'
>>>print(A.encode('ascii', 'backslashreplace').decode('ascii','backslashreplace'))
# Hello	\u5316\u4eb1
>>>print(A.encode('ascii', 'namereplace').decode('ascii','namereplace'))
# Hello	\N{CJK UNIFIED IDEOGRAPH-5316}\N{CJK UNIFIED IDEOGRAPH-4EB1}

As a result, we can get a not quite expected or uninformative answer, which can lead to further errors or waste of time on additional processing.

Method 2 Module unidecode()

PyPi has a unidecode module, it exports a function that takes a Unicode string and returns a string that can be encoded into ASCII bytes in Python 3.x:

>>>from unidecode import unidecode
>>>print(unidecode(A))
# Hello	Hua Ye

You can also provide an error argument to unidecode(), which determines what to do with characters not present in its transliteration tables. The default is ignore, which means that Unidecode ignores these characters (replaces them with an empty string). strict will raise UnidecodeError. The exclusion object will contain an index attribute that can be used to find the invalid character. replace will replace them with “?” (or another string specified in the replace_str argument). The preserve will save the original non-ASCII character in the string. Note that if preserve is used, the string returned by unidecode() will not be ASCII encoded! Read more here.

Python Convert Unicode to UTF-8

Due to the fact that UTF-8 encoding is used by default in Python and is the most popular or even becoming a kind of standard, as well as making the assumption that other developers treat it the same way and do not forget to declare the encoding in the script header, we can say that almost all string handling tasks boil down to encoding/decoding from/to UTF-8.

For this task, both of the above methods are applicable.

Method 1 Built-in function encode() and decode()

With encode(), we first get a byte string by applying UTF-8 encoding to the input Unicode string, and then use decode(), which will give us a UTF-8 encoded Unicode string that is already readable and can be displayed or to the console to the user or printed.

B = '\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1\t\u041f\u0440\u0438\u0432\u0435\u0442'
>>>print(B.encode('utf-8').decode('utf-8'))
# Hello	化亱	Привет

Since it is difficult to imagine a character used in popular applications, environments, or operating environments that does not have its own code point in UTF-8, specifying the error handling method can be neglected.

Method 2 Module unidecode

>>>print(list(map(float, [ord(i) for i in B])))
# [72.0, 101.0, 108.0, 108.0, 111.0]

Or we can use a for loop, and the data type of each character will be float, since we explicitly indicated to convert to this type:

>>>for i in B:
   	print(float(ord(i)), sep=' ')
# 72.0 101.0 108.0 108.0 111.0