This tutorial will show you how to convert a Unicode string to a string in Python. If you already know about Unicode, you can skip the following background section and dive into the problem right away.
Background Unicode
A bit about Unicode from Wikipedia.
Unicode is a character encoding standard that includes characters from almost all written languages ββin the world. The standard is now prevalent on the Internet.
The standard was proposed in 1991 by the non-profit organization “Unicode Consortium” (Unicode Inc). The use of this standard makes it possible to encode a very large number of characters from different writing systems: in documents encoded according to the Unicode standard, Chinese hieroglyph, mathematical symbols, letters of the Greek alphabet, Latin and Cyrillic alphabet, symbols of musical notation become unnecessary, and switching code pages becomes unnecessary.
In Unicode, there are several forms of representation (Unicode transformation format, UTF): UTF-8, UTF-16 (UTF-16BE, UTF-16LE) and UTF-32 (UTF-32BE, UTF-32LE). In the data stream, UTF-16 the low byte can be written either before the high order (UTF-16 little-endian, UTF-16LE) or after the high order (UTF-16 big-endian, UTF-16BE). Likewise, there are two variants of the four-byte form of presentation – UTF-32LE and UTF-32BE. All of them are also called encodings.
Microsoft Windows NT and systems based on it mainly use the UTF-16LE form. UNIX-like operating systems GNU / Linux, BSD, and Mac OS X adopt UTF-8 for files and UTF-32 or UTF-8 for in-memory character handling.
Often we receive as input a string of Unicode characters, which is not readable by a regular user, but has many advantages over regular text, for example, it takes up less memory space or takes less time to process and further transfer. Depending on the further requirements for the Unicode string or depending on the environment (whether it be an operating system or software), it is necessary to determine the encoding that can and should be used.
UTF-8 is now the dominant encoding on the web. UTF-8, in comparison with UTF-16, gives the greatest gain in compactness for texts in Latin, since Latin letters, numbers, and the most common punctuation marks are encoded in UTF-8 by only one byte, and the codes of these characters correspond to their codes in ASCII.
UTF-16 is an encoding that allows writing Unicode characters in the ranges U + 0000 … U + D7FF and U + E000 … U + 10FFFF (with a total of 1112064). Moreover, each character is written in one or two words (surrogate pair).
UTF-32 is a way of representing Unicode in which each character is exactly 4 bytes. The main advantage of UTF-32 over variable-length encodings is that Unicode characters in it are directly indexable, so finding a character by its position number in the file can be extremely fast, and getting any character in the n-th position is an operation that always takes the same time. It also makes it very easy to replace characters in UTF-32 strings. In contrast, variable-length encodings require sequential access to the n-th character, which can be very time-consuming. The main disadvantage of UTF-32 is its inefficient use of space since four bytes are used to store any character.
Problem Formulation
Suppose we have a Unicode string and we need to convert it to a Python string.
A = '\u0048\u0065\u006C\u006C\u006F'
Let’s make sure of the input data type:
>>> type(A) <class 'str'>
Method 1. String
In Python 3, all text is Unicode strings by default, which also means that u'<text>'
syntax is no longer used.
Most Python interpreters support Unicode and when the print function is called, the interpreter converts the input sequence from Unicode-escape characters to a string.
print(str(A)) # Hello
It makes no sense to check the data type after applying the string method.
Method 2. Repr()
The built-in repr()
function returns a string containing the printable formal representation of an object.
print(repr(A)) # 'Hello'
Check the data type:
print(type(repr(A))) # <class 'str'>
Method 3. Module Unicodedata, function normalize
The normalize()
function of the Unicodedata module returns the normal form for a Unicode string. Valid values ββfor the form are NFC, NFKC, NFD, and NFKD.
The Unicode standard defines various forms of Unicode string normalization based on the definition of canonical equivalence and compatibility equivalence. In Unicode, multiple characters can be expressed in different ways. For example, the character U + 00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U + 0043 (LATIN CAPITAL LETTER C) U + 0327 (COMBINING CEDILLA).
There are two normal forms for each character: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition and translates each character into decomposed form. Normal Form C (NFC) first applies canonical decomposition, then re-creates the pre-combined characters.
In addition to these two forms, there are two additional normal forms based on the equivalence of compatibility. Some characters which are supported in Unicode, are usually combined with other characters. For example, U + 2160 (ROMAN NUMERAL ONE) is indeed the same as U + 0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets such as gb2312.
The normal form KD (NFKD) will apply compatibility decomposition, that is, replace all compatibility symbols with their equivalents. The normal form KC (NFKC) applies compatibility decomposition first and then canonical composition.
Even though two Unicode strings are normalized and look the same to humans if one has combined characters and the other does not, they may not match.
import unicodedata print(unicodedata.normalize('NFC', A)) # Hello
Letβs check the data type after normalization:
print(type(unicodedata.normalize('NFC', A))) # <class 'str'>
Method 4. List Comprehension and str.join
The str.join()
method returns a string that is the concatenation (union) of all the elements of the strings of the iterable.
In the final line, the elements are combined with each other using the str separator string.
If there are any non-string values in the iterable sequence, including bytes, then raised the TypeError exception.
Letβs check how it works:
print(''.join([str(i) for i in A])) # Hello
''
– an empty string character joins the elements of the list that we have compiled from the elements of string A using the join method.
Since we have indicated to wrap each iterable of the list with the str function, we can safely assume that the result will be the desired data type:
print(type(''.join([str(i) for i in A]))) # <class 'str'>
Method 5. Library ftfy
The full name of this library is Fixes text for you. It is designed to turn bad Unicode strings (Γ’β¬ΕquotesΓ’β¬\x9d or uΓΛ) into good Unicode strings (“quotes” or ΓΌ respectively).
Let’s see how it works in our example:
import ftfy print(ftfy.fix_text(A)) # Hello
What does it do with the output data type:
print(type(ftfy.fix_text(A))) # <class 'str'>
Great, that’s what you need, the main thing is that the library remains accessible;)
Method 6. Module io
The IO module is applicable when you need to perform an I / O operation on files (for example, reading or writing files). You can use the built-in read()
and write()
methods to read or write a file, but this module gives us much more code options for these operations, such as writing or reading from a buffer.
In our simple example, it would look like this:
print(io.StringIO(A).read()) # Hello
io.StringIO
works with data of the string type, both in input and output. Whenever an input string or data stream consists of bytes or Unicode characters, the encoding or decoding of the data is performed transparently, and optional translation of environment-specific newlines is taken into account.
Method 7. Format
This method seems to be the most powerful and effective since it allows you to work with all data types: bytes, strings, int, and float numbers in different representations (octal, decimal, hexadecimal in different registers) using the mini-language specification, which allows you to specify not only the data type, but also offset, rounding, filling with characters to the required length, and also allows you to work with dictionaries and their indices in various variations.
Let’s check with our example:
print(format(A, 's')) # Hello
Here ‘s’ is the type of the formatted object – string, used by default. More details about the specification and syntax here.