Programmer's Python Data - Unicode Strings

Written by Mike James

Tuesday, 08 August 2023

Article Index
Programmer's Python Data - Unicode Strings
UTF-8 And Python Code
Comparing Strings

Page 2 of 3

UTF-8 And Python Code

Python works in UTF-8 in its source files and this has some important consequences. Also notice that UTF-8 is the default encoding for HTML 5 web pages which means it is a very common encoding. So much so that most editors default to UTF-8, i.e. they create UTF-8 encoded files which you can then use with whatever compiler or interpreter you want to. Of course, if the compiler of interpreter doesn’t accept UTF-8 files you have a problem. The usual solution is to set the editor you are using to create the files encoded to satisfy the compiler or interpreter. Python as implemented by CPython (the usual interpreter) reads UTF-8 files and so in most cases everything just works.

However, if you want to type a Unicode character into your source code then things can be difficult. If you simply want to use a local language then everything is easy, as long you can find a suitable keyboard supported by the operating system. In this case you simply type. If you can’t find such a keyboard you are at the mercy of the editor you are using as to what Unicode support they provide. There are various key combinations and conventions such as Alt key plus a character code typed in using the numeric keyboard, but these generally only work for a limited range of code points and depend on the editor in use.

If you can type a Unicode character then you can use it within string literals and variable names. Unicode in variable names is limited to code points that are part of a language and so you can use emojis or graphical symbols.

If you want to use an encoding other than UTF-8 you simply add a comment:

#coding = latin-1

to the start of the file where latin-1 is replaced by the encoding name of the file. Python comes with around 100 encoding definitions and these are listed in the documentation under “Standard Encodings”.

If you cannot use a keyboard or some direct method for Unicode entry then you can always use the \u or \U escape sequences. The lower case version expects four hex digits while the upper case expects eight hex digits. The lower case form limits you to the Basic Multilingual Plane (BMP), i.e. code points that only need two bytes to specify – this is usually enough for most purposes. For example:

print("ABCD\u2211")

displays ABCD∑. You can also use the \N escape sequence to specify a code point using the Unicode character’s name. For example:

print("ABCD\N{N-Ary Summation}")

displays ABCD∑ as before. Notice that code point 2211 hex is N-Ary Summation, a mathematical symbol and not the Greek Capital Letter Sigma which looks very similar, if not identical, in most fonts.

Unicode and Strings

Internally Python represents strings as full Unicode characters using one, two or four byes depending on the range of characters to be represented. Notice that there is no encoding used. For example a two byte representation gives the first 16-bits of the full 21-bits of a Unicode character. The number of bytes used per character depends on the largest Unicode character code to be stored in the string. If the string only uses code points between 0 and 255 a single byte per character is enough. However if the string uses even one code point that needs more than one byte to represent then all of the characters in the string use more than one byte.

This breaks the long-held belief that the number of characters in a string is the same as the number of bytes in a string. Indeed the number of byes used to represent the characters in a string depends on the highest Unicode character used. Unless you are going to be using the string to call an external function, or pass the string as a sequence of byte values, this probably doesn’t matter. If, however, you are doing anything even slightly out of the ordinary you have to understand that a string is no longer a sequence of bytes where each byte is a character.

For example:

print(len("ABCD\u2211"))

displays 5, even though the string needs at least two bytes to represent the final character. You can find out how many bytes a string takes using sys.getsizeof(string) although this returns the total number of bytes the object uses not just the number of bytes used to represent the characters. To find the number of bytes you have to subtract the length of a single character string:

print(sys.getsizeof("ABCD\u2211")-
                          sys.getsizeof("\u2211"))

which displays 8 which means in this case it takes 8 bytes to represent four characters i.e. two bytes per character. The reason is that \u2211 is the largest Unicode character in the string and this needs two bytes and so every character is represented by two bytes.

If you change the \u to \U, i.e. specify eight hex digits, and include a Unicode character that needs more than 16-bits the result is different:

print(sys.getsizeof("ABCD\U00072211")-
                     sys.getsizeof("\U00072211"))

displays 16 which means we are now using four bytes per character. This is in line with the fact that the final character needs four bytes to represent and so all of the characters are represented using four bytes.

The same is true of encodings only in this case each character could need a different number of bytes. For example converting the string to a byte sequence using the encode method, which is discussed in more detail later:

print(len("ABCD\u2211".encode("utf-8")))

displays 7, which reflects the fact that the final character needs three bytes to store. Of course, all of the string methods understand Unicode and will give you the correct result, i.e. everything works as it would if strings were ASCII encoded.

We have already seen that the len function returns the number of characters not the number of bytes, but all other string functions and operators work with characters rather than bytes. As long as you don’t delve into the inner workings everything should be simple and you can use string functions as if they were working with one byte per character. There is more to say about encodings in general, but this is better discussed in Chapter 12 where we look at binary data.

The point is that the link between the number of characters in a string and the number of bytes used is well and truly broken in the era of Unicode.

Chr and Ord

There are two built-in functions that work with strings that are part of Python mainly because they are present in other languages, but in Python these are strictly Unicode functions, whereas in other languages they generally work only with ASCII codes.

The chr function takes an integer and returns a single Unicode character:

c = chr(8721)

or using hex:

chr(0x2211)

is the sigma summation sign.

The ord function is the reverse of the chr function and returns an integer corresponding to the code point of the specified Unicode character. For example:

ord(\u2211)

returns 8721. Notice that the integer used in chr has to be a valid Unicode code point and ord will only work with a single Unicode character.

<< Prev - Next >>

Last Updated ( Wednesday, 09 August 2023 )