Programmer's Python Data - Unicode Strings

Written by Mike James

Tuesday, 08 August 2023

Article Index
Programmer's Python Data - Unicode Strings
UTF-8 And Python Code
Comparing Strings

Page 1 of 3

Strings in the era of Unicode are no longer simple. Python uses Unicode without encoding for its strings and UTF-8 encoding for its source files. Find out how it all works in this extract from Programmer's Python: Everything is Data.

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

Python – A Lightning Tour
The Basic Data Type – Numbers
Extract: Bignum
Truthy & Falsey
Dates & Times
Extract Naive Dates
Sequences, Lists & Tuples
Extract Sequences
Strings
Extract Unicode Strings
Regular Expressions
Extract Simple Regular Expressions
The Dictionary
Extract The Dictionary
Iterables, Sets & Generators
Extract Iterables
Comprehensions
Extract Comprehensions
Data Structures & Collections
Extract Stacks, Queues and Deques
Extract Named Tuples and Counters
Bits & Bit Manipulation
Extract Bits and BigNum
Bytes
Extract Bytes And Strings
Extract Byte Manipulation
Binary Files
Extract Files and Paths ***NEW!!!
Text Files
Creating Custom Data Classes
Extract A Custom Data Class
Python and Native Code
Extract Native Code
Appendix I Python in Visual Studio Code
Appendix II C Programming Using Visual Studio Code

<ASIN:1871962765>

<ASIN:1871962749>

<ASIN:1871962595>

<ASIN:B0CK71TQ17>

<ASIN:187196265X>

The Python string object isn’t that different from those encountered in other languages, but there are some differences that might puzzle you if you are familiar with other languages. It starts out being similar, but then as you dig just a little deeper things aren’t quite what they seem.

Python strings are immutable sequences and they support all of the standard sequence methods and conventions, including slices. They also have additional methods suited to the data they represent. However, the biggest difference is that the elements of a string are also strings. That is, an element of a string is a string of length one, representing a single character coded using Unicode. This makes things a little more interesting.

In chapter but not in this extract

String Literals
String Operators
Formatted String Literals

Note: This extract is a slightly updated, and more accurate, version of the first printing of Programmer's Python: Everything is Data.

String Coding With Unicode

In many cases you can ignore the exact way that strings represent characters, but there are lots of times when you can’t. To really master Python strings you have to master Unicode, which is the default internal representation for Python strings, and encodings in general.

Unicode is just a list of characters indexed by a 21-bit value called the character's code point. There are enough characters in Unicode to represent every language in use and some that aren't, such as Linear A, the as-yet undeciphered writing system from the Minoan civilization of ancient Crete.

It is usual to specify the character with code x as a code point and write it as U+x. For example, the summation sign, capital sigma, is at code point 2211 in hex and this is usually written U+2211.

Unicode defines the characters, but it doesn't say how the code point should be represented in programs. You are free to choose any encoding of the code point. The simplest encoding is to use a 32-bit index for every character. This is UTF-32 and it is simple, but very inefficient. It requires 4-bytes per character and compared to ASCII which uses a single byte per character it is four times bigger.

As many files only need to use a restricted set of characters, using 32 bits per character is wasteful compared to the single byte needed for an ASCII character. There a number of different Unicode encodings, but the one that is most used is UTF-8. Python source files are encoded in UTF-8.

UTF-8

Given that there are roughly 1,112,064 characters in Unicode these cannot all be represented by a value in a single byte as the 256 characters of ASCII could. Instead UTF-8 is a variable length code that uses up to four bytes to represent a character. How many bytes are used to code a character is indicated by the most significant bits of the first byte.

Byte 1
0xxxxxxx   one byte 
110xxxxx   two bytes
1110xxxx   three bytes
11110xxx   four bytes

All subsequent bytes have their most significant two bits set to 10. This means that you can always tell a follow-on byte from a first byte. The bits in the table shown as x carry the information about which character is represented. To get the character code you simply extract the bits and concatenate them to get a 7, 11, 16 or 21-bit character code.

For example, if you have the bytes:

11100010 10001000 10010001 or E2 88 91 in hex

The first byte has 111 as its start and so this is a three-byte encoding. The second and third bytes start 10 so these are indeed the follow on bytes. Extracting the remaining bits gives the 11-bit value:

00010 001000 010001

which is 2211 in hex, 8721 in decimal, and this corresponds to the mathematical summation sign, i.e. a sigma, ∑.The first 128 characters of UTF-8 are the same as ASCII, so if you use a value less than 128 stored in a single byte then you have backward compatible ASCII text. That is, Unicode characters U+0000 to U+007F can be represented in a single byte. Going beyond this needs two, three and four bytes. Almost all the Latin alphabets plus Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syrian, Thaana and N'Ko can be represented with just two bytes.

Prev - Next >>

Last Updated ( Wednesday, 09 August 2023 )

Programmer's PythonEverything is Data

Is now available as a print book: Amazon

Contents

In chapter but not in this extract

String Coding With Unicode

UTF-8

Programmer's Python
Everything is Data