Programmer's Python Data - Unicode Strings
Written by Mike James   
Tuesday, 08 August 2023
Article Index
Programmer's Python Data - Unicode Strings
UTF-8 And Python Code
Comparing Strings

Strings in the era of Unicode are no longer simple. Python uses Unicode without encoding for its strings and UTF-8 encoding for its source files. Find out how it all works in this extract from  Programmer's Python: Everything is Data.

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

pythondata360Contents

  1. Python – A Lightning Tour
  2. The Basic Data Type – Numbers
       Extract: Bignum
  3. Truthy & Falsey
  4. Dates & Times
       Extract Naive Dates
  5. Sequences, Lists & Tuples
       Extract Sequences 
  6. Strings
       Extract Unicode Strings
  7. Regular Expressions
       Extract Simple Regular Expressions ***NEW!!!
  8. The Dictionary
       Extract The Dictionary 
  9. Iterables, Sets & Generators
       Extract  Iterables 
  10. Comprehensions
       Extract  Comprehensions 
  11. Data Structures & Collections
       Extract Stacks, Queues and Deques
      
    Extract Named Tuples and Counters
  12. Bits & Bit Manipulation
       Extract Bits and BigNum 
  13. Bytes
       Extract Bytes And Strings
       Extract Byte Manipulation 
  14. Binary Files
  15. Text Files
  16. Creating Custom Data Classes
        Extract A Custom Data Class 
  17. Python and Native Code
        Extract   Native Code
    Appendix I Python in Visual Studio Code
    Appendix II C Programming Using Visual Studio Code

<ASIN:1871962765>

<ASIN:1871962749>

<ASIN:1871962595>

<ASIN:B0CK71TQ17>

<ASIN:187196265X>

 

The Python string object isn’t that different from those encountered in other languages, but there are some differences that might puzzle you if you are familiar with other languages. It starts out being similar, but then as you dig just a little deeper things aren’t quite what they seem.

Python strings are immutable sequences and they support all of the standard sequence methods and conventions, including slices. They also have additional methods suited to the data they represent. However, the biggest difference is that the elements of a string are also strings. That is, an element of a string is a string of length one, representing a single character coded using Unicode. This makes things a little more interesting.

In chapter but not in this extract

  • String Literals
  •  String Operators
  •  Formatted String Literals

Note: This extract is a slightly updated, and more accurate, version of the first printing of Programmer's Python: Everything is Data.

String Coding With Unicode

In many cases you can ignore the exact way that strings represent characters, but there are lots of times when you can’t. To really master Python strings you have to master Unicode, which is the default internal representation for Python strings, and encodings in general.

Unicode is just a list of characters indexed by a 21-bit value called the character's code point. There are enough characters in Unicode to represent every language in use and some that aren't, such as Linear A, the as-yet undeciphered writing system from the Minoan civilization of ancient Crete.

It is usual to specify the character with code x as a code point and write it as U+x. For example, the summation sign, capital sigma, is at code point 2211 in hex and this is usually written U+2211.

Unicode defines the characters, but it doesn't say how the code point should be represented in programs. You are free to choose any encoding of the code point. The simplest encoding is to use a 32-bit index for every character. This is UTF-32 and it is simple, but very inefficient. It requires 4-bytes per character and compared to ASCII which uses a single byte per character it is four times bigger.

As many files only need to use a restricted set of characters, using 32 bits per character is wasteful compared to the single byte needed for an ASCII character. There a number of different Unicode encodings, but the one that is most used is UTF-8. Python source files are encoded in UTF-8.

UTF-8

Given that there are roughly 1,112,064 characters in Unicode these cannot all be represented by a value in a single byte as the 256 characters of ASCII could. Instead UTF-8 is a variable length code that uses up to four bytes to represent a character. How many bytes are used to code a character is indicated by the most significant bits of the first byte. 

Byte 1
0xxxxxxx   one byte 
110xxxxx   two bytes 1110xxxx   three bytes 11110xxx   four bytes

All subsequent bytes have their most significant two bits set to 10. This means that you can always tell a follow-on byte from a first byte. The bits in the table shown as x carry the information about which character is represented. To get the character code you simply extract the bits and concatenate them to get a 7, 11, 16 or 21-bit character code.

For example, if you have the bytes:

11100010 10001000 10010001 or E2 88 91 in hex

The first byte has 111 as its start and so this is a three-byte encoding. The second and third bytes start 10 so these are indeed the follow on bytes. Extracting the remaining bits gives the 11-bit value:

00010 001000 010001

which is 2211 in hex, 8721 in decimal, and this corresponds to the mathematical summation sign, i.e. a sigma, ∑.The first 128 characters of UTF-8 are the same as ASCII, so if you use a value less than 128 stored in a single byte then you have backward compatible ASCII text. That is, Unicode characters U+0000 to U+007F can be represented in a single byte. Going beyond this needs two, three and four bytes. Almost all the Latin alphabets plus Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syrian, Thaana and N'Ko can be represented with just two bytes.



Last Updated ( Wednesday, 09 August 2023 )