Programmer's Python Data - Unicode Strings
Written by Mike James   
Tuesday, 08 August 2023
Article Index
Programmer's Python Data - Unicode Strings
UTF-8 And Python Code
Comparing Strings

Comparing Strings

There is one exception to the rule that you can mostly ignore the fact that Python strings are Unicode strings – comparing strings. In practice, comparing strings has always been harder than it might first appear. For example, are two strings that differ only in case the same or different? Are strings that look the same always the same? As we are using Unicode the answer is no. The reason is that Unicode allows you to create characters by combining multiple characters together and this means that there is often more than one way to create a character. For example,

êê

that is e with a circumflex accent, is code point \u00EA but you can also create it using \u0065 \u0302 which is a standard e followed by Combining Circumflex Accent code point which places a circumflex over the preceding character code.

Characters that look the same but correspond to different code sequences are called homoglyphs and their existence makes string comparison difficult. For example, consider:

a = "He\u0302llo World"
b = "H\u00EAllo World"
print(a)
print(b)
print(a == b)

displays:

Hêllo World
Hêllo World
False

The two strings look the same, but they aren’t equal. You might want to accept this as reasonable and simply insist that there is only one way to encode a particular string in a particular character set. If you want to allow homoglyphs to be treated as equal then you need to use the normalize function which puts a string into a standard form. Unicode defines four normal forms:

  • NFC: Canonical Composition

  • NFD: Canonical Decomposition

  • NFKC: Compatibility Composition

  • NFKD: Compatibility Decomposition

The most commonly used is NFC. To understand what these alternative standard forms are all about we need to look at some terms. The first thing is that “composed” means that all of the characters are combined to make a single character. Similarly, “decomposed” means that characters are split into component parts if possible. So, for example, \u0065 \u0302 is decomposed and \u00EA is the composed equivalent.

A “canonical” representation is one where characters that look the same are represented in the same way, either composed or decomposed. A “compatibility” representation is more sophisticated and converts characters that mean the same thing but don’t necessarily look the same into the same character. For example, ¼ and 1/4 are compatible but not canonical. You can see that a compatible representation is starting to introduce meaning into what is treated as equivalent.

If you want to use a more sophisticated approach to string matching you have to convert all strings into a standard form. For example:

import unicodedata
a = "He\u0302llo World"
b = "H\u00EAllo World"
print(unicodedata.normalize("NFC",a) == b)
print(a == unicodedata.normalize("NFD",a))

The string a is in decomposed format and b is in composed format. The first print converts a to composed form and so it is equal to b and the result is True. The second print statement converts b to decomposed form and so it is equal to a and the result is True.

In most cases you would simply convert all strings to one of the forms and do a comparison after this. For example:

print(unicodedata.normalize("NFC",a) == 
                        unicodedata.normalize("NFC",b))

results in True because both a and b are converted to composed form.

There really isn’t any “best practice” when it comes to string comparison because what you want to consider as equal strings depends on the application and the locales it is intended to work in. If you understand the way normalize works you should be able to select the correct approach.

There is also a slight complication due to the way upper and lower case work in Unicode. Some characters become two characters when converting from upper to lower case. The casefold function will convert a string into a case insensitive form of lower case. The documentation gives the following way of comparing two strings in a case insensitive way:

unicodedata.normalize('NFD', 
      unicodedata.normalize('NFD', s1).casefold()) 

you have to first normalize the string, convert to case insensitive and then normalize again because casefold can denormalize a string.

In chapter but not in this extract

  • String Manipulation
  • String slicing
  • Joining and splitting
  • Finding and replacing

 Summary

  • Python strings are sequences and support indexing, slicing and all of the standard sequence methods plus a lot of additional methods.

  • In Python string literals come in many forms and you can use single, double or triple quotes.

  • You can include special characters using /, the escape character.

  • Formatted strings are the preferred way to create string templates.

  • Python uses Unicode to represent character data in strings and this breaks the long held belief that the number of characters in a string is the same as the number of bytes.

  • There are a range of ways of entering Unicode into a Python program including, using an appropriate keyboard, the \u escape.

  • Python source files are encoded using UTF-8.

  • The chr and ord functions convert characters into codepoints and codepoints into characters respectively.

  • The use of UTF-8 and Unicode make comparing string more complicated because there is more than one way to create some symbols. To make comparison consistent there are a number of normal forms that can be applied.

  • String manipulation is a matter of using the methods available to break a string into the parts that you want and then reassemble it into the final result.

  • Python has many powerful string methods that make string manipulation easy.

 

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

pythondata360Contents

  1. Python – A Lightning Tour
  2. The Basic Data Type – Numbers
       Extract: Bignum
  3. Truthy & Falsey
  4. Dates & Times
       Extract Naive Dates ***NEW!!!
  5. Sequences, Lists & Tuples
       Extract Sequences 
  6. Strings
       Extract Unicode Strings
  7. Regular Expressions
  8. The Dictionary
       Extract The Dictionary 
  9. Iterables, Sets & Generators
       Extract  Iterables 
  10. Comprehensions
       Extract  Comprehensions 
  11. Data Structures & Collections
       Extract Stacks, Queues and Deques
      
    Extract Named Tuples and Counters
  12. Bits & Bit Manipulation
       Extract Bits and BigNum 
  13. Bytes
       Extract Bytes And Strings
       Extract Byte Manipulation 
  14. Binary Files
  15. Text Files
  16. Creating Custom Data Classes
        Extract A Custom Data Class 
  17. Python and Native Code
        Extract   Native Code
    Appendix I Python in Visual Studio Code
    Appendix II C Programming Using Visual Studio Code

<ASIN:1871962765>

<ASIN:1871962749>

<ASIN:1871962595>

<ASIN:B0CK71TQ17>

<ASIN:187196265X>

Related Articles

Creating The Python UI With Tkinter

Creating The Python UI With Tkinter - The Canvas Widget

The Python Dictionary

Arrays in Python

Advanced Python Arrays - Introducing NumPy

espbook

 

Comments




or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner



Last Updated ( Wednesday, 09 August 2023 )