Programmer's Python Data - Unicode Strings

Written by Mike James

Tuesday, 08 August 2023

Article Index
Programmer's Python Data - Unicode Strings
UTF-8 And Python Code
Comparing Strings

Page 3 of 3

Comparing Strings

There is one exception to the rule that you can mostly ignore the fact that Python strings are Unicode strings – comparing strings. In practice, comparing strings has always been harder than it might first appear. For example, are two strings that differ only in case the same or different? Are strings that look the same always the same? As we are using Unicode the answer is no. The reason is that Unicode allows you to create characters by combining multiple characters together and this means that there is often more than one way to create a character. For example,

êê

that is e with a circumflex accent, is code point \u00EA but you can also create it using \u0065 \u0302 which is a standard e followed by Combining Circumflex Accent code point which places a circumflex over the preceding character code.

Characters that look the same but correspond to different code sequences are called homoglyphs and their existence makes string comparison difficult. For example, consider:

a = "He\u0302llo World"
b = "H\u00EAllo World"
print(a)
print(b)
print(a == b)

displays:

Hêllo World
Hêllo World
False

The two strings look the same, but they aren’t equal. You might want to accept this as reasonable and simply insist that there is only one way to encode a particular string in a particular character set. If you want to allow homoglyphs to be treated as equal then you need to use the normalize function which puts a string into a standard form. Unicode defines four normal forms:

NFC: Canonical Composition
NFD: Canonical Decomposition
NFKC: Compatibility Composition
NFKD: Compatibility Decomposition

The most commonly used is NFC. To understand what these alternative standard forms are all about we need to look at some terms. The first thing is that “composed” means that all of the characters are combined to make a single character. Similarly, “decomposed” means that characters are split into component parts if possible. So, for example, \u0065 \u0302 is decomposed and \u00EA is the composed equivalent.

A “canonical” representation is one where characters that look the same are represented in the same way, either composed or decomposed. A “compatibility” representation is more sophisticated and converts characters that mean the same thing but don’t necessarily look the same into the same character. For example, ¼ and 1/4 are compatible but not canonical. You can see that a compatible representation is starting to introduce meaning into what is treated as equivalent.

If you want to use a more sophisticated approach to string matching you have to convert all strings into a standard form. For example:

import unicodedata
a = "He\u0302llo World"
b = "H\u00EAllo World"
print(unicodedata.normalize("NFC",a) == b)
print(a == unicodedata.normalize("NFD",a))

The string a is in decomposed format and b is in composed format. The first print converts a to composed form and so it is equal to b and the result is True. The second print statement converts b to decomposed form and so it is equal to a and the result is True.

In most cases you would simply convert all strings to one of the forms and do a comparison after this. For example:

print(unicodedata.normalize("NFC",a) == 
                        unicodedata.normalize("NFC",b))

results in True because both a and b are converted to composed form.

There really isn’t any “best practice” when it comes to string comparison because what you want to consider as equal strings depends on the application and the locales it is intended to work in. If you understand the way normalize works you should be able to select the correct approach.

There is also a slight complication due to the way upper and lower case work in Unicode. Some characters become two characters when converting from upper to lower case. The casefold function will convert a string into a case insensitive form of lower case. The documentation gives the following way of comparing two strings in a case insensitive way:

unicodedata.normalize('NFD', 
      unicodedata.normalize('NFD', s1).casefold())

you have to first normalize the string, convert to case insensitive and then normalize again because casefold can denormalize a string.

In chapter but not in this extract

String Manipulation
String slicing
Joining and splitting
Finding and replacing

Summary

Python strings are sequences and support indexing, slicing and all of the standard sequence methods plus a lot of additional methods.
In Python string literals come in many forms and you can use single, double or triple quotes.
You can include special characters using /, the escape character.
Formatted strings are the preferred way to create string templates.
Python uses Unicode to represent character data in strings and this breaks the long held belief that the number of characters in a string is the same as the number of bytes.
There are a range of ways of entering Unicode into a Python program including, using an appropriate keyboard, the \u escape.
Python source files are encoded using UTF-8.
The chr and ord functions convert characters into codepoints and codepoints into characters respectively.
The use of UTF-8 and Unicode make comparing string more complicated because there is more than one way to create some symbols. To make comparison consistent there are a number of normal forms that can be applied.
String manipulation is a matter of using the methods available to break a string into the parts that you want and then reassemble it into the final result.
Python has many powerful string methods that make string manipulation easy.

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

Python – A Lightning Tour
The Basic Data Type – Numbers
Extract: Bignum
Truthy & Falsey
Dates & Times
Extract Naive Dates
Sequences, Lists & Tuples
Extract Sequences
Strings
Extract Unicode Strings
Regular Expressions
Extract Simple Regular Expressions
The Dictionary
Extract The Dictionary
Iterables, Sets & Generators
Extract Iterables
Comprehensions
Extract Comprehensions
Data Structures & Collections
Extract Stacks, Queues and Deques
Extract Named Tuples and Counters
Bits & Bit Manipulation
Extract Bits and BigNum
Bytes
Extract Bytes And Strings
Extract Byte Manipulation
Binary Files
Extract Files and Paths ***NEW!!!
Text Files
Creating Custom Data Classes
Extract A Custom Data Class
Python and Native Code
Extract Native Code
Appendix I Python in Visual Studio Code
Appendix II C Programming Using Visual Studio Code

<ASIN:1871962765>

<ASIN:1871962749>

<ASIN:1871962595>

<ASIN:B0CK71TQ17>

<ASIN:187196265X>

Creating The Python UI With Tkinter

Creating The Python UI With Tkinter - The Canvas Widget

The Python Dictionary

Arrays in Python

Advanced Python Arrays - Introducing NumPy

Comments

or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

<< Prev - Next

Last Updated ( Wednesday, 09 August 2023 )

Comparing Strings

In chapter but not in this extract

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

Contents

Related Articles

Comments

Comparing Strings

In chapter but not in this extract

Programmer's PythonEverything is Data

Is now available as a print book: Amazon

Contents

Related Articles

Comments

Programmer's Python
Everything is Data