Programmer's Python Data - Unicode Strings |
Written by Mike James | ||||
Tuesday, 08 August 2023 | ||||
Page 3 of 3
Comparing StringsThere is one exception to the rule that you can mostly ignore the fact that Python strings are Unicode strings – comparing strings. In practice, comparing strings has always been harder than it might first appear. For example, are two strings that differ only in case the same or different? Are strings that look the same always the same? As we are using Unicode the answer is no. The reason is that Unicode allows you to create characters by combining multiple characters together and this means that there is often more than one way to create a character. For example, êê that is e with a circumflex accent, is code point \u00EA but you can also create it using \u0065 \u0302 which is a standard e followed by Combining Circumflex Accent code point which places a circumflex over the preceding character code. Characters that look the same but correspond to different code sequences are called homoglyphs and their existence makes string comparison difficult. For example, consider: a = "He\u0302llo World" b = "H\u00EAllo World" print(a) print(b) print(a == b) displays: Hêllo World Hêllo World False The two strings look the same, but they aren’t equal. You might want to accept this as reasonable and simply insist that there is only one way to encode a particular string in a particular character set. If you want to allow homoglyphs to be treated as equal then you need to use the normalize function which puts a string into a standard form. Unicode defines four normal forms:
The most commonly used is NFC. To understand what these alternative standard forms are all about we need to look at some terms. The first thing is that “composed” means that all of the characters are combined to make a single character. Similarly, “decomposed” means that characters are split into component parts if possible. So, for example, \u0065 \u0302 is decomposed and \u00EA is the composed equivalent. A “canonical” representation is one where characters that look the same are represented in the same way, either composed or decomposed. A “compatibility” representation is more sophisticated and converts characters that mean the same thing but don’t necessarily look the same into the same character. For example, ¼ and 1/4 are compatible but not canonical. You can see that a compatible representation is starting to introduce meaning into what is treated as equivalent. If you want to use a more sophisticated approach to string matching you have to convert all strings into a standard form. For example: import unicodedata a = "He\u0302llo World" b = "H\u00EAllo World" print(unicodedata.normalize("NFC",a) == b) print(a == unicodedata.normalize("NFD",a)) The string a is in decomposed format and b is in composed format. The first print converts a to composed form and so it is equal to b and the result is True. The second print statement converts b to decomposed form and so it is equal to a and the result is True. In most cases you would simply convert all strings to one of the forms and do a comparison after this. For example: print(unicodedata.normalize("NFC",a) == unicodedata.normalize("NFC",b)) results in True because both a and b are converted to composed form. There really isn’t any “best practice” when it comes to string comparison because what you want to consider as equal strings depends on the application and the locales it is intended to work in. If you understand the way normalize works you should be able to select the correct approach. There is also a slight complication due to the way upper and lower case work in Unicode. Some characters become two characters when converting from upper to lower case. The casefold function will convert a string into a case insensitive form of lower case. The documentation gives the following way of comparing two strings in a case insensitive way: unicodedata.normalize('NFD', unicodedata.normalize('NFD', s1).casefold()) you have to first normalize the string, convert to case insensitive and then normalize again because casefold can denormalize a string. In chapter but not in this extract
Summary
Programmer's Python
|
||||
Last Updated ( Wednesday, 09 August 2023 ) |