Programmer's Python Data - Bytes And Strings |
Written by Mike James | |||
Monday, 09 January 2023 | |||
Page 2 of 2
Bytes As StringsA bytearray is really an ASCII string, but there are some small but important differences between them and Unicode strings. The first thing is that, while you can uses indexing and slicing, a single element of a bytearray is an integer, not another bytearray. A slice, however, is a bytearray. For example: myBytes = b"\xFF\xFF\x55\xAA" print(myBytes[2],myBytes[2:3]) displays 85 b'U' as the first is an integer and the second a bytes object. You can also use all of the familiar string methods on a bytes or bytearray object, but of course you cannot mutate a bytes object and even bytearray methods return new objects rather than work in place. The only other restriction is that you cannot use a string within a byte or bytearray method. For example, the string method find has its parallel in the bytes and bytearray method find: find(subbytes,start,end) searches the bytes for the specified subbytes between start and end and returns its position, or -1 if it isn’t found. The subbyte can be a bytes or bytearray object. For example: myBytes = b"\xFF\xFF\x55\xAA" print(myBytes.find(b"\x55")) displays 2. If you only need to know if subbyte is present and not its location, you can use in since bytes and bytearray objects are sequences. As well as the standard string functions there are some special functions that only make sense for bytes. For example: isascii() returns True if all of the bytes are valid ASCII characters in the range 0x0 to 0x7F. Many other string functions also only make sense if the sequence of bytes are all valid ASCII characters. You can use the % format operator to create byte sequences in a similar way as for strings: format % (value1, value2)... where format is a bytes object and the tuples of values are used to substitute for format items. For example: myBytes = b"Hello %b" % (b"World",) print(myBytes) displays Hello World as %b means copy the byte values byte-by-byte in the same way as %s for strings. If there is only one value then you don’t have to put it into a tuple. You can also use named format items which are replaced by the values specified in a dictionary in place of the tuple, for example: myBytes = b"Hello %(word)b"% {b"word":b"World"} The name in the format is used to look up the value in the dictionary. This is no different from the way formatting works in Unicode strings. However, it is generally used to create custom responses to low-level protocols where a byte header needs to have additional information inserted. For byte sequences that are not ASCII characters this isn’t very useful. One of the more useful string-like abilities is concatenation. If you have two or more bytes or bytearray objects then you can create a single bytes or bytearray object using the + operator. That is: myBytes3 = myBytes1 + myBytes2 combines myBytes1 and myBytes2 to produce myBytes3 which contains all of the data. This is a good way to build up big complex byte sequences from smaller simpler and perhaps repetitive byte sequences. Decode EncodeByte sequences are usually involved in low-level operations where the meaning of the values is user-defined. An exception to this is in string encoding. This idea was introduced in Chapter 6, but only in connection with the use of UTF-8 to represent any Unicode character. In the real world there are many different encodings that can be used to represent text and byte sequences provide a good way of working with them. Unicode strings have an encode method and ASCII strings have a decode method which allow you to convert between encodings: string.encode(encoding=, errors=) bytes.decode(encoding=, errors=) The first parameter determines the decoding in use and the second controls how characters that fall outside of the encoding are treated. Both are optional and the defaults are utf-8 and strict. The string encode method takes a UTF-8 Unicode string and converts it to a bytes object using the encoding method specified, for example: myString="ABCD\u2211" myBytes=myString.encode(encoding="utf-8") print(myString) print(myBytes) The string is automatically constructed using UTF-8 and when it prints it displays: ABCD∑ as the final character is Unicode 2211 which is a summation sign – see Chapter 6 for more detail. The call to encode converts the Unicode string into its UTF-8 representation as bytes. Of course, as the string is already UTF-8 encoded all that is necessary is to copy the internal representation to the bytes object. This is treated as an ASCII string by the final print which displays: 'ABCD\xe2\x88\x91' The first four bytes are representable as ASCII characters, but the final character isn’t, so it displays as the three-byte sequence that represents it using UTF-8. The decode method of bytes and bytearray does a similar job, but in the reverse direction. It takes a byte sequence and returns a UTF-8 Unicode string with the same characters as a string with the specified encoding. For example, if we add to the previous example: myString=myBytes.decode(encoding="utf-8") print(myString) then the print displays the original Unicode string: ABCD∑ That is, the bytes in myBytes are interpreted as characters in a UTF-8 encoding and converted to the Unicode string. Of course, in this case no conversion is necessary as the bytes are already in UTF-8 and so all that happens is that the bytes are copied as the internal representation of the string. To be clear:
Of course, there is no guarantee that a character in one encoding can be represented in another. The error parameter controls what happens when conversion isn’t possible. By default it is set to strict which results in a UnicodeError exception being raised. The most common alternatives are “ignore”, which simply skips the unconvertible character and “replace”, which puts the Unicode character U+FFFD Replacement Character into a Unicode string and “?” into an ASCII string. For example, cp1252 is Code Page 1252, i.e. the Latin Code page for Windows, and we can convert our Unicode string into it using: myString="ABCD\u2211" myBytes=myString.encode(encoding="cp1252",errors="replace") print(myBytes) which displays: ABCD? The question mark at the end is because there is no equivalent to the summation sign in this particular code page and we have selected “replace” for error handling. Code pages were how Windows managed and expanded range of characters before Unicode. Each code page defined a set of characters that character codes between 0 and 255 corresponded to. Essentially every string was an ASCII string, but how it was displayed depended on the code page the user or the program had selected. Python supports all of the Windows code pages and their corresponding ANSI standardization. As well as supporting legacy character encodings, you can also use encode/decode to convert between different Unicode encodings. For example: myString="ABCD\u2211" myBytes=myString.encode(encoding="utf-32",errors="replace") print(myBytes) displays: b'\xff\xfe\x00\x00A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00\x11"\x00\x00' which is the UTF-32 encoding. You can convert to UTF-16 just as easily. In chapter but not in this extract
Summary
Programmer's Python
|
|||
Last Updated ( Monday, 09 January 2023 ) |