Programmer's Python Data - Bytes And Strings

Written by Mike James

Monday, 09 January 2023

Article Index
Programmer's Python Data - Bytes And Strings
Bytes As Strings

Page 2 of 2

Bytes As Strings

A bytearray is really an ASCII string, but there are some small but important differences between them and Unicode strings. The first thing is that, while you can uses indexing and slicing, a single element of a bytearray is an integer, not another bytearray. A slice, however, is a bytearray. For example:

myBytes = b"\xFF\xFF\x55\xAA"
print(myBytes[2],myBytes[2:3])

displays 85 b'U' as the first is an integer and the second a bytes object. You can also use all of the familiar string methods on a bytes or bytearray object, but of course you cannot mutate a bytes object and even bytearray methods return new objects rather than work in place. The only other restriction is that you cannot use a string within a byte or bytearray method. For example, the string method find has its parallel in the bytes and bytearray method find:

find(subbytes,start,end)

searches the bytes for the specified subbytes between start and end and returns its position, or -1 if it isn’t found. The subbyte can be a bytes or bytearray object.

For example:

myBytes = b"\xFF\xFF\x55\xAA"
print(myBytes.find(b"\x55"))

displays 2. If you only need to know if subbyte is present and not its location, you can use in since bytes and bytearray objects are sequences.

As well as the standard string functions there are some special functions that only make sense for bytes. For example:

isascii()

returns True if all of the bytes are valid ASCII characters in the range 0x0 to 0x7F. Many other string functions also only make sense if the sequence of bytes are all valid ASCII characters.

You can use the % format operator to create byte sequences in a similar way as for strings:

format % (value1, value2)...

where format is a bytes object and the tuples of values are used to substitute for format items. For example:

myBytes = b"Hello %b" % (b"World",)
print(myBytes)

displays Hello World as %b means copy the byte values byte-by-byte in the same way as %s for strings. If there is only one value then you don’t have to put it into a tuple.

You can also use named format items which are replaced by the values specified in a dictionary in place of the tuple, for example:

myBytes = b"Hello %(word)b"% {b"word":b"World"}

The name in the format is used to look up the value in the dictionary. This is no different from the way formatting works in Unicode strings. However, it is generally used to create custom responses to low-level protocols where a byte header needs to have additional information inserted. For byte sequences that are not ASCII characters this isn’t very useful.

One of the more useful string-like abilities is concatenation. If you have two or more bytes or bytearray objects then you can create a single bytes or bytearray object using the + operator. That is:

myBytes3 = myBytes1 + myBytes2

combines myBytes1 and myBytes2 to produce myBytes3 which contains all of the data. This is a good way to build up big complex byte sequences from smaller simpler and perhaps repetitive byte sequences.

Decode Encode

Byte sequences are usually involved in low-level operations where the meaning of the values is user-defined. An exception to this is in string encoding. This idea was introduced in Chapter 6, but only in connection with the use of UTF-8 to represent any Unicode character. In the real world there are many different encodings that can be used to represent text and byte sequences provide a good way of working with them.

Unicode strings have an encode method and ASCII strings have a decode method which allow you to convert between encodings:

string.encode(encoding=, errors=)
bytes.decode(encoding=, errors=)

The first parameter determines the decoding in use and the second controls how characters that fall outside of the encoding are treated. Both are optional and the defaults are utf-8 and strict.

The string encode method takes a UTF-8 Unicode string and converts it to a bytes object using the encoding method specified, for example:

myString="ABCD\u2211"
myBytes=myString.encode(encoding="utf-8")
print(myString)
print(myBytes)

The string is automatically constructed using UTF-8 and when it prints it displays:

ABCD∑

as the final character is Unicode 2211 which is a summation sign – see Chapter 6 for more detail. The call to encode converts the Unicode string into its UTF-8 representation as bytes. Of course, as the string is already UTF-8 encoded all that is necessary is to copy the internal representation to the bytes object. This is treated as an ASCII string by the final print which displays:

'ABCD\xe2\x88\x91'

The first four bytes are representable as ASCII characters, but the final character isn’t, so it displays as the three-byte sequence that represents it using UTF-8.

The decode method of bytes and bytearray does a similar job, but in the reverse direction. It takes a byte sequence and returns a UTF-8 Unicode string with the same characters as a string with the specified encoding. For example, if we add to the previous example:

myString=myBytes.decode(encoding="utf-8")
print(myString)

then the print displays the original Unicode string:

ABCD∑

That is, the bytes in myBytes are interpreted as characters in a UTF-8 encoding and converted to the Unicode string. Of course, in this case no conversion is necessary as the bytes are already in UTF-8 and so all that happens is that the bytes are copied as the internal representation of the string.

To be clear:

encode takes a Unicode string and converts it into a byte sequence using the specified encoding.
decode takes a byte sequence and converts it into a Unicode string using the specified encoding.

Of course, there is no guarantee that a character in one encoding can be represented in another. The error parameter controls what happens when conversion isn’t possible. By default it is set to strict which results in a UnicodeError exception being raised. The most common alternatives are “ignore”, which simply skips the unconvertible character and “replace”, which puts the Unicode character U+FFFD Replacement Character into a Unicode string and “?” into an ASCII string. For example, cp1252 is Code Page 1252, i.e. the Latin Code page for Windows, and we can convert our Unicode string into it using:

myString="ABCD\u2211"
myBytes=myString.encode(encoding="cp1252",errors="replace")
print(myBytes)

which displays:

ABCD?

The question mark at the end is because there is no equivalent to the summation sign in this particular code page and we have selected “replace” for error handling.

Code pages were how Windows managed and expanded range of characters before Unicode. Each code page defined a set of characters that character codes between 0 and 255 corresponded to. Essentially every string was an ASCII string, but how it was displayed depended on the code page the user or the program had selected. Python supports all of the Windows code pages and their corresponding ANSI standardization.

As well as supporting legacy character encodings, you can also use encode/decode to convert between different Unicode encodings. For example:

myString="ABCD\u2211"
myBytes=myString.encode(encoding="utf-32",errors="replace")
print(myBytes)

displays:

b'\xff\xfe\x00\x00A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00\x11"\x00\x00'

which is the UTF-32 encoding. You can convert to UTF-16 just as easily.

In chapter but not in this extract

Byte Manipulation
Multibyte Shifts
One-Time Pad
The Array
Memoryview

Summary

Working with bit patterns is fundamental, but you generally have to work with bytes or some other larger unit of storage.
Working with a byte sequence is possible using the bytes object which is immutable or a bytearray which is mutable.
Both the bytes and bytearray objects can be thought of as ASCII strings and have many of the same methods as strings.
A bytes literal is distinguished from a string by a leading b and contains ASCII characters and escape codes for values above 127.
You can also create bytes objects and bytearrays using an iterable that provides integers in the correct range.
The encode method takes a Unicode string and converts it into a byte sequence using the specified encoding.
The decode method takes a byte sequence and converts it into a Unicode string using the specified encoding.
When trying to manipulate a byte sequence you can opt to convert it to a bignum and then use bitwise operators or you can work byte-by-byte in a for loop.
When working with bytes in groups it matters which order you take them in – big endian takes the most significant byte first and little endian takes the least significant byte first.
Multibyte shifts are difficult to implement because of the way the sign bit has to be treated.
Python has a basic array type in the array module. This supports arrays of basic C arrays.
The memoryview class provides a view into the buffer of any object that supports the buffer protocol.
A memoryview doesn’t make a copy of the original buffer – it simply provides access.
The object that the buffer belongs to can set the type and shape of the buffer in an attempt to make it easier for you to use.
If the object doesn’t set the type and shape of the buffer you can use the cast method to change or set it.

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

Python – A Lightning Tour
The Basic Data Type – Numbers
Extract: Bignum
Truthy & Falsey
Dates & Times
Extract Naive Dates
Sequences, Lists & Tuples
Extract Sequences
Strings
Extract Unicode Strings
Regular Expressions
Extract Simple Regular Expressions
The Dictionary
Extract The Dictionary
Iterables, Sets & Generators
Extract Iterables
Comprehensions
Extract Comprehensions
Data Structures & Collections
Extract Stacks, Queues and Deques
Extract Named Tuples and Counters
Bits & Bit Manipulation
Extract Bits and BigNum
Bytes
Extract Bytes And Strings
Extract Byte Manipulation
Binary Files
Extract Files and Paths
Text Files
Extract Text Files & CSV ***NEW!!!
Creating Custom Data Classes
Extract A Custom Data Class
Python and Native Code
Extract Native Code
Appendix I Python in Visual Studio Code
Appendix II C Programming Using Visual Studio Code

Creating The Python UI With Tkinter

Creating The Python UI With Tkinter - The Canvas Widget

The Python Dictionary

Arrays in Python

Advanced Python Arrays - Introducing NumPy

Comments

or email your comment to: comments@i-programmer.info

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

<< Prev - Next

Last Updated ( Monday, 09 January 2023 )

Bytes As Strings

Decode Encode

In chapter but not in this extract

Summary

Programmer's Python
Everything is Data

Is now available as a print book: Amazon

Contents

Related Articles

Comments

Bytes As Strings

Decode Encode

In chapter but not in this extract

Summary

Programmer's PythonEverything is Data

Is now available as a print book: Amazon

Contents

Related Articles

Comments

Programmer's Python
Everything is Data