Programmer's Python Data

Programmer's Python Data - Native Code

Written by Mike James

Monday, 20 March 2023

Article Index
Programmer's Python Data - Native Code
Marshaling
Complex Data Types
Unicode
Unions

Page 4 of 5

Unicode Strings

Everything works reasonably well until we come to the problem of Unicode strings. This is, paradoxically, complicated by the fact that C doesn’t really support any encoding other than ASCII and it is also simplified by this fact. The reason is that C treats all strings as byte sequences and it is up to the program to apply an interpretation via an encoding. That is, there is nothing stopping you from writing a function which sends a UTF-8 byte sequence as an ASCII string, i.e. as an array of char. The only downside would be that C string handling functions would not understand the UTF-8 encoding and the number of bytes would be more than the number of characters in the string. It is important to keep in mind that all the data sent to and from a C program is in the form of a sequence of bytes – what it means is up to you.

The only concession to Unicode that C makes is the wchar_t data type. This is just a “bigger” data type than char, but how big depends on the system it is running on. Under Windows wchar_t is 16 bits and under Linux/Unix it is usually 32 bits, but it can also be 16 bits. There are also corresponding new functions to process the wider characters, but these generally don’t assume any encoding. In general the encoding used with wchar_t isn’t fixed, although when it is 16-bits UTF-16 makes sense and when it is 32-bits UTF-32 makes sense. Window in particular uses UTF-16 as it standard encoding.

The ctypes module has the c_wchar and c_wchar_p to work with wchar_t and wchar_p pointer types. It also has create_unicode_buffer which works in the same way as create_string_buffer but allocates more space. How much more space depends on the system in use. You can also use a Python str to initialize the wider types. For example, the C function:

__declspec(dllexport) int countStringW(wchar_t[]);
int countStringW(wchar_t myString[])
{
    int temp = 0;
    for (int i = 0; i < 100; i++)
    {
        if (myString[i] == 0)
            break;
        temp = temp + 1;
    }
    return temp;
}

is the wide character equivalent of the previous character-counting function. We can call this from Windows or Linux using:

count=lib.countStringW(ctypes.c_wchar_p("Hello World"))
print(count)

It works in both cases, even though Windows uses 16-bit wchar_t and Linux uses 32-bits. It also works if you go beyond the usual ASCII character set. The ctype c_wchar marshals the UTF-8 Python string into a UTF-16 string for the Windows version and to a UTF-32 string for the Linux version.

Notice also that the name of the function countStringW follows the Windows naming convention. Most Windows string functions come in two sorts, one with ending in A for ASCII/ANSI and one ending in W for wide. You generally have to select which one you are going to use and ensure that you send the correct type of string to the function.

If you are using C functions that have been written by someone else, such as system functions, then you have no choice but to work with whatever string types the functions use – char or wchar or even something else. In this case all that matters is that you send a byte sequence that encodes the data as the function needs it. For example, if the function demands a byte array encoded using UTF-8 then you have to take the Python string and convert it to a UTF-8 bytes object and pass this as a byte array rather than a string.

If you are creating the C functions yourself then you can choose to use either UTF-8 as a byte array or UTF-16 or UTF-32 as a wchar array. Clearly, having to support both UTF-16 and UTF-32 is extra work, but so is supporting UTF-8. There is no easy solution if you are implementing a large set of functions.

Structures

We have encountered the idea of a structure or struct in Chapter 13 where they were compared to records. There we used the struct module to implement a Python equivalent of a struct suitable for data exchange and you can use the struct module to create byte sequences that can be used as parameters to functions. However, ctypes provides a Structure type that can be used to implement structs in another way that is much more suitable for passing to a C function. This works much like the array type covered in the previous section.

You create a struct by implementing a class that inherits from ctypes Structure. Its fields are defined by a list of name/type pairs as tuples. The types have to be valid ctypes or be derived from ctypes. For example, a C struct to store some Person data might be something like:

struct Person{
    char name[25];
    int id;
    float score;
};

Notice that the name field is an array of 25 characters including the null byte that marks the end. A function that processes the struct might be something like:

__declspec(dllexport) float updateScore(struct Person);
float updateScore(struct Person p)
{
    p.score=p.score+1.0;
    return p.score;
}

This simply adds one to the score field and returns it as a result. Notice that C structs are passed by value – that is, the entire Python ctypes buffer is copied to the C Person struct and the function operates on this copy, which means that any changes it makes have no effect on the Python structure.

A Python program to call the C function is:

class Person(ctypes.Structure):
    _fields_= [("name",ctypes.c_char*25),
               ("id",ctypes.c_int),
               ("score",ctypes.c_float)]
me=Person(b"Mike",42,3.4)
lib.updateScore.restype=ctypes.c_float
s=lib.updateScore(me)
print(me.name)
print(s,me.score)

The definition of the Person class follows the C definition closely. Notice that the name field is not a pointer but an array of 25 c_char types. Also notice that the initializer is quite happy to accept b”Mike” and will convert it to a suitable array of 25 c_char types. The results are 4.4 and 3.4 because, as already mentioned, the structure is passed by value and so the Python data is unchanged by anything that the C function does.

If you want to pass a structure with a field that can be modified, you need to pass a pointer as the field. For example, the C function needs to be changed to work with a pointer to name:

struct Person{
    char *name;
    int  id;
    float score;
};
__declspec(dllexport) float updateScore(struct Person);
float updateScore(struct Person p)
{
    p.name[0]='m';
    p.score=p.score+1.0;
    return p.score;
}

The function now changes the first character of the name to a lower case m.

The Python program also has to be modified to work with a pointer:

class Person(ctypes.Structure):
    _fields_= [("name",ctypes.c_char_p),
                 ("id",ctypes.c_int),
                 ("score",ctypes.c_float)]
me=Person(b"Mike",42,3.4)
lib.updateScore.restype=ctypes.c_float
s=lib.updateScore(me)
print(me.name)
print(s,me.score)

Notice that now the Python string isn’t copied to the C function. Instead a pointer to it is passed as part of the structure. The result is that the upper case M is changed to a lower case m.

<< Prev - Next >>

Last Updated ( Wednesday, 22 March 2023 )