Fundamental C - Simple Strings |
Written by Harry Fairhead | ||||||||||||||||||||||||||||||
Sunday, 08 December 2019 | ||||||||||||||||||||||||||||||
Page 1 of 3 This extract, from my new book on programming C in an IoT context, explains the basics of the string, As was the case for arrays this is not as simple as in other languages because it is so simple. Fundamental C: Getting Closer To The MachineNow available as a paperback and ebook from Amazon.
Also see the companion volume: Applying C <ASIN:1871962609> <ASIN:1871962463> <ASIN:1871962617> <ASIN:1871962455>
This is going to be a shock if you program in almost any other language - C doesn’t actually have strings in the sense of most languages. All it has are char arrays and a few conventions on how these are used. It is still important to know how these work and there are some extra features over and above a simple array. StringsIn most other languages, strings are a special additional data structure managed by the runtime software. In C a string is just an array of chars. For example: char myString[10]; is an array of 10 chars or a string depending on how you look at it. As already mentioned a char is usually an eight bit byte, but it doesn’t have to be. As in the case of a general array, myString is a pointer to the first element of the array. That's all there is to it and it is important you keep this in mind as you find out about the extra features that are generally associated with string use in C. The problem with using strings in C is the same problem we have with finding the size of a more general array. Generally strings are processed character by character. So how do you know you have reached the end of a string? You could record the length of every string you use and specify it in any functions you use or create. However, this is not what C encourages you to do. In C strings are “null-terminated”. That is, the last character of every legal string is a null, i.e. a zero byte. You can say that a char array isn’t a string unless its null-terminated and this is the difference between the two. For example, in: char myArray[3]={'a','b','c'}; char myString[4]={'a','b','c','\0'}; myArray is a standard char array and myString is a null-terminated char array i.e. a string. Notice that single quotes are used for character literals and \0 is the escape character for a null char. Notice also that myString is one element longer than myArray and this is the small overhead of using null-terminated strings. The usual way of initializing a string is to use a string literal, which is signified using double quotes and supplies a null terminator by default: char myString[4]=”abc”; or, more usually, we leave out the explicit size of the array: char myString[]=”abc”; Notice that it is sometimes important to allocate more storage to a string than it actually uses so that you can lengthen it using string operations. For example: char myString[10]=”abc”; is a null-terminated string with three chars followed by a null. The remaining six elements are available to extend the string should it later be required. Notice that you can only use a string literal to initialize a string. Unlike other many other languages you cannot write: myString=”def”; myString is an array of chars and assigning to it in this way doesn’t make any sense. To assign to a string you need to use the built-in strcpy function or similar – see later. String assignment involves copying the pointer to the start of the string and this applies to literal assignment as well. You can use escape characters within the string literal:
This also brings us to the question of what the character encoding is? As already explained in Chapter 4 the data type char is the smallest of the integer types. It is called char because traditionally it was used to store single byte ASCII codes representing characters. Today we have additional problems in that text is represented by Unicode, of which ASCII is only a tiny subset. Unicode is supported in C99 by the introduction of the wide character type but exactly how this was to be used wasn’t specified. C11 introduced a completely new way of working with Unicode and again this has not proved to be popular. Some operating system functions require the use of wide character types using UTF-16 encoding and in this case you have little choice but to look up the documentation and convert strings to UTF-16. To work with Unicode strings the simplest thing to do is adopt UTF-8 encoding and put up with the fact that sometimes a character needs more than one byte to represent it. Most of the string functions will work with UTF-8 without modification, with the proviso that a character might correspond to more than one element of the string. For example if you use a function to find the length of the string you will get the number of bytes used but this may be more than the number of characters as some characters need two or three bytes to store. You can consider ASCII to be the characters that can be represented as a single byte in UTF-8. Working with Unicode in general is a tricky subject and not a core concern for most IoT and systems programs which have limited user interfaces, and for the remainder of this book strings will be treated as ASCII or a UTF-8 subset. |
||||||||||||||||||||||||||||||
Last Updated ( Monday, 09 December 2019 ) |