Getting Started With jQuery - Advanced Ajax Characters & Encoding |
Written by Ian Elliot | ||||||||
Tuesday, 20 June 2017 | ||||||||
Page 4 of 4
Round Trips
The fact that the data that goes to the server is always UTF-8 has an effect on the data that the sever sends back to the browser. This is something we have already examined in terms of using get to retrieve a file, but in the case of a post there is an additional consideration. The data transfer is two-way and there are two sources of data that can be sent back to the client, the data that the client sent and the data the server retrieves or generates.
Data that is generated by the server is very varied and can range from retrieving a file, retrieving data from a database or using a language such as PHP. In the case of PHP the system is complicated, but very flexible when used with Apache. In PHP strings are a sequence of bytes and it doesn't make any attempt to change any encoding. If you use a multi-byte encoding then each byte is treated as a character. There are a set of functions that work with multibyte characters and encodings and these can be used to programmatically generate output to the browser in any of the supported encodings. However, if you don't make any effort to generate particular encodings, then PHP will return any data it receives in the encoding it was received in. It will also send any string literal in the program using the encoding of the file the program is saved in. For example, if you assume that the file containing the program is saved in ISO-8859-2 encoding, then that is the encoding used for string literals. You can do this if you use Notepad++ and select encoding Ansi and character set Eastern European. So if you have an instruction:
Then 0xA3 is sent to the browser and no attempt to change its encoding is made. When the browser receives this byte it is interpreted as a UTF-8 encoding and shows as a replacement character because, as we have seen many times, it is an illegal code. If you add a header, remember meta tags are ignored, to the data sent to the browser that defines the charset correctly, you get a slightly different result:
With this in place, the browser interprets the 0xA3 as an ISO-8859-2 character and replaces it by Unicode x0141 which is the correct character. Now it looks as if everything works as long as we include an appropriate header in the response, but there is yet another twist. Consider the data that the server received as part of the post or get. If we assume that the data in the earlier example:
is sent to the server then:
will display the correct character in the web page without the header. The reason is simply that the data sent to the server is UTF-8 encoded which means test contains 0xC5 0x81 i.e. as far as PHP is concerned it is a two character string. When you send this string back to the browser it is interpreted as UTF-8 and hence the browser displays the correct character. However, if you add the header defining the charset as ISO-8859, then things go wrong. The data sent to the browser is 0xC5 0x81 as this is what was received. The browser thinks that is an ISO-8859 encoding and 0xC5 is an L with a dash and 0x81 is an undefined character that displays as an open square. Thus, if you don't include a header, data that is sent to the server is correctly sent back to the client, but data from the server might not be. If you do include the header, the data from the server is correctly sent to the client, but any data originating from the client isn't. There is nothing you can do to stop the browser sending UTF-8 and encoding to UTF-8 anything it receives. Trying to work with AJAX with anything other than UTF-8 seems like fighting nature. It can be done, but you will have to do it in code. You can treat UTF-8 as a "transport" encoding and write code on the client or the server to convert to the encoding that you want to work with. For example, if you want to send the data from the server in ISO-8859 encoding, assuming you are sending a correct ContentType header, and you want to echo data back to the client you need to use:
This converts the UTF-8 string into ISO-8859, which is now echoed back to the client correctly as long as there is an appropriate Content-Type header. You have to be aware at all times what encoding is being used and make sure you use just one encoding in a single page. ConclusionThe whole subject of encodings and web pages, is huge and well beyond the limits of a single chapter. Even a book would fail to cover every possibility. If you can, opt to work with nothing but UTF-8. This is the only easy route. Make sure all files are stored in UTF-8 and that all servers, web and datbase default to UTF-8. If you do this then things are as simple as they can be. If you have to use legacy encodings, then consider converting to UTF-8 before spending a lot of time trying to work with them as they are encoded. There are too many ways that things can go wrong when you change encodings on the fly.
Summary
More InformationJust jQuery
|
JavaScript Canvas - Fetch API Working with lower-level data is very much part of graphics. This extract from Ian Elliot's book on JavaScript Graphics looks at how to use typed arrays to access graphic data. |
JavaScript Jems - The Inheritance Tax JavaScript should not be judged as if it was a poor version of the other popular languages - it isn't a Java or a C++ clone. It does things its own way. In particular, it doesn't do inheritance [ ... ] |
Other Articles |
To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.
Comments
or email your comment to: comments@i-programmer.info