Getting Started With jQuery - Advanced Ajax Characters & Encoding
Written by Ian Elliot   
Tuesday, 20 June 2017
Article Index
Getting Started With jQuery - Advanced Ajax Characters & Encoding
JavaScript problems
Ajax and Encoding from the Server


Even if you are using a different encoding in the web page, the encoding used by the browser is still Unicode. The only thing that setting the charset property does is to govern how the codes stored in the file and received by the browser are converted into Unicode. 

If you follow this idea then what do you think the alert box will show when the content of the div is changed to:

<div id="test">&#xA3;</div>

The character code 0xA3 in ISO 8859-2 is Ł and this is what you might expect to be displayed, but it isn't. The HTML entity &#xA3; represents Unicode character 0xA3, no matter what the charset is. So the web page shows £, which is Unicode character 0xA3, and the alert box shows A3. No conversion to Unicode is performed because it is already considered to be Unicode.

The charset simply controls how the file sent to the browser is treated as it is converted to Unicode in UTF-8 encoding. Once it is converted everything from that point on works as if it had been Unicode from the beginning. 

This might sound complicated, but in fact it is a great simplification. It allows you to write your JavaScript programs using Unicode without having to worry what the charset of the web page it is using when it is loaded.

That is, irrespective of the charset encoding, JavaScript always works with a web page that is encoded using Unicode UTF-8 and when it interacts with the page it uses Unicode, but UTF-16 encoded.  

If you don't specify an encoding, most browsers will treat the input data as UTF-8 and what you see depends on how non-UTF-8 data can be interpreted as UTF-8.

What you actually see when an ISO or other encoding is treated as UTF-8 depends on whether the code received forms a legal single, double, triple or even quadruple byte UTF-8 code. All ISO characters up to 0x7F are represented accurately as they are shared between encodings. Everything else is illegal except for 0xC2 or 0xC3 followed by a character in the range 0x80 to 0xbf. It is also worth knowing that the replacement character for illegal UTF-8 codes, a question mark in a diamond, is u+FFFD or 0xEF 0xBF 0xBD as a 3-byte UTF-8 encoding.

Ajax and Encoding From The Server

Now we come to the matter of what happens when you perform an Ajax operation with respect to encodings. We need to consider the situation with respect to a get and a post. 

The main thing to remember is that as far as the server is concerned an Ajax get request is the same as a browser get request. The data that the sever sends back is therefore the same as for a "normal" browser request  but there are a few differences in how the browser treats the data.

The character encoding used will therefore be whatever is used in the file that is served, but most browsers will not process the HTML in the body of the response to an Ajax request. This means that you cannot rely on specifying the charset in a met tag. Most browsers will read and honour the charset specified in a header which they do still read and process as part of the HTTP protocol.

Apart from this, the process or retrieving a file using Ajax works the same way as for a file retrieved by a normal browser get. The charset specified in the header is used to convert the data into the correct UTF-8 Unicode. 

For example, create a file test.html containing character code 0xA3 twice, which in ISO 8859-2 is ŁŁ. It is also important that the HTML file is saved in ISO-8859-2 encoding. You can do this if you use Notepad++ and select encoding Ansi and character set Eastern European.

Now if you read the file using:

var options = {};
options.url = "test.html";
options.method = "get";
    function (data) { alert(data); });

what you will find is that the characters show as diamonds with question marks.



The reason for this is that the browser is interpreting the code 0xA3 as UTF-8, where it is an illegal character. Even the two bytes taken together are illegal as no 2-byte encoding starts 0xA3.

Also notice that this behavior doesn't change if you change the encoding of the page that the Ajax script is running in. It also doesn't change if you set jQuery's options.dataType to "html" or "text". The default charset is UTF-8 and what matters is what charset is specified for the file being received.

The problem is how to do this. For a normal web page get you can use a meta tag or an HTTP header and things work - although if both are specified the tag takes precedence.

Let's add a meta tag to the file:

<meta http-equiv="Content-Type"
         content="text/html; charset=ISO-8859-2">

Of course you could have a complete HTML page stored in the file, but from the point of view of encoding this is all that matters. 

If you try this out and load the file using an Ajax get, what you will see depends on the browser. Chrome takes notice of the charset and converts the ISO codes to Unicode. That is, the 0xA3 which it now knows is in ISO 8859-2 is the same character as Unicode 0x141 and this is what it is converted to. With the browser using Unicode 0x141 the correct character is displayed even though it is UTF-8 encoded.

If you try Firefox, Edge or IE 11 you will discover that the character encoding isn't changed and what you see is the replacement character. That is, the page is treated as if it was UTF-8 and the meta tag is ignored. 

As mentioned earlier, some browsers do not process the HTML meta tags included in files retrieved using AJAX.

However, if you place the same information in an HTTP header, for example by serving the following PHP file:

header('Content-Type: text/html; charset=ISO-8859-2');

then Chrome, Firefox, Edge and IE take notice and convert the ISO codes to Unicode correctly and you see the correct characters.

The same behaviour is true of other Ajax methods that return data from the server. The only safe encoding is UTF-8 unless you place Content-Type HTTP headers into the response. 

Ajax and Encoding to the Server

What about data doing to the server using a post or a get?

Even though they use different methods to send the data to the server in principle both can be set to a particular encoding using the contentType option.

For example:

contentType: "application/x-www-form-urlencoded;

Things are not simple, however. You can make them simple by using nothing but UTF-8. If you can't you will have to do battle with each of the systems involved in the interaction.

The problem is that there are too many applications involved in a typical ajax transaction and they all have an opinion on how to deal with the encoding. For example, there is the browser sending the request, the web server receiving the request, Apache say, the language used to receive the request and generate the response PHP say, the web server sending the response and the browser receiving the response. They all can decide what the encoding is and what it should be and they can each therefore make a mess of it. When debugging a charset/encoding problem you have to verify what is received at each of the stages and this can be difficult. 

The biggest problem with AJAX and non-UTF-8 encoding is the following statement in the jQuery documentation:

Note: The W3C XMLHttpRequest specification dictates that the charset is always UTF-8; specifying another charset will not force the browser to change the encoding.

If you check the most up-to-date documentation you will find no mention that UTF-8 is always to be used but Edge, IE 11, Firefox and Chrome all do use UTF-8.

The best way to see how things can go wrong is by way of an example. If you have a web page with the meta tag:

<meta http-equiv="Content-Type"
           content="text/html; charset=ISO-8859-2">

and the program:

var sendData={test:"Ł"};

var options = {};
options.url = "phpinfo.php";
options.method = "post";
options.contentType= "application/x-www-form-urlencoded;

  function (data) {

It is also important that the HTML file is saved in ISO-8859-2 encoding. You can do this if you use Notepad++ say and select encoding Ansi and character set Eastern European.

Notice that in the ajax call we set contentType and then send the data which is encoded as 0xA3 which is Ł in ISO-8859-2 in the web page. What happens is that the web browser converts this string literal as the page is loaded to the correct Unicode character, i.e. 0x141, which is then sent as UTF-8, i.e. the two bytes 0xC5 and 0x81.  

This works in the same way even if you change the post to a get. In this case the data is encoded into the query string as %C5%81.

The conversion from ISO-8859-2 to UTF-8 happens because of the meta tag and not the contentType option. As you can easily prove by removing the meta tag and the contentType, it only works with the meta tag. It would also work with an HTTP header because all that matters is that the page is converted from ISO-8859-2 as it is loaded by the browser. The ajax call has nothing to do with it. 

The ajax call always sends the data using UTF-8 because this is what is always used within the web page for text.

This means that the data that the server receives and processes is always in UTF-8 and if you want it to work with another encoding you have to write program to do the conversion on the fly.







Last Updated ( Thursday, 05 May 2022 )