Getting Started With jQuery - Advanced Ajax Characters & Encoding

Written by Ian Elliot

Tuesday, 20 June 2017

Article Index
Getting Started With jQuery - Advanced Ajax Characters & Encoding
JavaScript problems
Ajax and Encoding from the Server
Conclusion

Page 2 of 4

UTF-16 In JavaScript

JavaScript has Unicode support and all JavaScript strings are UTF-16 coded - this has some unexpected results for any programmer under the impression that they can assume that one character is one byte. While you can mostly ignore the encoding used the fact that web page use UTF-8 and JavaScript uses UTF-16 can cause problems.

The key idea is that when JavaScript interacts with a web page characters are converted from UTF-8 to UTF-16 and vice versa.

As you can guess UTF-16 is another variable length way of coding Unicode but as the basic unit is 16 bits we only need to allow for the possibility of an additional two-byte word.

For any Unicode character in the range U+0000 to U+FFFF, i.e. 16 bits, you simply set the single 16-bit word to the code. So how do we detect that two 16-bit words, called a surrogate pair, are needed? The answer is that the range U+D800 to U+DFFF is reserved and doesn't represent any valid character, i.e. they never occur in a valid string. These reserved codes are used to signal a two-word coding. If you have a Unicode character that has a code larger than U+FFFF then you have to convert it into a surrogate pair:

Subtract 0x010000 from it to give a 20-bit number in the range 0x000000 to 0x0FFFFF.
The top 10 bits are added to 0xD800 to give the first 16-bit surrogate in the range 0xD800 to 0xDBFF.
The low 10 bits are added to 0xDC00 to give the second 16-bit surrogate in the same range.

Reconstructing the character code is just the same process in reverse.

If you find a 16-bit value in the range x0800 to xDFFF then it and the next 16-bit value are a surrogate pair. Take 0xD800 from the first and 0xDC00 from the second. Put the two together to make a 20-bit value and add 0x0100000. The only problem is that different machines use different byte orderings - little endian and big endian. To tell which sort of machine you are working with, a Byte Order mark or BOM can be included in a string U+FEFF. If this is read as FFEF the machine doing the decoding has a different byte order to the machine that did the coding.

The BMP - Basic Multilingual Plane

A JavaScript string usually uses nothing but characters that can be represented in a single 16-bit word in UTF-16. As long as you can restrict yourself to the Basic Multilingual Plane (BMP), as this set is referred to, everything works simply. If you can't, then things become much harder.

You can enter a unicode character using an escape sequence:

\xHH

for characters that have codes up to xFF, i.e. 0 to 255, and:

\uHHHH

for characters that have codes up to xFFFF, where H is a hex digit.

For example:

var a = "Hello World\u00A9"; console.log(a);

adds a copyright symbol to the end of Hello World. This is simple enough, but if you now try:

console.log(a.length);

you will find that it displays 12, because the length property counts the number of 16-bit characters in a string.

What about the Unicode characters that need two bytes? How can you enter them?

The answer is that in ECMAScript 6 you can enter a 32-bit character code:

\u{HHHHHHHH}

If you cannot assume ECMAScript 6 then you have to enter the surrogate pairs as two characters.

You can easily write a function that will return a UTF-16 encoding of a Unicode character code:

function codeToUTF16(code) { if (code <= 0xFFFF) return "\\u" + code.toString(16).toUpperCase(); code = code - 0x10000; var sLead = 0xD800 | (code >> 10); var sTrail = 0xDC00 | (code & 0x3FF); return "\\u" + sLead.toString(16).toUpperCase() + "\\u" + sTrail.toString(16).toUpperCase(); }

For example:

console.log(codeToUTF16(0x1F638));

produces:

\uD83D\uDE38

which is the "grinning cat face with smiling eyes".

cat

If you try to display this on the console the chances are you won't see it - it depends on what is hosting the console. If you show it in an alert then you should see it, as the browser will convert it to UTF-8 and then display it:

alert("\uD83D\uDE38");

Notice JavaScript sends the UTF-16 to the browser unmodified - it is the browser that converts it to equivalent UTF-8.

JavaScript Problems

As mentioned, as soon as you use characters outside of the BMP things get complicated. For example:

s= "\uD83D\uDE38"; console.log(s.length);

reports the length of the string as two even though only one character is coded.

At the moment most of the JavaScript functions only work when you use characters from the BMP and there is a one-to-one correspondence between 16-bit values and characters. JavaScript may display surrogate pairs correctly, but in general it doesn't process them correctly. For example, consider the string that represents two cat emoji:

s= "\uD83D\uDE38\uD83D\uDE38"; alert(s.charAt(1));

The charAt doesn't give you the final cat emoji, but the character corresponding to the first uDE38, which is an illegal unicode character, i.e. it returns the 16-bit code corresponding to the second 16-bit word rather than the second character.

You also need to know about the string functions that work with Unicode code values.

fromCharCode and fromCodePoint do the same job - convert a character code to a string - however fromCharCode only works with 16 bit values and not surrogate pairs. fromCodePoint will return a surrogate pair if the code is greater then 0xFFFF. The only problem is that fromCodePoint was introduced with ECMAScript 2015 and isn't supported in IE or older browsers. A polyfill is available.

charCodeAt and codePointAt will return the character code at a specified position in a string. The charCodeAt function works in 16-bit values and is blind to surrogate pairs. The codePointAt will return a value greater than 0xFFFF is the position is the start of a surrogate pair. Notice however the the position is still in terms of 16-bit values and not characters. The codePointAt function was introduced in ECMAScript 2015 and isn't supported in older browsers. A polyfill is available.

There is one final problem that you need to be aware of. In Unicode a single code always produces the same character, but there may be many characters that look identical. In technical terms you can create the same glyph, i.e. character, in different ways. Unicode supports combining codes which put together multiple characters into a single character. This means that you can often obtain a character with an accent either by selecting a character with an accent or selecting a character without an accent and combining it with the appropriate accent character. The end result is that two Unicode strings can look identical and yet be represented by different codes. This means the two strings won't be treated as being equal and they might not even have the same length. The solution to this problem is to normalize the strings so that characters are always produced in the same way. This is not an easy topic to deal with in general as there are so many possible ways of approaching it.

The whole subject of safely working with Unicode in JavaScript is too large for this chapter. It is important that you know what the problem is and, if you are going to work with characters that need two 16-bit words, or rather if your users are, then you need to look into ways of processing the strings correctly.

Working With Non-UTF-8 Encodings

Before moving on to the specific topic of Ajax and encoding, let's just look at the way normal web pages are retrieved using a get.

This is a difficult subject because of the many different ways available to deal with the situation. All modern programs, not just servers and browsers, use Unicode. If they have an option to save data in another encoding then there are two ways they can do the job:

Stay with Unicode and just save a file that uses Unicode characters to look like the character set of the encoding.
Convert the Unicode to the encoding in question and create a file that really is in the encoding and not just looks like it is in the encoding.

This confusion between the use of the real encoding and the Unicode equivalent character set occurs in servers and browsers. That is, if you specify an encoding of ISO 8859-2 and enter character 0xA3, which is Ł, the character is stored in a file or a web page on disk as character code 0xA3.

When the web server is asked for the file it reads it in and doesn't change the coding and sends it to the browser complete with HTTP headers and/or meta tags that specify that the data is in ISO 8859-2. This causes the browser to read 0xA3, which it knows is the same character as Unicode 0x141 in ISO 8859-2, and converts it accordingly. It is Unicode 0x141 that is displayed in the web page. Also notice that the character is UTF-8 encoded so that in the web page it is represented by two bytes - 0xC5 and 0x81.

So while characters stored in the file are encoded using ISO 8859-2, when they are loaded into the web page they only look as if they are in the ISO 8859-2 characters set, but in fact they are being displayed as the equivalent Unicode characters in UTF-8.

The browser always works with UTF-8 internally.

You can prove that this is true by writing a JavaScript program that retrieves the character code at the character's location:

alert($("#test").text().charCodeAt(0).toString(16));

where test is:

<div id="test">Ł</div>

with the meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-2">

It is also important the HTML file is saved in ISO-8859-2 encoding. You can do this if you use Notepad++ say and select encoding Ansi and character set Eastern European. If the file isn't saved in ISO-8859-2 encoding then things just won't work because the meta tag or the header will incorrectly state what the encoding is.

If correctly encoded, the file contains character code 0xA3 between the div tags, but when the alert is displayed the character code shown is 0x141, which is of course the correct UTF-16 encoding.

<< Prev - Next >>

Last Updated ( Thursday, 05 May 2022 )