JavaScript Canvas

JavaScript Canvas - Unicode

Written by Ian Elliot

Monday, 04 January 2021

Article Index
JavaScript Canvas - Unicode
UTF-16

Page 2 of 2

UTF-16 in JavaScript

Now we come to a confusing twist in the story. JavaScript has Unicode support and all JavaScript strings are UTF-16 coded – this has some unexpected results for any programmer under the impression that they can assume that one character is one byte. While you can mostly ignore the encoding used, the fact that web pages and script files use UTF-8 and JavaScript uses UTF-16 can cause problems. The important point to note is that when JavaScript interacts with a web page characters are converted from UTF-8 to UTF-16 and vice versa.

As you can guess, UTF-16 is another variable length way of coding Unicode, but as the basic unit is 16 bits we only need to allow for the possibility of an additional two-byte word. For any Unicode character in the range U+0000 to U+FFFF, i.e. 16 bits, you simply set the single 16-bit word to the code. So how do we detect that two 16-bit words, called a surrogate pair, are needed? The answer is that the range U+D800 to U+DFFF is reserved and doesn't represent any valid character, i.e. they never occur in a valid string. These reserved codes are used to signal a two-word coding.

If you have a Unicode character that has a code larger than U+FFFF then you have to convert it into a surrogate pair using the following steps:

Subtract 0x010000 from it to give a 20-bit number in the range 0x000000 to 0x0FFFFF.
The top 10 bits are added to 0xD800 to give the first 16-bit surrogate in the range 0xD800 to 0xDBFF.
The low 10 bits are added to 0xDC00 to give the second 16-bit surrogate in the same range.

Reconstructing the character code is just the same process in reverse.

If you find a 16-bit value in the range x0800 to xDFFF then it and the next 16‑bit value are a surrogate pair. Take 0xD800 from the first and 0xDC00 from the second. Put the two together to make a 20-bit value and add 0x0100000. The only problem is that different machines use different byte orderings - little endian and big endian. To tell which sort of machine you are working with, a Byte Order mark or BOM can be included in a string U+FEFF. If this is read as FFEF the machine doing the decoding has a different byte order to the machine that did the coding.

The most important thing to know is that JavaScript only uses a single 16-bit value to represent a character. This means it doesn't naturally work with the full range of Unicode characters as there are no surrogate pairs.

The BMP - Basic Multilingual Plane

A JavaScript string usually uses nothing but characters that can be represented in a single 16-bit word in UTF-16. As long as you can restrict yourself to the Basic Multilingual Plane (BMP), as this set is referred to, everything works simply. If you can't, then things become much harder.

You can enter a Unicode character using an escape sequence:

\xHH

for characters that have codes up to xFF, i.e. 0 to 255, and:

\uHHHH

for characters that have codes up to xFFFF, where H is a hex digit.

For example:

var a = "Hello World\u00A9"; 

console.log(a);

adds a copyright symbol to the end of Hello World. This is simple enough, but if you now try:

console.log(a.length);

you will find that it correctly displays 12, because the length property counts the number of 16-bit characters in a string.

What about the Unicode characters that need two bytes? How can you enter them?

The answer is that in ECMAScript 2015 and later you can enter a 32-bit character code:

\u{HHHHHHHH}

Alternatively you could use the new string functions also introduced in ECMAScript 2015, fromCharCode and fromCodePoint do the same job of converting a character code to a string. However, fromCharCode only works with 16-bit values and not surrogate pairs, while fromCodePoint will return a surrogate pair if the code is greater then 0xFFFF. The only problem is that fromCodePoint was introduced with ECMAScript 2015 and isn't supported in older browsers, although a polyfill is available. The functions charCodeAt and codePointAt will return the character code at a specified position in a string. The charCodeAt function, which also isn’t supported by older browsers, works in 16-bit values and is blind to surrogate pairs, whereas codePointAt will return a value greater than 0xFFFF if the position is the start of a surrogate pair. Notice, however, that the position is still in terms of 16-bit values and not characters.

For example:

var s1="\u{1F638}";

or:

var s1=String.fromCodePoint(0x1F638);

stores the surrogate pair \uD83D\uDE38 in s1 which is the "grinning cat face with smiling eyes" emoji:

unicode2
Notice that you specify the code point and JavaScript converts this into two 16-bit values - the surrogate pair.

If you cannot assume ECMAScript 2015 then you have to enter the surrogate pair as two characters.

You can easily write a function that will return a UTF-16 encoding of a Unicode character code:

function codeToUTF16(code) { 
 if (code <= 0xFFFF) return "\\u" + 
                       code.toString(16).toUpperCase();
 code = code – 0x10000;
 var sLead = 0xD800 | (code >> 10);
 var sTrail = 0xDC00 | (code & 0x3FF);
 return "\\u" + sLead.toString(16).toUpperCase() +
         "\\u" + sTrail.toString(16).toUpperCase();  
}

For example:

console.log(codeToUTF16(0x1F638));

produces:

\uD83D\uDE38

which is the "grinning cat face with smiling eyes" again.

Notice JavaScript sends the UTF-16 to the browser unmodified – it is the browser that converts it to equivalent UTF-8 and then displays it.

JavaScript Problems

As long as you restrict yourself to the BMP everything works an a fairly simple way. If you move outside of the BMP to make use of emojis say then things are more complicated. Most of the JavaScript functions only work when you use characters from the BMP and there is a one-to-one correspondence between 16-bit values and characters. JavaScript may display surrogate pairs correctly, but in general it doesn't process them correctly. For example, length gives you the wrong number of characters if there are surrogate pairs. Functions like charAt(n) will return the wrong character if n is beyond a surrogate pair and it might not even return a valid character if it selects the second value of the pair. In general, you just have to think that all JavaScript functions work with 16-bit characters and if you use a surrogate pair then these are treated as two characters and one of these might even be invalid.

There is one final problem that you need to be aware of. In Unicode a single code always produces the same glyph or character, but there may be many characters that look identical. In technical terms, you can create the same glyph in different ways. Unicode supports combining codes which put together multiple characters into a single character. This means that you can often obtain a character with an accent either by selecting a character with an accent or selecting a character without an accent and combining it with the appropriate accent character. The end result is that two Unicode strings can look identical and yet be represented by different codes. This means the two strings won't be treated as being equal and they might not even have the same length. The solution to this problem is to normalize the strings so that characters are always produced in the same way. This is not an easy topic to deal with in general as there are so many possible ways of approaching it.

For example there are two ways of specifying an accented e:

var s1 = '\u00E9';
var s2 = 'e\u0301'; 
ctx.font="normal normal 40px arial";
ctx.fillText(s1+" "+s2,10,60);

both of which produce apparently the same character:

unicode1
However, the characters are created in two different ways and they will test as non-equal. That is s1!==s2 is true.

ECMA 2015 introduced the normalize method and:

s1.normalize() ===s2.normalize();

is true and both are equal to \u00E9. You can also specify the particular type of normalization you require, but this is beyond the scope of this introduction to working with Unicode text.

Reading A BMP File In JavaScript

Getting Started With SVG

SVG, JavaScript and the DOM

Getting Started with Box2D in JavaScript

Summary

Canvas text is just another example of a path to be filled or stroked using fillText or strokeText. Drawing text does not interact with or modify the current path.
You can set the font used via the CSS font values and the font property.
Line spacing has to be set manually as there is no concept of multiline text in Canvas text.
There is limited typographic control in Canvas text and the degree to which TextMetrics is supported is limited.
An alternative to using Canvas text is to use SVG text which has full typographic control.
Old ISO character sets are converted to Unicode UTF-8 by the browser.
JavaScript works with UTF-16 restricted to a single word which means you can only work directly with the Basic Multilingual Plan (BMP).
Unicode characters can be entered as literals using escape sequences.
The new Unicode functions help, but there are still some problems to overcome. In particular if you go beyond the BMP then JavaScript string functions will give you the wrong results as they don’t handle characters that correspond to two 16-bit words.
There is also the problem that a Unicode glyph has more than one way to specify it. The new normalize function can help with testing for equality in such cases.

Now available as a paperback or ebook from Amazon.

JavaScript Bitmap Graphics
With Canvas

largecover360

JavaScript Graphics
Getting Started With Canvas
Drawing Paths
Extract: Basic Paths
Extract: SVG Paths
Extract: Bezier Curves
Stroke and Fill
Extract: Stroke Properties
Extract: Fill and Holes
Extract: Gradient & Pattern Fills
Transformations
Extract: Transformations
Extract: Custom Coordinates
Extract Graphics State
Text
Extract: Text, Typography & SVG
Extract: Unicode
Clipping, Compositing and Effects
Extract: Clipping & Basic Compositing
Generating Bitmaps
Extract: Introduction To Bitmaps
Extract : Animation
WebWorkers & OffscreenCanvas
Extract: Web Workers
Extract: OffscreenCanvas
Bit Manipulation In JavaScript
Extract: Bit Manipulation
Typed Arrays
Extract: Typed Arrays
Files, blobs, URLs & Fetch
Extract: Blobs & Files
Extract: Read/Writing Local Files
Extract: Fetch API **NEW!
Image Processing
Extract: ImageData
Extract:The Filter API
3D WebGL
Extract: WebGL 3D
2D WebGL
Extract: WebGL Convolutions

<ASIN:B07XJQDS4Z>

<ASIN:1871962579>

<ASIN:1871962560>

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Comments

or email your comment to: comments@i-programmer.info

<< Prev - Next

Last Updated ( Monday, 04 January 2021 )

UTF-16 in JavaScript

The BMP - Basic Multilingual Plane

JavaScript Problems

Related Articles

Summary

Now available as a paperback or ebook from Amazon.

JavaScript Bitmap Graphics
With Canvas

Contents

Comments

UTF-16 in JavaScript

The BMP - Basic Multilingual Plane

JavaScript Problems

Related Articles

Summary

Now available as a paperback or ebook from Amazon.

JavaScript Bitmap GraphicsWith Canvas

Contents

Comments

JavaScript Bitmap Graphics
With Canvas