JavaScript Canvas

JavaScript Canvas - Unicode

Written by Ian Elliot

Monday, 04 January 2021

Article Index
JavaScript Canvas - Unicode
UTF-16

Page 1 of 2

Canvas can handle text, but can JavaScript handle Unicode? In this extract from my book on JavaScript Graphics, we look at the basics of working with Unicode characters on Canvas.

Now available as a paperback or ebook from Amazon.

JavaScript Bitmap Graphics
With Canvas

largecover360

JavaScript Graphics
Getting Started With Canvas
Drawing Paths
Extract: Basic Paths
Extract: SVG Paths
Extract: Bezier Curves
Stroke and Fill
Extract: Stroke Properties
Extract: Fill and Holes
Extract: Gradient & Pattern Fills
Transformations
Extract: Transformations
Extract: Custom Coordinates
Extract Graphics State
Text
Extract: Text, Typography & SVG
Extract: Unicode
Clipping, Compositing and Effects
Extract: Clipping & Basic Compositing
Generating Bitmaps
Extract: Introduction To Bitmaps
Extract : Animation
WebWorkers & OffscreenCanvas
Extract: Web Workers
Extract: OffscreenCanvas
Bit Manipulation In JavaScript
Extract: Bit Manipulation
Typed Arrays
Extract: Typed Arrays
Files, blobs, URLs & Fetch
Extract: Blobs & Files
Extract: Read/Writing Local Files
Extract: Fetch API **NEW!
Image Processing
Extract: ImageData
Extract:The Filter API
3D WebGL
Extract: WebGL 3D
2D WebGL
Extract: WebGL Convolutions

We have moved beyond the 128 or 256 characters available in ASCII - how does this work in JavaScript?

In chapter but not in this extract

fillText and strokeText
Typographic Positioning - textBaseline and textAlign
SVG Text on Canvas

Character Sets

At its most basic, data on the Internet consists of groups of 8-bits known at an "octet", but usually just called a "byte". Obviously to represent character data we need a mapping between numeric values and characters. One of the first standards for this was, and is, ASCII. This defines 127 alphanumeric characters: A-Z, a-z, 0-9, command characters such as carriage return and backspace, and assorted special characters. Of course, using 8 bits you can represent 256 characters, but this isn't enough to represent all of the characters used by even a small selection of the written languages of the world.

The first solution to this problem was to simply reuse the same 256 numeric codes and associate them with different sets of characters. The most commonly used on the Internet is ISO 8859-n where n is between 1 and 16. Each value of n maps a different set of characters onto the 0 to 255 values that a byte can represent. For example, ISO 8859-1 is Latin-1 Western European and, if selected, provides characters for most Western European languages. ISO-8859-2 is Latin-2 Central European and provides characters for Bosnian, Polish, Croatian and so on.

Notice that we now have a situation where a single character code can correspond to different characters depending on which ISO-8859 character set is selected. This is a potential problem if a server sends data using one ISO-8859 character set and the browser displays it using another. The data hasn't changed, but what is displayed on each system is different. To stop this from happening, servers send a header stating the character set in use. For example:

Content-Type: text/html; charset=ISO-8859-1

sets the character set to Latin 1. The problem with this is that the server can't adjust its headers for an individual page. Setting the HTTP header for an entire site is reasonable, but you still might want to send a page in another character set.

To allow this you can use the <meta> tag:

<meta http-equiv=“Content-Type” content=“text/html; 
                                 charset=ISO-8859-1”>

This has to be the first tag in the <head> section because the page cannot be rendered until the browser knows the charset is in use.

You can also set the character set of a script using:

<script src="/./myProgram.js" charset="ISO-8859-1">

Notice that adding any of these character set specifications only tells the browser what encoding is in use, it doesn't actually enforce the encoding or convert anything from one encoding to another.

What matters is what encoding the file is stored using. For example, to use ISO-8859-2 when you save a file when using an editor such as Notepad++, select encoding ANSI and character set Eastern European. The encoding used for the file determines how all of the characters it contains are represented, and this includes string literals used in JavaScript or PHP programs.

The advice is that if you are creating a library to be used by others then limit your code to the ASCII character set, which is the same in any encoding. If you can't do this the best thing to do is to use UTF-8, i.e. charset = UTF-8, which is what all modern browser use and what all encodings are converted into on load.

Unicode

Most of what we have just looked at is legacy because the proper way to do character representation today is to use Unicode. You will still encounter websites using ISO character sets and need to understand how they work, but by comparison Unicode is more logical and complete. Unicode is just a list of characters indexed by a 32-bit value called the character's code point. There are enough characters in Unicode to represent every language in use and some that aren't.

Unicode defines the characters, but it doesn't say how the code point should be represented. The simplest is to use a 32-bit index for every character. This is UTF-32 and it is simple, but very inefficient. It is roughly four times bigger than ASCII. In practice we use more efficient encodings.

UTF-8

There are a number of encodings of Unicode, but the most important for the web is UTF-8. There are 1,112,064 characters in UTF-8 and clearly these cannot all be represented by a value in a single byte as the 256 characters of ASCII could. Instead UTF-8 is a variable length code that uses up to four bytes to represent a character. The number of bytes are used to code a character is indicated by the most significant bits of the first byte.

0xxxxxxx   one byte 
110xxxxx   two bytes
1110xxxx   three bytes
11110xxx   four bytes

All subsequent bytes have their most significant two bits set to 10. This means that you can always tell a follow-on byte from a first byte. The bits in the table shown as x carry the information about which character is represented. To get the character code you simply extract the bits and concatenate them to get a 7, 11, 16 or 21-bit character code. Notice that, unlike the ISO schemes, there is only one character assigned to a character code. This means that if the server sends UTF-8 and the browser interprets the data as UTF-8 then there is no ambiguity.

The first 128 characters of UTF-8 are the same as ASCII, so if you use a value less than 128 stored in a single byte then you have backward compatible ASCII text. That is, Unicode characters U+0000 to U+007F can be represented in a single byte. Going beyond this needs two, three and four bytes. Almost all the Latin alphabets plus Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syrian, Thaana and N'Ko can be represented with just two bytes.

Modern browsers work in UTF-8 internally and any other encoding is converted to UTF-8 when the page or file is read in. This means that web pages and scripts have to be created using UTF-8 encoded files - i.e. the editor must be set to create and work with UTF-8 files.

If you want to include a UTF-8 character in HTML that is outside the usual range, i.e. one you cannot type using the default keyboard, then you can enter it using:

&#decimal;

or:

&#xhex;

where decimal and hex are the character codes in decimal and hex. For example:

&#x2211;

will display a mathematical summation sign, i.e. a Greek sigma.

∑

If you don't see this symbol when a page with this character code is loaded into a browser then the character set is something other than UTF-8.

You also have to be careful about text that is processed by the server. For example, text stored in a database needs to be in the same representation that the server is going to use. Similarly, you have to pay attention to text processed by server-side languages like PHP.

The most important single idea is:

The browser always works with UTF-8 encoded data and, if it can, it will convert any other encoding as the web page is read in.

To do this it has to know what the encoding is and it has to “know” how to convert it.

Prev - Next >>

Last Updated ( Monday, 04 January 2021 )