A Page on Unicode

The only languages that can comfortably be written with the repertoire of US-ASCII happen to be Latin, Swahili, Hawaiian and American English without most typographic frills. It is rumoured that there are more languages in the world. Roman Czyborra (roman@czyborra.com) on his page about ASCII.

Frequently Asked Questions about Unicode

Frequently Asked Questions about Unicode

Unicode is a character set developped jointly by the Unicode Consortium and the International Standardization Organization (under the codename of ISO-10646-1). It aims at providing support for the scripts of every live language on Earth, and certain dead languages also, plus a great number of technical, graphical, or decorative symbols. Unicode is supposedly the universal character set in that every other character set should be convertible to Unicode without loss of information.

What is ISO-10646-1? What is UCS?

ISO-10646 is the number of the ISO standard describing the Universal Character Set (UCS), also known as ISO-10646-1 (Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane). The UCS and Unicode character sets are identical in the first 17 “planes”, i.e. in the first 1114112 characters. For nearly all purposes, they can be considered to be the same thing.

How many characters are there in Unicode?

This is not easy to answer. The ISO-10646-1 standard defines a possible total of 2147483648 characters, divided in 128 “groups” of 256 “planes” of 256 “rows” of 256 “cells”. The Unicode standard proper, however, addresses only the first 17 planes (of group 0), totaling 1114112 characters. And even within the Unicode range, only plane 0, the so-called “Basic Multilingual Plane” (BMP), in other words the first 65536 cells, has characters allocated in it. Thus, while UCS is strictly speaking a 31-bit character set, so far it can be considered as only 16-bit (and many applications have chosen to restrict themselves to this 16-bit BMP anyway.

Now if we wish to count the number of characters actually allocated, then version 3.0 of the standard defines 10236 (named) character, not counting the 27786 CJK (“Han”: Chinese, Japanese, Korean) Unified Ideographs and the 11172 Hangul (Korean) syllables. With 65 control codes, this means that a total of 49259 cells are allocated in the Basic Multilingual Plane. Another 2048, the “surrogates” region used in the UTF-16 encoding, cannot be used to code characters, and some 6400 more characters are reserved for private use. Finally, the last two codes are guaranteed not to be Unicode characters. Thus, the Basic Multilingual Plane is rather full at the moment.

Which languages and scripts are and are not supported by Unicode?

Beyond languages such as English which require no accented characters (but may requite special punctuation symbols), the Latin-1 block of Unicode is already sufficient to support languages such as Danish, Dutch, Faroese, Finnish, Flemish, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish. The Latin-Extended-A block adds the characters enabling to typeset (among other languages) Afrikaans, Breton, Basque, Catalan, Croatian, Czech, Esperanto, Estonian, French, Frisian, Greenlandic, Hungarian, Latin, Latvian, Lithuanian, Maltese, Polish, Provençal, Rhaeto-Romanic, Romanian, Romany, Sami, Slovak, Slovenian, Sorbian, Turkish and Welsh. The Latin-Extended-B block adds support for a variety of African, Uralian, Central-Asian and similar languages, in Latin transcription. With the further Latin Extended Additional block, Unicode provides support for just about every language or transcription that uses the Latin alphabet, including the International Phonetic Alphabet which has its own block of extensions.

Beyond Latin, Unicode includes the Greek (and Coptic), Cyrillic, Armenian, Hebrew, Arabic, (Deva)nagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Laotian, Tibetan, and Georgian alphabets. In version 3.0 of the standard, Syriac, Thaana, Sinhala, Myanmar, Ethiopic, Cherokee, Canadian Aboriginal (syllabic), Ogham, Runic, Khmer and Mongolian alphabets were added.

Furthermore, Unicode supports over 25000 unified Chinese-Japanese-Korean ideograms, plus the syllabic writings of Japanese (Hiragana and Katakana) and Korean (Hangul).

On the other hand, there are also some languages scripts which are not included in Unicode. Notable examples are: Phoenician, Hieroglyphic, Demotic, Glagolitic, Linear A and B, Assyrian Cuneiform, ancient Hebrew, Maya pictograms. Some of these scripts could be added in later versions of the standard. Neither is there support for invented or fictional languages such as the Elvish or Dwarven scripts of Tolkien or the Klingon alphabet from the Star Trek series: the Unicode consortium has refused to consider such scripts for normalization.

Unicode and other character sets

What is ASCII?

ASCII (American Standard Code for Information Interchange) is a 7-bit character set that has now become completely standard in the sense that it is recognized by very nearly every computer in the world (the only sizeable rival, EBCDIC, being practically non existant) and that every other standard character set extends ASCII. In particular, Unicode extends ASCII, so the first 128 Unicode cells coincide with the ASCII characters. ASCII defines 128 characters, of which 33 (namely, the first 32 and the last one) are control characters.

What is ISO-8859-1? What is Latin-1?

ISO-8859-1, also known as Latin-1, is a standard that defines an 8-bit character set which extends ASCII and which is probably the most widely-used character set in the world today (disclaimer: this completely ad hoc claim should not be taken seriously). Because of the popularity of Latin-1, the Unicode character set has chosen to maintain upward compatibility by extending it; in other words: the first 256 characters of Unicode (the “Latin-1 block”) are precisely those of ISO-8859-1.

Latin-1 defines 256 characters. The first 128 are precisely ASCII (in particular, the first 32 are control characters); the next 32 characters (from 0x80=128 through 0x9f=159) are again 32 control characters, and only the next 96 characters are actually used as printable characters (mostly accented letters).

Although it was inspired by the DEC MCS, ISO-8859-1 lacks some characters that the latter had, and which are needed in French (the major language based on the Latin alphabet not correctly supported by ISO-8859-1): namely the oe and OE ligatures and the upper case y with diaeresis (the last is admittedly a very minor point). This has made some people rather unhappy. The characters were reintroduced in Microsoft's CP1252 character set, together with some punctuations signs which ISO-8859-1 also lacks (em-dash, en-dash, quotes, ellipsis). They were also added as part of the ISO-8859-15 character set, in a different place. And, of course, they are in a yet different place in Unicode (beyond the 0x100 limit).

You can look at a table of ISO-8859-1 characters.

What is ISO-8859-`n`? (For various values of `n`.)

ISO-8859-n is a set of character sets which all extend ASCII by defining characters from 0xa0 though 0xff and leaving characters 0x80 through 0x9f as control characters.

ISO-8859-1 (Latin-1) has already been discussed. It is the character set used by “Western European” languages (in a very broad sense). ISO-8859-2 (Latin-2: table here) is used for “Central European” languages (Polish, Czech, Hungarian, etc.). ISO-8859-3 (Latin-3: table here) is used for “South European” languages (Turkish, Maltese, Galician, etc.) and Esperanto. ISO-8859-4 (Latin-4: table here) is used for “Baltic Rim” languages (Estonian, Latvian, Lithuanian); it has been more or less obsoleted by ISO-8859-10 (except for Latvian). ISO-8859-5 (table here) contains the Cyrillic alphabet (Russian, Serbian, Bulgarian, Byelorussian, Ukrainian, etc.); its layout was used in creating the Cyrillic block of Unicode (range 0x400–0x4ff). ISO-8859-6 (table here) is used for Arabic; its layout was used in creating the Arabic block of Unicode (range 0x600–0x6ff). ISO-8859-7 (table here) is used for (modern) Greek; its layout was used in creating the Greek block of Unicode (range 0x370–0x3ff); note that it is not sufficient to typeset ancient (polytonic) Greek. ISO-8859-8 (table here) is used for Hebrew; its layout was used in creating the Hebrew block of Unicode (range 0x590–0x5ff). ISO-8859-9 (Latin-5: table here) is used for Turkish: it is identical to ISO-8859-1 except that it replaces six characters used in Icelandic by characters used in Turkish. ISO-8859-10 (Latin-6: table here) is used for “Nordic” langauges: it is an extension of ISO-8859-4 (Latin-4) to add support for Greenlandic (Inuit) and Sami (Lappish), while loosing Latvian support at the same time (arguably a mistake).

ISO-8859-11 (table here) covers Thai, but it does not seem to have gone beyond the draft stage; and to the best of my knowledge ISO-8859-12 does not exist. ISO-8859-13 (Latin-7: table here) is Baltic again: it re-establishes the Latvian support lost in ISO-8859-10 (Latin-6). ISO-8859-14 (Latin-8: table here) is like ISO-8859-1 except that it replaces certain (non-letter) characters by characters used in Celtic languages (Welsh and Manx Gaelic notably).

Lastly, ISO-8859-15 (Latin-9, sometimes nicknamed Latin-0: table here) is like ISO-8859-1 except that it replaces certain (non-letter) characters by characters used in Estonian, Finnish and French, plus the Euro sign. There has been some debate about whether ISO-8859-15 should be made to replace ISO-8859-1 as the “default primary character set” (the French are all in favor of this, notably the fr.* Usenet hierarchy, which accepts ISO-8859-1 and ISO-8859-15 encoding but not UTF-8). However, the incompatibility with Windows' CP1252 character set, and the growing popularity and support of Unicode make this a bad idea. Hopefully the issue will be moot.

Naturally, all the characters found in any of the ISO-8859-n character sets are also found in Unicode. Certain of these standards have provided the layout structure for blocks of Unicode. Even when taken all together, they do not cover all of Unicode, of course, nor even the Latin characters of Unicode. All the Latin characters (with various diacritics, alterations or ligatures) can be found within the Latin-1, Latin-Extended-A, Latin-Extended-B or Latin Extended Additional blocks of Unicode.

What is CP1252?

CP1252 is a character set that very much resembles ISO-8859-1. It was introduced by Microsoft for the Windows operating system. It extends it except that it replaces the (very much useless) control characters 0x80–0x9f by very useful printable characters. These characters are found elsewhere in Unicode, of course (since Unicode is universal); however, note that because it does not match ISO-8859-1, CP1252 also does not match Unicode in the 0x80–0x9f range.

The good side of CP1252 is that it provides many useful characters not found in Latin-1 for typing Latin languages like English or French. The bad side is that since it very much looks like ISO-8859-1 without being strictly compatible with it (and, a fortiori, Unicode), one sometimes finds documents in the CP1252 character set that “claim” to be Latin-1, and can cause problems with truly standards-compliant software.

You can look at a table of CP1252 characters.

What is CP437?

CP437 is a character set that (more or less) extends ASCII and that was used on the IBM PC under MS-DOS before Microsoft replaced it under Windows by the CP1252 character set. This character set is extremely weird, and now completely obsolete, but it sometimes resurfaces in strange ways; for example, under Microsoft Windows, typing an ALT + digits combination will enter a character from the CP437 character set if the number entered (in decimal) does not have a leading zero, and from the CP1252 character set if it does; most PC graphics cards still have the CP437 character set as their default text font. It contains some accented characters, but also some box drawing elements and some Greek letters (a very strange selection of Greek letters).

You can look at a table of CP437 characters.

What is DEC MCS?

MCS is the “Multilingual Character Set” used by Digital Equipments Corporation (DEC) (now Compaq) on their once ubiquitous VT220 terminal. The MCS inspired the ISO-8859-1 character set; in fact, it is arguably superior to it. It is quite similar to ISO-8859-1, but lacks support for Icelandic (the eth and thorn letters and the y with acute accent) and a few (mostly useless) symbols while it has the capital Y with diaeresis and the French oe and OE ligatures. One disadvantage of it, though, is that it does not have the useful unbreakable space character (0xa0=160 in ISO-8859-1; the VT220 prints a reverse question mark instead).

You can look at a table of DEC MCS characters.

Show me some tables of these various character sets in relation to Unicode.

This page lists a certain number of common character sets in terms of Unicode characters. In other words, the characters making up the tables are Unicode characters (as they must be in HTML).

If you are interested specially in Latin characters (with various alterations or diacritics), this page shows the Latin characters (from Unicode, and in Unicode order) found in various common Latin character sets.

Note that these tables have various values. Despite Unicode's universalness, character sets do not map to Unicode in a manner that is entirely satisfactory (generally this is “the character set's fault” in that it does not use the same basic principles as Unicode for determining what constitutes or not a character). Latin character pose no real problem, and the mapping is very well defined. Similarly for basic punctuation and symbols. However, some areas of doubt concern: control characters (it is not always entirely clear what constitutes a control character and what is outside the character set — or indeed whether control characters are in fact part of Unicode), certain semantic vs. graphic values (it is not always easy to know whether a character in a character set that looks like a “mu” is a MICRO SIGN or a GREEK SMALL LETTER MU, for example), multivalued mappings (a character in the set may map to serveral Unicode characters), ligatures (several consecutive characters may map to a single Unicode character), corporate characters (the Apple logo from the Macintosh character sets is not part of Unicode, because that character is a registered trademark), and so on. These tables, therefore, are furnished merely as a useful and amusing information, not as a reference data sheet destined to supplant the ISO standards.

If you want the data of the tables in computer-readable format, this tarball contains all the data that was used to produce them, as well as the Perl scripts I used. You can also browse the data file by file. Of course, in any case, the only reasonably authoritative data is the Unicode Consortium's distributed mapping tables from which most of my own data was derived.

Another source of useful information is the remarkable page by Roman Czyborra on the ISO 8859 Alphabet Soup.

Encodings of Unicode

What is an encoding?

A logical document is a sequence of characters, whereas a physical document is a sequence of octets (an octet is a byte: the term merely serves to insist upon the fact that it is an eight-bit byte). A character set is a list of potential characters, with numbers assigned to them (character codes), from which the characters making up the logical document are taken. An encoding is a way of representing the characters that make up the logical document as octets or sequences of octets.

If a character set contains at most 256 characters (e.g. ASCII, ISO-8859-1 or CP1252), so that all the possible character codes fit within an octet value, then it is feasible to encode the characters as octets simply by using their numbers in the character set: this is the transparent encoding for that character set.

Now if a character set is included in another (in the sense that all the characters of the one can be found somewhere in the other, but not necessarily with the same number), then encoding the one will provide an encoding for a subset of the other. Since the whole point of Unicode being a Universal character set is that every character set is a subset of Unicode, this also means that every character set's identity (or otherwise “natural”) encoding will provide an encoding for Unicode. These are only partial encodings in that they do not address all of Unicode, naturally. For example, the identity encoding for ISO-8859-1 (i.e. coding every ISO-8859-1 character through its 8-bit number) will provide an encoding for the 256 first characters of Unicode, which exactly match ISO-8859-1.

So it is now customary to view character sets, not really as character sets, but merely as subsets of Unicode together with an encoding for the subset in question. This is the now official position of the Internet and W3 Standards in the matter of character sets.

Beyond the encodings provided by the other character sets, Unicode has various total encodings (in the sense that they are capable of coding the entire Unicode range, not merely a subset of it) of its own, which we will describe.

What is UTF-8 and how does it work?

UTF-8 stands for “Universal Transformation Format” (8 bit). It is defined by the document UTF-8, a transformation format of ISO 10646 by F. Yergeau (RFC2279). It is the most official, and the most common, total encoding of Unicode.

UTF-8 has the property that it extends the identity encoding of the ASCII character set. In other words, an ASCII character will be encoded in the same manner (through the same octet) by the UTF-8 encoding of Unicode as by the identity encoding of ASCII. Furthermore any octet in the byte stream whose numerical value is less than 128=0x80 encodes an ASCII character (every other character in Unicode is encoded as a sequence of octets with numerical value at least 0x80). This ASCII-preservation property is extremely useful in getting legacy (non-Unicode-aware) applications to function with UTF-8-encoded Unicode data.

Here is now the definition of the UTF-8 encoding. Further details and comments can be found in the aforementioned RFC. If the character's code is less than 128=0x80 (i.e. it is an ASCII character) then it is encoded by the octet with the same value. If it is greater, then write the character's code in binary, and divide it in groups of six bits (ending, of course, with the six least significant bits); or, if you want, write its value in base 64. Arrange (by possibly adding leading zero bits) so that the leading (most significant) bit group contains actually not six bits but six minus the number of subsequent groups. In other words, you may divide the character code value's bits as 5+6 or 4+6+6 or 3+6+6+6 or 2+6+6+6+6 or 1+6+6+6+6+6. Use as few groups as possible. Each bit group will now be encoded as one octet in the stream, with the most significant group encoded first and the least significant (the last 6 bits) encoded last. Every group except the first is encoded in an octet by prepending 10 before the six-bit value (giving an octet value between 0x80=128 and 0xbf=119). The first group, on the other hand, is encoded by placing as many 1's as necessary and then a single 0 before the value bits to fill up the octet's eight bits; thus, the number of leading 1's in the octet is one more than the number of groups (and hence, octets) to follow. The leading octet always has a value between 0xc0=120 and 0xfd=253; the octet values 0xfe=254 and 0xff=255 are never used in the UTF-8 encoding.

The encoding is evidently non ambiguous. It maps the full ISO-10646-1 range and not merely the part corresponding to Unicode proper. Furthermore, it is “context-free” in the sense that the interpretation of a single octet in the stream depends on the lookbehind of at most a bounded number (five) of previous octets. Each Unicode character is represented by a unique octet sequence, and the occurrence of that octet sequence in the UTF-8 octet stream means that this character is present.

An ASCII character is represented by UTF-8 as a single octet. Other unicode characters up to character number 0x7ff are represented by two octets (this includes the Latin-1 characters, the Latin-Extended-A and Latin-Extended-B blocks, IPA extensions, spacing modifier letters, combining diacritics, (modern) Greek and Coptic, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana). Every other character in the Basic Multilingual Plane (and thus, every other Unicode character defined so far) is encoded as three octets.

What is UTF-16 and how does it work?

UTF-16 stands for “Universal Transformation Format” (16 bit). It is defined by the document UTF-16, an encoding of ISO 10646 by P. Hoffman and F. Yergeau (RFC2781).

Contrary to UTF-8, UTF-16 is a sixteen-bit encoding in that it works with pairs of octets rather than single octets. For this reason, it does not preserve the ASCII character set (an ASCII character will be encoded as two octets under UTF-16).

The UTF-16 encoding is simple to define. A character with value less than 0x10000=65536 (in other words, a character in the Basic Multilingual Plane) is encoded by its 16-bit value. This 16-bit value is broken in two octets. Whether the higher-order 8 bits are placed first or last depends on the chosen “endianness”: UTF-16 may be encoded either as “big-endian” (aka “network order”: higher-order 8 bits placed first) or “little-endian” (higher-order 8 bits placed last). To make endianness detection possible when it is not specified by an external indication (such as the name UTF-16BE or UTF-16LE of the encoding), some formats recommend (or even demand) that a Unicode character 0xfeff (ZERO WIDTH NON BREAKING SPACE, aka “byte-order mark”) be prepended to the character stream; indeed, since the Unicode character 0xfffe is guaranteed not to exist, the octet stream can be interpreted as big- or little-endian depending on whether it starts by 0xfe 0xff or 0xff 0xfe.

To encode a character whose code is greater or equal to 0x10000 but at most 0x10ffff in UTF-16, proceed as follows: first substract 0x10000 to its value. Then split the resulting 20-bit value in two ten-bit values. Add the bits 110110 at the start of the top ten-bit value to yield a first sixteen-bit value. Add the bits 110111 at the start of the lower-order ten-bit group to yiend a second sixteen-bit value. Encode each sixteen-bit value as a two-octet sequence as previously, and place them in this order in the octet stream. The sixteen-bit value coding the top ten bits of the reduced character's value, which is comprised between 0xd800 and 0xdbff, is known as a high surrogate; the sixteen-bit value coding the bottom ten bits of the reduced character's value, which is comprised between 0xdc00 and 0xdfff, is known as a low surrogate. Regardless of the encoding's endianness for octet values, the high surrogate must always be placed first. The Unicode standard guarantees that no Unicode character will ever have a value between 0xd000 and 0xdfff, so that 16-bit values in this range are always surrogates.

The encoding is non ambiguous because surrogates and Unicode characters do not overlap. It does not map the full ISO-10646-1 range but merely the part corresponding to Unicode proper, i.e. the first 17 planes. It is “context-free”, but only as a sequence of 16-bit values; as a sequence of octets, it depends on one context parameter, namely the octet index's parity. Removing a single octet from a UTF-16 stream will cause the stream to be interpreted incorrectly.

Every character in the Basic Multilingual Plane (and thus, every other Unicode character defined so far) is encoded as two octets by the UTF-16 encoding; other Unicode characters are encoded as four octets by using a surrogate pair. This makes UTF-16 more efficient than UTF-8 for characters in the 0x800-0xffff range (notably all Asian scripts), whereas UTF-8 is more efficient than UTF-16 for ASCII characters.

What is UTF-7 and how does it work?

UTF-7 stands for “Universal Transformation Format” (7 bit). It is defined by the document UTF-7 A Mail-Safe Transformation Format of Unicode by D. Goldsmith and M. Davis (RFC2152). UTF-7 is used instead of UTF-8 to encode Unicode characters as 7-bit values (identified with ASCII characters, in fact, printable ASCII characters) when the handling software is not “8-bit clean”. Strictly speaking, UTF-7 is not really an encoding of Unicode but rather a transformation on mail and related formats that happen to contain Unicode characters.

UTF-7 is rather complicated to define. First we must divide the printable ASCII characters as follows:

Set D (“directly encoded characters”): the upper-case letters A through Z; the lower-case letters a through z; the ten digits 0 through 9; the nine following characters: apostrophe, left parenthesis, right parenthesis, comma, hyphen-minus, full stop (period), solidus (slash), colon and question mark. The hyphen-minus character will play a special role here.
Set O (“optionally directly encoded characters”): exclamation mark, quotation mark (double quote), number sign (“hash”), dollar sign, percent sign, ampersand, asterisk, semicolon, less-than sign, equals sign, greater-than sign, commercial at, left square bracket, right square bracket, circumflex accent (“hat”), low line (underscore), grave accent (“backquote”), left curly bracket, vertical line (“pipe”) and right curly bracket.
Whitespace: the space, horizontal tab, carriage return and line feed characters, which, though not strictly speaking printable ASCII characters, will be handled quite like Set D characters. The status of the other control characters is unclear (generally they shouldn't appear in data to be UTF-7-encoded).
Necessarily encoded characters: every other ASCII character. That is: the plus sign (UTF-7's escape character), the reverse solidus (backslash), and the tilde.

UTF-7 has two modes, or contexts: direct encoding, and Unicode shifted encoding. In the direct encoding mode, characters are encoded as themselves. This applies only to the characters of Set D, and optionally Set O (according as some software might treat them specially), as well as whitespace characters. Unicode shifted encoding starts when a plus sign is encountered. Any Unicode character may be encoded in this mode.

To encode characters in the Unicode shifted encoding mode of UTF-7, start by transforming it as a sequence of 16-bit values. Characters with values between 0x10000 and 0x10ffff are encoded to 16-bit values using the surrogates mechanism defined in UTF-16. The sequence of 16-bit quantities is then read as a stream of bits by writing the 16-bit quantities with the high-order bit first. The bit sequence is then padded with zeroes to a length that is multiple of six bits, and is read as a sequence of six-bit blocs, representing values from 0 to 63. Each of these values is encoded as a character in the so-called “base-64 alphabet”: the upper-case letters A though Z represent values from 0 through 25, the lower-case letters a through z represent those from 26 through 51, the digits 0 through 9 represent 52 through 61, the plus sign represents 62 and the solidus (slash) represents 63. These characters are written, in order, after the plus sign that started the Unicode shifted encoding mode.

Unicode shifted encoding mode terminates (from the decoder's point of view) whenever any character outside the base-64 alphabet is encountered in the UTF-7 stream. If said character is any other than the hyphen-minus, then it is also part of the encoded Unicode character stream (directly encoded). If it is a hyphen-minus, then it is absorbed. In other words, looking from the encoder's point of view, to terminate the Unicode shifted encoding mode, consider the next character to be (directly) encoded. If this character is outside the base-64 alphabet and is not a hyphen-minus, just add it to the UTF-7 stream. If it is in the alphabet or if it is a hyphen-minus, add it with a hyphen-minus before it.

As a special exception to the rules, a sequence formed by a plus sign immediately followed by a hyphen-minus (which ought to code an empty character sequence) actually codes a plus sign.

What is UTF-5 and how does it work?

UTF-5 stands for “Universal Transformation Format” (5 bit). It is defined by the document UTF-5, a transformation format of Unicode and ISO 10646 by James Seng, Martin Dürst and Tin Wee Tan (draft-jseng-utf5: Internet Draft Work in Progress). UTF-5 is a specialized encoding: it does not really have to do with Unicode, and it could be used in much general contexts, but its primary target is to support the Internationalized DNS system.

UTF-5 is very simple to define. Take the Unicode character stream. Write each character's code in binary, and group its bits by blocks of four (or, equivalently, write the value in hexadecimal). Use as few groups as possible (i.e. strip away all leading groups of zero, except if the actual value to be coded is 0). Encode the first group of four bits by one of the letters (ASCII characters) G through V, representing the values 0 throught 15. Encode the subsequent groups of four bits by the digits 0 through 9 (representing the values 0 throught 9) or A throught F (representing the values 10 through 15).

The UTF-8 and UTF-5 are quite similar. Both are “context-free” (so the loss or corruption of octets in the stream does not corrupt the data beyond a bounded distance). Both can encode the entire ISO-10646-1 Universal Character Set and not merely the Unicode range proper. Naturally, UTF-5 is less efficient that UTF-8: it takes two octets to encode one ASCII character, three to encode a Unicode character with value less than 0x1000 (this includes the Latin-1 characters, the Latin-Extended-A and Latin-Extended-B blocks, IPA extensions, spacing modifier letters, combining diacritics, (modern) Greek and Coptic, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Laotian and Tibetan), and every other character in the Basic Multilingual Plane (and thus, every other Unicode character defined so far) is encoded as four octets.

Show me some examples of Unicode strings encoded in various ways

This is easily done:

Unicode string	Hello, World!	Здравствуй Мир!	Γνῶθι σεαυτόν	“foo”	I♥NY!	あおい	中国	한국어	A≢Α
Decimal code values	72 101 108 108 111 44 32 87 111 114 108 100 33	1047 1076 1088 1072 1074 1089 1090 1074 1091 1081 32 1052 1080 1088 33	915 957 8182 952 953 32 963 949 945 965 964 8057 957	8220 102 111 111 8221	73 9829 78 89 33	12354 12362 12356	20013 22269	54620 44397 50612	65 8802 913
Hexadecimal code values	48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21	417 434 440 430 432 441 442 432 443 439 20 41c 438 440 21	393 3bd 1ff6 3b8 3b9 20 3c3 3b5 3b1 3c5 3c4 1f79 3bd	201c 66 6f 6f 201d	49 2665 4e 59 21	3042 304a 3044	4e2d 56fd	d55c ad6d c5b4	41 2262 391
UTF-8 encoding (hex octets)	48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21	d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 20 d0 9c d0 b8 d1 80 21	ce 93 ce bd e1 bf b6 ce b8 ce b9 20 cf 83 ce b5 ce b1 cf 85 cf 84 e1 bd b9 ce bd	e2 80 9c 66 6f 6f e2 80 9d	49 e2 99 a5 4e 59 21	e3 81 82 e3 81 8a e3 81 84	e4 b8 ad e5 9b bd	ed 95 9c ea b5 ad ec 96 b4	41 e2 89 a2 ce 91
UTF-16 big-endian encoding (hex octets)	00 48 00 65 00 6c 00 6c 00 6f 00 2c 00 20 00 57 00 6f 00 72 00 6c 00 64 00 21	04 17 04 34 04 40 04 30 04 32 04 41 04 42 04 32 04 43 04 39 00 20 04 1c 04 38 04 40 00 21	03 93 03 bd 1f f6 03 b8 03 b9 00 20 03 c3 03 b5 03 b1 03 c5 03 c4 1f 79 03 bd	20 1c 00 66 00 6f 00 6f 20 1d	00 49 26 65 00 4e 00 59 00 21	30 42 30 4a 30 44	4e 2d 56 fd	d5 5c ad 6d c5 b4	00 41 22 62 03 91
UTF-5 encoding (ASCII characters)	K 8 M 5 M C M C M F I C I 0 L 7 M F N 2 M C M 4 I 1	K 1 7 K 3 4 K 4 0 K 3 0 K 3 2 K 4 1 K 4 2 K 3 2 K 4 3 K 3 9 I 0 K 1 C K 3 8 K 4 0 I 1	J 9 3 J B D H F F 6 J B 8 J B 9 I 0 J C 3 J B 5 J B 1 J C 5 J C 4 H F 7 9 J B D	I 0 1 C M 6 M F M F I 0 1 D	K 9 I 6 6 5 K E L 9 I 1	J 0 4 2 J 0 4 A J 0 4 4	K E 2 D L 6 F D	T 5 5 C Q D 6 D S 5 B 4	K 1 I 2 6 2 J 9 1
UTF-7 encoding (further encoded as a C string)	`"Hello,\040" "World+ACE-"`	`"+BBcENARA" "BDAEMgRB" "BEIEMgRD" "BDkAIAQc" "BDgEQAAh-"`	`"+A5MDvR/2" "A7gDuQAg" "A8MDtQOx" "A8UDxB95" "A70-"`	`"+IBw-" "foo" "+IB0-"`	`"I+In0-" "NY+ACE-"`	`"+MEIw" "SjBE-"`	`"+Ti1W" "/Q-"`	`"+1Vyt" "bcW0-"`	`"A+Im" "IDkQ-"`
UTF-8 encoding (further encoded as a C string)	`"Hello,\040" "World!"`	`"\320\227\320\264" "\321\200\320\260" "\320\262\321\201" "\321\202\320\262" "\321\203\320\271" "\040\320\234" "\320\270\321\200!"`	`"\316\223\316\275" "\341\277\266" "\316\270\316\271" "\040\317\203" "\316\265\316\261" "\317\205\317\204" "\341\275\271" "\316\275"`	`"\342\200\234" "foo" "\342\200\235"`	`"I\342\231\245" "NY!"`	`"\343\201\202" "\343\201\212" "\343\201\204"`	`"\344\270\255" "\345\233\275"`	`"\355\225\234" "\352\265\255" "\354\226\264"`	`"A\342\211" "\242\316\221"`

Unicode and other standards

How do I write Unicode characters in HTML?

The first thing to realize, which tends to be forgotten, is that the character data of any HTML page consists of Unicode characters. The HTML recommendation (specification) by the W3 Consortium specifies, in section 5.1 that HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. For example, the text you are currently reading is, between the tags, a collection of Unicode characters, and so is every other Web page written in HTML.

In fact, this is not at all surprising in virtue of Unicode's universality properties. But it tends to be forgotten for two main reasons. First, HTML source tends to use an encoding other than UTF-8, generally an encoding derived from a character set (frequently an ISO-8859-n character set) that does not map all of Unicode; people often incorrectly assume that because of this they cannot use Unicode characters outside their encoding character set — this is wrong as we shall see. Second, the popular browser Communicator by Netscape has a completely absurd handling of character data which has led people to either abstain from using entities and characters outside of the encoding character set, or use them incorrectly.

So let us straighten this out once and for all: every HTML page can access every Unicode character, no matter what character set is used to encode it. For example, this page is encoded as US-ASCII, meaning that the HTML source only uses standard ASCII characters; but it still contains many non-ASCII characters in the actual data content; note that because it is encoded as ASCII, this page should appear exactly the same no matter what “document character set” (i.e. encoding) it is read as, provided the character set in question preserves ASCII (as nearly all of them do).

The most obvious way to use Unicode characters in an HTML page is to encode it as UTF-8 or some other encoding scheme that maps all of Unicode. But, as we have just pointed out, this is not necessary. Note that if one chooses to do so, one should explicitely indicate the encoding used, since the default encoding is ISO-8859-1 (see Hypertext Transfer Protocol — HTTP/1.1 by R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach and T. Berners-Lee (RFC2616), section 3.7.1, page 26; but compare with section 5.2.2 of the HTML specification). There are several ways to do that: one is to use the HTTP headers as specified by sections 3.4 (page 21) and 14.17 (page 124) of the above mentioned RFC2616; another is to explicitly add a META HTTP-EQUIV tag as suggested by the aforementioned RFC section; yet another is to use XHTML and to insert an XML encoding declaration in the prolog.

But, once more, no matter what the chosen encoding is, it is always possible to add any Unicode character in the HTML document. The generic method is to use numeric character references: just write &#N; where N is the Unicode number of the character to insert, in decimal (one can also write it as &#xN; and write N in hexadecimal — using upper-case or lower-case letters indifferently — but using decimal is preferred over hexadecimal). For example, to write the Chinese ideograms (“中国”) meaning “China” in an HTML file one would write “中国” in the source.

For certain characters, there is a more pleasant way of wrinting the character, namely to use an HTML predefined character entity: these have the form &name; where name is a certain alphanumerical string. The full list of HTML predefined character entities is given on this page (according to chapter 24 of the HTML specification). For example, to write “cliché” in an HTML document (without using the ISO-8859-1 é character directly), one writes “cliché”.

How did the internationalization of HTML evolve?

Version 2.0 of HTML is the first that was really popular, and it is the first formal definition of HTML. It is described by the document Hypertext Markup Language — 2.0 by T. Berners-Lee and D. Connolly (RFC1866), dating from 1995/11. Already at the time, the authors were aware of the necessity of internationalization (aka “i18n” because there are 18 characters between the “i” and the “n”): the RFC defines (section 1.2.1, third asterisk) the minimal document character set as ISO-8859-1, and requires of the possible other characters to be mapped at their place as defined by ISO-10646-1, thus acknowledging the value and importance of Unicode.

version 3.2 of HTML was released as a recommendation of the W3 Consortium as of 1997/01/14. It is, in a way, a step back with respect to HTML 2.0 in the question of i18n. Indeed, it does not mention ISO-10646-1 and uses exclusively the ISO-8859-1 (Latin-1) character set. This restriction, of course (also present in earlier versions of HTML), was inacceptable to those people living outside the Western European character set's applicability zone: they did not hesitate to encode HTML with other character sets. To end the mess, an RFC was written: Internationalization of the Hypertext Markup Language by F. Yergeau, G. Nicol, G. Adams and M. Dürst (RFC2070). This set the way toward HTML 4.0: the RFC reaffirms the importance of Unicode and states (section 2.2) that The document character set, in the SGML sense, is the Universal Character Set (UCS) of ISO 10646:1993 [ISO-10646], as amended. Currently, this is code-by-code identical with the Unicode standard, version 1.1 [UNICODE]. It also introduces some new HTML entities for the production of ligatures and bidirectional text in Unicode (zwnj, zwj, lrm and rlm).

HTML 4, another W3C recommendation, made it perfectly clear that Unicode was the character set of every HTML document (see above).

In what way is Netscape Navigator 4.x's handling of HTML characters erroneous?

Netscape Communicator was designed to handle not version 4 of HTML but the older version 3.2 (see above). Consequently of the HTML predefined character entities (see above for explanation, or this page for a list), it ignores all those which aren't part of ISO-8859-1 (except, for some strange reason, the “Euro sign” character in some versions), because they are new in HTML 4.0.

In particular, this means that the character entities like — or … (which are necessary to properly punctuate English) will appear literally as such in the text. Not good.

Worse than that, Netscape chooses its display fonts based on the character encoding of the document. Thus, it cannot display a character that is outside the particular transfer encoding of the document. This is absurd because as we have explained, any HTML document can refer to any Unicode character using numeric character references. Thus, the whole value of numeric character references is lost. Furthermore, it might be necessary to change the document encoding even for documents encoded in plain ASCII (such as this one) to properly see some characters. This is quite absurd.

Fortunately, this terrible behavior has been changed in the new Mozilla project, that is destined to form the core of the forthcoming version 6 of Netscape Communicator. I encourage all Netscape users to switch to Mozilla.

Can I include Unicode characters in Uniform Resource Identifiers (URI)?

Well, yes and no. The correct way to include Unicode characters in Uniform Resource Identifiers as defined in Uniform Resource Identifiers (URI): Generic Syntax by T. Berners-Lee, R. Fielding and L. Masinter (RFC2396) is to first encode them as octets using the UTF-8 transformation format, and then represent those octets which are not safe ASCII characters by %-encoding, i.e. a percent sign followed by two hexadecimal digits.

This statement is actually pretty meaningless. What I'm saying is that if you wish to use the URI http://www.somewhere.tld/ユアライ/ you should write it as http://www.somewhere.tld/%e3%83%a6%e3%82%a2%e3%83%a9%e3%82%a4/. But of course, if you are going to code it in such a way, there is nothing to say that it represents the katakana characters I have written, and it is a mere and uninteresting sequence of octets.

This is why a new term, IURIs, or Internationalized Uniform Resource Identifiers, has been introduced. The document defining them is Internationalized Uniform Resource Identifiers (IURI) by Larry Masinter and Martin Dürst (draft-masinter-url-i18n: Internet Draft Work in Progress). IURIs are simply URIs in which arbitrary Unicode characters are permitted. One maps an IURI to the corresponding URI by doing what we have just explained: encode as UTF-8 and then %-encode the characters. The introduction of IURIs merely means that it is left to the software (and not the user) to convert Unicode characters as appropriate; so the user should be able to type Unicode characters in a browser's location bar, and the browser should take care of transforming this IURI into a pure URI to submit it to Internet protocols (such as HTTP).

When writing the reference, it is possible to write it as an IURI, leaving it to the software to transform it correctly, or as an URI.

Note that in an HTML document one has even more choices, because character entities can be used to form an IURI. So the following are equivalent:

<a href="http://www.somewhere.tld/ユアライ/">click here</a> (direct Unicode characters if the encoding permits it; this forms an IURI)
<a href="http://www.somewhere.tld/ユアライ/">click here</a> (numeric character references within the href attribute; this forms an IURI)
<a href="http://www.somewhere.tld/%e3%83%a6%e3%82%a2%e3%83%a9%e3%82%a4/">click here</a> (UTF-8 encoding and %-transforming; this forms a true URI)

The third form, however, is preferred, because it requires no special ability on the client's part, and will thus guarantee compatibility with non-Unicode aware browsers. In all cases, the server www.somewhere.tld will be contacted, and the HTTP request will look like “GET /%e3%83%a6%e3%82%a2%e3%83%a9%e3%82%a4 HTTP/1.1”.

On the design and principles of Unicode

What is a (Unicode) character? What is unification?

This question may seem trivial, but in fact it is highly non-obvious. One of the basic problems is this: when are two characters identical? In other words, when to unify and when not to.

To give an idea of the complexity of the problems involved, consider the following. The Greek capital letter Alpha is graphically identical to the Latin (Roman) capital letter A. In fact, a lot of Greek capital letters are graphically identical to their Roman counterparts. For example, the T_EX document typesetting system makes no distinction between them and its font encoding contains only those Greek capitals (Γ, Δ, Θ, Λ, Ξ, Π, Σ, ϒ, Φ, Ψ and Ω) which are not already Latin capitals in some way. So are the Greek capital letter Alpha and the Latin capital letter A the same character? Well, it may be tempting to identify them, especially as both correspond roughly to the same vowel; but it is much less satisfactory to identify the Greek capital rho (a kind of R) with the Latin capital P, even though they are graphically identical: not unless you want to start confusing “P” and “R” at any rate. Besides, the Greek letters are quite different from their Latin cousins in lower-case (except Omicron which is identical to O both in upper-case and in lower-case). So the Unicode Consortium decided to keep Greek characters entirely separate from Latin ones, because they are “different scripts”. But then we have a problem with a couple of scripts that use mostly Latin characters plus a couple of Greek ones as well. A couple of Greek letters have therefore been “Latinized”, meaning that a (rather artificial) Latin version of them has been introduced; for example, Unicode has a LATIN CAPITAL LETTER GAMMA (Ɣ) and a LATIN SMALL LETTER GAMMA (ɣ): what in the world is a Latin Gamma?

On the other hand, consider italics: certainly they are not a separate script but a mere typographical variant of the roman script. However, the italic-type (lower-case) g differs from its roman counterpart. In fact, in some cases, the italic-type g is used in roman, with a special meaning (such as in the phonetic alphabet). For this reason, Unicode has a LATIN SMALL LETTER SCRIPT G (ɡ). But how is this character supposed to be represented in italics?

What about ligatures? Is the ligature frequently found between the letters f and i a character of its own right? Unicode considers that it isn't (though it does allocate a code point for the ligature, use of this character is deprecated). However, it considers that the ligatures æ (a+e, found in Latin) and œ (o+e, found in French) are characters of their own right: indeed there are examples of words containing the letters ae in Latin, or oe in French, that do not form ligatures. But shouldn't it be rather that the ligature is the “standard state” (in Latin, as far as I know, every word containing “ae” forms the ligature except derivations from “aer” — in fact, the latter is sometimes written “aër”) and that when the letters don't form a ligature they should have a special “non joiner” character between them (such a character is found in Unicode)? Well, this is the case for some scripts but apparently not for the two ligatures just mentioned.

What about the V/u character of Latin: whereas some texts distinguish the consonant form (V/v) from the vowel form (U/u), many prefer to use a single character, which is V in upper-case and u in lower-case. Should that be a separate character? Unicode considers that it shouldn't.

Or again, what about the various forms that a single letter can take, in Arabic for example? We might have four different glyphs, for the isolated, initial, final and middle forms of the letter. Are these separate characters? Again, Unicode considers that they aren't.

This gives some idea of the choices that must be made in the matter. It is important to note that both excessive unification and insufficient unification have their problems. Excessive unification leads to loss of information, especially when performing a back-and-forth conversion from a charset that does not have the unification in question (and this is bad because Unicode is supposed to be universal). Also, unification of different glyphs (i.e. declaring that two different glyphs are actually mere typographical variants, or presentation forms, of a single character) implies that having a Unicode font is going to be more difficult. Insufficient unification, on the other hand, leads to spurious “information artefacts” which can be more than simply bothersome. For example, if two identical glyphs are kept as separate characters (such as A and Alpha), a terminal or font might think it is incapable of displaying one when in fact it can do so, simply because it is a differnt character. Or searching for a string might not match a string which is “essentially identical” because it has been represented in a different way.

There is no easy answer to the question of what is a character. Unfortunately, it all too often happens that “not enough unification” is already too much unification. No decision can be made in a systematic way. However, we can outline a few of the general principles on the unification of Unicode characters (and, more generally, deciding what constitutes a character):

Characters are considered as having semantic value. Characters of different semantic value are not unified.
Separate scripts are not unified. For example, the Latin, Greek and Cyrillic scripts are kept entirely distinct, even though they have some glyphs in common.

Miscellaneous questions

I'm only interested in English. So I don't care about Unicode, right?

Wrong. Even if we ignore the few load words in English which have kept foreign diacritics (e.g. “cliché”), plain ASCII (or even ISO-8859-1) is not sufficient to typeset English properly. For Unicode does not merely address the question of internationalization (i18n): it also provides some important punctuation signs not found in ASCII but used in English. Namely, the em-dash (used as a kind of parenthesis or to indicate a pause in a sentence), the en-dash (used in a digit range), the English left and right double and single quotes, and the ellipsis.

David Madore

Last modified: $Date: 2002/06/17 22:41:33 $