David Madore's WebLog: The Unicode standard, version 4.0

I have just received from Amazon.com my copy of the Unicode standard, version 4.0. For those who do not know what this is, the Unicode standard is, in a nutshell, a computer standard that seeks to provide a uniform encoding (computer representation) for all human scripts, past and present—an infinite job, of course, that will never be complete, but which is nevertheless proceeding at its own pace. Unicode is what permits any well-conceived computer file format, for example any HTML page, to contain characters, even mixed, from an incredible variety of alphabets and scripts; you can test your system's Unicode conformance (browser and scripts) by viewing this Unicode test page, which gives a small sample of Unicode from a few different scripts, together with images of what they should look like. Before Unicode, it was certainly possible to write an HTML page, say, in Japanese, or in Hindi, but it was impossible to write one that contained both Japanese and Hindi (in the same file).

But this Standard, and, beyond the standard itself, the 1500-pages printed form of the standard—the book I just bought—is truly amazing. This is a book about Writing (with a capital ‘W’), a beautiful one, and, turning its pages, one discovers many an elegant and artistic script, whose very existence had sometimes gone unsuspected (I had certainly never heard of Shavian until I learned about it from Unicode; actually, I hadn't even heard of Yi either, which is less forgivable). Have you ever beheld the strange serpent-like signs of Syriac? The graceful curves of Gujarati? The strange loops of Georgian? The treelike glyphs of Ethiopic? The deceptively simple Cherokee? The mysterious pictures of Linear B ? If not, you should have a look at the Standard (all of whose pages can be found in PDF format on the Unicode Web site).

The Unicode standard is one of Man's dreams: one standard to rule all scripts. It is also an endless pursuit: version 3.0 of the Standard (which I had also bought in printed form) already contained 27496 Chinese ideograms (simplified and unsimplified alike), and version 3.1 added another 42711 to these, making a total of 70207—probably the single largest collection of Chinese ideograms ever compiled, more than any dictionary ever published, or any collection of printer's glyphs; and rest assured that more ideograms will yet be found and added to the Standard.

But there are also important omissions in Unicode. The largest and most remarkable one is probably that of Egyptian hieroglyphic: it will certainly take years of work before a decent repertoire of glyphs for Egyptian can be added to the standard, even as a start. (I look forward to the day when I can quote the Book of the Dead in the original in my Web pages—and have it display correctly everywhere!) Unicode guru Michael Everson has written a very interesting note, Leaks in the Unicode Pipeline: Script, Script, Script…, on some of the scripts that remain to be encoded and how difficult it will be to include them someday. Well, good luck with this heroic task!