This discussion of Unicode® is not intended to be general or comprehensive. It is merely a set of notes to myself (these are my Notebooks, after all) of various things of use to me in my study of ancient Greek and the origins of Western writing systems.
Unicode is a standard promulgated by a private consortium for the encoding of all characters in all languages as multi-byte numbers. The nuances and details of this are of course subtle and complex; for further discussion refer to the Unicode Consortium's website. The character mappings of Unicode and the ISO® standard 10646, "Universal Character Set," are now equivalent (the Unicode 4.0 Standard says "Version 4.0 of the Unicode Standard is code-for-code identical to ISO/IEC 10646:2003"). The Unicode standard, however, specifies more than just the character mapping.
Unicode (as of version 4.0) has a "code space" in the range of integers from zero to 0x10FFFF (10,FFFF hexadecimal). It is "convenient" (says the Standard) to think of these as divided into a series of seventeen 65,536 (0xFFFF) code point sequences, or "planes," numbered 0 through 16 (0x10). Plane 0, code points 0 through 0x00FFFF, is the "Basic Multilingual Plane," which contains most of the world's modern alphabetic and syllabic scripts, as well as much else. Plane 1, code points 0x010000 through 0x01FFFF, is the "Supplementary Multilingual Plane." Of the contents of this plane presently of interest to me, Linear B and the Cypriot Syllabary stand out. Plane 2, code points 0x020000 through 0x02FFFF, is the Supplementary Ideographic Plane, which contains many of the ideographic characters used in languages such as Chinese and Japanese. Two special code points are defined in Plane 14 (0x0Ennnn), and Planes 15 (0x0Fnnnn) and 16 (0x10nnnn) are reserved as supplementary private use areas (in addition to a smaller private use area within Plane 0). This leaves 11 other planes entirely reserved for future use.
Unicode may be represented in any number of ways. Perhaps the simplest would be to use four bytes per character. This encoding is called "UTF-32" (Unicode Transformation Format - 32 bit). However, this representation is not space-efficient (Unicode could be represented in three integral bytes per character) and it is not directly compatible with traditional single-byte encoded ASCII. By way of contrast, UTF-8 is a particularly ingenious way of re-encoding the multi-byte characters of Unicode/ISO10646 into a variable number of bytes. It takes advantage of the fact that ASCII is a 7-bit code while (all modern) computers use 8-bit bytes. Basically, it uses this last bit to "chain" on another byte when necessary. Since the first 128 character positions of Unicode are, effectively, ASCII, this means that the UTF-8 encoding of the first 128 codes of Unicode is simply ASCII (with the high-order bit forced to 0). UTF-8 is thus fully backward-compatible with ASCII for these single-byte codes.
For further information, see Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux , as well as the Unicode standard itself.
Note: When it is necessary to refer to a Unicode character by its code number (so that, for example, the display program won't try to show it as the glyph of the character itself), the convention is "U+XXXX", where "XXXX" is the character's hexadecimal code (or U+XXXXX or U+XXXXXX, if five or six hex digits are required, of course). Thus, a lowercase Greek alpha is U+03B1. The first character in the Cypriot Syllabary, in the Supplementary Multilingual Plane, is U+10800. Sometimes I'll drop leading 0s, sometimes not.
The Unicode Consortium's website, http://www.unicode.org/, has the Standard online, together with illustrative charts of the various code point ranges. It all makes for very interesting reading. Really.
The major issue in representing ancient (that is, "polytonic" or multi-accented) Greek in Unicode is that Unicode allows the handling of diacritical marks in two ways. For example, a lowercase alpha with an acute accent may be represented as two characters (U+03B1 ("GREEK SMALL LETTER ALPHA") followed by U+0301 ("COMBINING ACUTE ACCENT")), in which case it is up to the displaying program to "combine" or "compose" these two characters into a single visual presentation. Alternatively, it may be represented as a single "precomposed" character (U+1F71 ("GREEK SMALL LETTER ALPHA WITH OXIA")).
The advantage of using precomposed characters is that, given that they display at all, they should display correctly. The following image shows on the left a two-character "combining" representation of alpha with an acute accent, as displayed on the Mozilla® browser, version 1.7.11 under SuSE® Linux 10.0 On the right it shows the single-character "precomposed" version of the same.
Combining vs Precomposed alpha with acute, Mozilla 1.7.11, SuSE 10.0
Mozilla simply overprints the accent on top of the character. As the Unicode combining diacritical marks are not specific to Greek but may be used with many other characters, and as Unicode doesn't encode visual forms (glyphs) in any case, it is not surprising that the result is disappointing.
It also happens that one of the tools I use, the "vim" version of the vi text editor, only supports two combining diacritical marks. At times in Greek it is necessary to have three (e.g., rough breathing, circumflex, and iota subscript).
From the presentation and data entry points of view, in the absence of more sophisticated typographic software (a nontrivial issue) or other vim keyboard mappings (perhaps easier than I think), precomposed characters seem to have the advantage.
The disadvantage of using precomposed characters is that searches (with relatively simple software, at least) for the underlying character (e.g., alpha, U+03B1) won't find characters precomposed with accents (e.g., alpha acutely accented, precomposed as U+1F71). Representations using combining characters show both of these literally (U+03B1 U+0301), and so searches should work more easily.
The following program takes a UTF-8 encoded file on the standard input, detects all Unicode Greek combining character sequences, and writes the file to the standard output with these transformed into precomposed characters.
Unicode employs a principle of "unification" whereby characters from different domains which are "equivalent" (in some semantic sense, not necessarily visually) are given the same number ("code point"). This can become a political issue when people of one group find that they have to share characters with people of another group. I certainly don't want to get into that. However, the principle of unification does mean that finding the Unicode characters to represent a particular domain can be more complex than one might at first think. I'll illustrate this here with what is, I hope, the relatively neutral domain of the International Phonetic Alphabet (IPA).
The IPA quite deliberately uses symbols drawn from other alphabets, such as Latin and Greek. For example, it uses the latin letter "a" and the Greek letter beta. It also invents symbols of its own (or adapts them so completely that they become purely IPA symbols). For example, it distinguishes regular and "script" versions of "a", it rotates the latin letter "m", and it contains a symbol for "ram's horns" (from astrological notations?). It also uses some symbols which are very much like "regular" ones but which are in some way distinguishable when used as IPA symbols. For example, an IPA "colon" (its "long mark") might be typed on an ordinary typewriter as a colon (how many people nowdays have actually seen an "ordinary" typewriter? a manual ordinary typewriter?), but is generally typeset with more triangular dots and isn't quite the same thing as a "real" colon.
One solution is to simply lay out a separate range which has all of the IPA symbols. This would be straightforward, but would involve duplicating symbols which "really" are also in other domains (the lower case "a" in IPA is, really, just a lower-case letter "a"). Unicode does not do this.
The other solution, suggested by the principle of "unification" in Unicode, is to use symbols from other (preexisting, presumably) domains when possible and to add special symbols only when necessary. This makes perfect sense, but it can become complex in a situation such as that of the IPA, which draws its symbols from "basic" Latin letters (the ASCII ones), the many accented languages of Europe, the diacritical marks of many languages (considered as separate characters), and specially positioned characters (e.g., superscripts) which appear in other domains as well. The end result of this is that it takes, so far as I can identify them, characters from eleven (yes, 11) Unicode ranges to represent the IPA.
This is no problem from a computer's point of view (characters are just numbers, after all). It can, however, make things difficult for a writer trying to locate a character. (It can also make things difficult when fonts which represent the less commonly used ranges of characters are not installed. I very much hope that this is a transitional issue which will soon go away.)
In the next section, therefore, I'll identify those groups of Unicode ranges which are relevant to my own language studies. General scholarly writing in the Western tradition, for example, takes at least five to seven ranges. Typing ordinary ancient Greek take three ranges, and there are five more ranges relevant to ancient Greek scholarship (without even getting into Linear B and the Cypriot Syllabary).
It's surprising how various ranges of characters necessary in a single field are scattered throughout the standard. Discussing the scholarship on the phonetics of Greek, for example, might involve characters from well over a dozen Unicode ranges. Here, I'll organize some (not all!) of the ranges by the topics in which I use them.
These are additional Western European characters with precomposed diacritical marks and punctuation. The Copyright and Registered Trademark symbols are here. These are all "spacing" (vs. "combining") characters, so even when they're logically and visually superscripts, they occupy their own space (semantically, at least; the visual always depends on the display system).
As noted above, these are spacing characters. Thus, that the accents as they appear in this range differ from those which appear in the U+0300 "Combining Diacritical Marks" range. Here they are independent characters. There they are intended to combine with the characters around them. So for example if I type an a and then a U+00B4, my editor (vi) shows two characters, an a and then an acute accent: a´ . If instead I type an a and then a U+0301, vi moves the acute accent leftwards so that it sits atop the a: á . (Your browser may or may not show this behavior here.)
There's also an invisible sign which indicates multiplication where you don't generally (as a mathematician) write a multiplication sign but might otherwise (in the text) wish to indicate it explicitly. This is thoughtful.
This is where the non-registered TM symbol (™, U+2122) hides. (The Registered Trademark symbol (®) is U+00AE in the "C1 Controls and Latin-1 Supplement" range.) The Service Mark symbol (℠) U+2120 is also here.
There are other fun things here, to, including the drafting centerline symbol (U+2104, ℄), degrees Celsius (U+2103, ℃), degrees Fahrenheit (U+2109, ℉), and the (not degrees) Kelvin sign (U+212A, K), for when you wish to distinguish units of measure from Kafka's protagonists, but no degrees Reaumer sign, alas. It also has "Care Of" as a symbol (U+2105, ℅), the prescription symbol (U+211E, ℞), Planck's constant (U+210F, ℏ), the Angstrom sign (U+212B, Å), the i for information sign (U+2139, ℹ), "No." as a sign ("numero sign," U+2116, №), and the ounce sign (U+2125, ℥). I never knew that there was an ounce sign.
The markup of prosodic features (accents, length markings, and related diacritical marks) requires characters from several ranges. In addition to the "ordinary" situations, I also have need of the diacritical marks used by W. Sydney Allen in his Accent and Rhythm, and at times use completely ad hoc conventions of my own.
For the scansion of text in Latinate characters, precombined versions of the acute and grave accents over the vowels are present in the U+0080 to U+00FF "C1 Controls and Latin-1 Supplement" range, and versions of the breve accent precombined over the vowels are present in the U+0100 to U+017F "Latin Extended-A" Range.
Though it isn't a part of conventional scansion, I find the musical "fermata" symbol to be of use - I use it to indicate a syllable held indefinitely which thus stands apart from the regular scansion. The combining fermata is in the relatively ordinary range U+0300 to U+036F "Combining Diacritical Marks," but the spacing version is up in the Supplementary Multilingual Plane in Range U+1D100 to U+1D1FF "Musical Symbols." The chances that either of these display on most computers at the present time (2006) is, alas, slight. (No, they don't display on my system at present; I use them infrequently, and so don't mind simply reading the numeric code point value displayed instead of a real glyph.)
For the scansion of Greek meter by "weight" as "light" (inverted breve below) and "heavy" (macron below) syllabes. Allen's notation also uses the (regular, not inverted) breve above and macron above to indicate short and long vowel length or syllable-length-with-short-or-long-vowel. Unicode allows this using combining diacritical marks, but its Greek wasn't really designed to accomodate these as precombined charcters. Some (not all) of the vowels have precombined versions with breve above and macron above.
Allen's notation for prosodic analysis also requires superscript and subscript numbers (0, 1, 2) and the superscript plus sign. The superscripts and subscripts are generally in range U+2070 to U+209F, "Superscripts and Subscripts." However in this range the superscript one would expect for "1" is instead "SUPERSCRIPT LATIN SMALL LETTER I"; no alternative is suggested, but there is a superscript numeral 1 in the Latin 1 Supplement (U+00B9). Also, superscript "2" and "3" are "reserved" and instead code points from the Latin 1 Supplement are suggested.
Note that the superscript "0" U+2070 (⁰) is not the same as the "masculine ordinal indicator" U+00BA (º) (and in some fonts they look quite different; e.g., the ordinal indicator may have both a round part and a line under it).
These are modifier symbols which are like diacritical marks, but are separately "spaced" characters (they don't combine with other characters), and so aren't called "diacritical marks" in Unicode. Mostly they're phonetic modifiers - both IPA and non-IPA.
The standard prefers U+0342 COMBINING GREEK PERISPOMENI over U+0303 COMBINING TILDE (which I would not use for a Greek circumflex, though many do) and does not mention U+0302 COMBINING CIRCUMFLEX ACCENT (which is the character I would tend to use for a circumflex). I suppose that this means it is best to use U+0342 for the combining circumflex.
The standard discourages the use of U+0344 COMBINING GREEK DIALYTIKA TONOS, which is a diaeresis (¨) with an acute accent (΄) piled on top of it (α̈́), in favor of U+0308 (α̈) plus U+0301 (ά): α̈́ or ά̈ . Note also that this character is duplicated by U+1FEE (΅) in the "Greek Extended" range, where it is paired with a diaeresis-with-grave as well (U+1FED (῭)) as the "regular" range U+0344 is not. The "Greek Extended" range also includes iota and upsilon precomposed with diaeresis and acute/grave.
Ugaritic is an alphabetic script written with cuneiform characters. This is thought to be the only "other" instance of the development of an alphabet. It's a cautionary tale to the value of compatibile technology.
These are by no means the only other modern scripts in Unicode, of course, or even the only other modern scripts of interest to me. They're simply the other modern scripts that it is likely I might have to write something in while researching the linguistics of ancient Greek. Devanagari is obvious here, as it is the language in which Sanskrit is written (the seminal language in Indo-European linguistics, if no longer the oldest attested IE language). Hebrew and Arabic are obvious as well, as semitic languages with distinctive scripts which occupy important places in the history of writing sytems.
See: http://www.unicode.org/roadmaps/smp/ for "roadmap" of the Supplemental Multilingual Plane, with links to proposals.
The ConScript Unicode Consortium http://www.evertype.com/standards/csur/ organizes, unofficially, various scripts which for one reason or another aren't in, or aren't likely to be in Unicode. It employs the Unicode Private Use Areas:
All portions of this document not noted otherwise are Copyright © 2006 by David M. MacMillan and Rollande Krandall.
Circuitous Root is a Registered Trademark of David M. MacMillan and Rollande Krandall.
This work is licensed under the Creative Commons "Attribution - ShareAlike" license. See http://creativecommons.org/licenses/by-sa/3.0/ for its terms.
ISO is a registered trademark of The International Standards Organization.
Linux is a registered trademark of Linus Torvalds.
Mozilla is a registered trademark of The Mozilla Organization.
SuSE is a registered trademark of Novell Computer Corporation.
Unicode is a registered trademark of The Unicode Consortium.
Presented originally by Circuitous Root®
Select Resolution: 0 [other resolutions temporarily disabled due to lack of disk space]