Thursday, 28 July 2011

Unicode 6.0

Phonetic-symbol anoraks/nerds/geeks can have hours of fun browsing the Unicode Standard, the repository of all the characters that can be displayed on a modern computer screen (blog, 22 Jan 2007). If you haven’t got the book (which is hefty), browse online.

Now there’s a new version of the Standard, 6.0 (well, it came out last October, actually). Unlike previous versions, it has not been published as a printed book, but is available only online.

So what’s new in version 6.0? In brief: there are 2,088 new characters, including (I quote)
• over 1,000 additional symbols—chief among them the additional emoji symbols, which are especially important for mobile phones
• the new official Indian currency symbol: the Indian Rupee Sign
• 222 additional CJK Unified Ideographs in common use in China, Taiwan, and Japan
• 603 additional characters for African language support, including extensions to the Tifinagh, Ethiopic, and Bamum scripts
• three additional scripts: Mandaic, Batak, and Brahmi

There are also extensive technical changes to do with character properties and format specifications.

Two new Cyrillic characters cater for Azerbaijani. Two new Arabic characters and ten new Devanagari characters cater for Kashmiri. Thirty-two new Ethiopic characters cater for Gamo-Gofa-Dawro, Basketo, and Gumuz. Complete new blocks of letters cater for Mandaic, for Batak, and for Brāhmī.

Is there anything of particular interest to phoneticians and IPA users?

How about a symbol for a voiceless retroflex lateral fricative? A sort of combination of ɬ and ɭ? It’s not (yet) an official IPA symbol, but it’s a logical combination of two. Here it is, U+A78E. (Unicode numbers are given in hexadecimal and prefixed with the identifier U+.)
If you’ve always wanted a COMBINING DOUBLE INVERTED BREVE BELOW, it’s now available. But unless you’re a Uralic Phonetic Alphabet aficionado, you’ll have managed without. Do you have a use for subscript h k l m n p s t? I doubt it. Even if you do, you’d probably simply use the subscripting tag <sub> </sub>, as I have just done. In Unicode 6.0 they’re ready-made at U+2095 to U+209C.

Students of the minority languages of China may welcome three new Bopomofo characters to cater for Hmu and Ge. (Bopomofo is a phonetic notation system based on Chinese characters.)

It’s one thing to have a symbol recognized in Unicode and assigned a U+ number. It’s something else for the new symbol to become available in an available font. We’ll just have to wait and see if and when these new characters make an appearance in documents on our display screens.

Don’t hold your breath.

6 comments:

  1. Unfortunately, <sub> </sub> is not accepted in comments on this blog!

    ReplyDelete
  2. something else for the new symbol to become available in an available font

    Exactly. Just the other day, I was reading a few India-related articles on Wikipedia and was wondering what that ugly low-resolution image for the rupee symbol was doing there... Even in the actual Rupee article the character is only used once; the rest of the mentions use the image.

    ReplyDelete
  3. For a character to be accepted by the Unicode consortium, the submission is required to demonstrate its use, so it is clear that someone has a need for these subscripts. To write them off as substitutable by <sup> is missing the point of the Unicode standard: it is not simply the character set for HTML, it is designed to be universal across ALL platforms and applications.

    As for inclusion in fonts: the labiodental flap was introduced into Unicode three years ago, and is already available in nine fonts (http://www.fileformat.info/info/unicode/char/2c71/fontsupport.htm) which isn't bad going really. No breath-holding is required for the subscripts themselves: they are already in Symbola. (It should be noted there will always be at least one font implementing new characters as the Unicode consortium do not approve submissions unless there is at least one font which implements the character: for them to print it with!)

    This page is a very useful resource for anyone who wants to track down the implementation of any Unicode character:

    http://www.fileformat.info/info/unicode/char/search.htm

    ReplyDelete
  4. @Stuart Brown: It should be noted there will always be at least one font implementing new characters as the Unicode consortium do not approve submissions unless there is at least one font which implements the character: for them to print it with!

    I can see that there must be a glyph requiring a codepoint, but is it kosher for a font implementation to jump the gun by assigning the glyph to the proposed codepoint before it's been approved by the consortium? I thought the Private Use Area was for implementing unapproved characters; so any pre-existing font implementation would need to be amended post approval to move the character to its permanent codepoint.

    ReplyDelete
  5. The people who contribute fonts for use in the Unicode books are of course on the inside track for the location of the character. Many such fonts come from Evertype, Michael Everson's company.

    ReplyDelete
  6. Although not quite on a par with the nasal ingressive voiceless velar trill (see blog entry dated Monday, 6 April 2009), I feel that an ingressive version of the above mentioned voiceless retroflex lateral fricative also deserves a symbol. I think it was probably one of the sounds used in an advert for instant coffee, where the hostess goes into the kitchen and pretends to be using a percolator.

    ReplyDelete