Consider the following pairs of symbols: a а ä ӓ æ ӕ c с e е è ѐ ë ё i і j ј o о p р s ѕ x х y у. Can you see any difference between the members of each pair? Nor can I. Nor can anyone.
However in each pair the first symbol is a letter of the Latin alphabet, while the second is Cyrillic.
Correspondingly, the two members of each pair have different Unicode encodings. While Latin a is U+0061, Cyrillic а is U+0430. While Latin j is U+006A, Cyrillic ј (used in writing Serbian) is U+0458. And so on.
This situation is convenient in that it keeps all the basic Latin letters together in the block 0021–00FF (I give Unicode numbers in the usual hexadecimal form) and all the Cyrillic letters together in the block 0400–04FF. But it is also highly inconvenient, because it opens up potential breaches in security. Now that non-ASCII letters are allowed in URLs, the fact that two differently coded letters look identical could be exploited for malicious purposes, for phishing or scamming. While www.facebook.com is a website you know and love (or not), “www.fасеbооk.com” would be somewhere quite different. (In the latter case, the Latin a,c,e,o have been replaced by the identical-looking Cyrillic equivalents.)
That is why they tell you not to click on links in emails, but to type them into the browser yourself.
It’s not quite as bad as that, because the domain name authorities will (we hope) refuse to register such deceptive domain names. On the other hand there is nothing to stop someone using this sort of thing as their Facebook name.
By the time it came to encoding IPA symbols, the Unicode consortium had become aware of this danger and resolved to take a much more conservative line. The new policy was that if two characters (“glyphs”) look the same, then normally they should have the same encoding. That’s why although most phonetic symbols are located in the IPA Extensions block (0250–02AF) some aren’t. We use the basic Latin a b c… rather than having special IPA ones. We also use the “Latin-1 Supplement” coding for the characters æ ç ð ø (U+00E6, U+00E7, U+00F0, U+00F8) since they occur in the ordinary spelling of Danish, French, Icelandic, and Norwegian. We also use the “Latin Extended-A” coding for the ħ (U+0127) used in Maltese orthography, for the œ (U+0153) used in French, and even for the ŋ (U+014B) used in spelling Sami and Mende. None of these is repeated in the IPA Extensions block, though ћ is separately coded for Cyrillic (Serbian, U+045B).
Worse, the phonetic symbols β θ χ (U+03B2, U+03B8, U+03C7) are to be found only in the “Greek and Coptic” block, since they are treated as identical with the Greek letters beta, theta and chi.
Fortunately, our IPA ɫ is not lumped in with Polish ł, nor ɪ (lax front unrounded vowel, small cap i) with Turkish dotless ı or Greek iota ι.
Meanwhile — rather incredibly, and going to the other extreme — our phonetic schwa ə is among the IPA symbols at U+0259, while the identical-appearing ǝ and ә are respectively LATIN SMALL LETTER TURNED E (U+01DD) of the Pan-Nigerian alphabet and CYRILLIC SMALL LETTER SCHWA (U+04D9) as used in Azerbaijani orthography.
The problem we face in all such cases is that of the “unification” versus “disunification” of identical-looking symbols.
More on this next week. Meanwhile, you might like to read Michael Everson’s discussion here.