Tuesday 24 February 2004

Character sets

While on the subject of character encoding, I had a big followup which I was going to send to John Robbins about his Bugslayer column in this month's MSDN Magazine. However, I drifted off the point a bit, and it would make a better blog entry.

My basic point relevant to the article is that there's no such thing as 'Greek ASCII' (apart from "it's all Greek to me," of course). ASCII was standardised as ISO 646-US. It's a 7-bit only character set; only code points 0 - 127 (decimal) are defined. The meaning of any other bits (ASCII was defined before standardisation - here de facto, not de jure - on an 8-bit byte) is up to the implementation. There are seven other variations, the simplest being 646-UK, which only swaps £ in for # at code-point 35.

The Danish/Norwegian, German and Swedish forms cause havoc for C programmers, because characters essential for writing C programs (e.g. {}, [], \, |) are replaced (relative to the -US set) with accented vowel characters. C partly gets around this using <iso646.h>, which defines a number of alternate names (macros) for some of the operators that are really messed up by this. C also has trigraph support (officially, although few compilers support it), where these characters can be produced by using ?? and another character (e.g. ??/ => \). C++ also has some digraphs which are easier to type and remember than the trigraphs, but are more limited. Officially, the iso646.h names are now keywords in their own right.

The irony of this is that for the most part, very few people now need the trigraphs, digraphs or alternate keywords, because almost everyone is now at least 8-bit capable. The de jure standards for 8-bit character sets - at least, European ones - are the ISO 8859 series, including the well-known ISO 8859-1 suitable for Western European languages. The de facto standard is of course Windows-1252, which defines characters in a region between code points 128 and 159 which 8859 marks as unused (and IANA's iso-8859-1 reserves for control characters). 8859 uses 646-US for the first 128 code points. This often causes havoc on the Web, where many documents are marked as iso-8859-1 but contain windows-1252 characters (although this is usually the least of the problems).

8859 is a single-byte character set: a single 8-bit byte defines each character. This doesn't give nearly enough range for Far East character sets, which use double- or multi-byte character sets. An MBCS character set (such as Shift-JIS) reserves some positions as lead bytes, which don't directly encode a character, they act as a shift for the following or trail bytes. Unfortunately, the trail bytes aren't a distinct set from the lead and single bytes. This gives rise to the classic Microsoft interview question: if you're at an arbitrary position in a string, how do you move backwards one character?

For some reason best known to themselves, all byte-oriented character encodings are known in Windows as 'ANSI', except those designed by IBM for the PC, which are known as 'OEM'. If you see 'ANSI code-page', think byte-oriented.

Frankly this is a bit of a nightmare, and a rationalisation was called for. Enter Unicode (or ISO 10646). Now, a lot of programmers seem to believe that Unicode is an answer to all of this, and only ever maps to 16-bit quantities. Unfortunately, once you get outside the Basic Multilingual Plane, you can get Unicode code points that are above U+FFFF. It's better to think of Unicode code points as being a bit abstract; you use an encoding to actually represent Unicode character data in memory. The encoding that Windows calls 'Unicode' is UTF-16, little-endian. This serves to confuse programmers. Actually, versions of Windows before XP used UCS-2, i.e. they didn't understand UTF-16 surrogates, which are used to encode code points above U+FFFF. Again, for backwards compatibility (at least at a programmer level), the first 256 code points of Unicode are identical to ISO 8859-1 (including the C1 controls defined by IANA).

You may have heard of UTF-8. This is a byte-oriented encoding of Unicode. Characters below U+0080 are specified with a single byte; otherwise, a combination of lead and trail bytes are used. This means that good old ASCII can be used directly.

Hang on, that sounds familiar... The difference with UTF-8 is that the characters form distinct subsets; you can tell whether a given byte represents a single code point, a lead byte, a trail byte, and if it's a lead byte, how many trail bytes follow. UTF-16 has the same property; the term surrogate pair is used because there can only be two code words for a code point. UTF-16 can't encode anything after U+10FFFF because of this limitation. This makes it possible to walk backwards, although everyone who has --pwc in their loops has a potential problem.

UTF-8 is more practical than UTF-16 for Western scripts, but any advantage it has is quickly wiped out for non-Western scripts. The Chinese symbol for water (?) at code-point U+6C34 becomes the sequence e6 b0 b4 in UTF-8 - 3 bytes compared to UTF-16's 2. Its main advantage is that byte-oriented character manipulation code can be used with no changes. Recent *nix APIs largely use UTF-8; Windows NT-based systems use UTF-16LE.

The .NET Framework also uses UTF-16 as the internal character encoding of the System.String type, which is exposed by the Chars property. System.Text.ASCIIEncoding does exactly what it says on the tin: converts to ISO 646-US. Anything outside the 7-bit ASCII range is converted to the default character, ?. The unmanaged WideCharToMultiByte API (thank god it's not called UnicodeToAnsi) allows you to specify the default character, but as far as I can see Encoding does not. GetEncoding( 1253 ) will get you a wrapper for Windows-1253, not 'Greek ASCII'.

No comments: