Wednesday 25 February 2004

More character set stuff

I see that MSDN is still having character set trouble: it looks like pages are being encoded with UTF-8, then the encoded page is again passed through UTF-8.

The latest example is in the XP SP2 Windows Firewall information (see 'allow local' near the end of the document). The UTF-8 sequence e2 80 9c (which shows in the document as “) is U+201C, the opening double-quote character -> “

The safest way in XML and HTML is to use &# notation (e.g. &#x20AC; is the Euro symbol, €). HTML 3.2 indicated that these were to be interpreted as ISO Latin-1, whereas HTML 4.0, XML 1.0 and later interpret them as Unicode. The named character references (e.g. &ldquo; for “) only work properly in HTML, not in XML documents. XML processors are required to recognise &lt;, &gt;, &amp;, &apos; and &quot; - < > & ' ", respectively (see section 4.6 of the XML 1.0 specification [link to Tim Bray's annotated version; definitive version at www.w3.org]).

No comments: