Mike Dimmick's Bleurgh: More character set stuff

Wednesday 25 February 2004

More character set stuff

I see that MSDN is still having character set trouble: it looks like pages are being encoded with UTF-8, then the encoded page is again passed through UTF-8.

The latest example is in the XP SP2 Windows Firewall information (see 'allow local' near the end of the document). The UTF-8 sequence e2 80 9c (which shows in the document as â€œ) is U+201C, the opening double-quote character -> “

The safest way in XML and HTML is to use &# notation (e.g. € is the Euro symbol, €). HTML 3.2 indicated that these were to be interpreted as ISO Latin-1, whereas HTML 4.0, XML 1.0 and later interpret them as Unicode. The named character references (e.g. “ for “) only work properly in HTML, not in XML documents. XML processors are required to recognise <, >, &, ' and " - < > & ' ", respectively (see section 4.6 of the XML 1.0 specification [link to Tim Bray's annotated version; definitive version at www.w3.org]).

Mike Dimmick's Bleurgh

Wednesday 25 February 2004

More character set stuff

No comments:

Blog Archive

Links

More about me

Subscribe

Contact

About Me