Unicode and HTML: Difference between revisions

Content deleted Content added
Komputist (talk | contribs)
Character encoding determination: Added subheadings. Added info on UTF-8 BOM. Added more info on XHTML. More data on encoding overriding. Deleted info on 'text/*' (read: text/xml) was incorrect: Even application/xhtm+xml can override enc. info
m Typo patrol, typos fixed: agianst → against, documetns → documents, the the → the using AWB (7794)
Line 40:
 
===Encoding information===
When a document is transmitted via a [[MIME]] message or a transport that uses MIME content types such as an [[HTTP]] response, the message may signal the encoding via a Content-Type header, such as <code>Content-Type: text/html; charset=UTF-8</code>. Other external means of declaring encoding are permitted but rarely used. If the document uses an [[Comparison_of_Unicode_encodings|Unicode encoding]], the encoding info might also be present in the form of a [[Byte order mark]]. Finally, the encoding can be declared via the HTML syntax. For the <code>text/html</code> serialisation then, as long as the the page is encoded in an extension of [[ASCII]] (such as [[UTF-8]], and thus, not if the page is using [[UTF-16]]), a <code>meta</code> element, like <code>&lt;meta http-equiv="content-type" content="text/html; charset=UTF-8"&gt;</code> or (starting with [[HTML5]]) <code>&lt;meta charset="UTF-8"/></code> can be used. For HTML pages serialized as XML, then declaration options is to either rely on the encoding default (which for XML documents is UTF-8), or to use an XML encoding declaration — se below. The meta attribute plays no role in HTML served as XML.
 
===Encoding defaults===
Line 56:
For HTML documents which are <code>text/html</code> serialized, manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. But note that Internet Explorer, Chrome and Safari — for both XML and <code>text/html</code> serializations — do not permit the encoding to be overridden whenever the page includes the BOM.<ref>[http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 Bug 12897 - In some parsers, UTF-8 BOM trumps the HTTP charset attribute (Encoding sniffing algorithm)]</ref>
 
For HTML documents serialized with the preferred XML label — <code>application/xhtml+xml</code>, manual encoding override is not permitted. To override the encoding of such an XML document would mean that that the document stopped being XML, as it is a fatal error for XML documetnsdocuments to have an encoding declaration with detectable errors. Currently, Gecko browsers such as Firefox, abide to this rule, whereas the bulk of the other common browsers that support HTML as XML, such as Webkit browsers (Chrome/Safari) <ref>[https://bugs.webkit.org/show_bug.cgi?id=66189 Bug 66189 - XML parser doesn't emit FATAL ERROR for all, detectable encoding errors]</ref> do — agianstagainst the XML specificaiton — allow the encoding of XHTML documents to be manually overridden.
 
==Web browser support==