Revision as of 02:34, 17 February 2012 edit Plugwash (talk \| contribs) Extended confirmed users 9,427 edits →Character encoding: rm contradiction ← Previous edit		Revision as of 03:02, 2 April 2012 edit undo Don4of4 (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers 3,922 edits m General Cleanup using AWB (0) Next edit →
Line 6: In RFC 1866 (the inital HTML 2.0 standard) the document character set was defined as ISO-8859-1. It was extended to [[ISO 10646]] (which is basically equivalent to Unicode) by RFC 2073. It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references. The relationship between [[Unicode]] and HTML tends to be a difficult topic for many computer professionals, document authors, and [[World Wide Web\|web]] users alike. The accurate representation of text in [[web page]]s from different [[natural language]]s and [[writing system]]s is complicated by the details of [[character encoding]], [[markup language]] syntax, [[~~Computer_font~~Computer font\|font]], and varying levels of support by [[web browser]]s. == HTML document characters == Line 25: In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a [[numeric character reference]]: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form '''<code>&#</code>'''<var>N</var>'''<code>;</code>''', where <var>N</var> is either a [[decimal]] number for the Unicode code point, or a [[hexadecimal]] number, in which case it must be prefixed by <code>x</code>. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet. For example, a Unicode code point like U+5408, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by <code>&#</code> and followed by <code>;</code>, like this: <code>&#21512;</code>, which produces this: ~~合~~合 (if it doesn't look like a Chinese character, see [[Template:Special characters]]). The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers—but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example <code>&#21512;</code> instead of <code>&#x5408;</code>). Line 42: ===Encoding information=== When a document is transmitted via a [[MIME]] message or a transport that uses MIME content types such as an [[HTTP]] response, the message may signal the encoding via a Content-Type header, such as <code>Content-Type: text/html; charset=UTF-8</code>. Other external means of declaring encoding are permitted but rarely used. If the document uses an [[~~Comparison_of_Unicode_encodings~~Comparison of Unicode encodings\|Unicode encoding]], the encoding info might also be present in the form of a [[Byte order mark]]. Finally, the encoding can be declared via the HTML syntax. For the <code>text/html</code> serialisation then, as long as the page is encoded in an extension of [[ASCII]] (such as [[UTF-8]], and thus, not if the page is using [[UTF-16]]), a <code>meta</code> element, like <code><meta http-equiv="content-type" content="text/html; charset=UTF-8"></code> or (starting with [[HTML5]]) <code><meta charset="UTF-8"></code> can be used. For HTML pages serialized as XML, then declaration options is to either rely on the encoding default (which for XML documents is UTF-8), or to use an XML encoding declaration. The meta attribute plays no role in HTML served as XML. ===Encoding defaults=== An encoding default applies when there is no external or internal encoding declaration and also no Byte order mark. While the encoding default for HTML pages served as XML is required to be UTF-8, the encoding default for a regular Web page (that is: for HTML pages serialized as <code>text/html</code>) varies depending on the localization of the browser. For a system set up mainly for Western European languages, it will generally be [[ISO 8859-1#Windows-1252\|Windows-1252]]. For the Russian locale, the default is typically [[Windows-1251]]. For a browser from a ___location where ''legacy'' multi-byte character encodings are prevalent, some form of auto-detection is likely to be applied. ===Encoding trends===

Unicode and HTML: Difference between revisions