Content deleted Content added
→Character encoding: rm contradiction |
m General Cleanup using AWB (0) |
||
Line 6:
In RFC 1866 (the inital HTML 2.0 standard) the document character set was defined as ISO-8859-1. It was extended to [[ISO 10646]] (which is basically equivalent to Unicode) by RFC 2073. It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references.
The relationship between [[Unicode]] and HTML tends to be a difficult topic for many computer professionals, document authors, and [[World Wide Web|web]] users alike. The accurate representation of text in [[web page]]s from different [[natural language]]s and [[writing system]]s is complicated by the details of [[character encoding]], [[markup language]] syntax, [[
== HTML document characters ==
Line 25:
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a [[numeric character reference]]: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form '''<code>&#</code>'''<var>N</var>'''<code>;</code>''', where <var>N</var> is either a [[decimal]] number for the Unicode code point, or a [[hexadecimal]] number, in which case it must be prefixed by <code>x</code>. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.
For example, a Unicode code point like U+5408, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by <code>&#</code> and followed by <code>;</code>, like this: <code>&#21512;</code>, which produces this:
The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers—but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example <code>&#21512;</code> instead of <code>&#x5408;</code>).
Line 42:
===Encoding information===
When a document is transmitted via a [[MIME]] message or a transport that uses MIME content types such as an [[HTTP]] response, the message may signal the encoding via a Content-Type header, such as <code>Content-Type: text/html; charset=UTF-8</code>. Other external means of declaring encoding are permitted but rarely used. If the document uses an [[
===Encoding defaults===
An encoding default applies when there is no external or internal encoding declaration and also no Byte order mark. While the encoding default for HTML pages served as XML is required to be UTF-8, the encoding default for a regular Web page (that is: for HTML pages serialized as <code>text/html</code>) varies depending on the localization of the browser. For a system set up mainly for Western European languages, it will generally be [[ISO 8859-1#Windows-1252|Windows-1252]]. For the Russian locale, the default is typically [[Windows-1251]]. For a browser from a ___location where ''legacy'' multi-byte character encodings are prevalent, some form of auto-detection is likely to be applied.
===Encoding trends===
|