Revision as of 23:08, 25 June 2012 edit 86.30.202.122 (talk) Letting the transport override the encoding accords with the XML spec (section F.2) ← Previous edit		Revision as of 15:12, 6 August 2012 edit undo Funandtrvl (talk \| contribs) Autopatrolled, Extended confirmed users, Pending changes reviewers, Rollbackers, Template editors 205,053 edits mos:layout Next edit →
Line 4: Web pages authored using '''hypertext markup language''' ([[HTML]]) may contain multilingual text represented with the '''Unicode universal character set'''. Key to the relationship between Unicode and HTML is the relationship between the "document character set" which defines the set of characters that may be present in a HTML document and assigns numbers to them and the "external character encoding" or "charset" used to encode a given document as a sequence of bytes. In RFC 1866, (the initial HTML 2.0 standard), the document character set was defined as ISO-8859-1. It was extended to [[ISO 10646]] (which is basically equivalent to Unicode) by RFC 2073. It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references. The relationship between [[Unicode]] and HTML tends to be a difficult topic for many computer professionals, document authors, and [[World Wide Web\|web]] users alike. The accurate representation of text in [[web page]]s from different [[natural language]]s and [[writing system]]s is complicated by the details of [[character encoding]], [[markup language]] syntax, [[Computer font\|font]], and varying levels of support by [[web browser]]s. Line 32: {{main\|character entity reference}} In HTML, there is a standard set of 252 named ''character entities'' for characters - some common, some obscure - that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers. Character entities can be included in an HTML document via the use of ''entity references'', which take the form '''<code>&</code>'''<var>EntityName</var>'''<code>;</code>''', where <var>EntityName</var> is the name of the entity. For example, <code>&mdash;</code>, much like <code>&#8212;</code> or <code>&#x2014;</code>, represents {{U+\|2014}}: the [[em dash]] character "—" even if the character encoding used doesn't contain that character. Line 155: Some web browsers, such as [[Mozilla Firefox]], [[Opera (web browser)\|Opera]], [[Safari (web browser)\|Safari]] and [[Internet Explorer]] (from version 7 on), are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of [[Mapping of Unicode characters\|Unicode blocks]], as long as appropriate [[List of typefaces#Unicode fonts\|fonts]] are present in the [[operating system]]. Older browsers, such as [[Netscape Navigator]] 4.77 and [[Internet Explorer 6]], can only display text supported by the current font associated with the character encoding of the page, and may misinterpret numeric character references as being references to code values within the current character encoding, rather than references to Unicode code points. When you are using such a browser, it is unlikely that your computer has all of those fonts, or that the browser can use all available fonts on the same page. As a result, the browser will not display the text in the examples above correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they ''will'' display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others. For displaying characters outside the [[Basic Multilingual Plane]], like the Gothic letter faihu, which is variant of runic letter Fehu in the table above, some systems (like Windows 2000) need manual adjustments of their settings. Line 165: * [[meta:Help:Special characters\|Help file for using special characters on Wikipedia]] * [[Character encodings in HTML]] * [[Charset detection]] * [[wikibooks:Unicode/Character reference\|Unicode character reference (wikibooks)]] == References ==▼ {{refimprove\|date=January 2011}}▼ {{reflist}} == External links == Line 180 ⟶ 184: [http://www.pinyin.info/tools/converter/chars2uninumbers.html Web tool that converts "special" characters (such as Chinese characters) to Unicode numeric character references] [http://www.hotpeachpages.net/a/characters.html Multi-lingual web pages and Unicode] - how to fix display problems ▲== References == ~~<references />~~ ▲{{refimprove\|date=January 2011}} {{Unicode navigation}} [[Category:Unicode]]▼ [[Category:HTML]] ▲[[Category:Unicode]] [[fr:Unicode et HTML]]

Unicode and HTML: Difference between revisions