Unicode and HTML: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 18:52, 6 April 2022 edit 150.199.172.253 (talk) all Tags: Reverted Visual edit ← Previous edit		Latest revision as of 21:13, 10 October 2024 edit undo 93.150.208.161 (talk) →Frequency of usage Tag: Visual edit
(14 intermediate revisions by 12 users not shown)
Line 4: {{essay-like\|date=December 2011}} {{refimprove\|date=January 2011}} ~~{{Rewrite\|date=July 2018}}~~ }} {{SpecialChars}} {{Html series}} Web pages authored using ~~'''~~HyperText Markup Language~~'''~~ ([[HTML email\|HTML]]) may contain multilingual text represented with the ~~'''~~Unicode universal character set~~'''~~. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in aan HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes. In RFC 1866, the initial HTML 2.0 standard, the document character set was defined as ISO-8859-1 (later HTML standard defaults to [[Windows-1252]] encoding). It was extended to [[ISO 10646]] (which is basically equivalent to Unicode) by {{IETF RFC\|2070}}. It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references. Line 15 ⟶ 14: ==HTML document characters== Web pages are typically [[HTML]] or [[XHTML]] documents. Both types of documents consist, at a fundamental level, of [[character (computing)\|character]]s, which are [[grapheme]]s and grapheme-like units, independent of how they manifest in [[computer storage]] systems and [[computer network\|network]]s. An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML ''document character set'' : a character repertoire wherein each character is assigned a unique, non-negative integer ''code point''. This set is defined in the HTML 4.0 [[Document Type Definition\|DTD]], which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by [[Unicode]] and ISO/IEC 10646: the [[Universal Character Set]] (UCS). Line 29 ⟶ 28: {{Main\|Numeric character reference}} In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a [[numeric character reference]]: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form '''<code>&#</code>'''<var>N</var>'''<code>;</code>''', where <var>N</var> is either a [[decimal]] number for the Unicode code point, or a [[hexadecimal]] number, in which case it must be prefixed by <code>x</code>. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.{{citation needed\|date=June 2022}} The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers{{snd}} but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example <code>&#21512;</code> instead of <code>&#x5408;</code>).{{citation needed\|date=June 2022}} ===Named character entities=== Line 60 ⟶ 59: Many HTML documents are served with inaccurate encoding information, or no encoding information at all. In order to determine the encoding in such cases, many browsers allow the user to manually select an encoding name from a list. They may also employ an encoding auto-detection algorithm that works in concert '''with''' or{{snd}} ''in the case of the BOM and in case of HTML served as XML''{{snd}} '''against''' the manual override. For HTML documents which are <code>text/html</code> serialized, manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. But note that Internet Explorer, Chrome and Safari{{snd}} for both XML and <code>text/html</code> serializations{{snd}} do not permit the encoding to be overridden whenever the page includes the BOM.<ref>~~[http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897~~{{Cite ~~Bug~~web \|title=12897 -– In some parsers, UTF-8 BOM trumps the HTTP charset attribute (Encoding sniffing algorithm)] \|url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 \|access-date=2023-03-09 \|website=www.w3.org}}</ref> For HTML documents serialized with the preferred XML label{{snd}} <code>application/xhtml+xml</code>, manual encoding override is not permitted. To override the encoding of such an XML document would mean that the document stopped being XML, as it is a fatal error for XML documents to have an encoding declaration with detectable errors. Currently, Gecko browsers such as Firefox, abide to this rule, whereas the bulk of the other common browsers that support HTML as XML, such as Webkit browsers (Chrome/Safari) <ref>~~[https://bugs.webkit.org/show_bug.cgi?id=66189~~{{Cite ~~Bug~~web \|title=66189 -– XML parser doesn't emit FATAL ERROR for all, detectable encoding errors] \|url=https://bugs.webkit.org/show_bug.cgi?id=66189 \|access-date=2023-03-09 \|website=bugs.webkit.org}}</ref> do allow the encoding of XHTML documents to be manually overridden. ==Web browser support== Line 132 ⟶ 131: !scope="row" \| U+53F6 \| <code>&#21494;</code> or <code>&#x53F6;</code> \| [[CJK Unified Ideographs\|CJK Unified Ideograph]]-53F6 ([[simplified Chinese characters\|Simplified Chinese]] "Leaf") \| style="text-align:center;font-size:large;" \| 叶 \|- !scope="row" \| U+8449 \| <code>&#33865;</code> or <code>&#x8449;</code> \| [[CJK Unified Ideographs\|CJK Unified Ideograph]]-8449 ([[traditional Chinese characters\|Traditional Chinese]] "Leaf") \| style="text-align:center;font-size:large;" \| 葉 \|- Line 170 ⟶ 169: ==Frequency of usage== According to internal data from [[Google]]'s web index, in December 2007 the [[UTF-8]] Unicode encoding became the most frequently used encoding on web pages, overtaking both [[ASCII]] (US) and [[ISO/IEC 8859-1\|8859-1]]/[[Windows-1252\|1252]] (Western European).<ref>~~[[Mark~~{{Cite ~~Davis~~web ~~(Unicode)~~\|~~Mark~~title=Moving ~~Davis]]:~~to ~~[http~~Unicode 5.1 \|url=https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html ~~Moving to Unicode 5.1]~~\|access-date=2024-10-10 \|website=Official Google ~~blog,~~Blog ~~5 May 2008~~\|language=en}}</ref> ==See also== Line 191 ⟶ 190: http://www.alanwood.net/unicode/cjk_compatibility_ideographs.html CJK Compatibility Ideographs http://www.unicode.org/charts/ Unicode character charts; hexadecimal numbers only; PDF files showing all characters independent of browser capabilities [http://unicode.coeurlumiere.com/ Table of Unicode characters from 1 to 65535] {{Webarchive\|url=https://web.archive.org/web/20071103125951/http://unicode.coeurlumiere.com/ \|date=2007-11-03 }} - shows how they look in one's browser [http://www.pinyin.info/tools/converter/chars2uninumbers.html Web tool that converts "special" characters (such as Chinese characters) to Unicode numeric character references] *[http://www.hotpeachpages.net/a/characters.html Multi-lingual web pages and Unicode] - how to fix display problems Line 198 ⟶ 197: [[Category:HTML]] [[Category:Unicode\|HTML]]