Unicode and HTML: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 12:46, 15 October 2022 edit InternetArchiveBot (talk \| contribs) Bots, Pending changes reviewers 5,669,861 edits Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.9.2) (Whoop whoop pull up - 10895 ← Previous edit		Latest revision as of 21:13, 10 October 2024 edit undo 93.150.208.161 (talk) →Frequency of usage Tag: Visual edit
(9 intermediate revisions by 8 users not shown)
Line 4: {{essay-like\|date=December 2011}} {{refimprove\|date=January 2011}} ~~{{Rewrite\|date=July 2018}}~~ }} {{SpecialChars}} {{Html series}} Web pages authored using ~~'''~~HyperText Markup Language~~'''~~ ([[HTML email\|HTML]]) may contain multilingual text represented with the ~~'''~~Unicode universal character set~~'''~~. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in aan HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes. In RFC 1866, the initial HTML 2.0 standard, the document character set was defined as ISO-8859-1 (later HTML standard defaults to [[Windows-1252]] encoding). It was extended to [[ISO 10646]] (which is basically equivalent to Unicode) by {{IETF RFC\|2070}}. It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references. Line 60 ⟶ 59: Many HTML documents are served with inaccurate encoding information, or no encoding information at all. In order to determine the encoding in such cases, many browsers allow the user to manually select an encoding name from a list. They may also employ an encoding auto-detection algorithm that works in concert '''with''' or{{snd}} ''in the case of the BOM and in case of HTML served as XML''{{snd}} '''against''' the manual override. For HTML documents which are <code>text/html</code> serialized, manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. But note that Internet Explorer, Chrome and Safari{{snd}} for both XML and <code>text/html</code> serializations{{snd}} do not permit the encoding to be overridden whenever the page includes the BOM.<ref>~~[http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897~~{{Cite ~~Bug~~web \|title=12897 -– In some parsers, UTF-8 BOM trumps the HTTP charset attribute (Encoding sniffing algorithm)] \|url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 \|access-date=2023-03-09 \|website=www.w3.org}}</ref> For HTML documents serialized with the preferred XML label{{snd}} <code>application/xhtml+xml</code>, manual encoding override is not permitted. To override the encoding of such an XML document would mean that the document stopped being XML, as it is a fatal error for XML documents to have an encoding declaration with detectable errors. Currently, Gecko browsers such as Firefox, abide to this rule, whereas the bulk of the other common browsers that support HTML as XML, such as Webkit browsers (Chrome/Safari) <ref>~~[https://bugs.webkit.org/show_bug.cgi?id=66189~~{{Cite ~~Bug~~web \|title=66189 -– XML parser doesn't emit FATAL ERROR for all, detectable encoding errors] \|url=https://bugs.webkit.org/show_bug.cgi?id=66189 \|access-date=2023-03-09 \|website=bugs.webkit.org}}</ref> do allow the encoding of XHTML documents to be manually overridden. ==Web browser support== Line 170 ⟶ 169: ==Frequency of usage== According to internal data from [[Google]]'s web index, in December 2007 the [[UTF-8]] Unicode encoding became the most frequently used encoding on web pages, overtaking both [[ASCII]] (US) and [[ISO/IEC 8859-1\|8859-1]]/[[Windows-1252\|1252]] (Western European).<ref>~~[[Mark~~{{Cite ~~Davis~~web ~~(Unicode)~~\|~~Mark~~title=Moving ~~Davis]]:~~to ~~[http~~Unicode 5.1 \|url=https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html ~~Moving to Unicode 5.1]~~\|access-date=2024-10-10 \|website=Official Google ~~blog,~~Blog ~~5 May 2008~~\|language=en}}</ref> ==See also== Line 198 ⟶ 197: [[Category:HTML]] [[Category:Unicode\|HTML]]