Unicode and HTML: Difference between revisions

Content deleted Content added
m linkrot; have no idea how to format this and match to current which redirects and also doesn't have this quote anymore...
 
(34 intermediate revisions by 28 users not shown)
Line 1:
{{Short description|Relationship between Unicode characters and HTML}}
<br />
{{Short description|The character set and the hypertext markup language}}
<br />
{{Rewrite|date=July 2018}}
{{Multiple issues|
{{primary sources|date=December 2011}}
Line 10 ⟶ 7:
{{SpecialChars}}
{{Html series}}
Web pages authored using '''hypertextHyperText markupMarkup language'''Language ([[HTML email|HTML]]) may contain multilingual text represented with the '''Unicode universal character set'''. Key to the relationship between Unicode and HTML is the relationship between the "document character set", which defines the set of characters that may be present in aan HTML document and assigns numbers to them, and the "external character encoding", or "charset", used to encode a given document as a sequence of bytes.
 
In RFC 1866, the initial HTML 2.0 standard, the document character set was defined as ISO-8859-1 (later HTML standard defaults to [[Windows-1252]] encoding). It was extended to [[ISO 10646]] (which is basically equivalent to Unicode) by RFC{{IETF RFC|2070}}. It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references.
 
The relationship between [[Unicode]] and HTML tends to be a difficult topic for many computer professionals, document authors, and [[World Wide Web|web]] users alike. The accurate representation of text in [[web page]]s from different [[natural language]]s and [[writing system]]s is complicated by the details of [[character encoding]], [[markup language]] syntax, [[Computer font|font]], and varying levels of support by [[web browser]]s.
 
== HTML document characters ==
Web pages are typically [[HTML]] or [[XHTML]] documents. Both types of documents consist, at a fundamental level, of [[character (computing)|character]]s, which are [[grapheme]]s and grapheme-like units, independent of how they manifest in [[computer storage]] systems and [[computer network|network]]s.
 
Line 25 ⟶ 22:
Regardless of whether the document is HTML or XHTML, when stored on a [[file system]] or transmitted over a network, the document's characters are ''encoded'' as a sequence of [[bit]] [[octet (computing)|octet]]s (''[[byte]]s'') according to a particular character encoding. This encoding may either be a [[Unicode Transformation Format]], like [[UTF-8]], that can directly encode any Unicode character, or a legacy encoding, like [[Windows-1252]], that cannot. However, even when using encodings that do not support all Unicode characters, the encoded document may make use of [[numeric character references]]. For example, <code>&amp;#x263A;</code> (☺) is used to indicate a smiling face character in the Unicode character set.
 
=== Character encoding===
In order to support all Unicode characters without resorting to numeric character references, a web page must have an encoding covering all of Unicode. The most popular is [[UTF-8]], where the [[ASCII]] characters, such as English letters, digits, and some other common characters are preserved unchanged against ASCII. This makes HTML code (such as &lt;br> and &lt;/div>) unchanged compared to ASCII. Characters outside the ASCII range are stored in 2-42–4 bytes. It is also possible to use [[UTF-16]] where most characters are stored as two bytes with varying [[endianness]], which is supported by modern browsers but less commonly used.
 
=== Numeric character references ===
{{mainMain|Numeric character reference}}
 
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a [[numeric character reference]]: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form '''<code>&amp;#</code>'''<var>N</var>'''<code>;</code>''', where <var>N</var> is either a [[decimal]] number for the Unicode code point, or a [[hexadecimal]] number, in which case it must be prefixed by <code>x</code>. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.{{citation needed|date=June 2022}}
 
The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers—butnumbers{{snd}} but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example <code>&amp;#21512;</code> instead of <code>&amp;#x5408;</code>).{{citation needed|date=June 2022}}
For example, a Unicode code point like U+5408, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by <code>&amp;#</code> and followed by <code>;</code>, like this: <code>&amp;#21512;</code>, which produces this: 合 (if it doesn't look like a Chinese character, see [[Template:Special characters]]).
 
=== Named character entities ===
The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers—but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example <code>&amp;#21512;</code> instead of <code>&amp;#x5408;</code>).
{{mainMain|character entity reference}}
 
=== Named character entities ===
{{main|character entity reference}}
 
In HTML 4, there is a standard set of 252 named ''character entities'' for characters - some common, some obscure - that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.
Line 50 ⟶ 45:
 
===Encoding information===
When a document is transmitted via a [[MIME]] message or a transport that uses MIME content types such as an [[HTTP]] response, the message may signal the encoding via a Content-Type header, such as <code>Content-Type: text/html; charset=UTF-8</code>. Other external means of declaring encoding are permitted but rarely used. If the document uses a [[Comparison of Unicode encodings|Unicode encoding]], the encoding info might also be present in the form of a [[Bytebyte order mark]] (BOM). Finally, the encoding can be declared via the HTML syntax. For the <code>text/html</code> serialisation then, as long as the page is encoded in an extension of [[ASCII]] (such as [[UTF-8]], and thus, not if the page is using [[UTF-16]]), a <code>meta</code> element, like <code>&lt;meta http-equiv="content-type" content="text/html; charset=UTF-8"&gt;</code> or (starting with [[HTML5]]) <code>&lt;meta charset="UTF-8"></code> can be used. For HTML pages serialized as XML, then declaration options is to either rely on the encoding default (which for XML documents is UTF-8), or to use an XML encoding declaration. The meta attribute plays no role in HTML served as XML.
 
===Encoding defaults===
An encoding default applies when there is no external or internal encoding declaration and also no Bytebyte order mark. While the encoding default for HTML pages served as XML is required to be UTF-8, the encoding default for a regular Web page (that is: for HTML pages serialized as <code>text/html</code>) varies depending on the localization of the browser. For a system set up mainly for Western European languages, it will generally be [[ISO 8859-1#Windows-1252|Windows-1252]]. For Cyrillic alphabet locales, the default is typically [[Windows-1251]]. For a browser from a ___location where ''legacy'' multi-byte character encodings are prevalent, some form of auto-detection is likely to be applied.
 
===Encoding trends===
Because of the legacy of 8-bit text representations in [[programming language]]s and [[operating system]]s and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range. Consequently, many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. Misunderstandings, such as the belief that the encoding declaration affects a change in the actual encoding (whereas it is actually just a label that could be inaccurate), is also a reason for this editor attitude. Another factor contributing in the same direction, is the arrival of UTF-8{{snd}} which greatly diminishes the need for other encodings, and thus modern editors tends to default, as recommended by the HTML5 specification,<ref>{{Cite web|url=http://www.w3.org/TR/html5/semantics.html#charset|title=HTML5|author=Ian Hickson|accessdateaccess-date=17 September 2011|year=2011|quote=Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629] Authoring tools should default to using UTF-8 for newly created documents. [RFC3629] }}</ref> to UTF-8.
 
===Byte order mark/Unicode sniffing===
For both serializations of HTML (content-type "text/html" and content/type "application/xhtml+xml"), the Bytebyte order mark (BOM) is an effective way to transmit encoding information within an HTML document. For UTF-8, the BOM is optional, while it is a must for the UTF-16 and the UTF-32 encodings. (Note: UTF-16 and UTF-32 without the BOM are formally known under different names, they are different encodings, and thus needs some form of encoding declaration – see [[UTF-16BE]], [[UTF-16LE]], [[UTF-32LE]] and [[UTF-32BE]].) The use of the BOM character (U+FEFF) means that the encoding automatically declares itself to any processing application. Processing applications need only look for an initial 0x0000FEFF, 0xFEFF or 0xEFBBBF in the byte stream to identify the document as UTF-32, UTF-16 or UTF-8 encoded respectively. No additional metadata mechanisms are required for these encodings since the byte-order mark includes all of the information necessary for processing applications. In most circumstances, the byte-order mark character is handled by editing applications separately from the other characters so there is little risk of an author removing or otherwise changing the byte order mark to indicate the wrong encoding (as can happen when the encoding is declared in English/Latin script). If the document lacks a byte-order mark, the fact that the first non-blank printable character in an HTML document is supposed to be "<" (U+003C) can be used to determine a UTF-8/UTF-16/UTF-32 encoding.
 
===Encoding overriding===
Many HTML documents are served with inaccurate encoding information, or no encoding information at all. In order to determine the encoding in such cases, many browsers allow the user to manually select an encoding name from a list. They may also employ an encoding auto-detection algorithm that works in concert '''with''' or{{snd}} ''in the case of the BOM and in case of HTML served as XML''{{snd}} '''against''' the manual override.
 
For HTML documents which are <code>text/html</code> serialized, manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. But note that Internet Explorer, Chrome and Safari{{snd}} for both XML and <code>text/html</code> serializations{{snd}} do not permit the encoding to be overridden whenever the page includes the BOM.<ref>[http://www.w3.org/Bugs/Public/show_bug.cgi?id=12897{{Cite Bugweb |title=12897 - In some parsers, UTF-8 BOM trumps the HTTP charset attribute (Encoding sniffing algorithm)] |url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=12897 |access-date=2023-03-09 |website=www.w3.org}}</ref>
 
For HTML documents serialized with the preferred XML label{{snd}} <code>application/xhtml+xml</code>, manual encoding override is not permitted. To override the encoding of such an XML document would mean that the document stopped being XML, as it is a fatal error for XML documents to have an encoding declaration with detectable errors. Currently, Gecko browsers such as Firefox, abide to this rule, whereas the bulk of the other common browsers that support HTML as XML, such as Webkit browsers (Chrome/Safari) <ref>[https://bugs.webkit.org/show_bug.cgi?id=66189{{Cite Bugweb |title=66189 - XML parser doesn't emit FATAL ERROR for all, detectable encoding errors] |url=https://bugs.webkit.org/show_bug.cgi?id=66189 |access-date=2023-03-09 |website=bugs.webkit.org}}</ref> do allow the encoding of XHTML documents to be manually overridden.
 
==Web browser support==
Line 72 ⟶ 67:
 
{| class="wikitable"
|+Example web browser support for Unicode characters
|-
! scope="col" |Character
! scope="col" |HTML char ref
! scope="col" |Unicode name
! scope="col" |What your browser displays
|-
!scope="row" | U+0041
| <code>&amp;#65;</code> or <code>&amp;#x41;</code>
| [[Latin alphabet|Latin]] capital letter [[A]]
| style="text-align:center;font-size:large;" | A
|-
!scope="row" | U+00DF
| <code>&amp;#223;</code> or <code>&amp;#xDF;</code>
| Latin small letter [[ß|Sharp S]]
| style="text-align:center;font-size:large;" | ß
|-
!scope="row" | U+00FE
| <code>&amp;#254;</code> or <code>&amp;#xFE;</code>
| Latin small letter [[Thorn (letter)|Thorn]]
| style="text-align:center;font-size:large;" | þ
|-
!scope="row" | U+0394
| <code>&amp;#916;</code> or <code>&amp;#x394;</code>
| [[Greek alphabet|Greek]] capital letter [[Delta (letter)|Delta]]
| style="text-align:center;font-size:large;" | Δ
|-
!scope="row" | U+017D
| <code>&amp;#381;</code> or <code>&amp;#x17D;</code>
| Latin capital letter [[háček|Z with háček]]
| style="text-align:center;font-size:large;" | Ž
|-
!scope="row" | U+0419
| <code>&amp;#1049;</code> or <code>&amp;#x419;</code>
| [[Cyrillic script|Cyrillic]] capital letter [[Short I]]
| style="text-align:center;font-size:large;" | Й
|-
!scope="row" | U+05E7
| <code>&amp;#1511;</code> or <code>&amp;#x5E7;</code>
| [[Hebrew alphabet|Hebrew]] letter [[Qoph#Hebrew Qof|Qof]]
| style="text-align:center;font-size:large;" | ק
|-
!scope="row" | U+0645
| <code>&amp;#1605;</code> or <code>&amp;#x645;</code>
| [[Arabic alphabet|Arabic]] letter [[Mem#Arabic mīm|Meem]]
| style="text-align:center;font-size:large;" | م
|-
!scope="row" | U+0E57
| <code>&amp;#3671;</code> or <code>&amp;#xE57;</code>
| [[Thai alphabet|Thai]] [[numerical digit|digit]] [[7 (number)|7]]
| style="text-align:center;font-size:large;" | ๗
|-
!scope="row" | U+1250
| <code>&amp;#4688;</code> or <code>&amp;#x1250;</code>
| [[Ge'ez alphabet|Ge'ez]] syllable [[Qha]]
| style="text-align:center;font-size:large;" | ቐ
|-
!scope="row" | U+3042
| <code>&amp;#12354;</code> or <code>&amp;#x3042;</code>
| [[Hiragana]] letter A (Japanese)
| style="text-align:center;font-size:large;" | あ
|-
!scope="row" | U+53F6
| <code>&amp;#21494;</code> or <code>&amp;#x53F6;</code>
| [[CJK Unified Ideographs|CJK Unified Ideograph]]-53F6 ([[simplified Chinese characters|Simplified Chinese]] "Leaf")
| style="text-align:center;font-size:large;" | 叶
|-
!scope="row" | U+8449
| <code>&amp;#33865;</code> or <code>&amp;#x8449;</code>
| [[CJK Unified Ideographs|CJK Unified Ideograph]]-8449 ([[traditional Chinese characters|Traditional Chinese]] "Leaf")
| style="text-align:center;font-size:large;" | 葉
|-
!scope="row" | U+B5AB
| <code>&amp;#46507;</code> or <code>&amp;#xB5AB;</code>
| [[Hangul]] [[syllable]] Tteolp (Korean "Ssangtikeut Eo Rieulbieup")
| style="text-align:center;font-size:large;" | 떫
|-
!scope="row" | U+16A0
| <code>&amp;#5792;</code> or <code>&amp;#x16A0;</code>
| [[Runic alphabet|Runic]] letter [[Fe (rune)|Fehu]]
| style="text-align:center;font-size:large;" | ᚠ
|-
!scope="row" | U+0D37
| <code>&amp;#3383;</code> or <code>&amp;#x0D37;</code>
| [[Malayalam alphabet|Malayalam]] letter ഷ (ṣha)
| style="text-align:center;font-size:large;" | ഷ
|-
!scope="row" | U+1F602
| colspan=4 style="font-size:smaller;" | To display all of the characters above, you may need to install one or more large multilingual fonts, like [[Code2000]].
| <code>&amp;#128514;</code> or <code>&amp;#x1F602;</code>
| [[Face with Tears of Joy emoji]]
| style="text-align:center;font-size:large;" | 😂
|-
|!scope="row" colspan=4 style="font-size:smaller;" | To display all of the characters above, you may need to install one or more large multilingual fonts, like [[Code2000]].
|}
 
Line 168 ⟶ 169:
 
==Frequency of usage==
According to internal data from [[Google]]'s web index, in December 2007 the [[UTF-8]] Unicode encoding became the most frequently used encoding on web pages, overtaking both [[ASCII]] (US) and [[ISO/IEC 8859-1|8859-1]]/[[Windows-1252|1252]] (Western European).<ref>[[Mark{{Cite Davisweb (Unicode)|Marktitle=Moving Davis]]:to [httpUnicode 5.1 |url=https://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html Moving to Unicode 5.1]|access-date=2024-10-10 |website=Official Google blog,Blog 5 May 2008|language=en}}</ref>
 
== See also ==
* [[meta:Help:Special characters|Help file for using special characters on Wikipedia]]
* [[Character encodings in HTML]]
Line 176 ⟶ 177:
* [[wikibooks:Unicode/Character reference|Unicode character reference (wikibooks)]]
 
== References ==
{{reflist}}
 
== External links ==
{{toomanylinks|date=April 2020}}
*[http://www.w3.org/TR/unicode-xml/ Unicode in XML and other Markup Languages] - a W3C & Unicode Consortium joint publication that describes issues and provides guidelines relating to Unicode in markup languages
*[http://www.w3.org/TR/REC-html40/HTMLlat1.ent Latin-1], [http://www.w3.org/TR/REC-html40/HTMLspecial.ent "Special"], and [http://www.w3.org/TR/REC-html40/HTMLsymbol.ent Mathematical, Greek and Symbolic] named character entity definitions for HTML 4.01
Line 188 ⟶ 190:
*http://www.alanwood.net/unicode/cjk_compatibility_ideographs.html CJK Compatibility Ideographs
*http://www.unicode.org/charts/ Unicode character charts; hexadecimal numbers only; PDF files showing all characters independent of browser capabilities
*[http://unicode.coeurlumiere.com/ Table of Unicode characters from 1 to 65535] {{Webarchive|url=https://web.archive.org/web/20071103125951/http://unicode.coeurlumiere.com/ |date=2007-11-03 }} - shows how they look in one's browser
*[http://www.pinyin.info/tools/converter/chars2uninumbers.html Web tool that converts "special" characters (such as Chinese characters) to Unicode numeric character references]
*[http://www.hotpeachpages.net/a/characters.html Multi-lingual web pages and Unicode] - how to fix display problems
*[httphttps://web.archive.org/web/20110924073701/http://www.w3.org/TR/html5/semantics.html#charset w3.org via web.archive.org] - Original HTML5 Citation Reference saved via Wayback Machine
{{Unicode navigation}}
 
[[Category:HTML]]
[[Category:Unicode|HTML]]