Content deleted Content added
→Character encoding determination: Added subheadings. Added info on UTF-8 BOM. Added more info on XHTML. More data on encoding overriding. Deleted info on 'text/*' (read: text/xml) was incorrect: Even application/xhtm+xml can override enc. info |
|||
(89 intermediate revisions by 70 users not shown) | |||
Line 1:
{{Short description|Relationship between Unicode characters and HTML}}
{{Multiple issues|
{{primary sources|date=December 2011}}
{{refimprove|date=January 2011}}
}}
{{SpecialChars}}
{{Html series}}
Web pages authored using
In RFC 1866, the initial HTML 2.0 standard, the document character set was defined as ISO-8859-1 (later HTML standard defaults to [[Windows-1252]] encoding). It was extended to [[ISO 10646]] (which is basically equivalent to Unicode) by {{IETF RFC|2070}}. It does not vary between documents of different languages or created on different platforms. The external character encoding is chosen by the author of the document (or the software the author uses to create the document) and determines how the bytes used to store and/or transmit the document map to characters from the document character set. Characters not present in the chosen external character encoding may be represented by character entity references.
The relationship between [[Unicode]] and HTML tends to be a difficult topic for many computer professionals, document authors, and [[World Wide Web|web]] users alike. The accurate representation of text in [[web page]]s from different [[natural language]]s and [[writing system]]s is complicated by the details of [[character encoding]], [[markup language]] syntax, [[Typeface|font]], and varying levels of support by [[web browser]]s.▼
▲The relationship between [[Unicode]] and HTML tends to be a difficult topic for many computer professionals, document authors, and [[World Wide Web|web]] users alike. The accurate representation of text in [[web page]]s from different [[natural language]]s and [[writing system]]s is complicated by the details of [[character encoding]], [[markup language]] syntax, [[
== HTML document characters ==▼
Web pages are typically [[HTML]] or [[XHTML]] documents. Both types of documents consist, at a fundamental level, of [[character (computing)|character]]s, which are [[grapheme]]s and grapheme-like units, independent of how they manifest in [[computer storage]] systems and [[computer network|network]]s.
An HTML document is a sequence of Unicode characters. More specifically, HTML 4.0 documents are required to consist of characters in the HTML ''document character set'' : a character repertoire wherein each character is assigned a unique, non-negative integer ''code point''. This set is defined in the HTML 4.0 [[Document Type Definition|DTD]], which also establishes the syntax (allowable sequences of characters) that can produce a valid HTML document. The HTML document character set for HTML 4.0 consists of most, but not all, of the characters jointly defined by [[Unicode]] and ISO/IEC 10646: the [[Universal Character Set]] (UCS).
Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an [[XML]] document, which, while not having an explicit "document character" layer of [[abstraction]], nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author.
Regardless of whether the document is HTML or XHTML, when stored on a [[file system]] or transmitted over a network, the document's characters are ''encoded'' as a sequence of [[bit]] [[octet (computing)|octet]]s (''[[byte]]s'') according to a particular character encoding. This encoding may either be a [[Unicode Transformation Format]], like [[UTF-8]], that can directly encode any Unicode character, or a legacy encoding, like [[Windows-1252]], that cannot. However, even when using encodings that do not support all Unicode characters, the encoded document may make use of [[numeric character references]]. For example, <code>&#x263A;</code> (☺) is used to indicate a smiling face character in the Unicode character set.
===
In order to support all Unicode characters without resorting to numeric character references, a web page must have an encoding
===
{{
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a [[numeric character reference]]: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form '''<code>&#</code>'''<var>N</var>'''<code>;</code>''', where <var>N</var> is either a [[decimal]] number for the Unicode code point, or a [[hexadecimal]] number, in which case it must be prefixed by <code>x</code>. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.{{citation needed|date=June 2022}}
The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal
▲The support for hexadecimal in this context is more recent, so older browsers might have problems displaying characters referenced with hexadecimal numbers—but they will probably have a problem displaying Unicode characters above code point 255 anyway. To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example <code>&#21494;</code> instead of <code>&#x53F6;</code>).
In HTML 4, there is a standard set of 252 named ''character entities'' for characters
▲=== Named character entities ===
▲{{main|character entity reference}}
Character entities can be included in an HTML document via the use of ''entity references'', which take the form '''<code>&</code>'''<var>EntityName</var>'''<code>;</code>''', where <var>EntityName</var> is the name of the entity. For example, <code>&mdash;</code>, much like <code>&#8212;</code> or <code>&#x2014;</code>, represents {{U+|2014}}: the [[em dash]] character
▲In HTML there is a standard set of 252 named ''character entities'' for characters — some common, some obscure — that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.
▲Character entities can be included in an HTML document via the use of ''entity references'', which take the form '''<code>&</code>'''<var>EntityName</var>'''<code>;</code>''', where <var>EntityName</var> is the name of the entity. For example, <code>&mdash;</code>, much like <code>&#8212;</code> or <code>&#x2014;</code>, represents {{U+|2014}}: the [[em dash]] character — like this — even if the character encoding used doesn't contain that character.
For the full list, see: [[List of XML and HTML character entity references]].
Line 40 ⟶ 45:
===Encoding information===
When a document is transmitted via a [[MIME]] message or a transport that uses MIME content types such as an [[HTTP]] response, the message may signal the encoding via a Content-Type header, such as <code>Content-Type: text/html; charset=UTF-8</code>. Other external means of declaring encoding are permitted but rarely used. If the document uses
===Encoding defaults===
An encoding default applies when there is no external or internal encoding declaration and also no
===Encoding trends===
Because of the legacy of 8-bit text representations in [[programming language]]s and [[operating system]]s and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range. Consequently, many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. Misunderstandings, such as the belief that the encoding declaration affects a change in the actual encoding (whereas it is actually just a label that could be inaccurate), is also a reason for this editor attitude. Another factor contributing in the same direction, is the arrival of UTF-8
===Byte order mark/Unicode sniffing===
For both serializations of HTML (content-type "text/html" and content/type "application/xhtml+xml"), the
===Encoding overriding===
Many HTML documents are served with inaccurate encoding information, or no encoding information at all. In order to determine the encoding in such cases, many browsers allow the user to manually select an encoding name from a list. They may also employ an encoding
For HTML documents which are <code>text/html</code> serialized, manual override may apply to all documents, or only those for which the encoding cannot be ascertained by looking at declarations and/or byte patterns. The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist. But note that Internet Explorer, Chrome and Safari{{snd}}
For HTML documents serialized with the preferred XML label
==Web browser support==
Line 62 ⟶ 67:
{| class="wikitable"
|+Example web browser support for Unicode characters
|-
! scope="col" |Character
! scope="col" |HTML char ref
! scope="col" |Unicode name
! scope="col" |What your browser displays
|-
!scope="row" | U+0041
| <code>&#65;</code> or <code>&#x41;</code>
| [[Latin alphabet|Latin]] capital letter [[A]]
| style="text-align:center;font-size:large;" | A
|-
!scope="row" | U+00DF
| <code>&#223;</code> or <code>&#xDF;</code>
| Latin small letter [[ß|Sharp S]]
| style="text-align:center;font-size:large;" | ß
|-
!scope="row" | U+00FE
| <code>&#254;</code> or <code>&#xFE;</code>
| Latin small letter [[Thorn (letter)|Thorn]]
| style="text-align:center;font-size:large;" | þ
|-
!scope="row" | U+0394
| <code>&#916;</code> or <code>&#x394;</code>
| [[Greek alphabet|Greek]] capital letter [[Delta (letter)|Delta]]
| style="text-align:center;font-size:large;" | Δ
|-
!scope="row" | U+017D
| <code>&#381;</code> or <code>&#x17D;</code>
| Latin capital letter [[háček|Z with
| style="text-align:center;font-size:large;" | Ž
|-
!scope="row" | U+0419
| <code>&#1049;</code> or <code>&#x419;</code>
| [[Cyrillic
| style="text-align:center;font-size:large;" | Й
|-
!scope="row" | U+05E7
| <code>&#1511;</code> or <code>&#x5E7;</code>
| [[Hebrew alphabet|Hebrew]] letter [[Qoph#Hebrew Qof|Qof]]
| style="text-align:center;font-size:large;" | ק
|-
!scope="row" | U+0645
| <code>&#1605;</code> or <code>&#x645;</code>
| [[Arabic alphabet|Arabic]] letter [[Mem#Arabic mīm|Meem]]
| style="text-align:center;font-size:large;" | م
|-
!scope="row" | U+0E57
| <code>&#3671;</code> or <code>&#xE57;</code>
| [[Thai alphabet|Thai]] [[numerical digit|digit]] [[7 (number)|7]]
| style="text-align:center;font-size:large;" | ๗
|-
!scope="row" | U+1250
| <code>&#4688;</code> or <code>&#x1250;</code>
| [[Ge'ez alphabet|Ge'ez]] syllable [[Qha]]
| style="text-align:center;font-size:large;" | ቐ
|-
!scope="row" | U+3042
| <code>&#12354;</code> or <code>&#x3042;</code>
| [[Hiragana]] letter A (Japanese)
| style="text-align:center;font-size:large;" | あ
|-
!scope="row" | U+53F6
| <code>&#21494;</code> or <code>&#x53F6;</code>
| [[CJK Unified Ideographs|CJK Unified Ideograph]]-53F6 ([[simplified Chinese characters|Simplified Chinese]] "Leaf")
| style="text-align:center;font-size:large;" | 叶
|-
!scope="row" | U+8449
| <code>&#33865;</code> or <code>&#x8449;</code>
| [[CJK Unified Ideographs|CJK Unified Ideograph]]-8449 ([[traditional Chinese characters|Traditional Chinese]] "Leaf")
| style="text-align:center;font-size:large;" | 葉
|-
!scope="row" | U+B5AB
| <code>&#46507;</code> or <code>&#xB5AB;</code>
| [[Hangul]] [[syllable]] Tteolp (Korean "Ssangtikeut Eo Rieulbieup")
| style="text-align:center;font-size:large;" | 떫
|-
!scope="row" | U+16A0
| <code>&#5792;</code> or <code>&#x16A0;</code>
| [[Runic alphabet|Runic]] letter [[Fe (rune)|Fehu]]
| style="text-align:center;font-size:large;" | ᚠ
|-
!scope="row" | U+0D37
| colspan=4 style="font-size:smaller;" | To display all of the characters above, you may need to install one or more large multilingual fonts, like [[Code2000]].▼
| <code>&#3383;</code> or <code>&#x0D37;</code>
| [[Malayalam alphabet|Malayalam]] letter ഷ (ṣha)
| style="text-align:center;font-size:large;" | ഷ
|-
!scope="row" | U+1F602
| <code>&#128514;</code> or <code>&#x1F602;</code>
| [[Face with Tears of Joy emoji]]
| style="text-align:center;font-size:large;" | 😂
|-
▲
|}
Some web browsers, such as [[Mozilla Firefox]], [[Opera (web browser)|Opera]], [[Safari (web browser)|Safari]] and [[Internet Explorer]] (from version 7 on), are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of [[
Older browsers, such as [[Netscape Navigator]] 4.77 and [[Internet Explorer 6]], can only display text supported by the current font associated with the character encoding of the page, and may misinterpret numeric character references as being references to code values within the current character encoding, rather than references to Unicode code points. When you are using such a browser, it is unlikely that your computer has all of those fonts, or that the browser can use all available fonts on the same page. As a result, the browser will not display the text in the examples above correctly, though it may display a subset of them. Because they are encoded according to the standard, though, they ''will'' display correctly on any system that is compliant and does have the characters available. Further, those characters given names for use in named entity references are likely to be more commonly available than others.
For displaying characters outside the [[Basic Multilingual Plane]],
==Frequency of usage==
According to internal data from [[Google]]'s web index, in December 2007 the [[UTF-8]] Unicode encoding became the most frequently used encoding on web pages, overtaking both [[ASCII]] (US) and [[ISO/IEC 8859-1|8859-1]]/[[Windows-1252|1252]] (Western European).<ref>
==
* [[meta:Help:Special characters|Help file for using special characters on Wikipedia]]
* [[Character encodings in HTML]]
* [[Charset detection]]
* [[wikibooks:Unicode/Character reference|Unicode character reference (wikibooks)]]
== External links ==▼
{{reflist}}
{{toomanylinks|date=April 2020}}
*[http://www.w3.org/TR/unicode-xml/ Unicode in XML and other Markup Languages] - a W3C & Unicode Consortium joint publication that describes issues and provides guidelines relating to Unicode in markup languages
*[http://www.w3.org/TR/REC-html40/HTMLlat1.ent Latin-1], [http://www.w3.org/TR/REC-html40/HTMLspecial.ent "Special"], and [http://www.w3.org/TR/REC-html40/HTMLsymbol.ent Mathematical, Greek and Symbolic] named character entity definitions for HTML 4.01
*[http://www.unicodemap.org/ UnicodeMap.org] - Browse Unicode characters, ranges, and other information
*[http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi
*[http://www.alanwood.net/unicode/ Alan Wood’s Unicode Resources] - Unicode fonts and information
*http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm The International Phonetic Alphabet in Unicode
*http://www.alanwood.net/unicode/cjk_compatibility_ideographs.html CJK Compatibility Ideographs
*http://www.unicode.org/charts/ Unicode character charts; hexadecimal numbers only; PDF files showing all characters independent of browser capabilities
*[http://unicode.coeurlumiere.com/ Table of Unicode characters from 1 to 65535] {{Webarchive|url=https://web.archive.org/web/20071103125951/http://unicode.coeurlumiere.com/ |date=2007-11-03 }} - shows how they look in one's browser
*[http://www.pinyin.info/tools/converter/chars2uninumbers.html Web tool that converts "special" characters (such as Chinese characters) to Unicode numeric character references]
*[http://www.hotpeachpages.net/a/characters.html Multi-lingual web pages and Unicode] - how to fix display problems
*[https://web.archive.org/web/20110924073701/http://www.w3.org/TR/html5/semantics.html#charset w3.org via web.archive.org] - Original HTML5 Citation Reference saved via Wayback Machine
▲== References ==
▲{{refimprove|date=January 2011}}
{{Unicode navigation}}
[[Category:Unicode]]▼
[[Category:HTML]]
▲[[Category:Unicode|HTML]]
|