Character encodings in HTML: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 13:54, 2 November 2020 edit HarJIT (talk \| contribs) Extended confirmed users 12,434 edits →Permitted encodings ← Previous edit		Latest revision as of 05:06, 16 November 2024 edit undo Ejazz128 (talk \| contribs) 24 edits m →External links: i have removed the broken link
(44 intermediate revisions by 28 users not shown)
Line 1: {{~~short~~Short description\|Use of encoding systems for international characters in HTML}} {{~~for~~For\|a list of character entity references\|List of XML and HTML character entity references}} {{Hatnote\|For fixing links within Wikipedia, see [[Help:Percent-encoding#Fixing links with unsupported characters\|Help:Percent-encoding ~~(the section~~§ Fixing Links with Unsupported Characters)]].}} {{Use dmy dates\|date=~~August~~December ~~2011~~2021}}▼ {{Html series}} ~~[[HTML]]~~While (Hypertext Markup Language ([[HTML]]) has been in use since 1991, ~~but~~ HTML 4.0 (from December 1997) was the first standardized version where international [[character (computing)\|character]]s were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit [[ASCII]], two goals are worth considering: the information's [[integrity]], and universal [[Web browser\|browser]] display. ==Specifying the document's character encoding== There are ~~several~~two general ways to specify which character encoding is used in the document. First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation \|chapter-url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5\|chapter=Content-Type \|title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content\|publisher=[[IETF]] \|date=June 2014 \|~~accessdate~~doi=10.17487/RFC7231 \|access-date=2014-07-30\|editor-last1=Fielding \|editor-last2=Reschke \|editor-first1=R \|editor-first2=J \|last1=Fielding \|first1=R. \|last2=Reschke \|first2=J. \|s2cid=14399078 }}</ref> <syntaxhighlight lang="http"> Content-Type: text/html; charset=~~ISO~~utf-~~8859-4~~8 This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules\|module]] <code>mod_charset_lite</code>.<ref>[http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html Apache Module mod_charset_lite]</ref>▼ </syntaxhighlight> ▲This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules\|module]] <code>mod_charset_lite</code>.<ref>[{{cite web\| url = http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html\| title = Apache Module mod_charset_lite]}}</ref> Second, a declaration can be included within the document itself. For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name=html5charset/> <!-- Please don't add a closing "/": that is incorrect here. --> <syntaxhighlight lang="~~html4strict~~html"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </syntaxhighlight> [[HTML5]] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation \|chapter-url=http://www.w3.org/TR/html5/document-metadata.html#specifying-the-documents-character-encoding \|chapter=Specifying the document's character encoding \|title=HTML5 \|publisher=[[World Wide Web Consortium]] \|date=14 December 2017 \|~~accessdate~~access-date=2018-05-28}}</ref> <!-- Please don't add a closing "/": that is unnecessary here. --> <syntaxhighlight lang="~~html4strict~~html"> <meta charset="utf-8"> </syntaxhighlight> [[XHTML]] documents have a third option: to express the character encoding via [[XML]] declaration, as follows:<ref>{{citation \|chapter-url=http://www.w3.org/TR/REC-xml/#sec-prolog-dtd \|chapter=Prolog and Document Type Declaration \|title=XML \|first1=T. \|last1=Bray \|~~authorlink1~~author-link1=Tim Bray \|first2=J. \|last2=Paoli \|first3=C. \|last3=Sperberg-McQueen \|~~authorlink3~~author-link3=Michael Sperberg-McQueen \|first4=E. \|last4=Maler \|first5=F. \|last5=Yergeau \|publisher=[[W3C]] \|date=26 November 2008 \|~~accessdate~~access-date=8 March 2010}}</ref> <syntaxhighlight lang="xml"> <?xml version="1.0" encoding="~~ISO~~utf-~~8859-1~~8"?> </syntaxhighlight> AsWith this second approach, because the character encoding cannot be known until ~~this{{clarify\|date=October 2019}}~~the declaration is parsed, there ~~can be~~is a problem knowing which character encoding is used ~~for~~in the ~~declaration~~document ~~itself.~~up ~~The~~to ~~main~~and ~~principle is that~~including the declaration ~~shall~~itself. beIf ~~encoded~~the incharacter ~~pure~~encoding ~~ASCII,~~is ~~and~~an ~~therefore~~[[ASCII ~~(if~~extension]] then the ~~declaration~~content isup ~~inside~~to ~~the~~and ~~file)~~including the ~~encoding~~declaration ~~needs~~itself toshould be anpure [[ASCII ~~extension]].~~and Inthis ~~order~~will towork ~~allow~~correctly. ~~encodings~~For ~~not~~character ~~backwards~~encodings ~~compatible~~that ~~with~~are not ASCII, ~~browsers~~extensions ~~must~~(i.e. benot ~~able~~a tosuperset ~~parse~~of ~~declarations in~~ASCII), such ~~encodings. Examples of such encodings are~~as [[UTF-16BE]] and [[UTF-16LE]], a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics. ===Encoding detection algorithm=== As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including: # Explicit user instruction # An explicit meta tag within the first 1024 bytes of the document # A [[~~Byte~~byte order mark]] (BOM) within the first three bytes of the document # The HTTP Content-Type or other transport layer information # Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>[{{cite web\| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding\| title = HTML5 prescan a byte stream to determine its encoding]}}</ref> and other tentative detection mechanisms. ~~For~~Characters ~~ASCII-compatible character encodings the consequence~~outside of ~~choosing incorrectly is that characters outside~~ the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language\|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters\|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well. It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents. Line 59 ⟶ 67: * [[Windows-1257]] * [[Windows-1258]] * [[~~GB18030~~GB 18030]]{{efn\|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#gb18030-decoder \|title=10.2.1. gb18030 decoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web \|url=https://encoding.spec.whatwg.org/#index-gb18030 \|title=5. Indexes (§ index gb18030) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[Big5]]{{efn\|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-big5-pointer \|title=5. Indexes (§ index Big5 pointer) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[Shift JIS]]{{efn\|The specification includes [[IBM]] and [[NEC]] extensions ~~(see [[Windows-31J]]).~~,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-jis0208 \|title=5. Indexes (§ Index jis0208) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web \|url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming \|title=Notable Differences from IANA Naming \|work=Crate encoding_rs \|publisher=docs.rs \|author=Mozilla Foundation \|author-link=Mozilla Foundation}}</ref>}} * [[ISO-2022-JP]]{{efn\|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana \|title=5. Indexes (§ Index ISO-2022-JP katakana) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web \|url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder \|title=12.2.1. ISO-2022-JP decoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web \|url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder \|title=12.2.2. ISO-2022-JP encoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[EUC-KR]]{{efn\|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)\|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-euc-kr \|title=5. Indexes (§ index EUC-KR) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[UTF-16BE]]{{efn\|Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc">{{cite web \|url=https://encoding.spec.whatwg.org/#output-encodings \|title=4.3. Output encodings \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[UTF-16LE]]{{efn\|For compatibility with deployed content, also specified for the plain <code>[[UTF-16]]</code> label,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#utf-16le \|title=14.4. UTF-16LE \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> although a [[~~Byte~~byte ~~Order~~order ~~Mark~~mark]] (BOM), if present, takes priority over any label.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#decode \|title=6. Hooks for standards (§ decode) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc" />}} * x-user-defined{{efn\|Maps 0x00 through 0x7F to U+0000 through U+007F, and 0x80 through 0xFF to U+F780 through U+F7FF (a [[Private Use Area]] range), such that the low 8 bits of the code point always match the original byte.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#x-user-defined \|title=14.5. x-user-defined \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} }}{{notelist}} Line 83 ⟶ 91: * [[ISO-8859-16]] * [[KOI8-R]] * [[KOI8-U]] / [[KOI8-RU]]{{efn\|Titled KOI8-U and specified for both <code>KOI8-U</code> and <code>KOI8-RU</code> labels;<ref name="namesandlabels"/> follows [[KOI8-RU]] in positions 0xAE and 0xBE (i.e. includes [[Ў\|Ў/ў]])<ref name="whatwg-koi8u">{{cite web \|url=https://encoding.spec.whatwg.org/koi8-u.html \|title=index KOI8-U visualization \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref><ref>{{cite web \|url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=17053 \|title=Bug 17053: Support KOI8-RU mapping for KOI8-U \|date=2015-08-19 \|work=[[W3C]] Bugzilla}}</ref> but KOI8-U in positions 0x93–9F.<ref name="whatwg-koi8u"/>}} * [[KOI8-U]] * [[Mac OS Roman]] * [[Windows-1253]] * [[Mac OS Cyrillic encoding\|Mac OS Cyrillic]] * [[GBK (character encoding)\|GBK]]{{efn\|Also specified for <code>[[GB 2312\|GB2312]]</code> and related labels. Handled the same as ~~GB18030~~{{nowrap\|GB 18030}} for decoding purposes.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#gbk \|title=10.1. GBK \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> For encoding purposes, labelling as GBK (or ~~GB2312~~{{nowrap\|GB 2312}}) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.<ref name="gbenc">{{cite web \|url=https://encoding.spec.whatwg.org/#gb18030-encoder \|title=10.2.2. gb18030 encoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[EUC-JP]]{{efn\|The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions. [[JIS X 0212]] is included for decoding only.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-jis0212 \|title=5. Indexes (§ Index jis0212) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} }}{{notelist}} Line 95 ⟶ 103: * [[CESU-8]] * [[UTF-7]] * [[Binary Ordered Compression for Unicode\|BOCU-1]] * [[Standard Compression Scheme for Unicode\|SCSU]] * [[SCSU]] * [[EBCDIC]] * [[UTF-32]] Line 110 ⟶ 118: ==Character references== {{Main\|~~Character~~List of XML and HTML character entity ~~reference~~references\|Numeric character reference}} In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' ([[decimal]] or [[hexadecimal]]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from [[SGML]]. Line 116 ⟶ 124: ===HTML character references=== <!--Linked from [[Template:Auxiliary template common notice]]--> A ''[[numeric character reference]]'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''[[code point]]'', and uses the format :<code>&#''nnnn'';</code> Line 128 ⟶ 136: For codes from 0 to 127, the original 7-bit [[ASCII]] standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using [[List of XML and HTML character entity references\|character entity names]]. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference. [[List of XML and HTML character entity references\|Character entity references]] can also have the format <code>&''name'';</code> where ''name'' is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as <code>&lambda;</code> in an HTML document. The character entity references <code>&lt;</code>, <code>&gt;</code>, <code>&quot;</code> and <code>&amp;</code> are predefined in HTML and SGML, because <code><</code>, <code>></code>, <code>"</code> and <code>&</code> are already used to delimit markup. This notably did not include XML's <code>&apos;</code> (') entity prior to [[HTML5]]. For a list of all named HTML character entity references along with the versions in which they were introduced, see [[List of XML and HTML character entity references]]. Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character\|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters. ===XML character references=== Unlike traditional HTML with its large range of character entity references, in [[XML]] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{citation \|chapter-url=http://www.w3.org/TR/REC-xml/#sec-references \|chapter=Character and Entity References \|title=XML \|first1=T. \|last1=Bray \|~~authorlink1~~author-link1=Tim Bray \|first2=J. \|last2=Paoli \|first3=C. \|last3=Sperberg-McQueen \|~~authorlink3~~author-link3=Michael Sperberg-McQueen \|first4=E. \|last4=Maler \|first5=F. \|last5=Yergeau \|publisher=[[W3C]] \|date=26 November 2008 \|~~accessdate~~access-date=8 March 2010}}</ref> {\| class="wikitable" <code>&amp;</code> → & ([[ampersand]], U+0026) \| <code>&ltamp;</code> \|\|align="center"\| & \|\| [[ampersand]] → < ~~(less-than~~ ~~sign,~~\|\| U+~~003C)~~0026 \|- <code>&gt;</code> → > (greater-than sign, U+003E)▼ \| <code>&~~quot~~lt;</code> → \|\|align="center"\| < \|\| less-than sign ~~(quotation~~ ~~mark,~~\|\| U+~~0022)~~003C \|- <code>&apos;</code> → ' (apostrophe, U+0027)▼ ▲\| <code>&gt;</code> →\|\|align="center"\| > (\|\| greater-than sign, \|\| U+003E) \|- \| <code>&quot;</code> \|\|align="center"\| " \|\| quotation mark \|\| U+0022 \|- ▲\| <code>&apos;</code> →\|\|align="center"\| ' (\|\| apostrophe, \|\| U+0027) \|} All other character entity references have to be defined before they can be used. For example, use of <code>&eacute;</code> (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the <code>x</code> in hexadecimal numeric references be in lowercase: for example <code>&#xA1b</code> rather than <code>&#XA1b</code>. [[XHTML]], which is an XML application, supports the HTML entity set, along with XML's predefined entities. Line 155 ⟶ 169: [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4] * [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding] * [http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding HTML Entity Encoding chapter of Browser Security Handbook -– more information about current browsers and their entity handling] * [http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet The Open Web Application Security Project's wiki article on cross-site scripting (XSS)] ▲{{Use dmy dates\|date=August 2011}} {{DEFAULTSORT:Character Encodings Inin Html}} [[Category:HTML]] [[Category:World Wide Web Consortium standards]]