Character encodings in HTML: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 07:14, 25 October 2022 edit Comp.arch (talk \| contribs) Extended confirmed users 41,493 edits mNo edit summary Tag: 2017 wikitext editor ← Previous edit		Latest revision as of 05:06, 16 November 2024 edit undo Ejazz128 (talk \| contribs) 24 edits m →External links: i have removed the broken link
(22 intermediate revisions by 13 users not shown)
Line 9: There are two general ways to specify which character encoding is used in the document. First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation \|chapter-url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5\|chapter=Content-Type \|title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content\|publisher=[[IETF]] \|date=June 2014 \|doi=10.17487/RFC7231 \|access-date=2014-07-30\|editor-last1=Fielding \|editor-last2=Reschke \|editor-first1=R \|editor-first2=J \|last1=Fielding \|first1=R. \|last2=Reschke \|first2=J. \|s2cid=14399078 }}</ref> <syntaxhighlight lang="http"> Content-Type: text/html; charset=~~ISO~~utf-~~8859-4~~8 </syntaxhighlight> This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules\|module]] <code>mod_charset_lite</code>.<ref>{{cite web\| url = http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html\| title = Apache Module mod_charset_lite}}</ref> Line 17 ⟶ 19: For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name=html5charset/> <!-- Please don't add a closing "/": that is incorrect here. --> <syntaxhighlight lang="~~html4strict~~html"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </syntaxhighlight> [[HTML5]] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation \|chapter-url=http://www.w3.org/TR/html5/document-metadata.html#specifying-the-documents-character-encoding \|chapter=Specifying the document's character encoding \|title=HTML5 \|publisher=[[World Wide Web Consortium]] \|date=14 December 2017 \|access-date=2018-05-28}}</ref> <!-- Please don't add a closing "/": that is unnecessary here. --> <syntaxhighlight lang="~~html4strict~~html"> <meta charset="utf-8"> </syntaxhighlight> [[XHTML]] documents have a third option: to express the character encoding via [[XML]] declaration, as follows:<ref>{{citation \|chapter-url=http://www.w3.org/TR/REC-xml/#sec-prolog-dtd \|chapter=Prolog and Document Type Declaration \|title=XML \|first1=T. \|last1=Bray \|author-link1=Tim Bray \|first2=J. \|last2=Paoli \|first3=C. \|last3=Sperberg-McQueen \|author-link3=Michael Sperberg-McQueen \|first4=E. \|last4=Maler \|first5=F. \|last5=Yergeau \|publisher=[[W3C]] \|date=26 November 2008 \|access-date=8 March 2010}}</ref> <syntaxhighlight lang="xml"> <?xml version="1.0" encoding="~~ISO~~utf-~~8859-1~~8"?> </syntaxhighlight> With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an [[ASCII extension]] then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as [[UTF-16BE]] and [[UTF-16LE]], a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics. === Encoding detection algorithm === As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including: # Explicit user instruction Line 42 ⟶ 44: # Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>{{cite web\| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding\| title = HTML5 prescan a byte stream to determine its encoding}}</ref> and other tentative detection mechanisms. Characters outside of the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language\|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters\|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well. It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents. Line 66 ⟶ 68: * [[Windows-1258]] * [[GB 18030]]{{efn\|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#gb18030-decoder \|title=10.2.1. gb18030 decoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web \|url=https://encoding.spec.whatwg.org/#index-gb18030 \|title=5. Indexes (§ index gb18030) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[Big5]]{{efn\|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-big5-pointer \|title=5. Indexes (§ index Big5 pointer) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[Shift JIS]]{{efn\|The specification includes [[IBM]] and [[NEC]] extensions ~~(see [[Windows-31J]]).~~,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-jis0208 \|title=5. Indexes (§ Index jis0208) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web \|url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming \|title=Notable Differences from IANA Naming \|work=Crate encoding_rs \|publisher=docs.rs \|author=Mozilla Foundation \|author-link=Mozilla Foundation}}</ref>}} * [[ISO-2022-JP]]{{efn\|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana \|title=5. Indexes (§ Index ISO-2022-JP katakana) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web \|url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder \|title=12.2.1. ISO-2022-JP decoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web \|url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder \|title=12.2.2. ISO-2022-JP encoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[EUC-KR]]{{efn\|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)\|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-euc-kr \|title=5. Indexes (§ index EUC-KR) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[UTF-16BE]]{{efn\|Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc">{{cite web \|url=https://encoding.spec.whatwg.org/#output-encodings \|title=4.3. Output encodings \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[UTF-16LE]]{{efn\|For compatibility with deployed content, also specified for the plain <code>[[UTF-16]]</code> label,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#utf-16le \|title=14.4. UTF-16LE \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> although a [[byte order mark]] (BOM), if present, takes priority over any label.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#decode \|title=6. Hooks for standards (§ decode) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc" />}} Line 89 ⟶ 91: * [[ISO-8859-16]] * [[KOI8-R]] * [[KOI8-U]] / [[KOI8-RU]]{{efn\|Titled KOI8-U and specified for both <code>KOI8-U</code> and <code>KOI8-RU</code> labels,;<ref name="namesandlabels"/> ~~but~~ follows [[KOI8-RU]] in positions 0xAE and 0xBE (i.e. includes [[Ў\|Ў/ў]]).<ref name="whatwg-koi8u">{{cite web \|url=https://encoding.spec.whatwg.org/koi8-u.html \|title=index KOI8-U visualization \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref><ref>{{cite web \|url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=17053 \|title=Bug 17053: Support KOI8-RU mapping for KOI8-U \|date=2015-08-19 \|work=[[W3C]] Bugzilla}}</ref> but KOI8-U in positions 0x93–9F.<ref name="whatwg-koi8u"/>}} * [[Mac OS Roman]] * [[Windows-1253]] * [[Mac OS Cyrillic encoding\|Mac OS Cyrillic]] * [[GBK (character encoding)\|GBK]]{{efn\|Also specified for <code>[[GB 2312\|GB2312]]</code> and related labels. Handled the same as {{nowrap\|GB 18030}} for decoding purposes.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#gbk \|title=10.1. GBK \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> For encoding purposes, labelling as GBK (or {{nowrap\|GB 2312}}) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.<ref name="gbenc">{{cite web \|url=https://encoding.spec.whatwg.org/#gb18030-encoder \|title=10.2.2. gb18030 encoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[EUC-JP]]{{efn\|The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions. [[JIS X 0212]] is included for decoding only.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-jis0212 \|title=5. Indexes (§ Index jis0212) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} }}{{notelist}} Line 116 ⟶ 118: ==Character references== {{Main\|~~Character~~List of XML and HTML character entity ~~reference~~references\|Numeric character reference}} In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' ([[decimal]] or [[hexadecimal]]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from [[SGML]]. Line 122 ⟶ 124: ===HTML character references=== <!--Linked from [[Template:Auxiliary template common notice]]--> A ''[[numeric character reference]]'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''[[code point]]'', and uses the format :<code>&#''nnnn'';</code> Line 134 ⟶ 136: For codes from 0 to 127, the original 7-bit [[ASCII]] standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using [[List of XML and HTML character entity references\|character entity names]]. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference. [[List of XML and HTML character entity references\|Character entity references]] can also have the format <code>&''name'';</code> where ''name'' is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as <code>&lambda;</code> in an HTML document. The character entity references <code>&lt;</code>, <code>&gt;</code>, <code>&quot;</code> and <code>&amp;</code> are predefined in HTML and SGML, because <code><</code>, <code>></code>, <code>"</code> and <code>&</code> are already used to delimit markup. This notably did not include XML's <code>&apos;</code> (') entity prior to [[HTML5]]. For a list of all named HTML character entity references along with the versions in which they were introduced, see [[List of XML and HTML character entity references]]. Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character\|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters. ===XML character references=== Unlike traditional HTML with its large range of character entity references, in [[XML]] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{citation \|chapter-url=http://www.w3.org/TR/REC-xml/#sec-references \|chapter=Character and Entity References \|title=XML \|first1=T. \|last1=Bray \|author-link1=Tim Bray \|first2=J. \|last2=Paoli \|first3=C. \|last3=Sperberg-McQueen \|author-link3=Michael Sperberg-McQueen \|first4=E. \|last4=Maler \|first5=F. \|last5=Yergeau \|publisher=[[W3C]] \|date=26 November 2008 \|access-date=8 March 2010}}</ref> {\| class="wikitable" <code>&amp;</code> → & ([[ampersand]], U+0026) \| <code>&ltamp;</code> \|\|align="center"\| & \|\| [[ampersand]] → < ~~(less-than~~ ~~sign,~~\|\| U+~~003C)~~0026 \|- <code>&gt;</code> → > (greater-than sign, U+003E)▼ \| <code>&~~quot~~lt;</code> → \|\|align="center"\| < \|\| less-than sign ~~(quotation~~ ~~mark,~~\|\| U+~~0022)~~003C \|- <code>&apos;</code> → ' (apostrophe, U+0027)▼ ▲\| <code>&gt;</code> →\|\|align="center"\| > (\|\| greater-than sign, \|\| U+003E) \|- \| <code>&quot;</code> \|\|align="center"\| " \|\| quotation mark \|\| U+0022 \|- ▲\| <code>&apos;</code> →\|\|align="center"\| ' (\|\| apostrophe, \|\| U+0027) \|} All other character entity references have to be defined before they can be used. For example, use of <code>&eacute;</code> (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the <code>x</code> in hexadecimal numeric references be in lowercase: for example <code>&#xA1b</code> rather than <code>&#XA1b</code>. [[XHTML]], which is an XML application, supports the HTML entity set, along with XML's predefined entities. Line 159 ⟶ 167: == External links == [https://devpal.co/html-entity-encode/ Online HTML entity encoder & decoder tool] * [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4] * [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding]