Character encodings in HTML: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 14:20, 26 January 2023 edit 2400:c600:3453:6834:1:0:ae5c:9665 (talk) No edit summary Tags: Reverted Mobile edit Mobile web edit ← Previous edit		Latest revision as of 05:06, 16 November 2024 edit undo Ejazz128 (talk \| contribs) 24 edits m →External links: i have removed the broken link
(16 intermediate revisions by 11 users not shown)
Line 10: First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation \|chapter-url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5\|chapter=Content-Type \|title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content\|publisher=[[IETF]] \|date=June 2014 \|doi=10.17487/RFC7231 \|access-date=2014-07-30\|editor-last1=Fielding \|editor-last2=Reschke \|editor-first1=R \|editor-first2=J \|last1=Fielding \|first1=R. \|last2=Reschke \|first2=J. \|s2cid=14399078 }}</ref> <syntaxhighlight lang="http"> Content-Type: text/html; charset=utf-8 </syntaxhighlight> This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules\|module]] <code>mod_charset_lite</code>.<ref>{{cite web\| url = http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html\| title = Apache Module mod_charset_lite}}</ref> Line 17 ⟶ 19: For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name=html5charset/> <!-- Please don't add a closing "/": that is incorrect here. --> <syntaxhighlight lang="~~html4strict~~html"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </syntaxhighlight> Line 23 ⟶ 25: [[HTML5]] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation \|chapter-url=http://www.w3.org/TR/html5/document-metadata.html#specifying-the-documents-character-encoding \|chapter=Specifying the document's character encoding \|title=HTML5 \|publisher=[[World Wide Web Consortium]] \|date=14 December 2017 \|access-date=2018-05-28}}</ref> <!-- Please don't add a closing "/": that is unnecessary here. --> <syntaxhighlight lang="~~html4strict~~html"> <meta charset="utf-8"> </syntaxhighlight> Line 34 ⟶ 36: With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an [[ASCII extension]] then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as [[UTF-16BE]] and [[UTF-16LE]], a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics. === Encoding detection algorithm === As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including: # Explicit user instruction Line 42 ⟶ 44: # Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>{{cite web\| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding\| title = HTML5 prescan a byte stream to determine its encoding}}</ref> and other tentative detection mechanisms. Characters outside of the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language\|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters\|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well. It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents. Line 66 ⟶ 68: * [[Windows-1258]] * [[GB 18030]]{{efn\|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#gb18030-decoder \|title=10.2.1. gb18030 decoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web \|url=https://encoding.spec.whatwg.org/#index-gb18030 \|title=5. Indexes (§ index gb18030) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[Big5]]{{efn\|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-big5-pointer \|title=5. Indexes (§ index Big5 pointer) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[Shift JIS]]{{efn\|The specification includes [[IBM]] and [[NEC]] extensions ~~(see [[Windows-31J]]).~~,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-jis0208 \|title=5. Indexes (§ Index jis0208) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web \|url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming \|title=Notable Differences from IANA Naming \|work=Crate encoding_rs \|publisher=docs.rs \|author=Mozilla Foundation \|author-link=Mozilla Foundation}}</ref>}} * [[ISO-2022-JP]]{{efn\|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana \|title=5. Indexes (§ Index ISO-2022-JP katakana) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web \|url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder \|title=12.2.1. ISO-2022-JP decoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web \|url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder \|title=12.2.2. ISO-2022-JP encoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[EUC-KR]]{{efn\|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)\|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-euc-kr \|title=5. Indexes (§ index EUC-KR) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[UTF-16BE]]{{efn\|Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc">{{cite web \|url=https://encoding.spec.whatwg.org/#output-encodings \|title=4.3. Output encodings \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[UTF-16LE]]{{efn\|For compatibility with deployed content, also specified for the plain <code>[[UTF-16]]</code> label,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#utf-16le \|title=14.4. UTF-16LE \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> although a [[byte order mark]] (BOM), if present, takes priority over any label.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#decode \|title=6. Hooks for standards (§ decode) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc" />}} Line 89 ⟶ 91: * [[ISO-8859-16]] * [[KOI8-R]] * [[KOI8-U]] / [[KOI8-RU]]{{efn\|Titled KOI8-U and specified for both <code>KOI8-U</code> and <code>KOI8-RU</code> labels,;<ref name="namesandlabels"/> ~~but~~ follows [[KOI8-RU]] in positions 0xAE and 0xBE (i.e. includes [[Ў\|Ў/ў]]).<ref name="whatwg-koi8u">{{cite web \|url=https://encoding.spec.whatwg.org/koi8-u.html \|title=index KOI8-U visualization \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref><ref>{{cite web \|url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=17053 \|title=Bug 17053: Support KOI8-RU mapping for KOI8-U \|date=2015-08-19 \|work=[[W3C]] Bugzilla}}</ref> but KOI8-U in positions 0x93–9F.<ref name="whatwg-koi8u"/>}} * [[Mac OS Roman]] * [[Windows-1253]] * [[Mac OS Cyrillic encoding\|Mac OS Cyrillic]] * [[GBK (character encoding)\|GBK]]{{efn\|Also specified for <code>[[GB 2312\|GB2312]]</code> and related labels. Handled the same as {{nowrap\|GB 18030}} for decoding purposes.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#gbk \|title=10.1. GBK \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> For encoding purposes, labelling as GBK (or {{nowrap\|GB 2312}}) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.<ref name="gbenc">{{cite web \|url=https://encoding.spec.whatwg.org/#gb18030-encoder \|title=10.2.2. gb18030 encoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[EUC-JP]]{{efn\|The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions. [[JIS X 0212]] is included for decoding only.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-jis0212 \|title=5. Indexes (§ Index jis0212) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} }}{{notelist}} Line 115 ⟶ 117: }} ==Character references== General default Android smartphone display viewing action active setting complete then all internet browsing action active setting by default secure internet connection action active running in to be permanently deeply active by email address and phones number using by this devices needs to be Android smartphone system and setting complete by rights default Google javascript allow permission action active setting by searching engine by javascript {{Main\|List of XML and HTML character entity references\|Numeric character reference}} In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' ([[decimal]] or [[hexadecimal]]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from [[SGML]]. Line 121 ⟶ 124: ===HTML character references=== <!--Linked from [[Template:Auxiliary template common notice]]--> A ''[[numeric character reference]]'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''[[code point]]'', and uses the format :<code>&#''nnnn'';</code> Line 133 ⟶ 136: For codes from 0 to 127, the original 7-bit [[ASCII]] standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using [[List of XML and HTML character entity references\|character entity names]]. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference. [[List of XML and HTML character entity references\|Character entity references]] can also have the format <code>&''name'';</code> where ''name'' is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as <code>&lambda;</code> in an HTML document. The character entity references <code>&lt;</code>, <code>&gt;</code>, <code>&quot;</code> and <code>&amp;</code> are predefined in HTML and SGML, because <code><</code>, <code>></code>, <code>"</code> and <code>&</code> are already used to delimit markup. This notably did not include XML's <code>&apos;</code> (') entity prior to [[HTML5]]. For a list of all named HTML character entity references along with the versions in which they were introduced, see [[List of XML and HTML character entity references]]. Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character\|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters. Line 140 ⟶ 143: Unlike traditional HTML with its large range of character entity references, in [[XML]] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{citation \|chapter-url=http://www.w3.org/TR/REC-xml/#sec-references \|chapter=Character and Entity References \|title=XML \|first1=T. \|last1=Bray \|author-link1=Tim Bray \|first2=J. \|last2=Paoli \|first3=C. \|last3=Sperberg-McQueen \|author-link3=Michael Sperberg-McQueen \|first4=E. \|last4=Maler \|first5=F. \|last5=Yergeau \|publisher=[[W3C]] \|date=26 November 2008 \|access-date=8 March 2010}}</ref> {\| class="wikitable" <code>&amp;</code> → & ([[ampersand]], U+0026) \| <code>&ltamp;</code> \|\|align="center"\| & \|\| [[ampersand]] → < ~~(less-than~~ ~~sign,~~\|\| U+~~003C)~~0026 \|- <code>&gt;</code> → > (greater-than sign, U+003E)▼ \| <code>&~~quot~~lt;</code> → \|\|align="center"\| < \|\| less-than sign ~~(quotation~~ ~~mark,~~\|\| U+~~0022)~~003C \|- <code>&apos;</code> → ' (apostrophe, U+0027)▼ ▲\| <code>&gt;</code> →\|\|align="center"\| > (\|\| greater-than sign, \|\| U+003E) \|- \| <code>&quot;</code> \|\|align="center"\| " \|\| quotation mark \|\| U+0022 \|- ▲\| <code>&apos;</code> →\|\|align="center"\| ' (\|\| apostrophe, \|\| U+0027) \|} All other character entity references have to be defined before they can be used. For example, use of <code>&eacute;</code> (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the <code>x</code> in hexadecimal numeric references be in lowercase: for example <code>&#xA1b</code> rather than <code>&#XA1b</code>. [[XHTML]], which is an XML application, supports the HTML entity set, along with XML's predefined entities. Line 157 ⟶ 166: {{Reflist}} == External links == ~~== External links for all action deactivate solving setting complete by permanently deeply deactivating solving setting complete by end~~ [https://devpal.co/html-entity-encode/ Online HTML entity encoder & decoder tool] * [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4] * [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding]