Character encodings in HTML: Difference between revisions

Content deleted Content added
mNo edit summary
Ejazz128 (talk | contribs)
m External links: i have removed the broken link
 
(22 intermediate revisions by 13 users not shown)
Line 9:
There are two general ways to specify which character encoding is used in the document.
 
First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation |chapter-url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5|chapter=Content-Type |title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content|publisher=[[IETF]] |date=June 2014 |doi=10.17487/RFC7231 |access-date=2014-07-30|editor-last1=Fielding |editor-last2=Reschke |editor-first1=R |editor-first2=J |last1=Fielding |first1=R. |last2=Reschke |first2=J. |s2cid=14399078 }}</ref>
<syntaxhighlight lang="http">
Content-Type: text/html; charset=ISOutf-8859-48
</syntaxhighlight>
This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules|module]] <code>mod_charset_lite</code>.<ref>{{cite web| url = http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html| title = Apache Module mod_charset_lite}}</ref>
 
Line 17 ⟶ 19:
For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name=html5charset/>
<!-- Please don't add a closing "/": that is incorrect here. -->
<syntaxhighlight lang="html4stricthtml">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</syntaxhighlight>
 
[[HTML5]] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation |chapter-url=http://www.w3.org/TR/html5/document-metadata.html#specifying-the-documents-character-encoding |chapter=Specifying the document's character encoding |title=HTML5 |publisher=[[World Wide Web Consortium]] |date=14 December 2017 |access-date=2018-05-28}}</ref>
<!-- Please don't add a closing "/": that is unnecessary here. -->
<syntaxhighlight lang="html4stricthtml">
<meta charset="utf-8">
</syntaxhighlight>
 
[[XHTML]] documents have a third option: to express the character encoding via [[XML]] declaration, as follows:<ref>{{citation |chapter-url=http://www.w3.org/TR/REC-xml/#sec-prolog-dtd |chapter=Prolog and Document Type Declaration |title=XML |first1=T. |last1=Bray |author-link1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |author-link3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |access-date=8 March 2010}}</ref>
<syntaxhighlight lang="xml">
<?xml version="1.0" encoding="ISOutf-8859-18"?>
</syntaxhighlight>
 
With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an [[ASCII extension]] then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as [[UTF-16BE]] and [[UTF-16LE]], a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics.
 
=== Encoding detection algorithm ===
As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:
# Explicit user instruction
Line 42 ⟶ 44:
# Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>{{cite web| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding| title = HTML5 prescan a byte stream to determine its encoding}}</ref> and other tentative detection mechanisms.
 
Characters outside of the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well.
 
It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.
Line 66 ⟶ 68:
* [[Windows-1258]]
* [[GB 18030]]{{efn|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web |url=https://encoding.spec.whatwg.org/#gb18030-decoder |title=10.2.1. gb18030 decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web |url=https://encoding.spec.whatwg.org/#index-gb18030 |title=5. Indexes (§ index gb18030) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[Big5]]{{efn|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-big5-pointer |title=5. Indexes (§ index Big5 pointer) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[Shift JIS]]{{efn|The specification includes [[IBM]] and [[NEC]] extensions (see [[Windows-31J]]).,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-jis0208 |title=5. Indexes (§ Index jis0208) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web |url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming |title=Notable Differences from IANA Naming |work=Crate encoding_rs |publisher=docs.rs |author=Mozilla Foundation |author-link=Mozilla Foundation}}</ref>}}
* [[ISO-2022-JP]]{{efn|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana |title=5. Indexes (§ Index ISO-2022-JP katakana) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder |title=12.2.1. ISO-2022-JP decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder |title=12.2.2. ISO-2022-JP encoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[EUC-KR]]{{efn|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-euc-kr |title=5. Indexes (§ index EUC-KR) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[UTF-16BE]]{{efn|Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc">{{cite web |url=https://encoding.spec.whatwg.org/#output-encodings |title=4.3. Output encodings |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[UTF-16LE]]{{efn|For compatibility with deployed content, also specified for the plain <code>[[UTF-16]]</code> label,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#utf-16le |title=14.4. UTF-16LE |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> although a [[byte order mark]] (BOM), if present, takes priority over any label.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#decode |title=6. Hooks for standards (§ decode) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc" />}}
Line 89 ⟶ 91:
* [[ISO-8859-16]]
* [[KOI8-R]]
* [[KOI8-U]] / [[KOI8-RU]]{{efn|Titled KOI8-U and specified for both <code>KOI8-U</code> and <code>KOI8-RU</code> labels,;<ref name="namesandlabels"/> but follows [[KOI8-RU]] in positions 0xAE and 0xBE (i.e. includes [[Ў|Ў/ў]]).<ref name="whatwg-koi8u">{{cite web |url=https://encoding.spec.whatwg.org/koi8-u.html |title=index KOI8-U visualization |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref><ref>{{cite web |url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=17053 |title=Bug 17053: Support KOI8-RU mapping for KOI8-U |date=2015-08-19 |work=[[W3C]] Bugzilla}}</ref> but KOI8-U in positions 0x93–9F.<ref name="whatwg-koi8u"/>}}
* [[Mac OS Roman]]
* [[Windows-1253]]
* [[Mac OS Cyrillic encoding|Mac OS Cyrillic]]
* [[GBK (character encoding)|GBK]]{{efn|Also specified for <code>[[GB 2312|GB2312]]</code> and related labels. Handled the same as {{nowrap|GB 18030}} for decoding purposes.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#gbk |title=10.1. GBK |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> For encoding purposes, labelling as GBK (or {{nowrap|GB 2312}}) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.<ref name="gbenc">{{cite web |url=https://encoding.spec.whatwg.org/#gb18030-encoder |title=10.2.2. gb18030 encoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[EUC-JP]]{{efn|The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions. [[JIS X 0212]] is included for decoding only.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-jis0212 |title=5. Indexes (§ Index jis0212) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
}}{{notelist}}
Line 116 ⟶ 118:
 
==Character references==
{{Main|CharacterList of XML and HTML character entity referencereferences|Numeric character reference}}
 
In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' ([[decimal]] or [[hexadecimal]]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from [[SGML]].
Line 122 ⟶ 124:
===HTML character references===
<!--Linked from [[Template:Auxiliary template common notice]]-->
A ''[[numeric character reference]]'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''[[code point]]'', and uses the format
 
:<code>&#''nnnn'';</code>
Line 134 ⟶ 136:
For codes from 0 to 127, the original 7-bit [[ASCII]] standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using [[List of XML and HTML character entity references|character entity names]]. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.
 
[[List of XML and HTML character entity references|Character entity references]] can also have the format <code>&amp;''name'';</code> where ''name'' is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as <code>&amp;lambda;</code> in an HTML document. The character entity references <code>&amp;lt;</code>, <code>&amp;gt;</code>, <code>&amp;quot;</code> and <code>&amp;amp;</code> are predefined in HTML and SGML, because <code>&lt;</code>, <code>&gt;</code>, <code>"</code> and <code>&amp;</code> are already used to delimit markup. This notably did not include XML's <code>&amp;apos;</code> (') entity prior to [[HTML5]]. For a list of all named HTML character entity references along with the versions in which they were introduced, see [[List of XML and HTML character entity references]].
 
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.
 
===XML character references===
Unlike traditional HTML with its large range of character entity references, in [[XML]] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{citation |chapter-url=http://www.w3.org/TR/REC-xml/#sec-references |chapter=Character and Entity References |title=XML |first1=T. |last1=Bray |author-link1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |author-link3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |access-date=8 March 2010}}</ref>
 
{| class="wikitable"
*<code>&amp;amp;</code> → & ([[ampersand]], U+0026)
*| <code>&amp;ltamp;</code> ||align="center"| & || [[ampersand]] < (less-than sign,|| U+003C)0026
|-
*<code>&amp;gt;</code> → > (greater-than sign, U+003E)
*| <code>&amp;quotlt;</code> ||align="center"| < || less-than sign (quotation mark,|| U+0022)003C
|-
*<code>&amp;apos;</code> → ' (apostrophe, U+0027)
*| <code>&amp;gt;</code> ||align="center"| > (|| greater-than sign, || U+003E)
|-
| <code>&amp;quot;</code> ||align="center"| " || quotation mark || U+0022
|-
*| <code>&amp;apos;</code> ||align="center"| ' (|| apostrophe, || U+0027)
|}
 
All other character entity references have to be defined before they can be used. For example, use of <code>&amp;eacute;</code> (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the <code>x</code> in hexadecimal numeric references be in lowercase: for example <code>&amp;#xA1b</code> rather than <code>&amp;#XA1b</code>. [[XHTML]], which is an XML application, supports the HTML entity set, along with XML's predefined entities.
Line 159 ⟶ 167:
 
== External links ==
* [https://devpal.co/html-entity-encode/ Online HTML entity encoder & decoder tool]
* [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4]
* [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding]