Character encodings in HTML: Difference between revisions

Content deleted Content added
Ejazz128 (talk | contribs)
m External links: i have removed the broken link
 
(44 intermediate revisions by 28 users not shown)
Line 1:
{{shortShort description|Use of encoding systems for international characters in HTML}}
{{forFor|a list of character entity references|List of XML and HTML character entity references}}
{{Hatnote|For fixing links within Wikipedia, see [[Help:Percent-encoding#Fixing links with unsupported characters|Help:Percent-encoding (the section§ Fixing Links with Unsupported Characters)]].}}
{{Use dmy dates|date=AugustDecember 20112021}}
{{Html series}}
[[HTML]]While (Hypertext Markup Language ([[HTML]]) has been in use since 1991, but HTML 4.0 (from December 1997) was the first standardized version where international [[character (computing)|character]]s were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit [[ASCII]], two goals are worth considering: the information's [[integrity]], and universal [[Web browser|browser]] display.
 
==Specifying the document's character encoding==
There are severaltwo general ways to specify which character encoding is used in the document.

First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation |chapter-url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5|chapter=Content-Type |title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content|publisher=[[IETF]] |date=June 2014 |accessdatedoi=10.17487/RFC7231 |access-date=2014-07-30|editor-last1=Fielding |editor-last2=Reschke |editor-first1=R |editor-first2=J |last1=Fielding |first1=R. |last2=Reschke |first2=J. |s2cid=14399078 }}</ref>
<syntaxhighlight lang="http">
Content-Type: text/html; charset=ISOutf-8859-48
This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules|module]] <code>mod_charset_lite</code>.<ref>[http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html Apache Module mod_charset_lite]</ref>
</syntaxhighlight>
This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules|module]] <code>mod_charset_lite</code>.<ref>[{{cite web| url = http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html| title = Apache Module mod_charset_lite]}}</ref>
 
Second, a declaration can be included within the document itself.
 
For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name=html5charset/>
<!-- Please don't add a closing "/": that is incorrect here. -->
<syntaxhighlight lang="html4stricthtml">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</syntaxhighlight>
 
[[HTML5]] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation |chapter-url=http://www.w3.org/TR/html5/document-metadata.html#specifying-the-documents-character-encoding |chapter=Specifying the document's character encoding |title=HTML5 |publisher=[[World Wide Web Consortium]] |date=14 December 2017 |accessdateaccess-date=2018-05-28}}</ref>
<!-- Please don't add a closing "/": that is unnecessary here. -->
<syntaxhighlight lang="html4stricthtml">
<meta charset="utf-8">
</syntaxhighlight>
 
[[XHTML]] documents have a third option: to express the character encoding via [[XML]] declaration, as follows:<ref>{{citation |chapter-url=http://www.w3.org/TR/REC-xml/#sec-prolog-dtd |chapter=Prolog and Document Type Declaration |title=XML |first1=T. |last1=Bray |authorlink1author-link1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |authorlink3author-link3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |accessdateaccess-date=8 March 2010}}</ref>
<syntaxhighlight lang="xml">
<?xml version="1.0" encoding="ISOutf-8859-18"?>
</syntaxhighlight>
 
AsWith this second approach, because the character encoding cannot be known until this{{clarify|date=October 2019}}the declaration is parsed, there can beis a problem knowing which character encoding is used forin the declarationdocument itself.up Theto mainand principle is thatincluding the declaration shallitself. beIf encodedthe incharacter pureencoding ASCII,is andan therefore[[ASCII (ifextension]] then the declarationcontent isup insideto theand file)including the encodingdeclaration needsitself toshould be anpure [[ASCII extension]].and Inthis orderwill towork allowcorrectly. encodingsFor notcharacter backwardsencodings compatiblethat withare not ASCII, browsersextensions must(i.e. benot ablea tosuperset parseof declarations inASCII), such encodings. Examples of such encodings areas [[UTF-16BE]] and [[UTF-16LE]], a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics.
 
===Encoding detection algorithm===
As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:
# Explicit user instruction
# An explicit meta tag within the first 1024 bytes of the document
# A [[Bytebyte order mark]] (BOM) within the first three bytes of the document
# The HTTP Content-Type or other transport layer information
# Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>[{{cite web| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding| title = HTML5 prescan a byte stream to determine its encoding]}}</ref> and other tentative detection mechanisms.
 
ForCharacters ASCII-compatible character encodings the consequenceoutside of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well.
 
It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.
Line 59 ⟶ 67:
* [[Windows-1257]]
* [[Windows-1258]]
* [[GB18030GB 18030]]{{efn|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web |url=https://encoding.spec.whatwg.org/#gb18030-decoder |title=10.2.1. gb18030 decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web |url=https://encoding.spec.whatwg.org/#index-gb18030 |title=5. Indexes (§ index gb18030) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[Big5]]{{efn|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-big5-pointer |title=5. Indexes (§ index Big5 pointer) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[Shift JIS]]{{efn|The specification includes [[IBM]] and [[NEC]] extensions (see [[Windows-31J]]).,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-jis0208 |title=5. Indexes (§ Index jis0208) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web |url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming |title=Notable Differences from IANA Naming |work=Crate encoding_rs |publisher=docs.rs |author=Mozilla Foundation |author-link=Mozilla Foundation}}</ref>}}
* [[ISO-2022-JP]]{{efn|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana |title=5. Indexes (§ Index ISO-2022-JP katakana) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder |title=12.2.1. ISO-2022-JP decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder |title=12.2.2. ISO-2022-JP encoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[EUC-KR]]{{efn|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-euc-kr |title=5. Indexes (§ index EUC-KR) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[UTF-16BE]]{{efn|Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc">{{cite web |url=https://encoding.spec.whatwg.org/#output-encodings |title=4.3. Output encodings |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[UTF-16LE]]{{efn|For compatibility with deployed content, also specified for the plain <code>[[UTF-16]]</code> label,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#utf-16le |title=14.4. UTF-16LE |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> although a [[Bytebyte Orderorder Markmark]] (BOM), if present, takes priority over any label.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#decode |title=6. Hooks for standards (§ decode) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in [[UTF-8]].<ref name="outputenc" />}}
* x-user-defined{{efn|Maps 0x00 through 0x7F to U+0000 through U+007F, and 0x80 through 0xFF to U+F780 through U+F7FF (a [[Private Use Area]] range), such that the low 8 bits of the code point always match the original byte.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#x-user-defined |title=14.5. x-user-defined |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
}}{{notelist}}
Line 83 ⟶ 91:
* [[ISO-8859-16]]
* [[KOI8-R]]
* [[KOI8-U]] / [[KOI8-RU]]{{efn|Titled KOI8-U and specified for both <code>KOI8-U</code> and <code>KOI8-RU</code> labels;<ref name="namesandlabels"/> follows [[KOI8-RU]] in positions 0xAE and 0xBE (i.e. includes [[Ў|Ў/ў]])<ref name="whatwg-koi8u">{{cite web |url=https://encoding.spec.whatwg.org/koi8-u.html |title=index KOI8-U visualization |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref><ref>{{cite web |url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=17053 |title=Bug 17053: Support KOI8-RU mapping for KOI8-U |date=2015-08-19 |work=[[W3C]] Bugzilla}}</ref> but KOI8-U in positions 0x93–9F.<ref name="whatwg-koi8u"/>}}
* [[KOI8-U]]
* [[Mac OS Roman]]
* [[Windows-1253]]
* [[Mac OS Cyrillic encoding|Mac OS Cyrillic]]
* [[GBK (character encoding)|GBK]]{{efn|Also specified for <code>[[GB 2312|GB2312]]</code> and related labels. Handled the same as GB18030{{nowrap|GB 18030}} for decoding purposes.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#gbk |title=10.1. GBK |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> For encoding purposes, labelling as GBK (or GB2312{{nowrap|GB 2312}}) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.<ref name="gbenc">{{cite web |url=https://encoding.spec.whatwg.org/#gb18030-encoder |title=10.2.2. gb18030 encoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[EUC-JP]]{{efn|The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions. [[JIS X 0212]] is included for decoding only.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-jis0212 |title=5. Indexes (§ Index jis0212) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
}}{{notelist}}
Line 95 ⟶ 103:
* [[CESU-8]]
* [[UTF-7]]
* [[Binary Ordered Compression for Unicode|BOCU-1]]
* [[Standard Compression Scheme for Unicode|SCSU]]
* [[SCSU]]
* [[EBCDIC]]
* [[UTF-32]]
Line 110 ⟶ 118:
 
==Character references==
{{Main|CharacterList of XML and HTML character entity referencereferences|Numeric character reference}}
 
In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' ([[decimal]] or [[hexadecimal]]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from [[SGML]].
Line 116 ⟶ 124:
===HTML character references===
<!--Linked from [[Template:Auxiliary template common notice]]-->
A ''[[numeric character reference]]'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''[[code point]]'', and uses the format
 
:<code>&#''nnnn'';</code>
Line 128 ⟶ 136:
For codes from 0 to 127, the original 7-bit [[ASCII]] standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using [[List of XML and HTML character entity references|character entity names]]. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.
 
[[List of XML and HTML character entity references|Character entity references]] can also have the format <code>&amp;''name'';</code> where ''name'' is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as <code>&amp;lambda;</code> in an HTML document. The character entity references <code>&amp;lt;</code>, <code>&amp;gt;</code>, <code>&amp;quot;</code> and <code>&amp;amp;</code> are predefined in HTML and SGML, because <code>&lt;</code>, <code>&gt;</code>, <code>"</code> and <code>&amp;</code> are already used to delimit markup. This notably did not include XML's <code>&amp;apos;</code> (') entity prior to [[HTML5]]. For a list of all named HTML character entity references along with the versions in which they were introduced, see [[List of XML and HTML character entity references]].
 
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.
 
===XML character references===
Unlike traditional HTML with its large range of character entity references, in [[XML]] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{citation |chapter-url=http://www.w3.org/TR/REC-xml/#sec-references |chapter=Character and Entity References |title=XML |first1=T. |last1=Bray |authorlink1author-link1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |authorlink3author-link3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |accessdateaccess-date=8 March 2010}}</ref>
 
{| class="wikitable"
*<code>&amp;amp;</code> → & ([[ampersand]], U+0026)
*| <code>&amp;ltamp;</code> ||align="center"| & || [[ampersand]] < (less-than sign,|| U+003C)0026
|-
*<code>&amp;gt;</code> → > (greater-than sign, U+003E)
*| <code>&amp;quotlt;</code> ||align="center"| < || less-than sign (quotation mark,|| U+0022)003C
|-
*<code>&amp;apos;</code> → ' (apostrophe, U+0027)
*| <code>&amp;gt;</code> ||align="center"| > (|| greater-than sign, || U+003E)
|-
| <code>&amp;quot;</code> ||align="center"| " || quotation mark || U+0022
|-
*| <code>&amp;apos;</code> ||align="center"| ' (|| apostrophe, || U+0027)
|}
 
All other character entity references have to be defined before they can be used. For example, use of <code>&amp;eacute;</code> (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the <code>x</code> in hexadecimal numeric references be in lowercase: for example <code>&amp;#xA1b</code> rather than <code>&amp;#XA1b</code>. [[XHTML]], which is an XML application, supports the HTML entity set, along with XML's predefined entities.
Line 155 ⟶ 169:
* [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4]
* [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding]
* [http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding HTML Entity Encoding chapter of Browser Security Handbook - more information about current browsers and their entity handling]
* [http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet The Open Web Application Security Project's wiki article on cross-site scripting (XSS)]
{{Use dmy dates|date=August 2011}}
 
{{DEFAULTSORT:Character Encodings Inin Html}}
[[Category:HTML]]
[[Category:World Wide Web Consortium standards]]