Character encodings in HTML: Difference between revisions

Content deleted Content added
Ejazz128 (talk | contribs)
m External links: i have removed the broken link
 
(7 intermediate revisions by 4 users not shown)
Line 69:
* [[GB 18030]]{{efn|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web |url=https://encoding.spec.whatwg.org/#gb18030-decoder |title=10.2.1. gb18030 decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web |url=https://encoding.spec.whatwg.org/#index-gb18030 |title=5. Indexes (§ index gb18030) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[Big5]]{{efn|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-big5-pointer |title=5. Indexes (§ index Big5 pointer) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[Shift JIS]]{{efn|The specification includes [[IBM]] and [[NEC]] extensions,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-jis0208 |title=5. Indexes (§ Index jis0208) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web |url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming |title=Notable Differences from IANA Naming |work=Crate encoding_rs |publisher=docs.rs |author=Mozilla Foundation |author-link=Mozilla Foundation}}</ref>}}
* [[ISO-2022-JP]]{{efn|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana |title=5. Indexes (§ Index ISO-2022-JP katakana) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder |title=12.2.1. ISO-2022-JP decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder |title=12.2.2. ISO-2022-JP encoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[EUC-KR]]{{efn|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-euc-kr |title=5. Indexes (§ index EUC-KR) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
Line 91:
* [[ISO-8859-16]]
* [[KOI8-R]]
* [[KOI8-U]] / [[KOI8-RU]]{{efn|Titled KOI8-U and specified for both <code>KOI8-U</code> and <code>KOI8-RU</code> labels,;<ref name="namesandlabels"/> but follows [[KOI8-RU]] in positions 0xAE and 0xBE (i.e. includes [[Ў|Ў/ў]]).<ref name="whatwg-koi8u">{{cite web |url=https://encoding.spec.whatwg.org/koi8-u.html |title=index KOI8-U visualization |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref><ref>{{cite web |url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=17053 |title=Bug 17053: Support KOI8-RU mapping for KOI8-U |date=2015-08-19 |work=[[W3C]] Bugzilla}}</ref> but KOI8-U in positions 0x93–9F.<ref name="whatwg-koi8u"/>}}
* [[Mac OS Roman]]
* [[Windows-1253]]
Line 118:
 
==Character references==
{{Main|CharacterList of XML and HTML character entity referencereferences|Numeric character reference}}
 
In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' ([[decimal]] or [[hexadecimal]]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from [[SGML]].
Line 124:
===HTML character references===
<!--Linked from [[Template:Auxiliary template common notice]]-->
A ''[[numeric character reference]]'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''[[code point]]'', and uses the format
 
:<code>&#''nnnn'';</code>
Line 136:
For codes from 0 to 127, the original 7-bit [[ASCII]] standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using [[List of XML and HTML character entity references|character entity names]]. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.
 
[[List of XML and HTML character entity references|Character entity references]] can also have the format <code>&amp;''name'';</code> where ''name'' is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as <code>&amp;lambda;</code> in an HTML document. The character entity references <code>&amp;lt;</code>, <code>&amp;gt;</code>, <code>&amp;quot;</code> and <code>&amp;amp;</code> are predefined in HTML and SGML, because <code>&lt;</code>, <code>&gt;</code>, <code>"</code> and <code>&amp;</code> are already used to delimit markup. This notably did not include XML's <code>&amp;apos;</code> (') entity prior to [[HTML5]]. For a list of all named HTML character entity references along with the versions in which they were introduced, see [[List of XML and HTML character entity references]].
 
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.
Line 167:
 
== External links ==
* [https://owuk.com/html-encode.html Online HTML entity encoder & decoder tool]
* [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4]
* [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding]