Character encodings in HTML: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 15:39, 6 January 2024 edit HarJIT (talk \| contribs) Extended confirmed users 12,434 edits →Permitted encodings ← Previous edit		Latest revision as of 05:06, 16 November 2024 edit undo Ejazz128 (talk \| contribs) 24 edits m →External links: i have removed the broken link
(7 intermediate revisions by 4 users not shown)
Line 69: * [[GB 18030]]{{efn\|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#gb18030-decoder \|title=10.2.1. gb18030 decoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web \|url=https://encoding.spec.whatwg.org/#index-gb18030 \|title=5. Indexes (§ index gb18030) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[Big5]]{{efn\|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-big5-pointer \|title=5. Indexes (§ index Big5 pointer) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[Shift JIS]]{{efn\|The specification includes [[IBM]] and [[NEC]] extensions,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-jis0208 \|title=5. Indexes (§ Index jis0208) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web \|url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming \|title=Notable Differences from IANA Naming \|work=Crate encoding_rs \|publisher=docs.rs \|author=Mozilla Foundation \|author-link=Mozilla Foundation}}</ref>}} * [[ISO-2022-JP]]{{efn\|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana \|title=5. Indexes (§ Index ISO-2022-JP katakana) \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web \|url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder \|title=12.2.1. ISO-2022-JP decoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web \|url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder \|title=12.2.2. ISO-2022-JP encoder \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} * [[EUC-KR]]{{efn\|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)\|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web \|url=https://encoding.spec.whatwg.org/#index-euc-kr \|title=5. Indexes (§ index EUC-KR) \|work=Encoding Standard \|institution=[[WHATWG]] \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref>}} Line 91: * [[ISO-8859-16]] * [[KOI8-R]] * [[KOI8-U]] / [[KOI8-RU]]{{efn\|Titled KOI8-U and specified for both <code>KOI8-U</code> and <code>KOI8-RU</code> labels,;<ref name="namesandlabels"/> ~~but~~ follows [[KOI8-RU]] in positions 0xAE and 0xBE (i.e. includes [[Ў\|Ў/ў]]).<ref name="whatwg-koi8u">{{cite web \|url=https://encoding.spec.whatwg.org/koi8-u.html \|title=index KOI8-U visualization \|institution=[[WHATWG]] \|work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref><ref>{{cite web \|url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=17053 \|title=Bug 17053: Support KOI8-RU mapping for KOI8-U \|date=2015-08-19 \|work=[[W3C]] Bugzilla}}</ref> but KOI8-U in positions 0x93–9F.<ref name="whatwg-koi8u"/>}} * [[Mac OS Roman]] * [[Windows-1253]] Line 118: ==Character references== {{Main\|~~Character~~List of XML and HTML character entity ~~reference~~references\|Numeric character reference}} In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' ([[decimal]] or [[hexadecimal]]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from [[SGML]]. Line 124: ===HTML character references=== <!--Linked from [[Template:Auxiliary template common notice]]--> A ''[[numeric character reference]]'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''[[code point]]'', and uses the format :<code>&#''nnnn'';</code> Line 136: For codes from 0 to 127, the original 7-bit [[ASCII]] standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using [[List of XML and HTML character entity references\|character entity names]]. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference. [[List of XML and HTML character entity references\|Character entity references]] can also have the format <code>&''name'';</code> where ''name'' is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as <code>&lambda;</code> in an HTML document. The character entity references <code>&lt;</code>, <code>&gt;</code>, <code>&quot;</code> and <code>&amp;</code> are predefined in HTML and SGML, because <code><</code>, <code>></code>, <code>"</code> and <code>&</code> are already used to delimit markup. This notably did not include XML's <code>&apos;</code> (') entity prior to [[HTML5]]. For a list of all named HTML character entity references along with the versions in which they were introduced, see [[List of XML and HTML character entity references]]. Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character\|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters. Line 167: == External links == * [https://owuk.com/html-encode.html Online HTML entity encoder & decoder tool] * [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4] * [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding]