Numeric character reference: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 20:27, 9 December 2015 edit Compynerd255 (talk \| contribs) 56 edits Added note in summary about why NCRs would be used - escaping or encoding ← Previous edit		Latest revision as of 08:59, 5 February 2025 edit undo Beland (talk \| contribs) Autopatrolled, Administrators 259,161 edits m convert special characters found by Wikipedia:Typo Team/moss (via WP:JWB)
(59 intermediate revisions by 43 users not shown)
Line 1: {{Short description\|Common markup construct used in SGML, XML, and HTML}} ~~{{Unreferenced\|date=December 2009}}~~ {{one source\|date=February 2021}} A '''numeric character reference''' ('''NCR''') is a common [[markup (computer programming)\|markup]] construct used in [[SGML]] and SGML-derived markup languages such as [[HTML]] and [[XML]]. It consists of a short sequence of [[character (computing)\|character]]s that, in turn, represents a single character. Since [[SGML\|WebSgml]], [[XML]] and [[HTML 4]], the code points of the [[Universal Character Set]] (UCS) of [[Unicode]] are used. NCRs are typically used in order to represent characters that are not [[plain text#Encoding\|directly encodable]] in a particular document (for example, because they are international characters that don't fit in the 8-bit character set being used, or because they have special syntactic meaning in the language). When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents. A '''numeric character reference''' ('''NCR''') is a common [[markup (computer programming)\|markup]] construct used in [[SGML]] and SGML-derived markup languages such as [[HTML]] and [[XML]]. It consists of a short sequence of [[character (computing)\|character]]s that, in turn, represents a single character. Since [[SGML\|WebSgml]], [[XML]] and [[HTML 4]], the code points of the [[Universal Character Set]] (UCS) of [[Unicode]] are used. NCRs are typically used in order to represent characters that are not [[plain text#Encoding\|directly encodable]] in a particular document (for example, because they are international characters that do not fit in the 8-bit [[Character encoding\|character set]] being used, or because they have special syntactic meaning in the language). When the document is interpreted by a markup-aware reader, each NCR is treated as if it were the character it represents. ==~~Example~~Examples== In SGML, HTML, and XML, the following are all valid numeric character references for the Greek capital letter Sigma {\| class="wikitable~~" border="1~~" \|+ Numerical character reference of ~~Greek~~{{unichar\|03A3\|GREEK ~~Sigma~~CAPITAL LETTER SIGMA}}<br/>({{hexadecimal\|0931}} = 931<sub>10</sub>) \|- ~~\| colspan="4" style="text-align:center" \| {{unichar\|03A3\|GREEK CAPITAL LETTER SIGMA}} ({{hexadecimal\|0931}} = {{Decimal2Base\|0931\|10}})~~ \|- ! [[Unicode#Upluslink\|Unicode character]] Line 19 ⟶ 18: \|- \| U+03A3 \|\| Hexadecimal \|\| &#x3A3; \|\| Σ \|- \| U+03A3 \|\| Hexadecimal \|\| &#x03A3; \|\| Σ Line 27 ⟶ 25: In SGML, HTML, and XML, the following are all valid numeric character references for the Latin capital letter AE {\| class="wikitable~~" border="1~~" \|+ Numerical character reference of {{unichar\|00C6\|Latin capital letter AE}} \| \|- ~~\| colspan="4" style="text-align:center" \| {{unichar\|00C6\|Latin capital letter Æ}}~~ \|- ! [[Unicode#Upluslink\|Unicode character]] Line 38 ⟶ 34: \|- \| U+00C6 \|\| Decimal \|\| &#198; \|\| Æ \|- \| U+00C6 \|\| Hexadecimal \|\| &#xC6; \|\| Æ Line 44 ⟶ 39: In SGML, HTML, and XML, the following are all valid numeric character references for the Latin small letter sharp s ß {\| class="wikitable~~" border="1~~" \|+ Numerical character reference of {{unichar\|00DF\|Latin small letter sharp s}} ~~ß \|~~ \|- ~~\| colspan="4" style="text-align:center" \| {{unichar\|00DF\|Latin small letter sharp s ß}}~~ \|- ! [[Unicode#Upluslink\|Unicode character]] Line 55 ⟶ 48: \|- \| U+00DF \|\| Decimal \|\| &#223; \|\| ß \|- \| U+00DF \|\| Hexadecimal \|\| &#xDF; \|\| ß \|} List of numeric character references for the printable [[ASCII]] characters: {\| class="wikitable" ! [[Unicode#Upluslink\|Unicode character]] ! Character<br />Reference<br />(decimal) ! Character<br />Reference<br />(hexadecimal) ! Effect \|- \| U+0020 \|\| &#32; \|\| &#x20; \|\| (space) \|- \| U+0021 \|\| &#33; \|\| &#x21; \|\| ! \|- \| U+0022 \|\| &#34; \|\| &#x22; \|\| " \|- \| U+0023 \|\| &#35; \|\| &#x23; \|\| # \|- \| U+0024 \|\| &#36; \|\| &#x24; \|\| $ \|- \| U+0025 \|\| &#37; \|\| &#x25; \|\| % \|- \| U+0026 \|\| &#38; \|\| &#x26; \|\| & \|- \| U+0027 \|\| &#39; \|\| &#x27; \|\| ' \|- \| U+0028 \|\| &#40; \|\| &#x28; \|\| ( \|- \| U+0029 \|\| &#41; \|\| &#x29; \|\| ) \|- \| U+002A \|\| &#42; \|\| &#x2A; \|\| * \|- \| U+002B \|\| &#43; \|\| &#x2B; \|\| + \|- \| U+002C \|\| &#44; \|\| &#x2C; \|\| , \|- \| U+002D \|\| &#45; \|\| &#x2D; \|\| - \|- \| U+002E \|\| &#46; \|\| &#x2E; \|\| . \|- \| U+002F \|\| &#47; \|\| &#x2F; \|\| / \|- \| U+0030 \|\| &#48; \|\| &#x30; \|\| 0 \|- \| U+0031 \|\| &#49; \|\| &#x31; \|\| 1 \|- \| U+0032 \|\| &#50; \|\| &#x32; \|\| 2 \|- \| U+0033 \|\| &#51; \|\| &#x33; \|\| 3 \|- \| U+0034 \|\| &#52; \|\| &#x34; \|\| 4 \|- \| U+0035 \|\| &#53; \|\| &#x35; \|\| 5 \|- \| U+0036 \|\| &#54; \|\| &#x36; \|\| 6 \|- \| U+0037 \|\| &#55; \|\| &#x37; \|\| 7 \|- \| U+0038 \|\| &#56; \|\| &#x38; \|\| 8 \|- \| U+0039 \|\| &#57; \|\| &#x39; \|\| 9 \|- \| U+003A \|\| &#58; \|\| &#x3A; \|\| : \|- \| U+003B \|\| &#59; \|\| &#x3B; \|\| ; \|- \| U+003C \|\| &#60; \|\| &#x3C; \|\| < \|- \| U+003D \|\| &#61; \|\| &#x3D; \|\| = \|- \| U+003E \|\| &#62; \|\| &#x3E; \|\| > \|- \| U+003F \|\| &#63; \|\| &#x3F; \|\| ? \|- \| U+0040 \|\| &#64; \|\| &#x40; \|\| @ \|- \| U+0041 \|\| &#65; \|\| &#x41; \|\| A \|- \| U+0042 \|\| &#66; \|\| &#x42; \|\| B \|- \| U+0043 \|\| &#67; \|\| &#x43; \|\| C \|- \| U+0044 \|\| &#68; \|\| &#x44; \|\| D \|- \| U+0045 \|\| &#69; \|\| &#x45; \|\| E \|- \| U+0046 \|\| &#70; \|\| &#x46; \|\| F \|- \| U+0047 \|\| &#71; \|\| &#x47; \|\| G \|- \| U+0048 \|\| &#72; \|\| &#x48; \|\| H \|- \| U+0049 \|\| &#73; \|\| &#x49; \|\| I \|- \| U+004A \|\| &#74; \|\| &#x4A; \|\| J \|- \| U+004B \|\| &#75; \|\| &#x4B; \|\| K \|- \| U+004C \|\| &#76; \|\| &#x4C; \|\| L \|- \| U+004D \|\| &#77; \|\| &#x4D; \|\| M \|- \| U+004E \|\| &#78; \|\| &#x4E; \|\| N \|- \| U+004F \|\| &#79; \|\| &#x4F; \|\| O \|- \| U+0050 \|\| &#80; \|\| &#x50; \|\| P \|- \| U+0051 \|\| &#81; \|\| &#x51; \|\| Q \|- \| U+0052 \|\| &#82; \|\| &#x52; \|\| R \|- \| U+0053 \|\| &#83; \|\| &#x53; \|\| S \|- \| U+0054 \|\| &#84; \|\| &#x54; \|\| T \|- \| U+0055 \|\| &#85; \|\| &#x55; \|\| U \|- \| U+0056 \|\| &#86; \|\| &#x56; \|\| V \|- \| U+0057 \|\| &#87; \|\| &#x57; \|\| W \|- \| U+0058 \|\| &#88; \|\| &#x58; \|\| X \|- \| U+0059 \|\| &#89; \|\| &#x59; \|\| Y \|- \| U+005A \|\| &#90; \|\| &#x5A; \|\| Z \|- \| U+005B \|\| &#91; \|\| &#x5B; \|\| [ \|- \| U+005C \|\| &#92; \|\| &#x5C; \|\| \ \|- \| U+005D \|\| &#93; \|\| &#x5D; \|\| ] \|- \| U+005E \|\| &#94; \|\| &#x5E; \|\| ^ \|- \| U+005F \|\| &#95; \|\| &#x5F; \|\| _ \|- \| U+0060 \|\| &#96; \|\| &#x60; \|\| ' \|- \| U+0061 \|\| &#97; \|\| &#x61; \|\| a \|- \| U+0062 \|\| &#98; \|\| &#x62; \|\| b \|- \| U+0063 \|\| &#99; \|\| &#x63; \|\| c \|- \| U+0064 \|\| &#100; \|\| &#x64; \|\| d \|- \| U+0065 \|\| &#101; \|\| &#x65; \|\| e \|- \| U+0066 \|\| &#102; \|\| &#x66; \|\| f \|- \| U+0067 \|\| &#103; \|\| &#x67; \|\| g \|- \| U+0068 \|\| &#104; \|\| &#x68; \|\| h \|- \| U+0069 \|\| &#105; \|\| &#x69; \|\| i \|- \| U+006A \|\| &#106; \|\| &#x6A; \|\| j \|- \| U+006B \|\| &#107; \|\| &#x6B; \|\| k \|- \| U+006C \|\| &#108; \|\| &#x6C; \|\| l \|- \| U+006D \|\| &#109; \|\| &#x6D; \|\| m \|- \| U+006E \|\| &#110; \|\| &#x6E; \|\| n \|- \| U+006F \|\| &#111; \|\| &#x6F; \|\| o \|- \| U+0070 \|\| &#112; \|\| &#x70; \|\| p \|- \| U+0071 \|\| &#113; \|\| &#x71; \|\| q \|- \| U+0072 \|\| &#114; \|\| &#x72; \|\| r \|- \| U+0073 \|\| &#115; \|\| &#x73; \|\| s \|- \| U+0074 \|\| &#116; \|\| &#x74; \|\| t \|- \| U+0075 \|\| &#117; \|\| &#x75; \|\| u \|- \| U+0076 \|\| &#118; \|\| &#x76; \|\| v \|- \| U+0077 \|\| &#119; \|\| &#x77; \|\| w \|- \| U+0078 \|\| &#120; \|\| &#x78; \|\| x \|- \| U+0079 \|\| &#121; \|\| &#x79; \|\| y \|- \| U+007A \|\| &#122; \|\| &#x7A; \|\| z \|- \| U+007B \|\| &#123; \|\| &#x7B; \|\| { \|- \| U+007C \|\| &#124; \|\| &#x7C; \|\| {{pipe}} \|- \| U+007D \|\| &#125; \|\| &#x7D; \|\| } \|- \| U+007E \|\| &#126; \|\| &#x7E; \|\| ~ \|} Line 89 ⟶ 279: While the syntax of SGML does not prohibit references to invalid or unassigned code points, such as <code>&#xFFFF;</code>, SGML-derived markup languages such as HTML and XML can, and often do, restrict numeric character references to only those code points that are assigned to characters. Restrictions may also apply for other reasons. For example, in HTML 4, <code>&#12;</code>, which is a reference to a non-printing "form feed" control character, is allowed because a form feed character is allowed. But in XML, the form feed character cannot be used, not even by reference.<ref>{{cite web \|title=HTML 5.2: 8. The HTML syntax \|url=https://www.w3.org/TR/2017/WD-html52-20170228/syntax.html \|website=www.w3.org}}</ref>{{Citation needed\|date=May 2013}}. As another example, <code>&#128;</code>, which is a reference to another control character, is not allowed to be used or referenced in either HTML or XML, but when used in HTML, it is usually not flagged as an error by web browsers – some of which interpret it as a reference to the character represented by code value 128 in the [[Windows-1252]] encoding for compatibility reasons. This character, "€", has to be represented as <code>&#8364;</code> in a standard-compliant HTML code. As a further example, prior to the publication of XML 1.0 Second Edition on October 6, 2000, XML 1.0 was based on an older version of ISO 10646 and prohibited using characters above U+FFFD, except in character data, thus making a reference like <code>&#65536;</code> (U+10000) illegal. In XML 1.1 and newer editions of XML 1.0, such a reference is allowed, because the available character repertoire was explicitly extended. Markup languages also place restrictions on where character references can occur. Line 95 ⟶ 285: ==Compatibility issues== In the initial versions of [[SGML]] and [[HTML]], numeric character references were interpreted in relationship to the document character encoding, rather than [[Unicode]]. For Latin-script documents, numeric character references to characters between x80 and x9F in those documents will not be correct against [[Unicode]], and must be recoded. HTML standards prior to [[HTML 4]] ~~only~~ supported only Western Latin script documents: the treatment of character references above #7F may vary between applications and national conventions. For example, as mentioned above, the correct numeric character reference for the [[Euro sign]] "€" <code>U+20AC</code> when using [[Unicode]] is decimal <code>&#8364;</code> and hexadecimal <code>&#x20AC;</code>. However, if using tools supporting obsolete implementations of HTML, the reference <code>&#128;</code> (Euro sign in the [[~~Cp1252~~CP-1252]] code page) or <code>&#164;</code> (Euro sign in [[ISO/IEC 8859-15]] ) may work. As another example, if some text was created originally using the [[MacRoman]] character set, the [[quotation mark ~~glyphs~~\|left double quotation mark]] {{~~not a typo~~char\|“"}} will be represented with ~~codepoint~~code point xD2. This will not display properly in a system expecting a document encoded as UTF-8, ISO 8859-1, or ~~[[CP1252]]~~CP-1252, where this code point is occupied by the letter [[Ò]]. The correct numeric character reference for {{~~not a typo~~char\|“"}} in HTML 4 and newer is <code>&#x201C;</code>, because [[Unicode#Upluslink\|U+]]201C is its UCS code. In some systems, the [[List of XML and HTML character entity references\|named character reference]] <code>&ldquo;</code> may also be available. ==See also== * [[List of XML and HTML character entity references]] ==References== {{Reflist}} {{Unicode navigation}} Line 110 ⟶ 303: [[Category:Unicode]] [[Category:XML]] ~~[[pl:Odwołania znakowe SGML]]~~