Examine individual changes
This page allows you to examine the variables generated by the Edit Filter for an individual change.
Variables generated for this change
Variable | Value |
---|---|
Edit count of the user (user_editcount ) | null |
Name of the user account (user_name ) | '2A03:2880:3020:1FFC:FACE:B00C:0:8000' |
Age of the user account (user_age ) | 0 |
Groups (including implicit) the user is in (user_groups ) | [
0 => '*'
] |
Global groups that the user is in (global_user_groups ) | [] |
Whether or not a user is editing through the mobile interface (user_mobile ) | false |
Page ID (page_id ) | 5315 |
Page namespace (page_namespace ) | 0 |
Page title without namespace (page_title ) | 'Character encodings in HTML' |
Full page title (page_prefixedtitle ) | 'Character encodings in HTML' |
Last ten users to contribute to the page (page_recent_contributors ) | [
0 => 'George8211',
1 => '0x010C',
2 => 'Andy Dingley',
3 => '122.160.75.93',
4 => 'Dernier Siècle',
5 => '86.153.77.139',
6 => 'Softtest123',
7 => 'BIL',
8 => '217.238.184.46',
9 => 'Yobot'
] |
Action (action ) | 'edit' |
Edit summary/reason (summary ) | '/* Foreign Biologists */' |
Whether or not the edit is marked as minor (no longer in use) (minor_edit ) | false |
Old page wikitext, before the edit (old_wikitext ) | '{{for|a list of character entity references|List of XML and HTML character entity references}}
{{Hatnote|For fixing links within Wikipedia, see [[Help:Percent-encoding#Fixing links with unsupported characters|Help:Percent-encoding (the section Fixing Links with Unsupported Characters)]].}}
{{Html series}}
[[HTML]] (<u>H</u>yper<u>t</u>ext <u>M</u>arkup <u>L</u>anguage) has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international [[character (computing)|character]]s were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit [[ASCII]] two goals are worth considering: the information's [[integrity]], and universal [[Web browser|browser]] display.
==Specifying the document's character encoding==
There are several ways to specify which character encoding is used in the document. First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation |url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5|chapter=Content-Type |title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content|publisher=[[IETF]] |date=June 2014 |accessdate=2014-07-30}}</ref>
Content-Type: text/html; charset=ISO-8859-4
This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules|module]] mod_charset_lite.<ref>[http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html Apache Module mod_charset_lite]</ref>
For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name=html5charset/>
<!-- Please don't add a closing "/": that is incorrect here. -->
<source lang=html4strict>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</source>
[[HTML5]] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation |url=http://www.w3.org/TR/html5/document-metadata.html#charset |chapter=Specifying the document's character encoding |title=HTML5 |publisher=[[World Wide Web Consortium]] |date=17 June 2014 |accessdate=2014-07-30}}</ref>
<!-- Please don't add a closing "/": that is unnecessary here. -->
<source lang=html4strict>
<meta charset="utf-8">
</source>
[[XHTML]] documents have a third option: to express the character encoding via [[XML]] declaration, as follows:<ref>{{citation |url=http://www.w3.org/TR/REC-xml/#sec-prolog-dtd |chapter=Prolog and Document Type Declaration |title=XML |first1=T. |last1=Bray |authorlink1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |authorlink3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |accessdate=8 March 2010}}</ref>
<source lang=xml>
<?xml version="1.0" encoding="ISO-8859-1"?>
</source>
Note that as the character encoding can't be known until this declaration is parsed, there can be a problem knowing which encoding is used for the declaration itself. The main principle is that the declaration shall be encoded in pure ASCII, and therefore (if the declaration is inside the file) the encoding needs to be an [[ASCII extension]]. In order to allow encodings not backwards compatible with ASCII, browsers must be able to parse declarations in such encodings. Examples of such encodings are [[UTF-16BE]] and [[UTF-16LE]].
As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:
# Explicit user instruction
# An explicit meta tag within the first 1024 bytes of the document
# A [[Byte order mark]] within the first three bytes of the document
# The HTTP Content-Type or other transport layer information
# Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>[http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding HTML5 prescan a byte stream to determine its encoding]</ref> and other tentative detection mechanisms.
For ASCII-compatible character encodings the consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In [[CJK]] environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well.
It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.
Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.
==Character references==
{{Main|Character entity reference|Numeric character reference}}
In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' ([[decimal]] or [[hexadecimal]]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from [[SGML]].
===HTML character references===
<!--Linked from [[Template:Auxiliary template common notice]]-->
A ''numeric character reference'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''code point'', and uses the format
:<code>&#''nnnn'';</code>
or
:<code>&#x''hhhh'';</code>
where ''nnnn'' is the code point in [[decimal]] form, and ''hhhh'' is the code point in [[hexadecimal]] form. The ''x'' must be lowercase in XML documents. The ''nnnn'' or ''hhhh'' may be any number of digits and may include leading zeros. The ''hhhh'' may mix uppercase and lowercase, though uppercase is the usual style.
Not all [[web browser]]s or [[email client]]s used by receivers of HTML documents, or [[text editor]]s used by authors of HTML documents, will be able to render all HTML characters. Most modern software is able to display most or all of the characters for the user's language, and will draw a box or other clear indicator for characters they cannot render.
For codes from 0 to 127, the original 7-bit [[ASCII]] standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using [[List of XML and HTML character entity references|character entity names]]. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.
Character entity references can also have the format <code>&''name'';</code> where ''name'' is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as <code>&lambda;</code> in an HTML document. The character entity references <code>&lt;</code>, <code>&gt;</code>, <code>&quot;</code> and <code>&amp;</code> are predefined in HTML and SGML, because <code><</code>, <code>></code>, <code>"</code> and <code>&</code> are already used to delimit markup. This notably does not include XML's <code>&apos;</code> (') entity. For a list of all named HTML character entity references (about 250), see [[List of XML and HTML character entity references]].
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.
===Illegal characters===
HTML forbids<ref>{{cite web |url=http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html |title= SGML Declaration of HTML 4 |date= 24 December 1999 |website= HTML 4.01 Specification |publisher= World Wide Web Consortium (W3C) |accessdate= 2014-09-06}}</ref> the use of the characters with [[Universal Character Set]]/[[Unicode]] code points ''(in decimal form, preceded by x in hexadecimal form)''
* 0 to 31, except 9, 10, and 13 (C0 [[control characters]])
* 127 (DEL character)
* 128 to 159 (x80 – x9F, C1 [[control characters]])
* 55296 to 57343 (xD800 – xDFFF, the [[UTF-16]] surrogate halves)
The Unicode standard also forbids:
* 65534 and 65535 (xFFFE – xFFFF), non-characters, related to xFEFF, the [[byte order mark]].
These characters are not allowed by [[numeric character reference]]s. However, references to characters 128–159 are commonly interpreted by lenient web browsers as if they were references to the characters assigned to ''bytes'' 128–159 (decimal) in the [[Windows-1252]] character encoding. This is in violation of HTML and SGML standards, and the characters are already assigned to higher code points, so HTML documents should always use the higher code points. For example the trademark sign (™) should be represented with <code>&#8482;</code> and not with <code>&#153;</code>.
The characters 9 (tab), 10 (linefeed), and 13 (carriage return) are allowed in HTML documents, but, along with 32 (space) are all considered "[[whitespace (computer science)|whitespace]]".<ref>{{cite web |url= http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 |title= Text - White space |date= 24 December 1999 |website= HTML 4.01 Specification |publisher= World Wide Web Consortium (W3C) |accessdate= 2014-09-06}}</ref> The "form feed" control character, which would be at 12, is not allowed in HTML documents, but is also mentioned as being one of the "white space" characters – perhaps an oversight in the specifications. In HTML, most consecutive occurrences of white space characters, except in a <code><pre></code> block, are interpreted as comprising a single "word separator" for rendering purposes. A word separator is typically rendered a single en-width space in European languages, but not in all the others.
===XML character references===
Unlike traditional HTML with its large range of character entity references, in [[XML]] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{citation |url=http://www.w3.org/TR/REC-xml/#sec-references |chapter=Character and Entity References |title=XML |first1=T. |last1=Bray |authorlink1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |authorlink3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |accessdate=8 March 2010}}</ref>
*<code>&amp;</code> → & ([[ampersand]], U+0026)
*<code>&lt;</code> → < (less-than sign, U+003C)
*<code>&gt;</code> → > (greater-than sign, U+003E)
*<code>&quot;</code> → " (quotation mark, U+0022)
*<code>&apos;</code> → ' (apostrophe, U+0027)
All other character entity references have to be defined before they can be used. For example, use of <code>&eacute;</code> (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the <code>x</code> in hexadecimal numeric references be in lowercase: for example <code>&#xA1b</code> rather than <code>&#XA1b</code>. [[XHTML]], which is an XML application, supports the HTML entity set, along with XML's predefined entities.
== See also ==
* [[Charset sniffing]] – used by many browsers when character encoding metadata is not available
* [[Unicode and HTML]]
* [[Language code]]
* [[List of XML and HTML character entity references]]
== References ==
{{Reflist}}
== External links ==
* [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4]
* [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding]
* [http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding HTML Entity Encoding chapter of Browser Security Handbook - more information about current browsers and their entity handling]
* [http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet The Open Web Application Security Project's wiki article on cross-site scripting (XSS)]
{{Use dmy dates|date=August 2011}}
{{DEFAULTSORT:Character Encodings In Html}}
[[Category:HTML]]
[[Category:World Wide Web Consortium standards]]' |
New page wikitext, after the edit (new_wikitext ) | '{{for|a list of character entity references|List of XML and HTML character entity references}}
{{Hatnote|For fixing links within Wikipedia, see [[Help:Percent-encoding#Fixing links with unsupported characters|Help:Percent-encoding (the section Fixing Links with Unsupported Characters)]].}}
{{Html series}}
[[HTML]] (<u>H</u>yper<u>t</u>ext <u>M</u>arkup <u>L</u>anguage) has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international [[character (computing)|character]]s were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit [[ASCII]] two goals are worth considering: the information's [[integrity]], and universal [[Web browser|browser]] display.
==Specifying the document's character encoding==
There are several ways to specify which character encoding is used in the document. First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation |url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5|chapter=Content-Type |title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content|publisher=[[IETF]] |date=June 2014 |accessdate=2014-07-30}}</ref>
Content-Type: text/html; charset=ISO-8859-4
This method gives the HTTP server a convenient way to alter document's encoding according to [[content negotiation]]; certain HTTP server software can do it, for example Apache with the [[List of Apache modules|module]] mod_charset_lite.<ref>[http://httpd.apache.org/docs/2.0/en/mod/mod_charset_lite.html Apache Module mod_charset_lite]</ref>
For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name=html5charset/>
<!-- Please don't add a closing "/": that is incorrect here. -->
<source lang=html4strict>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</source>
[[HTML5]] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation |url=http://www.w3.org/TR/html5/document-metadata.html#charset |chapter=Specifying the document's character encoding |title=HTML5 |publisher=[[World Wide Web Consortium]] |date=17 June 2014 |accessdate=2014-07-30}}</ref>
<!-- Please don't add a closing "/": that is unnecessary here. -->
<source lang=html4strict>
<meta charset="utf-8">
</source>
[[XHTML]] documents have a third option: to express the character encoding via [[XML]] declaration, as follows:<ref>{{citation |url=http://www.w3.org/TR/REC-xml/#sec-prolog-dtd |chapter=Prolog and Document Type Declaration |title=XML |first1=T. |last1=Bray |authorlink1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |authorlink3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |accessdate=8 March 2010}}</ref>
<source lang=xml>
<?xml version="1.0" encoding="ISO-8859-1"?>
</source>
Note that as the character encoding can't be known until this declaration is parsed, there can be a problem knowing which encoding is used for the declaration itself. The main principle is that the declaration shall be encoded in pure ASCII, and therefore (if the declaration is inside the file) the encoding needs to be an [[ASCII extension]]. In order to allow encodings not backwards compatible with ASCII, browsers must be able to parse declarations in such encodings. Examples of such encodings are [[UTF-16BE]] and [[UTF-16LE]].
As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:
# Explicit user instruction
# An explicit meta tag within the first 1024 bytes of the document
# A [[Byte order mark]] within the first three bytes of the document
# The HTTP Content-Type or other transport layer information
# Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>[http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding HTML5 prescan a byte stream to determine its encoding]</ref> and other tentative detection mechanisms.
For ASCII-compatible character encodings the consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In [[CJK]] environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well.
It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.
Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.
==Character references==
{{Main|Character entity reference|Numeric character reference}}
In addition to native character encodings, characters can also be encoded as ''character references'', which can be ''numeric character references'' ([[decimal]] or [[hexadecimal]]) or ''character entity references''. Character entity references are also sometimes referred to as ''named entities'', or ''HTML entities'' for HTML. HTML's usage of character references derives from [[SGML]].
===HTML character references===
<!--Linked from [[Template:Auxiliary template common notice]]-->
A ''numeric character reference'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''code point'', and uses the format
:<code>&#''nnnn'';</code>
or
:<code>&#x''hhhh'';</code>
where ''nnnn'' is the code point in [[decimal]] form, and ''hhhh'' is the code point in [[hexadecimal]] form. The ''x'' must be lowercase in XML documents. The ''nnnn'' or ''hhhh'' may be any number of digits and may include leading zeros. The ''hhhh'' may mix uppercase and lowercase, though uppercase is the usual style.
Not all [[web browser]]s or [[email client]]s used by receivers of HTML documents, or [[text editor]]s used by authors of HTML documents, will be able to render all HTML characters. Most modern software is able to display most or all of the characters for the user's language, and will draw a box or other clear indicator for characters they cannot render.
For codes from 0 to 127, the original 7-bit [[ASCII]] standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using [[List of XML and HTML character entity references|character entity names]]. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.
Character entity references can also have the format <code>&''name'';</code> where ''name'' is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as <code>&lambda;</code> in an HTML document. The character entity references <code>&lt;</code>, <code>&gt;</code>, <code>&quot;</code> and <code>&amp;</code> are predefined in HTML and SGML, because <code><</code>, <code>></code>, <code>"</code> and <code>&</code> are already used to delimit markup. This notably does not include XML's <code>&apos;</code> (') entity. For a list of all named HTML character entity references (about 250), see [[List of XML and HTML character entity references]].
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.
===Foreign Biologists===
===XML character references===
Unlike traditional HTML with its large range of character entity references, in [[XML]] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{citation |url=http://www.w3.org/TR/REC-xml/#sec-references |chapter=Character and Entity References |title=XML |first1=T. |last1=Bray |authorlink1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |authorlink3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |accessdate=8 March 2010}}</ref>
*<code>&amp;</code> → & ([[ampersand]], U+0026)
*<code>&lt;</code> → < (less-than sign, U+003C)
*<code>&gt;</code> → > (greater-than sign, U+003E)
*<code>&quot;</code> → " (quotation mark, U+0022)
*<code>&apos;</code> → ' (apostrophe, U+0027)
All other character entity references have to be defined before they can be used. For example, use of <code>&eacute;</code> (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the <code>x</code> in hexadecimal numeric references be in lowercase: for example <code>&#xA1b</code> rather than <code>&#XA1b</code>. [[XHTML]], which is an XML application, supports the HTML entity set, along with XML's predefined entities.
== See also ==
* [[Charset sniffing]] – used by many browsers when character encoding metadata is not available
* [[Unicode and HTML]]
* [[Language code]]
* [[List of XML and HTML character entity references]]
== References ==
{{Reflist}}
== External links ==
* [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4]
* [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding]
* [http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding HTML Entity Encoding chapter of Browser Security Handbook - more information about current browsers and their entity handling]
* [http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet The Open Web Application Security Project's wiki article on cross-site scripting (XSS)]
{{Use dmy dates|date=August 2011}}
{{DEFAULTSORT:Character Encodings In Html}}
[[Category:HTML]]
[[Category:World Wide Web Consortium standards]]' |
Unified diff of changes made by edit (edit_diff ) | '@@ -64,17 +64,5 @@
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.
-===Illegal characters===
-
-HTML forbids<ref>{{cite web |url=http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html |title= SGML Declaration of HTML 4 |date= 24 December 1999 |website= HTML 4.01 Specification |publisher= World Wide Web Consortium (W3C) |accessdate= 2014-09-06}}</ref> the use of the characters with [[Universal Character Set]]/[[Unicode]] code points ''(in decimal form, preceded by x in hexadecimal form)''
-* 0 to 31, except 9, 10, and 13 (C0 [[control characters]])
-* 127 (DEL character)
-* 128 to 159 (x80 – x9F, C1 [[control characters]])
-* 55296 to 57343 (xD800 – xDFFF, the [[UTF-16]] surrogate halves)
-The Unicode standard also forbids:
-* 65534 and 65535 (xFFFE – xFFFF), non-characters, related to xFEFF, the [[byte order mark]].
-
-These characters are not allowed by [[numeric character reference]]s. However, references to characters 128–159 are commonly interpreted by lenient web browsers as if they were references to the characters assigned to ''bytes'' 128–159 (decimal) in the [[Windows-1252]] character encoding. This is in violation of HTML and SGML standards, and the characters are already assigned to higher code points, so HTML documents should always use the higher code points. For example the trademark sign (™) should be represented with <code>&#8482;</code> and not with <code>&#153;</code>.
-
-The characters 9 (tab), 10 (linefeed), and 13 (carriage return) are allowed in HTML documents, but, along with 32 (space) are all considered "[[whitespace (computer science)|whitespace]]".<ref>{{cite web |url= http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 |title= Text - White space |date= 24 December 1999 |website= HTML 4.01 Specification |publisher= World Wide Web Consortium (W3C) |accessdate= 2014-09-06}}</ref> The "form feed" control character, which would be at 12, is not allowed in HTML documents, but is also mentioned as being one of the "white space" characters – perhaps an oversight in the specifications. In HTML, most consecutive occurrences of white space characters, except in a <code><pre></code> block, are interpreted as comprising a single "word separator" for rendering purposes. A word separator is typically rendered a single en-width space in European languages, but not in all the others.
+===Foreign Biologists===
===XML character references===
' |
New page size (new_size ) | 11672 |
Old page size (old_size ) | 13931 |
Size change in edit (edit_delta ) | -2259 |
Lines added in edit (added_lines ) | [
0 => '===Foreign Biologists==='
] |
Lines removed in edit (removed_lines ) | [
0 => '===Illegal characters===',
1 => false,
2 => 'HTML forbids<ref>{{cite web |url=http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html |title= SGML Declaration of HTML 4 |date= 24 December 1999 |website= HTML 4.01 Specification |publisher= World Wide Web Consortium (W3C) |accessdate= 2014-09-06}}</ref> the use of the characters with [[Universal Character Set]]/[[Unicode]] code points ''(in decimal form, preceded by x in hexadecimal form)''',
3 => '* 0 to 31, except 9, 10, and 13 (C0 [[control characters]]) ',
4 => '* 127 (DEL character)',
5 => '* 128 to 159 (x80 – x9F, C1 [[control characters]])',
6 => '* 55296 to 57343 (xD800 – xDFFF, the [[UTF-16]] surrogate halves)',
7 => 'The Unicode standard also forbids:',
8 => '* 65534 and 65535 (xFFFE – xFFFF), non-characters, related to xFEFF, the [[byte order mark]].',
9 => false,
10 => 'These characters are not allowed by [[numeric character reference]]s. However, references to characters 128–159 are commonly interpreted by lenient web browsers as if they were references to the characters assigned to ''bytes'' 128–159 (decimal) in the [[Windows-1252]] character encoding. This is in violation of HTML and SGML standards, and the characters are already assigned to higher code points, so HTML documents should always use the higher code points. For example the trademark sign (™) should be represented with <code>&#8482;</code> and not with <code>&#153;</code>.',
11 => false,
12 => 'The characters 9 (tab), 10 (linefeed), and 13 (carriage return) are allowed in HTML documents, but, along with 32 (space) are all considered "[[whitespace (computer science)|whitespace]]".<ref>{{cite web |url= http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 |title= Text - White space |date= 24 December 1999 |website= HTML 4.01 Specification |publisher= World Wide Web Consortium (W3C) |accessdate= 2014-09-06}}</ref> The "form feed" control character, which would be at 12, is not allowed in HTML documents, but is also mentioned as being one of the "white space" characters – perhaps an oversight in the specifications. In HTML, most consecutive occurrences of white space characters, except in a <code><pre></code> block, are interpreted as comprising a single "word separator" for rendering purposes. A word separator is typically rendered a single en-width space in European languages, but not in all the others.'
] |
Whether or not the change was made through a Tor exit node (tor_exit_node ) | 0 |
Unix timestamp of change (timestamp ) | 1440680707 |