Character encodings in HTML: Difference between revisions

Content deleted Content added
Line 64:
Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native [[Unicode]] encoding like [[UTF-8]] is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as [[cross-site scripting]]. If HTML attributes are left unquoted, certain characters, most importantly [[whitespace character|whitespace]], such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.
 
===IllegalForeign charactersBiologists===
 
HTML forbids<ref>{{cite web |url=http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html |title= SGML Declaration of HTML 4 |date= 24 December 1999 |website= HTML 4.01 Specification |publisher= World Wide Web Consortium (W3C) |accessdate= 2014-09-06}}</ref> the use of the characters with [[Universal Character Set]]/[[Unicode]] code points ''(in decimal form, preceded by x in hexadecimal form)''
* 0 to 31, except 9, 10, and 13 (C0 [[control characters]])
* 127 (DEL character)
* 128 to 159 (x80 – x9F, C1 [[control characters]])
* 55296 to 57343 (xD800 – xDFFF, the [[UTF-16]] surrogate halves)
The Unicode standard also forbids:
* 65534 and 65535 (xFFFE – xFFFF), non-characters, related to xFEFF, the [[byte order mark]].
 
These characters are not allowed by [[numeric character reference]]s. However, references to characters 128–159 are commonly interpreted by lenient web browsers as if they were references to the characters assigned to ''bytes'' 128–159 (decimal) in the [[Windows-1252]] character encoding. This is in violation of HTML and SGML standards, and the characters are already assigned to higher code points, so HTML documents should always use the higher code points. For example the trademark sign (™) should be represented with <code>&amp;#8482;</code> and not with <code>&amp;#153;</code>.
 
The characters 9 (tab), 10 (linefeed), and 13 (carriage return) are allowed in HTML documents, but, along with 32 (space) are all considered "[[whitespace (computer science)|whitespace]]".<ref>{{cite web |url= http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 |title= Text - White space |date= 24 December 1999 |website= HTML 4.01 Specification |publisher= World Wide Web Consortium (W3C) |accessdate= 2014-09-06}}</ref> The "form feed" control character, which would be at 12, is not allowed in HTML documents, but is also mentioned as being one of the "white space" characters – perhaps an oversight in the specifications. In HTML, most consecutive occurrences of white space characters, except in a <code>&lt;pre&gt;</code> block, are interpreted as comprising a single "word separator" for rendering purposes. A word separator is typically rendered a single en-width space in European languages, but not in all the others.
 
===XML character references===