Revision as of 17:42, 28 August 2011 edit JMF (talk \| contribs) Extended confirmed users 61,389 edits →Illegal characters: don't pipe Windows-1252 to somewhere else. Also, Windows has been based on Unicode since Windows NT/2000 ← Previous edit		Revision as of 21:56, 28 August 2011 edit undo Spitzak (talk \| contribs) Extended confirmed users 10,503 edits →Illegal characters: Latest version of Windows still interprets byte files as CP1252 Next edit →
Line 20: * 55296 to 57343 (xD800–xDFFF, the [[UTF-16]] surrogate halves) These characters are ''not even allowed by reference''. That is, you should not even write them as [[numeric character reference]]s. However, references to characters 128–159 are commonly interpreted by lenient web browsers as if they were references to the characters assigned to ''bytes'' 128–159 (decimal) in the [[Windows-1252]] character encoding ~~used in historic [[Windows 9x\|'9x' versions]] of Windows~~. This is in violation of HTML and SGML standards, and the characters are already assigned to higher code points, so HTML document authors should always use the higher code points. For example, for the trademark sign (™), use <code>&#8482;</code>, not <code>&#153;</code>. The characters 9 (tab), 10 (linefeed), and 13 (carriage return) are allowed in HTML documents, but, along with 32 (space) are all considered "[[whitespace (computer science)\|whitespace]]"<ref>http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1</ref>. The "form feed" control character, which would be at 12, is not allowed in HTML documents, but is also mentioned as being one of the "white space" characters — perhaps an oversight in the specifications. In HTML, most consecutive occurrences of white space characters, except in a <code><pre></code> block, are interpreted as comprising a single "word separator" for rendering purposes. A word separator is typically rendered a single en-width space in European languages, but not in others.

HTML decimal character rendering: Difference between revisions