Unicode and HTML: Difference between revisions

Content deleted Content added
m Reverting possible vandalism by 2A03:2880:3020:1FF5:FACE:B00C:0:8000 to version by Wbm1058. Report False Positive? Thanks, ClueBot NG. (2326363) (Bot)
m typo(s) fixed: For example → For example, (2) using AWB
Line 19:
Like HTML documents, an XHTML document is a sequence of Unicode characters. However, an XHTML document is an [[XML]] document, which, while not having an explicit "document character" layer of [[abstraction]], nevertheless relies upon a similar definition of permissible characters that cover most, but not all, of the Unicode/UCS character definitions. The sets used by HTML and XHTML/XML are slightly different, but these differences have little effect on the average document author.
 
Regardless of whether the document is HTML or XHTML, when stored on a [[file system]] or transmitted over a network, the document's characters are ''encoded'' as a sequence of [[bit]] [[octet (computing)|octet]]s (''[[byte]]s'') according to a particular character encoding. This encoding may either be a [[Unicode Transformation Format]], like [[UTF-8]], that can directly encode any Unicode character, or a legacy encoding, like [[Windows-1252]], that cannot. However, even when using encodings that do not support all Unicode characters, the encoded document may make use of [[numeric character references]]. For example, <code>&amp;#x263A;</code> ({{Unicode|☺}}) is used to indicate a smiling face character in the Unicode character set.
 
=== Character encoding===
Line 52:
 
===Encoding trends===
Because of the legacy of 8-bit text representations in [[programming language]]s and [[operating system]]s and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range. Consequently, many HTML authors are unaware of encoding issues and may not have any idea what encoding their documents actually use. Misunderstandings, such as the belief that the encoding declaration affects a change in the actual encoding (whereas it is actually just a label that could be inaccurate), is also a reason for this editor attitude. Another factor contributing in the same direction, is the arrival of UTF-8 — which greatly diminishes the need for other encodings, and thus modern editors tends to default, as recommended by the HTML5 specification,<ref>{{Cite web|url=http://www.w3.org/TR/html5/semantics.html#charset|title=HTML5|author=Ian Hickson|accessdate=17 September 2011|year=2011|quote=Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629] Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]}}</ref> to UTF-8.
 
===Byte order mark/Unicode sniffing===