Revision as of 18:50, 29 August 2021 edit 212.238.182.54 (talk) No edit summary ← Previous edit		Revision as of 20:35, 12 October 2021 edit undo Alexander Davronov (talk \| contribs) Extended confirmed users 10,942 edits →Specifying the document's character encoding: please, subsection Tag: Visual edit Next edit →
Line 29: As the character encoding cannot be known until this{{clarify\|date=October 2019}} declaration is parsed, there can be a problem knowing which encoding is used for the declaration itself. The main principle is that the declaration shall be encoded in pure ASCII, and therefore (if the declaration is inside the file) the encoding needs to be an [[ASCII extension]]. In order to allow encodings not backwards compatible with ASCII, browsers must be able to parse declarations in such encodings. Examples of such encodings are [[UTF-16BE]] and [[UTF-16LE]]. === Encoding detection algorithm === As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including: # Explicit user instruction Line 36 ⟶ 37: # Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>[http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding HTML5 prescan a byte stream to determine its encoding]</ref> and other tentative detection mechanisms. ~~For~~Characters ~~ASCII-compatible character encodings the consequence~~outside of ~~choosing incorrectly is that characters outside~~ the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language\|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well. It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

Character encodings in HTML: Difference between revisions