Revision as of 09:38, 28 November 2023 edit Zsohl (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers 6,045 edits m Reverted edits by Saidul777 (talk) (HG) (3.4.12) Tags: Huggle Rollback ← Previous edit		Revision as of 15:21, 2 January 2024 edit undo Comp.arch (talk \| contribs) Extended confirmed users 41,494 edits mNo edit summary Tag: 2017 wikitext editor Next edit →
Line 36: With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an [[ASCII extension]] then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as [[UTF-16BE]] and [[UTF-16LE]], a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics. === Encoding detection algorithm === As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including: # Explicit user instruction Line 44: # Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>{{cite web\| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding\| title = HTML5 prescan a byte stream to determine its encoding}}</ref> and other tentative detection mechanisms. Characters outside of the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language\|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters\|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well. It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

Character encodings in HTML: Difference between revisions