Revision as of 14:03, 20 January 2023 edit Hexware (talk \| contribs) Extended confirmed users 755 edits Fix wrong wording; Cited spec does not require UTF-8 output, just that decoders be able to "... read entities in both the UTF-8 and UTF-16 encodings." The convention of outputting in UTF-8 exclusively is a convention due to its prevalence. Tag: 2017 wikitext editor ← Previous edit		Revision as of 16:32, 20 January 2023 edit undo Spitzak (talk \| contribs) Extended confirmed users 10,500 edits →Efficiency Next edit →
Line 23: The next 1,920 characters, U+0080 to U+07FF (encompassing the remainder of almost all [[Latin-script alphabet]]s, and also [[Greek alphabet\|Greek]], [[Cyrillic script\|Cyrillic]], [[Coptic alphabet\|Coptic]], [[Armenian alphabet\|Armenian]], [[Hebrew alphabet\|Hebrew]], [[Arabic alphabet\|Arabic]], [[Syriac alphabet\|Syriac]], [[Tāna]] and [[N'Ko alphabet\|N'Ko]]), require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, i.e. the remainder of the characters in the [[Basic Multilingual Plane]] (BMP, plane 0, U+0000 to U+FFFF), which encompasses the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character, while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the [[Plane (Unicode)\|supplementary planes]] (planes 1–16), require 32 bits in UTF-8, UTF-16 and UTF-32. Therefore a file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup, and embedded words and acronyms written with Latin letters.<ref>{{Cite web \|title=UTF-8 Everywhere \|url=https://utf8everywhere.org/#asian \|access-date=2022-08-28 \|website=utf8everywhere.org}}</ref> UTF-32 is always longer unless there are no code points less than U+10000. All printable characters in [[UTF-EBCDIC]] use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, [[UTF-7]] is more space efficient than the combination of other Unicode encodings with [[quoted-printable]] or [[base64]] for almost all types of text (see "[[#Seven-bit environments\|Seven-bit environments]]" below). ~~===Storage utilization===~~ Each format has its own set of advantages and disadvantages with respect to storage efficiency (and thus also of transmission time) and processing efficiency. Storage efficiency is subject to the ___location within the Unicode [[code point\|code space]] in which any given text's characters are predominantly from. Since Unicode code space blocks are organized by character set (i.e. alphabet/script), storage efficiency of any given text effectively depends on the [[alphabet\|alphabet/script]] used for that text. So, for example, UTF-8 needs one less byte per character (8 versus 16 bits) than UTF-16 for the 128 code points between U+0000 and U+007F, but needs one more byte per character (24 versus 16 bits) for the 63,488 code points between U+0800 and U+FFFF. Therefore, if there are more characters in the range U+0000 to U+007F than there are in the range U+0800 to U+FFFF then UTF-8 is more efficient, while if there are fewer, then UTF-16 is more efficient. If the counts are equal then they are exactly the same size. A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup, and embedded words and acronyms written with Latin letters.<ref>{{Cite web \|title=UTF-8 Everywhere \|url=https://utf8everywhere.org/#asian \|access-date=2022-08-28 \|website=utf8everywhere.org}}</ref> ===Processing time=== ~~As far as processing time is concerned, text~~Text with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to ~~find~~work ~~the~~with individual code units, as opposed to working with sequences of code units. Searching is unaffected by whether the characters are variable sized, since a search for a sequence of code units does not care about the divisions (it does require that the encoding be self-synchronizing, which both UTF-8 and UTF-16 are). A common misconception is that there is a need to "find the ''n''th character" and that this requires a fixed-length encoding; however, in real use the number ''n'' is only derived from examining the {{nowrap\|''n−1''}} characters, thus sequential access is needed anyway.{{Citation needed\|date=October 2013}} [[UTF-16BE]] and [[UTF-32BE]] are [[endianness\|big-endian]], [[UTF-16LE]] and [[UTF-32LE]] are [[endianness\|little-endian]]. When character sequences in one endian order are loaded onto a machine with a different endian order, the characters need to be converted before they can be processed efficiently, ~~unless~~(or ~~data~~two isprocessors ~~processed~~are ~~with~~needed). aByte-based ~~byte~~encodings ~~granularity~~such (as ~~required for~~ UTF-8). ~~Accordingly,~~do ~~the~~not ~~issue~~have ~~at hand is more pertinent to the protocol and communication than to a computational~~this ~~difficulty~~problem. == Processing issues ==

Comparison of Unicode encodings: Difference between revisions