Comparison of Unicode encodings: Difference between revisions

Content deleted Content added
Fix wrong wording; Cited spec does not require UTF-8 output, just that decoders be able to "... read entities in both the UTF-8 and UTF-16 encodings." The convention of outputting in UTF-8 exclusively is a convention due to its prevalence.
Line 23:
 
The next 1,920 characters, U+0080 to U+07FF (encompassing the remainder of almost all [[Latin-script alphabet]]s, and also [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], [[Coptic alphabet|Coptic]], [[Armenian alphabet|Armenian]], [[Hebrew alphabet|Hebrew]], [[Arabic alphabet|Arabic]], [[Syriac alphabet|Syriac]], [[Tāna]] and [[N'Ko alphabet|N'Ko]]), require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, i.e. the remainder of the characters in the [[Basic Multilingual Plane]] (BMP, plane 0, U+0000 to U+FFFF), which encompasses the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character, while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the [[Plane (Unicode)|supplementary planes]] (planes 1–16), require 32 bits in UTF-8, UTF-16 and UTF-32.
 
Therefore a file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup, and embedded words and acronyms written with Latin letters.<ref>{{Cite web |title=UTF-8 Everywhere |url=https://utf8everywhere.org/#asian |access-date=2022-08-28 |website=utf8everywhere.org}}</ref> UTF-32 is always longer unless there are no code points less than U+10000.
 
All printable characters in [[UTF-EBCDIC]] use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, [[UTF-7]] is more space efficient than the combination of other Unicode encodings with [[quoted-printable]] or [[base64]] for almost all types of text (see "[[#Seven-bit environments|Seven-bit environments]]" below).
 
===Storage utilization===
Each format has its own set of advantages and disadvantages with respect to storage efficiency (and thus also of transmission time) and processing efficiency. Storage efficiency is subject to the ___location within the Unicode [[code point|code space]] in which any given text's characters are predominantly from. Since Unicode code space blocks are organized by character set (i.e. alphabet/script), storage efficiency of any given text effectively depends on the [[alphabet|alphabet/script]] used for that text. So, for example, UTF-8 needs one less byte per character (8 versus 16 bits) than UTF-16 for the 128 code points between U+0000 and U+007F, but needs one more byte per character (24 versus 16 bits) for the 63,488 code points between U+0800 and U+FFFF. Therefore, if there are more characters in the range U+0000 to U+007F than there are in the range U+0800 to U+FFFF then UTF-8 is more efficient, while if there are fewer, then UTF-16 is more efficient. If the counts are equal then they are exactly the same size. A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup, and embedded words and acronyms written with Latin letters.<ref>{{Cite web |title=UTF-8 Everywhere |url=https://utf8everywhere.org/#asian |access-date=2022-08-28 |website=utf8everywhere.org}}</ref>
 
===Processing time===
As far as processing time is concerned, textText with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to findwork thewith individual code units, as opposed to working with sequences of code units. Searching is unaffected by whether the characters are variable sized, since a search for a sequence of code units does not care about the divisions (it does require that the encoding be self-synchronizing, which both UTF-8 and UTF-16 are). A common misconception is that there is a need to "find the ''n''th character" and that this requires a fixed-length encoding; however, in real use the number ''n'' is only derived from examining the {{nowrap|''n−1''}} characters, thus sequential access is needed anyway.{{Citation needed|date=October 2013}} [[UTF-16BE]] and [[UTF-32BE]] are [[endianness|big-endian]], [[UTF-16LE]] and [[UTF-32LE]] are [[endianness|little-endian]].

When character sequences in one endian order are loaded onto a machine with a different endian order, the characters need to be converted before they can be processed efficiently, unless(or datatwo isprocessors processedare withneeded). aByte-based byteencodings granularitysuch (as required for UTF-8). Accordingly,do thenot issuehave at hand is more pertinent to the protocol and communication than to a computationalthis difficultyproblem.
 
== Processing issues ==