Comparison of Unicode encodings: Difference between revisions

Content deleted Content added
Cleanup grammar, wording, citations in introduction and first two sections
Line 3:
{{Use dmy dates|date=July 2023}}
{{More footnotes needed|date=July 2019}}
This article compares [[Unicode]] encodings. Twoin situationstwo aretypes consideredof environments: [[8-bit-clean]] environments (which can be assumed), and environments that forbid the use of [[byte]] values that havewith the high bit set. Originally, such prohibitions were to allowallowed for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions.{{explain|date=July 2024}} The [[Standard Compression Scheme for Unicode]] and the [[Binary Ordered Compression for Unicode]] are excluded from the comparison tables because it is difficult to simply quantify their size.
 
== Compatibility issues ==
A [[UTF-8]] file that contains only [[ASCII]] characters is identical to an ASCII file. Legacy programs can generally handle UTF-8 encoded files, even if they contain non-ASCII characters. For instance, the [[C (programming language)|C]] [[printf]] function can print a UTF-8 string, asbecause it only looks for the ASCII '%' character to define a formatting string,. and prints allAll other bytes unchanged, thus non-ASCII characters will beare outputprinted unchanged.
 
[[UTF-16]] and [[UTF-32]] are incompatible with ASCII files, and thus require [[Unicode]]-aware programs to display, print, and manipulate them, even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, thecharacter strings representing such files cannot be manipulated by normalcommon [[null-terminated string]] handling for even simple operations such as copylogic.{{efn|ASCII software ''not'' using null characters to terminate strings would handle UTF-16 and UTF-32 encoded files correctly (such files, if containing only ASCII-subset characters, would appear as normal ASCII padded with [[null character]]s), but such software is not common.{{cn|date=July 2024}}}} The prevalence of string handling using this logic means that, even in the context of UTF-16 systems such as [[Windows]] and [[Java (software platform)|Java]], UTF-16 text files are not commonly used. Rather, older 8-bit encodings such as ASCII or [[ISO-8859-1]] are still used, forgoing Unicode support entirely, or UTF-8 is used for Unicode.{{cn|date=July 2024}} One rare counter-example is the [[Mac OS X Panther|Mac OS X 10.3 Panther]] and later "strings" file used by applications to lookup internationalized versions of messages. By default, this file is encode in UTF-16, with "files encoded using UTF-8 ... not guaranteed to work."<ref>{{Cite web|url=https://developer.apple.com/documentation/MacOSX/Conceptual/BPInternational/Articles/StringsFiles.html|title=Apple Developer Connection: Internationalization Programming Topics: Strings Files}}</ref>
 
[[XML]] is, [[de facto|by conventionconventionally]], encoded as UTF-8{{cn|date=July 2024}}, and all XML processors must at least support UTF-8 (including US-ASCII by definition) and UTF-16.<ref>{{cite web
Therefore, even on most UTF-16 systems such as [[Windows]] and [[Java (software platform)|Java]], UTF-16 text files are not common; older 8-bit encodings such as ASCII or [[ISO-8859-1]] are still used, forgoing Unicode support; or UTF-8 is used for Unicode. One rare counter-example is the "strings" file used by [[macOS]] ([[Mac OS X Panther|Mac OS X 10.3 Panther]] and later) applications for lookup of internationalized versions of messages which defaults to UTF-16, with "files encoded using UTF-8 ... not guaranteed to work."<ref>{{Cite web|url=https://developer.apple.com/documentation/MacOSX/Conceptual/BPInternational/Articles/StringsFiles.html|title=Apple Developer Connection: Internationalization Programming Topics: Strings Files}}</ref>
 
[[XML]] is, [[de facto|by convention]], encoded as UTF-8, and all XML processors must at least support UTF-8 (including US-ASCII by definition) and UTF-16.<ref>{{cite web
|url=http://www.w3.org/TR/xml/#charencoding
|title=Character Encoding in Entities
Line 20 ⟶ 18:
 
== Efficiency ==
[[UTF-8]] requires 8, 16, 24 or 32 bits (one to four [[Octet (computing)|bytes]]) to encode a Unicode character, [[UTF-16]] requires either 16 or 32 bits to encode a character, and [[UTF-32]] always requires 32 bits to encode a character. The first 128 Unicode [[code point]]s, U+0000 to U+007F, used for the [[C0 Controls and Basic Latin]] characters and which correspond one-to-one to their ASCII-code equivalents, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32.
 
The first 128 Unicode [[code point]]s, U+0000 to U+007F, which are used for the [[C0 Controls and Basic Latin]] characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF, (encompassingrepresent the remainderrest of the characters used by almost all [[Latin-script alphabet]]s, andas alsowell as [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], [[Coptic alphabet|Coptic]], [[Armenian alphabet|Armenian]], [[Hebrew alphabet|Hebrew]], [[Arabic alphabet|Arabic]], [[Syriac alphabet|Syriac]], [[Tāna]] and [[N'Ko alphabet|N'Ko]]),. Characters in this range require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, i.e. the remainder of theremaining characters in the [[Basic Multilingual Plane]] (BMP,and planecapable 0, U+0000 to U+FFFF), whichof encompassesrepresenting the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character, while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the [[Plane (Unicode)|supplementary planes]] (planes 1–16), require 32 bits in UTF-8, UTF-16 and UTF-32.
 
Therefore, aA file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. AAdvocates surprisingof resultUTF-8 isas the preferred form argue that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup[[HTML]], and embedded words and acronyms written with Latin letters.<ref>{{Cite web |title=UTF-8 Everywhere |url=https://utf8everywhere.org/#asian |access-date=2022-08-28 |website=utf8everywhere.org}}</ref> UTF-32, by contrast, is always longer unless there are no code points less than U+10000.
 
All printable characters in [[UTF-EBCDIC]] use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, [[UTF-7]] is more space efficient than the combination of other Unicode encodings with [[quoted-printable]] or [[base64]] for almost all types of text{{explain|date=July 2024}} (see "[[#Seven-bit environments|Seven-bit environments]]" below).
 
===Processing time===
Text with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to work with individual code units, as opposed to working with sequences of code units. Searching is unaffected by whether the characters are variablevariably sized, since a search for a sequence of code units does not care about the divisions. Howver, (it does require that the encoding be [[self-synchronizing code|self-synchronizing]], which both UTF-8 and UTF-16 are). A common misconception is that there is a need to "find the ''n''th character" and that this requires a fixed-length encoding; however, in real use the number ''n'' is only derived from examining the {{nowrap|''n−1''}} characters, thus sequential access is needed anyway.{{Citation needed|date=October 2013}}
 
WhenEfficiently using character sequences in one [[endianness|endian order are]] loaded onto a machine with a different endian order, therequires charactersextra needprocessing. Characters may toeither be converted before theyuse can beor processed efficiently (orwith two processorsdistinct are needed)systems. Byte-based encodings such as UTF-8 do not have this problem.{{why|date=July 2024}} [[UTF-16BE]] and [[UTF-32BE]] are [[endianness|big-endian]], [[UTF-16LE]] and [[UTF-32LE]] are [[endianness|little-endian]].
 
== Processing issues ==
For processing, a format should be easy to search, truncate, and generally process safely.{{cn|date=July 2024}} All normal Unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded, one or more of these code units will represent a Unicode [[code point]]. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but [[UTF-7]] and [[GB 18030]] do not.
 
Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to [[combining character]]s. Considering these incompatibilities and other quirks among different encoding schemes, handling unicode data with the same (or compatible) protocol throughout and across the interfaces (e.g. using an API/library, handling unicode characters in client/server model, etc.) can in general simplify the whole pipeline while eliminating a potential source of bugs at the same time.