Comparison of Unicode encodings: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: template type. | Use this bot. Report bugs. | Suggested by Abductive | #UCB_toolbar
Sapphaline (talk | contribs)
In detail: hatnote
 
(35 intermediate revisions by 26 users not shown)
Line 1:
{{short description|None}}
{{Use dmy dates|date=July 2013}}
{{More footnotes|date=July 2019}}
This article compares [[Unicode]] encodings. Two situations are considered: [[8-bit-clean]] environments (which can be assumed), and environments that forbid use of [[byte]] values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. [[Standard Compression Scheme for Unicode]] and [[Binary Ordered Compression for Unicode]] are excluded from the comparison tables because it is difficult to simply quantify their size.
 
{{Use dmy dates|date=July 20132023}}
== Compatibility issues ==
{{More footnotes needed|date=July 2019}}
A [[UTF-8]] file that contains only [[ASCII]] characters is identical to an ASCII file. Legacy programs can generally handle UTF-8 encoded files, even if they contain non-ASCII characters. For instance, the [[C (programming language)|C]] [[printf]] function can print a UTF-8 string, as it only looks for the ASCII '%' character to define a formatting string, and prints all other bytes unchanged, thus non-ASCII characters will be output unchanged.
This article compares [[Unicode]] encodings. Twoin situationstwo aretypes consideredof environments: [[8-bit- clean]] environments (which can be assumed), and environments that forbid the use of [[byte]] values that havewith the high bit set. Originally, such prohibitions were to allowallowed for links that used only seven data bits, but they remain in some standards and, so some standard-conforming software must generate messages that comply with the restrictions.{{explain|date=July 2024}} The [[Standard Compression Scheme for Unicode]] and the [[Binary Ordered Compression for Unicode]] are excluded from the comparison tables because it is difficult to simply quantify their size.
 
== Compatibility issues ==
[[UTF-16]] and [[UTF-32]] are incompatible with ASCII files, and thus require [[Unicode]]-aware programs to display, print and manipulate them, even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, the strings cannot be manipulated by normal [[null-terminated string]] handling for even simple operations such as copy.{{efn|ASCII software ''not'' using null characters to terminate strings would handle UTF-16 and UTF-32 encoded files correctly (such files, if containing only ASCII-subset characters, would appear as normal ASCII padded with [[null character]]s), but such software is not common.}}
A [[UTF-8]] file that contains only [[ASCII]] characters is identical to an ASCII file. Legacy programs can generally handle UTF-8 -encoded files, even if they contain non-ASCII characters. For instance, the [[C (programming language)|C]] [[printf]] function can print a UTF-8 string, asbecause it only looks for the ASCII '%' character to define a formatting string,. and prints allAll other bytes unchanged, thus non-ASCII characters will beare outputprinted unchanged.
 
Therefore[[UTF-16]] and [[UTF-32]] are incompatible with ASCII files, and thus require [[Unicode]]-aware programs to display, print, and manipulate them even onif mostthe file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, character strings representing such files cannot be manipulated by common [[null-terminated string]] handling logic.{{efn|ASCII software ''not'' using null characters to terminate strings would handle UTF-16 and UTF-32 encoded files correctly (such files, if containing only ASCII-subset characters, would appear as normal ASCII padded with [[null character]]s), but such software is not common.{{cn|date=July 2024}}}} The prevalence of string handling using this logic means that, even in the context of UTF-16 systems such as [[Windows]] and [[Java (software platform)|Java]], UTF-16 text files are not common;commonly used. Rather, older 8-bit encodings such as ASCII or [[ISO-8859-1]] are still used, forgoing Unicode support; entirely, or UTF-8 is used for Unicode.{{cn|date=July 2024}} One rare counter-example is the "strings" file usedintroduced byin [[Mac OS X]] (Panther|Mac OS X 10.3 andPanther]], later)which is used by applications forto lookuplook ofup internationalized versions of messages. whichBy defaultsdefault, tothis file is encoded in UTF-16, with "files encoded using UTF-8 ... not guaranteed to work."<ref>[{{Cite web|url=https://developer.apple.com/documentation/MacOSX/Conceptual/BPInternational/Articles/StringsFiles.html |title=Apple Developer Connection: Internationalization Programming Topics: Strings Files]}}</ref>
 
[[XML]] is, by[[de default,facto|conventionally]] encoded as UTF-8,{{cn|date=July 2024}} and all XML processors must at least support UTF-8 (including US-ASCII by definition) and UTF-16.<ref>{{cite web
|url=http://www.w3.org/TR/xml/#charencoding
|title=Character Encoding in Entities
|work=Extensible Markup Language (XML) 1.0 (Fifth Edition)
|publisher=[[World Wide Web Consortium|W3C]]
|year=2008}}</ref>
 
== Efficiency ==
[[UTF-8]] requires 8, 16, 24 or 32 bits (one to four [[Octet (computing)|bytes]]) to encode a Unicode character, [[UTF-16]] requires either 16 or 32 bits to encode a character, and [[UTF-32]] always requires 32 bits to encode a character. The first 128 Unicode [[code point]]s, U+0000 to U+007F, used for the [[C0 Controls and Basic Latin]] characters and which correspond one-to-one to their ASCII-code equivalents, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32.
 
The first 128 Unicode [[code point]]s, U+0000 to U+007F, which are used for the [[C0 Controls and Basic Latin]] characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF, (encompassingrepresent the remainderrest of the characters used by almost all [[Latin-script alphabet]]s, andas alsowell as [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], [[Coptic alphabetscript|Coptic]], [[Armenian alphabet|Armenian]], [[Hebrew alphabet|Hebrew]], [[Arabic alphabet|Arabic]], [[Syriac alphabet|Syriac]], [[TānaThaana]] and [[N'Ko alphabetscript|N'Ko]]),. Characters in this range require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, i.e. the remainder of theremaining characters in the [[Basic Multilingual Plane]] (BMP,and planecapable 0, U+0000 to U+FFFF), whichof encompassesrepresenting the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character, while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the [[Plane (Unicode)|supplementary planes]] (planes 1–16), require 32 bits in UTF-8, UTF-16 and UTF-32.
 
A file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. Advocates of UTF-8 as the preferred form argue that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8 due to the extensive use of spaces, digits, punctuation, newlines, [[HTML]], and embedded words and acronyms written with Latin letters.<ref>{{Cite web |title=UTF-8 Everywhere |url=https://utf8everywhere.org/#asian |access-date=2022-08-28 |website=utf8everywhere.org}}</ref> UTF-32, by contrast, is always longer unless there are no code points less than U+10000.
All printable characters in [[UTF-EBCDIC]] use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, [[UTF-7]] is more space efficient than the combination of other Unicode encodings with [[quoted-printable]] or [[base64]] for almost all types of text (see "[[#Seven-bit environments|Seven-bit environments]]" below).
 
All printable characters in [[UTF-EBCDIC]] use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, [[UTF-7]] is more space efficient than the combination of other Unicode encodings with [[quoted-printable]] or [[base64]] for almost all types of text{{explain|date=July 2024}} (see "[[#Seven-bit environments|Seven-bit environments]]" below).
===Storage utilization===
Each format has its own set of advantages and disadvantages with respect to storage efficiency (and thus also of transmission time) and processing efficiency. Storage efficiency is subject to the ___location within the Unicode [[code point|code space]] in which any given text's characters are predominantly from. Since Unicode code space blocks are organized by character set (i.e. alphabet/script), storage efficiency of any given text effectively depends on the [[alphabet|alphabet/script]] used for that text. So, for example, UTF-8 needs one less byte per character (8 versus 16 bits) than UTF-16 for the 128 code points between U+0000 and U+007F, but needs one more byte per character (24 versus 16 bits) for the 63,488 code points between U+0800 and U+FFFF. Therefore, if there are more characters in the range U+0000 to U+007F than there are in the range U+0800 to U+FFFF then UTF-8 is more efficient, while if there are fewer, then UTF-16 is more efficient. If the counts are equal then they are exactly the same size. A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup, and embedded words and acronyms written with Latin letters.{{Citation needed|date=October 2013}}
 
===Processing time===
As far as processing time is concerned, textText with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to findwork thewith individual code units, as opposed to working with sequences of code unitspoints. Searching is unaffected by whether the characters are variablevariably sized, since a search for a sequence of code units does not care about the divisions. However, (it does require that the encoding be [[self-synchronizing code|self-synchronizing]], which both UTF-8 and UTF-16 are). A common misconception is that there is a need to "find the ''n''th character" and that this requires a fixed-length encoding; however, in real use the number ''n'' is only derived from examining the {{nowrap|''n−1''}} characters, thus sequential access is needed anyway.{{Citation needed|date=October 2013}} [[UTF-16BE]] and [[UTF-32BE]] are [[endianness|big-endian]], [[UTF-16LE]] and [[UTF-32LE]] are [[endianness|little-endian]]. When character sequences in one endian order are loaded onto a machine with a different endian order, the characters need to be converted before they can be processed efficiently, unless data is processed with a byte granularity (as required for UTF-8). Accordingly, the issue at hand is more pertinent to the protocol and communication than to a computational difficulty.
 
Efficiently using character sequences in one [[endianness|endian order]] loaded onto a machine with a different endian order requires extra processing. Characters may either be converted before use or processed with two distinct systems. Byte-based encodings such as UTF-8 do not have this problem.{{why|date=July 2024}} [[UTF-16BE]] and [[UTF-32BE]] are big-endian; [[UTF-16LE]] and [[UTF-32LE]] are little-endian.
 
== Processing issues ==
For processing, a format should be easy to search, truncate, and generally process safely.{{cn|date=July 2024}} All normal Unicode encodings use some form of fixed -size code unit. Depending on the format and the [[code point]] to be encoded, one or more of these code units will represent a Unicode [[code point]]. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but [[UTF-7]] and [[GB 18030]] do not.
 
Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to [[combining character]]s. Considering these incompatibilities and other quirks among different encoding schemes, handling unicodeUnicode data with the same (or compatible) protocol throughout and across the interfaces (e.g. using an API/library, handling unicodeUnicode characters in client/server model, etc.) can in general simplify the whole pipeline while simultaneously eliminating a potential source of bugs at the same time.
 
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width (referred as UCS-2). However, using UTF-16 makes characters outside the [[Mapping of Unicode character planes|Basic Multilingual Plane]] a special case, which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.
 
If any stored data is in UTF-8 (such as file contents or names), it is very difficult to write a system that uses UTF-16 or UTF-32 as an API. This is due to the oft-overlooked fact that the byte array used by UTF-8 can physically contain invalid sequences. For instance, it is impossible to fix an invalid UTF-8 filename using a UTF-16 API, as no possible UTF-16 string will translate to that invalid filename. The opposite is not true: it is trivial to translate invalid UTF-16 to a unique (though technically invalid) UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names, making UTF-8 preferred in any such mixed environment. An unfortunate but far more common workaround used by UTF-16 systems is to interpret the UTF-8 as some other encoding such as [[CP-1252]] and ignore the [[mojibake]] for any non-ASCII data.
Line 42 ⟶ 43:
UTF-16 and UTF-32 do not have [[endianness]] defined, so a byte order must be selected when receiving them over a byte-oriented network or reading them from a byte-oriented storage. This may be achieved by using a [[byte-order mark]] at the start of the text or assuming big-endian (RFC 2781). [[UTF-8]], [[UTF-16BE]], [[UTF-32BE]], [[UTF-16LE]] and [[UTF-32LE]] are standardised on a single byte order and do not have this problem.
 
If the byte stream is subject to [[data corruption|corruption]] then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize after a corrupt or missing byte at the start of the next code point; GB 18030 is unable to recover until the next ASCII non-number. UTF-16 can handle ''altered'' bytes, but not an odd number of ''missing'' bytes, which will garble all the following text (though it will produce uncommon and/or unassigned characters).{{efn|An ''even'' number of missing bytes in UTF-16, in contrast, will garble at most one character.}} If ''bits'' can be lost all of them will garble the following text, though UTF-8 can be resynchronized as incorrect byte boundaries will produce invalid UTF-8 in almost all text longer than a few bytes.
 
== In detail ==
{{hatnote|The tables below list the numbernumbers of bytes per code point, fornot differentper user-visible "character" Unicode(or ranges"grapheme cluster"). AnyIt additionalcan commentstake neededmultiple arecode includedpoints into thedescribe table.a Thesingle figuresgrapheme assumecluster, thatso overheadseven atin theUTF-32, startcare andmust endbe oftaken thewhen block ofsplitting textor areconcatenating negligiblestrings.}}
 
The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.
<blockquote>
'''''N.B.''' The tables below list numbers of bytes per '''code point''', '''not''' per user visible "character" (or "grapheme cluster"). It can take multiple code points to describe a single grapheme cluster, so even in UTF-32, care must be taken when splitting or concatenating strings.''
</blockquote>
 
=== Eight-bit environments ===
Line 57 ⟶ 56:
|000000 – 00007F||1||rowspan=6|2||rowspan=8|4||rowspan=2|1||1
|-
|000080 – 00009F||rowspan=3|2||rowspan=5|2 for characters inherited from<br>[[GB 2312]]/[[GBK (character encoding)|GBK]] (e.g. most<br>Chinese characters), 4 for<br>everything else.
|-
|0000A0 – 0003FF||2
Line 114 ⟶ 113:
|rowspan=2|2–6 depending on if the byte values need to be escaped
<!--|rowspan=3|8–12 depending on if the final two byte values need to be escaped -->
|rowspan=2|4–6 for characters inherited from GB2312/GBK (e.g.<br>most Chinese characters), 8 for everything else.
|rowspan=2|{{frac|2|2|3}} for characters inherited from GB2312/GBK (e.g.<br>most Chinese characters), {{frac|5|1|3}} for everything else.
|-
|000800 – 00FFFF
Line 125 ⟶ 124:
|12
|{{frac|5|1|3}}
|8–12 depending on if the low bytes of the surrogates need to be escaped.
|{{frac|5|1|3}}
|8
Line 138 ⟶ 137:
[[Binary Ordered Compression for Unicode|BOCU-1]] and [[Standard Compression Scheme for Unicode|SCSU]] are two ways to compress Unicode data. Their [[character encoding|encoding]] relies on how frequently the text is used. Most runs of text use the same script; for example, [[Latin alphabet|Latin]], [[Cyrillic script|Cyrillic]], [[Greek alphabet|Greek]] and so on. This normal use allows many runs of text to compress down to about 1 byte per code point. These stateful encodings make it more difficult to randomly access text at any position of a string.
 
These two compression schemes are not as efficient as other compression schemes, like [[ZIP (file format)|zip]] or [[bzip2]]. Those general-purpose compression schemes can compress longer runs of bytes to just a few bytes. The [[Standard Compression Scheme for Unicode|SCSU]] and [[Binary Ordered Compression for Unicode|BOCU-1]] compression schemes will not compress more than the theoretical 25% of text encoded as UTF-8, UTF-16 or UTF-32. Other general-purpose compression schemes can easily compress to 10% of original text size. The general -purpose schemes require more complicated algorithms and longer chunks of text for a good compression ratio.
 
[https://www.unicode.org/notes/tn14/ Unicode Technical Note #14] contains a more detailed comparison of compression schemes.
 
=== {{anchor|UTF-5|UTF-6}}Historical: UTF-5 and UTF-6 ===
Proposals have been made for a UTF-5 and UTF-6 for the [[Internationalized ___domain name|internationalization of ___domain names]] (IDN). The UTF-5 proposal used a [[Base32|base 32]] encoding, where [[Punycode]] is (among other things, and not exactly) a [[base 36]] encoding. The name ''UTF-5'' for a code unit of 5 bits is explained by the equation 2<sup>5</sup> = 32.<ref>Seng, James, [https://archive.today/20120721050018/http://tools.ietf.org/html/draft-jseng-utf5 UTF-5, a transformation format of Unicode and ISO 10646], 28 January 2000</ref> The UTF-6 proposal added a running length encoding to UTF-5,; here '''6''' simply stands for ''UTF-5 plus 1''.<ref name="UTF-6">{{cite journal |author-last1=Welter |author-first1=Mark |author-last2=Spolarich |author-first2=Brian W. |title=UTF-6 - Yet Another ASCII-Compatible Encoding for ID |url=https://tools.ietf.org/html/draft-ietf-idn-utf6-00 |websitenewspaper=Internet Engineering TaskIetf ForceDatatracker |date=2000-11-16 |access-date=2016-04-09 |url-status=live |archive-url=https://web.archive.org/web/20160523174347/https://tools.ietf.org/html/draft-ietf-idn-utf6-00 |archive-date=2016-05-23}}</ref>
The [[Internet Engineering Task Force|IETF]] IDN WG later adopted the more efficient [[Punycode]] for this purpose.<ref>[{{Cite web|title=Internationalized Domain Name (idn)|url=http://tools.ietf.org/wg/idn|access-date=2023-03-20|publisher=Internet Historical IETF IDNEngineering WGTask page]Force|language=en}}</ref>
 
=== Not being seriously pursued ===
[[UTF-1]] never gained serious acceptance. UTF-8 is much more frequently used.
 
The [[wikt:nonet#Noun|nonet]] encodings [[UTF-9 and UTF-18]] are [[April Fools' Day Request for Comments|April Fools' Day RFC]] joke specifications, although UTF-9 is a functioning nonet Unicode transformation format, and UTF-18 is a functioning nonet encoding for all non-Private-Use code points in Unicode 12 and below, although not for [[Private Use Areas#PUA-A|Supplementary Private Use Areas]] or [[CJK Unified Ideographs Extension G|portions of Unicode 13 and later]].
[[UTF-9 and UTF-18]], despite being functional encodings, were [[April Fools' Day RFC]] joke specifications.
 
==Notes==
Line 155 ⟶ 154:
 
== References ==
{{reflist|30em}}
 
{{Unicode navigation}}
 
{{DEFAULTSORT:Comparison Of Unicode Encodings}}
[[Category:Unicode Transformation Formats| ]]
[[Category:Software comparisons|Unicode]]