Content deleted Content added
Sapphaline (talk | contribs) →In detail: hatnote |
|||
(2 intermediate revisions by 2 users not shown) | |||
Line 3:
{{Use dmy dates|date=July 2023}}
{{More footnotes needed|date=July 2019}}
This article compares [[Unicode]] encodings in two types of environments: [[8-bit
== Compatibility issues ==
A [[UTF-8]] file that contains only [[ASCII]] characters is identical to an ASCII file. Legacy programs can generally handle UTF-8
[[UTF-16]] and [[UTF-32]] are incompatible with ASCII files, and thus require [[Unicode]]-aware programs to display, print, and manipulate them even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, character strings representing such files cannot be manipulated by common [[null-terminated string]] handling logic.{{efn|ASCII software ''not'' using null characters to terminate strings would handle UTF-16 and UTF-32 encoded files correctly (such files, if containing only ASCII-subset characters, would appear as normal ASCII padded with [[null character]]s), but such software is not common.{{cn|date=July 2024}}}} The prevalence of string handling using this logic means that, even in the context of UTF-16 systems such as [[Windows]] and [[Java (software platform)|Java]], UTF-16 text files are not commonly used. Rather, older 8-bit encodings such as ASCII or [[ISO-8859-1]] are still used, forgoing Unicode support entirely, or UTF-8 is used for Unicode.{{cn|date=July 2024}} One rare counter-example is the "strings" file introduced in [[Mac OS X Panther|Mac OS X 10.3 Panther]], which is used by applications to
[[XML]] is [[de facto|conventionally]] encoded as UTF-8,{{cn|date=July 2024}} and all XML processors must at least support UTF-8 and UTF-16.<ref>{{cite web
Line 29:
Text with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to work with individual code units as opposed to working with code points. Searching is unaffected by whether the characters are variably sized since a search for a sequence of code units does not care about the divisions. However, it does require that the encoding be [[self-synchronizing code|self-synchronizing]], which both UTF-8 and UTF-16 are. A common misconception is that there is a need to "find the ''n''th character" and that this requires a fixed-length encoding; however, in real use the number ''n'' is only derived from examining the {{nowrap|''n−1''}} characters, thus sequential access is needed anyway.{{Citation needed|date=October 2013}}
Efficiently using character sequences in one [[endianness|endian order]] loaded onto a machine with a different endian order requires extra processing. Characters may either be converted before use or processed with two distinct systems. Byte-based encodings such as UTF-8 do not have this problem.{{why|date=July 2024}} [[UTF-16BE]] and [[UTF-32BE]] are
== Processing issues ==
For processing, a format should be easy to search, truncate, and generally process safely.{{cn|date=July 2024}} All normal Unicode encodings use some form of fixed
Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to [[combining character]]s. Considering these incompatibilities and other quirks among different encoding schemes, handling
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width (referred as UCS-2). However, using UTF-16 makes characters outside the [[Mapping of Unicode character planes|Basic Multilingual Plane]] a special case, which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.
If any stored data is in UTF-8 (such as file contents or names), it is very difficult to write a system that uses UTF-16 or UTF-32 as an API. This is due to the oft-overlooked fact that the byte array used by UTF-8 can physically contain invalid sequences. For instance, it is impossible to fix an invalid UTF-8 filename using a UTF-16 API, as no possible UTF-16 string will translate to that invalid filename. The opposite is not true: it is trivial to translate invalid UTF-16 to a unique (though technically invalid) UTF-8 string, so a UTF-8 API can control both UTF-8 and UTF-16 files and names, making UTF-8 preferred in any such mixed environment. An unfortunate but far more common workaround used by UTF-16 systems is to interpret the UTF-8 as some other encoding such as [[CP-1252]] and ignore the [[mojibake]] for any non-ASCII data.
Line 46:
== In detail ==
{{hatnote|The tables below list
The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.
=== Eight-bit environments ===
Line 58 ⟶ 56:
|000000 – 00007F||1||rowspan=6|2||rowspan=8|4||rowspan=2|1||1
|-
|000080 – 00009F||rowspan=3|2||rowspan=5|2 for characters inherited from<br>[[GB 2312]]/[[GBK (character encoding)|GBK]] (e.g. most<br>Chinese characters), 4 for<br>everything else
|-
|0000A0 – 0003FF||2
Line 115 ⟶ 113:
|rowspan=2|2–6 depending on if the byte values need to be escaped
<!--|rowspan=3|8–12 depending on if the final two byte values need to be escaped -->
|rowspan=2|4–6 for characters inherited from GB2312/GBK (e.g.<br>most Chinese characters), 8 for everything else
|rowspan=2|{{frac|2|2|3}} for characters inherited from GB2312/GBK (e.g.<br>most Chinese characters), {{frac|5|1|3}} for everything else
|-
|000800 – 00FFFF
Line 126 ⟶ 124:
|12
|{{frac|5|1|3}}
|8–12 depending on if the low bytes of the surrogates need to be escaped
|{{frac|5|1|3}}
|8
Line 139 ⟶ 137:
[[Binary Ordered Compression for Unicode|BOCU-1]] and [[Standard Compression Scheme for Unicode|SCSU]] are two ways to compress Unicode data. Their [[character encoding|encoding]] relies on how frequently the text is used. Most runs of text use the same script; for example, [[Latin alphabet|Latin]], [[Cyrillic script|Cyrillic]], [[Greek alphabet|Greek]] and so on. This normal use allows many runs of text to compress down to about 1 byte per code point. These stateful encodings make it more difficult to randomly access text at any position of a string.
These two compression schemes are not as efficient as other compression schemes, like [[ZIP (file format)|zip]] or [[bzip2]]. Those general-purpose compression schemes can compress longer runs of bytes to just a few bytes. The [[Standard Compression Scheme for Unicode|SCSU]] and [[Binary Ordered Compression for Unicode|BOCU-1]] compression schemes will not compress more than the theoretical 25% of text encoded as UTF-8, UTF-16 or UTF-32. Other general-purpose compression schemes can easily compress to 10% of original text size. The general
[https://www.unicode.org/notes/tn14/ Unicode Technical Note #14] contains a more detailed comparison of compression schemes.
=== {{anchor|UTF-5|UTF-6}}Historical: UTF-5 and UTF-6 ===
Proposals have been made for a UTF-5 and UTF-6 for the [[Internationalized ___domain name|internationalization of ___domain names]] (IDN). The UTF-5 proposal used a [[Base32|base 32]] encoding, where [[Punycode]] is (among other things, and not exactly) a [[base 36]] encoding. The name ''UTF-5'' for a code unit of 5 bits is explained by the equation 2<sup>5</sup> = 32.<ref>Seng, James, [https://archive.today/20120721050018/http://tools.ietf.org/html/draft-jseng-utf5 UTF-5, a transformation format of Unicode and ISO 10646], 28 January 2000</ref> The UTF-6 proposal added a running length encoding to UTF-5
The [[Internet Engineering Task Force|IETF]] IDN WG later adopted the more efficient [[Punycode]] for this purpose.<ref>{{Cite web|title=Internationalized Domain Name (idn)|url=http://tools.ietf.org/wg/idn|access-date=2023-03-20|publisher=Internet Engineering Task Force|language=en}}</ref>
Line 160 ⟶ 158:
{{Unicode navigation}}
[[Category:Unicode Transformation Formats| ]]
[[Category:Software comparisons|Unicode]]
|