Content deleted Content added
m date format audit, refine ref details, typo(s) fixed: Therefore → Therefore, |
|||
Line 1:
{{short description|None}}
{{Use dmy dates|date=July
{{More footnotes needed|date=July 2019}}
This article compares [[Unicode]] encodings. Two situations are considered: [[8-bit-clean]] environments (which can be assumed), and environments that forbid use of [[byte]] values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions. [[Standard Compression Scheme for Unicode]] and [[Binary Ordered Compression for Unicode]] are excluded from the comparison tables because it is difficult to simply quantify their size.
Line 16:
|title=Character Encoding in Entities
|work=Extensible Markup Language (XML) 1.0 (Fifth Edition)
|publisher=[[World Wide Web Consortium
|year=2008}}</ref>
Line 24:
The next 1,920 characters, U+0080 to U+07FF (encompassing the remainder of almost all [[Latin-script alphabet]]s, and also [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], [[Coptic alphabet|Coptic]], [[Armenian alphabet|Armenian]], [[Hebrew alphabet|Hebrew]], [[Arabic alphabet|Arabic]], [[Syriac alphabet|Syriac]], [[Tāna]] and [[N'Ko alphabet|N'Ko]]), require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, i.e. the remainder of the characters in the [[Basic Multilingual Plane]] (BMP, plane 0, U+0000 to U+FFFF), which encompasses the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character, while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the [[Plane (Unicode)|supplementary planes]] (planes 1–16), require 32 bits in UTF-8, UTF-16 and UTF-32.
Therefore, a file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, html markup, and embedded words and acronyms written with Latin letters.<ref>{{Cite web |title=UTF-8 Everywhere |url=https://utf8everywhere.org/#asian |access-date=2022-08-28 |website=utf8everywhere.org}}</ref> UTF-32 is always longer unless there are no code points less than U+10000.
All printable characters in [[UTF-EBCDIC]] use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, [[UTF-7]] is more space efficient than the combination of other Unicode encodings with [[quoted-printable]] or [[base64]] for almost all types of text (see "[[#Seven-bit environments|Seven-bit environments]]" below).
Line 45:
UTF-16 and UTF-32 do not have [[endianness]] defined, so a byte order must be selected when receiving them over a byte-oriented network or reading them from a byte-oriented storage. This may be achieved by using a [[byte-order mark]] at the start of the text or assuming big-endian (RFC 2781). [[UTF-8]], [[UTF-16BE]], [[UTF-32BE]], [[UTF-16LE]] and [[UTF-32LE]] are standardised on a single byte order and do not have this problem.
If the byte stream is subject to [[data corruption|corruption]] then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize
== In detail ==
Line 147:
=== {{anchor|UTF-5|UTF-6}}Historical: UTF-5 and UTF-6 ===
Proposals have been made for a UTF-5 and UTF-6 for the [[Internationalized ___domain name|internationalization of ___domain names]] (IDN). The UTF-5 proposal used a [[Base32|base 32]] encoding, where [[Punycode]] is (among other things, and not exactly) a [[base 36]] encoding. The name ''UTF-5'' for a code unit of 5 bits is explained by the equation 2<sup>5</sup> = 32.<ref>Seng, James, [https://archive.today/20120721050018/http://tools.ietf.org/html/draft-jseng-utf5 UTF-5, a transformation format of Unicode and ISO 10646], 28 January 2000</ref> The UTF-6 proposal added a running length encoding to UTF-5, here '''6''' simply stands for ''UTF-5 plus 1''.<ref name="UTF-6">{{cite journal |author-last1=Welter |author-first1=Mark |author-last2=Spolarich |author-first2=Brian W. |title=UTF-6 - Yet Another ASCII-Compatible Encoding for ID |url=https://tools.ietf.org/html/draft-ietf-idn-utf6-00 |newspaper=Ietf Datatracker |date=2000-11-16 |access-date=2016-04-09 |url-status=live |archive-url=https://web.archive.org/web/20160523174347/https://tools.ietf.org/html/draft-ietf-idn-utf6-00 |archive-date=2016-05-23}}</ref>
The [[Internet Engineering Task Force|IETF]] IDN WG later adopted the more efficient [[Punycode]] for this purpose.<ref>{{Cite web|title=Internationalized Domain Name (idn)|url=http://tools.ietf.org/wg/idn|access-date=2023-03-20|
=== Not being seriously pursued ===
Line 158:
== References ==
{{reflist
{{Unicode navigation}}
{{DEFAULTSORT:Comparison
[[Category:Unicode Transformation Formats| ]]
[[Category:Software comparisons|Unicode]]
|