Revision as of 05:50, 12 April 2024 edit Tea2min (talk \| contribs) Extended confirmed users, Pending changes reviewers 21,968 edits Undid revision 1218408594 by 2601:601:513:6D07:5500:C5D2:DDF5:8634 (talk) Tag: Undo ← Previous edit		Revision as of 09:45, 27 July 2024 edit undo Hawkblade96 (talk \| contribs) 40 edits Cleanup grammar, wording, citations in introduction and first two sections Next edit →
Line 3: {{Use dmy dates\|date=July 2023}} {{More footnotes needed\|date=July 2019}} This article compares [[Unicode]] encodings. ~~Two~~in ~~situations~~two ~~are~~types ~~considered~~of environments: [[8-bit-clean]] environments ~~(which can be assumed)~~, and environments that forbid the use of [[byte]] values ~~that have~~with the high bit set. Originally, such prohibitions ~~were to allow~~allowed for links that used only seven data bits, but they remain in some standards and so some standard-conforming software must generate messages that comply with the restrictions.{{explain\|date=July 2024}} The [[Standard Compression Scheme for Unicode]] and the [[Binary Ordered Compression for Unicode]] are excluded from the comparison tables because it is difficult to simply quantify their size. == Compatibility issues == A [[UTF-8]] file that contains only [[ASCII]] characters is identical to an ASCII file. Legacy programs can generally handle UTF-8 encoded files, even if they contain non-ASCII characters. For instance, the [[C (programming language)\|C]] [[printf]] function can print a UTF-8 string, asbecause it only looks for the ASCII '%' character to define a formatting string,. ~~and prints all~~All other bytes ~~unchanged, thus non-ASCII characters will be~~are ~~output~~printed unchanged. [[UTF-16]] and [[UTF-32]] are incompatible with ASCII files, and thus require [[Unicode]]-aware programs to display, print, and manipulate them, even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, ~~the~~character strings representing such files cannot be manipulated by ~~normal~~common [[null-terminated string]] handling ~~for even simple operations such as copy~~logic.{{efn\|ASCII software ''not'' using null characters to terminate strings would handle UTF-16 and UTF-32 encoded files correctly (such files, if containing only ASCII-subset characters, would appear as normal ASCII padded with [[null character]]s), but such software is not common.{{cn\|date=July 2024}}}} The prevalence of string handling using this logic means that, even in the context of UTF-16 systems such as [[Windows]] and [[Java (software platform)\|Java]], UTF-16 text files are not commonly used. Rather, older 8-bit encodings such as ASCII or [[ISO-8859-1]] are still used, forgoing Unicode support entirely, or UTF-8 is used for Unicode.{{cn\|date=July 2024}} One rare counter-example is the [[Mac OS X Panther\|Mac OS X 10.3 Panther]] and later "strings" file used by applications to lookup internationalized versions of messages. By default, this file is encode in UTF-16, with "files encoded using UTF-8 ... not guaranteed to work."<ref>{{Cite web\|url=https://developer.apple.com/documentation/MacOSX/Conceptual/BPInternational/Articles/StringsFiles.html\|title=Apple Developer Connection: Internationalization Programming Topics: Strings Files}}</ref> [[XML]] is, [[de facto\|~~by convention~~conventionally]], encoded as UTF-8{{cn\|date=July 2024}}, and all XML processors must at least support UTF-8 ~~(including US-ASCII by definition)~~ and UTF-16.<ref>{{cite web▼ Therefore, even on most UTF-16 systems such as [[Windows]] and [[Java (software platform)\|Java]], UTF-16 text files are not common; older 8-bit encodings such as ASCII or [[ISO-8859-1]] are still used, forgoing Unicode support; or UTF-8 is used for Unicode. One rare counter-example is the "strings" file used by [[macOS]] ([[Mac OS X Panther\|Mac OS X 10.3 Panther]] and later) applications for lookup of internationalized versions of messages which defaults to UTF-16, with "files encoded using UTF-8 ... not guaranteed to work."<ref>{{Cite web\|url=https://developer.apple.com/documentation/MacOSX/Conceptual/BPInternational/Articles/StringsFiles.html\|title=Apple Developer Connection: Internationalization Programming Topics: Strings Files}}</ref> ▲[[XML]] is, [[de facto\|by convention]], encoded as UTF-8, and all XML processors must at least support UTF-8 (including US-ASCII by definition) and UTF-16.<ref>{{cite web \|url=http://www.w3.org/TR/xml/#charencoding \|title=Character Encoding in Entities Line 20 ⟶ 18: == Efficiency == [[UTF-8]] requires 8, 16, 24 or 32 bits (one to four [[Octet (computing)\|bytes]]) to encode a Unicode character, [[UTF-16]] requires either 16 or 32 bits to encode a character, and [[UTF-32]] always requires 32 bits to encode a character. The first 128 Unicode [[code point]]s, U+0000 to U+007F, used for the [[C0 Controls and Basic Latin]] characters and which correspond one-to-one to their ASCII-code equivalents, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The first 128 Unicode [[code point]]s, U+0000 to U+007F, which are used for the [[C0 Controls and Basic Latin]] characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF, ~~(encompassing~~represent the ~~remainder~~rest of the characters used by almost all [[Latin-script alphabet]]s, ~~and~~as ~~also~~well as [[Greek alphabet\|Greek]], [[Cyrillic script\|Cyrillic]], [[Coptic alphabet\|Coptic]], [[Armenian alphabet\|Armenian]], [[Hebrew alphabet\|Hebrew]], [[Arabic alphabet\|Arabic]], [[Syriac alphabet\|Syriac]], [[Tāna]] and [[N'Ko alphabet\|N'Ko]]),. Characters in this range require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, ~~i.e.~~ the ~~remainder of the~~remaining characters in the [[Basic Multilingual Plane]] ~~(BMP,~~and ~~plane~~capable ~~0, U+0000 to U+FFFF), which~~of ~~encompasses~~representing the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character, while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the [[Plane (Unicode)\|supplementary planes]] ~~(planes 1–16)~~, require 32 bits in UTF-8, UTF-16 and UTF-32. ~~Therefore, a~~A file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. AAdvocates ~~surprising~~of ~~result~~UTF-8 isas the preferred form argue that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, punctuation, newlines, ~~html markup~~[[HTML]], and embedded words and acronyms written with Latin letters.<ref>{{Cite web \|title=UTF-8 Everywhere \|url=https://utf8everywhere.org/#asian \|access-date=2022-08-28 \|website=utf8everywhere.org}}</ref> UTF-32, by contrast, is always longer unless there are no code points less than U+10000. All printable characters in [[UTF-EBCDIC]] use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, [[UTF-7]] is more space efficient than the combination of other Unicode encodings with [[quoted-printable]] or [[base64]] for almost all types of text{{explain\|date=July 2024}} (see "[[#Seven-bit environments\|Seven-bit environments]]" below). ===Processing time=== Text with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to work with individual code units, as opposed to working with sequences of code units. Searching is unaffected by whether the characters are ~~variable~~variably sized, since a search for a sequence of code units does not care about the divisions. Howver, (it does require that the encoding be [[self-synchronizing code\|self-synchronizing]], which both UTF-8 and UTF-16 are). A common misconception is that there is a need to "find the ''n''th character" and that this requires a fixed-length encoding; however, in real use the number ''n'' is only derived from examining the {{nowrap\|''n−1''}} characters, thus sequential access is needed anyway.{{Citation needed\|date=October 2013}} ~~When~~Efficiently using character sequences in one [[endianness\|endian order ~~are~~]] loaded onto a machine with a different endian order, ~~the~~requires ~~characters~~extra ~~need~~processing. Characters may toeither be converted before ~~they~~use ~~can be~~or processed ~~efficiently (or~~with two ~~processors~~distinct ~~are needed)~~systems. Byte-based encodings such as UTF-8 do not have this problem.{{why\|date=July 2024}} [[UTF-16BE]] and [[UTF-32BE]] are [[endianness\|big-endian]], [[UTF-16LE]] and [[UTF-32LE]] are [[endianness\|little-endian]]. == Processing issues == For processing, a format should be easy to search, truncate, and generally process safely.{{cn\|date=July 2024}} All normal Unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded, one or more of these code units will represent a Unicode [[code point]]. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but [[UTF-7]] and [[GB 18030]] do not. Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to [[combining character]]s. Considering these incompatibilities and other quirks among different encoding schemes, handling unicode data with the same (or compatible) protocol throughout and across the interfaces (e.g. using an API/library, handling unicode characters in client/server model, etc.) can in general simplify the whole pipeline while eliminating a potential source of bugs at the same time.

Comparison of Unicode encodings: Difference between revisions