Content deleted Content added
Undid revision 1218408594 by 2601:601:513:6D07:5500:C5D2:DDF5:8634 (talk) |
Hawkblade96 (talk | contribs) Cleanup grammar, wording, citations in introduction and first two sections |
||
Line 3:
{{Use dmy dates|date=July 2023}}
{{More footnotes needed|date=July 2019}}
This article compares [[Unicode]] encodings
== Compatibility issues ==
A [[UTF-8]] file that contains only [[ASCII]] characters is identical to an ASCII file. Legacy programs can generally handle UTF-8 encoded files, even if they contain non-ASCII characters. For instance, the [[C (programming language)|C]] [[printf]] function can print a UTF-8 string
[[UTF-16]] and [[UTF-32]] are incompatible with ASCII files, and thus require [[Unicode]]-aware programs to display, print, and manipulate them
[[XML]] is
▲[[XML]] is, [[de facto|by convention]], encoded as UTF-8, and all XML processors must at least support UTF-8 (including US-ASCII by definition) and UTF-16.<ref>{{cite web
|url=http://www.w3.org/TR/xml/#charencoding
|title=Character Encoding in Entities
Line 20 ⟶ 18:
== Efficiency ==
[[UTF-8]] requires 8, 16, 24 or 32 bits (one to four [[Octet (computing)|bytes]]) to encode a Unicode character, [[UTF-16]] requires either 16 or 32 bits to encode a character, and [[UTF-32]] always requires 32 bits to encode a character
The first 128 Unicode [[code point]]s, U+0000 to U+007F, which are used for the [[C0 Controls and Basic Latin]] characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF,
All printable characters in [[UTF-EBCDIC]] use at least as many bytes as in UTF-8, and most use more, due to a decision made to allow encoding the C1 control codes as single bytes. For seven-bit environments, [[UTF-7]] is more space efficient than the combination of other Unicode encodings with [[quoted-printable]] or [[base64]] for almost all types of text{{explain|date=July 2024}} (see "[[#Seven-bit environments|Seven-bit environments]]" below).
===Processing time===
Text with variable-length encoding such as UTF-8 or UTF-16 is harder to process if there is a need to work with individual code units
== Processing issues ==
For processing, a format should be easy to search, truncate, and generally process safely.{{cn|date=July 2024}} All normal Unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded, one or more of these code units will represent a Unicode [[code point]]. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but [[UTF-7]] and [[GB 18030]] do not.
Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to [[combining character]]s. Considering these incompatibilities and other quirks among different encoding schemes, handling unicode data with the same (or compatible) protocol throughout and across the interfaces (e.g. using an API/library, handling unicode characters in client/server model, etc.) can in general simplify the whole pipeline while eliminating a potential source of bugs at the same time.
|