Content deleted Content added
→For communication and storage: Surrogate pairs have nothing to do with this, the problem is 1/2 of a code point is missing |
Iridescent (talk | contribs) m →Processing issues: Cleanup and typo fixing, typo(s) fixed: etc) → etc.) |
||
Line 27:
For processing, a format should be easy to search, truncate, and generally process safely. All normal Unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded, one or more of these code units will represent a Unicode [[code point]]. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but [[UTF-7]] and [[GB 18030]] do not.
Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to [[combining character]]s. Considering these incompatibilities and other quirks among different encoding schemes, handling unicode data with the same (or compatible) protocol throughout and across the interfaces (e.g. using an API/library, handling unicode characters in client/server model, etc.) can in general simplify the whole pipeline while eliminating a potential source of bugs at the same time.
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. However, using UTF-16 makes characters outside the [[Mapping of Unicode character planes|Basic Multilingual Plane]] a special case which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.
|