Revision as of 16:31, 11 February 2020 edit Spitzak (talk \| contribs) Extended confirmed users 10,503 edits →For communication and storage: Surrogate pairs have nothing to do with this, the problem is 1/2 of a code point is missing ← Previous edit		Revision as of 18:25, 18 February 2020 edit undo Iridescent (talk \| contribs) Administrators 402,821 edits m →Processing issues: Cleanup and typo fixing, typo(s) fixed: etc) → etc.) Tag: AWB Next edit →
Line 27: For processing, a format should be easy to search, truncate, and generally process safely. All normal Unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded, one or more of these code units will represent a Unicode [[code point]]. To allow easy searching and truncation, a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but [[UTF-7]] and [[GB 18030]] do not. Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to [[combining character]]s. Considering these incompatibilities and other quirks among different encoding schemes, handling unicode data with the same (or compatible) protocol throughout and across the interfaces (e.g. using an API/library, handling unicode characters in client/server model, etc.) can in general simplify the whole pipeline while eliminating a potential source of bugs at the same time. UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. However, using UTF-16 makes characters outside the [[Mapping of Unicode character planes\|Basic Multilingual Plane]] a special case which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.

Comparison of Unicode encodings: Difference between revisions