Variable-width encoding: Difference between revisions

Content deleted Content added
General structure: finally color is done right
No edit summary
Line 12:
UTF-8 is one of the best-designed multibyte encodings because the three sorts of units are kept apart and are easy for a program to identify. Older variable-width encodings are typically not so well designed, and in them the trail and lead units may use the same values, and in some all three sorts use overlapping values. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally differently. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and the corruption of one unit corrupts only one character.
 
==[[CJK]] variable-widthmultibyte encodings==
 
The first use of variable-widthmultibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range 21-7E for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and further sets of 94×94 characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet.
 
On [[Unix]] platforms, the ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80-FF, while the singletons were in the range 00-7F alone. The lead units and trail units were in the range A1 to FE, that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1.
Line 20:
On the PC ([[MS-DOS]] and [[Microsoft Windows]] platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: [[Shift-JIS]] and [[Big5]] respectively. In Shift-JIS, lead units had the range 81-9F and E0-FC, trail units had the range 40-7E and 80-FC, and singletons had the range 21-7E and A1-DF. In Big5, lead units had the range A1-FE, trail units had the range 40-7E and A1-FE, and singletons had the range 21-7E.
 
==[[Unicode]] variable-widthmultibyte encodings==
 
The Unicode standard has two variable-widthmultibyte encodings: [[UTF-8]] and [[UTF-16]]. Originally, both Unicode and [[ISO 10646|ISO 10646]] standards were meant to be fixed-width. ISO 10646 provided a variable-width encoding called UTF-1, in which singletons had the range 00-9F, lead units the range A0-FF and trail units the range A0-FF and 21-7E. Because of this bad design, parallel to Shift-JIS and Big5 in its overlap of values, the inventors of the [[Plan 9 (operating system)|Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00-7F, lead units have the range C0-DF (now actually C2-DF, to avoid overlong sequences; see [[UTF-8]] article), and trail units have the range E0-FD (now E0-F4, in synchronism with the encoding capacity of UTF-16). The lead unit also tells how many trail units follow: one after C2-DF, two after E0-EF and three after F0-F4.
 
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode.