Revision as of 01:22, 31 December 2007 edit 88.23.94.32 (talk) No edit summary ← Previous edit		Revision as of 22:04, 20 October 2008 edit undo 81.132.25.141 (talk) No edit summary Next edit →
Line 17: The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range 21-7E (hexadecimal) for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and further sets of 94×94 characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process. On [[Unix]] platforms, the ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80-FF (~~hexadeximal~~hexadecimal), while the singletons were in the range 00-7F alone. The lead units and trail units were in the range A1 to FE (~~hexadeximal~~hexadecimal), that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1. These encodings were ~~reasonablly~~reasonably easy to work with provided all your delimiters were [[ASCII]] characters and you avoided truncating strings to fixed lengths, but a break in the middle of a multibyte character could still cause major corruption. On the PC ([[MS-DOS]] and [[Microsoft Windows]] platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: [[Shift-JIS]] and [[Big5]] respectively. In Shift-JIS, lead units had the range 81-9F and E0-FC, trail units had the range 40-7E and 80-FC, and singletons had the range 21-7E and A1-DF. In Big5, lead units had the range A1-FE, trail units had the range 40-7E and A1-FE, and singletons had the range 21-7E (all values in hexadecimal). This overlap, again, made processing tricky, though at least most of the symbols had unique byte values (though strangely the backslash does not).<!--FIXME: GBK and code page 949 should ~~probabbly~~probably also be mentioned here--> ==Unicode variable-width encodings==

Variable-width encoding: Difference between revisions