Revision as of 17:11, 21 December 2004 edit Shlomital (talk \| contribs) 657 edits No edit summary		Revision as of 17:16, 21 December 2004 edit undo Shlomital (talk \| contribs) 657 edits m accuracy, spelling Next edit →
Line 7: A variable-width encoding adds a layer of using 1+x units (where x>0) for encoding characters outside the range that the use of a single unit allows to encode. The single-unit layer coexists with the multiunit additions. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. For example the word ''can’t'' (thus, with a right single quotation mark for the apostrophe, not the [[ASCII]] apostrophe) is encoded thus in [[UTF-8]]: <code>63 61 6E E2 80 99 74</code>. In this sequence, 63, 61, 6E and 74 are singletons, E2 is a lead unit and 80 and 99 are trail units. UTF-8 is one of the best-designed variable-width encodings, so the three sorts of units are kept apart and easy to identify. Other variable-width encodings may not be so well designed, and in them the trail and lead units overlap (same numbers for both). Some are so badly designed that all three overlap. Where there is overlap, a text processing application that deals with the variable-encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and render the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is then also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally different. In a variable-width encoding where all three sorts of units are disjunct, string searching always ~~work~~works without false positives, and the corruption of one unit corrupts only one character. ==[[CJK]] variable-width encodings== The first use of variable-width encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range 21-7E for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and ~~three-byte~~further ~~sequences~~sets ~~were~~of ~~added~~94×94 ~~later~~characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. On [[Unix]] platforms, the ISO 2022 7-bit encodings were replaced by a 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80-FF, while the singletons were in the range 00-7F alone. The lead units and trail units were in the range A1 to FE, that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1.

Variable-width encoding: Difference between revisions