Revision as of 19:40, 13 July 2005 edit Timwi (talk \| contribs) Administrators 32,166 edits →General Structure: "Lieutenant! The ship is sinking!" -- "Well, go and tell General Structure!" ← Previous edit		Revision as of 16:01, 23 August 2005 edit undo Plugwash (talk \| contribs) Extended confirmed users 9,427 edits rework and expand a bit Next edit →
Line 1: A '''~~Variable~~variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation in a [[computer]]. ~~The most~~Most common ~~form of~~ variable-width ~~encoding~~encodings isare '''multibyte ~~encoding~~encodings''', which ~~uses~~use varying numbers of [[byte]]s ([[octet]]s) to encode different characters. Early variable width encodings using less than a byte per character were sometimes used to pack english text into less bytes in [[adventure game]]s for early [[microcomputers]]. However [[disk]]s (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose [[compression algorithm]]s have rendered such tricks largely redundant. Variable-width encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with 8 bits per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to increase the number of bits in each encoding unit, such as to 16 bits, allowing 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all. ▼ ==General structure==▼ ▲~~Variable-width~~Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to ~~increase~~use ~~the~~two ~~number~~or ofmore ~~bits~~bytes ~~in each~~per encoding unit, ~~such~~two ~~as to~~bytes (16 bits,) would ~~allowing~~allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all. A variable-width encoding system adds a layer of [[software]] for generation and interpretation of groups of the base encoding units. This layer can encode a character of its repertoire by using a short sequence of base units, the length of which typically depends on the particular character. The resulting string of units can then generally be handled in an unchanged manner by the other, pre-existing layers of software. ▲==General structure== ~~Due~~Since the aim of a multibyte encoding system is to ~~compatibility~~minimise ~~needs~~changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme but other software generally doesn't need to know if a pair of bytes represent two seperate characters or just one character. For example, the four character string "I♥NY" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): 49 E2 99 A5 4E 59. Of the six units in that sequence, 49, 4E, and 59 are singletons (for ''I, N,'' and ''Y''), E2 is a lead unit and 99 and A5 are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units. UTF-8 is one of the best-designed ~~variable-width~~multibyte encodings because the three sorts of units are kept apart and are easy for a program to identify. Older variable-width encodings are typically not so well designed, and in them the trail and lead units may use the same values, and in some all three sorts use overlapping values. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally differently. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and the corruption of one unit corrupts only one character. ==[[CJK]] variable-width encodings==

Variable-width encoding: Difference between revisions