Revision as of 20:20, 21 March 2005 edit Colin Hill (talk \| contribs) 220 edits mNo edit summary ← Previous edit		Revision as of 04:08, 4 June 2005 edit undo R. S. Shaw (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers 11,796 edits rework first half Next edit →
Line 1: '''Variable-width encoding''' is a type of [[character encoding]] scheme in which ~~units~~codes of differing lengths are used to encode a ~~coded~~ [[character set]] (a repertoire ~~with~~of ~~numbers~~symbols) ~~assigned~~for ~~to it)~~representation in a [[computer]]. ~~memory~~ orThe ~~storage.~~most Itcommon isform ~~also~~of ~~known~~variable-width asencoding ais '''multibyte encoding''', ~~though~~which ~~this~~uses isvarying anumbers ~~less~~of ~~accurate term, since not all variable-width encodings use 8-~~[[~~bit~~byte]] ~~units~~s ([[~~UTF-16~~octet]],s) ~~for~~to ~~example,~~encode isdifferent ~~a variable-width encoding that uses 16-bit units)~~characters. Variable-width encodings are always the result of requiring to break an encoding range limit without breaking [[backward compatibility]] with an existing legacy constraint. For example, with 8 bits per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to increase the number of bits per character, such as to 16 bits for 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all. The first variable-width encodings, the [[ISO 2022\|ISO 2022]] encodings for Chinese, Japanese and Korean, were even further constrained to the limit of 7 bits per character.▼ ▲Variable-width encodings are ~~always~~usually the result of ~~requiring~~a need to ~~break~~increase anthe ~~encoding~~number ~~range~~of characters which can be ~~limit~~encoded without breaking [[backward compatibility]] with an existing ~~legacy~~ constraint. For example, with 8 bits per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to increase the number of bits ~~per~~in ~~character~~each encoding unit, such as to 16 bits, ~~for~~allowing 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all. ~~The first variable-width encodings, the [[ISO 2022\|ISO 2022]] encodings for Chinese, Japanese and Korean, were even further constrained to the limit of 7 bits per character.~~ ==General Structure== A variable-width encoding system adds a layer of [[software]] for generation and interpretation of groups of the base encoding units. This layer can encode a character of its repertoire by using a short sequence of base units, the length of which typically depends on the particular character. The resulting string of units can then generally be handled in an unchanged manner by the other, pre-existing layers of software. A variable-width encoding adds a layer of using more than one unit for encoding characters outside the range that the use of a single unit allows to encode. The single-unit layer coexists with the multiunit additions. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. For example the word ''can’t'' (thus, with a right single quotation mark for the apostrophe, not the [[ASCII]] apostrophe) is encoded like this in [[UTF-8]]: <code>63 61 6E E2 80 99 74</code>. In this sequence, 63, 61, 6E and 74 are singletons, E2 is a lead unit and 80 and 99 are trail units. Due to compatibility needs, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. For example, the four character string "I♥NY" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): 49 E2 99 A5 4E 59. Of the six units in that sequence, 49, 4E, and 59 are singletons (for ''I, N,'' and ''Y''), E2 is a lead unit and 99 and A5 are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units. UTF-8 is one of the best-designed variable-width encodings, sobecause the three sorts of units are kept apart and are easy for a program to identify. ~~Other~~Older variable-width encodings ~~may~~are typically not be so well designed, and in them the trail and lead units ~~overlap~~may ~~(same~~use ~~numbers~~the ~~for~~same ~~both).~~values, ~~Some~~and ~~are~~in sosome ~~badly~~all ~~designed~~three ~~that~~sorts ~~all~~use ~~three~~overlapping ~~overlap~~values. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is ~~then~~ also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally ~~different~~differently. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and the corruption of one unit corrupts only one character. ==[[CJK]] variable-width encodings==

Variable-width encoding: Difference between revisions