Variable-width encoding: Difference between revisions

Content deleted Content added
Colin Hill (talk | contribs)
mNo edit summary
rework first half
Line 1:
'''Variable-width encoding''' is a type of [[character encoding]] scheme in which unitscodes of differing lengths are used to encode a coded [[character set]] (a repertoire withof numberssymbols) assignedfor to it)representation in a [[computer]]. memory orThe storage.most Itcommon isform alsoof knownvariable-width asencoding ais '''multibyte encoding''', thoughwhich thisuses isvarying anumbers lessof accurate term, since not all variable-width encodings use 8-[[bitbyte]] unitss ([[UTF-16octet]],s) forto example,encode isdifferent a variable-width encoding that uses 16-bit units)characters.
 
Variable-width encodings are always the result of requiring to break an encoding range limit without breaking [[backward compatibility]] with an existing legacy constraint. For example, with 8 bits per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to increase the number of bits per character, such as to 16 bits for 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all. The first variable-width encodings, the [[ISO 2022|ISO 2022]] encodings for Chinese, Japanese and Korean, were even further constrained to the limit of 7 bits per character.
 
Variable-width encodings are alwaysusually the result of requiringa need to breakincrease anthe encodingnumber rangeof characters which can be limitencoded without breaking [[backward compatibility]] with an existing legacy constraint. For example, with 8 bits per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to increase the number of bits perin charactereach encoding unit, such as to 16 bits, forallowing 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all. The first variable-width encodings, the [[ISO 2022|ISO 2022]] encodings for Chinese, Japanese and Korean, were even further constrained to the limit of 7 bits per character.
==General Structure==
 
A variable-width encoding system adds a layer of [[software]] for generation and interpretation of groups of the base encoding units. This layer can encode a character of its repertoire by using a short sequence of base units, the length of which typically depends on the particular character. The resulting string of units can then generally be handled in an unchanged manner by the other, pre-existing layers of software.
A variable-width encoding adds a layer of using more than one unit for encoding characters outside the range that the use of a single unit allows to encode. The single-unit layer coexists with the multiunit additions. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. For example the word ''can&#8217;t'' (thus, with a right single quotation mark for the apostrophe, not the [[ASCII]] apostrophe) is encoded like this in [[UTF-8]]: <code>63 61 6E E2 80 99 74</code>. In this sequence, 63, 61, 6E and 74 are singletons, E2 is a lead unit and 80 and 99 are trail units.
 
Due to compatibility needs, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence.
 
For example, the four character string "I&#9829;NY" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): 49 E2 99 A5 4E 59. Of the six units in that sequence, 49, 4E, and 59 are singletons (for ''I, N,'' and ''Y''), E2 is a lead unit and 99 and A5 are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.
 
UTF-8 is one of the best-designed variable-width encodings, sobecause the three sorts of units are kept apart and are easy for a program to identify. OtherOlder variable-width encodings mayare typically not be so well designed, and in them the trail and lead units overlapmay (sameuse numbersthe forsame both).values, Someand arein sosome badlyall designedthree thatsorts alluse threeoverlapping overlapvalues. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is then also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally differentdifferently. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and the corruption of one unit corrupts only one character.
 
==[[CJK]] variable-width encodings==