Variable-width encoding: Difference between revisions

Content deleted Content added
General structure: finally color is done right
Line 8:
Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme but other software generally doesn't need to know if a pair of bytes represent two seperate characters or just one character.
 
For example, the four character string "I{{unicode|&#9829;}}NY" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): <span style="color:green">49</span> <span style="color:red">E2</span> {{tfd-inline|Green.3B_Template:Red.3B_Template:Blue}}<span style="color:blue">99</span> {{tfd-inline|Green.3B_Template:Red.3B_Template:Blue}}<span style="color:blue">A5</span> <span style="color:green">4E</span> <span style="color:green">59</span>. Of the six units in that sequence, {{tfd-inline|Green.3B_Template:Red.3B_Template:Blue}}<span style="color:green">49</span>, {{tfd-inline|Green.3B_Template:Red.3B_Template:Blue}}<span style="color:green">4E</span>, and {{tfd-inline|Green.3B_Template:Red.3B_Template:Blue}}<span style="color:green">59</span> are singletons (for ''I, N,'' and ''Y''), {{tfd-inline|Green.3B_Template:Red.3B_Template:Blue}}<span style="color:red">E2</span> is a lead unit and {{tfd-inline|Green.3B_Template:Red.3B_Template:Blue}}<span style="color:blue">99</span> and {{tfd-inline|Green.3B_Template:Red.3B_Template:Blue}}<span style="color:blue">A5</span> are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.
 
UTF-8 is one of the best-designed multibyte encodings because the three sorts of units are kept apart and are easy for a program to identify. Older variable-width encodings are typically not so well designed, and in them the trail and lead units may use the same values, and in some all three sorts use overlapping values. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally differently. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and the corruption of one unit corrupts only one character.