Variable-width encoding: Difference between revisions

Content deleted Content added
Shlomital (talk | contribs)
m proofreading
Shlomital (talk | contribs)
m decided the maths was only confusing
Line 5:
==General Structure==
 
A variable-width encoding adds a layer of using 1+xmore unitsthan (whereone x&gt;0)unit for encoding characters outside the range that the use of a single unit allows to encode. The single-unit layer coexists with the multiunit additions. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. For example the word ''can&#8217;t'' (thus, with a right single quotation mark for the apostrophe, not the [[ASCII]] apostrophe) is encoded like this in [[UTF-8]]: <code>63 61 6E E2 80 99 74</code>. In this sequence, 63, 61, 6E and 74 are singletons, E2 is a lead unit and 80 and 99 are trail units.
 
UTF-8 is one of the best-designed variable-width encodings, so the three sorts of units are kept apart and easy to identify. Other variable-width encodings may not be so well designed, and in them the trail and lead units overlap (same numbers for both). Some are so badly designed that all three overlap. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is then also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally different. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and the corruption of one unit corrupts only one character.