Content deleted Content added
m proofreading |
m decided the maths was only confusing |
||
Line 5:
==General Structure==
A variable-width encoding adds a layer of using
UTF-8 is one of the best-designed variable-width encodings, so the three sorts of units are kept apart and easy to identify. Other variable-width encodings may not be so well designed, and in them the trail and lead units overlap (same numbers for both). Some are so badly designed that all three overlap. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is then also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally different. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and the corruption of one unit corrupts only one character.
|