Revision as of 01:29, 2 January 2024 edit Comp.arch (talk \| contribs) Extended confirmed users 41,479 edits m 1 April... Tag: 2017 wikitext editor ← Previous edit		Revision as of 18:09, 28 January 2024 edit undo ReadOnlyAccount (talk \| contribs) Extended confirmed users 2,063 edits m →General structure Next edit →
Line 15: For example, the four character string "[[I Love New York\|I♥NY]]" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): {{mono\|49 {{maroon\|E2}} {{navy (color)\|99}} {{navy (color)\|A5}} 4E 59}}. Of the six units in that sequence, 49, 4E, and 59 are singletons (for ''I, N,'' and ''Y''), {{maroon\|E2}} is a lead unit and {{navy (color)\|99}} and {{navy (color)\|A5}} are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units. UTF-8 makes it easy for a program to identify the three sorts of units, since they fall into separate value ranges. Older variable-width encodings are typically not as well -designed, since the ranges may overlap. A text processing application that deals with the variable-width encoding must then scan the text from the beginning of all definitive sequences in order to identify the various units and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE, DF, E0, and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the sequence DE DF E0 E1, which consists of two consecutive two-unit sequences. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences incorrect. In a variable-width encoding where all three types of units are disjunct, string searching always works without false positives, and (provided the decoder is well written) the corruption or loss of one unit corrupts only one character. ==CJK multibyte encodings==

Variable-width encoding: Difference between revisions