Variable-width encoding: Difference between revisions

Content deleted Content added
Tag: categories removed
m Reverted edits by 124.40.246.86 (talk) to last version by AlanUS
Line 8:
 
Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all.
 
==General structure==
Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character.
 
For example, the four character string "[[I Love New York|{{Unicode|I♥NY}}]]" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): <span style="color:green">49</span> <span style="color:red">E2</span> <span style="color:blue">99</span> <span style="color:blue">A5</span> <span style="color:green">4E</span> <span style="color:green">59</span>. Of the six units in that sequence, <span style="color:green">49</span>, <span style="color:green">4E</span>, and <span style="color:green">59</span> are singletons (for ''I, N,'' and ''Y''), <span style="color:red">E2</span> is a lead unit and <span style="color:blue">99</span> and <span style="color:blue">A5</span> are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.
 
UTF-8 makes it easy for a program to identify the three sorts of units, since they fall into separate value ranges. Older variable-width encodings are typically not as well designed, since the ranges may overlap. A text processing application that deals with the variable-width encoding must then scan the text from the beginning of all definitive sequences in order to identify the various units and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE, DF, E0, and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the sequence DE DF E0 E1, which consists of two consecutive two-unit sequences. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences incorrect. In a variable-width encoding where all three types of units are disjunct, string searching always works without false positives, and (provided the decoder is well written) the corruption or loss of one unit corrupts only one character.
 
==CJK multibyte encodings==
Line 20 ⟶ 27:
 
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF (55296 codepoints) and E000-FFFF (8192 codepoints, 63488 in total), lead units the range D800-DBFF (1024 codepoints) and trail units the range DC00-DFFF (1024 codepoints, 2048 in total). The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 (1,048,576 codepoints represented by high and low surrogate pairs + 63488 BMP codepoints + 2048 surrogate codepoints) codepoints in Unicode, of which 1,112,064 codepoints are valid in other encodings: UTF-8, UTF-32, where there surrogate pair ranges are not required and forbidden to be used.
 
==See also==
*[[wchar_t]] wide characters
 
{{Character encoding}}
 
{{use dmy dates|date=January 2012}}
{{DEFAULTSORT:Variable-Width Encoding}}
[[Category:Character encoding]]