Variable-width encoding: Difference between revisions

Content deleted Content added
SmackBot (talk | contribs)
m remove Erik9bot category,outdated, tag and general fixes
Line 1:
{{Unreferenced|date=December 2009}}
{{otheruses4Otheruses4|the storage of text in computers|the transmission of data across noisy channels|variable-length code}}
A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation in a [[computer]]. Most common variable-width encodings are '''multibyte encodings''', which use varying numbers of [[byte]]s ([[octet (computing)|octet]]s) to encode different characters.
(Some authors, notably in Microsoft documentation, use the term ''multibyte character set,'' which is a [[misnomer]] since representation size is an attribute of the encoding, not of the character set.)
Line 10 ⟶ 11:
Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character.
 
For example, the four character string "[[I Love New York|I{{unicodeUnicode|♥}}NY]]" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): <span style="color:green">49</span> <span style="color:red">E2</span> <span style="color:blue">99</span> <span style="color:blue">A5</span> <span style="color:green">4E</span> <span style="color:green">59</span>. Of the six units in that sequence, <span style="color:green">49</span>, <span style="color:green">4E</span>, and <span style="color:green">59</span> are singletons (for ''I, N,'' and ''Y''), <span style="color:red">E2</span> is a lead unit and <span style="color:blue">99</span> and <span style="color:blue">A5</span> are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.
 
UTF-8 makes it easy for a program to identify the three sorts of units as they are kept apart. Older variable-width encodings are typically not so well designed, as in them the trail and lead units may use the same values, and in some all three sorts use overlapping values. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally different. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and (provided the decoder is well written) the corruption or loss of one unit corrupts only one character.
 
==[[CJK]] multibyte encodings==
The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range 21-7E (hexadecimal) for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and further sets of 94×94 characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process.
 
Line 26 ⟶ 27:
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode.
 
==See Alsoalso==
*[[wchar_twchar t]] wide characters
 
{{characterCharacter encoding}}
*[[wchar_t]] wide characters
 
{{character encoding}}
 
{{DEFAULTSORT:Variable-Width Encoding}}
[[Category:Character encoding]]
[[Category:Articles lacking sources (Erik9bot)]]
 
[[de:Multibyte Character Set]]