Variable-width encoding: Difference between revisions

Content deleted Content added
Shlomital (talk | contribs)
m more proofreading
merge suggestion
 
(128 intermediate revisions by 82 users not shown)
Line 1:
{{more citations needed|date=December 2009}}
'''Variable-width encoding''' is a character encoding scheme in which units of differing lengths are used to encode a coded character set (a repertoire with numbers assigned to it) in computer memory or storage. It is also known as a '''multibyte encoding''', though this is a less accurate term, since not all variable-width encodings use 8-[[bit]] units ([[UTF-16]], for example, is a variable-width encoding that uses 16-bit units).
{{Short description|Type of character encoding scheme}}{{Merge|Variable-length code
| date = February 2025
}}{{About|the storage of text in computers|the transmission of data across noisy channels|variable-length code}}
{{Use dmy dates|date=December 2023}}
A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation, usually in a [[computer]].<ref>{{Cite RFC|last=Crispin|first=M.|date=1 April 2005|title=UTF-9 and UTF-18 Efficient Transformation Formats of Unicode|doi=10.17487/rfc4042|doi-access=}}</ref>{{efn|The concept long precedes the advent of the electronic computer, however, as seen with [[Morse code]].}} Most common variable-width encodings are '''multibyte encodings''' (aka '''MBCS''' – '''multi-byte character set'''), which use varying numbers of [[byte]]s ([[octet (computing)|octets]]) to encode different characters. (Some authors, notably in [[Microsoft]] documentation, use the term ''multibyte character set,'' which is a [[misnomer]], because representation size is an attribute of the encoding, not of the character set.)
 
Early variable-width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in [[adventure game]]s for early [[microcomputer]]s. However [[disk storage|disks]] (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose [[compression algorithm]]s have rendered such tricks largely obsolete.
Variable-width encodings are always the result of requiring to break an encoding range limit without breaking [[backward compatibility]] with an existing legacy constraint. For example, with 8 bits per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to increase the number of bits per character, such as to 16 bits for 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all. The first variable-width encodings, the [[ISO 2022|ISO&#160;2022]] encodings for Chinese, Japanese and Korean, were even further constrained to the limit of 7 bits per character.
 
Variable-widthMultibyte encodings are alwaysusually the result of requiringa need to breakincrease anthe encodingnumber rangeof characters which can be limitencoded without breaking [[backward compatibility]] with an existing legacy constraint. For example, with 8one byte (8&nbsp;bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to increaseuse thetwo numberor ofmore bitsbytes per characterencoding unit, suchtwo as tobytes (16 &nbsp;bits) forwould allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all.{{efn|As The firsta variablereal-widthlife encodings,example theof this, [[ISO 2022|ISO&#160;2022UTF-16]], encodingswhich forrepresents Chinese,the Japanesemost common characters in exactly the manner just described (and Korean,uses werepairs evenof further16-bit code units for less-common characters) never gained traction as an encoding for text intended for interchange constraineddue to theits limitincompatibility ofwith the ubiquitous 7-/8-bit bits[[ASCII]] perencoding, characterwith its intended role instead being taken by [[UTF-8]], which ''does'' preserve ASCII compatibility.}}
==General Structure==
 
==General Structurestructure==
A variable-width encoding adds a layer of using more than one unit for encoding characters outside the range that the use of a single unit allows to encode. The single-unit layer coexists with the multiunit additions. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. For example the word ''can&#8217;t'' (thus, with a right single quotation mark for the apostrophe, not the [[ASCII]] apostrophe) is encoded like this in [[UTF-8]]: <code>63 61 6E E2 80 99 74</code>. In this sequence, 63, 61, 6E and 74 are singletons, E2 is a lead unit and 80 and 99 are trail units.
Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme, but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character.
 
For example, the four character string "[[I Love New York|I♥NY]]" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): {{mono|49 {{maroon|E2}} {{navy (color)|99}} {{navy (color)|A5}} 4E 59}}. Of the six units in that sequence, 49, 4E, and 59 are singletons (for ''I, N,'' and ''Y''), {{maroon|E2}} is a lead unit and {{navy (color)|99}} and {{navy (color)|A5}} are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.
UTF-8 is one of the best-designed variable-width encodings, so the three sorts of units are kept apart and easy to identify. Other variable-width encodings may not be so well designed, and in them the trail and lead units overlap (same numbers for both). Some are so badly designed that all three overlap. Where there is overlap, a text processing application that deals with the variable-width encoding must scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DE and DF and E0 and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the two consecutive two-unit sequences DE DF E0 E1. There is then also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally different. In a variable-width encoding where all three sorts of units are disjunct, string searching always works without false positives, and the corruption of one unit corrupts only one character.
 
UTF-8 ismakes oneit ofeasy thefor best-designeda variable-widthprogram encodings,to soidentify the three sorts of units, aresince keptthey apartfall andinto easyseparate tovalue identifyranges. Other Older variable-width encodings mayare typically not be soas well -designed, and in themsince the trailranges and lead unitsmay overlap (same numbers for both). Some are so badly designed that all three overlap. Where there is overlap, aA text processing application that deals with the variable-width encoding must then scan the text from the beginning of all definitive sequences in order to identify the various units properly and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if DEthe andhexadecimal DFvalues andDE, DF, E0, and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the sequence DE DF E0 E1, which consists of two consecutive two-unit sequences DE DF E0 E1. There is then also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences totally differentincorrect. In a variable-width encoding where all three sortstypes of units are disjunct, string searching always works without false positives, and (provided the decoder is well written) the corruption or loss of one unit corrupts only one character.
==[[CJK]] variable-width encodings==
 
==[[CJK]] variable-widthmultibyte encodings==
The first use of variable-widthmultibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The [[ISO/IEC 2022|ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings]] used the range 21-7E21–7E (hexadecimal) for both lead units and trail units, and marked them off from the singletons by using ISO&#160; 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94&#215;9494×94) characters could be encoded at first, and further sets of 94&#215;9494×94 characters with switching. The ISO&#160; 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process.
 
On [[Unix]] platforms, the ISO&#160; 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80-FF80–FF (hexadecimal), while the singletons were in the range 00-7F00–7F alone. The lead units and trail units were in the range A1 to FE (hexadecimal), that is, the same as their range in the ISO&#160; 2022 encodings, but with the high bit set to 1. These encodings were reasonably easy to work with provided all your delimiters were [[ASCII]] characters and you avoided truncating strings to fixed lengths, but a break in the middle of a multibyte character could still cause major corruption.
 
On the PC ([[MS-DOS]] and [[Microsoft Windows]] platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: [[Shift-JIS]] and [[Big5]] respectively. In Shift-JIS, lead units had the range 81-9F81–9F and E0-FCE0–FC, trail units had the range 40-7E40–7E and 80-FC80–FC, and singletons had the range 21-7E21–7E and A1-DFA1–DF. In Big5, lead units had the range A1-FEA1–FE, trail units had the range 40-7E40–7E and A1-FEA1–FE, and singletons had the range 2121–7E (all values in hexadecimal). This overlap again made processing tricky, though at least most of the symbols had unique byte values (though strangely the backslash does not).<!-7E- FIXME: GBK and code page 949 should probably also be mentioned here. -->
 
==[[Unicode]] variable-width encodings==
The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]] (it also has a fixed-width encoding, [[UTF-32]]). Originally, both the Unicode and [[ISO 10646|ISO&#160nbsp;10646]] standards were meant to be fixed-width., with Unicode being 16-bit and ISO&#160nbsp;10646 being 32-bit.{{Citation needed|date=April 2013}} ISO 10646 provided a variable-width encoding called [[UTF-1]], in which singletons had the range 00-9F00–9F, lead units the range A0-FFA0–FF and trail units the rangeranges A0-FFA0–FF and 21-7E21–7E. Because of this bad design, parallelsimilar to [[Shift- JIS]] and [[Big5]] in its overlap of values, the inventors of the [[Plan 9 from Bell Labs|Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00-7F00–7F, lead units have the range C0-DFC0–FD (now actually C2-DFC2–F4, to avoid overlong sequences and to maintain synchronism with the encoding capacity of UTF-16; see the [[UTF-8]] article), and trail units have the range E0-FD (now E0-F4, in synchronism with the encoding capacity of UTF-16)80–BF. The lead unit also tells how many trail units follow: one after C2-DFC2–DF, two after E0-EFE0–EF and three after F0F0–F4.{{efn|In the original version of UTF-F48, from its 1992 publication until its code space was restricted to that of UTF-16 in 2003, the range of lead units encoding three-unit trailing sequences was larger (F0–F7); additionally, the lead units F8–FB were followed by four trail units, and FC–FD by five. FE–FF were never valid lead or trail units in any version of UTF-8.}}
 
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF0000–D7FF (55,296 code points) and E000-FFFFE000–FFFF (8192 code points, 63,488 in total), lead units the range D800-DBFFD800–DBFF (1024 code points) and trail units the range DC00-DFFFDC00–DFFF (1024 code points, 2048 in total). The lead and trail units, called in Unicode terminology ''high surrogates'' and ''low surrogates'', respectively, in Unicode terminology, map 1024&#215;10241024×1024 or 1,048,576 numberssupplementary characters, making for1,112,064 a(63,488 maximumBMP ofcode possiblepoints + 1,114048,112576 codepointscode points represented by high and low surrogate pairs) encodable code points, or ''scalar values'' in Unicode parlance (surrogates are not encodable).
The Unicode standard has two variable-width encodings: UTF-8 and UTF-16. Originally, both Unicode and [[ISO 10646|ISO&#160;10646]] standards were meant to be fixed-width. ISO&#160;10646 provided a variable-width encoding called UTF-1, in which singletons had the range 00-9F, lead units the range A0-FF and trail units the range A0-FF and 21-7E. Because of this bad design, parallel to Shift-JIS and Big5 in its overlap of values, the inventors of the [[Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00-7F, lead units have the range C0-DF (now actually C2-DF, to avoid overlong sequences; see [[UTF-8]] article), and trail units have the range E0-FD (now E0-F4, in synchronism with the encoding capacity of UTF-16). The lead unit also tells how many trail units follow: one after C2-DF, two after E0-EF and three after F0-F4.
 
==See also==
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024&#215;1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode.
* [[wchar_t]] wide characters
* [[Lotus Multi-Byte Character Set]] (LMBCS)
* [[Triple-Byte Character Set]] (TBCS)
* [[Double-byte character set|Double-Byte Character Set]] (DBCS)
* [[SBCS|Single-Byte Character Set]] (SBCS)
 
==Notes==
UTF-32, in contrast to the other two, is fixed-width.
{{notelist}}
 
==References==
{{Reflist}}
{{Character encoding}}
 
{{DEFAULTSORT:Variable-Width Encoding}}
[[Category:Character encoding]]