Content deleted Content added
Tag: categories removed |
merge suggestion |
||
(56 intermediate revisions by 38 users not shown) | |||
Line 1:
{{about|the storage of text in computers|the transmission of data across noisy channels|variable-length code}}▼
{{Short description|Type of character encoding scheme}}{{Merge|Variable-length code
▲{{Unreferenced|date=December 2009}}
| date = February 2025
▲}}{{
{{Use dmy dates|date=December 2023}}
A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation, usually in a [[computer]].<ref>{{Cite RFC|last=Crispin|first=M.|date=1 April 2005|title=UTF-9 and UTF-18 Efficient Transformation Formats of Unicode|doi=10.17487/rfc4042|doi-access=}}</ref>{{efn|The concept long precedes the advent of the electronic computer, however, as seen with [[Morse code]].}} Most common variable-width encodings are '''multibyte encodings''' (aka '''MBCS''' – '''multi-byte character set'''), which use varying numbers of [[byte]]s ([[octet (computing)|
Early variable
▲A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation in a [[computer]]. Most common variable-width encodings are '''multibyte encodings''', which use varying numbers of [[byte]]s ([[octet (computing)|octet]]s) to encode different characters.
Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8
▲Early variable width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in [[adventure game]]s for early [[microcomputers]]. However [[disk storage|disk]]s (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose [[compression algorithm]]s have rendered such tricks largely obsolete.
==General structure==
▲Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all.
Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme, but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character.
For example, the four character string "[[I Love New York|I♥NY]]" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): {{mono|49 {{maroon|E2}} {{navy (color)|99}} {{navy (color)|A5}} 4E 59}}. Of the six units in that sequence, 49, 4E, and 59 are singletons (for ''I, N,'' and ''Y''), {{maroon|E2}} is a lead unit and {{navy (color)|99}} and {{navy (color)|A5}} are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.
UTF-8 makes it easy for a program to identify the three sorts of units, since they fall into separate value ranges. Older variable-width encodings are typically not as well-designed, since the ranges may overlap. A text processing application that deals with the variable-width encoding must then scan the text from the beginning of all definitive sequences in order to identify the various units and interpret the text correctly. In such encodings, one is liable to encounter false positives when searching for a string in the middle of the text. For example, if the hexadecimal values DE, DF, E0, and E1 can all be either lead units or trail units, then a search for the two-unit sequence DF E0 can yield a false positive in the sequence DE DF E0 E1, which consists of two consecutive two-unit sequences. There is also the danger that a single corrupted or lost unit may render the whole interpretation of a large run of multiunit sequences incorrect. In a variable-width encoding where all three types of units are disjunct, string searching always works without false positives, and (provided the decoder is well written) the corruption or loss of one unit corrupts only one character.
==CJK multibyte encodings==
The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The [[ISO/IEC 2022|ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings]] used the range
On [[Unix]] platforms, the ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range
On the PC ([[
==Unicode variable-width encodings==
The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]] (it also has a fixed-width encoding, [[UTF-32]]). Originally, both the Unicode and [[ISO 10646|ISO 10646]] standards were meant to be fixed-width, with Unicode being 16
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range
==See also==
* [[wchar_t]] wide characters
* [[Lotus Multi-Byte Character Set]] (LMBCS)
* [[Triple-Byte Character Set]] (TBCS)
* [[Double-byte character set|Double-Byte Character Set]] (DBCS)
* [[SBCS|Single-Byte Character Set]] (SBCS)
==Notes==
{{notelist}}
==References==
{{Reflist}}
{{Character encoding}}
{{DEFAULTSORT:Variable-Width Encoding}}
▲UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF (55296 codepoints) and E000-FFFF (8192 codepoints, 63488 in total), lead units the range D800-DBFF (1024 codepoints) and trail units the range DC00-DFFF (1024 codepoints, 2048 in total). The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 (1,048,576 codepoints represented by high and low surrogate pairs + 63488 BMP codepoints + 2048 surrogate codepoints) codepoints in Unicode, of which 1,112,064 codepoints are valid in other encodings: UTF-8, UTF-32, where there surrogate pair ranges are not required and forbidden to be used.
[[Category:Character encoding]]
|