Content deleted Content added
mNo edit summary |
merge suggestion |
||
(23 intermediate revisions by 15 users not shown) | |||
Line 1:
{{about|the storage of text in computers|the transmission of data across noisy channels|variable-length code}}▼
{{more citations needed|date=December 2009}}
{{Short description|Type of character encoding scheme}}{{Merge|Variable-length code
{{use dmy dates|date=January 2012}}▼
| date = February 2025
▲}}{{
A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation, usually in a [[computer]].<ref>{{Cite
Early variable
▲A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation in a [[computer]].<ref>{{Cite journal|last=Crispin|first=M.|date=April 2005|title=UTF-9 and UTF-18 Efficient Transformation Formats of Unicode|doi=10.17487/rfc4042|doi-access=free}}</ref> Most common variable-width encodings are '''multibyte encodings''', which use varying numbers of [[byte]]s ([[octet (computing)|octets]]) to encode different characters.
Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8
▲Early variable width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in [[adventure game]]s for early [[microcomputers]]. However [[disk storage|disks]] (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose [[compression algorithm]]s have rendered such tricks largely obsolete.
▲Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all.
==General structure==
Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, '''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme, but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character.
For example, the four character string "[[I Love New York|I♥NY]]" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values):
UTF-8 makes it easy for a program to identify the three sorts of units, since they fall into separate value ranges. Older variable-width encodings are typically not as well
==CJK multibyte encodings==
Line 22:
On [[Unix]] platforms, the ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range 80–FF (hexadecimal), while the singletons were in the range 00–7F alone. The lead units and trail units were in the range A1 to FE (hexadecimal), that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1. These encodings were reasonably easy to work with provided all your delimiters were [[ASCII]] characters and you avoided truncating strings to fixed lengths, but a break in the middle of a multibyte character could still cause major corruption.
On the PC ([[DOS]] and [[Microsoft Windows]] platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: [[Shift-JIS]] and [[Big5]] respectively. In Shift-JIS, lead units had the range 81–9F and E0–FC, trail units had the range 40–7E and 80–FC, and singletons had the range 21–7E and A1–DF. In Big5, lead units had the range A1–FE, trail units had the range 40–7E and A1–FE, and singletons had the range 21–7E (all values in hexadecimal). This overlap again made processing tricky, though at least most of the symbols had unique byte values (though strangely the backslash does not).<!-- FIXME: GBK and code page 949 should probably also be mentioned here. -->
==Unicode variable-width encodings==
The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]] (it also has a fixed-width encoding, [[UTF-32]]). Originally, both the Unicode and [[ISO 10646|ISO 10646]] standards were meant to be fixed-width, with Unicode being 16-bit and ISO 10646 being 32-bit.{{Citation needed|date=April 2013}} ISO 10646 provided a variable-width encoding called [[UTF-1]], in which singletons had the range 00–9F, lead units the range A0–FF and trail units the
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000–D7FF (55,296 code points) and E000–FFFF (8192 code points, 63,488 in total), lead units the range D800–DBFF (1024 code points) and trail units the range DC00–DFFF (1024 code points, 2048 in total). The lead and trail units, called
==See also==
Line 33:
* [[Lotus Multi-Byte Character Set]] (LMBCS)
* [[Triple-Byte Character Set]] (TBCS)
* [[Double-byte character set|Double-Byte Character Set]] (DBCS)
* [[SBCS|Single-Byte Character Set]] (SBCS)
==Notes==
{{Character encoding}}▼
{{notelist}}
==References==
{{Reflist}}
▲{{Character encoding}}
{{DEFAULTSORT:Variable-Width Encoding}}
|