Revision as of 16:39, 19 November 2017 edit Themckinlay (talk \| contribs) 43 edits →Unicode variable-width encodings: correct false assertion that surrogate code points can be encoded with utf-16 ← Previous edit		Revision as of 19:03, 6 June 2018 edit undo Nitpicking polish (talk \| contribs) Extended confirmed users 5,186 edits m Minor formatting. Next edit →
Line 2: {{Unreferenced\|date=December 2009}} A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation in a [[computer]]. Most common variable-width encodings are '''multibyte encodings''', which use varying numbers of [[byte]]s ([[octet (computing)\|~~octet~~octets]]s) to encode different characters. (Some authors, notably in Microsoft documentation, use the term ''multibyte character set,'' which is a [[misnomer]], because representation size is an attribute of the encoding, not of the character set).) Early variable width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in [[adventure game]]s for early [[microcomputers]]. However [[disk storage\|~~disk~~disks]]s (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose [[compression algorithm]]s have rendered such tricks largely obsolete. Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8 bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 bits) would allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all. Line 17: ==CJK multibyte encodings== The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range ~~21-7E~~21–7E (hexadecimal) for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and further sets of 94×94 characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process. On [[Unix]] platforms, the ISO 2022 7-bit encodings were replaced by a set of 8-bit encoding schemes, the Extended Unix Code: EUC-JP, EUC-CN and EUC-KR. Instead of distinguishing between the multiunit sequences and the singletons with escape sequences, which made the encodings stateful, multiunit sequences were marked by having the most significant bit set, that is, being in the range ~~80-FF~~80–FF (hexadecimal), while the singletons were in the range ~~00-7F~~00–7F alone. The lead units and trail units were in the range A1 to FE (hexadecimal), that is, the same as their range in the ISO 2022 encodings, but with the high bit set to 1. These encodings were reasonably easy to work with provided all your delimiters were [[ASCII]] characters and you avoided truncating strings to fixed lengths, but a break in the middle of a multibyte character could still cause major corruption. On the PC ([[DOS]] and [[Microsoft Windows]] platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: [[Shift-JIS]] and [[Big5]] respectively. In Shift-JIS, lead units had the range ~~81-9F~~81–9F and ~~E0-FC~~E0–FC, trail units had the range ~~40-7E~~40–7E and ~~80-FC~~80–FC, and singletons had the range ~~21-7E~~21–7E and ~~A1-DF~~A1–DF. In Big5, lead units had the range ~~A1-FE~~A1–FE, trail units had the range ~~40-7E~~40–7E and ~~A1-FE~~A1–FE, and singletons had the range ~~21-7E~~21–7E (all values in hexadecimal). This overlap again made processing tricky, though at least most of the symbols had unique byte values (though strangely the backslash does not).<!--FIXME: GBK and code page 949 should probably also be mentioned here--> ==Unicode variable-width encodings== The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]] (it also has a fixed-width encoding, [[UTF-32]]). Originally, both Unicode and [[ISO 10646\|ISO 10646]] standards were meant to be fixed-width, with Unicode being 16 bit and ISO 10646 being 32 bit.{{Citation needed\|date=April 2013}} ISO 10646 provided a variable-width encoding called [[UTF-1]], in which singletons had the range ~~00-9F~~00–9F, lead units the range ~~A0-FF~~A0–FF and trail units the range ~~A0-FF~~A0–FF and ~~21-7E~~21–7E. Because of this bad design, parallel to [[Shift-JIS]] and [[Big5]] in its overlap of values, the inventors of the [[Plan 9 from Bell Labs\|Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range ~~00-7F~~00–7F, lead units have the range ~~C0-FD~~C0–FD (now actually ~~C2-F4~~C2–F4, to avoid overlong sequences and to maintain synchronism with the encoding capacity of UTF-16; see [[UTF-8]] article), and trail units have the range ~~80-BF~~80–BF. The lead unit also tells how many trail units follow: one after ~~C2-DF~~C2–DF, two after ~~E0-EF~~E0–EF and three after ~~F0-F4~~F0–F4. UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range ~~0000-D7FF~~0000–D7FF (55,296 code points) and ~~E000-FFFF~~E000–FFFF (8192 code points, 63,488 in total), lead units the range ~~D800-DBFF~~D800–DBFF (1024 code points) and trail units the range ~~DC00-DFFF~~DC00–DFFF (1024 code points, 2048 in total). The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 supplementary characters, making 1,112,064 (63,488 BMP code points + 1,048,576 code points represented by high and low surrogate pairs) encodable code points (surrogates are not encodable). ==See also==

Variable-width encoding: Difference between revisions