Revision as of 20:47, 1 March 2006 edit Johnteslade (talk \| contribs) Extended confirmed users 4,006 edits mNo edit summary ← Previous edit		Revision as of 09:16, 28 March 2006 edit undo Phil Boswell (talk \| contribs) Administrators 40,569 edits fix REDIRECTs following page-move using AWB Next edit →
Line 13: ==[[CJK]] multibyte encodings== The first use of multibyte encodings was for the encoding of Chinese, Japanese and Korean, which have large character sets well in excess of 256 characters. At first the encoding was constrained to the limit of 7 bits. The ISO-2022-JP, ISO-2022-CN and ISO-2022-KR encodings used the range 21-7E (hexadecimal) for both lead units and trail units, and marked them off from the singletons by using ISO 2022 escape sequences to switch between single-byte and multibyte mode. A total of 8,836 (94×94) characters could be encoded at first, and further sets of 94×94 characters with switching. The ISO 2022 encoding schemes for CJK are still in use on the Internet. The stateful nature of these encodings and the large overlap make them very awkward to process. Line 21 ⟶ 20: ==Unicode variable-width encodings== The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]] (it also has a fixed-width encoding, [[UTF-32]]). Originally, both Unicode and [[ISO 10646\|ISO 10646]] standards were meant to be fixed-width, with Unicode being 16 bit and ISO 10646 being 32 bit. ISO 10646 provided a variable-width encoding called [[UTF-1]], in which singletons had the range 00-9F, lead units the range A0-FF and trail units the range A0-FF and 21-7E. Because of this bad design, parallel to Shift-JIS and Big5 in its overlap of values, the inventors of the [[Plan 9 ~~(operating~~from ~~system)~~Bell Labs\|Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00-7F, lead units have the range C0-DF (now actually C2-DF, to avoid overlong sequences; see [[UTF-8]] article), and trail units have the range E0-FD (now E0-F4, in synchronism with the encoding capacity of UTF-16). The lead unit also tells how many trail units follow: one after C2-DF, two after E0-EF and three after F0-F4.▼ ▲The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]] (it also has a fixed-width encoding, [[UTF-32]]). Originally, both Unicode and [[ISO 10646\|ISO 10646]] standards were meant to be fixed-width, with Unicode being 16 bit and ISO 10646 being 32 bit. ISO 10646 provided a variable-width encoding called [[UTF-1]], in which singletons had the range 00-9F, lead units the range A0-FF and trail units the range A0-FF and 21-7E. Because of this bad design, parallel to Shift-JIS and Big5 in its overlap of values, the inventors of the [[Plan 9 (operating system)\|Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00-7F, lead units have the range C0-DF (now actually C2-DF, to avoid overlong sequences; see [[UTF-8]] article), and trail units have the range E0-FD (now E0-F4, in synchronism with the encoding capacity of UTF-16). The lead unit also tells how many trail units follow: one after C2-DF, two after E0-EF and three after F0-F4. UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode.

Variable-width encoding: Difference between revisions