Revision as of 21:45, 23 August 2005 edit R. S. Shaw (talk \| contribs) Extended confirmed users, Pending changes reviewers, Rollbackers 11,796 edits all Unicodings are multbyte, but only some are variable-width. ← Previous edit		Revision as of 22:14, 23 August 2005 edit undo Plugwash (talk \| contribs) Extended confirmed users 9,427 edits →[[Unicode]] variable-width encodings: remove double reference to utf-32 and some other edits Next edit →
Line 20: On the PC ([[MS-DOS]] and [[Microsoft Windows]] platforms), two encodings became established for Japanese and Traditional Chinese in which all of singletons, lead units and trail units overlapped: [[Shift-JIS]] and [[Big5]] respectively. In Shift-JIS, lead units had the range 81-9F and E0-FC, trail units had the range 40-7E and 80-FC, and singletons had the range 21-7E and A1-DF. In Big5, lead units had the range A1-FE, trail units had the range 40-7E and A1-FE, and singletons had the range 21-7E. ==[[Unicode]] variable-width encodings== The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]]. (Itit also has a fixed-width encoding, [[UTF-32]].). Originally, both Unicode and [[ISO 10646\|ISO&~~#160~~nbsp;10646]] standards were meant to be fixed-width with unicode being 16 bit and ISO 10646 being 32 bit. ISO 10646 provided a variable-width encoding called UTF-1, in which singletons had the range 00-9F, lead units the range A0-FF and trail units the range A0-FF and 21-7E. Because of this bad design, parallel to Shift-JIS and Big5 in its overlap of values, the inventors of the [[Plan 9 (operating system)\|Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00-7F, lead units have the range C0-DF (now actually C2-DF, to avoid overlong sequences; see [[UTF-8]] article), and trail units have the range E0-FD (now E0-F4, in synchronism with the encoding capacity of UTF-16). The lead unit also tells how many trail units follow: one after C2-DF, two after E0-EF and three after F0-F4. UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode. ~~[[UTF-32]], in contrast to the other two, is fixed-width.~~ [[Category:Character encoding]]

Variable-width encoding: Difference between revisions