Revision as of 11:42, 10 March 2017 edit ArwinJ (talk \| contribs) Extended confirmed users 3,087 edits m Punctuation order ← Previous edit		Revision as of 16:22, 19 November 2017 edit undo Themckinlay (talk \| contribs) 43 edits m →Unicode variable-width encodings Next edit →
Line 26: The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]] (it also has a fixed-width encoding, [[UTF-32]]). Originally, both Unicode and [[ISO 10646\|ISO 10646]] standards were meant to be fixed-width, with Unicode being 16 bit and ISO 10646 being 32 bit.{{Citation needed\|date=April 2013}} ISO 10646 provided a variable-width encoding called [[UTF-1]], in which singletons had the range 00-9F, lead units the range A0-FF and trail units the range A0-FF and 21-7E. Because of this bad design, parallel to [[Shift-JIS]] and [[Big5]] in its overlap of values, the inventors of the [[Plan 9 from Bell Labs\|Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00-7F, lead units have the range C0-FD (now actually C2-F4, to avoid overlong sequences and to maintain synchronism with the encoding capacity of UTF-16; see [[UTF-8]] article), and trail units have the range 80-BF. The lead unit also tells how many trail units follow: one after C2-DF, two after E0-EF and three after F0-F4. UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF (55296 ~~codepoints~~code points) and E000-FFFF (8192 ~~codepoints~~code points, 63488 in total), lead units the range D800-DBFF (1024 ~~codepoints~~code points) and trail units the range DC00-DFFF (1024 ~~codepoints~~code points, 2048 in total). The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 (1,048,576 ~~codepoints~~code points represented by high and low surrogate pairs + 63488 BMP ~~codepoints~~code points + 2048 surrogate ~~codepoints~~code points) ~~codepoints~~code points in Unicode, of which 1,112,064 ~~codepoints~~code points are valid in other encodings: UTF-8, UTF-32, where there surrogate pair ranges are not required and forbidden to be used. ==See also==

Variable-width encoding: Difference between revisions