Variable-width encoding: Difference between revisions

Content deleted Content added
No edit summary
Line 25:
 
==Unicode variable-width encodings==
The [[Unicode]] standard has two variable-width encodings: [[UTF-8]] and [[UTF-16]] (it also has a fixed-width encoding, [[UTF-32]]). Originally, both the Unicode and [[ISO 10646|ISO 10646]] standards were meant to be fixed-width, with Unicode being 16-bit and ISO 10646 being 32-bit.{{Citation needed|date=April 2013}} ISO 10646 provided a variable-width encoding called [[UTF-1]], in which singletons had the range 00–9F, lead units the range A0–FF and trail units the rangeranges A0–FF and 21–7E. Because of this bad design, parallelsimilar to [[Shift- JIS]] and [[Big5]] in its overlap of values, the inventors of the [[Plan 9 from Bell Labs|Plan 9]] operating system, the first to implement Unicode throughout, abandoned it and replaced it with a much better designed variable-width encoding for Unicode: UTF-8, in which singletons have the range 00–7F, lead units have the range C0–FD (now actually C2–F4, to avoid overlong sequences and to maintain synchronism with the encoding capacity of UTF-16; see the [[UTF-8]] article), and trail units have the range 80–BF. The lead unit also tells how many trail units follow: one after C2–DF, two after E0–EF and three after F0–F4.{{efn|In the original version of UTF-8, from its 1992 publication until its code space was restricted to that of UTF-16 in 2003, the range of lead units encoding three-unit trailing sequences was larger (F0-F7); additionally, the lead units F8-FB were followed by four trail units, and FC-FD by five. FE-FF were never valid lead or trail units in any version of UTF-8.}}
 
UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000–D7FF (55,296 code points) and E000–FFFF (8192 code points, 63,488 in total), lead units the range D800–DBFF (1024 code points) and trail units the range DC00–DFFF (1024 code points, 2048 in total). The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 supplementary characters, making 1,112,064 (63,488 BMP code points + 1,048,576 code points represented by high and low surrogate pairs) encodable code points (surrogates are not encodable).