Revision as of 21:02, 15 August 2005 edit Indefatigable (talk \| contribs) Autopatrolled, Extended confirmed users 45,739 edits m Comparison of unicode encodings moved to Comparison of Unicode encodings ← Previous edit		Revision as of 21:31, 15 August 2005 edit undo Indefatigable (talk \| contribs) Autopatrolled, Extended confirmed users 45,739 edits Clarified comparing size only. Expanded abbreviations. Caps and spelling. Next edit →
Line 1: This page compares ~~unicode~~Unicode encodings. Two situations are considered: 8 eight-bit -clean environments and environments like [[Simple Mail Transfer Protocol~~\|SMTP~~]] that use only ~~support~~seven 7bits per byte, the high-order bit ~~characters~~being ignored or used for [[parity]]. [[~~SCSU~~Standard Compression Scheme for Unicode]] and [[~~BOCU~~Binary Ordered Compression for Unicode]] are excluded from the comparison tables because it is difficult to simply quantify ~~thier~~their size. ==In summary== If space were the only consideration, UTF-32 ~~loses~~would lose in almost every case since characters outside the ~~BMP~~[[basic multilingual plane]] are very rare, and one of the bytes of BMP characters in ~~utf~~UTF-32 is always 0. For 7 seven-bit environments UTF-7 clearly wins in terms of size over the combination of other ~~unicode~~Unicode encodings with [[quoted printable]] or [[base64]]. For 8 eight-bit -clean environments things vary ~~considerablly~~considerably depending on what code points are in the text ~~to be encoded~~. ==In detail== The tables below list the number of bytes per code point for different ~~unicode~~Unicode ranges. Any ~~additonal~~additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are ~~negligable~~negligible. ===8 Eight-bit environments=== {\| {{prettytable}} \|~~code~~Code range (hexadecimal)\|\|[[UTF-8]]\|\|[[UTF-16]]\|\|[[UTF-32]]\|\|[[GB18030]] \|- \|000000 - 00007F\|\|1\|\|2\|\|4\|\|1 Line 20: \|} ===7 Seven-bit environments=== This table may not cover every special case and so should be used for estimation and ~~comparion~~comparison only. To accurately determine the size of text in an encoding ~~please~~, see the actual specifications. {\| {{prettytable}} \|code range (hexadecimal)\|\|[[UTF-7]]\|\|[[UTF-8]] [[quoted printable]]\|\|UTF-8 [[base64]]\|\|[[UTF-16]] quoted printable\|\|UTF-16 base64\|\|[[UTF-32]] quoted printable\|\|UTF-32 base64\|\|[[GB18030]] quoted printable\|\|[[GB18030]] base64 Line 27: \|000000 - 000032\|\|same as 000080-00FFFFFF\|\|3\|\|1⅓\|\|6\|\|2⅔\|\|12\|\|5⅓\|\|3\|\|1⅓ \|- \|000033 - 00003C\|\|rowspan=3\|1 for "direct characters" and ~~possiblly~~possibly "optional direct characters" (depending on the encoder setting) 2 for +, otherwise same as 000080-00FFFFFF\|\|1\|\|1⅓\|\|4\|\|2⅔\|\|10\|\|5⅓\|\|1\|\|1⅓ \|- \|00003D (equals sign)\|\|3\|\|1⅓\|\|6\|\|2⅔\|\|12\|\|5⅓\|\|3\|\|1⅓ Line 33: \|00003E - 00007E\|\|1\|\|1⅓\|\|4\|\|2⅔\|\|10\|\|5⅓\|\|1\|\|1⅓ \|- \|00007F\|\|rowspan=3\|5 for an ~~isolted~~isolated case inside a run of single byte characters. For runs 2⅔ per character plus padding to make it a whole number of bytes plus two to start and finish the run\|\|3\|\|1⅓\|\|6\|\|2⅔\|\|12\|\|5⅓\|\|3\|\|1⅓ \|- \|000080 - 0007FF\|\|6\|\|2⅔\|\|rowspan=2\|2-6 depending on if the byte values need to be escaped\|\|2⅔\|\|rowspan=3\|8-12 depending on if the final two byte values need to be escaped\|\|5⅓\|\|rowspan=2\|4-6 for stuff inherited from [[GB2312]]/[[GBK]] (e.g.<br>most Chinese stuff) 6-10 for everything else.\|\|rowspan=2\|2⅔ for stuff inherited from [[GB2312]]/[[GBK]] (e.g.<br>most Chinese stuff) 5⅓ for everything else.

Comparison of Unicode encodings: Difference between revisions