Comparison of Unicode encodings: Difference between revisions

Content deleted Content added
m Comparison of unicode encodings moved to Comparison of Unicode encodings
Clarified comparing size only. Expanded abbreviations. Caps and spelling.
Line 1:
This page compares unicodeUnicode encodings. Two situations are considered: 8 eight-bit -clean environments and environments like [[Simple Mail Transfer Protocol|SMTP]] that use only supportseven 7bits per byte, the high-order bit charactersbeing ignored or used for [[parity]]. [[SCSUStandard Compression Scheme for Unicode]] and [[BOCUBinary Ordered Compression for Unicode]] are excluded from the comparison tables because it is difficult to simply quantify thiertheir size.
 
==In summary==
If space were the only consideration, UTF-32 loseswould lose in almost every case since characters outside the BMP[[basic multilingual plane]] are very rare, and one of the bytes of BMP characters in utfUTF-32 is always 0. For 7 seven-bit environments UTF-7 clearly wins in terms of size over the combination of other unicodeUnicode encodings with [[quoted printable]] or [[base64]]. For 8 eight-bit -clean environments things vary considerabllyconsiderably depending on what code points are in the text to be encoded.
 
==In detail==
The tables below list the number of bytes per code point for different unicodeUnicode ranges. Any additonaladditional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligablenegligible.
===8 Eight-bit environments===
 
{| {{prettytable}}
|codeCode range (hexadecimal)||[[UTF-8]]||[[UTF-16]]||[[UTF-32]]||[[GB18030]]
|-
|000000 - 00007F||1||2||4||1
Line 20:
|}
 
===7 Seven-bit environments===
This table may not cover every special case and so should be used for estimation and comparioncomparison only. To accurately determine the size of text in an encoding please, see the actual specifications.
{| {{prettytable}}
|code range (hexadecimal)||[[UTF-7]]||[[UTF-8]] [[quoted printable]]||UTF-8 [[base64]]||[[UTF-16]] quoted printable||UTF-16 base64||[[UTF-32]] quoted printable||UTF-32 base64||[[GB18030]] quoted printable||[[GB18030]] base64
Line 27:
|000000 - 000032||same as 000080-00FFFFFF||3||1⅓||6||2⅔||12||5⅓||3||1⅓
|-
|000033 - 00003C||rowspan=3|1 for "direct characters" and possibllypossibly "optional direct characters" (depending on the encoder setting) 2 for +, otherwise same as 000080-00FFFFFF||1||1⅓||4||2⅔||10||5⅓||1||1⅓
|-
|00003D (equals sign)||3||1⅓||6||2⅔||12||5⅓||3||1⅓
Line 33:
|00003E - 00007E||1||1⅓||4||2⅔||10||5⅓||1||1⅓
|-
|00007F||rowspan=3|5 for an isoltedisolated case inside a run of single byte characters. For runs 2⅔ per character plus padding to make it a whole number of bytes plus two to start and finish the run||3||1⅓||6||2⅔||12||5⅓||3||1⅓
|-
|000080 - 0007FF||6||2&#x2154;||rowspan=2|2-6 depending on if the byte values need to be escaped||2⅔||rowspan=3|8-12 depending on if the final two byte values need to be escaped||5⅓||rowspan=2|4-6 for stuff inherited from [[GB2312]]/[[GBK]] (e.g.<br>most Chinese stuff) 6-10 for everything else.||rowspan=2|2&#x2154; for stuff inherited from [[GB2312]]/[[GBK]] (e.g.<br>most Chinese stuff) 5⅓ for everything else.