This page compares unicode encodings. Two situations are considered: 8 bit clean environments and environments like SMTP that only support 7 bit characters. SCSU and BOCU are excluded from the comparison tables because it is difficult to simply quantify thier size.
In summary
UTF-32 loses in almost every case since characters outside the BMP are very rare and one of the bytes in utf-32 is always 0. For 7 bit environments UTF-7 clearly wins over the combination of other unicode encodings with quoted printable or base64. For 8 bit clean environments things vary considerablly depending on what code points are in the text to be encoded.
In detail
The tables below list the number of bytes per code point for different unicode ranges. Any additonal comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligable.
8 bit environments
code range (hexadecimal) | UTF-8 | UTF-16 | UTF-32 | GB18030 |
000000 - 00007F | 1 | 2 | 4 | 1 |
000080 - 0007FF | 2 | 2 | 4 | 2 for stuff inherited from GB2312/GBK (e.g. most Chinese stuff) 4 for everything else. |
000800 - 00FFFF | 3 | 2 | 4 | |
010000 - 10FFFF | 4 | 4 | 4 | 4 |
7 bit environments
This table may not cover every special case and so should be used for estimation and comparion only. To accurately determine the size of text in an encoding please see the actual specifications.
code range (hexadecimal) | UTF-7 | UTF-8 quoted printable | UTF-8 base64 | UTF-16 quoted printable | UTF-16 base64 | UTF-32 quoted printable | UTF-32 base64 | GB18030 quoted printable | GB18030 base64 |
000000 - 000032 | same as 000080-00FFFFFF | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ |
000033 - 00003C | 1 for "direct characters" and possiblly "optional direct characters" (depending on the encoder setting) 2 for +, otherwise same as 000080-00FFFFFF | 1 | 1⅓ | 4 | 2⅔ | 10 | 5⅓ | 1 | 1⅓ |
00003D (equals sign) | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ | |
00003E - 00007E | 1 | 1⅓ | 4 | 2⅔ | 10 | 5⅓ | 1 | 1⅓ | |
00007F | 5 for an isolted case inside a run of single byte characters. For runs 2⅔ per character plus padding to make it a whole number of bytes plus two to start and finish the run | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ |
000080 - 0007FF | 6 | 2⅔ | 2-6 depending on if the byte values need to be escaped | 2⅔ | 8-12 depending on if the final two byte values need to be escaped | 5⅓ | 4-6 for stuff inherited from GB2312/GBK (e.g. most Chinese stuff) 6-10 for everything else. |
2⅔ for stuff inherited from GB2312/GBK (e.g. most Chinese stuff) 5⅓ for everything else. | |
000800 - 00FFFF | 9 | 4 | 2⅔ | 5⅓ | |||||
010000 - 10FFFF | same as two characters from above | 12 | 5⅓ | 8-12 depending on if the low bytes of the surrogates need to be escaped. | 5⅓ | 5⅓ | 6-10 | 5⅓ |