Comparison of Unicode encodings

This is an old revision of this page, as edited by Reisio (talk | contribs) at 00:33, 26 August 2005 (Seven-bit environments: cleanup markup). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

This page compares Unicode encodings. Two situations are considered: eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that only used 7 data bits but they remain in the standards and so software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.

Summary of size issues

UTF-32 loses in almost every case since characters outside the basic multilingual plane are very rare, and one of the bytes of BMP characters in UTF-32 is always 0. For seven-bit environments UTF-7 clearly wins over the combination of other Unicode encodings with quoted printable or base64. For eight-bit-clean environments things vary considerably depending on what code points are in the text.

Considerations other than size

For processing

For processing a format should be easy to search truncate and generally process safely. Fixed size characters can be helpfull but it should be remembered that even if there is a fixed width per code point (as in utf-32), there is not a fixed width per displayed character due to combining characters. Also if you are working with a particular API heavilly and that api has standardised on a particular unicode encoding it is generally a good idea to use the encoding that the API does. UTF-16 is popular because many apis date to the time when unicode was 16 bit fixed width. Unfortunately using UTF-16 encouranges code that does not properly handle code points outside the BMP.

For communication

Some protocols may limit you to a specific set of encodings but even when they don't some encodings may offer better compatibility than others with existing implementations. Also the cost of converting between your processing format and your communication format should be considered both in terms of program size (e.g. GB18030 requires a huge mapping table) and run-time requirements. It may simplify matters to use the same format for processing that you are communicating in especially for servers.

In detail

The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.

Eight-bit environments

Code range (hexadecimal) UTF-8 UTF-16 UTF-32 GB18030
000000 - 00007F 1 2 4 1
000080 - 0007FF 2 2 4 2 for stuff inherited from GB2312/GBK (e.g.
most Chinese stuff) 4 for everything else.
000800 - 00FFFF 3 2 4
010000 - 10FFFF 4 4 4 4

Seven-bit environments

This table may not cover every special case and so should be used for estimation and comparison only. To accurately determine the size of text in an encoding, see the actual specifications.

code range (hexadecimal) UTF-7 UTF-8 quoted printable UTF-8 base64 UTF-16 quoted printable UTF-16 base64 UTF-32 quoted printable UTF-32 base64 GB18030 quoted printable GB18030 base64
000000 - 000032 same as 000080-00FFFFFF 3 1⅓ 6 2⅔ 12 5⅓ 3 1⅓
000033 - 00003C 1 for "direct characters" and possibly "optional direct characters" (depending on the encoder setting) 2 for +, otherwise same as 000080-00FFFFFF 1 1⅓ 4 2⅔ 10 5⅓ 1 1⅓
00003D (equals sign) 3 1⅓ 6 2⅔ 12 5⅓ 3 1⅓
00003E - 00007E 1 1⅓ 4 2⅔ 10 5⅓ 1 1⅓
00007F 5 for an isolated case inside a run of single byte characters. For runs 2⅔ per character plus padding to make it a whole number of bytes plus two to start and finish the run 3 1⅓ 6 2⅔ 12 5⅓ 3 1⅓
000080 - 0007FF 6 2⅔ 2-6 depending on if the byte values need to be escaped 2⅔ 8-12 depending on if the final two byte values need to be escaped 5⅓ 4-6 for stuff inherited from GB2312/GBK (e.g.
most Chinese stuff) 6-10 for everything else.
2⅔ for stuff inherited from GB2312/GBK (e.g.
most Chinese stuff) 5⅓ for everything else.
000800 - 00FFFF 9 4 2⅔ 5⅓
010000 - 10FFFF same as two characters from above 12 5⅓ 8-12 depending on if the low bytes of the surrogates need to be escaped. 5⅓ 5⅓ 6-10 5⅓