Template:Table Unicode This page compares Unicode encodings. Two situations are considered: eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.
Summary of size issues
UTF-32 requires four bytes to encode any character. Since characters outside the basic multilingual plane are rare, a document encoded in UTF-32 will usually be nearly twice as large as its UTF-16–encoded equivalent. On the other hand, UTF-8 uses anywhere between one and four bytes to encode a character; it may use fewer, the same, or more bytes than UTF-16 to encode the same character. UTF-EBCDIC is always as bad as or worse than UTF-8 for printable characters due to a descision made to allow encoding the C1 control codes as single bytes.
For seven-bit environments, UTF-7 clearly wins over the combination of other Unicode encodings with quoted printable or base64.
Considerations other than size
For processing
For processing, a format should be easy to search, truncate, and generally process safely. All normal unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded one or more of these code units will represent a Unicode code point. To allow easy searching and truncation a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but UTF-7 and GB18030 do not.
Fixed-size characters can be helpful, but it should be remembered that even if there is a fixed width per code point (as in UTF-32), there is not a fixed width per displayed character due to combining characters. If you are working with a particular API heavily and that API has standardised on a particular Unicode encoding it is generally a good idea to use the encoding that the API does to avoid the need to convert before every call to the API. Similarly if you are writing server side software it may simplify matters to use the same format for processing that you are communicating in.
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. Unfortunately using UTF-16 makes characters outside the BMP a special case which increases the risk of oversights related to their handling.
For communication
Some protocols may be limited to a specific set of encodings, but even when they are not some encodings may offer better compatibility than others with existing implementations. Also the cost of converting between your processing format and your communication format should be considered both in terms of program size (e.g. GB18030 requires a huge mapping table) and run-time requirements.
In detail
The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.
Eight-bit environments
Code range (hexadecimal) | UTF-8 | UTF-16 | UTF-32 | UTF-EBCDIC | GB18030 |
---|---|---|---|---|---|
000000 – 00007F | 1 | 2 | 4 | 1 | 1 |
000080 – 00009F | 2 | 2 | 4 | 1 | 2 for characters inherited from GB2312/GBK (e.g. most Chinese characters) 4 for everything else. |
0000A0 – 0003FF | 2 | 2 | 4 | 2 | |
000400 – 0007FF | 2 | 2 | 4 | 3 | |
000800 – 003FFF | 3 | 2 | 4 | 3 | |
004000 – 00FFFF | 3 | 2 | 4 | 4 | |
010000 – 03FFFF | 4 | 4 | 4 | 4 | 4 |
040000 – 10FFFF | 4 | 4 | 4 | 5 | 4 |
Seven-bit environments
This table may not cover every special case and so should be used for estimation and comparison only. To accurately determine the size of text in an encoding, see the actual specifications.
code range (hexadecimal) | UTF-7 | UTF-8 quoted printable | UTF-8 base64 | UTF-16 quoted printable | UTF-16 base64 | UTF-32 quoted printable | UTF-32 base64 | GB18030 quoted printable | GB18030 base64 |
000000 – 000032 | same as 000080–00FFFFFF | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ |
000033 – 00003C | 1 for "direct characters" and possibly "optional direct characters" (depending on the encoder setting) 2 for +, otherwise same as 000080–00FFFFFF | 1 | 1⅓ | 4 | 2⅔ | 10 | 5⅓ | 1 | 1⅓ |
00003D (equals sign) | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ | |
00003E – 00007E | 1 | 1⅓ | 4 | 2⅔ | 10 | 5⅓ | 1 | 1⅓ | |
00007F | 5 for an isolated case inside a run of single byte characters. For runs 2⅔ per character plus padding to make it a whole number of bytes plus two to start and finish the run | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ |
000080 – 0007FF | 6 | 2⅔ | 2–6 depending on if the byte values need to be escaped | 2⅔ | 8–12 depending on if the final two byte values need to be escaped | 5⅓ | 4–6 for characters inherited from GB2312/GBK (e.g. most Chinese characters) 6–10 for everything else. |
2⅔ for characters inherited from GB2312/GBK (e.g. most Chinese characters) 5⅓ for everything else. | |
000800 – 00FFFF | 9 | 4 | 2⅔ | 5⅓ | |||||
010000 – 10FFFF | same as two characters from above | 12 | 5⅓ | 8–12 depending on if the low bytes of the surrogates need to be escaped. | 5⅓ | 5⅓ | 6–10 | 5⅓ |