Comparison of Unicode encodings

This is an old revision of this page, as edited by Plugwash (talk | contribs) at 23:36, 11 August 2005 (8 bit environments). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

This page compares unicode encodings. Two situations are considered: 8 bit clean environments and environments like SMTP that only support 7 bit characters. SCSU and BOCU are excluded from the comparison tables because it is difficult to simply quantify thier size.

In summary

For 7 bit environments UTF-7 clearly wins over the combination of other unicode encodings with quoted printable or base64. For 8 bit clean environments things vary considerablly depending on what code points are in the text to be encoded.

In detail

The tables below list the number of bytes per code point for different unicode ranges. Any additonal comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligable.

8 bit environments

code range (hexadecimal) UTF-8 UTF-16 UTF-32 GB18030
000000 - 00007F 1 2 4 1
000080 - 0007FF 2 2 4 2 for stuff inherited from GB2312/GBK (e.g.
most chineese stuff) 4 for everything else.
000800 - 00FFFF 3 2 4
010000 - 10FFFF 4 4 4 4