Comparison of Unicode encodings: Difference between revisions

Content deleted Content added
top: add short description
According to the UTF-8 page, characters are encoded in as many as 6 bytes, not the 4 bytes listed in this article.
Tag: Reverted
Line 20:
 
== Efficiency ==
[[UTF-8]] requires 8, 16, 24, 32, 40, or 3248 bits (one to foursix [[Octet (computing)|bytes]]) to encode a Unicode character, [[UTF-16]] requires either 16 or 32 bits to encode a character, and [[UTF-32]] always requires 32 bits to encode a character. The first 128 Unicode [[code point]]s, U+0000 to U+007F, used for the [[C0 Controls and Basic Latin]] characters and which correspond one-to-one to their ASCII-code equivalents, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32.
 
The next 1,920 characters, U+0080 to U+07FF (encompassing the remainder of almost all [[Latin-script alphabet]]s, and also [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], [[Coptic alphabet|Coptic]], [[Armenian alphabet|Armenian]], [[Hebrew alphabet|Hebrew]], [[Arabic alphabet|Arabic]], [[Syriac alphabet|Syriac]], [[Tāna]] and [[N'Ko alphabet|N'Ko]]), require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, i.e. the remainder of the characters in the [[Basic Multilingual Plane]] (BMP, plane 0, U+0000 to U+FFFF), which encompasses the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character, while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the [[Plane (Unicode)|supplementary planes]] (planes 1–16), require 32 bits in UTF-8, UTF-16 and UTF-32.