Comparison of Unicode encodings: Difference between revisions

Content deleted Content added
howver → however
Line 20:
[[UTF-8]] requires 8, 16, 24 or 32 bits (one to four [[Octet (computing)|bytes]]) to encode a Unicode character, [[UTF-16]] requires either 16 or 32 bits to encode a character, and [[UTF-32]] always requires 32 bits to encode a character.
 
The first 128 Unicode [[code point]]s, U+0000 to U+007F, which are used for the [[C0 Controls and Basic Latin]] characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF, represent the rest of the characters used by almost all [[Latin-script alphabet]]s as well as [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], [[Coptic alphabetscript|Coptic]], [[Armenian alphabet|Armenian]], [[Hebrew alphabet|Hebrew]], [[Arabic alphabet|Arabic]], [[Syriac alphabet|Syriac]], [[TānaThaana]] and [[N'Ko alphabetscript|N'Ko]]. Characters in this range require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, the remaining characters in the [[Basic Multilingual Plane]] and capable of representing the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the [[Plane (Unicode)|supplementary planes]], require 32 bits in UTF-8, UTF-16 and UTF-32.
 
A file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. Advocates of UTF-8 as the preferred form argue that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8 due to the extensive use of spaces, digits, punctuation, newlines, [[HTML]], and embedded words and acronyms written with Latin letters.<ref>{{Cite web |title=UTF-8 Everywhere |url=https://utf8everywhere.org/#asian |access-date=2022-08-28 |website=utf8everywhere.org}}</ref> UTF-32, by contrast, is always longer unless there are no code points less than U+10000.