Content deleted Content added
Polygnotus (talk | contribs) howver → however |
|||
Line 20:
[[UTF-8]] requires 8, 16, 24 or 32 bits (one to four [[Octet (computing)|bytes]]) to encode a Unicode character, [[UTF-16]] requires either 16 or 32 bits to encode a character, and [[UTF-32]] always requires 32 bits to encode a character.
The first 128 Unicode [[code point]]s, U+0000 to U+007F, which are used for the [[C0 Controls and Basic Latin]] characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF, represent the rest of the characters used by almost all [[Latin-script alphabet]]s as well as [[Greek alphabet|Greek]], [[Cyrillic script|Cyrillic]], [[Coptic
A file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. Advocates of UTF-8 as the preferred form argue that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8 due to the extensive use of spaces, digits, punctuation, newlines, [[HTML]], and embedded words and acronyms written with Latin letters.<ref>{{Cite web |title=UTF-8 Everywhere |url=https://utf8everywhere.org/#asian |access-date=2022-08-28 |website=utf8everywhere.org}}</ref> UTF-32, by contrast, is always longer unless there are no code points less than U+10000.
|