Revision as of 00:36, 15 August 2024 edit Polygnotus (talk \| contribs) Extended confirmed users, File movers, Rollbackers 33,809 edits howver → however ← Previous edit		Revision as of 01:43, 16 September 2024 edit undo Meno25 (talk \| contribs) Autopatrolled, Administrators 216,385 edits →Efficiency Next edit →
Line 20: [[UTF-8]] requires 8, 16, 24 or 32 bits (one to four [[Octet (computing)\|bytes]]) to encode a Unicode character, [[UTF-16]] requires either 16 or 32 bits to encode a character, and [[UTF-32]] always requires 32 bits to encode a character. The first 128 Unicode [[code point]]s, U+0000 to U+007F, which are used for the [[C0 Controls and Basic Latin]] characters and which correspond to ASCII, are encoded using 8 bits in UTF-8, 16 bits in UTF-16, and 32 bits in UTF-32. The next 1,920 characters, U+0080 to U+07FF, represent the rest of the characters used by almost all [[Latin-script alphabet]]s as well as [[Greek alphabet\|Greek]], [[Cyrillic script\|Cyrillic]], [[Coptic ~~alphabet~~script\|Coptic]], [[Armenian alphabet\|Armenian]], [[Hebrew alphabet\|Hebrew]], [[Arabic alphabet\|Arabic]], [[Syriac alphabet\|Syriac]], [[~~Tāna~~Thaana]] and [[N'Ko ~~alphabet~~script\|N'Ko]]. Characters in this range require 16 bits to encode in both UTF-8 and UTF-16, and 32 bits in UTF-32. For U+0800 to U+FFFF, the remaining characters in the [[Basic Multilingual Plane]] and capable of representing the rest of the characters of most of the world's living languages, UTF-8 needs 24 bits to encode a character while UTF-16 needs 16 bits and UTF-32 needs 32. Code points U+010000 to U+10FFFF, which represent characters in the [[Plane (Unicode)\|supplementary planes]], require 32 bits in UTF-8, UTF-16 and UTF-32. A file is shorter in UTF-8 than in UTF-16 if there are more ASCII code points than there are code points in the range U+0800 to U+FFFF. Advocates of UTF-8 as the preferred form argue that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8 due to the extensive use of spaces, digits, punctuation, newlines, [[HTML]], and embedded words and acronyms written with Latin letters.<ref>{{Cite web \|title=UTF-8 Everywhere \|url=https://utf8everywhere.org/#asian \|access-date=2022-08-28 \|website=utf8everywhere.org}}</ref> UTF-32, by contrast, is always longer unless there are no code points less than U+10000.

Comparison of Unicode encodings: Difference between revisions