Content deleted Content added
→Semantically distinct characters: Used Template:Unichar for angstrom, ohm and kelvin to bypass normalization. Previously, these were the regular unicode characters instead of the symbols. Attempting to change them by hand resulted in a null edit |
m Dating maintenance tags: {{Citation needed}} |
||
(4 intermediate revisions by 3 users not shown) | |||
Line 25:
=== Glyph substitution and composition ===
Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include:
;[[typographic ligature|Ligatures]]: Ligatures such as '
;Precomposed Roman numerals: For example, Roman numeral twelve ('Ⅻ': U+216B) can be decomposed into a Roman numeral ten ('Ⅹ': U+2169) and two Roman numeral ones ('Ⅰ': U+2160). Precomposed characters are in the [[Number Forms]] block.
;Precomposed [[vulgar fraction|fractions]]: These decomposition have the keyword <fraction>. A fully conforming text handler should<ref>{{cite book|author=The Unicode Consortium|authorlink=Unicode Consortium|year=2010|title=The Unicode Standard, Version 6.0.0|publisher=Addison-Wesley Professional|isbn=978-0321480910|pages=212|url=https://www.unicode.org/versions/Unicode6.0.0/ch06.pdf#G12861}}</ref> display the vulgar fraction ¼ (U+00BC) identically to the composed fraction 1⁄4 (numeral 1 with fraction slash U+2044 and numeral 4). Precomposed characters are in the [[Number Forms]] block.
Line 71:
* Mathematical constants (3): Euler constant ([[ℇ]] U+2107), [[Planck constant]] (ℎ U+210E), [[reduced Planck constant]] (ℏ U+210F),
* Currency symbols (2): rupee sign (₨ U+20A8), rial sign (﷼ U+FDFC)
* Punctuation (4): one dot [[leader (typography)|leader]] (U+2024), no-break space (U+00A0), non-breaking hyphen (U+2011), Tibetan mark delimiter tsheg bstar (U+0F0C)
* Other letter-like symbols (10): information source (ℹ U+2139), account of (℀ U+2100), addressed to the subject (℁ U+2101), care of (℅ U+2105), cada una (℆ U+2106), [[Numero sign|numero]] (№ U+2116), telephone sign (℡ U+2121), facsimile sign (℻ U+213B), trademark (™ U+2122), service mark (℠ U+2120)
In addition, several scripts use glyph position such as superscripts and subscripts to differentiate semantics. In these cases subscripts and superscripts are not merely rich text, but constitute a distinct character in the writing system (130 total).
* 112 characters representing abstract phonemes from phonetic alphabets such as the
* 14 characters from the [[Kanbun]] block (U+3192 – U+319F)
* 1 character from the [[Tifinagh]] script: Tifinagh Modifier Letter Labialization Mark (ⵯ U+2D6F)
* 1 character from the [[Georgian script]]: Modifier Letter Georgian Nar (ჼ U+10FC)
* masculine ([[º|U+00BA]]) and feminine ([[ª|U+00AA]]) ordinal indicators included in the [[Latin-1
Finally, Unicode designates Roman numerals as compatibility equivalence to the Latin letters that share the same glyphs.{{Citation needed|date=November 2015}}
Line 112:
These thirteen characters are not compatibility characters, and their use is not discouraged in any way. However, U+27EAF 𧺯, the same as U+FA23 﨣, is mistakenly encoded in CJK Unified Ideographs Extension B.<ref>[http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg26/IRGN1218_Response_to_WG2.pdf#page=4 IRGN 1218]</ref> In any event, a normalized text should never contain both U+27EAF 𧺯 and U+FA23 﨣; these code points represent the same character, encoded twice.
Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support:{{citation needed|date=July 2025}}
Alphabetic Presentation Forms (1)
Line 118:
Arabic Presentation Forms (4)
# "Ornate Left Parenthesis" (U+FD3E): ﴾. A glyph variant for U+
# "Ornate Right Parenthesis" (U+FD3F): ﴿. A glyph variant for U+
# "Ligature Bismillah Ar-Rahman Ar-Raheem" (U+FDFD): ﷽. [[Bismillah ar-Rahman, ar-Raheem|Bismillah Ar-Rahman Ar-Raheem]] is a ligature for Beh (U+0628), Seen (U+0633), Meem (U+0645), Space (U+0020), Alef (U+0627), Lam (U+0644), Lam (U+0644), Heh (U+0647), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Meem (U+0645), Alef (U+0627), Noon (U+0646), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Yeh (U+064A), Meem (U+0645) i.e. {{lang|ar|بسم الله الرحمان الرحيم}} <ref>[https://www.unicode.org/charts/PDF/UFB50.pdf Unicode chart FB50-FDFF (PDF)].</ref><!-- Note: In Unicode, characters are written in "logical" sequence, i.e. from right to left in RTL languages such as Arabic. --> (Similarly, U+FDFA and U+FDFB code for two other Arabic ligatures, of 21 and 9 characters respectively.)
# "Arabic Tail Fragment" (U+FE73): ﹳ for supporting text systems without contextual glyph handling
|