Unicode compatibility characters: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 13:21, 14 June 2024 edit Warudo (talk \| contribs) Extended confirmed users 9,369 edits →Semantically distinct characters: Used Template:Unichar for angstrom, ohm and kelvin to bypass normalization. Previously, these were the regular unicode characters instead of the symbols. Attempting to change them by hand resulted in a null edit ← Previous edit		Latest revision as of 19:13, 28 July 2025 edit undo AnomieBOT (talk \| contribs) Bots 6,859,713 edits m Dating maintenance tags: {{Citation needed}}
(4 intermediate revisions by 3 users not shown)
Line 25: === Glyph substitution and composition === Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include: ;[[typographic ligature\|Ligatures]]: Ligatures such as '~~ffi~~ﬃ' in the Latin script were often encoded as a separate character in legacy character sets. Unicode's approach to ligatures is to treat them as rich text and, if turned on, handle them through glyph substitution. ;Precomposed Roman numerals: For example, Roman numeral twelve ('Ⅻ': U+216B) can be decomposed into a Roman numeral ten ('Ⅹ': U+2169) and two Roman numeral ones ('Ⅰ': U+2160). Precomposed characters are in the [[Number Forms]] block. ;Precomposed [[vulgar fraction\|fractions]]: These decomposition have the keyword <fraction>. A fully conforming text handler should<ref>{{cite book\|author=The Unicode Consortium\|authorlink=Unicode Consortium\|year=2010\|title=The Unicode Standard, Version 6.0.0\|publisher=Addison-Wesley Professional\|isbn=978-0321480910\|pages=212\|url=https://www.unicode.org/versions/Unicode6.0.0/ch06.pdf#G12861}}</ref> display the vulgar fraction ¼ (U+00BC) identically to the composed fraction 1⁄4 (numeral 1 with fraction slash U+2044 and numeral 4). Precomposed characters are in the [[Number Forms]] block. Line 71: * Mathematical constants (3): Euler constant ([[ℇ]] U+2107), [[Planck constant]] (ℎ U+210E), [[reduced Planck constant]] (ℏ U+210F), * Currency symbols (2): rupee sign (₨ U+20A8), rial sign (﷼ U+FDFC) * Punctuation (4): one dot [[leader (typography)\|leader]] (U+2024), no-break space (U+00A0), non-breaking hyphen (U+2011), Tibetan mark delimiter tsheg bstar (U+0F0C) * Other letter-like symbols (10): information source (ℹ U+2139), account of (℀ U+2100), addressed to the subject (℁ U+2101), care of (℅ U+2105), cada una (℆ U+2106), [[Numero sign\|numero]] (№ U+2116), telephone sign (℡ U+2121), facsimile sign (℻ U+213B), trademark (™ U+2122), service mark (℠ U+2120) In addition, several scripts use glyph position such as superscripts and subscripts to differentiate semantics. In these cases subscripts and superscripts are not merely rich text, but constitute a distinct character in the writing system (130 total). * 112 characters representing abstract phonemes from phonetic alphabets such as the [[International Phonetic Alphabet]] use such positional glyphs to represent semantic differences (U+1D2C – U+1D6A, U+1D78, U+1D9B – U+1DBF, U+02B0 – U+02B8, U+02E0 – U+02E4) * 14 characters from the [[Kanbun]] block (U+3192 – U+319F) * 1 character from the [[Tifinagh]] script: Tifinagh Modifier Letter Labialization Mark (ⵯ U+2D6F) * 1 character from the [[Georgian script]]: Modifier Letter Georgian Nar (ჼ U+10FC) * masculine ([[º\|U+00BA]]) and feminine ([[ª\|U+00AA]]) ordinal indicators included in the [[Latin-1 ~~supplement~~Supplement]]{{citation needed\|date=January 2012}} block Finally, Unicode designates Roman numerals as compatibility equivalence to the Latin letters that share the same glyphs.{{Citation needed\|date=November 2015}} Line 112: These thirteen characters are not compatibility characters, and their use is not discouraged in any way. However, U+27EAF 𧺯, the same as U+FA23 﨣, is mistakenly encoded in CJK Unified Ideographs Extension B.<ref>[http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg26/IRGN1218_Response_to_WG2.pdf#page=4 IRGN 1218]</ref> In any event, a normalized text should never contain both U+27EAF 𧺯 and U+FA23 﨣; these code points represent the same character, encoded twice. Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support:{{citation needed\|date=July 2025}} Alphabetic Presentation Forms (1) Line 118: Arabic Presentation Forms (4) # "Ornate Left Parenthesis" (U+FD3E): ﴾. A glyph variant for U+~~0029~~0028 ')(' # "Ornate Right Parenthesis" (U+FD3F): ﴿. A glyph variant for U+~~0028~~0029 '()' # "Ligature Bismillah Ar-Rahman Ar-Raheem" (U+FDFD): ﷽. [[Bismillah ar-Rahman, ar-Raheem\|Bismillah Ar-Rahman Ar-Raheem]] is a ligature for Beh (U+0628), Seen (U+0633), Meem (U+0645), Space (U+0020), Alef (U+0627), Lam (U+0644), Lam (U+0644), Heh (U+0647), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Meem (U+0645), Alef (U+0627), Noon (U+0646), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Yeh (U+064A), Meem (U+0645) i.e. {{lang\|ar\|بسم الله الرحمان الرحيم}} <ref>[https://www.unicode.org/charts/PDF/UFB50.pdf Unicode chart FB50-FDFF (PDF)].</ref><!-- Note: In Unicode, characters are written in "logical" sequence, i.e. from right to left in RTL languages such as Arabic. --> (Similarly, U+FDFA and U+FDFB code for two other Arabic ligatures, of 21 and 9 characters respectively.) # "Arabic Tail Fragment" (U+FE73): ﹳ for supporting text systems without contextual glyph handling