Unicode compatibility characters: Difference between revisions

Content deleted Content added
Compatibility mappings types: drop claim tagged as original research
AnomieBOT (talk | contribs)
m Dating maintenance tags: {{Citation needed}}
 
(8 intermediate revisions by 5 users not shown)
Line 25:
=== Glyph substitution and composition ===
Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include:
;[[typographic ligature|Ligatures]]: Ligatures such as 'ffi' in the Latin script were often encoded as a separate character in legacy character sets. Unicode's approach to ligatures is to treat them as rich text and, if turned on, handle them through glyph substitution.
;Precomposed Roman numerals: For example, Roman numeral twelve ('Ⅻ': U+216B) can be decomposed into a Roman numeral ten ('Ⅹ': U+2169) and two Roman numeral ones ('Ⅰ': U+2160). Precomposed characters are in the [[Number Forms]] block.
;Precomposed [[vulgar fraction|fractions]]: These decomposition have the keyword &lt;fraction&gt;. A fully conforming text handler should<ref>{{cite book|author=The Unicode Consortium|authorlink=Unicode Consortium|year=2010|title=The Unicode Standard, Version 6.0.0|publisher=Addison-Wesley Professional|isbn=978-0321480910|pages=212|url=https://www.unicode.org/versions/Unicode6.0.0/ch06.pdf#G12861}}</ref> display the vulgar fraction ¼ (U+00BC) identically to the composed fraction 1⁄4 (numeral 1 with fraction slash U+2044 and numeral 4). Precomposed characters are in the [[Number Forms]] block.
Line 64:
While these compatibility characters are distinguished from their compatibility decomposition characters only by adding the word "symbol" to their name, they do represent long-standing distinct meanings in written mathematics. However, for all practical purposes they share the same semantics as their compatibility equivalent Greek or Hebrew letter. These may be considered border-line semantically distinguishable characters so they are not included in the total.
 
Though not the intention of Unicode to encode such measuring units the repertoire includes six (6) such symbols that should not be used by authors: the characters' decompositions should be used instead.<ref>Omega, mu, Angstrom, Kelvin: {{cite web |url=http://www.unicode.org/reports/tr25/ |title=Unicode Technical Report #25 / Unicode Support for Mathematics |date=2017-05-30 |page=11 |author=Unicode Consortium}}</ref><ref name="decomp" />
* Unit symbols (6): [[Angstrom]] (Å U+{{unichar|212B|name=none}}: use U+00C5 instead), [[Ohm]] (Ω, U+{{unichar|2126|name=none}}: use U+03A9 instead), [[Kelvin]] (K U+{{unichar|212A|name=none}}: use U+004B instead), [[Fahrenheit]] (U+2109: use [[°|U+00B0]] and U+0046 instead), [[Celsius]] (U+2103: use U+00B0 and U+0043 instead), [[micro-|Micro]] Sign (µ U+00B5 µ: use U+03BC instead)
 
Unicode also designates 22 other letter-like symbols as compatibility characters.<ref name="decomp">≈ designates compatibility decomposition according to https://www.unicode.org/versions/Unicode15.0.0/ch24.pdf and is shown in code charts at https://www.unicode.org/charts/nameslist/n_2100.html</ref>
Line 71:
* Mathematical constants (3): Euler constant ([[ℇ]] U+2107), [[Planck constant]] (ℎ U+210E), [[reduced Planck constant]] (ℏ U+210F),
* Currency symbols (2): rupee sign (₨ U+20A8), rial sign (﷼ U+FDFC)
* Punctuation (4): one dot [[leader (typography)|leader]] (U+2024), no-break space (U+00A0), non-breaking hyphen (U+2011), Tibetan mark delimiter tsheg bstar (U+0F0C)
* Other letter-like symbols (10): information source (ℹ U+2139), account of (℀ U+2100), addressed to the subject (℁ U+2101), care of (℅ U+2105), cada una (℆ U+2106), [[Numero sign|numero]] (№ U+2116), telephone sign (℡ U+2121), facsimile sign (℻ U+213B), trademark (™ U+2122), service mark (℠ U+2120)
 
In addition, several scripts use glyph position such as superscripts and subscripts to differentiate semantics. In these cases subscripts and superscripts are not merely rich text, but constitute a distinct character in the writing system (130 total).
 
* 112 characters representing abstract phonemes from phonetic alphabets such as the [[International Phonetic Alphabet]] use such positional glyphs to represent semantic differences (U+1D2C – U+1D6A, U+1D78, U+1D9B – U+1DBF, U+02B0 – U+02B8, U+02E0 – U+02E4)
* 14 characters from the [[Kanbun]] block (U+3192 – U+319F)
* 1 character from the [[Tifinagh]] script: Tifinagh Modifier Letter Labialization Mark (ⵯ U+2D6F)
* 1 character from the [[Georgian script]]: Modifier Letter Georgian Nar (ჼ U+10FC)
* masculine ([[º|U+00BA]]) and feminine ([[ª|U+00AA]]) ordinal indicators included in the [[Latin-1 supplementSupplement]]{{citation needed|date=January 2012}} block
 
Finally, Unicode designates Roman numerals as compatibility equivalence to the Latin letters that share the same glyphs.{{Citation needed|date=November 2015}}
Line 95:
Several blocks of Unicode characters include either entirely or almost entirely all compatibility characters (U+F900–U+FFEF except for the nonchars). The compatibility blocks contain none of the semantically distinct compatibility characters with only one exception: the rial currency symbol (﷼ U+FDFC) so the compatibility decomposable characters in the compatibility blocks fall unambiguously into the set of discouraged characters. Unicode recommends authors use the plain text compatibility decomposition equivalents instead and complement those characters with rich text markup. This approach is much more flexible and open-ended than using the finite set of circled or enclosed alphanumerics to give just one example.
 
Unfortunately, thereThere are a small number of characters even within the compatibility blocks that themselves are not compatibility characters and therefore may confuse authors. The "Enclosed CJK Letters and Months" block contains a single non-compatibility character: the 'Korean Standard Symbol' (㉿ U+327F). That symbol and 12 other characters have been included in the blocks for unknown reasons. The "CJK Compatibility Ideographs" block contains these non-compatibility unified Han ideographs:
 
# (U+FA0E): 﨎
Line 112:
These thirteen characters are not compatibility characters, and their use is not discouraged in any way. However, U+27EAF 𧺯, the same as U+FA23 﨣, is mistakenly encoded in CJK Unified Ideographs Extension B.<ref>[http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg26/IRGN1218_Response_to_WG2.pdf#page=4 IRGN 1218]</ref> In any event, a normalized text should never contain both U+27EAF 𧺯 and U+FA23 﨣; these code points represent the same character, encoded twice.
 
Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support:{{citation needed|date=July 2025}}
 
Alphabetic Presentation Forms (1)
Line 118:
 
Arabic Presentation Forms (4)
# "Ornate Left Parenthesis" (U+FD3E): ﴾. A glyph variant for U+00290028 ')('
# "Ornate Right Parenthesis" (U+FD3F): ﴿. A glyph variant for U+00280029 '()'
# "Ligature Bismillah Ar-Rahman Ar-Raheem" (U+FDFD): ﷽. [[Bismillah ar-Rahman, ar-Raheem|Bismillah Ar-Rahman Ar-Raheem]] is a ligature for Beh (U+0628), Seen (U+0633), Meem (U+0645), Space (U+0020), Alef (U+0627), Lam (U+0644), Lam (U+0644), Heh (U+0647), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Meem (U+0645), Alef (U+0627), Noon (U+0646), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Yeh (U+064A), Meem (U+0645) i.e. {{lang|ar|بسم الله الرحمان الرحيم}} <ref>[https://www.unicode.org/charts/PDF/UFB50.pdf Unicode chart FB50-FDFF (PDF)].</ref><!-- Note: In Unicode, characters are written in "logical" sequence, i.e. from right to left in RTL languages such as Arabic. --> (Similarly, U+FDFA and U+FDFB code for two other Arabic ligatures, of 21 and 9 characters respectively.)
# "Arabic Tail Fragment" (U+FE73): ﹳ for supporting text systems without contextual glyph handling
Line 134:
{{main article|Unicode normalization}}
 
Normalization is the process by which Unicode conforming software first performs full compatibility decomposition (or composition) before making comparisons or collating text strings. This is similar to other operations needed when, for example, a user performs a case or diacritic insensitive search within some text. In such cases software must equate or ignore characters it would not otherwise equate or ignore. Typically normalization is performed without altering the underlying stored text data (lossless). However, some software may potentially make permanent changes to text that eliminates the canonical or even non-canonical compatibility characters differences from text storage (lossy).
 
== See also ==