Content deleted Content added
→Semantically distinct characters: examples listed below Tags: Mobile edit Mobile web edit Advanced mobile edit |
m Dating maintenance tags: {{Citation needed}} |
||
(13 intermediate revisions by 8 users not shown) | |||
Line 1:
{{
{{Multiple issues|
{{original research|date=July 2008}}
Line 25:
=== Glyph substitution and composition ===
Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include:
;[[typographic ligature|Ligatures]]: Ligatures such as '
;Precomposed Roman numerals: For example, Roman numeral twelve ('Ⅻ': U+216B) can be decomposed into a Roman numeral ten ('Ⅹ': U+2169) and two Roman numeral ones ('Ⅰ': U+2160). Precomposed characters are in the [[Number Forms]] block.
;Precomposed [[vulgar fraction|fractions]]: These decomposition have the keyword <fraction>. A fully conforming text handler should<ref>{{cite book|author=The Unicode Consortium|authorlink=Unicode Consortium|year=2010|title=The Unicode Standard, Version 6.0.0|publisher=Addison-Wesley Professional|isbn=978-0321480910|pages=212|url=https://www.unicode.org/versions/Unicode6.0.0/ch06.pdf#G12861}}</ref> display the vulgar fraction ¼ (U+00BC) identically to the composed fraction 1⁄4 (numeral 1 with fraction slash U+2044 and numeral 4). Precomposed characters are in the [[Number Forms]] block.
Line 64:
While these compatibility characters are distinguished from their compatibility decomposition characters only by adding the word "symbol" to their name, they do represent long-standing distinct meanings in written mathematics. However, for all practical purposes they share the same semantics as their compatibility equivalent Greek or Hebrew letter. These may be considered border-line semantically distinguishable characters so they are not included in the total.
Though not the intention of Unicode to encode such measuring units the repertoire includes six (6) such symbols that should not be used by authors: the characters' decompositions should be used instead.<ref>Omega, mu, Angstrom, Kelvin: {{cite web |url=http://www.unicode.org/reports/tr25/ |title=Unicode Technical Report #25 / Unicode Support for Mathematics |date=2017-05-30 |page=11 |author=Unicode Consortium}}</ref><ref name="decomp" />
* Unit symbols (6): [[Angstrom]]
Unicode also designates
* Other Greek letter-based symbols (4): lunate epsilon (ϵ U+03F5), lunate sigma (ϲ U+03F2), capital lunate sigma (Ϲ U+03F9), upsilon with hook (ϒ U+03D2)
* Mathematical constants (3): Euler constant ([[ℇ]] U+2107), [[Planck constant]] (ℎ U+210E), [[reduced Planck constant]] (ℏ U+210F),
* Currency symbols (2): rupee sign (₨ U+20A8), rial sign (﷼ U+FDFC)
* Punctuation (4): one dot [[leader (typography)|leader]] (U+2024), no-break space (U+00A0), non-breaking hyphen (U+2011), Tibetan mark delimiter tsheg bstar (U+0F0C)
* Other letter-like symbols (10): information source (ℹ U+2139), account of (℀ U+2100), addressed to the subject (℁ U+2101), care of (℅ U+2105), cada una (℆ U+2106), [[Numero sign|numero]] (№ U+2116), telephone sign (℡ U+2121), facsimile sign (℻ U+213B), trademark (™ U+2122), service mark (℠ U+2120)
In addition, several scripts use glyph position such as superscripts and subscripts to differentiate semantics. In these cases subscripts and superscripts are not merely rich text, but constitute a distinct character
* 112 characters representing abstract phonemes from phonetic alphabets such as the
* 14 characters from the [[Kanbun]] block (U+3192 – U+319F)
* 1 character from the [[Tifinagh]] script: Tifinagh Modifier Letter Labialization Mark (ⵯ U+2D6F)
* 1 character from the [[Georgian script]]: Modifier Letter Georgian Nar (ჼ U+10FC)
* masculine ([[º|U+00BA]]) and feminine ([[ª|U+00AA]]) ordinal indicators included in the [[Latin-1
Finally, Unicode designates Roman numerals as compatibility equivalence to the Latin letters that share the same glyphs.{{Citation needed|date=November 2015}}
Line 95:
Several blocks of Unicode characters include either entirely or almost entirely all compatibility characters (U+F900–U+FFEF except for the nonchars). The compatibility blocks contain none of the semantically distinct compatibility characters with only one exception: the rial currency symbol (﷼ U+FDFC) so the compatibility decomposable characters in the compatibility blocks fall unambiguously into the set of discouraged characters. Unicode recommends authors use the plain text compatibility decomposition equivalents instead and complement those characters with rich text markup. This approach is much more flexible and open-ended than using the finite set of circled or enclosed alphanumerics to give just one example.
# (U+FA0E): 﨎
Line 112:
These thirteen characters are not compatibility characters, and their use is not discouraged in any way. However, U+27EAF 𧺯, the same as U+FA23 﨣, is mistakenly encoded in CJK Unified Ideographs Extension B.<ref>[http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg26/IRGN1218_Response_to_WG2.pdf#page=4 IRGN 1218]</ref> In any event, a normalized text should never contain both U+27EAF 𧺯 and U+FA23 﨣; these code points represent the same character, encoded twice.
Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support:{{citation needed|date=July 2025}}
Alphabetic Presentation Forms (1)
Line 118:
Arabic Presentation Forms (4)
# "Ornate Left Parenthesis" (U+FD3E): ﴾. A glyph variant for U+
# "Ornate Right Parenthesis" (U+FD3F): ﴿. A glyph variant for U+
# "Ligature Bismillah Ar-Rahman Ar-Raheem" (U+FDFD): ﷽. [[Bismillah ar-Rahman, ar-Raheem|Bismillah Ar-Rahman Ar-Raheem]] is a ligature for Beh (U+0628), Seen (U+0633), Meem (U+0645), Space (U+0020), Alef (U+0627), Lam (U+0644), Lam (U+0644), Heh (U+0647), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Meem (U+0645), Alef (U+0627), Noon (U+0646), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Yeh (U+064A), Meem (U+0645) i.e. {{lang|ar|بسم الله الرحمان الرحيم}} <ref>[https://www.unicode.org/charts/PDF/UFB50.pdf Unicode chart FB50-FDFF (PDF)].</ref><!-- Note: In Unicode, characters are written in "logical" sequence, i.e. from right to left in RTL languages such as Arabic. --> (Similarly, U+FDFA and U+FDFB code for two other Arabic ligatures, of 21 and 9 characters respectively.)
# "Arabic Tail Fragment" (U+FE73): ﹳ for supporting text systems without contextual glyph handling
Line 134:
{{main article|Unicode normalization}}
Normalization is the process by which Unicode conforming software first performs full compatibility decomposition (or composition) before making comparisons or collating text strings
== See also ==
* [[CJK Compatibility]]
* [[CJK Compatibility Forms]]
* [[CJK Compatibility Ideographs]]
== References ==
Line 144 ⟶ 150:
{{Unicode navigation}}
|