Unicode compatibility characters: Difference between revisions

Content deleted Content added
AnomieBOT (talk | contribs)
m Dating maintenance tags: {{When}}
m convert special characters (via WP:JWB)
Line 18:
# Characters included from other character sets or otherwise added to the UCS that constitute [[formatted text|rich text]] rather than the plain text goals of Unicode.
# Some other characters that are semantically distinct, but [[homoglyph|visually similar]].
Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a find on a page for a capital Latin letter ‘I’'I' and their software application fails to find the visually similar [[Roman numeral]] ‘Ⅰ’'Ⅰ'.
 
== Compatibility mappings types ==
Line 24:
=== Glyph substitution and composition ===
Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include:
;[[typographic ligature|Ligatures]]: Ligatures such as ‘ffi’'ffi' in the Latin script were often encoded as a separate character in legacy character sets. Unicode’sUnicode's approach to ligatures is to treat them as rich text and, if turned on, handled through glyph substitution.
;Precomposed Roman numerals: For example, Roman numeral twelve (‘Ⅻ’'Ⅻ': U+216B) can be decomposed into a Roman numeral ten (‘Ⅹ’'Ⅹ': U+2169) and two Roman numeral ones (‘Ⅰ’'Ⅰ': U+2160).
;Precomposed [[vulgar fraction|fractions]]: These decomposition have the keyword &lt;fraction&gt;. A fully conforming text handler should<ref>{{cite web|author=The Unicode Consortium|authorlink=Unicode Consortium|year=2010|title=The Unicode Standard, Version 6.0.0|publisher=Addison-Wesley Professional|isbn=978-0321480910|pages=212|url=https://www.unicode.org/versions/Unicode6.0.0/ch06.pdf#G12861}}</ref> display the vulgar fraction ¼ (U+00BC) identically to the composed fraction 1⁄4 (numeral 1 with fraction slash U+2044 and numeral 4).
;Contextual glyphs or forms: These arise primarily in the Arabic script. Using fonts with glyph substitution capabilities such as [[OpenType]] and [[Apple Advanced Typography|TrueTypeGX]], Unicode conforming software can substitute the proper glyphs for the same character depending on whether that character appears at the beginning, end, middle of a word or in isolation. Such glyph substitution is also necessary for vertical (top to bottom) text layout for some East Asian languages. In this case glyphs must be substituted or synthesized for wide, narrow, small and square glyph forms. Non-conforming software or software using other character sets instead use multiple separate character for the same letter depending on its position: further complicating text processing.
Line 35:
#Substitute (at the author or readers discretion) ligatures and contextual glyph variants.
#Layout CJKV text vertically (at the author's or reader's discretion), substituting glyphs for small, vertical, narrow, wide square forms, either from font data or synthesized as needed.
#Combine fractions using the '[[⁄|Fraction Slash]]' character (⁄ U+2044) and any other arbitrary characters.
#Combine a '[[̸|Combining Long Solidus Overlay]]' ( ̸ U+0338) with other symbols: for example ∄ or ∄ for [[∄]] (U+2203).
 
All together these compatibility characters included for incomplete Unicode implementations total 3,779 of the 5,402 designated compatibility characters. These include all of the compatibility characters marked with the keywords &lt;initial>, &lt;medial>, &lt;final>, &lt;isolated>, &lt;fraction>, &lt;wide>, &lt;narrow>, &lt;small>, &lt;vertical>, &lt;square>. Also it includes nearly all of the canonical and most of the &lt;compat> keyword compatibility characters (the exceptions include those &lt;compat> keyword characters for enclosed alphanumerics, enclosed ideographs and those discussed in [[#Semantically distinct characters|§ Semantically distinct characters]]).
Line 61:
* [[Greek letter]] based symbols (7): beta (ϐ U+03D0), theta (ϑ U+03D1), phi (ϕ U+03D5), pi (ϖ U+03D6), kappa (ϰ U+03F0), rho (ϱ U+03F1), capital theta (ϴ U+03F4)
 
While these compatibility characters are distinguished from their compatibility decomposition characters only by adding the word “symbol”"symbol" to their name, they do represent long-standing distinct meanings in written mathematics. However, for all practical purposes they share the same semantics as their compatibility equivalent Greek or Hebrew letter. These may be considered border-line semantically distinguishable characters so they are not included in the total.
 
Though not the intention of Unicode to encode such measuring units the repertoire includes six (6) such symbols that should not be used by authors: the characters' decompositions should be used instead.
Line 94:
Several blocks of Unicode characters include either entirely or almost entirely all compatibility characters (U+F900–U+FFEF except for the nonchars). The compatibility blocks contain none of the semantically distinct compatibility characters with only one exception: the rial currency symbol (﷼ U+FDFC) so the compatibility decomposable characters in the compatibility blocks fall unambiguously into the set of discouraged characters. Unicode recommends authors use the plain text compatibility decomposition equivalents instead and complement those characters with rich text markup. This approach is much more flexible and open-ended than using the finite set of circled or enclosed alphanumerics to give just one example.
 
Unfortunately, there are a small number of characters even within the compatibility blocks that themselves are not compatibility characters and therefore may confuse authors. The “Enclosed"Enclosed CJK Letters and Months”Months" block contains a single non-compatibility character: the ‘Korean'Korean Standard Symbol’Symbol' (㉿ U+327F). That symbol and 12 other characters have been included in the blocks for unknown reasons. The “CJK"CJK Compatibility Ideographs”Ideographs" block contains these non-compatibility unified Han ideographs:
 
# (U+FA0E): 﨎
Line 117:
 
Arabic Presentation Forms (4)
# “Ornate"Ornate Left Parenthesis”Parenthesis" (U+FD3E): ﴾. A glyph variant for U+0029 ')'
# “Ornate"Ornate Right Parenthesis”Parenthesis" (U+FD3F): ﴿. A glyph variant for U+0028 '('
# “Ligature"Ligature Bismillah Ar-Rahman Ar-Raheem”Raheem" (U+FDFD): ﷽. [[Bismillah ar-Rahman, ar-Raheem|Bismillah Ar-Rahman Ar-Raheem]] is a ligature for Beh (U+0628), Seen (U+0633), Meem (U+0645), Space (U+0020), Alef (U+0627), Lam (U+0644), Lam (U+0644), Heh (U+0647), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Meem (U+0645), Alef (U+0627), Noon (U+0646), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Yeh (U+064A), Meem (U+0645) i.e. {{rtl-lang|ar|بسم الله الرحمان الرحيم}} <ref>[https://www.unicode.org/charts/PDF/UFB50.pdf Unicode chart FB50-FDFF (PDF)].</ref><!-- Note: In Unicode, characters are written in "logical" sequence, i.e. from right to left in RTL languages such as Arabic. --> (Similarly, U+FDFA and U+FDFB code for two other Arabic ligatures, of 21 and 9 characters respectively.)
# “Arabic"Arabic Tail Fragment”Fragment" (U+FE73): ﹳ for supporting text systems without contextual glyph handling
 
CJK Compatibility Forms (2 that are both related to CJK Unified Ideograph: U+4E36 丶)