Content deleted Content added
m Dating maintenance tags: {{When}} |
m convert special characters (via WP:JWB) |
||
Line 18:
# Characters included from other character sets or otherwise added to the UCS that constitute [[formatted text|rich text]] rather than the plain text goals of Unicode.
# Some other characters that are semantically distinct, but [[homoglyph|visually similar]].
Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a find on a page for a capital Latin letter
== Compatibility mappings types ==
Line 24:
=== Glyph substitution and composition ===
Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include:
;[[typographic ligature|Ligatures]]: Ligatures such as
;Precomposed Roman numerals: For example, Roman numeral twelve (
;Precomposed [[vulgar fraction|fractions]]: These decomposition have the keyword <fraction>. A fully conforming text handler should<ref>{{cite web|author=The Unicode Consortium|authorlink=Unicode Consortium|year=2010|title=The Unicode Standard, Version 6.0.0|publisher=Addison-Wesley Professional|isbn=978-0321480910|pages=212|url=https://www.unicode.org/versions/Unicode6.0.0/ch06.pdf#G12861}}</ref> display the vulgar fraction ¼ (U+00BC) identically to the composed fraction 1⁄4 (numeral 1 with fraction slash U+2044 and numeral 4).
;Contextual glyphs or forms: These arise primarily in the Arabic script. Using fonts with glyph substitution capabilities such as [[OpenType]] and [[Apple Advanced Typography|TrueTypeGX]], Unicode conforming software can substitute the proper glyphs for the same character depending on whether that character appears at the beginning, end, middle of a word or in isolation. Such glyph substitution is also necessary for vertical (top to bottom) text layout for some East Asian languages. In this case glyphs must be substituted or synthesized for wide, narrow, small and square glyph forms. Non-conforming software or software using other character sets instead use multiple separate character for the same letter depending on its position: further complicating text processing.
Line 35:
#Substitute (at the author or readers discretion) ligatures and contextual glyph variants.
#Layout CJKV text vertically (at the author's or reader's discretion), substituting glyphs for small, vertical, narrow, wide square forms, either from font data or synthesized as needed.
#Combine fractions using the
#Combine a
All together these compatibility characters included for incomplete Unicode implementations total 3,779 of the 5,402 designated compatibility characters. These include all of the compatibility characters marked with the keywords <initial>, <medial>, <final>, <isolated>, <fraction>, <wide>, <narrow>, <small>, <vertical>, <square>. Also it includes nearly all of the canonical and most of the <compat> keyword compatibility characters (the exceptions include those <compat> keyword characters for enclosed alphanumerics, enclosed ideographs and those discussed in [[#Semantically distinct characters|§ Semantically distinct characters]]).
Line 61:
* [[Greek letter]] based symbols (7): beta (ϐ U+03D0), theta (ϑ U+03D1), phi (ϕ U+03D5), pi (ϖ U+03D6), kappa (ϰ U+03F0), rho (ϱ U+03F1), capital theta (ϴ U+03F4)
While these compatibility characters are distinguished from their compatibility decomposition characters only by adding the word
Though not the intention of Unicode to encode such measuring units the repertoire includes six (6) such symbols that should not be used by authors: the characters' decompositions should be used instead.
Line 94:
Several blocks of Unicode characters include either entirely or almost entirely all compatibility characters (U+F900–U+FFEF except for the nonchars). The compatibility blocks contain none of the semantically distinct compatibility characters with only one exception: the rial currency symbol (﷼ U+FDFC) so the compatibility decomposable characters in the compatibility blocks fall unambiguously into the set of discouraged characters. Unicode recommends authors use the plain text compatibility decomposition equivalents instead and complement those characters with rich text markup. This approach is much more flexible and open-ended than using the finite set of circled or enclosed alphanumerics to give just one example.
Unfortunately, there are a small number of characters even within the compatibility blocks that themselves are not compatibility characters and therefore may confuse authors. The
# (U+FA0E): 﨎
Line 117:
Arabic Presentation Forms (4)
#
#
#
#
CJK Compatibility Forms (2 that are both related to CJK Unified Ideograph: U+4E36 丶)
|