Content deleted Content added
m clean up, References after punctuation per WP:REFPUNC and WP:PAIC using AWB (8748) |
|||
Line 1:
{{original research|date=July 2008}}
In [[Unicode]] and the [[Universal Character Set|UCS]], a '''compatibility character''' is a character that is encoded solely to maintain round trip convertability with other, often older, standards.<ref>{{cite web|title=Chapter 2.3: Compatibility characters|url=http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf#G11062|work=The Unicode Standard 6.0.0}}</ref>
<blockquote>
Line 17:
== Compatibility mappings types ==
=== Glyph substitution and composition ===
Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include:
Line 22 ⟶ 23:
* '''[[typographic ligature|Ligatures]]'''. Ligatures such as ‘ffi’ in the Latin script were often encoded as a separate character in legacy character sets. Unicode’s approach to ligatures is to treat them as rich text and, if turned on, handled through glyph substitution.
* '''Precomposed Roman numerals'''. For example, Roman numeral twelve (‘Ⅻ’: U+216B) can be decomposed into a Roman numeral ten (‘Ⅹ’: U+2169) and two Roman numeral ones (‘Ⅰ’: U+2160).
* '''Precomposed [[vulgar fraction|fractions]]'''. These decomposition have the keyword <fraction>. A fully conforming text handler should<ref>{{cite web|author=The Unicode Consortium|authorlink=Unicode Consortium|
* '''Contextual glyphs or forms '''. These arise primarily in the Arabic script. Using fonts with glyph substitution capabilities such as [[OpenType]] and [[Apple Advanced Typography|TrueTypeGX]], Unicode conforming software can substitute the proper glyphs for the same character depending on whether that character appears at the beginning, end, middle of a word or in isolation. Such glyph substitution is also necessary for vertical (top to bottom) text layout for some East Asian languages. In this case glyphs must be substituted or synthesized for wide, narrow, small and square glyph forms. Non-conforming software or software using other character sets instead use multiple separate character for the same letter depending on its position: further complicating text processing.
Line 32 ⟶ 33:
# Layout CJKV text vertically (at the author's or reader's discretion), substituting glyphs for small, vertical, narrow, wide square forms, either from font data or synthesized as needed.
# Combine fractions using the ‘[[⁄|Fraction Slash]]’ character (⁄ U+2044) and any other arbitrary characters.
# Combine a ‘[[
All together these compatibility characters included for incomplete Unicode implementations total 3,779 of the 5,402 designated compatibility characters. These include all of the compatibility characters marked with the keywords <initial>, <medial>, <final>, <isolated>, <fraction>, <wide>, <narrow>, <small>, <vertical>, <square>. Also it includes nearly all of the canonical and most of the <compat> keyword compatibility characters (the exceptions include those <compat> keyword characters for enclosed alphanumerics, enclosed ideographs and those discussed in the following sections: [[Mapping of Unicode characters#Semantically distinct characters|subsequent section]]).
Line 77 ⟶ 78:
* 1 character from the [[Tifinagh]] script: Tifinagh Modifier Letter Labialization Mark (ⵯ U+2D6F)
* 1 character from the [[Georgian script]]: Modifier Letter Georgian Nar (ჼ U+10FC)
* masculine ([[º|U+00BA]]) and feminine ([[ª|U+00AA]]) ordinal indicators included in the Latin-1 supplement{{
Finally, Unicode designates Roman numerals as compatibility equivalence to the Latin letters that share the same glyphs. Here the Unicode Standard make the same mistake in confusing glyph and character that it so often seeks to prevent. Certainly there's a need to deal with the visual ambiguity these characters may suffer when sharing the same glyphs, however a [[Sign-value notation|sign-value]] numeral for one is certainly a semantically distinct character from a Latin capital or small letter ‘i’{{citation needed|date=May 2009}}. A similar visual ambiguity exists between such characters as the Latin capital letter A (U+0041) and the Greek capital letter Alpha (Α U+0391), yet Unicode does not unify those characters.
Line 107 ⟶ 108:
# (U+FA29): 﨩
These thirteen characters are neither compatibility characters nor is their use discouraged in any way. However, U+27EAF 𧺯, identical as U+FA23 﨣, is mistakenly encoded in CJK Unified Ideographs Extension B.<ref>[http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg26/IRGN1218_Response_to_WG2.pdf#page=4 IRGN 1218]</ref>
Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support:
|