Unicode compatibility characters: Difference between revisions

Content deleted Content added
AnomieBOT (talk | contribs)
m Dating maintenance tags: {{Refimprove}}
m clean up, typo(s) fixed: Therefore → Therefore, (2) using AWB
Line 1:
{{Multiple issues|
{{original research|date=July 2008}}
{{refimprove|date=July 2016}}
}}
 
In [[Unicode]] and the [[Universal Character Set|UCS]], a '''compatibility character''' is a character that is encoded solely to maintain round trip convertibility with other, often older, standards.<ref>{{cite web|title=Chapter 2.3: Compatibility characters|url=http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf#G11062|work=The Unicode Standard 6.0.0}}</ref> As the Unicode Glossary says:
 
Line 26 ⟶ 29:
* '''Contextual glyphs or forms'''. These arise primarily in the Arabic script. Using fonts with glyph substitution capabilities such as [[OpenType]] and [[Apple Advanced Typography|TrueTypeGX]], Unicode conforming software can substitute the proper glyphs for the same character depending on whether that character appears at the beginning, end, middle of a word or in isolation. Such glyph substitution is also necessary for vertical (top to bottom) text layout for some East Asian languages. In this case glyphs must be substituted or synthesized for wide, narrow, small and square glyph forms. Non-conforming software or software using other character sets instead use multiple separate character for the same letter depending on its position: further complicating text processing.
 
The UCS, Unicode character properties and the Unicode algorithms provide software implementations with everything needed to properly display these characters from their decomposition equivalents. Therefore, these decomposable compatibility characters become redundant and unnecessary. Their existence in the character set requires extra text processing to ensure text is properly compared and collated (see [[Unicode normalization]]). Moreover, these compatibility characters provide no additional or distinct semantics. Nor do these characters provide any visually distinct rendering provided the text layout and fonts are Unicode conforming. Also, none of these characters are required for round-trip convertibility to other character sets, since the transliteration can easily map decomposed characters to precomposed counterparts in another character set. Similarly, contextual forms, such as a final Arabic letter can be mapped based on its position within a word to the appropriate legacy character set form character.
 
In order to dispense with these compatibility characters, text software must conform to several Unicode protocols. The software must be able to:
Line 55 ⟶ 58:
=== Semantically distinct characters ===
 
Many compatibility characters are semantically distinct characters, though they may share representational glyphs with other characters. Some of these characters may have been included because most other characters sets that focused on one script or writing system. So for example, the ISO and other Latin character sets likely included a character for π (pi) since, when focusing on primarily one writing system or script, those character sets would not have otherwise had characters for the common mathematical symbol π;. However, with Unicode, mathematicians are free to use characters from any known script in the World to stand in for a mathematical set or mathematical constant. To date, Unicode has only added specific semantic support for a few such mathematical constants (for example the Planck constant, U+210E, and Euler constant, U+2107, both of which Unicode considers to be compatibility characters). Therefore, Unicode designates several mathematical symbols based on letters from Greek and Hebrew as compatibility characters. These include:
 
* [[Hebrew letter]] based symbols (4): alef (ℵ U+2135), bet (ℶ U+2136), gimel (ℷ U+2137) and dalet (ℸ U+2138)
Line 130 ⟶ 133:
 
== Normalization ==
{{main article|Unicode normalization}}
 
Normalization is the process by which Unicode conforming software first performs compatibility decomposition before making comparisons or collating text strings. This is similar to other operations needed when, for example, a user performs a case or diacritic insensitive search within some text. In such cases software must equate or ignore characters it would not otherwise equate or ignore. Typically normalization is performed without altering the underlying stored text data (lossless). However, some software may potentially make permanent changes to text that eliminates the canonical or even non-canonical compatibility characters differences from text storage (lossy).