Unicode compatibility characters: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 23:41, 9 June 2018 edit Pengo (talk \| contribs) Administrators 19,382 edits m →Rich text compatibility characters: grammar fix ← Previous edit		Latest revision as of 19:13, 28 July 2025 edit undo AnomieBOT (talk \| contribs) Bots 6,860,174 edits m Dating maintenance tags: {{Citation needed}}
(28 intermediate revisions by 19 users not shown)
Line 1: {{Short description\|Character encoded solely to maintain round-trip convertibility with other standards}} {{Multiple issues\| {{original research\|date=July 2008}} {{~~refimprove~~more citations needed\|date=July 2016}} }} In [[Unicode]] and the [[Universal Character Set\|UCS]], a '''compatibility character''' is a character that is encoded solely to maintain [[Round-trip format conversion\|round -trip convertibility]] with other, often older, standards.<ref>{{cite web\|title=Chapter 2.3: Compatibility characters\|url=https://www.unicode.org/versions/Unicode6.0.0/ch02.pdf#G11062\|work=The Unicode Standard 6.0.0}}</ref> As the Unicode Glossary says: <blockquote> Line 13 ⟶ 14: == Compatibility character types and keywords == {{Time-context\|section\|that may not be current but does not specify which version of Unicode is being referenced\|date=June 2018}} The compatibility decomposition property for the 5,402 Unicode compatibility characters{{when\|date=June 2018}} includes a keyword that divides the compatibility characters into 17 logical groups. Those characters with a compatibility decomposition but without a keyword are termed canonical decomposable characters and those characters are not compatibility characters. Keywords for compatibility decomposable characters include: <initial>, <medial>, <final>, <isolated>, <wide>, <narrow>, <small>, <square>, <vertical>, <circle>, <noBreak>, <fraction>, <sub>, <super>, and <compat>. These keywords provide some indication of the relation between the compatibility character and its compatibility decomposition character sequence. Compatibility characters fall in three basic categories: # Characters corresponding to multiple alternate glyph forms and precomposed diacritics to support software and font implementations that do not include complete Unicode text layout capabilities. # Characters included from other character sets or otherwise added to the UCS that constitute [[formatted text\|rich text]] rather than the plain text goals of Unicode. # Some other characters that are semantically distinct, but [[homoglyph\|visually similar]]. Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a find on a page for a capital Latin letter ~~‘I’~~'I' and their software application fails to find the visually similar [[Roman numeral]] ~~‘Ⅰ’~~'Ⅰ'. == Compatibility mappings types == Line 23 ⟶ 25: === Glyph substitution and composition === Some compatibility characters are completely dispensable for text processing and display software that conforms to the Unicode standard. These include: ;[[typographic ligature\|Ligatures]]: Ligatures such as ~~‘ffi’~~'ﬃ' in the Latin script were often encoded as a separate character in legacy character sets. ~~Unicode’s~~Unicode's approach to ligatures is to treat them as rich text and, if turned on, ~~handled~~handle them through glyph substitution. ;Precomposed Roman numerals: For example, Roman numeral twelve (~~‘Ⅻ’~~'Ⅻ': U+216B) can be decomposed into a Roman numeral ten (~~‘Ⅹ’~~'Ⅹ': U+2169) and two Roman numeral ones (~~‘Ⅰ’~~'Ⅰ': U+2160). Precomposed characters are in the [[Number Forms]] block. ;Precomposed [[vulgar fraction\|fractions]]: These decomposition have the keyword <fraction>. A fully conforming text handler should<ref>{{cite ~~web~~book\|author=The Unicode Consortium\|authorlink=Unicode Consortium\|year=2010\|title=The Unicode Standard, Version 6.0.0\|publisher=Addison-Wesley Professional\|isbn=978-0321480910\|pages=212\|url=https://www.unicode.org/versions/Unicode6.0.0/ch06.pdf#G12861}}</ref> display the vulgar fraction ¼ (U+00BC) identically to the composed fraction 1⁄4 (numeral 1 with fraction slash U+2044 and numeral 4). Precomposed characters are in the [[Number Forms]] block. ;Contextual glyphs or forms: These arise primarily in the Arabic script. Using fonts with glyph substitution capabilities such as [[OpenType]] and [[Apple Advanced Typography\|TrueTypeGX]], Unicode conforming software can substitute the proper glyphs for the same character depending on whether that character appears at the beginning, end, middle of a word or in isolation. Such glyph substitution is also necessary for vertical (top to bottom) text layout for some East Asian languages. In this case glyphs must be substituted or synthesized for wide, narrow, small and square glyph forms. Non-conforming software or software using other character sets instead use multiple separate character for the same letter depending on its position: further complicating text processing. Line 32 ⟶ 34: In order to dispense with these compatibility characters, text software must conform to several Unicode protocols. The software must be able to: #Compose diacritic marked graphemes from letter characters and one or more separate combining diacritic marks. #Substitute (at the author or ~~readers~~reader's discretion) ligatures and contextual glyph variants. #~~Layout~~Lay out CJKV text vertically (at the author's or reader's discretion), substituting glyphs for small, vertical, narrow, wide square forms, either from font data or synthesized as needed. #Combine fractions using the ‘'[[⁄\|Fraction Slash]]’' character (⁄ U+2044) and any other arbitrary characters. #Combine a ‘'[[̸\|Combining Long Solidus Overlay]]’' ( ̸ U+0338) with other symbols: for example ∄ or ∄ for [[∄]] (U+2203). All together these compatibility characters included for incomplete Unicode implementations total 3,779 of the 5,402 designated compatibility characters. These include all of the compatibility characters marked with the keywords <initial>, <medial>, <final>, <isolated>, <fraction>, <wide>, <narrow>, <small>, <vertical>, <square>. Also it includes nearly all of the canonical and most of the <compat> keyword compatibility characters (the exceptions include those <compat> keyword characters for enclosed alphanumerics, enclosed ideographs and those discussed in [[#Semantically distinct characters\|§ Semantically distinct characters]]). === Rich text compatibility characters === Many other compatibility characters constitute what Unicode considers rich text and therefore outside the goals of Unicode and UCS. In some sense even compatibility characters discussed in the previous ~~section — those~~section—those that aid legacy software in displaying ligatures and vertical ~~text — constitute~~text—constitute a form of rich text, since the rich text protocols determine whether text is displayed in one way or another. However, the choice to display text with or without ligatures or vertically versus horizontally are both non-semantic rich text. They are simply style differences. This is in contrast to other rich text such as italics, superscripts and subscripts, or list markers where the styling of the rich text implies certain semantics along with it. For comparing, collating, handling and storing plain text, rich text variants are semantically redundant. For example, using a superscript character for the numeral 4 is likely indistinguishable from using the standard character for a numeral 4 and then using rich text protocols to make it superscript. Such alternate rich text characters therefore create ambiguity because they appear visually the same as their plain text counterpart characters with rich text formatting applied. These rich text compatibility characters include: Line 60 ⟶ 62: * [[Greek letter]] based symbols (7): beta (ϐ U+03D0), theta (ϑ U+03D1), phi (ϕ U+03D5), pi (ϖ U+03D6), kappa (ϰ U+03F0), rho (ϱ U+03F1), capital theta (ϴ U+03F4) While these compatibility characters are distinguished from their compatibility decomposition characters only by adding the word ~~“symbol”~~"symbol" to their name, they do represent long-standing distinct meanings in written mathematics. However, for all practical purposes they share the same semantics as their compatibility equivalent Greek or Hebrew letter. These may be considered border-line semantically distinguishable characters so they are not included in the total. Though not the intention of Unicode to encode such measuring units the repertoire includes six (6) such symbols that should not be used by authors: the characters' decompositions should be used instead.<ref>Omega, mu, Angstrom, Kelvin: {{cite web \|url=http://www.unicode.org/reports/tr25/ \|title=Unicode Technical Report #25 / Unicode Support for Mathematics \|date=2017-05-30 \|page=11 \|author=Unicode Consortium}}</ref><ref name="decomp" /> * Unit symbols (6): [[Angstrom]] ~~(Å U+~~{{unichar\|212B\|name=none}}: use U+00C5 instead), [[Ohm]] (~~Ω, U+~~{{unichar\|2126\|name=none}}: use U+03A9 instead), [[Kelvin]] (~~K U+~~{{unichar\|212A\|name=none}}: use U+004B instead), [[Fahrenheit]] (℉ U+2109 ℉: use [[°\|U+00B0]] and U+0046 instead), [[Celsius]] (℃ U+2103 ℃: use U+00B0 and U+0043 instead), [[micro-\|Micro]] Sign (µ U+00B5 µ: use U+03BC instead) Unicode also designates ~~twenty-two (~~22) other letter-like symbols as compatibility characters.<ref name="decomp">≈ designates compatibility decomposition according to https://www.unicode.org/versions/Unicode15.0.0/ch24.pdf and is shown in code charts at https://www.unicode.org/charts/nameslist/n_2100.html</ref> * Other Greek letter-based symbols (4): lunate epsilon (ϵ U+03F5), lunate sigma (ϲ U+03F2), capital lunate sigma (Ϲ U+03F9), upsilon with hook (ϒ U+03D2) * Mathematical constants (3): Euler constant ([[ℇ]] U+2107), [[Planck constant]] (ℎ U+210E), [[reduced Planck constant]] (ℏ U+210F), * Currency symbols (2): rupee sign (₨ U+20A8), rial sign (﷼ U+FDFC) * Punctuation (4): one dot [[leader (typography)\|leader]] (U+2024), no-break space (U+00A0), non-breaking hyphen (U+2011), Tibetan mark delimiter tsheg bstar (U+0F0C) * Other letter-like symbols (10): information source (ℹ U+2139), account of (℀ U+2100), addressed to the subject (℁ U+2101), care of (℅ U+2105), cada una (℆ U+2106), [[Numero sign\|numero]] (№ U+2116), telephone sign (℡ U+2121), facsimile sign (℻ U+213B), trademark (™ U+2122), service mark (℠ U+2120) In addition, several scripts~~{{which\|date=January 2012}}~~ use glyph position such as superscripts and subscripts to differentiate semantics. In these cases subscripts and superscripts are not merely rich text, but constitute a distinct character — similar to a hybrid between a diacritic and a letter{{Or\|date=January 2012}}<!-- are they [[Spacing modifier character]]s or what? there is no example, so we should guess author's intentions. --> — in the writing system (130 total). * 112 characters representing abstract phonemes from phonetic alphabets such as the [[International Phonetic Alphabet]] use such positional glyphs to represent semantic differences (U+1D2C – U+1D6A, U+1D78, U+1D9B – U+1DBF, U+02B0 – U+02B8, U+02E0 – U+02E4 ) * 14 characters from the [[Kanbun]] block (U+3192 – U+319F) * 1 character from the [[Tifinagh]] script: Tifinagh Modifier Letter Labialization Mark (ⵯ U+2D6F) * 1 character from the [[Georgian script]]: Modifier Letter Georgian Nar (ჼ U+10FC) * masculine ([[º\|U+00BA]]) and feminine ([[ª\|U+00AA]]) ordinal indicators included in the [[Latin-1 ~~supplement~~Supplement]]{{citation needed\|date=January 2012}} block Finally, Unicode designates Roman numerals as compatibility equivalence to the Latin letters that share the same glyphs.{{Citation needed\|date=November 2015}} Line 93 ⟶ 95: Several blocks of Unicode characters include either entirely or almost entirely all compatibility characters (U+F900–U+FFEF except for the nonchars). The compatibility blocks contain none of the semantically distinct compatibility characters with only one exception: the rial currency symbol (﷼ U+FDFC) so the compatibility decomposable characters in the compatibility blocks fall unambiguously into the set of discouraged characters. Unicode recommends authors use the plain text compatibility decomposition equivalents instead and complement those characters with rich text markup. This approach is much more flexible and open-ended than using the finite set of circled or enclosed alphanumerics to give just one example. ~~Unfortunately, there~~There are a small number of characters even within the compatibility blocks that themselves are not compatibility characters and therefore may confuse authors. The ~~“Enclosed~~"Enclosed CJK Letters and ~~Months”~~Months" block contains a single non-compatibility character: the ~~‘Korean~~'Korean Standard ~~Symbol’~~Symbol' (㉿ U+327F). That symbol and 12 other characters have been included in the blocks for unknown reasons. The ~~“CJK~~"CJK Compatibility ~~Ideographs”~~Ideographs" block contains these non-compatibility unified Han ideographs: # (U+FA0E): 﨎 Line 110 ⟶ 112: These thirteen characters are not compatibility characters, and their use is not discouraged in any way. However, U+27EAF 𧺯, the same as U+FA23 﨣, is mistakenly encoded in CJK Unified Ideographs Extension B.<ref>[http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg26/IRGN1218_Response_to_WG2.pdf#page=4 IRGN 1218]</ref> In any event, a normalized text should never contain both U+27EAF 𧺯 and U+FA23 﨣; these code points represent the same character, encoded twice. Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support:{{citation needed\|date=July 2025}} Alphabetic Presentation Forms (1) Line 116 ⟶ 118: Arabic Presentation Forms (4) # ~~“Ornate~~"Ornate Left ~~Parenthesis”~~Parenthesis" (U+FD3E): ﴾. A glyph variant for U+~~0029~~0028 ~~‘)’~~'(' # ~~“Ornate~~"Ornate Right ~~Parenthesis”~~Parenthesis" (U+FD3F): ﴿. A glyph variant for U+~~0028~~0029 ~~‘ (’~~')' # ~~“Ligature~~"Ligature Bismillah Ar-Rahman Ar-~~Raheem”~~Raheem" (U+FDFD): ﷽. [[Bismillah ar-Rahman, ar-Raheem\|Bismillah Ar-Rahman Ar-Raheem]] is a ligature for Beh (U+0628), Seen (U+0633), Meem (U+0645), Space (U+0020), Alef (U+0627), Lam (U+0644), Lam (U+0644), Heh (U+0647), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Meem (U+0645), Alef (U+0627), Noon (U+0646), Space (U+0020), Alef (U+0627), Lam (U+0644), Reh (U+0631), Hah (U+062D), Yeh (U+064A), Meem (U+0645) i.e. {{~~rtl-~~lang\|ar\|بسم الله الرحمان الرحيم}} <ref>[https://www.unicode.org/charts/PDF/UFB50.pdf Unicode chart FB50-FDFF (PDF)].</ref><!-- Note: In Unicode, characters are written in "logical" sequence, i.e. from right to left in RTL languages such as Arabic. --> (Similarly, U+FDFA and U+FDFB code for two other Arabic ligatures, of 21 and 9 characters respectively.) # ~~“Arabic~~"Arabic Tail ~~Fragment”~~Fragment" (U+FE73): ﹳ for supporting text systems without contextual glyph handling CJK Compatibility Forms (2 that are both related to CJK Unified Ideograph: U+4E36 丶) Line 132 ⟶ 134: {{main article\|Unicode normalization}} Normalization is the process by which Unicode conforming software first performs full compatibility decomposition (or composition) before making comparisons or collating text strings. This is similar to other operations needed when, for example, a user performs a case or diacritic insensitive search within some text. In such cases software must equate or ignore characters it would not otherwise equate or ignore. Typically normalization is performed without altering the underlying stored text data (lossless). However, some software may potentially make permanent changes to text that eliminates the canonical or even non-canonical compatibility characters differences from text storage (lossy). == See also == * [[CJK Compatibility]] * [[CJK Compatibility Forms]] * [[CJK Compatibility Ideographs]] == References == Line 142 ⟶ 150: {{Unicode navigation}} ~~{{DEFAULTSORT~~[[Category:Unicode \|Compatibility ~~Characters}}~~characters]] ~~[[Category:Unicode]]~~